LegitPhishSet Dataset – A Phishing and Legitimate Dataset for Rapid Benchmarking
The legitPhishSet Dataset contains a total of 30,000 samples of webpages, namely, 15,000 legitimate samples and 15,000 phishing samples. All webpage elements (i.e., images, URLs, HTML, screenshot and WHOIS information) are organized according to different folder for each sample. Anti-phishing research is one of the active research fields in information security. Most of the researches are using their own dataset for the experiment. This makes the benchmarking become challenging and inefficient. The main objective of this project is to propose and construct a standard offline dataset that is universal and suitable for a wide range of anti-phishing researches. The dataset encompasses samples of phishing and legitimate webpages with a distribution of 50 percent each type. In order to make the dataset as comprehensive as possible, the project has considered major anti-phishing researches from the literature and performed a thorough investigation. The works include identifying the raw elements needed, source of the sample, size and influencing factor on the dataset that form the basis criteria for the dataset construction. The final outcome of this project is a readily downloadable dataset that has 30,000 samples of phishing and legitimate webpages. The samples in the dataset consist of all the required elements that have been used in different researches from the literature. The dataset will be useful and suitable for a wide range of anti-phishing researches in conducting the benchmarking as well as beneficial for a research to conduct a rapid proof of concept experiment.
Detail of the dataset:Size: 64GB Folder Structure: 2 folders (1st folder with 150 zip files and 2nd 15 zip files)
Contains 2 large zip file: (1) Legitimate.zip (56GB) and (2) Phishing.zip(7.3GB)