Phishing Dataset

Phishing Dataset – A Phishing and Legitimate Dataset for Rapid Benchmarking


The Phishing Websites Dataset contains a total of 30,000 samples of webpages, namely, 15,000 legitimate samples and 15,000 phishing samples. All webpage elements (i.e., images, URLs, HTML, screenshot and WHOIS information) are organized according to different folder for each sample.


Anti-phishing research is one of the active research fields in information security. Most of the researches are using their own dataset for the experiment. This makes the benchmarking become challenging and inefficient. The main objective of this project is to propose and construct a standard offline dataset that is universal and suitable for a wide range of anti-phishing researches. The dataset encompasses samples of phishing and legitimate webpages with a distribution of 50 percent each type. In order to make the dataset as comprehensive as possible, the project has considered major anti-phishing researches from the literature and performed a thorough investigation. The works include identifying the raw elements needed, source of the sample, size and influencing factor on the dataset that form the basis criteria for the dataset construction. The final outcome of this project is a readily downloadable dataset that has 30,000 samples of phishing and legitimate webpages. The samples in the dataset consist of all the required elements that have been used in different researches from the literature. The dataset will be useful and suitable for a wide range of anti-phishing researches in conducting the benchmarking and rapid proof of concept experiments. In addition, it is also useful for machine learning based phishing detection and phishing features analysis.


Details of the dataset:
Size: 64GB
Folder Structure: 2 folders (1st folder with 150 zip files and 2nd 15 zip files)


Download Link:
Option 1: Download as smaller zip file:
Contains only the phishing dataset zip files (7.3GB)

Mirror 1


Option 2: Download as large zip file:
Contains only the legitimate dataset zip files (56GB)

Mirror 1


How to cite:
K. L. Chiew, E. H. Chang, C. L. Tan, J. Abdullah and K. S. C. Yong (2018), “Building Standard Offline Anti-Phishing Dataset for Benchmarking”, International Journal of Engineering and Technology, 7 (4.31) pp. 7-14.


Phishing Dataset for Machine Learning: Feature Evaluation


A total of 48 features are extracted from 5000 phishing webpages and 5000 legitimate webpages, which were downloaded from January to May 2015 and from May to June 2017. These features were selected after surveying 11 research papers that focus on machine learning-based phishing website detection, published between year 2007 and 2016. A detailed listing of the features is provided in the Appendix section of our paper here: https://www.sciencedirect.com/science/article/pii/S0020025519300763


Details of the dataset:
Size: 1.30 MB
Type: ARFF (Weka-ready)


Download Link:
https://data.mendeley.com/datasets/h3cgnj8hft/1


How to cite:
K. L. Chiew, C. L. Tan, K. Wong, K. S. C. Yong, and W. K. Tiong (2019), "A new hybrid ensemble feature selection framework for machine learning-based phishing detection system", Information Sciences, 484, pp. 153-166.


Contact Person:


If you need further information or to collaborate on this dataset, please contact:


Associate Professor Dr Chiew Kang Leng
Faculty of Computer Science and Information Technology
Universiti Malaysia Sarawak
94300 Kota Samarahan
Sarawak, Malaysia
Email: This email address is being protected from spambots. You need JavaScript enabled to view it.
Phone: +6082 58 3735


AGREEMENT AND DISCLAIMER
By downloading the dataset, I hereby agree to the following terms and conditions:
1. The dataset should be only used for non-commercial research and educational purposes.
2. The copyright of all components in the dataset fully belongs to the original owners.
3. You will act according to the terms of use of each component as specified on its source site.
4. In no event will we be liable for any loss or damage including without limitation, indirect or consequential loss or damage, or any loss or damage whatsoever arising from loss of data or profits arising out of, or in connection with, the use of this dataset.
5. We reserve the right to terminate access to the dataset at any time without notice.
6. This agreement must be retained with the dataset.