FACULTY OF COMPUTER SCIENCE & INFORMATION TECHNOLOGY

LegitPhishSet Dataset – A Phishing and Legitimate Dataset for Rapid Benchmarking

The legitPhishSet Dataset contains a total of 30,000 samples of webpages, namely, 15,000 legitimate samples and 15,000 phishing samples. All webpage elements (i.e., images, URLs, HTML, screenshot and WHOIS information) are organized according to different folder for each sample.
Anti-phishing research is one of the active research fields in information security. Most of the researches are using their own dataset for the experiment. This makes the benchmarking become challenging and inefficient. The main objective of this project is to propose and construct a standard offline dataset that is universal and suitable for a wide range of anti-phishing researches. The dataset encompasses samples of phishing and legitimate webpages with a distribution of 50 percent each type. In order to make the dataset as comprehensive as possible, the project has considered major anti-phishing researches from the literature and performed a thorough investigation. The works include identifying the raw elements needed, source of the sample, size and influencing factor on the dataset that form the basis criteria for the dataset construction. The final outcome of this project is a readily downloadable dataset that has 30,000 samples of phishing and legitimate webpages. The samples in the dataset consist of all the required elements that have been used in different researches from the literature. The dataset will be useful and suitable for a wide range of anti-phishing researches in conducting the benchmarking as well as beneficial for a research to conduct a rapid proof of concept experiment.
Detail of the dataset:
Size: 64GB
Folder Structure: 2 folders (1st folder with 150 zip files and 2nd 15 zip files)
Download Link:
Option 1: Download as smaller zip file (if you want to try several first before downloading the whole dataset):
Mirror 1
Option 2: Download as large zip file:
Contains 2 large zip file: (1) Legitimate.zip (56GB) and (2) Phishing.zip(7.3GB)
Mirror 1
Publication:
C. L. Tan, K. L. Chiew and S. N. Sze (2016), “Phishing Webpage Detection Using Weighted URL Tokens for Identity Keywords Retrieval”, The 9th International Conference on Robotics, Vision, Signal Processing & Power Applications.
Contact Person:
If you need further information or to collaborate on this dataset, please contact:
Dr. Chiew Kang Leng
Department of Computational Science & Mathematics,
Faculty of Computer Science & IT,
Universiti Malaysia Sarawak,
94300 Kota Samarahan,
Sarawak, Malaysia.
Email: This email address is being protected from spambots. You need JavaScript enabled to view it.
Phone: +6082 58 3735
AGREEMENT AND DISCLAIMER
By downloading the dataset, I hereby agree to the following terms and conditions:
1. The dataset should be only used for non-commercial research and educational purposes.
2. The copyright of all components in the dataset fully belongs to the original owners.
3. You will act according to the terms of use of each component as specified on its source site.
4. In no event will we be liable for any loss or damage including without limitation, indirect or consequential loss or damage, or any loss or damage whatsoever arising from loss of data or profits arising out of, or in connection with, the use of this dataset.
5. We reserve the right to terminate access to the dataset at any time without notice.
6. This agreement must be retained with the dataset.