Introduction
This project is part of the Machine Learning for Cybersecurity course I am taking as a cybersecurity student at the university. While I am not an ML engineer, this blog outlines how to implement a deep learning model to detect malicious URLs. I will demonstrate how a neural network can classify URLs as benign or malicious using specific features. Basic ML knowledge is required, but deep expertise is not necessary.
Dataset
The dataset used is the Malicious URLs Phishing Dataset, which contains benign and malicious URLs. The dataset and full project code are available on GitHub.
Step-by-Step Code Breakdown
Step 1: Importing Necessary Libraries
|
|
Step 2: Loading the Dataset
|
|
Step 3: Feature Engineering
Custom functions extract features like the presence of an IP address, URL length, and count of sensitive words. The features and their descriptions are listed below:
Feature | Description |
---|---|
use_of_ip | Whether the URL contains an IP address. |
url_length | The total length of the URL. |
numOf-https | Count of ‘https’ occurrences in the URL. |
numOf-http | Count of ‘http’ occurrences in the URL. |
hostname_length | Length of the hostname part of the URL. |
count-digits | Number of digits in the URL. |
count-letters | Number of alphabetic characters in the URL. |
NumSensitiveWords | Whether the URL contains sensitive words like ‘PayPal,’ ’login,’ or ‘bank.’ |
numOf. | Count of periods (’.’) in the URL. |
numOf% | Count of ‘%’ in the URL. |
numOf? | Count of ‘?’ in the URL. |
numOf- | Count of dashes (’-’) in the URL. |
numOf= | Count of equal signs (’=’) in the URL. |
abnormal_url | Whether the URL contains abnormal structures. |
binary | Binary representation of the URL’s classification (0 for benign, 1 for malicious). |
Each of these features was chosen because they are commonly found in malicious URLs, such as the presence of IP addresses, which are often used to evade detection, and sensitive keywords like ’login’ or ‘bank,’ which are frequently used in phishing URLs.
|
|
Step 4: Label Encoding
|
|
In this project, label encoding is applied to the type
column, which contains the categorical URL types (e.g., “benign”, “phishing”, “malware”), to convert them into numerical values that the model can process. The LabelEncoder from sklearn.preprocessing
is used to transform the text labels into integers.
Step 5: Data Splitting
|
|
Step 6: Neural Network Model
|
|
The model uses two dense layers with ReLU activation to capture non-linear patterns, and the final layer uses softmax for multi-class classification
![[Head.png]]
Step 7: Compiling and Training the Model
|
|
Step 8: Model Evaluation
|
|
The accuracy curve shows steady learning over the epochs, and the low loss indicates that the model is fitting the data well without overfitting
![[Pasted image 20241026151743.png]]
Step 9: Visualizing Accuracy and Loss
|
|
Conclusion
This project demonstrates the application of machine learning in detecting malicious URLs. The full code and dataset can be accessed on GitHub.
References
- Ma, J., Saul, L. K., Savage, S., and Voelker, G. M. (2009). Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs. Proceedings of the 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
- Verma, R. and Hossain, N. (2017). Phish-Zoo: Detecting Phishing Websites By Looking At Them. IEEE Transactions on Dependable and Secure Computing.
- Basnet, R., Mukkamala, S., and Sung, A. H. (2008). Detection of Phishing Attacks: A Machine Learning Approach. Studies in Fuzziness and Soft Computing.
- Buczak, A. L., and Guven, E. (2016). A Survey of Data Mining and Machine Learning Methods for Cybersecurity Intrusion Detection. IEEE Communications Surveys & Tutorials.