Phishing is a fraudulent activity that aims to steal user credentials, credit card and bank account information or to deploy malicious software on the victim's infrastructure. For instance, fraudsters can massively send letters containing malicious links, by clicking on which the user allows the attacker to get sensitive data.
There are various phishing URL detection techniques: white/black lists usage, heuristics-oriented approaches such as usage of content, visual and URL features. Here we will discuss URL-based method using standard machine learning approaches (Logistic regression, Random forest, SGD classifier and etc.) in combination with NLP-driven features which can be extracted from the URL.
Introduction
Machine learning is now widely used in different cybersecurity areas. Among them are the following:
- Malware detection and classification
- Domain generation algorithms and botnet detection
- Network intrusion detection
- URL detection
- Spam filtering
- Malicious insiders threats
- Cyber-physical systems (CPS) and industrial control systems (ICS)
- Biometric systems (face recognition, speaker verification/recognition, fingerprint systems)
- Anomalous user behaviour analysis
In this tutorial we will consider only one of these numerous cases and will give some insights into how, by combining NLP features and standard (non-deep learning) machine learning algorithms, we can perform the detection of phishing URLs.
There are different types of URL obfuscation:
| Type | Sample |
|---|---|
| Obfuscation with other domains | http://school497.ru/222/www.paypal.com/... |
| Typo-squatting domains | http://cgi-3.paypal-secure.de/... |
| Obfuscation with IP address | http://69.72.130.98/javaseva/https://paypal.com/... |
| Obfuscation with URL shorteners | http://goo.gl/HQx5g |
Dataset
In this tutorial I used the Kaggle dataset "Phishing Site URLs" which contains 549,346 entries of URLs labeled with 2 categories ("Good", "Bad").
Measuring ML Model Performance
Important metrics for measuring the quality of cybersecurity ML models are precision and recall:
In cyber security, recall is critical because it determines the rate of malicious samples we pass as legitimate, so FN should be as low as possible. Precision acts like the friction rate for legitimate users. There should always be an optimal trade-off between recall and precision.
Feature Engineering
Here is the general structure of any URL — protocol, username, hostname, port, path and query:
https://user:password@hostname.com:443/path?query
The following helper functions extract NLP features from the URL:
import re
import string
import math
from collections import Counter
VOWELS = set("aeiou")
CONSONANTS = set(string.ascii_lowercase) - VOWELS
def url_length(s: str):
return len(s)
def vowels_pct(s):
count = sum(1 for ch in s.lower() if ch in VOWELS)
return count / len(s)
def count_dots(s):
return s.count('.')
def count_slash(s):
return s.count('/')
def count_digits(s):
return len(re.sub(r"\D", "", s))
def extract_doc(s):
"""Split URL by punctuation and join with spaces for TF-IDF"""
return " ".join(re.split("[" + string.punctuation + "]+", s))
After feature extraction, we apply TF-IDF on the doc field and combine with numeric features:
from scipy.sparse import coo_matrix, hstack
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import RobustScaler
tf_idf_vec = TfidfVectorizer(
encoding='utf-8',
stop_words='english',
ngram_range=(1, 3),
max_df=0.8,
min_df=1000
)
sc = RobustScaler()
Results
| Algorithm | Mean Precision (5-fold) | Mean Recall (5-fold) |
|---|---|---|
| Logistic Regression | 0.94 | 0.91 |
| SGD Classifier | 0.94 | 0.93 |
| Random Forest | 0.96 | 0.98 |
| Linear SVM | 0.95 | 0.92 |
Random Forest showed the best performance with mean recall of 0.982 over 5 folds and no misclassifications on the manually selected test samples.
Conclusion
URL-based phishing detection using NLP features combined with standard ML classifiers achieves strong performance. The Random Forest classifier is the clear winner here — its ensemble nature handles the high-dimensional TF-IDF feature space well and its recall is the most critical metric for security applications.
Key takeaways:
- TF-IDF on tokenized URL text is a powerful and simple feature
- Structural URL features (dots, slashes, IP presence, port) add important signal
- Random Forest dominates; Logistic Regression is competitive but lower recall
- Class imbalance should be handled explicitly via
class_weight='balanced'