Phishing Classification using Ensemble model

With the rise of digital technology usage, it is becoming easier for attackers to steal personal information from users by committing phishing, one of the most common and dangerous cybercrimes. In this context, our exploration is related to phishing classification using an ensemble model. In this article, by leveraging a curated dataset, we will train and evaluate a robust model capable of distinguishing between legitimate and phishing URLs.

What is Phishing?

Phishing is a type of cyberattack that tricks people into revealing sensitive information, such as passwords or financial details. Attackers often use emails, messages, or websites that look legitimate to deceive victims. Phishing campaigns aim to create a false sense of security and can involve techniques like using fake links, infected attachments, or fake login pages to trick people into giving away their information.

How to be safe from Phishing?

Some of the countermeasures of Phishing are discussed below:

Security training: Stay aware of the dangers of phishing through regular training. Learn to spot fishy emails or messages so you can avoid falling prey to phishing. Discovery allows you to make smart decisions online.
Web filtering and URL analysis: Apply filters that scan websites in real time to block access to known phishing sites. It’s like having a security alarm that warns you before you access a potentially dangerous Internet site.
Advanced Threat Protection (ATP): Use advanced security tools that learn from patterns and behaviors to detect hijacking attempts. These tools can catch suspicious emails or files before they reach you, acting like a vigilant watchdog against cyber threats.
Endpoint security solutions: Equip your devices with security tools that specifically protect against phishing. It’s like your personal computer security guard, ready to block any phishing attempts that target your device.

Implementation: Phishing Classification using Ensemble Model

Importing required modules

At first, we will import all required Python modules like Pandas, Matplotlib and SKlearn etc.

Python3

import pandas as pd

from sklearn.model_selection import train_test_split, RandomizedSearchCV

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, roc_auc_score

import xgboost as xgb

import matplotlib.pyplot as plt

import seaborn as sns

Dataset loading and Preview

Now we will load Phishing dataset and map its target column. After that we will visualize some rows of the raw dataset for better understanding.

Python3

df = pd.read_csv('phishing_data')
# Map 'phishing' to 1 and 'legitimate' to 0 in the 'status' column as target

df['status'] = df['status'].map({'phishing': 1, 'legitimate': 0})

print("Dataset Preview:")
df.head()

Output:

Note: The screenshot is just for demo purpose, as 89 rows cannot be vizualised at once.

Exploratory Data Analysis

In Exploratory Data Analysis(EDA) we will generate a pie chart to depict the distribution of classes in the ‘status’ column of the dataset. The ‘status’ column represents whether a URL is classified as ‘Legitimate’ or ‘Phishing.’
The pie chart is created using the counts of each class, showing the proportion of ‘Legitimate’ and ‘Phishing’ instances. The chart is styled with custom colors and labels, making it visually informative.
This visualization helps us in quickly understanding the balance or imbalance between the two classes in the dataset.

Python3

# Plot a pie chart for the target column

plt.figure(figsize=(6, 6))

df['status'].value_counts().plot(kind='pie', autopct='%1.1f%%', colors=['lightcoral', 'lightblue'], labels=['Legitimate', 'Phishing'])

plt.title('Distribution of Target Classes')
plt.show()

Output:

distribution of target feature

So, from the above output we can see that our dataset is balanced and not required any extra resampling method.

Now we will create one more type of exploratory data analysis by pair plot. We will select the most relevant columns like length_url’, ‘nb_dots’, ‘nb_hyphens’, ‘nb_subdomains’, ‘web_traffic’ with target column ‘status’ and plot them in pairwise comparison. It will help us to understand their relationship with each selected feature. It is highly recommended to select more feature for better understanding.

Python3

# Create a pairplot for select features

selected_features = ['length_url', 'nb_dots', 'nb_hyphens', 'nb_subdomains', 'web_traffic', 'status']

sns.pairplot(df[selected_features], hue='status', palette='husl')
plt.tight_layout()
plt.show()

Output:

Data preprocessing and splitting

Now,the dataset is prepared for training a phishing classification model.
The features and the target variable are separated into X (features) and y (target). Next, the features are categorized into numerical and categorical types.
One-hot encoding is applied to the categorical features using the pd.get_dummies function. The numerical features and the one-hot encoded categorical features are then concatenated into a processed feature set, denoted as X_processed.
To ensure compatibility with the model, invalid characters like ‘[‘ and ‘]’ in feature names are removed. Additionally, a specific column with a problematic name is dropped to prevent issues during model training.
Finally, the dataset is split into training and testing sets using the train_test_split function, with 80% for training and 20% for testing, and a random seed for reproducibility.

Python3

# Separate features and target variable

X = df.drop('status', axis=1)

y = df['status']
 
# Split the dataset into numerical and categorical features

numerical_features = X.select_dtypes(include=['float64', 'int64']).columns

categorical_features = X.select_dtypes(include=['object']).columns
 
# Perform one-hot encoding for categorical features

X_categorical = pd.get_dummies(X[categorical_features])

X_numerical = X[numerical_features]
 
# Concatenate numerical and one-hot encoded categorical features

X_processed = pd.concat([X_numerical, X_categorical], axis=1)
# Remove invalid characters from feature names

cleaned_columns = [col.replace('[','').replace(']','').replace('<','') for col in X_processed.columns]
 
# Replace the old feature names with the cleaned ones

X_processed.columns = cleaned_columns
# Drop the duplicate column

X_processed.drop(columns=['url_https://user7770001255.el.r.appspot.com/E-mail%20Address'], inplace=True)
# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)

Defining an Ensemble model with instance of Hyperparameter tuning

In this stage, we’re setting up an ensemble model known as XGBoost, which is a popular algorithm for classification tasks. However you can use any other Ensemble model using same data.
Here we will perform hyperparameter tuning using Randomized Search CV on some hyperparameters like how many trees to include (`n_estimators`), how quickly the model should learn from the data (`learning_rate`), and the maximum depth of each tree (`max_depth`).
Additionally, we set a `random_state` to ensure that the process is reproducible.
The model is then trained on best hyperparameters set, where it learns to make predictions by analyzing patterns in the features (X_train) and corresponding target labels (y_train). The goal is to minimize errors in predicting whether a URL is legitimate or phishing. The tuning is performed based on best F1-Score metric.

Python3

# Ensemble Model as XGBoost with Hyperparameter Tuning

param_dist = {

    'n_estimators': [100, 150, 200],

    'learning_rate': [0.05, 0.1, 0.2],

    'max_depth': [3, 4, 5]
}
 
xgb_model = xgb.XGBClassifier(random_state=42)

random_search = RandomizedSearchCV(estimator=xgb_model, param_distributions=param_dist, n_iter=10, cv=3, scoring='f1', random_state=42, verbose=1, n_jobs=-1)
random_search.fit(X_train, y_train)
# Use the best parameters from the grid search

best_params = random_search.best_params_

xgb_model_tuned = xgb.XGBClassifier(**best_params, random_state=42)
 
# Fit the tuned model
xgb_model_tuned.fit(X_train, y_train)

Model evaluation

Now,we will evaluate our model on various model performance metrics like Accuracy, Precision, Recall, F1-score, Classification Report and AUC-ROC score.

Python3

# Make predictions on the test set

y_pred_xgb_tuned = xgb_model_tuned.predict(X_test)
# Evaluate the XGBoost ensemble model with tuned hyperparameters

accuracy_xgb_tuned = accuracy_score(y_test, y_pred_xgb_tuned)

precision_tuned = precision_score(y_test, y_pred_xgb_tuned)

recall_tuned = recall_score(y_test, y_pred_xgb_tuned)

f1_tuned = f1_score(y_test, y_pred_xgb_tuned)

auc_roc_value_tuned = roc_auc_score(y_test, xgb_model_tuned.predict_proba(X_test)[:, 1])
# Print the results

print("Classification Report:")

print(classification_report(y_test, y_pred_xgb_tuned))

print(f"AUC-ROC Value: {auc_roc_value_tuned}")

print(f"Tuned XGBoost Accuracy: {accuracy_xgb_tuned}")

print(f"Tuned XGBoost Precision: {precision_tuned}")

print(f"Tuned XGBoost Recall: {recall_tuned}")

print(f"Tuned XGBoost F1-score: {f1_tuned}")

Output:

Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.94      0.93       101
           1       0.94      0.92      0.93        99

    accuracy                           0.93       200
   macro avg       0.93      0.93      0.93       200
weighted avg       0.93      0.93      0.93       200

AUC-ROC Value: 0.9820982098209821
Tuned XGBoost Accuracy: 0.93
Tuned XGBoost Precision: 0.9381443298969072
Tuned XGBoost Recall: 0.9191919191919192
Tuned XGBoost F1-score: 0.9285714285714285

So, our model is performing very well with 99.71% of AUC-ROC value and 98.50% above performance metrics.

Article Tags :

AI-ML-DS

Machine Learning

AI-ML-DS With Python

Machine Learning

Machine Learning Projects