Open In App

Phishing Classification using Ensemble model

With the rise of digital technology usage, it is becoming easier for attackers to steal personal information from users by committing phishing, one of the most common and dangerous cybercrimes. In this context, our exploration is related to phishing classification using an ensemble model. In this article, by leveraging a curated dataset, we will train and evaluate a robust model capable of distinguishing between legitimate and phishing URLs.

What is Phishing?

Phishing is a type of cyberattack that tricks people into revealing sensitive information, such as passwords or financial details. Attackers often use emails, messages, or websites that look legitimate to deceive victims. Phishing campaigns aim to create a false sense of security and can involve techniques like using fake links, infected attachments, or fake login pages to trick people into giving away their information.



How to be safe from Phishing?

Some of the countermeasures of Phishing are discussed below:

Implementation: Phishing Classification using Ensemble Model

Importing required modules

At first, we will import all required Python modules like Pandas, Matplotlib and SKlearn etc.






import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, roc_auc_score
import xgboost as xgb
import matplotlib.pyplot as plt
import seaborn as sns

Dataset loading and Preview

Now we will load Phishing dataset and map its target column. After that we will visualize some rows of the raw dataset for better understanding.




df = pd.read_csv('phishing_data')
# Map 'phishing' to 1 and 'legitimate' to 0 in the 'status' column as target
df['status'] = df['status'].map({'phishing': 1, 'legitimate': 0})
print("Dataset Preview:")
df.head()

Output:

Note: The screenshot is just for demo purpose, as 89 rows cannot be vizualised at once.

Exploratory Data Analysis




# Plot a pie chart for the target column
plt.figure(figsize=(6, 6))
df['status'].value_counts().plot(kind='pie', autopct='%1.1f%%', colors=['lightcoral', 'lightblue'], labels=['Legitimate', 'Phishing'])
plt.title('Distribution of Target Classes')
plt.show()

Output:

distribution of target feature

So, from the above output we can see that our dataset is balanced and not required any extra resampling method.

Now we will create one more type of exploratory data analysis by pair plot. We will select the most relevant columns like length_url’, ‘nb_dots’, ‘nb_hyphens’, ‘nb_subdomains’, ‘web_traffic’ with target column ‘status’ and plot them in pairwise comparison. It will help us to understand their relationship with each selected feature. It is highly recommended to select more feature for better understanding.




# Create a pairplot for select features
selected_features = ['length_url', 'nb_dots', 'nb_hyphens', 'nb_subdomains', 'web_traffic', 'status']
sns.pairplot(df[selected_features], hue='status', palette='husl')
plt.tight_layout()
plt.show()

Output:

Data preprocessing and splitting




# Separate features and target variable
X = df.drop('status', axis=1)
y = df['status']
 
# Split the dataset into numerical and categorical features
numerical_features = X.select_dtypes(include=['float64', 'int64']).columns
categorical_features = X.select_dtypes(include=['object']).columns
 
# Perform one-hot encoding for categorical features
X_categorical = pd.get_dummies(X[categorical_features])
X_numerical = X[numerical_features]
 
# Concatenate numerical and one-hot encoded categorical features
X_processed = pd.concat([X_numerical, X_categorical], axis=1)
# Remove invalid characters from feature names
cleaned_columns = [col.replace('[','').replace(']','').replace('<','') for col in X_processed.columns]
 
# Replace the old feature names with the cleaned ones
X_processed.columns = cleaned_columns
# Drop the duplicate column
X_processed.drop(columns=['url_https://user7770001255.el.r.appspot.com/E-mail%20Address'], inplace=True)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)

Defining an Ensemble model with instance of Hyperparameter tuning




# Ensemble Model as XGBoost with Hyperparameter Tuning
param_dist = {
    'n_estimators': [100, 150, 200],
    'learning_rate': [0.05, 0.1, 0.2],
    'max_depth': [3, 4, 5]
}
 
xgb_model = xgb.XGBClassifier(random_state=42)
random_search = RandomizedSearchCV(estimator=xgb_model, param_distributions=param_dist, n_iter=10, cv=3, scoring='f1', random_state=42, verbose=1, n_jobs=-1)
random_search.fit(X_train, y_train)
# Use the best parameters from the grid search
best_params = random_search.best_params_
xgb_model_tuned = xgb.XGBClassifier(**best_params, random_state=42)
 
# Fit the tuned model
xgb_model_tuned.fit(X_train, y_train)

Model evaluation

Now,we will evaluate our model on various model performance metrics like Accuracy, Precision, Recall, F1-score, Classification Report and AUC-ROC score.




# Make predictions on the test set
y_pred_xgb_tuned = xgb_model_tuned.predict(X_test)
# Evaluate the XGBoost ensemble model with tuned hyperparameters
accuracy_xgb_tuned = accuracy_score(y_test, y_pred_xgb_tuned)
precision_tuned = precision_score(y_test, y_pred_xgb_tuned)
recall_tuned = recall_score(y_test, y_pred_xgb_tuned)
f1_tuned = f1_score(y_test, y_pred_xgb_tuned)
auc_roc_value_tuned = roc_auc_score(y_test, xgb_model_tuned.predict_proba(X_test)[:, 1])
# Print the results
print("Classification Report:")
print(classification_report(y_test, y_pred_xgb_tuned))
print(f"AUC-ROC Value: {auc_roc_value_tuned}")
print(f"Tuned XGBoost Accuracy: {accuracy_xgb_tuned}")
print(f"Tuned XGBoost Precision: {precision_tuned}")
print(f"Tuned XGBoost Recall: {recall_tuned}")
print(f"Tuned XGBoost F1-score: {f1_tuned}")

Output:

Classification Report:
precision recall f1-score support

0 0.92 0.94 0.93 101
1 0.94 0.92 0.93 99

accuracy 0.93 200
macro avg 0.93 0.93 0.93 200
weighted avg 0.93 0.93 0.93 200
AUC-ROC Value: 0.9820982098209821
Tuned XGBoost Accuracy: 0.93
Tuned XGBoost Precision: 0.9381443298969072
Tuned XGBoost Recall: 0.9191919191919192
Tuned XGBoost F1-score: 0.9285714285714285

So, our model is performing very well with 99.71% of AUC-ROC value and 98.50% above performance metrics.


Article Tags :