Open In App

What is Isolation Forest?

Last Updated : 02 Apr, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Isolation forest is a state-of-the-art anomaly detection algorithm which is very famous for its efficiency and simplicity. By removing anomalies from a dataset using binary partitioning, it quickly identifies outliers with minimal computational overhead, making it the way to go for anomalies in areas ranging from cybersecurity to finance. In this article, we are going to explore the fundamentals of Isolation Forest algorithm.

What is Isolation Forest?

Isolation Forest stands as a formidable anomaly detection algorithm renowned for its efficiency and versatility. Anomaly detection is the backbone of data analysis to identify patterns or events that deviate significantly from the norm in a dataset. Isolation forest operates by isolating anomalies within a dataset through a process of recursive partitioning.

  • Unlike traditional methods that rely on proximity measures, Isolation Forest takes a unique approach by randomly selecting features and splitting them along random values until individual data points are isolated.
  • This “isolating” process is responsible for creating partitions or “trees” that aim to separate anomalies from normal observations.
  • Anomalies, being fewer in number and further from the norm, typically require fewer splits to isolate, making them easier to detect.

By leveraging the concept of isolation, this algorithm efficiently distinguishes between normal and abnormal behavior, facilitating prompt action to mitigate potential risks or exploit valuable insights hidden within data anomalies.

Isolation Forest Algorithm with Example

isolation-(2)

In the diagram, “Input Dataset” is at the top. This dataset is then split into two branches, labeled “Normal with uncommon” and “Outliers.”

The “Normal with uncommon” branch splits again, until it reaches a label of “Normal.” This suggests that data points that are classified as normal may have some unusual characteristics.

The “Outliers” branch reaches a label of “Outliers” more quickly, suggesting that outliers can be identified relatively easily using Isolation Forest.

How Isolation forest Algorithm Works?

Before jumping to the working principal of Isolation Forest algorithm, let’s discuss the two main essential concepts of it:

  • Random Partitioning: In Isolation Forest, random partitioning involves selecting a random feature and then choosing a random value within the range of that feature’s values to split the data. This process is repeated recursively to create a partitioning tree, where each partition isolates a subset of the data. By randomly partitioning the data, Isolation Forest efficiently separates anomalies from normal data points, as anomalies are more likely to end up in smaller, isolated partitions.
  • Isolation Path: The isolation path of a data point within an isolation tree represents the number of splits required to isolate that data point. Anomalies, being less representative of the overall data distribution, typically require fewer splits to isolate compared to normal data points. By measuring the length of isolation paths across multiple trees, Isolation Forest computes an anomaly score for each data point, enabling the identification of outliers based on their deviation from the norm.

Workings of Isolation Forest algorithm

  1. Random Partitioning: Isolation Forest operates by randomly selecting features and splitting data points along these features at random thresholds, creating isolation trees.
  2. Recursive Isolation: Each partition isolates a subset of data points, aiming to separate anomalies from normal observations by creating increasingly smaller partitions.
  3. Anomaly Identification: Anomalies are identified as data points requiring fewer splits to isolate, as they typically deviate further from the norm and are less likely to be randomly selected for partitioning.
  4. Creating Isolation Path: The isolation path of a data point within the tree is measured by the number of splits required to isolate it, serving as a measure of its anomaly score.
  5. Ensemble of Trees: Isolation Forest constructs multiple isolation trees independently, forming an ensemble that collectively evaluates anomalies based on their isolation paths across the trees.
  6. Difference score calculation: The mean separation distance across all trees is calculated for each data point, yielding an anomaly score indicating the amount of deviation from the standard.
  7. Classification: Predefined thresholds are used to distinguish between normal and abnormal patterns and then the Data points with anomaly scores above the threshold are flagged as anomalies.

Implementation with Isolation Forest

In this section, we are going to delve into the implementation of Isolation Forest. We are going to perform anomaly detection on credit card transaction using the algorithm using the following steps:

Step 1: Importing required libraries

Python3
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

Step 2: Dataset loading and pre-processing

Now we will load very famous Credit Card Anomaly detection dataset and limit its row count to 40000 for faster processing speed. Then we will standardize the features in the dataset excluding the target variable ‘Class’ using StandardScaler, ensuring that each feature has a mean of 0 and a standard deviation of 1. Next, it selects the first 40,000 rows of the standardized data and converts it into a Data Frame. Finally, it separates the features (X) from the target variable (y), where ‘X’ contains all columns except ‘Class’, and ‘y’ contains only the ‘Class’ column indicating the transaction’s fraud status.

Python3
credit_data = pd.read_csv('creditcard.csv', nrows=40000) # https://www.kaggle.com/mlg-ulb/creditcardfraud
scaler = StandardScaler().fit_transform(credit_data.loc[:,credit_data.columns!='Class'])
scaled_data = scaler[0:40000]
df = pd.DataFrame(data=scaled_data)
# Separate features and target variable
X = credit_data.drop(columns=['Class'])
y = credit_data['Class']

Defining Isolation Forest model

Now it is time to train our Isolation Forest model. Firstly, the fraction of outliers in the dataset is determined by calculating the ratio of fraudulent transactions (‘Class’ equals 1) to non-fraudulent transactions (‘Class’ equals 0). Subsequently, an Isolation Forest model is created and fitted to the data. The hyperparameters for the Isolation Forest model are defined as follows–> ‘n_estimators’ is set to 100, indicating the number of base estimators in the ensemble, and ‘contamination’ is assigned the previously calculated outlier fraction, representing the expected proportion of outliers in the dataset. Additionally, ‘random_state’ is used for reproducibility.

Python3
# Determine the fraction of outliers
outlier_fraction = len(credit_data[credit_data['Class']==1])/float(len(credit_data[credit_data['Class']==0]))
# Create and fit the Isolation Forest model
model =  IsolationForest(n_estimators=100, contamination=outlier_fraction, random_state=42)
model.fit(df)

Output:

IsolationForest(contamination=0.0026067776218167233, random_state=42)

Model evaluation

Now we will evaluate our model on the basis of how much accurately our model is separating the outliers or potential anomalies present in the dataset. So, here we will calculate the anomaly score from model’s decision boundary function then print Accuracy of it.

Python3
# Predict outliers
scores_prediction = model.decision_function(df)
y_pred = model.predict(df)
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
# Print the accuracy in separating outliers or anomalies
print("Accuracy in finding anomaly:",accuracy_score(y,y_pred))

Output:

Accuracy in finding anomaly: 0.997175

So, we have achived above 99% of accuracy.

Comparative visualization

Now we will plot the normal vs. anomalous instances of any feature of the dataset. Here we will plot the ‘Amount’ feature of the dataset but you can just change the name of the feature to visualize that feature’s results.

Python3
# Selecting the feature for y-axis
y_feature = credit_data['Amount']    # change the feature name to visualize another

# Adding the predicted labels to the original dataset
credit_data['predicted_class'] = y_pred

# Plotting the graph
plt.figure(figsize=(7, 4))
sns.scatterplot(x=credit_data.index, y=y_feature, hue=credit_data['predicted_class'], palette={0: 'blue', 1: 'red'}, s=50)
plt.title('Visualization of Normal vs Anomalous Transactions')
plt.xlabel('Data points')
plt.ylabel(y_feature.name)
plt.legend(title='Predicted Class', loc='best')
plt.show()

Output:

From the above plot, we can clearly see that the normal instances and anomalous instances are separated in well manner with very little overlap.

Advantages of Isolation Forest

  1. Efficiency and flexibility: Isolation Forest exhibits remarkable robustness especially in high-dimensional datasets due to its ability to remove anomalies through random splitting. Unlike traditional methods like k-means or hierarchical clustering, it does not have to Isolation Forest calculates the distance between data points also remains small, which makes it highly scalable for real-time anomaly detection tasks.
  2. Tolerance for outliers: One of Isolation Forest’s most notable strengths is its tolerance for outliers. By design, the algorithm excels at reducing anomalies by performing separations that separate repeated data points. This makes it particularly effective in cases where the anomalies are small or show distinct differences from the norm. Furthermore, since forest segmentation does not rely on distance-based methods, it is less susceptible to the effects of outliers, ensuring reliable anomaly detection performance with different data sets in various fields
  3. Ease of implementation and interpretation: Isolation is quite straightforward to implement, due to its simple design and minimal overhead. The simplicity of the algorithm makes it easy for lack of labor more machine learning capabilities, allowing for rapid deployment in a variety of applications. Furthermore, the binary partitioning nature of Isolation Forest facilitates interpretability, as anomalies are identified based on their isolation paths within the constructed trees. This transparency enhances trust in the detection results and facilitates post-analysis interpretation for decision-making.
  4. Handling High-Dimensional Data: Isolation Forest excels in handling high-dimensional data, which poses challenges for many traditional anomaly detection techniques. By randomly selecting features for partitioning, the algorithm effectively mitigates the curse of dimensionality, maintaining robust performance even in datasets with numerous variables. This makes Isolation Forest well-suited for applications such as image processing, text mining, and sensor data analysis, where datasets often exhibit complex, multidimensional structures.

Limitations of Isolation Forest

Despite of having valid advantages, Isolation Forest algorithm has its own potential limitations which are discussed below:

  • Prone to overfitting: While Isolation Forest is often robust to outliers, it can be prone to overfitting, especially when dealing with small or highly imbalanced data with condition in various cases, the algorithm may over-segment the data, resulting in overly heterogeneous partition trees that fail to generalize well to unseen data. Careful parameter tuning and cross-validation procedures are necessary to mitigate this risk and ensure optimal performance.
  • Limited sensitivity to global anomalies: Despite its efficiency in detecting local anomalies, partition forests may struggle to detect global anomalies that span multiple regions of the dataset because the algorithm separates anomalies based on their individual characteristics so instead of considering the global data distribution using alternative anomaly detection methods to capture patterns or forest separation with a combination of preprocessing methods is needed.
  • Effects of correlated features: Separations can degrade forest performance when dealing with datasets with highly similar features. Splitting random features in such cases may lead to unnecessary segmentation, reducing the ability of the algorithm to successfully isolate the anomalies Preliminary steps such as feature selection or dimensionality reduction can help alleviate this problem by improving algorithm discrimination ability by reducing feature redundancy.
  • Problem with sequential data: Forest separation is inherently designed for datasets, which are independent and can face challenges when applied to ordinal or sequential data or time series data. Sequential data often exhibit temporal dependencies and evolving patterns that require specialized anomaly detection approaches. While adaptations of Isolation Forest for sequential data exist, such as extending the algorithm to construct isolation trees along temporal sequences, addressing this limitation effectively remains an ongoing research area in anomaly detection.

Conclusion

We can conclude that Isolation Forest emerges as a powerful anomaly detection algorithm with notable advantages such as efficiency, scalability, and robustness to outliers.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads