Open In App

Novelty Detection with Local Outlier Factor (LOF) in Scikit Learn

Improve
Improve
Like Article
Like
Save
Share
Report

Novelty detection is the task of identifying previously unseen data points as being different from the “normal” data points in a dataset. It is used in a variety of applications, such as fraud detection, error detection, and outlier detection.

There are several different approaches to novelty detection, including:

  1. One-class classification: This approach involves training a classifier on the normal data points in a dataset and then using it to predict whether a new data point is a normal data point or a novelty.
  2. Density-based methods: These methods calculate the local density of points around each data point and compare it to the densities of points around other data points. Data points with a low density relative to their neighbors are considered to be novelties.
  3. Distance-based methods: These methods calculate the distances between each data point and its nearest neighbors, and data points that are significantly far away from their nearest neighbors are considered to be novelties.
  4. Clustering-based methods: These methods use clustering algorithms to group the data points into clusters, and data points that do not belong to any of the clusters are considered to be novelties.

Novelty detection can be a useful tool for identifying previously unseen data points that are significantly different from the normal data points in a dataset. It can be used to detect fraud, errors, or other unusual patterns in a dataset.

Novelty detection Vs Outliers detection?

Novelty detection and outlier detection are closely related but distinct concepts. Outlier detection refers to the task of identifying data points that are significantly different from the majority of the data points in a dataset. These data points are often referred to as “outliers.”

On the other hand, novelty detection refers to the task of identifying previously unseen data points as being different from the “normal” data points in a dataset. In other words, novelty detection is about identifying data points that are different from what the model has seen before.

Both novelty detection and outlier detection involve identifying data points that are different from the majority of the data points in a dataset. However, outlier detection is focused on identifying data points that are different from the “normal” data points in a dataset that the model has already seen, while novelty detection is focused on identifying previously unseen data points as being different from the “normal” data points.

 LocalOutlierFactor and Reachability distance?

The Local Outlier Factor (LOF) is an algorithm for identifying anomalous data points in a dataset. It does this by measuring the local density of points around each data point and comparing it to the densities of points around other data points.

To calculate the local density of points around each data point, the LOF algorithm uses a measure called the reachability distance. The reachability distance of a data point is a measure of how “difficult” it is to reach that point from other points in the dataset.

To calculate the reachability distance of a data point, the LOF algorithm first identifies the k nearest neighbors of the data point, where k is a user-specified parameter. It then calculates the distance between the data point and each of its k nearest neighbors. The reachability distance of the data point is then defined as the maximum of these k distances.

The reachability distance is used to calculate the local reachability density of a data point, which is the sum of the distances between the data point and its k nearest neighbors, divided by k. The local reachability density of a data point is a measure of the local density of points around the data point.

Finally, the outlier factor of a data point is calculated as the ratio of the local reachability density of the data point to the average local reachability density of its k nearest neighbors. A high outlier factor indicates that a data point is more likely to be an outlier, while a low outlier factor indicates that a data point is more likely to be a normal (non-outlier) data point.

The reachability distance and local reachability density are used by the LOF algorithm to identify anomalous data points in a dataset. The algorithm is useful for identifying data points that are significantly different from their neighbors, such as fraud or errors in a dataset. It is often used as a preprocessing step for other machine learning algorithms, such as clustering or classification.

Step-by-Step Implementation:

In sci-kit-learn, the LocalOutlierFactor class is in the sklearn.neighbors module can be used to perform novelty detection using the local outlier factor (LOF) algorithm. The LOF algorithm is a density-based outlier detection method that calculates the local density of each sample in the dataset and identifies samples that have a significantly lower density than their neighbors. These samples are considered to be outliers or novelties.

To use the LocalOutlierFactor class, you need to create an instance of the class and fit it to the data using the fit() method. For example:

Python3




import numpy as np
from sklearn.neighbors import LocalOutlierFactor
 
# Generate random data
X = np.random.randn(100, 10)
 
# Create a LocalOutlierFactor estimator
# and fit it to the data
estimator = LocalOutlierFactor()
estimator.fit(X)


Once the LocalOutlierFactor estimator is fitted to the data, you can use it to obtain the outlier scores for each sample in the dataset. The outlier scores are calculated based on the local density of each sample and range from -1 to -infinity, with lower values indicating higher outlier scores.

To obtain the outlier scores, you can use the negative_outlier_factor_ attribute of the estimator. For example:

Python3




# Obtain the outlier scores for each sample
outlier_scores = estimator.negative_outlier_factor_
 
# Print the outlier scores for each sample
print(outlier_scores)


This code will print the outlier scores for each sample in the dataset. You can then use these scores to identify samples that are considered to be outliers or novelties.

You can also specify hyperparameters for the LocalOutlierFactor estimator, such as the number of neighbors to use for density estimation (the n_neighbors parameter) and the outlier detection method (the contamination parameter). For example:

Python3




# Create a LocalOutlierFactor estimator with
# hyperparameters and fit it to the data
estimator = LocalOutlierFactor(n_neighbors=5,
                               contamination=0.1)
estimator.fit(X)


This code will fit a LocalOutlierFactor estimator to the random data and obtain the outlier scores for each sample using the negative_outlier_factor_ attribute. The outlier scores are calculated based on the local density of each sample and range from -1 to -infinity, with lower values indicating higher outlier scores and create a LocalOutlierFactor estimator with the specified n_neighbors and contamination values and fit it to the data. The optimal values for these hyperparameters will depend on the specific dataset and should be determined through experimentation.

Python3




import numpy as np
from sklearn.neighbors import LocalOutlierFactor
 
# Generate random data
X = np.random.randn(100, 10)
 
# Create a LocalOutlierFactor estimator and fit it to the data
estimator = LocalOutlierFactor()
estimator.fit(X)
 
# Obtain the outlier scores for each sample
outlier_scores = estimator.negative_outlier_factor_
 
# Print the outlier scores for each sample
print(outlier_scores)


Output:

[-1.29336673 -0.98663101 -1.01328312 -0.98843551 -1.0340768  -1.00630881
-0.99046301 -1.01851411 -1.00941979 -1.02585983 -0.99454281 -1.03826622
-1.00920089 -1.08435498 -0.98485871 -0.99414    -1.02193122 -1.13255894
-0.98870854 -1.08340603 -1.03462261 -0.99815638 -1.06346218 -1.05982866
-1.15648965 -0.97513857 -0.99884846 -1.01392852 -1.00915394 -1.02404234
-1.02786408 -0.99580036 -1.03977835 -1.0856313  -1.0369034  -1.01757096
-0.98141263 -0.9666988  -0.99826695 -0.98593089 -1.02410345 -1.03045039
-1.01843609 -1.00225046 -0.99271876 -1.04562085 -1.04143942 -1.06242416
-1.24595953 -1.21899134 -1.06365838 -0.99014377 -1.00305435 -0.9863289
-0.96339396 -0.99409326 -1.0110496  -0.99468687 -0.99819612 -1.02407759
-1.05802008 -1.26005187 -1.00061505 -0.96921694 -0.97023558 -1.05295619
-1.01049517 -1.02283846 -0.985272   -0.99179016 -1.00560031 -1.0708834
-1.05491243 -1.00190921 -1.13925738 -1.04666919 -1.00216646 -0.99883435
-1.0091551  -0.98864925 -1.03776316 -1.12661428 -1.05180372 -1.20713398
-1.02207957 -1.00696503 -0.98899481 -1.04758736 -0.98664004 -0.97553829
-0.98835569 -1.19497038 -0.99148634 -1.00208273 -1.01195274 -1.06184659
-1.05820208 -0.99283114 -1.11214065 -0.97880798]

LOF for Outlier Detection

Here is a line-by-line explanation of the code example that demonstrates how to use the LocalOutlierFactor model for outlier detection and novelty detection in scikit-learn:

Python3




from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import LocalOutlierFactor


Importing Dataset

These three lines import the necessary modules and functions from scikit-learn. The load_breast_cancer function is used to load the breast cancer dataset, the StandardScaler transformer is used to standardize the data, and the LocalOutlierFactor class is used to create the outlier detection and novelty detection model.

This line loads the breast cancer dataset and stores it in the variables X and y. The return_X_y parameter is set to True to return the data and the target values separately. 

Python3




# Load the dataset
X, y = load_breast_cancer(return_X_y=True)


Normalization of the Data

These two lines create a StandardScaler transformer and use it to standardize the data. Standardization of data refers to the process of scaling the data so that it has zero mean and unit variance. This is often done as a preprocessing step before applying machine learning algorithms, as it can help to stabilize the variance of the features and improve the model’s performance.

Python3




# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


This line creates a LocalOutlierFactor model with n_neighbors=20 and the default value of novelty=False, which indicates that the model will be used for outlier detection rather than novelty detection.

Python3




# Create the LocalOutlierFactor
# model for outlier detection
lof_outlier = LocalOutlierFactor(n_neighbors=20)


This line fits the LocalOutlierFactor model to the standardized data and predicts the outlier scores for each data point. The fit_predict method returns an array of outlier scores, with 1 representing inliers and -1 representing outliers.

Python3




# Fit the model to the data and predict
# the outlier scores for each data point
outlier_scores = lof_outlier.fit_predict(X_scaled)
 
# Identify the outlier data points
outlier_indices = outlier_scores == -1
print("Outlier indices:", outlier_indices)


These lines identify the outlier data points by selecting the indices where the outlier scores are equal to -1 and printing the indices and scores of the outlier data points. This line creates a new LocalOutlierFactor model with n_neighbors=20 and novelty=True, which indicates that the model will be used for novelty detection rather than outlier detection.

Python3




# Create the LocalOutlierFactor model for
# outlier detection(Use novelty=True if
# you want to use LOF for novelty detection
# and predict on new unseen data)
lof_novelty = LocalOutlierFactor(n_neighbors=20,
                                 novelty=True)
 
lof_novelty.fit(X_scaled)


Complete Code Implementation:

In the below code, we have written all the sub-steps in one code block and now we will try to see the Novelty Detection algorithm in function.

Python3




from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import LocalOutlierFactor
 
# Load the dataset
X, y = load_breast_cancer(return_X_y=True)
 
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# Create the LocalOutlierFactor model for outlier detection
lof_outlier = LocalOutlierFactor(n_neighbors=20)
# Fit the model to the data and predict
# the outlier scores for each data point
outlier_scores = lof_outlier.fit_predict(X_scaled)
 
# Identify the outlier data points
outlier_indices = outlier_scores == -1
print("Outlier indices:", outlier_indices)
 
lof_novelty = LocalOutlierFactor(n_neighbors=20, novelty=True)
lof_novelty.fit(X_scaled)
 
# Use the model to predict whether new data points are novelties
new_data_point = [[2.0, 2.0, 2.0, 2.0, 2.0, 2.0,
                   2.0, 2.0, 2.0, 1.0, 3.0, 3.0, 3.0,
                   2.0, 2.0, 2.0, 1.0, 3.0, 3.0, 3.0,
                   2.0, 1.0, 3.0, 3.0, 3.0, 2.0, 1.0,
                   3.0, 3.0, 3.0]]
prediction = lof_novelty.predict(new_data_point)
print("Novelty detection for new data point:", prediction)


Output:

Outlier indices: [False False False  True False False False False False
  True False False
  True False False False False False False False False False False False
 False False False False False False False False False False False False
 False False  True False False False  True False False False False False
 False False False False False False False False False False False False
  True False False False False False False False  True False False  True
 False False False False False False  True False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False  True False False  True
 False False  True False False False False False False False False False
 False False False False False False  True False False False False False
 False False  True False False False False False  True False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False  True False
  True False False False False False False False False False False False
 False False False False False False False False  True  True False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
  True False  True False False False False False False False False False
 False False False False False False False False False False False False
 False False  True False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False  True False False False False False False
 False False False False False  True False False False False False False
 False False False False False False False False False  True False  True
 False False False False False False False False False False False False
  True  True False False False False False False False False False False
 False False False False  True False False False False False False False
 False False False False False False False False False False False  True
 False False False False False False False False False False False False
 False False False False False False False False False  True False False
 False False False False False]
 
Novelty detection for new data point: [1]


Last Updated : 09 Jan, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads