Grid-Based Method For Distance-Based Outlier Detection in Data Mining

Outlier detection is currently thought to be a crucial data mining work with a variety of applications, including the detection of credit card fraud, criminal activity, and remarkable trends in datasets.The goal of outlier Detection, a crucial area of data mining, is to find unusual behaviour in a given data collection. Anomalies can be utilised to forecast upcoming events to clarify a situation’s consequences or to improve the appropriate system.

Distance Based Outlier:

Statistics is one of the fields where outlier identification research is currently being done. Outlier can be intuitively characterised as according to Hawkins. An outlier is an observation that differs so significantly from other observations that it raises the possibility that it was produced by a different mechanism, according to definition (Hawkins-Outlier).

Grid-Based Method for Distance-Based Outlier Detection:

Using a grid-based outlier detection algorithm which help us to prunes away the portion of dataset which is safe and known to be non-outliers,This can locate the points that differ from the rest of the data points at a later stage with the aid of the nearest neighbour strategy. In turn, this lowers the overall cost of computation. This solution uses a straightforward grid-based structure to filter out the safe sections rather than using the distance-based closest neighbour algorithm to all of the given data.

Instead of applying the distance-based nearest neighbour over the entire available data, we can use a simple grid based structure to prune out the safe regions.

Algorithm for Grid-Based Mining Stream Outlier:

This algorithm can be broken down into three simple steps.

Divide the area into grid cells of equal width and compute the cell statistics. Identify the candidate cells, then combine any dense cells that don’t contain any candidate outliers. Later, these cells were trimmed.
Over candidate cells, apply a distance-based outlier detection technique.
Declare potential outliers as true outliers or inliers after the necessary number of stream chunks (L), and give the final outliers an outlierness score.

Requirements for algorithm:

Let, the N number of grid cells and the following:

L = Number of Iterations for Candidate Point for Outlierness Degree
W = w1, w2,…, wd: Data stream chunks
k = Number of closest neighbours
δ = Density threshold
c = Denotes ith cell in jth chunk of stream.
DS = N1xN2x…xNd
x = Total number of data elements.
sp = No. of data elements in c
u = Average of data elements in c
candout = candidate outliers

Steps:

Step 1: Construct n number of grid cells.
Step2: For each w1,
- Assign x to its appropriate cell c.
- Update sp the properties of cell c,u =average of x in c.
Step3:Merge dense nearby cells with a density threshold δ.
Step4:If x>u,then mark cell c as candidate cell c.
Step5:Prune the number of safe regions
Step 6:For each candidate cell c.
- Apply Db-outlier for cell c.
- Keep the candout while discard the rest of the data
Step7: Move candout to the nextstream of chunk.
Step8:Using the density measure, the deviation value, and the nearby cells’ density values, assign an outlierness degree to each found outlier.

So, these are the steps to follow to know the outlier degree detection.

Article Tags :

Data Mining

Data Warehouse