Open In App
Related Articles

Mining Collective Outliers Data Mining

Improve
Improve
Improve
Like Article
Like
Save Article
Save
Report issue
Report

A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are Outliers. The investigation of OUTLIER data is known as OUTLIER MINING. An outlier may be detected using statistical tests which assume a distribution or probability model for the data, or using distance measures where objects having a small fraction of “close” neighbors in space are considered outliers. Rather than utilizing factual or distance measures, deviation-based techniques distinguish exceptions/outliers by inspecting differences in the principle attributes of items in a group.

A group of data objects which deviates significantly from the entire data set is called a Collective outlier.  In the Collective outlier, there is a chance that each individual object may not be an outlier. collective outlier detection is more difficult than conventional and contextual outlier detection as the structure of the data set relationships between multiple data objects needs to be examined.

Collective outliers are a group of data points that deviate from the rest of the data in a similar way. They are often found in datasets with multiple variables and can indicate a specific pattern or behavior that is different from the rest of the data. Data mining techniques, such as clustering or density-based algorithms, can be used to identify and analyze collective outliers in a dataset. It is important to note that, while outliers may be interesting, they should be treated with caution as they may also be caused by errors or inaccuracies in the data.

Collective Outlier Data Mining:

Collective outlier Data Mining completely depends on the type of data structure. But predetermining the structure of the data objects is a difficult task and may be impossible sometimes. we explore the internal structures which are formed by temporal data structures like segments of the time series or subsequences. we explore the local areas to detect collective outliers in spatial data. we explore subgraphs in graph and interconnected network data. Contextual outlier detection is similar to collective outlier detection are similar because in both of these detection methods substructures and local areas are explored. In contextual outlier detection, the context of the data objects is considered as the main attribute to detect the outliers. 

Here, the contextual information is the structural attribute. collective outlier detection is challenging as the structures of data are explored to detect the outliers. It also depends on the type of the application and data objects. As the mining process of collective outlier detection involves several sophisticated data mining and machine learning techniques it has a high computational cost. But collective outlier detection is practically applicable in many situations.

Collective outlier detection methods are of two different types. In the first category, the problem of collective outlier detection is reduced to conventional outlier detection. It identifies the structural units of the data and from each of the structural units (either a time-series segment, a local area, or a subgraph) significant features are extracted to determine the collective outliers. Now the problem of collective outlier detection is transformed into outlier detection. The data objects whose behavior deviates from the features extracted from the structures are considered outliers. Whereas all the normal objects exhibit a similar type of structural behavior. 

 The second category of collective outlier detection is to build the model of the expected behavior of structural units of the attributes.  For example, to detect the collective outliers in spatial data, we can build a  model by extracting the featural behavior of the structural units of the data attributes. The data objects are identified as collective outliers if they deviate from the model. 

Collective Outlier Detection on Graph Data:

Collective outlier detection can be performed on the social network. social networks can be assumed as an unlabeled graph. Each possible subgraph of the network or graph can be treated as a structural unit which is the important criterion to find out the outliers in the graph. We consider the two features namely the number of vertices in the subgraph S and the frequency of the particular subgraph in the network to detect outliers in the graph or network.  That is, frequency(S) is the number of subgraphs in the network that are similar and have the isomorphic properties in the Network. If the subgraph contains multiple vertices and high frequency compared to the other subgraphs then it is identified as the collective outlier. In general, the subgraphs with fewer number vertices are expected to be the frequent subgraphs. The large subgraphs are assumed to be less frequent. But during the experimentation, if both the frequency and number of vertices are high then these are declared as outliers in the social network.

Collective outlier detection is subtle due to the challenge of exploring the structures in data. The exploration typically uses heuristics and thus may be application-dependent. The computational cost is often high due to the sophisticated mining process. While highly useful in practice, collective outlier detection remains a challenging direction that calls for further research and development.

Advantages and Disadvantages:

Advantages of mining collective outliers in data mining include:

  1. The ability to identify patterns and relationships that may not be apparent when analyzing individual data points.
  2. The potential to uncover previously undiscovered knowledge or insights.
  3. The ability to detect and analyze anomalies in large datasets.


Disadvantages of mining collective outliers in data mining include:

  1. The difficulty in defining what constitutes a collective outlier, as it can depend on the specific context and dataset being analyzed.The 
  2. potential for false positives, where a group of data points that appear to be outliers may not actually be abnormal or significant.
  3. The computational cost of analyzing large datasets for collective outliers can be high.

Last Updated : 26 Jan, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads