Mining Collective Outliers Data Mining
A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are Outliers. The investigation of OUTLIER data is known as OUTLIER MINING. An outlier may be detected using statistical tests which assume a distribution or probability model for the data, or using distance measures where objects having a small fraction of “close” neighbors in space are considered outliers. Rather than utilizing factual or distance measures, deviation-based techniques distinguish exceptions/outliers by inspecting differences in the principle attributes of items in a group.
A group of data objects which deviates significantly from the entire data set is called a Collective outlier. In the Collective outlier, there is a chance that each individual object may not be an outlier. collective outlier detection is more difficult than conventional and contextual outlier detection as the structure of the data set relationships between multiple data objects needs to be examined.
Collective Outlier Data Mining:
Collective outlier Data Mining completely depends on the type of data structure. But predetermining the structure of the data objects is a difficult task and may be impossible sometimes. we explore the internal structures which are formed by temporal data structures like segments of the time series or subsequences. we explore the local areas to detect collective outliers in spatial data. we explore subgraphs in graph and interconnected network data. Contextual outlier detection is similar to collective outlier detection are similar because in both of these detection methods substructures and local areas are explored. In contextual outlier detection, the context of the data objects is considered as the main attribute to detect the outliers.
Here, the contextual information is the structural attribute. collective outlier detection is challenging as the structures of data are explored to detect the outliers. It also depends on the type of the application and data objects. As the mining process of collective outlier detection involves several sophisticated data mining and machine learning techniques it has a high computational cost. But collective outlier detection is practically applicable in many situations.
Collective outlier detection methods are of two different types. In the first category, the problem of collective outlier detection is reduced to conventional outlier detection. It identifies the structural units of the data and from each of the structural units (either a time-series segment, a local area, or a subgraph) significant features are extracted to determine the collective outliers. Now the problem of collective outlier detection is transformed into outlier detection. The data objects whose behavior deviates from the features extracted from the structures are considered outliers. Whereas all the normal objects exhibit a similar type of structural behavior.
The second category of collective outlier detection is to build the model of the expected behavior of structural units of the attributes. For example, to detect the collective outliers in spatial data, we can build a model by extracting the featural behavior of the structural units of the data attributes. The data objects are identified as collective outliers if they deviate from the model.
Collective Outlier Detection on Graph Data:
Collective outlier detection can be performed on the social network. social networks can be assumed as an unlabeled graph. Each possible subgraph of the network or graph can be treated as a structural unit which is the important criterion to find out the outliers in the graph. We consider the two features namely the number of vertices in the subgraph S and the frequency of the particular subgraph in the network to detect outliers in the graph or network. That is, frequency(S) is the number of subgraphs in the network that are similar and have the isomorphic properties in the Network. If the subgraph contains multiple vertices and high frequency compared to the other subgraphs then it is identified as the collective outlier. In general, the subgraphs with fewer number vertices are expected to be the frequent subgraphs. The large subgraphs are assumed to be less frequent. But during the experimentation, if both the frequency and number of vertices are high then these are declared as outliers in the social network.
Collective outlier detection is subtle due to the challenge of exploring the structures in data. The exploration typically uses heuristics and thus may be application-dependent. The computational cost is often high due to the sophisticated mining process. While highly useful in practice, collective outlier detection remains a challenging direction that calls for further research and development.