Proximity-Based Methods in Data Mining

Proximity-based methods are an important technique in data mining. They are employed to find patterns in large databases by scanning documents for certain keywords and phrases. They are highly prevalent because they do not require expensive hardware or much storage space, and they scale up efficiently as the size of databases increases.

Advantages of Proximity-Based Methods:

Proximity-based methods make use of machine learning techniques, in which algorithms are trained to respond to certain patterns.
Using a random sample of documents, the machine learning algorithm analyzes the keywords and phrases used in them and makes predictions about the probability that these words appear together across all documents.
Proximity can be calculated by calculating a similarity score between two collections of training data and then comparing these scores. The algorithm then tries to compute the maximum similarity score for two distinct sets of training items.

Disadvantages of Proximity-Based Methods:

Important words may not be as close in proximity as we expected.
Over-segmentation of documents into phrases. To counter these problems, a lexical chain-based algorithm has been proposed.

Proximity-based methods perform very well for finding sets of documents that contain certain words based on background knowledge. But performance is limited when the background knowledge has not been pre-classified into categories.

To find sets of documents containing certain categories, one must assign categorical values to each document and then run proximity-based methods on these documents as training data, hoping for accurate representations of the categories.

One way to identify outliers is by calculating their distance from the rest of the data set in is known as density-based outlier detection.

Types of Proximity-Based Outlier Detection Methods:

Distance-based outlier detection methods: A distance-based outlier detection method is a statistical technique. Such methods typically measure distances between individual data points and the rest of their respective groups. Many approaches also have a configurable error threshold for determining when a point is an outlier. Many distance-based outliers methods have been developed. The methods use distance statistics such as Euclidean, Manhattan, or Mahalanobis distance for calculating distances between individual points and to detect outliers. The following three outlier detection methods have been selected based on their performance:
- WLSMV (Weighted Least Squares Minimization) method
- SVM (Support Vector Machines) method,
- RMSProp method.
Density-based Outlier detection methods: A density-based outlier detection method is used for checking the density of an entity object and its closest objects. Key applications of this method are used in many applications including Malware Detection, Awareness, Behavior Analysis, and Network Intrusion Detection. There are some limitations to density-based outlier detection methods that are effective until it is determined that the outliers being detected are not necessarily outliers but just a part of a much larger distribution of data. A limitation with using density-based outlier detection methods is that the density function must be defined and clearly understood before implementation and the proper value set.

Article Tags :

Data Mining