Open In App

Proximity-Based Methods in Data Mining

Proximity-based methods are an important technique in data mining. They are employed to find patterns in large databases by scanning documents for certain keywords and phrases. They are highly prevalent because they do not require expensive hardware or much storage space, and they scale up efficiently as the size of databases increases. 

Advantages of Proximity-Based Methods:

  1. Proximity-based methods make use of machine learning techniques, in which algorithms are trained to respond to certain patterns. 
  2. Using a random sample of documents, the machine learning algorithm analyzes the keywords and phrases used in them and makes predictions about the probability that these words appear together across all documents. 
  3. Proximity can be calculated by calculating a similarity score between two collections of training data and then comparing these scores. The algorithm then tries to compute the maximum similarity score for two distinct sets of training items.

Disadvantages of  Proximity-Based Methods:

  1. Important words may not be as close in proximity as we expected.
  2. Over-segmentation of documents into phrases. To counter these problems, a lexical chain-based algorithm has been proposed. 

Proximity-based methods perform very well for finding sets of documents that contain certain words based on background knowledge. But performance is limited when the background knowledge has not been pre-classified into categories. 



To find sets of documents containing certain categories, one must assign categorical values to each document and then run proximity-based methods on these documents as training data, hoping for accurate representations of the categories.

One way to identify outliers is by calculating their distance from the rest of the data set in is known as density-based outlier detection. 
 



Types of Proximity-Based Outlier Detection Methods:

Article Tags :