Open In App

Measures Their Categorization and Computation in Data Mining

Improve
Improve
Like Article
Like
Save
Share
Report

Data Mining can be defined as a process of sorting or arranging large or sets of data to figure out patterns and relationships. These patterns or relationships may be helpful in solving business problems through data analysis. The various Data Mining tools and techniques have helped various institutions or enterprises to predict future trends and profit from them. Data Mining is said to be an important part of Data Analysis. It is said to be one of the core principles of Data Science ( data science is the use of advanced analytics techniques to search for useful information in the given data sets ). Data Mining is a very important part of analytical initiatives in various organizations. The information generated or processed from this is mainly used in business intelligence and advanced analytics applications.

Data Mining

 

Data mining measures can be categorized or arranged into three categories: holistic, distributive, and algebraic. The said classification or division of measures is based on which type of aggregate functions id being used in them.

Holistic:

If there is no defined constraint or limit on the storage amount needed to define the sub-aggregate, any given aggregate function is said to be holistic. It can be described as an algebraic function with n arguments.

For example, median(), rank(), and mode() are holistic measures. If any measure uses the holistic aggregate function then it can be said to be holistic. The majority of cube applications that work with big amounts of data demand quick computations of distributive and algebraic measurements.

Distributive:

If any function is calculated in a delivered manner as listed then it is said to be a distributive function. Let us consider the data to be independent into m sets. It should be able to use the services of each partition resulting in m aggregate values. When the result obtained by applying the function to the n aggregate values is identical to the result obtained by applying the function to the entire data set (without partitioning), the function is said to be applied in a dispersed manner.

For example, count() for a data cube can be calculated by dividing or partitioning the cube into a group of sub-cubes of the same size, We can calculate count() for each sub-cube and then add them to get the total. so we can conclude that the count() function is a distributive aggregate service.

A y measure is said to be distributive if it can be obtained by using the distributive aggregate service. 

Examples : Sum(), Count(), Minimum()

Algebraic:

If any aggregated function can be calculated by using an algebraic service then it is said to be algebraic. It is calculated by an algebraic function of N arguments where N is a positive integer. 

We can consider an example of an average function or avg().  The average function is mainly calculated by sum() or count() or both(). In this case both Count() and Sum() are distributive aggregate services but their division of them leads to an algebraic function. similarly min() and max() are also algebraic.If any measure is acquired by using any algebraic aggregate service then it can be called an algebraic function. 

Example: Average(), ManN(), MinN(), CenterofMass()

 

Advantages:

Helps evaluate algorithms: Measures provide a quantitative way to evaluate the performance of data mining algorithms, which can help in choosing the best algorithm for a given task.

Facilitates comparison: Measures make it possible to compare the performance of different algorithms and models on the same dataset, enabling researchers to choose the best model for a given task.

Provides insights: Measures can provide insights into the structure and characteristics of the data being analyzed, which can help in understanding the data and in developing more accurate models.

Enables optimization: Measures can be used to optimize data mining algorithms and models, which can lead to improved accuracy and efficiency.

Disadvantages:

Limited scope: Measures may not capture all aspects of the data being analyzed, and some important features or patterns may be overlooked.

Biased results: Measures can be biased towards certain types of patterns or models, which may not always be the most useful or relevant for a given task.

Limited applicability: Some measures may only be applicable to certain types of data or tasks, and may not be useful in other contexts.

Time-consuming: Computation of some measures can be time-consuming, especially for large datasets or complex models.


Last Updated : 25 Apr, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads