Open In App

Quantiles in Machine Learning

Last Updated : 12 Mar, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Quantiles offers valuable insights into data distribution and helping in various aspects of analysis. This article describes quantiles, looks at how to calculate them, and talks about how important they are for machine learning applications. We also discuss the problems with quantiles and how box plots may be used to represent them. For anybody dealing with data in the field of machine learning, having a firm understanding of quantiles is crucial.

What are Quantiles?

Quantiles divide the dataset into equal parts based on rank or percentile. They represent the values at certain points in a dataset sorted in increasing order. General quantiles include the median (50th percentile), quartiles (25th, 50th, and 75th percentiles), and percentiles (values ranging from 0 to 100).

In machine learning and data science, quantiles play an important role in understanding the data, detecting outliers and evaluating model performance.

Types of Quantiles

  • Quartiles: Quartiles divide a dataset into four equal parts, representing the 25th, 50th (median), and 75th percentiles.
  • Quintiles: Quintiles divide a dataset into five equal parts, each representing 20% of the data.
  • Deciles: Deciles divide a dataset into ten equal parts, with each decile representing 10% of the data.
  • Percentiles: Percentiles divide a dataset into 100 equal parts, with each percentile representing 1% of the data.

Steps to Calculate Quantiles

The steps for calculating quantiles involve:

  1. Sorting the Data: Arrange the dataset in increasing order.
  2. Determine the Position: Calculate the position of the desired quantile based on the given formula: “Position=(quantile×(n+1))/100”, where n is the total number of observations.
  3. Interpolation (if needed): Interpolate between two adjacent values to find the quantile if the position is not an integer.

Example with Mathematical Imputation:

Let’s consider a dataset: [5, 10, 15, 20, 25, 30, 35, 40, 45, 50].

  1. Median (Q2): There are 10 observations, so the median position is (2×(10+1))/2=5.5. Since, 5.5 is not an integer, we interpolate between the 5th and 6th observations: Median=(25+30)/2=27.5.
  2. First Quartile (Q1): (25×(10+1))/4=13.75. Interpolating between the 13th and 14th observations: Q1=(15+20)/2=17.5.
  3. Third Quartile (Q3):(75×(10+1))/4=41.25. Interpolating between the 41st and 42nd observations: Q3=(40+45)/2=42.5.

Implementation: Calculating Quantiles using NumPy Library

Quintiles

This code uses NumPy to compute the quintiles (20th, 40th, 60th, and 80th percentiles) of a given dataset data. It then prints out these quintiles to the console. The np.percentile function calculates the desired percentiles, and the values are accessed from the resulting array quintiles using indexing.

Python3
import numpy as np

# Different sample data
data = np.array([12, 15, 18, 20, 22, 25, 28, 30, 32, 35, 38, 40])

# Compute the quintiles
quintiles = np.percentile(data, [20, 40, 60, 80])

print("20th percentile (quintile 1):", quintiles[0])
print("40th percentile (quintile 2):", quintiles[1])
print("60th percentile (quintile 3):", quintiles[2])
print("80th percentile (quintile 4):", quintiles[3])

Output:

print("60th percentile (quintile 3):", quintiles[2])
print("80th percentile (quintile 4):", quintiles[3])

20th percentile (quintile 1): 18.4
40th percentile (quintile 2): 23.200000000000003
60th percentile (quintile 3): 29.2
80th percentile (quintile 4): 34.400000000000006

Quartiles

This code uses numpy‘s quantile function to calculate the median, first quartile (Q1), and third quartile (Q3) of the given dataset. We can adjust the quantile values (0.5, 0.25, 0.75) to calculate other quantiles like quintiles, deciles, or any custom quantile you need.

Python3
import numpy as np

# Sample data
data = np.array([5, 10, 15, 20, 25, 30, 35, 40, 45, 50])

# Calculating median (Q2)
median = np.quantile(data, 0.5)

# Calculating first quartile (Q1)
q1 = np.quantile(data, 0.25)

# Calculating third quartile (Q3)
q3 = np.quantile(data, 0.75)

print("Median (Q2):", median)
print("First Quartile (Q1):", q1)
print("Third Quartile (Q3):", q3)


Output:

Median (Q2): 27.5
First Quartile (Q1): 16.25
Third Quartile (Q3): 38.75

Percentiles

This code also utilizes NumPy to compute the 25th, 50th (median), and 75th percentiles of a given dataset data. The np.percentile function calculates the desired percentiles, and the resulting values are printed out to the console.

Python3
import numpy as np

# Sample data
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Compute the 25th, 50th, and 75th percentiles
percentiles = np.percentile(data, [25, 50, 75])

print("25th percentile:", percentiles[0])
print("50th percentile (median):", percentiles[1])
print("75th percentile:", percentiles[2])

Output:

25th percentile: 3.25
50th percentile (median): 5.5
75th percentile: 7.75

Deciles

This code utilizes NumPy to compute deciles (10th, 20th, …, 90th percentiles) of a given dataset data. The np.percentile function calculates the desired percentiles using an array of percentiles from 10 to 90 in increments of 10. The resulting decile values are then printed out to the console using a loop, with the enumerate function to iterate over the deciles and start=1 to start the enumeration from 1 instead of 0.

Python3
import numpy as np

# Sample data
data = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])

# Compute the deciles
deciles = np.percentile(data, np.arange(10, 100, 10))

for i, decile in enumerate(deciles, start=1):
    print(f"{i}0th percentile (decile {i}):", decile)

Output:

10th percentile (decile 1): 19.0
20th percentile (decile 2): 28.0
30th percentile (decile 3): 37.0
40th percentile (decile 4): 46.0
50th percentile (decile 5): 55.0
60th percentile (decile 6): 63.99999999999999
70th percentile (decile 7): 73.0
80th percentile (decile 8): 82.0
90th percentile (decile 9): 91.0

To learn more about implementing code on quantiles refer the following link:

quantile-implementation

Uses of Quantiles in Machine Learning

Quantiles play a crucial role in various aspects of machine learning and data analysis. Here are some key uses:

  1. Descriptive Statistics: Quantiles help summarize the distribution of a dataset, providing insights into its spread and central tendency.
  2. Outlier Detection: Observations that fall far from certain quantiles may be considered outliers, aiding in anomaly detection.
  3. Probability Distributions: Quantiles are used to describe the distribution of random variables, facilitating the analysis of probability distributions in machine learning models.
  4. Comparative Analysis: By comparing quantiles across different datasets, analysts can make informed decisions about the relative standing and characteristics of the datasets.
  5. Risk Assessment: In finance and other fields, quantiles are used to assess the risk of investments by determining the potential for loss or gain based on the distribution of data.

Understanding these uses is essential for effectively utilizing quantiles in machine learning and data analysis tasks.

Challenges and Limitations of Quantiles

  1. Influence of Outliers: Quantiles can be sensitive to outliers, especially when calculating quartiles. Outliers can significantly affect the position of quantiles, potentially leading to a misrepresentation of the data’s central tendency and spread.
  2. Skewed Distributions: Quantiles may not fully capture the characteristics of skewed distributions. For highly skewed datasets, the quantiles may not provide a complete picture of the data distribution, especially in the tails.
  3. Variability in Calculations: Different methods and software packages may use different algorithms for calculating quantiles, leading to variability in results. This can be a challenge when comparing quantiles across different datasets or when using quantiles for decision-making.

Conclusion

Quantiles are powerful statistical measures that provide valuable insights into the distribution of data. Understanding and utilizing quantiles effectively in machine learning and data science can enhance data analysis, model building, and decision-making processes. By calculating and interpreting quantiles, data scientists can gain more information about datasets and make informed decisions in various analytical tasks.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads