Open In App

Skewness of Statistical Data

Improve
Improve
Like Article
Like
Save
Share
Report

Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. In simpler terms, it indicates whether the data is concentrated more on one side of the mean compared to the other side.

Why is skewness important?

Understanding the skewness of data is crucial for several reasons:

  1. Modeling Assumptions: As you mentioned, linear models often assume that the independent and target variables have similar distributions. Skewness information helps in assessing the degree of deviation from this assumption, allowing for appropriate adjustments or transformations to be made to improve model performance.
  2. Feature Engineering: Skewed data can significantly impact the performance of machine learning models. Knowing the skewness helps in identifying which features may need transformation (e.g., log transformation) to make the data more suitable for modeling.
  3. Prediction Accuracy: Skewness provides insights into the distribution of data values, which can influence how well a model generalizes to new, unseen data. By understanding the skewness, one can anticipate how the model might perform on different segments of the data and make adjustments accordingly.
  4. Outlier Detection: Skewed distributions often indicate the presence of outliers, which can have a significant impact on model estimation and inference. Knowing the direction of skewness helps in identifying where outliers are more likely to occur, guiding the outlier detection and treatment process.
  5. Interpretation of Results: Skewness information aids in the interpretation of model results. For instance, in your example of predicting mpg based on car horsepower, knowing that the data is positively skewed suggests that the model may perform better for cars with lower horsepower, which informs decision-making and further analysis.

Central Limit Theorem

The Central Limit Theorem (CLT) is a fundamental concept in statistics with broad applications across various fields. It states that, under certain conditions, the distribution of the sample means of a sufficiently large sample drawn from any population will approximate a normal distribution, regardless of the shape of the original population distribution.

Coefficient of Skewness

  • Pearson’s first coefficient of skewness involves subtracting the mode from the mean and then dividing the difference by the standard deviation. This method is effective for data with a prominent mode but may not be suitable for datasets with low mode or multiple modes. In such cases, Pearson’s second coefficient may be preferable.
    First coefficient : \frac {Mean - Mode} {Standard Deviation}
  • Pearson’s second coefficient of skewness is calculated by subtracting the median from the mean, multiplying the difference by 3, and dividing the product by the standard deviation.
    Second coefficient : \frac {3{( Mean - Median)}} {Standard Deviation}

Otherwise, we can use this formula too,

Skewness Formula = \Sigma {(X - \bar X)}

Interpretation of skewness

Skewness is a measure of the asymmetry of a distribution. It tells us about the extent to which the data deviates from a symmetric distribution. Here’s how to interpret skewness:

  1. Skewness value around 0: A skewness value close to zero (between -0.5 and 0.5) indicates that the distribution is approximately symmetric. This means that the data are evenly distributed around the mean, with roughly equal frequencies of values on both sides.
  2. Negative skewness: If the skewness is negative (less than -0.5), it indicates that the left tail of the distribution is longer or stretched out compared to the right tail. In other words, the majority of the data points are concentrated on the right side of the distribution, with a few extremely low values dragging the mean to the left.
  3. Positive skewness: Conversely, if the skewness is positive (greater than 0.5), it suggests that the right tail of the distribution is longer or stretched out relative to the left tail. In this case, most of the data points are clustered on the left side of the distribution, with a few extremely high values pulling the mean to the right.
  4. Magnitude of skewness: The magnitude of the skewness value indicates the degree of asymmetry. Larger positive or negative values imply greater asymmetry. Skewness values less than -1 or greater than 1 are considered highly skewed.

Types of Skewness

There are three main types of skewness:

  1. Positive Skewness (Right Skewness):
    • In a positively skewed distribution, the tail of the distribution extends towards the right.
    • The majority of the data points are concentrated on the left side of the distribution, while the right tail is longer.
    • The mean is typically greater than the median, and the mode is usually less than the median.
    • It’s also known as right skewness because the longer tail is on the right side when the distribution is represented graphically.
    • Example: Income distribution in a population where a few individuals have extremely high incomes.
  2. Negative Skewness (Left Skewness):
    • In a negatively skewed distribution, the tail of the distribution extends towards the left.
    • Most of the data points are concentrated on the right side of the distribution, while the left tail is longer.
    • The mean is typically less than the median, and the mode is usually greater than the median.
    • It’s called left skewness because the longer tail is on the left side when graphed.
    • Example: Exam scores of a class where most students perform well but a few perform poorly.
  3. Zero Skewness:
    • A distribution is considered to have zero skewness when it is perfectly symmetrical.
    • In a symmetric distribution, the mean, median, and mode are all equal.
    • There’s an equal probability of values occurring on either side of the mean.
    • Example: Many natural phenomena, such as human height distribution in a well-nourished population.

Skewness

Implementation using python

  • We define a function skewness that takes a numpy array data as input.
  • Inside the function, we calculate the mean, standard deviation, and length of the data array.
  • We then use the formula for skewness to calculate the skewness value.

Python3

import numpy as np
 
# Function to calculate skewness
def skewness(data):
    mean_value = np.mean(data)
    std_dev = np.std(data)
    n = len(data)
    skew = (sum((x - mean_value) ** 3 for x in data) * n) / ((n - 1) * (n - 2) * std_dev ** 3)
    return skew
 
# Example data
data = np.array([2.5, 3.7, 6.6, 9.1, 9.5, 10.7, 11.9, 21.5, 22.6, 25.2])
 
# Calculate skewness
result = skewness(data)
print("Skewness:", result)

                    

Output:

Skewness: 0.6735781010244036



Last Updated : 11 Feb, 2024
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads