Open In App

Kolmogorov-Smirnov Test (KS Test)

Last Updated : 01 Feb, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

The Kolmogorov-Smirnov (KS) test is a non-parametric method for comparing distributions, essential for various applications in diverse fields.

In this article, we will look at the non-parametric test which can be used to determine whether the shape of the two distributions is the same or not.

What is Kolmogorov-Smirnov Test?

Kolmogorov–Smirnov Test is a completely efficient manner to determine if two samples are significantly one of a kind from each other. It is normally used to check the uniformity of random numbers. Uniformity is one of the maximum important properties of any random number generator and the Kolmogorov–Smirnov check can be used to check it.

The Kolmogorov–Smirnov test is versatile and can be employed to evaluate whether two underlying one-dimensional probability distributions vary. It serves as an effective tool to determine the statistical significance of differences between two sets of data. This test is particularly valuable in various fields, including statistics, data analysis, and quality control, where the uniformity of random numbers or the distributional differences between datasets need to be rigorously examined.

Kolmogorov Distribution

The Kolmogorov distribution, often denoted as D, represents the cumulative distribution function (CDF) of the maximum difference between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution.

The probability distribution function (PDF) of the Kolmogorov distribution itself is not expressed in a simple analytical form. Instead, tables or statistical software are commonly used to obtain critical values for the test. The distribution is influenced by sample size, and the critical values depend on the significance level chosen for the test.

F(x) = 1 - 2 \sum_{k = 1}^{\infty} (-1)^{k-1} e^{-2k^2 x^2}

where

  • n is the sample size.
  • x is the normalized Kolmogorov-Smirnov statistic.
  • k is the index of summation in the series

How does Kolmogorov-Smirnov Test work?

Below are the steps for how the Kolmogorov-Smirnov test works:

  1. Hypotheses Formulation:
    • Null Hypothesis : The sample follows a specified distribution.
    • Alternative Hypothesis: The sample does not follow the specified distribution.
  2. Selection of a Reference Distribution:
    • A theoretical distribution (e.g., normal, exponential) is decided against which you want to test the sample distribution. This distribution is usually based on theoretical expectations or prior knowledge.
  3. Calculation of the Test Statistic (D):
    • For a one-sample Kolmogorov-Smirnov test, the test statistic (D) represents the maximum vertical deviation between the empirical distribution function (EDF) of the sample and the cumulative distribution function (CDF) of the reference distribution.
    • For a two-sample Kolmogorov-Smirnov test, the test statistic compares the EDFs of two independent samples.
  4. Determination of Critical Value or P-value:
    • The test statistic (D) is compared to a critical value from the Kolmogorov-Smirnov distribution table or, more commonly, a p-value is calculated.
    • If the p-value is less than the significance level (commonly 0.05), the null hypothesis is rejected, suggesting that the sample distribution does not match the specified distribution.
  5. Interpretation of Results:
    • If the null hypothesis is rejected, it indicates that there is evidence to suggest that the sample does not follow the specified distribution. The alternative hypothesis, suggesting a difference, is accepted.

When use Kolmogorov-Smirnov Test?

The main idea behind using this Kolmogorov-Smirnov Test is to check whether the two samples that we are dealing with follow the same type of distribution or if the shape of the distribution is the same or not.

Let’s a breakdown the scenarios where this test can be applicable:

  1. Comparison of Probability Distributions: The test is used to evaluate whether two samples exhibit the same probability distribution.
  2. Compare the shape of the distributions: If we assume that the shapes or probability distributions of the two samples are similar, the test assesses the maximum absolute difference between the cumulative probability distributions of the two functions.
  3. Check Distributional Differences: The test quantifies the maximum difference between the cumulative probability distributions, and a higher value indicates greater dissimilarity in the shape of the distributions.
  4. Hypothesis Testing Types:The assessment of the shape of sample data is typically done through hypothesis testing, which can be categorized into two types:
    1. Parametric Test
    2. Non-Parametric Test

One Sample Kolmogorov-Smirnov Test

The one-sample Kolmogorov-Smirnov (KS) test is used to determine whether a sample comes from a specific distribution. It is particularly useful when the assumption of normality is in question or when dealing with small sample sizes.

The test statistic, denoted as D_n , measures the maximum difference between the two cumulative distribution functions.

Empirical Distribution Function

The empirical distribution function at the value x represents the proportion of data points that are less than or equal to x in the sample. The function can be defined as:

Fn(x) = \frac{1}{n} \sum_{i=1}^n \mathbb{1}_{(-\infty, x]}(X_i)

where,

  • n is the number of observations in the sample
  • X_i  represents the individual observations
  • \mathbb{1}_{(-\infty, x]}(X_i)       is an indicator function that is 1 if Xi ≤ x and 0 otherwise i.e if the condition is satisfied for the each observation X_i  , it is simply 1, otherwise 0.

Kolmogorov–Smirnov Statistic

The Kolmogorov–Smirnov statistic for a given cumulative distribution function F(x)        is defined as:

D_n = sup_x | F(x) - Fn(x) |

where,

  • sup stands for supremum, which means the largest value over all possible values of x.
  • F(x)       is the theoretical cumulative distribution function.
  • Fn(x)       is the empirical cumulative distribution function of the sample (calculated as described above).

Example

Let’s say you have a sample of n observations. You want to test whether this sample comes from a normal distribution with mean μ       and standard deviation σ       . The Null hypothesis is that the sample follows the specified distribution. Steps to follow the test are:

  • Compute the Empirical Distribution Function
  • Specify the Reference Distribution
    • In this case, the cumulative distribution function of the normal distribution with mean μ        and standard deviation σ        is used.
  • Calculate the Kolmogorov–Smirnov Statistic
  • Compare KS static with Critical Value or P-value

Kolmogorov-Smirnov Test Python One-Sample

Python3

import numpy as np
from scipy.stats import norm, kstest
 
# Step 1: Generate a sample from a normal distribution
np.random.seed(42)
sample_size = 100
mean = 0
std_dev = 1
sample = np.random.normal(mean, std_dev, sample_size)
 
# Step 2: Compute the Empirical Distribution Function (EDF)
def empirical_distribution_function(x, data):
    return np.sum(data <= x) / len(data)
edf_values = [empirical_distribution_function(x, sample) for x in sample]
 
# Step 3: Define the Reference Distribution
reference_cdf = norm.cdf(sample)
 
# Step 4: Calculate the Kolmogorov–Smirnov Statistic
ks_statistic, ks_p_value = kstest(sample, 'norm')
 
# Step 5: Comparing
alpha = 0.05
critical_value = 1.36  # This value can be obtained from the Kolmogorov-Smirnov table for a specific significance level
 
print(f"Kolmogorov-Smirnov Statistic: {ks_statistic}")
print(f"P-value: {ks_p_value}")
 
if ks_statistic > critical_value or ks_p_value < alpha:
    print("Reject the null hypothesis. The sample does not come from the specified distribution.")
else:
    print("Fail to reject the null hypothesis. The sample comes from the specified distribution.")

                    

Output:

Kolmogorov-Smirnov Statistic: 0.10357070563896065
P-value: 0.21805553378516235
Fail to reject the null hypothesis. The sample comes from the specified distribution.

  • The statistic is relatively small (0.103), suggesting that the EDF and CDF are close.
  • Since the p-value (0.218) is greater than the chosen significance level (commonly 0.05), we fail to reject the null hypothesis.

Therefore, we cannot conclude that the sample does not come from the specified distribution (normal distribution with mean and standard deviation).

Two-Sample Kolmogorov–Smirnov Test

The two-sample Kolmogorov-Smirnov (KS) test is used to compare two independent samples to assess whether they come from the same distribution. It’s a distribution-free test that evaluates the maximum vertical difference between the empirical distribution functions (EDFs) of the two samples.

Empirical Distribution Function (EDF):

The empirical distribution function at the value ( x ) in each sample represents the proportion of observations less than or equal to ( x ). Mathematically, the EDFs for the two samples are given by:

For Group 1:

 F_1(x) = \frac{1}{n_1} \sum_{i=1}^{n_1} \mathbb{1}{(-\infty, x]}(X{1i})

For Group 2:

F_2(x) = \frac{1}{n_2} \sum_{j=1}^{n_2} \mathbb{1}{(-\infty, x]}(X{2j})

Where,

  • n_1 and n_2 are the sample sizes for the two groups
  • X_{1i} and X_{2j} represent individual observations in the respective samples,
  • \mathbb{1}{(-\infty, x]}(X_{1i}) and \mathbb{1}{(-\infty, x]}(X_{2j}) are the indicator functions. 

Kolmogorov–Smirnov Statistic

D_{n,m} = \sup_{x} |F_{1,n}(x) - F_{2,m}(x)|

where,

  • sup denotes supremum, representing the largest value over all possible xx values,
  • F_{1}(x), F_{2}(x) are the empirical cumulative distribution functions (ECDFs) of the two samples, respectively.
  • Each ECDF represents the proportion of observations in the corresponding sample that are less than or equal to a particular value of x.

Example

Let’s perform the Two-Sample Kolmogorov–Smirnov Test using the scipy.stats.ks_2samp function. The function calculates the Kolmogorov–Smirnov statistic for two samples to find out if two samples come from different distributions or not.

Kolmogorov-Smirnov Test Python Two-Sample

  • The null hypothesis assumes that the two samples come from the same distribution.
  • The decision is based on comparing the p-value with a chosen significance level (e.g., 0.05). If the p-value is less than the significance level, reject the null hypothesis, indicating that the two samples come from different distributions.

Python3

import numpy as np
from scipy.stats import ks_2samp
np.random.seed(42)
sample1 = np.random.normal(0, 1, 100)
sample2 = np.random.normal(0.5, 1.5, 120)
 
ks_statistic, p_value = ks_2samp(sample1, sample2)
 
print(f"Kolmogorov–Smirnov Statistic: {ks_statistic}")
print(f"P-value: {p_value}")
 
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. The two samples come from different distributions.")
else:
    print("Fail to reject the null hypothesis. There is not enough evidence to suggest different distributions.")

                    

Output:

Kolmogorov–Smirnov Statistic: 0.35833333333333334
P-value: 9.93895980740741e-07
Reject the null hypothesis. The two samples come from different distributions.
  • The statistic is, indicating a relatively large discrepancy between the two sample distributions.
  • The small p-value suggests strong evidence against the null hypothesis that the two samples come from the same distribution.

Therefore, two samples come from different distributions.

One-Sample KS Test vs Two-Sample KS Test

One-Sample KS Test

Two-Sample KS Test

Employed to assess whether a single sample of data conforms to a specific theoretical distribution.

Utilized to evaluate whether two independent samples originate from the same underlying distribution.

Compares the (EDF) of the sample with the (CDF) of the theoretical distribution.

It compares the EDF of one sample with the EDF of the other sample.

Null hypothesis assumes that the sample is drawn from the specified distribution.

Null hypothesis posits that the two samples are drawn from identical distributions.

Test statistic, represents the maximum vertical deviation between the EDF and CDF.

The test statistic, reflects the maximum difference between the two EDFs.

Multidimensional Kolmogorov-Smirnov Testing

The Kolmogorov-Smirnov (KS) test, in its traditional form, is designed for one-dimensional data, where it assesses the similarity between the empirical distribution function (EDF) and a theoretical or another empirical distribution along a single axis. However, when dealing with data in more than one dimension, the extension of the KS test becomes more complex.

In the context of multidimensional data, the concept of the Kolmogorov-Smirnov statistic can be adapted to evaluate differences across multiple dimensions. This adaptation often involves considering the maximum distance or discrepancy in the cumulative distribution functions along each dimension. A generalization of the KS test to higher dimensions is known as the Kolmogorov-Smirnov n-dimensional test.

The Kolmogorov-Smirnov n-dimensional test aims to evaluate whether two samples in multiple dimensions follow the same distribution. The test statistic becomes a function of the maximum differences in cumulative distribution functions along each dimension.

Applications of the Kolmogorov-Smirnov Test

The essential features of the use of the Kolmogorov-Smirnov test are:

Goodness-of-in shape attempting out

The KS check can be used to evaluate how nicely a pattern data set fits a hypothesized distribution. This may be beneficial in determining whether or now not a sample of facts is probable to have been drawn from a particular distribution, together with a ordinary distribution or an exponential distribution. This is frequently used in fields together with finance, engineering, and herbal sciences to verify whether a records set conforms to an predicted distribution, which could have implications for preference-making, version fitting, and prediction.

Two-sample comparison

The KS test is used to evaluate two facts units to decide whether or not they’re drawn from the same                     underlying distribution. This may be beneficial in assessing whether there are statistically giant differences among  statistics units, together  with comparing the overall performance of  tremendous companies in an test or evaluating the distributions of two precise variables.

It is normally utilized in fields together with social sciences, remedy, and agency to evaluate whether or not there are full-size variations among groups or populations.

Hypothesis sorting Out

Check unique hypotheses about the distributional residences of a records set. For instance, it is able to be used to check whether a facts set is normally distributed or whether or not it follows a specific theoretical distribution. This may be beneficial in verifying assumptions made in statistical analyses or validating version assumptions.

Non-parametric alternative

The K-S test is a non-parametric test, because of this it does no longer require assumptions about the form or         parameters of the underlying distributions being in contrast. This makes it a beneficial opportunity to parametric checks, in conjunction with the t-test or ANOVA, at the same time as facts do no longer meet the assumptions of these assessments, along with at the same time as statistics are not generally disbursed, have unknown or unequal variances, or have small pattern sizes.

Limitations of the Kolmogorov-Smirnov Test

  • Sensitivity to sample length: K-S check may additionally moreover have confined energy with small sample sizes and may yield statistically sizeable results with large sample sizes even for small versions.
  • Assumes independence: K-S test assumes that the records gadgets being compared are unbiased, and might not be appropriate for based facts.
  • Limited to non-stop records: K-S take a look at is designed for non-stop statistics and won’t be suitable for discrete or specific information without modifications.
  • Lack of sensitivity to precise distributional properties: K-S test assesses fashionable differences among distributions and might not be touchy to variations specially distributional houses.
  • Vulnerability to type I error with multiple comparisons: Multiple K-S exams or use of K-S test in a larger hypothesis checking out framework might also boom the threat of type I mistakes.

Conclusion

While versatile, the KS test demands caution in sample size considerations, assumptions, and interpretations to ensure robust and accurate analyses.

Kolmogorov-Smirnov test- FAQs

Q. What is Kolmogorov-Smirnov test used for?

Used to assess whether a sample follows a specified distribution or to compare two samples’ distributions.

Q. What is the difference between T test and Kolmogorov-Smirnov test?

T-test compares means of two groups; KS test compares entire distributions for similarity or goodness-of-fit.

Q. How do you interpret Kolmogorov-Smirnov test for normality?

If p-value is high (e.g., > 0.05), data may follow normal distribution; low p-value suggests departure.

Q. How do you interpret KS test p value?

If the p-value is below the chosen significance level (commonly 0.05), we would reject the null hypothesis. It indicates significant difference; large p-value (i.e.below the chosen significance level ) suggests no significant difference.

Q. Which normality test is best?

No one-size-fits-all. Anderson-Darling, Shapiro-Wilk, and KS test are commonly used; choice depends on data size and characteristics.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads