Chi-square test in Machine Learning

Last Updated : 21 Dec, 2023

Chi-Square test is a statistical method crucial for analyzing associations in categorical data. Its applications span various fields, aiding researchers in understanding relationships between factors. This article elucidates Chi-Square types, steps for implementation, and its role in feature selection, exemplified through Python code on the Iris dataset.

Table of Content

What is Chi-Square test?
Types of Chi-Square test
Why do we use the Chi-Square Test?
Steps to perform Chi-square test
Chi-square Test for Feature Selection
Python Implementation of Chi-Square feature selection

What is Chi-Square test?

The chi-square test is a statistical test used to determine if there is a significant association between two categorical variables. It is a non-parametric test, meaning it makes no assumptions about the distribution of the data. The test is based on the comparison of observed and expected frequencies within a contingency table. The chi-square test helps with feature selection problems by looking at the relationship between the elements. It determines if the association between two categorical variables of the sample would reflect their real association in the population.

It belongs to the family of continuous probability distributions. The Chi-Squared distribution is defined as the sum of the squares of the k independent standard random variables given by:

$\chi^2_c = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$ ………..eq(1)

where,

c is degree of freedom
$O_{ij}$ is the observed frequency in cell ${ij}$
$E_{ij}$ is the expected frequency in cell ${ij}$ , calculated as:

Chi-Square Distribution

The chi-square distribution is a continuous probability distribution that arises in statistics and is associated with the sum of the squares of independent standard normal random variables. It is often denoted as $\chi^2$ and is parameterized by the degrees of freedom k.

It is widely used in statistical analysis, particularly in hypothesis testing and calculating confidence intervals. It is often used with non-normally distributed data.

Key terms used in Chi-Square test

Degrees of freedom
Observed values: Actual data collected
Expected values: Predicted data based on a theoretical model in chi-square test.
- where, $Ri$ : Totals of row i
- $Cj$ : Totals of column j
- N: Total number of Observations
Contingency table: A contingency table, also known as a cross-tabulation or two-way table, is a statistical table that displays the distribution of two categorical variables.

Types of Chi-Square test

There are several types of chi-square tests, each designed to address specific research questions or scenarios. The two main types are the chi-square test for independence and the chi-square goodness-of-fit test.

Chi-Square Test for Independence: This test assesses whether there is a significant association or relationship between two categorical variables. It is used to determine whether changes in one variable are independent of changes in another. This test is applied when we have counts of values for two nominal or categorical variables. To conduct this test, two requirements must be met:
independence of observations and a relatively large sample size.
For example, suppose we are interested in exploring whether there is a relationship between online shopping preferences and the payment methods people choose. The first variable is the type of online shopping preference (e.g., Electronics, Clothing, Books), and the second variable is the chosen payment method (e.g., Credit Card, Debit Card, PayPal).
The null hypothesis in this case would be that the choice of online shopping preference and the selected payment method are independent.
Chi-Square Goodness-of-Fit Test: The Chi-Square Goodness-of-Fit test is used in statistical hypothesis testing to ascertain whether a variable is likely from a given distribution or not. This test can be applied in situations when we have value counts for categorical variables. With the help of this test, we can determine whether the data values are a representative sample of the entire population or if they fit our hypothesis well.
For example, imagine you are testing the fairness of a six-sided die. The null hypothesis is that each face of the die should have an equal probability of landing face up. In other words, the die is unbiased, and the proportions of each number (1 through 6) occurring are expected to be equal.

Why do we use the Chi-Square Test?

The chi-square test is widely used across diverse fields to analyze categorical data, offering valuable insights into associations or differences between categories.
Its primary application lies in testing the independence of two categorical variables, determining if changes in one variable relate to changes in another.
It is particularly useful for understanding relationships between factors, such as gender and preferences or product categories and purchasing behaviors.
Researchers appreciate its simplicity and ease of application to categorical data, making it a preferred choice for statistical analysis.
The test provides insights into patterns and associations within categorical data, aiding in the interpretation of relationships.
Its utility extends to various fields, including genetics, market research, quality control, and social sciences, showcasing its broad applicability.
The chi-square test helps assess the conformity of observed data to expected values, enhancing its role in statistical analysis.

Steps to perform Chi-square test

Define
- Null Hypothesis (H0): There is no significant association between the two categorical variables.
- Alternative Hypothesis (H1): There is a significant association between the two categorical variables.
Create a contingency table that displays the frequency distribution of the two categorical variables.
Find the Expected values using formula:
………..eq(2)
where,
- $Ri$ : Totals of row i
- $Cj$ : Totals of column j
- N: Total number of Observations
Calculate the Chi-Square Statistic
Degrees of Freedom using formula:………..eq(3)
- where, m corresponds to the number of categories in one categorical variable.
- n corresponds to the number of categories in another categorical variable.
Accept or Reject the Null Hypothesis: Compare the calculated chi-square statistic to the critical value from the chi-square distribution table for the chosen significance level (e.g., 0.05)
- If $\chi^2$ is greater than the critical value, reject the null hypothesis, indicating a significant association between the variables.
- If $\chi^2$ is less than or equal to the critical value, fail to reject the null hypothesis, suggesting no significant association.

Chi-square Test for Feature Selection

Chi-square test is used for categorical features in a dataset. We calculate Chi-square between each feature and the target and select the desired number of features with best Chi-square scores. Features that show significant dependencies with the target variable are considered important for prediction and can be selected for further analysis.

Feature selection is also known as attribute selection is a process of extracting the most relevant features from the dataset and then applying machine learning algorithms for the better performance of the model. A large number of irrelevant features increases the training time exponentially and increase the risk of overfitting.

Let’s examine a dataset with features, including, “income level” (low, medium, high), and “subscription status” (subscribed, not subscribed) indicating whether a customer subscribed to a service. The goal is to determine if this feature is relevant for predicting subscription status.

Step 1: Null hypothesis: No significant association between features

Alternate Hypothesis: There is a significant association between features.

Step 2: Contingency table

	Subscribed	Not subscribed	Row Total
Low Income	20	30	50
Medium Income	40	25	65
High Income	10	15	25
Column Total	70	70	140

Step 3: Now, calculate the expected frequencies.

For example, the expected frequency for “Low Income” and “Subscribed” would be:

As, Total count for each row $R_i$ is 70 and each column $C_j$ is 70 and Totan number of observations are 140.

Low Income, subscribed= $(50 \times 70) \div140 = 25$ using equation (2)

Similarly, we can find expected frequencies for other aspects as well:

	Subscribed	Not Subscribed
Low Income	25	25
Medium Income	35	30
High Income	10	15

Step 4: Calculate the Chi-Square Statistic

Let’s summarize the observed and expected values into a table and calculate the Chi-Square value:

	Subscribed (O)	Not Subscribed (O)	Subscribed (E)	Not Subscribed (E)
Low Income	20	30	25	25
Medium Income	40	25	35	30
High Income	10	15	10	15

Now, using the formula specified in equation 1, we can get our chi-square statistic values in the following manner:

$\chi^2= \frac{(20 - 25)^2}{25} + \frac{(30 - 25)^2}{25}++ \frac{(40 - 35)^2}{35} + \frac{(25 - 30)^2}{30}+ \frac{(10 - 10)^2}{10} + \frac{(15 - 15)^2}{15}$

$= 1 + 1.2 + 0.714 + 0.833 + 0 + 0\\ =3.747$

Step 5: Degrees of Freedom

$\text{Degrees of Freedom (df)} = (3 - 1) \times (2 - 1) = 2$ , using equation 3

Step 6: Interpretations

Now, you can compare the calculated value $\chi^2$ (3.747) with the critical value from the Chi-Square distribution table or any statistical software tool with 2 degrees of freedom. If the $\chi^2$ value is greater than the critical value, you would reject the null hypothesis. This suggests that there is a significant association between “income level” and “subscription status,” and “income level” is a relevant feature for predicting subscription status.

Let’s find out using python’s scipy library.

Python3

import numpy as np 
import matplotlib.pyplot as plt 
import scipy.stats as stats 
  
# Degrees of freedom and significance level 
df = 2
alpha = 0.05
  
# Critical value from the Chi-Square distribution table 
critical_value = stats.chi2.ppf(1 - alpha, df) 
critical_value

Output:

5.991464547107979

Now, the critical value for df = 2 and $α$ = 0.05 is 5.991.

Since, critical value is greater than chi-square value, we accept the null hypothesis.

Let’s visualize the chi-square distribution and Critical Region, with python code.

Python3

import numpy as np 
import matplotlib.pyplot as plt 
import scipy.stats as stats 
  
df = 2
alpha = 0.05
critical_value = stats.chi2.ppf(1 - alpha, df) 
calculated_chi_square = 3.747
  
# Generate values for the chi-square distribution 
x = np.linspace(0, 10, 1000) 
y = stats.chi2.pdf(x, df) 
  
plt.plot(x, y, label='Chi-Square Distribution (df=2)') 
plt.fill_between(x, y, where=(x > critical_value), color='red', alpha=0.5, label='Critical Region') 
plt.axvline(calculated_chi_square, color='blue', linestyle='dashed', label='Calculated Chi-Square') 
plt.axvline(critical_value, color='green', linestyle='dashed', label='Critical Value') 
plt.title('Chi-Square Distribution and Critical Region') 
plt.xlabel('Chi-Square Value') 
plt.ylabel('Probability Density Function') 
plt.legend() 
plt.show() 

Output:

Chi-square Distribution

In this example, The green dashed line represents the critical value, the threshold beyond which you would reject the null hypothesis.
The red dashed line represents the critical value (5.991) for a significance level of 0.05 with 2 degrees of freedom.
The shaded area to the right of the critical value represents the rejection region.

If the calculated Chi-Square statistic falls within this shaded area, you would reject the null hypothesis.

The calculated chi-square value does not fall within the critical region, therefore accepting the null hypothesis.

Hence, there is no significant association between two variables.

Python Implementation of Chi-Square feature selection

Python3

import pandas as pd 
from sklearn.datasets import load_iris 
from sklearn.feature_selection import SelectKBest 
from sklearn.feature_selection import chi2 
  
# Load the dataset 
iris = load_iris() 
X = iris.data 
y = iris.target 
  
# Converting to DataFrame for better visualization 
column_names = [f'feature_{i}' for i in range(X.shape[1])] 
df = pd.DataFrame(X, columns=column_names) 
df['target'] = y 
  
print("Original Dataset:") 
print(df.head()) 
  
# Applying Chi-Square feature selection and 
# Selecting top k features 
k = 2 
chi2_selector = SelectKBest(chi2, k=k) 
X_new = chi2_selector.fit_transform(X, y) 
  
selected_features = df.columns[:-1][chi2_selector.get_support()] 
print("\nSelected Features:") 
print(selected_features)

Output:

Original Dataset:
   feature_0  feature_1  feature_2  feature_3  target
0        5.1        3.5        1.4        0.2       0
1        4.9        3.0        1.4        0.2       0
2        4.7        3.2        1.3        0.2       0
3        4.6        3.1        1.5        0.2       0
4        5.0        3.6        1.4        0.2       0
Selected Features:
Index(['feature_2', 'feature_3'], dtype='object')

The Chi-Square feature selection suggests that the most informative features for this task are ‘feature_2’ and ‘feature_3’. These features correspond to the petal length and petal width, respectively.

Conclusion

The Chi-Square test stands as a versatile tool for exploring categorical data associations, offering valuable insights into dependencies between variables. Whether applied for independence or goodness-of-fit, its significance resonates across genetics, market research, and social sciences. Feature selection using Chi-Square enhances model efficiency, exemplified by the Python implementation on Iris dataset features.

Frequently Asked Questions (FAQs)

1. What are the advantages of chi-square test?

Simple calculation and interpretation: Makes it accessible to researchers with varying statistical expertise.

Non-parametric: Does not require assumptions about the underlying distribution of the data.

Versatile: Handles large datasets efficiently and analyzes contingency tables.

Wide range of applications: Used across various fields like social sciences, medicine, marketing, and engineering.

2. What is chi-square test and its purpose?

It is used to compare observed and expected frequencies in a categorical variable. Assesses whether two categorical variables are independent or statistically associated.

3. What is the chi-square test in EDA?

It Used to identify potential relationships between categorical variables in a dataset.
Helps visualize and understand the data distribution.

4. What is the application of chi-square distribution?

Testing goodness-of-fit: Determines if a dataset fits a specific theoretical distribution.
Comparing two or more populations: Based on their categorical variables.
Assessing independence between variables: Analyzing contingency tables.

5. What is chi-square value for?

It is used to determine whether to reject the null hypothesis of independence.

6. Is chi-square a correlation test?

No, chi-square is not a correlation test. It measures association, not the strength or direction of a linear relationship.

Suggest improvement

Confidence Interval

Understanding Hypothesis Testing

Share your thoughts in the comments

Introduction to Data Analysis

Data Analysis Libraries

Data Visulization Libraries

Exploratory Data Analysis (EDA)

Data Preprocessing

Data Transformation

Time Series Data Analysis

Case Studies and Projects

Chi-square test in Machine Learning

What is Chi-Square test?

Chi-Square Distribution

Key terms used in Chi-Square test

Types of Chi-Square test

Why do we use the Chi-Square Test?

Steps to perform Chi-square test

Chi-square Test for Feature Selection

Python3

Python3

Python Implementation of Chi-Square feature selection

Python3

Conclusion

Frequently Asked Questions (FAQs)

1. What are the advantages of chi-square test?

2. What is chi-square test and its purpose?

3. What is the chi-square test in EDA?

4. What is the application of chi-square distribution?

5. What is chi-square value for?

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?