Chi-square test in Machine Learning

Last Updated : 21 Dec, 2023

Chi-Square test is a statistical method crucial for analyzing associations in categorical data. Its applications span various fields, aiding researchers in understanding relationships between factors. This article elucidates Chi-Square types, steps for implementation, and its role in feature selection, exemplified through Python code on the Iris dataset.

What is Chi-Square test?

The chi-square test is a statistical test used to determine if there is a significant association between two categorical variables. It is a non-parametric test, meaning it makes no assumptions about the distribution of the data. The test is based on the comparison of observed and expected frequencies within a contingency table. The chi-square test helps with feature selection problems by looking at the relationship between the elements. It determines if the association between two categorical variables of the sample would reflect their real association in the population.

It belongs to the family of continuous probability distributions. The Chi-Squared distribution is defined as the sum of the squares of the k independent standard random variables given by:

………..eq(1)

where,

• c is degree of freedom
• is the observed frequency in cell
• is the expected frequency in cell , calculated as:

Chi-Square Distribution

The chi-square distribution is a continuous probability distribution that arises in statistics and is associated with the sum of the squares of independent standard normal random variables. It is often denoted as and is parameterized by the degrees of freedom k.

It is widely used in statistical analysis, particularly in hypothesis testing and calculating confidence intervals. It is often used with non-normally distributed data.

Key terms used in Chi-Square test

• Degrees of freedom
• Observed values: Actual data collected
• Expected values: Predicted data based on a theoretical model in chi-square test.
• where, : Totals of row i
• : Totals of column j
• N: Total number of Observations
• Contingency table: A contingency table, also known as a cross-tabulation or two-way table, is a statistical table that displays the distribution of two categorical variables.

Types of Chi-Square test

There are several types of chi-square tests, each designed to address specific research questions or scenarios. The two main types are the chi-square test for independence and the chi-square goodness-of-fit test.

1. Chi-Square Test for Independence: This test assesses whether there is a significant association or relationship between two categorical variables. It is used to determine whether changes in one variable are independent of changes in another. This test is applied when we have counts of values for two nominal or categorical variables. To conduct this test, two requirements must be met:
independence of observations and a relatively large sample size.
For example, suppose we are interested in exploring whether there is a relationship between online shopping preferences and the payment methods people choose. The first variable is the type of online shopping preference (e.g., Electronics, Clothing, Books), and the second variable is the chosen payment method (e.g., Credit Card, Debit Card, PayPal).
The null hypothesis in this case would be that the choice of online shopping preference and the selected payment method are independent.
2. Chi-Square Goodness-of-Fit Test: The Chi-Square Goodness-of-Fit test is used in statistical hypothesis testing to ascertain whether a variable is likely from a given distribution or not. This test can be applied in situations when we have value counts for categorical variables. With the help of this test, we can determine whether the data values are a representative sample of the entire population or if they fit our hypothesis well.
For example, imagine you are testing the fairness of a six-sided die. The null hypothesis is that each face of the die should have an equal probability of landing face up. In other words, the die is unbiased, and the proportions of each number (1 through 6) occurring are expected to be equal.

Why do we use the Chi-Square Test?

• The chi-square test is widely used across diverse fields to analyze categorical data, offering valuable insights into associations or differences between categories.
• Its primary application lies in testing the independence of two categorical variables, determining if changes in one variable relate to changes in another.
• It is particularly useful for understanding relationships between factors, such as gender and preferences or product categories and purchasing behaviors.
• Researchers appreciate its simplicity and ease of application to categorical data, making it a preferred choice for statistical analysis.
• The test provides insights into patterns and associations within categorical data, aiding in the interpretation of relationships.
• Its utility extends to various fields, including genetics, market research, quality control, and social sciences, showcasing its broad applicability.
• The chi-square test helps assess the conformity of observed data to expected values, enhancing its role in statistical analysis.

Steps to perform Chi-square test

1. Define
• Null Hypothesis (H0): There is no significant association between the two categorical variables.
• Alternative Hypothesis (H1): There is a significant association between the two categorical variables.
2. Create a contingency table that displays the frequency distribution of the two categorical variables.
3. Find the Expected values using formula:
………..eq(2)
where,
• : Totals of row i
• : Totals of column j
• N: Total number of Observations
4. Calculate the Chi-Square Statistic
5. Degrees of Freedom using formula:………..eq(3)
• where, m corresponds to the number of categories in one categorical variable.
• n corresponds to the number of categories in another categorical variable.
6. Accept or Reject the Null Hypothesis: Compare the calculated chi-square statistic to the critical value from the chi-square distribution table for the chosen significance level (e.g., 0.05)
• If is greater than the critical value, reject the null hypothesis, indicating a significant association between the variables.
• If is less than or equal to the critical value, fail to reject the null hypothesis, suggesting no significant association.

Chi-square Test for Feature Selection

Chi-square test is used for categorical features in a dataset. We calculate Chi-square between each feature and the target and select the desired number of features with best Chi-square scores. Features that show significant dependencies with the target variable are considered important for prediction and can be selected for further analysis.

Feature selection is also known as attribute selection is a process of extracting the most relevant features from the dataset and then applying machine learning algorithms for the better performance of the model. A large number of irrelevant features increases the training time exponentially and increase the risk of overfitting.

Let’s examine a dataset with features, including, “income level” (low, medium, high), and “subscription status” (subscribed, not subscribed) indicating whether a customer subscribed to a service. The goal is to determine if this feature is relevant for predicting subscription status.

Step 1: Null hypothesis: No significant association between features

Alternate Hypothesis: There is a significant association between features.

Step 2: Contingency table

Subscribed

Not subscribed

Row Total

Low Income

20

30

50

Medium Income

40

25

65

High Income

10

15

25

Column Total

70

70

140

Step 3: Now, calculate the expected frequencies.

For example, the expected frequency for “Low Income” and “Subscribed” would be:

As, Total count for each row is 70 and each column is 70 and Totan number of observations are 140.

LowÂ Income,Â subscribed=using equation (2)

Similarly, we can find expected frequencies for other aspects as well:

Subscribed

Not Subscribed

Low Income

25

25

Medium Income

35

30

High Income

10

15

Step 4: Calculate the Chi-Square Statistic

Let’s summarize the observed and expected values into a table and calculate the Chi-Square value:

Subscribed (O)

Not Subscribed (O)

Subscribed (E)

Not Subscribed (E)

Low Income

20

30

25

25

Medium Income

40

25

35

30

High Income

10

15

10

15

Now, using the formula specified in equation 1, we can get our chi-square statistic values in the following manner:

Step 5: Degrees of Freedom

, using equation 3

Step 6: Interpretations

Now, you can compare the calculated value (3.747) with the critical value from the Chi-Square distribution table or any statistical software tool with 2 degrees of freedom. If the value is greater than the critical value, you would reject the null hypothesis. This suggests that there is a significant association between “income level” and “subscription status,” and “income level” is a relevant feature for predicting subscription status.

Let’s find out using python’s scipy library.

Python3

 import numpy as np import matplotlib.pyplot as plt import scipy.stats as stats   # Degrees of freedom and significance level df = 2alpha = 0.05  # Critical value from the Chi-Square distribution table critical_value = stats.chi2.ppf(1 - alpha, df) critical_value

Output:

5.991464547107979

Now, the critical value for df = 2 and = 0.05 is 5.991.

Since, critical value is greater than chi-square value, we accept the null hypothesis.

Let’s visualize the chi-square distribution and Critical Region, with python code.

Python3

 import numpy as np import matplotlib.pyplot as plt import scipy.stats as stats   df = 2alpha = 0.05critical_value = stats.chi2.ppf(1 - alpha, df) calculated_chi_square = 3.747  # Generate values for the chi-square distribution x = np.linspace(0, 10, 1000) y = stats.chi2.pdf(x, df)   plt.plot(x, y, label='Chi-Square Distribution (df=2)') plt.fill_between(x, y, where=(x > critical_value), color='red', alpha=0.5, label='Critical Region') plt.axvline(calculated_chi_square, color='blue', linestyle='dashed', label='Calculated Chi-Square') plt.axvline(critical_value, color='green', linestyle='dashed', label='Critical Value') plt.title('Chi-Square Distribution and Critical Region') plt.xlabel('Chi-Square Value') plt.ylabel('Probability Density Function') plt.legend() plt.show()

Output:

Chi-square Distribution

• In this example, The green dashed line represents the critical value, the threshold beyond which you would reject the null hypothesis.
• The red dashed line represents the critical value (5.991) for a significance level of 0.05 with 2 degrees of freedom.
• The shaded area to the right of the critical value represents the rejection region.

If the calculated Chi-Square statistic falls within this shaded area, you would reject the null hypothesis.

The calculated chi-square value does not fall within the critical region, therefore accepting the null hypothesis.

Hence, there is no significant association between two variables.

Python3

 import pandas as pd from sklearn.datasets import load_iris from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2   # Load the dataset iris = load_iris() X = iris.data y = iris.target   # Converting to DataFrame for better visualization column_names = [f'feature_{i}' for i in range(X.shape[1])] df = pd.DataFrame(X, columns=column_names) df['target'] = y   print("Original Dataset:") print(df.head())   # Applying Chi-Square feature selection and # Selecting top k features k = 2 chi2_selector = SelectKBest(chi2, k=k) X_new = chi2_selector.fit_transform(X, y)   selected_features = df.columns[:-1][chi2_selector.get_support()] print("\nSelected Features:") print(selected_features)

Output:

Original Dataset:   feature_0  feature_1  feature_2  feature_3  target0        5.1        3.5        1.4        0.2       01        4.9        3.0        1.4        0.2       02        4.7        3.2        1.3        0.2       03        4.6        3.1        1.5        0.2       04        5.0        3.6        1.4        0.2       0Selected Features:Index(['feature_2', 'feature_3'], dtype='object')

The Chi-Square feature selection suggests that the most informative features for this task are ‘feature_2’ and ‘feature_3’. These features correspond to the petal length and petal width, respectively.

Conclusion

The Chi-Square test stands as a versatile tool for exploring categorical data associations, offering valuable insights into dependencies between variables. Whether applied for independence or goodness-of-fit, its significance resonates across genetics, market research, and social sciences. Feature selection using Chi-Square enhances model efficiency, exemplified by the Python implementation on Iris dataset features.

1. What are the advantages of chi-square test?

Simple calculation and interpretation: Makes it accessible to researchers with varying statistical expertise.

Non-parametric: Does not require assumptions about the underlying distribution of the data.

Versatile: Handles large datasets efficiently and analyzes contingency tables.

Wide range of applications: Used across various fields like social sciences, medicine, marketing, and engineering.

2. What is chi-square test and its purpose?

It is used to compare observed and expected frequencies in a categorical variable. Assesses whether two categorical variables are independent or statistically associated.

3. What is the chi-square test in EDA?

• It Used to identify potential relationships between categorical variables in a dataset.
• Helps visualize and understand the data distribution.

4. What is the application of chi-square distribution?

• Testing goodness-of-fit:Â Determines if a dataset fits a specific theoretical distribution.
• Comparing two or more populations:Â Based on their categorical variables.
• Assessing independence between variables:Â Analyzing contingency tables.

5. What is chi-square value for?

It is used to determine whether to reject the null hypothesis of independence.

6. Is chi-square a correlation test?

No, chi-square is not a correlation test. It measures association, not the strength or direction of a linear relationship.

Previous
Next