Open In App
Related Articles

Python – Pearson’s Chi-Square Test

Improve
Improve
Improve
Like Article
Like
Save Article
Save
Report issue
Report

In this article, we will perform Pearson’s Chi-Square test using a mathematical approach and then using Python’s SciPy module. It is an important statistic test in data science for categorical column selection. generally in data science projects, we select only those columns which are important and are not correlated with each other.

Pearson’s Chi-Square

 Pearson’s Chi-Square is a statistical hypothesis test for independence between categorical variables. We will perform this chi-square test first using a mathematical approach and then using Python’s scipy module. 

Let us know some terms before we understand the chi-square distribution 

The Contingency Table 

The Contingency table (also called crosstab) is used in statistics to summarise the relationship between several categorical variables. Here, we are taking a table that shows the number of men and women buying different types of pets.

 dogcatbirdtotal
men207282241730
women234242232708
total4415244731438

Null Hypothesis 

A null Hypothesis is a general statistical statement or assumption about a population parameter that is assumed to be true Until we have sufficient evidence to reject it.

It is generally denoted by Ho.

Alternate Hypothesis

 The Alternate Hypothesis is considered as competing of the null hypothesis. It is generally denoted by H1. The general goal of our hypothesis testing is to test the Alternative hypothesis against the null hypothesis. 

 P-Value 

A P-value is used as a measure of evidence against the null hypothesis. If it is greater than our level of significance then we will accept our null hypothesis.

Chi-Square Mathematical Approach

The aim of this chi-square test is to conclude whether the two variables( gender and choice of pet ) are related to each other not. 

Null hypothesis: We start by defining our null hypothesis (H0) which states that there is no relation between the variables. 

Alternate hypothesis: It would state that there is a significant relationship between the two variables. 

We will verify our hypothesis using these methods:

Using p-value:

We will define a significant factor to determine whether the relation between the variables is of considerable significance. Generally, a significant factor or alpha value of 0.05 is chosen. This alpha value denotes the probability of erroneously rejecting H0 when it is true. A lower alpha value is chosen in cases when we expect more precision. If the p-value for the test comes out to be strictly greater than the alpha value, then we will accept our H0.

Using chi-square value:

If our calculated value of chi-square is less than or equal to the tabular(also called critical) value of chi-square, then we will accept our H0. 

Expected Values Table  

Next, we prepare a similar table of calculated(or expected) values. To do this we need to calculate each item in the new table as:

                                                              \frac{row\ total\ *\ column\ total}{grand\ total}

The expected values table :

 dogcatbirdtotal
men223.87343533266.00834492240.11821975730
women217.12656467257.99165508232.88178025708
total4415244731438

Chi-Square Table: We prepare this table by calculating for element item through this formula.

                                                          \frac{( Observed\_value\ -\ Calculated\_value)^2 }{ Calculated\_value}

The chi-square table: 

 observed (o)calculated (c)(o-c)^2 / c
 207223.873435331.2717579435607573
 282266.008344920.9613722161954465
 241240.118219750.003238139990850831
 234217.126564671.3112758457617977
 242257.991655080.991245364156322
 232232.881780250.0033387601600580606
Total  4.542228269825232

From this table, we obtain the total of the last column, which gives us the calculated value of chi-square.  Here the calculated value of chi-square is 4.542228269825232

Now, we need to find the critical value of the chi-square distribution. We can obtain this from the chi-square distribution table. To use this table, we need to know the degrees of freedom for the dataset.  

The degrees of freedom is defined as : (no. of rows – 1) * (no. of columns – 1). 

Hence, the degrees of freedom is (2-1) * (3-1) = 2 

Now, let us look at the table and find the value corresponding to 2 degrees of freedom and a 0.05 significance factor

chi-square distribution table

chi-square distribution table

The tabular or critical value of chi-square here is  5.991 

Hence

                                              critical\ value\ of\ \chi^2\ >=\ calculated\ value\ of\ \chi^2

So here, we will accept our null hypothesis H0, that is our variables do not have a significant relation. 

Next, let us see how to perform this chi-square test in Python. 

Performing the test using Python (scipy. stats) : 

SciPy is an Open Source Python library, which is used in mathematics, engineering, scientific and technical computing.  

Installation: To install scipy in our notebook, we will use this command.

pip install scipy

The chi2_contingency() function of scipy.stats module takes the contingency table element in 2d array format and it returns a tuple containing test statistics,  p-value, degrees of freedom, and expected table(the one we created from the calculated values) in that order.  Here, we need to compare the obtained p-value with an alpha value of 0.05. 

python3

from scipy.stats import chi2_contingency
 
# defining the table
data = [[207, 282, 241], [234, 242, 232]]
stat, p, dof, expected = chi2_contingency(data)
 
# interpret p-value
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (H0 holds true)')

                    

Output : 

p value is 0.1031971404730939
Independent (H0 holds true)

Since,

p-value > alpha 

Therefore, we accept H0, which shows that our variables do not have a significant relation.


Don't miss your chance to ride the wave of the data revolution! Every industry is scaling new heights by tapping into the power of data. Sharpen your skills and become a part of the hottest trend in the 21st century.

Dive into the future of technology - explore the Complete Machine Learning and Data Science Program by GeeksforGeeks and stay ahead of the curve.


Last Updated : 26 Apr, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads
Complete Tutorials