# Python – Pearson’s Chi-Square Test

The **Pearsonâ€™s Chi-Square** statistical hypothesis is a test for independence between categorical variables. In this article, we will perform the test using a mathematical approach and then using Python’s **SciPy** module.

First, let us see the mathematical approach :

**The Contingency Table :**

A Contingency table (also called crosstab) is used in statistics to summarise the relationship between several categorical variables. Here, we take a table that shows the number of men and women buying different types of pets.

dog | cat | bird | total | |

men | 207 | 282 | 241 | 730 |

women | 234 | 242 | 232 | 708 |

total | 441 | 524 | 473 | 1438 |

The **aim** of the test is to conclude whether the two variables( gender and choice of pet ) are related to each other.

**Null hypothesis:**

We start by defining the **null** hypothesis (**H0**) which states that there is *no relation* between the variables. An **alternate** hypothesis would state that there is a *significant relation* between the two.

We can verify the hypothesis by these methods:

- Using
**p-value**:

We define a **significance factor** to determine whether the relation between the variables is of considerable significance. Generally a significance factor or **alpha value** of **0.05** is chosen. This *alpha value* denotes the probability of erroneously rejecting **H0** when it is true. A lower *alpha value* is chosen in cases where we expect more precision. If the **p-value** for the test comes out to be strictly greater than the alpha value, then H0 holds true.

- Using
**chi-square**value:

If our calculated value of chi-square is less or equal to the tabular(also called **critical**) value of chi-square, then **H0** holds true.

**Expected Values Table :**

Next, we prepare a similar table of calculated(or expected) values. To do this we need to calculate each item in the new table as :

The expected values table :

dog | cat | bird | total | |

men | 223.87343533 | 266.00834492 | 240.11821975 | 730 |

women | 217.12656467 | 257.99165508 | 232.88178025 | 708 |

total | 441 | 524 | 473 | 1438 |

**Chi-Square Table :**

We prepare this table by calculating for each item the following:

The chi-square table:

observed (o) | calculated (c) | (o-c)^2 / c | |

207 | 223.87343533 | 1.2717579435607573 | |

282 | 266.00834492 | 0.9613722161954465 | |

241 | 240.11821975 | 0.003238139990850831 | |

234 | 217.12656467 | 1.3112758457617977 | |

242 | 257.99165508 | 0.991245364156322 | |

232 | 232.88178025 | 0.0033387601600580606 | |

Total | 4.542228269825232 |

From this table, we obtain the total of the last column, which gives us the calculated value of chi-square. Hence the calculated value of chi-square is **4.542228269825232**

Now, we need to find the **critical** value of chi-square. We can obtain this from a table. To use this table, we need to know the **degrees of freedom** for the dataset. The degrees of freedom is defined as : **(no. of rows – 1) * (no. of columns – 1).**

Hence, the degrees of freedom is **(2-1) * (3-1) = 2**

Now, let us look at the table and find the value corresponding to **2 **degrees of freedom and **0.05** significance factor :

The tabular or critical value of chi-square here is

**5.991**

Hence,

Therefore,

**H0**is

**accepted**, that is, the variables

*have a significant relation.*

**do not**Next, let us see how to perform the test in Python.

**Performing the test using Python (scipy.stats) :**

**SciPy** is an Open Source Python library, which is used in mathematics, engineering, scientific and technical computing.

**Installation:**

pip install scipy

The **chi2_contingency()** function of **scipy.stats** module takes as input, the contingency table in 2d array format. It returns a tuple containing *test statistics*, the * p-value*,

*degrees of freedom*and

*expected table*(the one we created from the calculated values) in that order.

Hence, we need to compare the obtained **p-value** with **alpha** value of 0.05.

`from` `scipy.stats ` `import` `chi2_contingency` ` ` `# defining the table` `data ` `=` `[[` `207` `, ` `282` `, ` `241` `], [` `234` `, ` `242` `, ` `232` `]]` `stat, p, dof, expected ` `=` `chi2_contingency(data)` ` ` `# interpret p-value` `alpha ` `=` `0.05` `print` `(` `"p value is "` `+` `str` `(p))` `if` `p <` `=` `alpha:` ` ` `print` `(` `'Dependent (reject H0)'` `)` `else` `:` ` ` `print` `(` `'Independent (H0 holds true)'` `)` |

**Output : **

p value is 0.1031971404730939 Independent (H0 holds true)

Since,

p-value > alpha

Therefore, we **accept** **H0, **that is, the variables * do not* have a significant relation.

Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning Concepts with the **Machine Learning Foundation Course** at a student-friendly price and become industry ready.