Chi-Square Test for Feature Selection – Mathematical Explanation

One of the primary tasks involved in any supervised Machine Learning venture is to select the best features from the given dataset to obtain the best results. One way to select these features is the Chi-Square Test.

Mathematically, a Chi-Square test is done on two distributions two determine the level of similarity of their respective variances. In its null hypothesis, it assumes that the given distributions are independent. This test thus can be used to determine the best features for a given dataset by determining the features on which the output class label is most dependent on. For each feature in the dataset, the \chi ^{2} is calculated and then ordered in descending order according to the \chi ^{2} value. The higher the value of \chi ^{2}, the more dependent the output label is on the feature and higher the importance the feature has on determining the output.

Let the feature in question have m attribute values and the output have k class labels. Then the value of \chi ^{2} is given by the following expression:-

\chi ^{2} = \sum _{i=1}^{m} \sum _{j=1}^{k}\frac{(O_{ij}-E_{ij})^{2}}{E_{ij}}

where

O_{ij} – Observed frequency

E_{ij} – Expected frequency

For each feature, a contingency table is created with m rows and k columns. Each cell (i,j) denotes the number of rows having attribute feature as i and class label as k. Thus each cell in this table denotes the observed frequency. To calculate the expected frequency for each cell, first the proportion of the feature value in the total dataset is calculated and then it is multiplied by the total number of the current class label.

Solved Example:

Consider the following table:-

Here the output variable is the column named “PlayTennis” which determines whether tennis was played on the given day given the weather conditions.

The contingency table for the feature “Outlook” is constructed as below:-

Note that the expected value for each cell is given inside the paranthesis.

The expected value for the cell (Sunny,Yes) is calculated as \frac{5}{14}\times 9 = 3.21 and similarly for others.

The \chi ^{2}_{outlook} value is calculated as below:-

\chi ^{2}_{outlook} = \frac{(2-3.21)^{2}}{3.21}+\frac{(3-1.79)^{2}}{1.79}+\frac{(4-2.57)^{2}}{2.57}+\frac{(0-1.43)^{2}}{1.43}+\frac{(3-3.21)^{2}}{3.21}+\frac{(2-1.79)^{2}}{1.79}

\Rightarrow \chi ^{2}_{outlook} = 3.129

The contingency table for the feature “Wind” is constructed as below:-

The \chi ^{2}_{wind} value is calculated as below:-

\chi ^{2}_{wind} = \frac{(3-3.86)^{2}}{3.86}+\frac{(3-1.14)^{2}}{1.14}+\frac{(6-5.14)^{2}}{5.14}+\frac{(2-2.86)^{2}}{2.86}

\Rightarrow \chi ^{2}_{wind} = 3.629

On comparing the two scores, we can conclude that the feature “Wind” is more important to determine the output than the feature “Outlook”.

This article demonstrates how to do feature selection using Chi-Square Test.



My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.




Article Tags :
Practice Tags :


Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.