Feature selection is also known as attribute selection is a process of extracting the most relevant features from the dataset and then applying machine learning algorithms for the better performance of the model. A large number of irrelevant features increases the training time exponentially and increase the risk of overfitting.
Chi-square Test for Feature Extraction:
Chi-square test is used for categorical features in a dataset. We calculate Chi-square between each feature and the target and select the desired number of features with best Chi-square scores. It determines if the association between two categorical variables of the sample would reflect their real association in the population.
Chi- square score is given by :

where –
Observed frequency = No. of observations of class
Expected frequency = No. of expected observations of class if there was no relationship between the feature and the target.
Python Implementation of Chi-Square feature selection:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris_dataset = load_iris()
X = iris_dataset.data
y = iris_dataset.target
X = X.astype( int )
chi2_features = SelectKBest(chi2, k = 2 )
X_kbest_features = chi2_features.fit_transform(X, y)
print ( 'Original feature number:' , X.shape[ 1 ])
print ( 'Reduced feature number:' , X_kbest.shape[ 1 ])
|
Output:
Original feature number: 4
Reduced feature number : 2