Pandas – Filling NaN in Categorical data
Real-world data is full of missing values. In order to work on them, we need to impute these missing values and draw meaningful conclusions from them. In this article, we will discuss how to fill NaN values in Categorical Data. In the case of categorical features, we cannot use statistical imputation methods.
Let’s first create a sample dataset to understand methods of filling missing values:
To fill missing values in Categorical features, we can follow either of the approaches mentioned below –
Method 1: Filling with most occurring class
One approach to fill these missing values can be to replace them with the most common or occurring class. We can do this by taking the index of the most common class which can be determined by using value_counts() method. Let’s see the example of how it works:
Method 2: Filling with unknown class
At times, the missing information is valuable itself, and to impute it with the most common class won’t be appropriate. In such a case, we can replace them with a value like “Unknown” or “Missing” using the fillna() method. Let’s look at an example of this –
Method 3: Using Categorical Imputer of sklearn-pandas library
We have scikit learn imputer, but it works only for numerical data. So we have sklearn_pandas with the transformer equivalent to that, which can work with string data. It replaces missing values with the most frequent ones in that column. Let’s see an example of replacing NaN values of “Color” column –
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course