Open In App

Imputing Missing Values Before Building an Estimator in Scikit Learn

The missing values in a dataset can cause problems during the building of an estimator. Scikit Learn provides different ways to handle missing data, which include imputing missing values. Imputing involves filling in missing data with estimated values that are based on other available data in the dataset.

Related topic of concepts:

Steps needed:

The following steps are required for imputing missing values before building an estimator in Scikit Learn:



  1. Import the required libraries: first You need to import the required libraries, including Scikit Learn and NumPy.
  2. Load the dataset: Then load the dataset which contains missing values.
  3. Identify missing values: After that identify missing values in the dataset.
  4. Impute missing values: We use Scikit Learn’s imputer class to impute missing values in the dataset.
  5. Build the estimator: To build the estimator, we are using here the Linear regression algorithm.

Examples 

Let’s consider an example of a dataset containing missing values. The following code imputes missing values in the dataset using Scikit Learn’s SimpleImputer class:




# Import the required libraries
from sklearn.impute import SimpleImputer
import numpy as np
 
# Load the dataset
X = np.array([[1, 2, np.nan],
              [3, np.nan, 4],
              [5, 6, np.nan],
              [7, 8, 9]])
Y = np.array([14, 20, 29, 40])
 
# Identify missing values
print('Check Null values \n',np.isnan(X))
 
# Impute missing values
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
 
# Build the estimator
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_imputed, Y)
 
print('\nCoefficient :',regressor.coef_)
print('Intercempt :',regressor.intercept_)
 
# Prediction
Y_pred = X_imputed @ regressor.coef_ + regressor.intercept_
print("Prediction :",Y_pred )

Output :



Check Null values 
 [[False False  True]
 [False  True False]
 [False False  True]
 [False False False]]

Coefficient : [2.25 1.5  1.4 ]
Intercempt : -0.3499999999999943
Prediction : [14. 20. 29. 40.]

In the above example, we first loaded a dataset which containing missing values. We then identified missing values in the following dataset using the NumPy library. We then used Scikit Learn’s SimpleImputer class to impute missing values in the dataset. Finally, we built a linear regression estimator using the imputed dataset.


Article Tags :