Open In App

Imputing Missing Values Before Building an Estimator in Scikit Learn

Last Updated : 31 Jul, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

The missing values in a dataset can cause problems during the building of an estimator. Scikit Learn provides different ways to handle missing data, which include imputing missing values. Imputing involves filling in missing data with estimated values that are based on other available data in the dataset.

Related topic of concepts:

  • Missing Data: Missing data will refer to the absence of data in a dataset. It can occur for serval reasons, such as human error, technical error, or data corruption.
  • Imputation: Imputation can refer to the process of filling in missing values with help pattern estimated values based on available data.
  • Scikit Learn: Scikit Learn is a popular machine learning library in Python language that provides various tools for machine learning, this include data preprocessing, feature selection, and model building.
  • Estimator: In machine learning, an estimator is an algorithm or model that learns from the data and is used to make predictions on new data.

Steps needed:

The following steps are required for imputing missing values before building an estimator in Scikit Learn:

  1. Import the required libraries: first You need to import the required libraries, including Scikit Learn and NumPy.
  2. Load the dataset: Then load the dataset which contains missing values.
  3. Identify missing values: After that identify missing values in the dataset.
  4. Impute missing values: We use Scikit Learn’s imputer class to impute missing values in the dataset.
  5. Build the estimator: To build the estimator, we are using here the Linear regression algorithm.

Examples 

Let’s consider an example of a dataset containing missing values. The following code imputes missing values in the dataset using Scikit Learn’s SimpleImputer class:

Python




# Import the required libraries
from sklearn.impute import SimpleImputer
import numpy as np
 
# Load the dataset
X = np.array([[1, 2, np.nan],
              [3, np.nan, 4],
              [5, 6, np.nan],
              [7, 8, 9]])
Y = np.array([14, 20, 29, 40])
 
# Identify missing values
print('Check Null values \n',np.isnan(X))
 
# Impute missing values
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
 
# Build the estimator
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_imputed, Y)
 
print('\nCoefficient :',regressor.coef_)
print('Intercempt :',regressor.intercept_)
 
# Prediction
Y_pred = X_imputed @ regressor.coef_ + regressor.intercept_
print("Prediction :",Y_pred )


Output :

Check Null values 
 [[False False  True]
 [False  True False]
 [False False  True]
 [False False False]]

Coefficient : [2.25 1.5  1.4 ]
Intercempt : -0.3499999999999943
Prediction : [14. 20. 29. 40.]

In the above example, we first loaded a dataset which containing missing values. We then identified missing values in the following dataset using the NumPy library. We then used Scikit Learn’s SimpleImputer class to impute missing values in the dataset. Finally, we built a linear regression estimator using the imputed dataset.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads