Missing data imputation with fancyimpute

In a real world dataset, there will always be some data missing. This mainly associates with how the data was collected. Missing data plays an important role creating a predictive model, because there are algorithms which does not perform very well with missing dataset.

Fancyimput

fancyimpute is a library for missing data imputation algorithms. Fancyimpute use machine learning algorithm to impute missing values. Fancyimpute uses all the column to impute the missing values. There are two ways missing data can be imputed using Fancyimpute

  1. KNN or K-Nearest Neighbor
  2. MICE or Multiple Imputation by Chained Equation

K-Nearest Neighbor

To fill out the missing values KNN finds out the similar data points among all the features. Then it took the average of all the points to fill in the missing values.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import pandas as pd
import numpy as np
# importing the KNN from fancyimpute library
from fancyimpute import KNN
  
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5],
                   [np.nan, 3, np.nan, 4],
                   [5,      78,     2],
                   [2,      57,     9]],
                  columns = list('ABCD'))
  
# printing the dataframe
print(df)
  
# calling the KNN class
knn_imputer = KNN()
# imputing the missing value with knn imputer
df = knn_imputer.fit_transform(df)
  
# printing dataframe
print(df)

chevron_right


Output:

    A    B    C  D
0  NaN  2.0  NaN  0
1  3.0  4.0  NaN  1
2  NaN  NaN  NaN  5
3  NaN  3.0  NaN  4
4  5.0  7.0  8.0  2
5  2.0  5.0  7.0  9
Imputing row 1/6 with 2 missing, elapsed time: 0.001
[[3.23556938 2.         7.75630267 0.]
 [3.         4.         7.825      1.]
 [3.67647071 3.46386587 7.64000033 5.]
 [3.35514006 3.         7.59183674 4.]
 [5.         7.         8.         2.]
 [2.         5.         7.         9.]]

Multiple Imputation by Chained Equation:

MICE uses multiple imputation instead of single imputation which results in statistical uncertainty. MICE perform multiple regression over the sample data and take averages of them

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import pandas as pd
import numpy as np
# importing the MICE from fancyimpute library
from fancyimpute import IterativeImputer
  
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5],
                   [np.nan, 3, np.nan, 4],
                   [5,      78,     2],
                   [2,      57,     9]],
                  columns = list('ABCD'))
  
# printing the dataframe
print(df)
  
# calling the  MICE class
mice_imputer = IterativeImputer()
# imputing the missing value with mice imputer
df = mice_imputer.fit_transform(df)
  
# printing dataframe
print(df)

chevron_right


Output

    A    B    C   D
0  NaN  2.0  NaN  0
1  3.0  4.0  NaN  1
2  NaN  NaN  NaN  5
3  NaN  3.0  NaN  4
4  5.0  7.0  8.0  2
5  2.0  5.0  7.0  9
[[3.27262261 2.         7.9809332  0 ]
 [3.         4.         7.9193547  1.]
 [2.91717117 4.35730239 7.47523962 5.]
 [2.77722048 3.         7.53760743 4.]
 [5.         7.         8.         2.]
 [2.         5.         7.         9.]]



My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :
Practice Tags :


Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.