Missing data imputation with fancyimpute
Last Updated :
01 Aug, 2020
In a real world dataset, there will always be some data missing. This mainly associates with how the data was collected. Missing data plays an important role creating a predictive model, because there are algorithms which does not perform very well with missing dataset.
Fancyimput
fancyimpute is a library for missing data imputation algorithms. Fancyimpute use machine learning algorithm to impute missing values. Fancyimpute uses all the column to impute the missing values. There are two ways missing data can be imputed using Fancyimpute
- KNN or K-Nearest Neighbor
- MICE or Multiple Imputation by Chained Equation
K-Nearest Neighbor
To fill out the missing values KNN finds out the similar data points among all the features. Then it took the average of all the points to fill in the missing values.
Python3
import pandas as pd
import numpy as np
from fancyimpute import KNN
df = pd.DataFrame([[np.nan, 2 , np.nan, 0 ],
[ 3 , 4 , np.nan, 1 ],
[np.nan, np.nan, np.nan, 5 ],
[np.nan, 3 , np.nan, 4 ],
[ 5 , 7 , 8 , 2 ],
[ 2 , 5 , 7 , 9 ]],
columns = list ( 'ABCD' ))
print (df)
knn_imputer = KNN()
df = knn_imputer.fit_transform(df)
print (df)
|
Output:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
4 5.0 7.0 8.0 2
5 2.0 5.0 7.0 9
Imputing row 1/6 with 2 missing, elapsed time: 0.001
[[3.23556938 2. 7.75630267 0.]
[3. 4. 7.825 1.]
[3.67647071 3.46386587 7.64000033 5.]
[3.35514006 3. 7.59183674 4.]
[5. 7. 8. 2.]
[2. 5. 7. 9.]]
Multiple Imputation by Chained Equation:
MICE uses multiple imputation instead of single imputation which results in statistical uncertainty. MICE perform multiple regression over the sample data and take averages of them
Python3
import pandas as pd
import numpy as np
from fancyimpute import IterativeImputer
df = pd.DataFrame([[np.nan, 2 , np.nan, 0 ],
[ 3 , 4 , np.nan, 1 ],
[np.nan, np.nan, np.nan, 5 ],
[np.nan, 3 , np.nan, 4 ],
[ 5 , 7 , 8 , 2 ],
[ 2 , 5 , 7 , 9 ]],
columns = list ( 'ABCD' ))
print (df)
mice_imputer = IterativeImputer()
df = mice_imputer.fit_transform(df)
print (df)
|
Output
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
4 5.0 7.0 8.0 2
5 2.0 5.0 7.0 9
[[3.27262261 2. 7.9809332 0 ]
[3. 4. 7.9193547 1.]
[2.91717117 4.35730239 7.47523962 5.]
[2.77722048 3. 7.53760743 4.]
[5. 7. 8. 2.]
[2. 5. 7. 9.]]
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...