Working with Missing Data in Pandas

Missing Data can occur when no information is provided for one or more items or for a whole unit. Missing Data is a very big problem in real life scenario. Missing Data can also refer to as NA(Not Available) values in pandas. In DataFrame sometimes many datasets simply arrive with missing data, either because it exists and was not collected or it never existed. For Example, Suppose different user being surveyed may choose not to share their income, some user may choose not to share the address in this way many datasets went missing.

In Pandas missing data is represented by two value:

  • None: None is a Python singleton object that is often used for missing data in Python code.
  • NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation

Pandas treat None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame :



In this article we are using CSV file, to download the CSV file used, Click Here.

Checking for missing values using isnull() and notnull()

In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series.

Checking for missing values using isnull()

In order to check null values in Pandas DataFrame, we use isnull() function this function return dataframe of Boolean values which are True for NaN values.

Code #1:

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas as pd
import pandas as pd
  
# importing numpy as np
import numpy as np
  
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
  
# creating a dataframe from list
df = pd.DataFrame(dict)
  
# using isnull() function  
df.isnull()

chevron_right


Output:

 
Code #2:

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas package 
import pandas as pd 
    
# making data frame from csv file 
data = pd.read_csv("employees.csv"
    
# creating bool series True for NaN values 
bool_series = pd.isnull(data["Gender"]) 
    
# filtering data 
# displaying data only with Gender = NaN 
data[bool_series] 

chevron_right


Output:
As shown in the output image, only the rows having Gender = NULL are displayed.

 

Checking for missing values using notnull()

In order to check null values in Pandas Dataframe, we use notnull() function this function return dataframe of Boolean values which are False for NaN values.

Code #3:

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas as pd
import pandas as pd
  
# importing numpy as np
import numpy as np
  
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
  
# creating a dataframe using dictionary
df = pd.DataFrame(dict)
  
# using notnull() function 
df.notnull()

chevron_right


Output:

 
Code #4:

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas package 
import pandas as pd 
    
# making data frame from csv file 
data = pd.read_csv("employees.csv"
    
# creating bool series True for NaN values 
bool_series = pd.notnull(data["Gender"]) 
    
# filtering data 
# displayind data only with Gender = Not NaN 
data[bool_series] 

chevron_right


Output:
As shown in the output image, only the rows having Gender = NOT NULL are displayed.

 


Filling missing values using fillna(), replace() and interpolate()

In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function these function replace NaN values with some value of their own. All these function help in filling a null values in datasets of a DataFrame. Interpolate() function is basically used to fill NA values in the dataframe but it uses various interpolation technique to fill the missing values rather than hard-coding the value.

Code #1: Filling null values with a single value

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas as pd
import pandas as pd
  
# importing numpy as np
import numpy as np
  
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
  
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
  
# filling missing value using fillna()  
df.fillna(0)

chevron_right


Output:

 
Code #2: Filling null values with the previous ones

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas as pd
import pandas as pd
  
# importing numpy as np
import numpy as np
  
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
  
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
  
# filling a missing value with
# previous ones  
df.fillna(method ='pad')

chevron_right


Output:

 
Code #3: Filling null value with the next ones

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas as pd
import pandas as pd
  
# importing numpy as np
import numpy as np
  
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
  
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
  
# filling  null value using fillna() function  
df.fillna(method ='bfill')

chevron_right


Output:

 
Code #4: Filling null values in CSV File

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas package 
import pandas as pd 
    
# making data frame from csv file 
data = pd.read_csv("employees.csv")
  
# Printing the first 10 to 24 rows of
# the data frame for visualization   
data[10:25]

chevron_right



Now we are going to fill all the null values in Gender column with “No Gender”

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas package 
import pandas as pd 
    
# making data frame from csv file 
data = pd.read_csv("employees.csv"
  
# filling a null values using fillna() 
data["Gender"].fillna("No Gender", inplace = True
  
data

chevron_right


Output:

Code #5: Filling a null values using replace() method

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas package 
import pandas as pd 
    
# making data frame from csv file 
data = pd.read_csv("employees.csv")
  
# Printing the first 10 to 24 rows of
# the data frame for visualization   
data[10:25]

chevron_right


Output:

Now we are going to replace the all Nan value in the data frame with -99 value.

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas package 
import pandas as pd 
    
# making data frame from csv file 
data = pd.read_csv("employees.csv"
    
# will replace  Nan value in dataframe with value -99  
data.replace(to_replace = np.nan, value = -99

chevron_right


Output:

 
Code #6: Using interpolate() function to fill the missing values using linear method.

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas as pd 
import pandas as pd 
    
# Creating the dataframe  
df = pd.DataFrame({"A":[12, 4, 5, None, 1], 
                   "B":[None, 2, 54, 3, None], 
                   "C":[20, 16, None, 3, 8], 
                   "D":[14, 3, None, None, 6]}) 
    
# Print the dataframe 
df 

chevron_right



Let’s interpolate the missing values using Linear method. Note that Linear method ignore the index and treat the values as equally spaced.


filter_none

edit
close

play_arrow

link
brightness_4
code

# to interpolate the missing values 
df.interpolate(method ='linear', limit_direction ='forward')

chevron_right


Output:

As we can see the output, values in the first row could not get filled as the direction of filling of values is forward and there is no previous value which could have been used in interpolation.
 

Dropping missing values using dropna()

In order to drop a null values from a dataframe, we used dropna() function this fuction drop Rows/Columns of datasets with Null values in different ways.

Code #1: Dropping rows with at least 1 null value.

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas as pd
import pandas as pd
  
# importing numpy as np
import numpy as np
  
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score':[52, 40, 80, 98],
        'Fourth Score':[np.nan, np.nan, np.nan, 65]}
  
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
    
df

chevron_right



Now we drop rows with at least one Nan value (Null value)

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas as pd
import pandas as pd
  
# importing numpy as np
import numpy as np
  
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score':[52, 40, 80, 98],
        'Fourth Score':[np.nan, np.nan, np.nan, 65]}
  
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
  
# using dropna() function  
df.dropna()

chevron_right


Output:

Code #2: Dropping rows if all values in that row are missing.

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas as pd
import pandas as pd
  
# importing numpy as np
import numpy as np
  
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score':[52, np.nan, 80, 98],
        'Fourth Score':[np.nan, np.nan, np.nan, 65]}
  
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
    
df

chevron_right



Now we drop a rows whose all data is missing or contain null values(NaN)

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas as pd
import pandas as pd
  
# importing numpy as np
import numpy as np
  
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score':[52, np.nan, 80, 98],
        'Fourth Score':[np.nan, np.nan, np.nan, 65]}
  
df = pd.DataFrame(dict)
  
# using dropna() function    
df.dropna(how = 'all')

chevron_right


Output:

Code #3: Dropping columns with at least 1 null value.

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas as pd
import pandas as pd
   
# importing numpy as np
import numpy as np
   
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score':[52, np.nan, 80, 98],
        'Fourth Score':[60, 67, 68, 65]}
  
# creating a dataframe from dictionary 
df = pd.DataFrame(dict)
     
df

chevron_right



Now we drop a columns which have at least 1 missing values

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas as pd
import pandas as pd
   
# importing numpy as np
import numpy as np
   
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score':[52, np.nan, 80, 98],
        'Fourth Score':[60, 67, 68, 65]}
  
# creating a dataframe from dictionary  
df = pd.DataFrame(dict)
  
# using dropna() function     
df.dropna(axis = 1)

chevron_right


Output :

 
Code #4: Dropping Rows with at least 1 null value in CSV file

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas module 
import pandas as pd 
    
# making data frame from csv file 
data = pd.read_csv("employees.csv"
    
# making new data frame with dropped NA values 
new_data = data.dropna(axis = 0, how ='any'
    
new_data

chevron_right


Output:

Now we compare sizes of data frames so that we can come to know how many rows had at least 1 Null value

filter_none

edit
close

play_arrow

link
brightness_4
code

print("Old data frame length:", len(data))
print("New data frame length:", len(new_data)) 
print("Number of rows with at least 1 NA value: ", (len(data)-len(new_data)))

chevron_right


Output :

Old data frame length: 1000
New data frame length: 764
Number of rows with at least 1 NA value:  236

Since the difference is 236, there were 236 rows which had at least 1 Null value in any column.



My Personal Notes arrow_drop_up