Handling Missing Data in Julia

Last Updated : 28 Jul, 2020

Nowadays, one of the common problem of big data is to analyse the missing values in that data. Missing values can lead to some major prediction error which is not good for any business point of view. So when we encounter any missing data we have to apply different techniques to deal with missing values in the data set.

Missing Object

Julia’s missing object is most powerful and fast user defined type which is much better than most of the build-in types like NA, NaN and many more. It also supports many custom types to take more advantage.

In order to provide consistency between some predefined types for missing values and some custom types, Julia introduces new missing object, an object having no fields which are the only instance of the Missing singleton type. Values can be either of type T or missing. It can be declared as Union{Missing, T}.

# missing object cast as Int 
[1, missing] 
  
# missing object cast as Char 
['1', missing] 
  
# missing object cast as Float64 
[1.0, missing] 

Output:

Julia’s new missing framework is more generic and efficient. It ensures the safety and security, that missing values should never be silently ignored nor replaced with any non-missing values. Any mathematical operation that is performed with this missing object doesn’t affect the result of the data manipulation. If there is any missing value in the dataset then we can also perform some tasks without having any problem.

# Adding something with missing value 
1 + missing 
  
# Subtract something with missing value 
1 - missing 
  
# Multiply something with missing value 
2 * missing 
  
# Round-off missing value 
round(missing) 
  
# Taking cosine of missing value 
cos(missing) 

Output:

As you can see that by using the missing framework we found that any operation on a missing object will not affect the result whereas, if we do the same thing with NA or NAN values it can return an error or some types of exception.

To get rid of these missing objects we can use a convenience function called skipmissing() method. Which can help us to use the other values in the dataframe or in an array.

# Sum the values of array ignoring missing 
sum(skipmissing([1, missing, 5])) 
  
# Mean of values of array ignoring missing 
mean(skipmissing([4, missing, 3])) 

Output:

Methods to handle missing data

There are many ways to handle missing values, some of them are given below:

Drop missing values from the dataframe

In this method we can see that by using dropmissing() method, we are able to remove the rows having missing values in the data frame. Drop missing values is good for those datasets which are large enough to miss some data that will not affect the prediction and it’s not good for small datasets it may lead to underfitting the models.

# Install DataFrames and Missings 
using Pkg 
Pkg.add('DataFrames') 
Pkg.add('Missings') 
  
# Defining DataFrame having missing values 
df = DataFrame(i = 1:6, 
               x = [5, missing, 4, missing, 2, 1], 
               y = ["a", missing, missing, "c", "d", "e"]) 
                 
# Droping missing data values 
gfg = dropmissing(df) 
  
print(gfg)

Output:

Skipping the missing values from the dataframe

In this method we can see that by using skipmissing() method, we are able to skip the missing values. It is a much better option to remove the missing values at least we can have the other values in that row which can act as useful data for making models.

# Install DataFrames and Missings 
using Pkg 
Pkg.add('DataFrames') 
Pkg.add('Missings') 
  
# Defining DataFrame having missing values 
df = DataFrame(i = 1:6, 
               x = [5, missing, 4, missing, 2, 1], 
               y = ["a", missing, missing, "c", "d", "e"]) 
                 
# Skipping missing data values 
gfg = skipmissing(df[2]) 
  
print(maximum(df[2])) 
print(maximum(gfg))