Open In App

How to deal with missing values in a Timeseries in Python?

Last Updated : 26 Dec, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

It is common to come across missing values when working with real-world data. Time series data is different from traditional machine learning datasets because it is collected under varying conditions over time. As a result, different mechanisms can be responsible for missing records at different times. These mechanisms are known as missingness mechanisms. In this article, we will discuss how to handle missing values in time series data using Python.

What is Timeseries Data?

Time series is a sequence of observations recorded at regular time intervals. Time series analysis can be useful to see how a given asset, security, or economic variable changes over time. Another big question is why we need to deal with missing values in the dataset and why the missing values are present in the data.

  • The handling of missing data is very important during the preprocessing of the dataset as many machine learning algorithms do not support missing values.
  • Time series are subject to missing points due to problems in reading or recording the data.

Why can’t we change the missing values with the global mean because the time series data might have some seasonality or trend?  

Conventional methods such as mean and mode imputation, deletion, and other methods are not good enough to handle missing values as those methods can cause bias in the data. Estimation or imputation of the missing data with the values produced by some procedures or algorithms can be the best possible solution to minimize the bias effect of the conventional method of the data. So that at last, the data will be completed and ready to use for another step of analysis or data mining. 

Types of Time Series Data

Let’s start by categorizing time series data based on its composition before delving into imputation methods. If we use a linear regression model to break down the time series data, it can be represented as:

Y_{t}=m_{t}+s_{t}+\epsilon_{t}

Here,

  • m_{t} represents the trend,
  • s_{t} represents seasonality, and
  • \epsilon_{t} represents random variables.

Based on the presence or absence of these components, the passage identifies four types of time series data:

1. No trend or seasonality (Constant): Data remains relatively constant over time, with neither trend nor seasonal fluctuations.

Y_{t}=\epsilon_{t}

2. Trend, but no seasonality (Trendy): Data exhibits a clear long-term trend (increasing or decreasing) but no regular seasonal patterns.

Y_{t}=m_{t}+\epsilon_{t}

3. Seasonality, but no trend (Seasonal): Data shows recurring fluctuations within a specific period (e.g., monthly sales cycles) but no overall trend over time.

Y_{t}=s_{t}+\epsilon_{t}

4. Both trend and seasonality (Trend-seasonal): Data exhibits both a long-term trend and recurring seasonal patterns. This is the most complex type of time series data.

Y_{t}=m_{t}+s_{t}+\epsilon_{t}

Types of Missing Data

Missing data is a common challenge in time series analysis, impacting the accuracy and reliability of your results. Understanding the different types of missing data is crucial for choosing the right imputation strategy to address them effectively. Here’s a breakdown of the main types:

  1. Missing Completely at Random (MCAR): Data points are missing randomly and independently of any other variables or observations. This is the ideal case for imputation, as any method can be used without introducing bias.
  2. Missing at Random (MAR): Data points are missing depending on observed values in other variables, but not on the missing values themselves. This is a more complex scenario, but imputation using observed data can still be effective.
  3. Missing Not at Random (MNAR): Data points are missing depending on the missing values themselves, making them difficult to predict accurately. This is the most challenging case, as traditional imputation methods can introduce bias and distort your analysis.

Handle Missing Values in Time Series in Python

Here’s an step by step guide of Python implementation for handling missing values in a time series dataset:

Step 1: Importing the Libraries

Here we are importing all the necessary libraries:

Python3

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

                    

Step 2: Importing the Dataset

  1. Importing data: It imports pandas library (pd) and reads the data from the CSV file using pd.read_csv, assuming the first row doesn’t contain column names (header=None).
  2. Naming columns: It assigns “Date” and “Customers” as names for the two columns using df.columns.
  3. Converting date format: It converts the “Date” column into a proper datetime format with year, month, and day order using pd.to_datetime and specifying the original format string (%Y-%m).
  4. Setting Date index: It sets the “Date” column as the index of the DataFrame using df.set_index, making it the reference point for time-based operations.
  5. Checking data shape and preview: It checks the final data shape with df.shape and displays the first few rows with df.head().

Python3

# import the data
df= pd.read_csv('/Time-Series.csv', header=None)
 
# name the columns
df.columns=['Date','Customers']
 
# represent date column in date fromat in the order, Year, month and the day
df['Date']=pd.to_datetime(df['Date'], format='%Y-%m')
 
# set the Date column be the index of our dataset
df= df.set_index('Date')
 
# now check the data shape
df.shape
 
print(df.head())

                    

Output:

(144, 1)
                    Customers
Date                 
1949-01-01      114.0
1949-02-01      120.0
1949-03-01      134.0
1949-04-01       67.0
1949-05-01      123.0
  1. Identifying missing values:
    • nul_data = pd.isnull(df['Customers']): This line uses the pd.isnull function from pandas to create a new Boolean Series (nul_data) containing True for every missing value in the “Customers” column of the DataFrame df and False otherwise.
  2. Filtering and printing data:
    • df[nul_data]: This line uses Boolean indexing to filter the original DataFrame df based on the nul_data Series. It essentially selects only the rows where the “Customers” value is missing (i.e., True in the corresponding nul_data series).

Python3

nul_data = pd.isnull(df['Customers'])
     
# print only the data, Customers = NaN
df[nul_data]

                    

Output:

                Customers
Date    
1951-06-01    NaN
1951-07-01    NaN
1954-06-01    NaN
1960-03-01    NaN

Plot the Graph

This creates a line plot of the data in the DataFrame df. It automatically uses the index (assumed to be the date) as the x-axis and the “Customers” column as the y-axis. 

Python3

plt.rcParams['figure.figsize']=(15,7)
 
# plots our series
plt.plot(df, color='green')
 
plt.title('Customers visted shop since 1950')
 
plt.show()

                    

Output:

Timeseries in Python

Step 3: Imputing the Missing Values

Here is the explanation of the techniques mentioned for handling missing values in time series data:

  1. Mean Imputation: Replaces missing values with the average of the entire column. Simple and fast, but may not capture trends or local variations.
  2. Median Imputation: Replaces missing values with the median of the entire column. Less sensitive to outliers than mean, but still lacks local context.
  3. Last Observation Carried Forward (LOCF): Replaces missing values with the last known value. Works well for data with rising or constant trends, but can distort trends if they change direction.
  4. Next Observation Carried Backward (NOCB): Replaces missing values with the next known value. Similar to LOCF but for downward trends. Both LOCF and NOCB can introduce artificial jumps or dips.
  5. Linear Interpolation: Estimates missing values by drawing a straight line between the two nearest known data points. Good for capturing linear trends, but less accurate for complex patterns.
  6. Spline Interpolation: Estimates missing values by fitting a flexible, curved line through the data points. More accurate for capturing complex trends and subtle changes than linear interpolation, but computationally more expensive.

1. Mean imputation

It performs mean imputation on the “Customers” column of the DataFrame. It creates a new column named “FillMean” containing the original values where available and the average value of the “Customers” column where missing.

Python3

plt.rcParams['figure.figsize']=(15,7)
 
# fill the missing data using the mean of the present observations
df = df.assign(FillMean=df.Customers.fillna(df.Customers.mean()))
 
# pass the data and declared the colour of your curve, i.e., blue
plt.plot(df, color='green')
 
plt.title('Mean Imputation')
plt.show()

                    

Output:

Mean imputation

2. Median imputation

It performs median imputation on the dataset. It copies all existing columns and adds a new column named FillMedian. This new column fills in missing values in the Customers column using the median value of that column (df.Customers.median()).

Python3

plt.rcParams['figure.figsize']=(15,7)
 
# fill the missing data using the of the present observations
dataset = df.assign(FillMean=df.Customers.fillna(df.Customers.median()))
 
# pass the data and declared the colouyr opf our curve as blue
plt.plot(dataset, color='green')
 
plt.title('Median Imputation')
plt.show()

                    

Output:

Median imputation

3. Last Observation Carried Forward(LOCF)

In this we are imputing missing values in time series data and visualizing the results using Last Observation Carried Forward (LOCF) technique imputes missing values in the “Customers” column by copying the previous values and then visualizes the resulting time series.

Python3

plt.rcParams['figure.figsize']=(15,7)
 
# On the customer column of our data, impute the missing values with the LOCF
df['Customers_locf']= df['Customers'].fillna(method ='bfill')
 
# plot our time series with imputed values
plt.plot(df['Customers_locf'], color='green')
 
plt.title('Last Observation Carried Forward')
plt.show()

                    

Output:

Screenshot-2023-12-18-103929-(1)

4. Next Observation Carried Backward(NOCB)

In this we are imputing missing values in time series data but uses a different technique: Next Observation Carried Backward (NOCB) imputation to fill missing values in the “Customers” column by copying the next available observation and then visualizes the time series data.

Python3

plt.rcParams['figure.figsize']=(15,7)
 
# On the customer column of our data, impute the missing values with the LOCF
df['Customers_nocb']= df['Customers'].fillna(method ='ffill')
 
# plot our time series with imputed values
plt.plot(df['Customers_nocb'], color='green')
 
plt.title('Next Observation Carried Backward')
plt.show()

                    

Output:

Next Observation Carried Backward

3. Linear Interpolation

In this we are imputing missing values in time series data using a technique called linear interpolation to estimate and fill in missing values in the “Customers” column.

Python3

plt.rcParams['figure.figsize']=(15,7)
 
# on our data, impute the missing values using rolling window method
df['Customers_L']= df['Customers'].interpolate(method='linear')
 
# plot the complete dataset
plt.plot(df['Customers_L'], color='green')
 
plt.title('Linear interpolatoin')
plt.show()

                    

Output:

Linear interpolation

6. Spline Interpolation

In this we are imputing missing values in time series data using a technique called spline interpolation to estimate and fill in missing values in the “Customers” column.

Python3

plt.rcParams['figure.figsize']=(15,7)
 
# on our data, impute the missing values using the interpolation techniques and specifically, the lineare method
df['Customers_Spline']= df['Customers'].interpolate(option='spline')
 
# plot the complete dataset
plt.plot(df['Customers_Spline'], color='green')
 
plt.title('Spline Interpolation')
plt.show()

                    

Output:

Spline interpolation

Conclusion

Dealing with missing values in your Python time series can be a frustrating experience. However, with careful analysis and the right imputation technique, you can transform fragmented data into a smooth and reliable flow for more accurate analysis. It’s important to note that there is no one-size-fits-all approach to imputation, so it’s essential to assess your data, understand the patterns of missingness, and choose the technique that best preserves the integrity and meaning of your time series. By embracing the power of imputation and bridging the gaps with confidence, you can take your time series analysis to new heights!

Frequently Asked Question(FAQs)

1. What are common reasons for missing values in time series data?

Missing values in time series data can occur due to sensor malfunctions, data transmission errors, or simply gaps in data collection.

2. How do you identify missing values in a time series using Python?

Use functions like isnull() or info() in libraries like Pandas to detect missing values in time series data.

3. What is mean imputation, and when is it suitable for handling missing values in time series?

Mean imputation involves replacing missing values with the mean of the available data. It’s suitable when the missing values are assumed to be randomly distributed.

4. Can you use interpolation techniques to fill missing values in a time series?

Yes, interpolation techniques like linear or spline interpolation can be used to estimate missing values based on existing data points.

5. How does backward fill (NOCB) work in filling missing values in a time series?

Backward fill replaces missing values with the next available observation, filling gaps by carrying values backward.




Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads