Open In App

Univariate Time Series Analysis and Forecasting

Time series data is one of the most challenging tasks in machine learning as well as the real-world problems related to data because the data entities not only depend on the physical factors but mostly on the chronological order in which they have occurred. We can forecast a target value in the time series based on a single feature that is univariate, and two features that are bivariate or multivariate but in this article, we will learn how to perform univariate forecasts on the Rainfall dataset that has been taken from Kaggle.

What is Univariate Forecasting?

Univariate forecasting is commonly used when you want to make prediction values of a single variable, especially when there are historical data points available for that variable. It’s a fundamental and widely applicable technique in fields like economics, finance, weather forecasting, and demand forecasting in supply chain management.



For more complex forecasting tasks where multiple variables or external factors may have an impact, multivariate forecasting techniques are used. These models take into account multiple variables and their interactions for making predictions.

Key Concepts of Univariate Forecasting

Techniques of Univariate Forecasting

Several methods are used in univariate time series analysis to comprehend, model, and predict the behavior of a single variable over time. In univariate time series analysis, the following methods are frequently employed:



Implementation of Univariate Forecasting

Importing Libraries




import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import statsmodels.api as sm
 
import warnings
warnings.filterwarnings('ignore')

Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.

Loading Dataset




df = pd.read_csv('Rainfall_data.csv')
df.head()

Output:

   Year  Month  Day  Specific Humidity  Relative Humidity  Temperature  \
0 2000 1 1 8.06 48.25 23.93
1 2000 2 1 8.73 50.81 25.83
2 2000 3 1 8.48 42.88 26.68
3 2000 4 1 13.79 55.69 22.49
4 2000 5 1 17.40 70.88 19.07
Precipitation
0 0.00
1 0.11
2 0.01
3 0.02
4 271.14

Here, in this code we are loading the dataset into the pandas data frame so, that we can explore the different aspects of the dataset.

Shape of the dataframe




df.shape

Output:

(252, 7)

The DataFrame “df”‘s row and column counts are returned by this code.

Data Information




df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 252 non-null int64
1 Month 252 non-null int64
2 Day 252 non-null int64
3 Specific Humidity 252 non-null float64
4 Relative Humidity 252 non-null float64
5 Temperature 252 non-null float64
6 Precipitation 252 non-null float64
dtypes: float64(4), int64(3)
memory usage: 13.9 KB

By using the df.info() function we can see the content of each column and the data types present in it along with the number of null values present in each column.

Describing the data




print(df.describe().T)

Output:

                   count         mean         std      min        25%  \
Year 252.0 2010.000000 6.067351 2000.00 2005.0000
Month 252.0 6.500000 3.458922 1.00 3.7500
Day 252.0 1.000000 0.000000 1.00 1.0000
Specific Humidity 252.0 14.416746 4.382599 5.74 10.0100
Relative Humidity 252.0 67.259524 17.307101 34.69 51.8450
Temperature 252.0 16.317262 6.584842 4.73 10.8650
Precipitation 252.0 206.798929 318.093091 0.00 0.4025
50% 75% max
Year 2010.000 2015.000 2020.00
Month 6.500 9.250 12.00
Day 1.000 1.000 1.00
Specific Humidity 15.200 18.875 20.57
Relative Humidity 66.655 84.610 92.31
Temperature 16.915 22.115 29.34
Precipitation 11.495 353.200 1307.43

The DataFrame df is described statistically via the df. describe() function. To provide a preliminary understanding of the data’s central tendencies and distribution, it includes important statistics such as count, mean, standard deviation, and minimum, and maximum values for each numerical column.

Exploratory Data Analysis

EDA is an approach to analyzing the data using visual techniques. It is used to discover trends, and patterns, or to check assumptions with the help of statistical summaries and graphical representations. While performing the EDA of this dataset we will try to look at what is the relation between the independent features that is how one affects the other.




# Feature engineering
df['Date'] = pd.to_datetime(df['Year'].astype(str) + '-' + df['Month'].astype(str) + '-' + df['Day'].astype(str),
                            format='%Y-%m-%d')
df['DayOfWeek'] = df['Date'].dt.dayofweek
df['Month'] = df['Date'].dt.month
df['Quarter'] = df['Date'].dt.quarter
df['Year'] = df['Date'].dt.year
 
# EDA
cat_cols = ['DayOfWeek', 'Month', 'Quarter', 'Year']
 
for col in cat_cols:
    df[['Precipitation', col]].groupby(col).mean().plot.bar()
    plt.title(f'Mean Precipitation by {col}')
    plt.show()

Output:

DAY

Interpretation:

MONTH

Interpretation:

QUARTER

Interpretation:

YEAR

Interpretation:

Using a time series dataset with rainfall data, this code does feature engineering. It takes data from the ‘Date’ column and adds additional temporal elements like ‘DayOfWeek,’ ‘Month,’ ‘Quarter,’ and ‘Year’. Exploratory data analysis (EDA) is then carried out, with an emphasis on the mean precipitation for every unique value of the newly-engineered temporal features. To visually evaluate the average precipitation patterns over the course of a week, month, quarter, and year, the code iterates through the designated categories columns and creates bar charts. By assisting in the discovery of possible temporal trends and patterns in the rainfall data, these visualizations enable additional understanding of the temporal features of the dataset.

Seasonal Decomposition

A statistical technique used in time series analysis to separate the constituent parts of a dataset is called seasonal decomposition. Three fundamental components of the time series are identified: trend, seasonality, and residuals. The long-term movement or direction is represented by the trend, repeating patterns at regular intervals are captured by seasonality, and random fluctuations are captured by residuals. By separating the effects of seasonality from broader trends and anomalies, decomposing a time series helps to comprehend the specific contributions of various components, enabling more accurate analysis and predictions.




# Seasonal decomposition
ts = df.set_index('Date')['Precipitation'] + 0.01  # Add a small constant
result = seasonal_decompose(ts, model='multiplicative', period=12)
result.plot()
plt.show()

Output:

Seasonal Decomposition

For Seasonal Component:

For Trend Component:

For Residual Component:

Using a time series (‘ts’) that represents precipitation data, the function performs seasonal decomposition. A little constant is added to account for possible problems with zero or negative values. The code consists of three parts: trend, seasonality, and residuals, and it uses a multiplicative model with a 12-month seasonal period. The resulting graphic helps identify long-term trends and recurrent patterns in the precipitation data by providing a visual representation of these elements.

Autocorrelation and Partial Autocorrelation Plots

Autocorrelation: A time series’ association with its lag values is measured by autocorrelation. Every lag is correlated, and peaks in an autocorrelation diagram show high correlation at particular delays. By revealing recurring patterns or seasonality in the time series data, this aids in understanding its temporal structure and supports the choice of suitable model parameters for time series analysis.

Partial Autocorrelation: When measuring a variable’s direct correlation with its lags, partial autocorrelation eliminates the impact of intermediate delays. Significant peaks in a Partial Autocorrelation Function (PACF) plot indicate that a particular lag has a direct impact on the current observation. It helps to capture the distinct contribution of each lag by assisting in the appropriate ordering of autoregressive components in time series modeling.




# Autocorrelation and Partial Autocorrelation
plot_acf(ts, lags=30)
plot_pacf(ts, lags=30)
plt.show()

Output:

Autocorrelation

Interpretation:
The ACF measures the correlation between a time series and its lagged values at different time intervals.

Partial Autocorrelation

Interpretation:
The PACF measures the correlation between a time series and its lagged values, controlling for the effects of other lags.

For a time series, “ts,” which represents precipitation data, the provided code creates plots of the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF). The correlation between the time series and its lag values is displayed in these charts, which are restricted to 30 lags. The ACF plot’s peaks suggest strong connections at particular lags that may be seasonal. Finding appropriate autoregressive terms for time series modeling is made easier by the PACF plot, which illustrates the direct impact of each lag on the current observation. These plots serve as a general reference for selecting model parameters and identifying temporal patterns.

Lag Plot

In time series analysis, a lag plot is a graphical tool that shows the relationship between a variable and its lagged values. It helps find patterns, trends, or unpredictability in the data by comparing each data point to its prior observation. The plot may show the presence of autocorrelation if there is a substantial correlation between a point and its lag. This can help with understanding the temporal dependencies and direct the selection of the best model for time series analysis.




# Lag Plots
for i in range(1, 3):
    lag_plot(ts, lag=i)
    plt.title(f'Lag Plot with lag = {i}')
    plt.show()

Output:

Interpretation:

Interpretation:

With lags set to 1 and 2, the code creates Lag Plots for the time series “ts.” The association between each data point and its antecedent observation at the designated latency is plotted in each iteration. By shedding light on the autocorrelation patterns, these representations make it easier to spot possible trends and temporal relationships in the data. In order to make each Lag Plot easier to understand and evaluate the strength of the autocorrelation, the titles provide the lag values.

Stationarity Check

In order to make sure that a dataset’s statistical characteristics do not change over time, a stationarity check is an essential step in time series analysis. Model predictions are made easier when a time series is stationary since its mean, variance, and autocorrelation remain constant. Visual inspections, such rolling statistics graphs, and formal statistical tests, like the Augmented Dickey-Fuller test, are common approaches. By reducing the influence of non-constant patterns, stationarity assures dependable modeling and facilitates more precise forecasting and trend analysis of time series data.




# Stationarity check
adf_result = adfuller(ts)
print('ADF Statistic:', adf_result[0])
print('p-value:', adf_result[1])

Output:

ADF Statistic: -2.46632490177327
p-value: 0.12388427626757847

The code uses the Augmented Dickey-Fuller test to verify for stationarity on the time series ‘ts’. The p-value and ADF Statistic are displayed. The p-value denotes the importance of the ADF Statistic’s estimate of the time series’ presence of a unit root. For a more stationary time series, a low p-value and a more negative ADF Statistic point to more evidence against stationarity, which helps determine whether differencing is necessary.

Rolling and Aggregations

Rolling: Rolling is a statistical method for time series analysis that computes summary statistics, such as moving averages, over successive subsets of a dataset. A fixed-size window traverses the data in the context of rolling statistics, and a new value is computed based on the observations within that window at each step. This facilitates the visualization and comprehension of underlying dynamics by reducing volatility and highlighting trends or patterns in the time series.

Aggregation: Aggregations are a common technique in time series analysis to identify broad trends. They entail merging and summarizing several data points into a single value. In this sense, calculating statistical measures within particular time intervals, such as mean or total, might be considered aggregating data. Aggregations streamline the process of interpreting complicated time series data by combining related information. This makes it easier for analysts to spot trends, patterns, or anomalies and promotes more efficient forecasting and decision-making.




# Rolling and Aggregations
rolling_mean = ts.rolling(window=12).mean()
rolling_std = ts.rolling(window=12).std()
 
plt.plot(ts, label='Actual Data')
plt.plot(rolling_mean, label='Rolling Mean')
plt.plot(rolling_std, label='Rolling Std')
plt.legend()
plt.show()

Output:

Interpretation:

The blue line represents the original time series data.

Rolling Mean:

Rolling Standard Deviation:

The code computes the rolling mean and rolling standard deviation with a window size of 12, performing rolling statistics on the time series ‘ts’. The rolling mean, rolling standard deviation, and actual data are superimposed on the generated graphs to help visualize patterns and variability. This method aids in noise reduction, enhancing the visibility of underlying patterns and offering insights into the temporal properties of the data. For easier understanding, the legend separates the rolling mean, rolling standard deviation, and the original data.

Model Development

We will train a SARIMA model for the univariate forecast by using the date column as the feature for the predictions. But for that first, we will have to create a date column in the dataset that too in the pd. DateTime format so, for that, we will be using the pd.to_datetime function that is available in the pandas dataframe.




# Combine 'Year' and 'Month' columns to create a 'Date' column
df['Date'] = pd.to_datetime(df['Year'].astype(str) +\
                            '-' + df['Month'].astype(str) +\
                            '-' + df['Day'].astype(str),
                            format='%Y-%m-%d')
df.head()

Output:

   Year  Month  Day  Specific Humidity  Relative Humidity  Temperature  \
0 2000 1 1 8.06 48.25 23.93
1 2000 2 1 8.73 50.81 25.83
2 2000 3 1 8.48 42.88 26.68
3 2000 4 1 13.79 55.69 22.49
4 2000 5 1 17.40 70.88 19.07
Precipitation Date
0 0.00 2000-01-01
1 0.11 2000-02-01
2 0.01 2000-03-01
3 0.02 2000-04-01
4 271.14 2000-05-01

Now let’s set the index to the date column and the target column is the precipitation column let’s separate it from the complete dataset.




ts = df.set_index('Date')['Precipitation']
ts

Output:

Date
2000-01-01 0.00
2000-02-01 0.11
2000-03-01 0.01
2000-04-01 0.02
2000-05-01 271.14
...
2020-08-01 1203.09
2020-09-01 361.30
2020-10-01 180.18
2020-11-01 0.49
2020-12-01 12.23
Name: Precipitation, Length: 252, dtype: float64

Training the SARIMA Model

Now let’s train a SARIMA model on the dataset at hand.




# Fit a SARIMA model
# We need to specify the order (p, d, q) and seasonal order (P, D, Q, S)
# Example order and seasonal order values
p, d, q = 1, 1, 1
 
# For monthly data with a yearly seasonality
P, D, Q, S = 1, 1, 1, 12 
 
model = sm.tsa.SARIMAX(ts, order=(p, d, q), seasonal_order=(P, D, Q, S))
results = model.fit()

As the model has been trained we can use this model to predict the rain for the next year and plot it along with the original data to get a feel for whether the predictions are following the previous trend or not.

This code fits the time series ts to a SARIMA model. The model’s order is defined by the order=(p, d, q) argument, where p denotes the number of autoregressive terms, d denotes the degree of differencing, and q denotes the number of moving average terms. The order of the seasonal component of the model is specified by the seasonal_order=(P, D, Q, S) argument, where P denotes the number of seasonal autoregressive terms, D the degree of seasonal differencing, Q the number of seasonal moving average terms, and S the duration of the seasonality period.

A SARIMA model object is created by the code line model = sm.tsa.SARIMAX(ts, order=(p, d, q), seasonal_order=(P, D, Q, S)). The time series that needs to be modeled is the ts argument. The model’s order and seasonal order are specified by the order=(p, d, q) and seasonal_order=(P, D, Q, S) arguments, respectively.

Predictions




# Make predictions
forecast = results.get_forecast(steps=12# Forecast the next 12 periods
forecast_mean = forecast.predicted_mean
 
# Plot the actual data and the forecast
plt.figure(figsize=(12, 6))
plt.plot(ts, label='Actual Data')
plt.plot(forecast_mean, label='SARIMA Forecast')
plt.legend()
plt.show()

Output:

This code plots the actual data and the forecast and makes predictions using the fitted SARIMA model. Initially, a forecast object containing the anticipated values for the upcoming 12 periods is created using the get_forecast() function. To get the predicted mean values, the forecast object’s predicted_mean attribute is then extracted.Next, a figure is made, and the plot() function is used to plot the forecast and the actual data. Each line now has a label, and a legend is shown.


Article Tags :