Open In App

SARIMA (Seasonal Autoregressive Integrated Moving Average)

Time series data is all around us, from stock prices and weather patterns to demand forecasting and seasonal trends in sales. To make sense of this data and predict future values, we turn to powerful models like the Seasonal Autoregressive Integrated Moving Average, or SARIMA. In this article, we will unravel the mysteries of SARIMA models, to forecast the monthly sales.

Understanding SARIMA

SARIMA, which stands for Seasonal Autoregressive Integrated Moving Average, is a versatile and widely used time series forecasting model. It’s an extension of the non-seasonal ARIMA model, designed to handle data with seasonal patterns. SARIMA captures both short-term and long-term dependencies within the data, making it a robust tool for forecasting. It combines the concepts of autoregressive (AR), integrated (I), and moving average (MA) models with seasonal components.



The Components of SARIMA

To grasp SARIMA, let’s break down its components:

  1. Seasonal Component: The “S” in SARIMA represents seasonality, which refers to repeating patterns in the data. This could be daily, monthly, yearly, or any other regular interval. Identifying and modelling the seasonal component is a key strength of SARIMA.
  2. Autoregressive (AR) Component: The “AR” in SARIMA signifies the autoregressive component, which models the relationship between the current data point and its past values. It captures the data’s autocorrelation, meaning how correlated the data is with itself over time.
  3. Integrated (I) Component: The “I” in SARIMA indicates differencing, which transforms non-stationary data into stationary data. Stationarity is crucial for time series modelling. The integrated component measures how many differences are required to achieve stationarity.
  4. Moving Average (MA) Component: The “MA” in SARIMA represents the moving average component, which models the dependency between the current data point and past prediction errors. It helps capture short-term noise in the data.

Seasonal Differencing

Before we jump into SARIMA, it’s essential to understand seasonal differencing. Seasonal differencing is the process of subtracting the time series data by a lag that equals the seasonality. This helps remove the seasonal component and makes the data stationary, allowing for more straightforward modeling. Seasonal differencing is often denoted as “D” in SARIMA.



The SARIMA Notation

SARIMA(p, d, q)(P, D, Q, s):
AR(p): Autoregressive component of order p
MA(q): Moving average component of order q
I(d): Integrated component of order d
Seasonal AR(P): Seasonal autoregressive component of order P
MA(Q): Seasonal moving average component of order Q
Seasonal I(D): Seasonal integrated component of order D
s: Seasonal period

The mathematical representation of SARIMA is as follows:

Where,

Lеt’s go through each component of the equation:

Thе SARIMA model is estimated by fitting the model to the historical data and then using it to forecast future values.

Use Cases for SARIMA

SARIMA models find applications in various domains, including:

Implementation of SARIMA for Time Series Forecasting:

Step 1: Import Libraries

First, import the necessary libraries for time series analysis.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from sklearn.metrics import mean_absolute_error, mean_squared_error

                    

Step 2: Load Dataset

let’s use this time retail dataset of a global superstore for 4 years, to predict the superstore’s monthly sales.

df=pd.read_csv('Dataset- Superstore (2015-2018).csv')
sales_data=df[['Order Date','Sales']]
sales_data=pd.DataFrame(sales_data)
sales_data['Order Date']=pd.to_datetime(sales_data['Order Date'])
print(sales_data.head())

                    

Output:

  Order Date     Sales
0 2016-11-08 261.9600
1 2016-11-08 731.9400
2 2016-06-12 14.6200
3 2015-10-11 957.5775
4 2015-10-11 22.3680

Step 3: Extracting monthly sales

Let’s extract monthly sales, as day by day sales analysis won’t be of any use.

df1 = sales_data.set_index('Order Date')
 
monthly_sales = df1.resample('M').mean()
monthly_sales.head()

                    

Output:

                 Sales
Order Date
2014-01-31 180.213861
2014-02-28 98.258522
2014-03-31 354.719803
2014-04-30 209.595148
2014-05-31 193.838418

Step 5: Plotting the Monhly Sales

plt.figure(figsize=(12, 6))
plt.plot(monthly_sales['Sales'], linewidth=3,c='cyan')
plt.title("Monthly sales")
plt.xlabel("Date")
plt.ylabel("Sales")
plt.show()

                    

Output:

From the visualization, we can analyze that the company has a seasonal pattern in sales. The company has a positive correlation between its sales and the month of the year and a negative correlation between its sales and the month of the quarter.

Step 6: Check Stationarity

Before applying SARIMA, check if your time series data is stationary because SARIMA assumes that the time series data is stationary. Stationarity refers to the statistical properties of a time series remaining constant over time, such as constant mean, constant variance, and constant autocovariance. You can use the Dickey-Fuller test for this.

def check_stationarity(timeseries):
    # Perform the Dickey-Fuller test
    result = adfuller(timeseries, autolag='AIC')
    p_value = result[1]
    print(f'ADF Statistic: {result[0]}')
    print(f'p-value: {p_value}')
    print('Stationary' if p_value < 0.05 else 'Non-Stationary')
 
 
check_stationarity(monthly_sales['Sales'])

                    

Output:

ADF Statistic: -3.2865668298704227
p-value: 0.015489720191097641
Stationary

Now, as the data is stationary, we can proceed further to build a model.

Step 7: Identify Model Parameters

Identify the SARIMA model parameters (p, d, q, P, D, Q, s) using ACF and PACF plots.

# Plot ACF and PACF
plot_acf(monthly_sales)
plot_pacf(monthly_sales)
plt.show()

                    

Output:

Here, both plots have a horizontal axis that shows the lags, and a vertical axis representing correlation coefficients ranging from -1 to 1 where -1 means perfect negative correlation, 0 means no correlation and 1 means perfect positive correlation.

The blue shaded area represents confidence interval for the correlation coefficients. If the correlation coefficient at a certain lag is outside the confidence interval, it means that the correlation coefficient is statistically significant and not due to chance.

Step 8: Fit the SARIMA Model

Now, fit the SARIMA model using the identified parameters.

# Define SARIMA parameters
p, d, q = 1, 1, 1
P, D, Q, s = 1, 1, 1, 12  # Assuming monthly seasonality
 
# Fit the SARIMA model
model = SARIMAX(monthly_sales, order=(p, d, q), seasonal_order=(P, D, Q, s))
results = model.fit()
model

                    

Output:

RUNNING THE L-BFGS-B CODE

* * *

Machine precision = 2.220D-16
N = 5 M = 10

At X0 0 variables are exactly at the bounds

At iterate 0 f= 4.15152D+00 |proj g|= 2.74080D-01

At iterate 5 f= 4.06571D+00 |proj g|= 1.57031D-02

At iterate 10 f= 4.05982D+00 |proj g|= 4.46554D-03

At iterate 15 f= 4.03337D+00 |proj g|= 1.65496D-02

At iterate 20 f= 4.02336D+00 |proj g|= 3.56570D-02

At iterate 25 f= 4.01887D+00 |proj g|= 6.31166D-04

* * *

Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value

* * *

N Tit Tnf Tnint Skip Nact Projg F
5 28 36 1 0 0 5.795D-06 4.019D+00
F = 4.0188723995389912

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL

Step 9: Generate Forecasts

With the model fitted, generate forecasts for future time periods.

# Forecast future values
forecast_periods = 12  # Forecast the next 12 months
forecast = results.get_forecast(steps=forecast_periods)
forecast_mean = forecast.predicted_mean
forecast_ci = forecast.conf_int()
 
# Plot the forecast
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales, label='Observed')
plt.plot(forecast_mean, label='Forecast', color='red')
plt.fill_between(forecast_ci.index, forecast_ci.iloc[:, 0], forecast_ci.iloc[:, 1], color='pink')
plt.title("Sales Forecast")
plt.xlabel("Date")
plt.ylabel("Sales")
plt.legend()
plt.show()

                    

Output:


The observed sales data and the forecasted values in red. The pink shaded area represents the confidence interval around the forecast. This visualization helps us understand the expected future sales trends based on our SARIMA model.

Step 10: Evaluate the Model

Let’s evaluate the forecasted sales values by comparing them to the observed sales data using two common metrics for this evaluation: Mean Absolute Error (MAE) and Mean Squared Error (MSE).

observed = monthly_sales[-forecast_periods:]
mae = mean_absolute_error(observed, forecast_mean)
mse = mean_squared_error(observed, forecast_mean)
print(f'MAE: {mae}')
print(f'MSE: {mse}')

                    

Output:

MAE: 30.815697783063925
MSE: 1302.515608316509

Conclusion

SARIMA models are a potent tool in the realm of time series forecasting. They can capture complex seasonality, long-term trends, and short-term fluctuations in your data. By understanding the components, notation, and best practices of SARIMA modeling, you can make accurate predictions in various domains. So, the next time you encounter time series data, remember the power of SARIMA to unravel its hidden patterns and forecast the future with confidence.



Article Tags :