Model Selection for ARIMA

Time series data analysis plays a pivotal role in various fields such as finance, economics, weather forecasting, and more. The Autoregressive Integrated Moving Average (ARIMA) model stands as one of the fundamental tools for forecasting future values based on historical patterns within time series data. However, selecting the appropriate parameters for an ARIMA model is crucial to ensure accurate predictions.

What is ARIMA?

ARIMA, standing for Autoregressive Integrated Moving Average, is a widely used statistical method for time series forecasting. It combines three key components to model data:

Autoregression (AR): This component relates the present value to its past values through a regression equation.
Differencing (I for Integrated): It involves differencing the time series data to make it stationary, ensuring that the mean and variance are constant over time.
Moving Average (MA): This component uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.

Components of ARIMA

1. Autoregression (AR):

The autoregressive part (AR) of an ARIMA model is represented by the parameter p. It signifies the dependence of the current observation on its previous values. Mathematically, an AR(p) model can be represented as:

Here, Y_t is the current observation, c is a constant, ϕ₁ to ϕ_p are the autoregressive parameters, and ϵ_t represents the error term at time t.

2. Differencing (I):

The differencing part of ARIMA is represented by the parameter d. It involves transforming a non-stationary time series into a stationary one by differencing consecutive observations. The differencing operation can be applied multiple times until stationarity is achieved. The formula for differencing is straightforward:

Y_t^`= Y_t-Y_t-1

Here:

Y_t^` is the differenced series at time t.
Y_tis the original series at time t.
Y_t-1 is the value of the series at the previous time step.

The differencing process is typically applied multiple times until stationarity is achieved. The notation I(d) indicates the order of differencing required for stationarity.

3. Moving Average (MA):

The moving average part (MA) of an ARIMA model is represented by the parameter q. It indicates the dependence of the current observation on the previous forecast errors. Mathematically, an MA(q) model can be represented as:

Here, Y_t is the current observation, c is a constant, ϵ_t is the error at time t, and θ₁ to θ_q are the moving average parameters.

Final Formula of ARIMA:

The general formula for a non-seasonal ARIMA model is represented as ARIMA(p,d,q):

Here:

Y_t^` is the differenced and stationary time series at time t.
c is a constant or mean of the differenced series.
ϕ₁,ϕ₂,…,ϕ_p are autoregressive parameters representing the dependence on past values.
ϵ_t is the white noise error term at time t.
θ₁,θ₂,…,θ_q are moving average parameters representing the dependence on past forecast errors.

The terms p,d,q in ARIMA(p,d,q) indicate:

p: The order of autoregression.
d: The order of differencing.
q: The order of the moving average.

The ARIMA model aims to capture the temporal dependencies and patterns in the time series data, making it suitable for forecasting future values.

Working Principles

Identifying Stationarity: ARIMA models require the time series data to be stationary. Stationarity implies that the statistical properties of the time series (like mean and variance) remain constant over time.
Parameter Estimation: Estimating the parameters p, d, and q involves analyzing the autocorrelation function (ACF) and partial autocorrelation function (PACF) plots of the time series data. ACF helps determine the MA order (q), while PACF aids in determining the AR order (p).
Model Fitting: Once the parameters are determined, the ARIMA model is fitted to the data. This involves minimizing the error (often using methods like maximum likelihood estimation) to obtain the most suitable coefficients for the autoregressive and moving average terms.
Forecasting: After fitting the model, it can be used to forecast future values by iterating over time.

Mathematical Aspects

Maximum Likelihood Estimation (MLE): ARIMA parameters are often estimated using MLE, a statistical method that finds the parameters maximizing the likelihood of the observed data.
Backshift Operator (B): This operator, represented as B, is used in ARIMA models to denote the lagged values of a time series.
Residual Analysis: After fitting the model, the residuals (errors) are analyzed to ensure they are independent, normally distributed, and have constant variance.

Model Parameters in ARIMA

The ARIMA model is defined by three main parameters: p, d, and q.

p (AR order): Represents the number of autoregressive terms and is denoted by p. It refers to the number of past observations that directly influence the current value.
d (Differencing order): Represents the number of differences needed to make the time series stationary. It involves computing the differences between consecutive observations.
q (MA order): Denoted by q, it represents the number of lagged forecast errors in the prediction equation.

Selecting the appropriate values for these parameters significantly impacts the model’s forecasting capability. However, determining the right values is often a challenging task.

Model Selection Methods for ARIMA

1. Visual Inspection:

Time Series Plots: Visualizing the data to identify trends, seasonality, and irregularities helps in understanding the data’s characteristics.
Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF): These plots help identify potential values for p and q by showcasing correlations between observations at different lags. Decay in autocorrelation at certain lags may indicate the order of the respective terms.

2. Parameter Grid Search:

Grid Search: This involves systematically evaluating different combinations of p, d, and q values to find the set that optimizes a chosen evaluation metric, such as AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion).
Iterative Search: Starting with a range of possible values for p, d, and q, this method iteratively tests combinations to determine the best fit.

3. Automated Techniques:

Auto-ARIMA: Several software packages and libraries offer automated ARIMA model selection algorithms (e.g., auto-ARIMA in Python’s pmdarima or R’s forecast package) that determine the optimal parameters based on statistical measures.

4. Cross-validation:

Time Series Cross-Validation: Dividing the data into training and validation sets and testing the model’s performance across various splits helps in assessing its robustness and accuracy. Methods like rolling origin forecasting or walk-forward validation can be employed to evaluate forecasting accuracy.

5. Information Criteria:

AIC, BIC: These statistical measures help in comparing models by penalizing complexity, encouraging the selection of models that balance goodness of fit with simplicity. Lower AIC or BIC values indicate better-fitting models.

6. Model Comparison:

Comparing Candidate Models: Fit multiple candidate models with different parameter combinations and compare their performance using statistical measures, visual inspection, and diagnostic tests for residuals.

7. Stepwise Methods:

Stepwise Search: Implement stepwise methods, such as stepwise AIC or stepwise BIC, to iteratively add or remove parameters from the model, improving the fit.

ARIMA Model Selection in R

Loading Libraries:

library(forecast) and library(tseries) load the necessary R packages (forecast and tseries) for time series analysis and forecasting.

Loading Dataset:

data(“AirPassengers”) loads the built-in AirPassengers dataset in R, which contains the monthly airline passenger data.
str(AirPassengers) displays the structure of the dataset, showing information about the type of data and its structure.

# Load necessary libraries

library(forecast)

library(tseries)
 
# Load the AirPassengers dataset (built-in in R)

data("AirPassengers")

Convert to Time Series:

passengers_ts <- ts(AirPassengers, frequency = 12) converts the dataset into a time series object, specifying a frequency of 12 for monthly data.

Visual Inspection:

Plots the time series data (plot(passengers_ts)) and the Autocorrelation Function (acf) and Partial Autocorrelation Function (pacf) plots to visually inspect patterns in the data and identify potential ARIMA parameters.

# Convert the dataset to a time series object

passengers_ts <- ts(AirPassengers, frequency = 12)
 
# Visual inspection
# Time series plot

plot(passengers_ts, main = "International Airline Passengers")

Output:

Model Selection for ARIMA

ACF Plot

# Automatically determine the lags using 'acf' and 'pacf' functions

acf(passengers_ts, main = "ACF Plot")

pacf(passengers_ts, main = "PACF Plot")

Output:

Model Selection for ARIMA

PACF Plot

Model Selection for ARIMA

Parameter Grid Search:

auto_model <- auto.arima(passengers_ts, …) uses auto.arima to perform a grid search, trying different combinations of parameters (p, d, q) to select the best model based on AIC.

Automated Model Selection:

auto_arima_model <- auto.arima(passengers_ts) uses auto.arima to automatically select the best model based on the dataset, without specifying additional parameters.

# Parameter grid search
# Grid search using auto.arima with a range of possible values for p, d, and q

auto_model <- auto.arima(passengers_ts, seasonal = FALSE, stepwise = FALSE,

                         approximation = FALSE,

                         ic = "aic") 
 
# Automated technique - Auto-ARIMA
# Using the auto.arima function for automated model selection

auto_arima_model <- auto.arima(passengers_ts)

Cross-validation:

Performs time series cross-validation using tsCV and forecast functions to evaluate the model’s accuracy through rolling origin forecasting.

Choosing the Best Model:

best_model <- auto.arima(passengers_ts, ic = “aic”) selects the best model based on AIC information criteria.

# Cross-validation
# Time series cross-validation

cv <- tsCV(passengers_ts, function(x) forecast(auto_arima_model, h = 1)$mean)
 
# Choosing the best model based on information criteria

best_model <- auto.arima(passengers_ts, ic = "aic")  # or "bic" for BIC
best_model

Output:

Series: passengers_ts 
ARIMA(0,1,1)(2,1,0)[12] 

Coefficients:
          ma1     sar1    sar2
      -0.3634  -0.1239  0.1911
s.e.   0.0899   0.0934  0.1036

sigma^2 = 133.5:  log likelihood = -505.59
AIC=1019.18   AICc=1019.5   BIC=1030.68

This output provides information about the ARIMA model with drift selected based on the AIC criterion for the AirPassengers dataset.

Loading Libraries and Dataset

# Load necessary libraries

library(forecast)

library(tseries)
 
# Load the Johnson & Johnson quarterly earnings dataset

data("JohnsonJohnson")

library(forecast) and library(tseries) load the required R packages for time series analysis and forecasting.
data(“JohnsonJohnson”) loads the “JohnsonJohnson” dataset in R, containing quarterly earnings data for Johnson & Johnson.
str(JohnsonJohnson) displays the structure of the dataset, showing information about the type of data and its structure.

Convert Data to Time Series

# Convert the dataset to a time series object

jj_ts <- ts(JohnsonJohnson, start = c(1960, 1), frequency = 4)
 
# Plot the time series data

plot(jj_ts, main = "Johnson & Johnson Quarterly Earnings per Share")

Output:

Model Selection for ARIMA

ts(JohnsonJohnson, start = c(1960, 1), frequency = 4) converts the dataset into a time series object, specifying a start year and frequency of 4 for quarterly data.
plot(jj_ts, main = “Johnson & Johnson Quarterly Earnings per Share”) plots the time series data for visualization.

Visual Inspection and Parameter Identification:

# ACF and PACF plots for identifying parameters

acf(jj_ts, main = "ACF Plot")

pacf(jj_ts, main = "PACF Plot")

Output:

Model Selection for ARIMA

acf(jj_ts, main = “ACF Plot”) and pacf(jj_ts, main = “PACF Plot”) generate Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots to identify potential values for the ARIMA model’s parameters.

Conclusion

The selection of ARIMA model parameters is a critical aspect of time series forecasting. Employing a combination of visual inspection, systematic search methods, automated techniques, and cross-validation aids in identifying the most appropriate values for p, d, and q. Nonetheless, understanding the data’s characteristics and the trade-offs between model complexity and accuracy remains essential for effective model selection in ARIMA-based forecasting.

Article Tags :

Geeks Premier League

R Language

Geeks Premier League 2023