Open In App

Box-Jenkins Methodology for ARIMA Models

Last Updated : 06 Jan, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Time series data records data points with respect to time intervals. The analysis of such dataset is important to recognize patterns and making predictions as well as providing informative insights. Box-Jenkins model is a forecasting method that is used to forecasts time series data for a specific period of time.

In this article we will be taking a dive into the Box-Jenkins method for ARIMA modelling as it helps us analyze and forecast time series data.

Let us first discuss an overview about what is an ARIMA model so that we can get a sound understanding about the process.

ARIMA Modelling

ARIMA modelling or Autoregressive Integrated Moving Average is a time series analysis and forecasting method, the ARIMA model is a combination of autoregression, differencing and moving average which are used in the modelling of time series. Let’s break it down and discuss the different components one by one:

  • Autoregressive (AR) Component: The autoregressive component involves modeling the relationship between an observation and several lagged observations (previously observed points). This component gives us the idea that the current value of the time series is related to the previous values of the series. The term “autoregressive” signifies that the model uses the relationship of the variable with its own past values. The AR component is denoted by p which can be expressed as:

X_t = c + \phi_1.X_{t-1}+\phi_2.X_{t-2}+...+\phi_p.X_{t-p}+\epsilon_t

Where:

  • X_t     is the value of time series on time t.
  • c is a constant value.
  • \phi_1, \phi_2,...,\phi_p     are autoregressive coefficients.
  • \epsilon_t     is error at time t.
  • Integrated(I) Component: Integrated component makes the time series stationary by differencing; it means that the statistical properties of the time series do not change over time. It helps in stabilizing the mean and removing trends from the time series. Differencing is denoted by d, and dY_t = Y_t - Y_{t-1}    represents first order differencing. We can further increase the order of differencing through d^2Y_t    , d^3Y_t    etc.
  • Moving Average (MA) Component: This component represents the effect of past error terms on the current value of the time series. The moving average component can be represented as q, which is also known as the order of moving average. The moving average process can also be represented as:

X_t = c+ \epsilon_t+ \theta_1.\epsilon_{t-1}+ \theta_2.\epsilon_{t-2}+...+ \theta_q.\epsilon_{t-q}

Where:

  • X_t    is the value of time series at time t.
  • c is a constant.
  • \epsilon_t, \epsilon_{t-1},..., \epsilon_{t-q}         are the noise terms or the error terms.
  • \theta_1, \theta_2, ..., \theta_q         are the moving average constants.

ARIMA(p,d,q):

ARIMA model combines all the AR, I, MA components in it. ARIMA modelling combines all the components mentioned above and its general form is given by:

X_t = c + \phi_1.X_{t-1}+\phi_2.X_{t-2}+...+\phi_p.X_{t-p} + \epsilon_t+ \theta_1.\epsilon_{t-1}+ \theta_2.\epsilon_{t-2}+...+ \theta_q.\epsilon_{t-q}

The general ARIMA forecasting process involves selecting appropriate values for p, d, and q, estimating the model parameters, and using the model to make predictions. The Box-Jenkins methodology is often used for identifying and fitting ARIMA models to time series data.

Let’s discuss the box-jenkins method in detail now.

Box-Jenkins Method

Box-Jenkins method is a type of forecasting and analyzing methodology for time series data. Box-Jenkins method comprises of three stages through which time series analysis could be performed. It comprises of different steps including identification, estimation, diagnostic checking, model refinement and forecasting. The Box-Jenkins method is an iterative process, and steps 1 to 4 from identification to model refinement are often repeated until a suitable and well-diagnosed model is obtained. It is important to note that the method assumes that the underlying time series data is generated by a stationary and linear process. The different stages of the Box-Jenkins model could be identified as:

Identification:

Identification is the first step of Box-Jenkins method it helps in determining the orders of autoregressive (AR), differencing (I), and moving average (MA) components that are appropriate for a given time series. This step helps in identifying the values of p, d and q for the given time series. Let’s see the key stages involved in this phase:

  • Stationarity Check: This process happens before the ARIMA modelling, stationarity check is the process in which statistical properties of time series such as mean, and variance are checked so that they do not change with time. If the data is not stationary differencing is done so that the data becomes stationary. Stationarity can be assessed visually and through statistical tests, such as the Augmented Dickey-Fuller (ADF) test.
  • Autocorrelation and Partial Autocorrelation Analysis: Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots are the main tools in identifying the orders of the AR and MA components. The ACF plot shows the correlation between the current observation and its past observations at various lag points. Whereas, the PACF plot shows the correlation between the current observation and its past observations, removing the effects of intermediate observations.
  • Seasonality Check: If the time series data has seasonality, it is important to account for it in the model. Seasonality can be identified through visual inspection of the time series plot or by using seasonal decomposition techniques.
  • Differencing Order: Differencing is often required to make the time series stationary. The order of differencing (d) is determined based on the number of differences needed to achieve stationarity.

Estimation:

Estimation is the second stage in the Box-Jenkins methodology for ARIMA modeling. In this stage, the identified ARIMA model parameters, including the autoregressive (AR), differencing (I), and moving average (MA) components, are estimated based on historical time series data. The primary goal is to fit the chosen ARIMA model to the observed data. Let’s see the key stages involved in this phase:

  • Model Selection: After the identification order of (p, d, q) of the ARIMA model the next step is to select the exact model based on these orders. This step involves selecting the autoregressive (AR) and moving average (MA) lags based on the patterns identified in the autocorrelation function (ACF) and partial autocorrelation function (PACF) during the identification phase. Even though it might not be a good selection of orders we can compare different candidate models using criteria like Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). We can choose the model with lowest AIC or BIC, balancing goodness of fit with model complexity.
  • Parameter Estimation: Once the ARIMA model is specified, the next step is to estimate the parameters of the model. The estimation involves finding the values of the autoregressive coefficients (\phi_1, \phi_2,...,\phi_p      ), the moving average coefficients (\theta_1, \theta_2,...,\theta_q      ), and any other parameters in the model.
  • Model Fitting: With the parameter estimates in hand, the ARIMA model is fitted to the historical data. The model is used to generate predicted values, and the fit is assessed by comparing these predictions to the actual observed values.

Diagnostic Checking:

Diagnostic checking is an important step in the Box-Jenkins methodology for ARIMA modeling. It involves evaluating the acceptance of the fitted ARIMA model by examining the residuals, which are the differences between the observed and predicted values. The goal is to ensure that the residuals are random and do not contain any patterns or structure. Now, let’s discuss the key aspects of diagnostic checking in Box-Jenkins:

  • Residual Analysis: Residuals are the differences between the actual observations and the values predicted by the ARIMA model. Analyzing the residuals helps identify any remaining patterns or systematic errors in the model.
  • Ljung-Box Test: The Ljung-Box test helps us check whether the errors or residuals in our model have any patterns or correlations. The null hypothesis it assesses is that there are no significant correlations among the residuals. In simpler terms, it tests if the leftover errors after modeling are random and don’t follow a specific pattern.
  • Mean and Variance Check: We have to ensure that the residuals have a mean close to zero and a constant variance. If the mean is significantly different from zero or the variance is not constant, it suggests that the model is not doing a consistent job, and its errors are becoming more unpredictable.
  • Iterative Refinement: Diagnostic checking is often an iterative process. If the initial diagnostic checks reveal issues, such as autocorrelation, non-constant variance, or outliers, the model may need to be refined.

Model Refinement:

The model refinement stage in the Box-Jenkins method involves a thorough evaluation of the estimated ARIMA model to ensure that it meets the required statistical assumptions and adequately captures the patterns in the time series data. If there are some issues in the model diagnostics, it will be required to refine the model by altering the orders of autoregressive, integrated and moving average or by considering additional factors which were not considered earlier. After rechecking and re-establishing the order of different components or by considering additional elements the diagnostic checks are again to be performed.

Once a satisfactory model is identified and validated, it could be used for the prediction purposes for future time series data points. Now let’s discuss the application of Box-Jenkins method.

Application of Box-Jenkins Methodology

Here we are using apple stock data from yfinance, we will be using Box-Jenkins method to analyze the stock data, here’s the step-by-step code with explanation:

Importing Libraries:

The code imports necessary libraries yfinance for downloading stock price data, pandas for data manipulation, matplotlib.pyplot for plotting, statsmodels for time series analysis and ARIMA modeling, warnings to suppress warnings during execution.

Python3

import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.stats.diagnostic import acorr_ljungbox
 
import warnings
warnings.filterwarnings('ignore')

                    

Function Definitions:

Now we will be using the functions that are defined for checking stationarity using the Augmented Dickey-Fuller (ADF) test and for plotting the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF).

Python3

# Function to check stationarity using Augmented Dickey-Fuller test
def check_stationarity(ts):
    result = adfuller(ts)
    print(f'ADF Statistic: {result[0]}')
    print(f'p-value: {result[1]}')
    print(f'Critical Values: {result[4]}')
 
# Function to plot ACF and PACF
def plot_acf_pacf(ts):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
    plot_acf(ts, ax=ax1, lags=20)
    plot_pacf(ts, ax=ax2, lags=20)
    plt.show()

                    


Data Loading and Preprocessing:

Stock price data for Apple Inc. (AAPL) is downloaded using yfinance. The data is collected from the start of 2015 to the start of 2023. Log returns are calculated to stabilize variance and make the time series more suitable for modeling.

Python3

# Load stock data
stock_symbol = "AAPL"
start_date = "2015-01-01"
end_date = "2023-01-01"
stock_data = yf.download(stock_symbol, start=start_date, end=end_date)['Close']
 
# Log transformation to stabilize variance
log_returns = stock_data.pct_change().dropna()
log_returns = log_returns.apply(lambda x: pd.np.log(1 + x))

                    


Stationarity Check and Differencing:

The stationarity of the log returns is checked before and after differencing. The time series is differenced to achieve stationarity. ACF and PACF plots are created for the differenced series to help determine ARIMA orders.

Python3

# Check stationarity
check_stationarity(log_returns)
 
# Differencing to make the series stationary
log_returns_diff = log_returns.diff().dropna()
 
# Check stationarity after differencing
check_stationarity(log_returns_diff)
 
# Plot ACF and PACF after differencing
plot_acf_pacf(log_returns_diff)

                    

Output:

ADF Statistic: -13.869148958528394
p-value: 6.51329302121344e-26
Critical Values: {'1%': -3.4336173133865064, '5%': -2.86298332472282, '10%': -2.5675383641200633}
ADF Statistic: -14.058039719328459
p-value: 3.091971442666415e-26
Critical Values: {'1%': -3.433648628001351, '5%': -2.8629971502062155, '10%': -2.5675457254979093}


Box-Jenkins Methodology for ARIMA Models

ACF and PACF Plots



Model Order Selection with AIC and BIC

The code iterates through different values of p, d, and q to find the combination that minimizes both the AIC and BIC values, helping to identify the optimal ARIMA model order.

Python3

# Find optimal values for p, d, q based on AIC and BIC
best_aic = float('inf')
best_bic = float('inf')
best_order = None
 
for p in range(3):  # Choose a range for p
    for d in range(2):  # Choose a range for d
        for q in range(3):  # Choose a range for q
            arima_model = ARIMA(log_returns, order=(p, d, q))
            arima_results = arima_model.fit()
             
            # Calculate AIC and BIC
            current_aic = arima_results.aic
            current_bic = arima_results.bic
             
            # Update best values
            if current_aic < best_aic and current_bic < best_bic:
                best_aic = current_aic
                best_bic = current_bic
                best_order = (p, d, q)
 
print(f'Best AIC: {best_aic}, Best BIC: {best_bic}, Best Order: {best_order}')

                    

Output:

Best AIC: -10277.232291010881, Best BIC: -10260.410146733962, Best Order: (0, 0, 1)

ARIMA Model Fitting and Diagnostics:

The ARIMA model is fitted using the optimal orders obtained from the AIC and BIC selection process. Diagnostics are performed on the residuals, including checking for stationarity. The Ljung-Box test is conducted to assess the autocorrelation in residuals.

Python3

# Fit ARIMA model with the best order
arima_model = ARIMA(log_returns, order=best_order)
arima_results = arima_model.fit()
 
# Diagnostics
residuals = arima_results.resid
check_stationarity(residuals)
 
# Ljung-Box test for autocorrelation in residuals
lb_test_stat, lb_test_pvalue = acorr_ljungbox(residuals, lags=20)
print(f'Ljung-Box test statistics: {lb_test_stat}')
print(f'Ljung-Box p-values: {lb_test_pvalue}')

                    

Output:

ADF Statistic: -13.478138873971695
p-value: 3.2812344010002946e-25
Critical Values: {'1%': -3.4336189466940414, '5%': -2.8629840458358933, '10%': -2.5675387480760885}
Ljung-Box test statistics: lb_stat
Ljung-Box p-values: lb_pvalue

Plotting Results:

Finally, the observed log returns and the fitted values from the ARIMA model are plotted to visualize the model’s performance.

Python3

# Plotting the predicted vs. actual values
plt.figure(figsize=(12, 6))
plt.plot(log_returns_diff, label='Observed')
plt.plot(arima_results.fittedvalues, color='red', label='Fitted', alpha=0.7)
plt.legend()
plt.title(f'ARIMA{best_order} Model for {stock_symbol} Stock Returns')
plt.show()

                    

Output:

b111

Observed vs fitted model with best order

The code mentioned above provides a comprehensive example of applying the Box-Jenkins methodology, including stationarity checks, differencing, model fitting, diagnostics, and result visualization for time series analysis and forecasting of stock returns. Adjustments to the model orders and parameters may be necessary based on the diagnostic results.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads