Open In App

Gaussian Process Regression (GPR) on Mauna Loa CO2 data

In article explores the application of Gaussian Process Regression (GPR) on the Mauna Loa CO2 dataset using Scikit-Learn.

What is Gaussian Process Regression (GPR)?

Gaussian Process Regression is the process of making predictions about an unknown function and dealing with regression problems with the help of statistical methods. GPR allows for flexibility it can model complex and non-linear relationships without explicitly defining a formula for the underlying function. Here in GPR, the predictions are treated as probability distributions over possible functions. Before getting the data GPR makes a wild guess about what the unknown function might look like, as the data component is accumulated the beliefs about the unknown function change and the model adapts to the data and the predictions start becoming more certain. After incorporation of data with the GPR model, it gives a posterior distribution which is a close estimate of the underlying function. When we make predictions for new, unseen data points, GPR provides a range of possible values with associated confidence intervals.



Kernel Function:

The kernel function is the most important component of a Gaussian Process Regression, it defines the measure of similarity between the data points and influences the shape and behaviour of the resulting Gaussian process. The choice of kernel influences the flexibility of the model, different kernel functions allow the GPR model to capture different relationships between data points. The kernel function also gives us rough knowledge about the smoothness, periodicity and other properties of the model before training the model. Therefore, the consideration of kernel function becomes an important part of the Gaussian Process in any case. Let’s discuss some of the most common kernel functions that we can use in modelling data in the Gaussian Process.

1. Radial Basis Function (RBF): The RBF kernel is often used when the function is expected to be smooth and continuous. It implies that the data points close to each other in the input space will have similar output values. With the appropriate choice of hyperparameter, RBF can capture the significance of any function, given the input data is enough. RBF kernel makes the function infinitely differentiable; this means the function is differentiable at any point in the input space it has been given. We can represent the RBF kernel as:



.

Where,

Here ‘l’ is the length scale parameter, which determines the width of the kernel and tells us about how quickly the similarity between the points decreases with distance. The choice of ‘l’ is often determined through hyperparameter optimization during the training process.

2. Matérn Kernel:

The Matérn kernel is more flexible than RBF kernel as it allows more flexibility in the smoothness of the function. In the Matérn kernel we use the shape parameter () which determines how quickly the correlation between points decay with distance. The higher the value of shape parameter () the higher the smoothness the function will have, as tends to infinity the Matérn kernel converges to the RBF kernel. We can represent the Matérn kernel as:

Here,

3. Linear Kernel: Linear kernel is useful in capturing the linear relationship between the input and output, it helps in capturing the linear trends in the data. It is represented as:

4. Periodic Kernel: Periodic kernel is useful in modelling periodic relationships in data, it helps in capturing the periodic trends in the dataset. Periodic kernel is represented as:

5. Constant Kernel: Constant kernel states that there is a constant relationship between the data points which does not vary. Constant kernel can be represented as:

6. Rational Quadratic Kernel: This kernel could be seen as a scale mixture of RBF kernel with different length-scales. This kernel is especially useful when the data exhibits behavior at multiple length-scales. The Rational Quadratic kernel function for two points x and x’ could be defined as:

Where, is the scale mixture parameter, which controls the weight of large-scale versus small-scale variations in the data.

7. White Kernel: This kernel plays a special role in modeling the noise aspect of the data. Unlike other kernels that define the shape or structure of the underlying function, the White Kernel specifically addresses the noise properties of the observations. White kernel could also be represented as:

Where, is the noise level or variance.

is 1 if x=x’ otherwise it is 0.

Gaussian Process Regression on Mauna Loa CO2 data

Now let’s get to the point of applying GPR on the Mauna Loa CO2 data. Moana Loa is a volcano present in the US state of Hawaii. Mauna Loa is considered as Earth’s largest active volcano in terms of volume and the area covered. Mauna Loa, due to its remote location is known for the atmospheric research which measures the continuous monitoring of carbon dioxide (CO2) levels on its surface. The dataset observing the increase in the CO2 concerns the Global Warming aspect and the involvement on human activities in it.

The dataset that we are going to be considering will be the daily increase in the level of CO2 on the Mauna Loa surface. We will try to visualize the Gaussian Process Regressor’s prediction with the actual observed data of CO2 concentration. In the modelling process we will be considering the Month, Year and CO2 concentration as the columns, where Month, Year will be our main feature and CO2 concentration will be the label value which we will be predicting.

Here’s the step-by-step code of GPR on Mauna Loa CO2 data with explanation:

Importing Libraries:

First, we will be importing important libraries which will be useful in this method. Here we will be importing pandas for data handling and manipulation, numpy for mathematical operations on arrays, matplotlib for the visualization of prediction, datetime for date-time manipulation, and from the sklearn library’s gaussian_process we will be importing GaussianProcessRegressor which will be our model on which data will be fed and fetch_openml to get the Mauna Loa CO2 dataset.

# Importing Necessary Libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.gaussian_process import GaussianProcessRegressor
import datetime
import numpy as np

                    

Fetch and Display Dataset:

The CO2 dataset of Mauna Loa is fetched from openml using its data ID which is 41187. The parameter as_frame is set to True to make sure that the data is loaded as a pandas dataframe. The parser parameter is set to pandas so that the data parser used in other code is pandas. Finally, the first five rows of the dataset are printed using head().

data = fetch_openml(data_id=41187).frame
data.head()

                    

Output:

   year  month  day  weight  flag station    co2
0 1958 3 29 4 0 MLO 316.1
1 1958 4 5 6 0 MLO 317.3
2 1958 4 12 4 0 MLO 317.6
3 1958 4 19 6 0 MLO 317.5
4 1958 4 26 2 0 MLO 316.4

Convert to Datetime and Set as Index:

The year, month, and day columns are combined into a single “date” column using pandas to_datetime method. The ‘date’ column is set to index of the pandas dataframe and the columns date and co2 are filtered such that all the other columns are removed. At last, the first five rows of the new dataframe are printed.

data["date"] = pd.to_datetime(data[["year", "month", "day"]])
data = data[["date", "co2"]].set_index("date")
data.head()

                    

Output:

              co2
date
1958-03-29 316.1
1958-04-05 317.3
1958-04-12 317.6
1958-04-19 317.5
1958-04-26 316.4

Resampling and Plotting Monthly Trend:

The data is resampled to a monthly frequency, calculating the mean CO2 concentration for each month. dropna() removes any rows with missing values (NaNs). After resampling the monthly CO2 average data is plotted with respect to each month.

# Resampling by month
co2_weekly = data.resample("W").mean().dropna()
 
# Ploting monthly data
plt.plot(co2_weekly['co2'])
plt.ylabel("Weekly average of CO$_2$ concentration (ppm)")
plt.title("Weekly average of air samples measurements")
plt.show()

                    

Output:

Weekly average of air samples measurements

Preparing Data for Gaussian Process Regression:

After proper visualization of the dataset, data is be prepared so that it could be fed to the gaussian process regression model, variables input – ‘X’ and output – ‘y’ are prepared in this process. ‘X’ is created by combining the year and the fractional part of the month, effectively creating a continuous time variable. Whereas ‘y’ is the CO2 concentration levels.

X = (data.index.year + data.index.month / 12).to_numpy().reshape(-1, 1)
y = data["co2"].to_numpy()

                    

Split the train and Test dataset

X_train = X[:int(X.shape[0]*0.7)]
y_train = y[:int(X.shape[0]*0.7)]
 
X_test = X[int(X.shape[0]*0.7):]
y_test = y[int(X.shape[0]*0.7):]
 
print(f"Training data shape :{X_train.shape},{y_train.shape}")
print(f"Testing data shape :{X_test.shape},{y_test.shape}")

                    

Output:

Training data shape :(1557, 1),(1557,)
Testing data shape :(668, 1),(668,)

Defining Gaussian Process Kernels:

Different kernels are defined which captures different aspects of the data used in this example. The kernels include RBF kernel, periodic kernel, white kernel and RationalQuadratic kernel. Each kernel carries different aspects of the data some are mentioned here: the long_term_trend_kernel captures the long-term trends in the data, the seasonal_kernel models seasonal variation in data, the irregularities_kernel marks the irregularities in the dataset and the noise_kernel models the noise component of the data.

The final combined_kernel is the combination of all the kernels mentioned.

from sklearn.gaussian_process.kernels import RBF, ExpSineSquared, RationalQuadratic, WhiteKernel
 
long_term_trend_kernel = 40.0**2 * RBF(length_scale=40.0)
 
seasonal_kernel = (2.0**2 * RBF(length_scale=100.0) * ExpSineSquared(
    length_scale=1.0, periodicity=1.0, periodicity_bounds="fixed"))
 
irregularities_kernel = 0.5**2 * RationalQuadratic(length_scale=1.0, alpha=1.5)
 
noise_kernel = 0.1**2 * RBF(length_scale=0.1) + \
    WhiteKernel(noise_level=0.1**2, noise_level_bounds=(1e-5, 1e5))
 
combined_kernel = long_term_trend_kernel + \
    seasonal_kernel + irregularities_kernel + noise_kernel

                    

Training the Gaussian Process Regressor:

Here, first the mean of the CO2 level value is computed and then GaussianProcessRegressor is instantiated with the help of combined_kernel defined in the previous step and normalize_y parameter is set to False since y value will be centered by the substraction of ‘y.mean()’ value manually and the gaussian_process model will be trained on this data.

gaussian_process = GaussianProcessRegressor(
    kernel=combined_kernel, normalize_y=False)
gaussian_process.fit(X_train, y_train - y_train.mean())

                    

Making Predictions:

mean_y_pred, std_y_pred = gaussian_process.predict(X_test, return_std=True)
mean_y_pred += y_train.mean()
 
plt.plot(X_train, y_train, color="gray", label="Training data")
plt.plot(X_test, mean_y_pred, color="orange", alpha=0.4,
         label="Gaussian process Prediction")
plt.plot(X_test, y_test, color="green", alpha=0.4,
         label="Actual values")
 
plt.legend()
plt.xlabel("Year")
plt.ylabel("Monthly average of CO$_2$ concentration in ppm")
 
plt.title(
    "Monthly average of air samples measurements from the Mauna Loa Observatory"
)
plt.show()

                    

Output:

Monthly average of air samples measurements

Forecasting

First, we will be determining the current month and year to create a future timeline for prediction. ‘forecast_x’ is an array of time points (from 2001 to the current month) for which we want to predict CO2 levels. The ‘predict’ method of the Gaussian Process Regressor is used to get predictions (‘mean_y_pred’) and their standard deviations (‘std_y_pred’). return_std=True enables the return of standard deviation estimates for the predictions. The mean predictions are adjusted by adding back the ‘y_mean’ to reverse the earlier centering.

today = datetime.datetime.now()
current_month = today.year + today.month / 12
forecast_x = np.linspace(start=2001, stop=current_month, num=365).reshape(-1, 1)
 
mean_y_forecast, std_y_forecast = gaussian_process.predict(forecast_x, return_std=True)
mean_y_forecast += y_train.mean()

                    

Plotting the Results:

The predictions are shown as a line plot with a shaded area representing the uncertainty (± standard deviation). The ‘fill_between’ method creates a shaded area around the prediction line, representing the uncertainty in the predictions. The plot is then labeled with a title, legend, and axis labels to complete the visualization.

plt.plot(X_train[1000:], y_train[1000:], color="gray", label="Training data")
plt.plot(X_test, mean_y_pred, color="orange", alpha=0.4,
         label="Gaussian process Prediction")
plt.plot(X_test, y_test, color="green", alpha=0.4,
         label="Actual values")
 
plt.plot(forecast_x, mean_y_forecast, color="red", alpha=0.3,
         label="Forecasting values")
 
plt.fill_between(
    forecast_x.ravel(),
    mean_y_forecast - std_y_forecast,
    mean_y_forecast + std_y_forecast,
    color="red",
    alpha=0.1,
)
plt.legend()
plt.xlabel("Year")
plt.ylabel("Monthly average of CO$_2$ concentration in ppm")
 
plt.title("Forcasting Mauna Loa Observatory")
plt.show()

                    

Output:

Forcasting Mauna Loa Observatory



Hence, through the output visualization, we can see that the prediction matches the original observation in a pretty good way, and the Gaussian Process Regression is effective in monitoring and predicting the CO2 rise level on Mauna Loa volcano.


Article Tags :