Quick Start to Gaussian Process Regression

Gaussian Processes, often abbreviated as GPs, are powerful and flexible machine-learning techniques primarily used for regression and probabilistic modelling. They excel at modelling intricate relationships between input variables and their corresponding output values. GPs offer methods to estimate both the mean and uncertainty (variance) of predictions, making them particularly valuable for uncertainty quantification.

In this article, we’ll understand, how Gaussian Process Regression works in alternative cases.

Gaussian Process Regression in scikit-learn

Gaussian Process Regression in scikit-learn, facilitated by the `GaussianProcessRegressor` class, excels in modelling complex relationships between input variables and outputs. Utilizing kernels like the Radial Basis Function, it estimates mean and uncertainty, crucial for uncertainty quantification. By providing methods for prior-based predictions and hyperparameter selection, it offers flexibility. Data preparation involves organizing input features and output values. With the ability to handle noise, GPR is a powerful tool for regression tasks. The process includes selecting a kernel, initializing the model, training it on prepared data, making predictions with mean and uncertainty, and visualizing results for comprehensive insights.

Example:

Let’s generate synthetic data with both noise-free and noisy versions, fit Gaussian Process models to both datasets and visualize the results to showcase the predictions along with the associated uncertainty for each case. The comparison between noise-free and noisy Gaussian Process Regression (GPR) implementations is made to highlight the impact of noise on the modelling process and subsequent predictions. Below are the key factors for why this comparison is important:

Uncertainty Quantification: In real-world scenarios, data often contains inherent noise or variability. By comparing noise-free and noisy implementations, we can observe how well the GPR model captures and quantifies uncertainty. The wider confidence intervals in the presence of noise reflect the model’s acknowledgement of the inherent variability in the data.
Model Robustness: Noisy data can introduce challenges for regression models. Comparing the model’s performance on noise-free and noisy datasets helps evaluate its robustness. A robust model should be able to adapt to noisy conditions without overfitting to the noise.
Hyperparameter Sensitivity: The presence of noise might influence the choice of hyperparameters for the GPR model. For example, the length scale of the kernel might need adjustment to better capture the underlying patterns in the presence of noise. This comparison can provide insights into how hyperparameters affect model performance in different scenarios.
Visualization of Uncertainty: Gaussian Process Regression provides not only predictions but also estimates of uncertainty (confidence intervals). Comparing noise-free and noisy predictions visually showcases how uncertainty increases when noise is present. This visualization is valuable for decision-making in applications where uncertainty matters.

Now, let’s delve deeper and explore the steps required to perform Gaussian Process regression in Scikit-Learn. We will provide code examples and explanations to ensure a clear understanding of the process.

Step 1: Importing Required Libraires

To perform Gaussian Process Regression, the first step is to import the necessary libraries. Obviously, we need scikit-learn. In addition to scikit-learn, we also require two more libraries: NumPy and Matplotlib. These libraries are essential for various aspects of GPR, including data manipulation, mathematical operations, and visualizing GPR plots.

Python3

import numpy as np

from sklearn.gaussian_process import GaussianProcessRegressor

from sklearn.gaussian_process.kernels import RBF

import matplotlib.pyplot as plt

Step 2: Data Preparation

As we already know, the initial step involves preparing our data to ensure it’s in the right format for our model. This entails organizing the input features and their corresponding output values appropriately. To illustrate this process, let’s generate synthetic data.

Python3

np.random.seed(42)

rng = np.random.default_rng(seed=42)

X = np.sort(rng.uniform(0, 5, 20))[:, np.newaxis]

y_noise_free = np.sin(X).ravel()

y_noisy = y_noise_free + rng.normal(0, 0.1, len(X))

Step 3: Choosing a Kernel

In this step, we need to select an appropriate kernel function that accurately models the relationship between the input features and output values. To begin, we’ll define the kernel function, specifically the Radial Basis Function (RBF). The choice of kernel function is critical as it determines how the Gaussian Process Regression model captures the underlying patterns in the data. For both cases, kernel’s parameters are calculated using maximum likelihood function.

Python3

kernel = 1.0 * RBF(length_scale=1.0)

Step 4: Creating the GP Model

Now, let’s proceed to initialize a GaussianProcessRegressor with the previously selected kernel and any relevant hyperparameters for our model. This step is crucial in setting up the GPR model with the chosen kernel and configuring it for the specific regression task.

Python3

gp_noise_free = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)

gp_noisy = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)

Step 5:Training the GP Model

In this phase, we will train our Gaussian Process Model using the prepared data. This process involves fitting the model to the training data, allowing it to learn the underlying patterns and relationships between the input features and output values. Training is a pivotal step in building an accurate GP model. Let’s first train with noise free data.

Python3

gp_noise_free.fit(X, y_noise_free)

Output:

GaussianProcessRegressor
GaussianProcessRegressor(kernel=1**2 * RBF(length_scale=1),
                         n_restarts_optimizer=10)

Training with noise data

Python3

gp_noisy.fit(X, y_noisy)

Output:

GaussianProcessRegressor
GaussianProcessRegressor(kernel=1**2 * RBF(length_scale=1),
                         n_restarts_optimizer=10)

Step 6: Making Predictions

With the model fully trained, we can now leverage it to make predictions on new data points. The process begins by generating test data for evaluation. Subsequently, the model will provide predictions that include both the mean and standard deviation, allowing us to assess not only the expected values but also the associated uncertainty in those predictions. This dual provision of mean and variance is a distinctive feature of Gaussian Process Regression

Python3

# Step 6: Making Predictions
# Generate test data for evaluation

X_pred = np.linspace(0, 5, 1000)[:, np.newaxis]
 
# Predictions for noise-free model

y_pred_noise_free, sigma_noise_free = gp_noise_free.predict(X_pred, return_std=True)
 
# Predictions for noisy model

y_pred_noisy, sigma_noisy = gp_noisy.predict(X_pred, return_std=True)

Step 7: Visualizing Regression Results

In the final step, we will visualize our regression results. Through visualization, we will be able to observe both the predicted mean and the associated uncertainty. This graphical representation is essential for gaining insights into the model’s performance and understanding the reliability of our predictions.

Python3

# Step 7: Visualizing Regression Results
# Plotting noise-free results

plt.figure(figsize=(12, 6))
 
plt.subplot(1, 2, 1)

plt.scatter(X, y_noise_free, c='r', marker='.', label='Observations ( noise-free)')

plt.plot(X_pred, y_pred_noise_free, 'b', label='Prediction')

plt.fill_between(X_pred.flatten(), y_pred_noise_free - 1.96 * sigma_noise_free, y_pred_noise_free + 1.96 * sigma_noise_free, alpha=0.2, color='blue', label='95% Confidence Interval')

plt.title('Gaussian Process Regression (Noise-Free)')

plt.xlabel('Input')

plt.ylabel('Output')
plt.legend()
 
# Plotting noisy results

plt.subplot(1, 2, 2)

plt.scatter(X, y_noisy, c='r', marker='.', label='Observations (Noisy)')

plt.plot(X_pred, y_pred_noisy, 'b', label='Prediction')

plt.fill_between(X_pred.flatten(), y_pred_noisy - 1.96 * sigma_noisy, y_pred_noisy + 1.96 * sigma_noisy, alpha=0.2, color='blue', label='95% Confidence Interval')

plt.title('Gaussian Process Regression (Noisy)')

plt.xlabel('Input')

plt.ylabel('Output')
plt.legend()
 
plt.show()

Output:

Gaussian Process Regressor

The plots showcase the predictions along with the associated uncertainty for each case. The noise in the training data is evident in the wider confidence intervals in the second subplot, where the model is trained on noisy data.

In both scenarios, the GP model is able to capture the underlying relationship between the input and output variables. This is evident from the fact that the predicted lines (solid blue lines) closely follow the true observations (data points), showcasing a sine wave pattern.
However, the presence of noise makes it more difficult for the model to make accurate predictions. This is reflected in the wider confidence intervals (shaded blue areas) in the plot with noisy observations.

Conclusion

In summary, we explored Gaussian Process Regression and understood Gaussian process regression is a powerful tool for modeling nonlinear relationships between variables.

Article Tags :

AI-ML-DS

Data Science

Geeks Premier League

Machine Learning

Geeks Premier League 2023

Python scikit-module