Open In App

Gaussian Process Classification (GPC) on Iris Dataset

A potent machine learning approach that may be used for both regression and classification problems is Gaussian process classification or GPC. It is predicated on the notion of using a probabilistic model that depicts a distribution across functions, known as a Gaussian process. Using this distribution, one may forecast a function’s output given a collection of input data.

GPC may be used in the classification context to forecast a new data point’s class label based on its attributes. This is accomplished by modelling the likelihood of each class label for the data point using the Gaussian process. Next, it is predicted that the class label with the greatest probability is the actual class label.



Gaussian Process Classification

A Gaussian Process Extension for Classification Problems is called GPC. To enable a probabilistic approach to class label prediction in classification tasks, GPC models the probability distribution over possible functions. Using GPC is helpful when dealing with issues involving imbalanced datasets or complex decision boundaries. Not only does it offer forecasts, but it also measures the degree of uncertainty surrounding them. Understanding the confidence or reliability of the model’s output is particularly important in applications. GPC is a flexible tool in machine learning classification tasks because it can apply past knowledge and adjust to various kinds of data.

A Gaussian process (GP) is a kind of stochastic process that has a multivariate normal distribution for each finite collection of its random variables. This indicates that the random variables have a normal distribution for every linear combination of them. Applications for GPs are many and include statistics, machine learning, and Bayesian inference.



Concepts of Gaussian Process Classification

Implementation of Gaussian process classification (GPC) on Iris dataset

Importing Libraries




# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

Load Iris Dataset




# Load the iris dataset
iris = load_iris()
X = iris.data[:, :2# Using only the first two features for visualization
y = iris.target

This imports the iris dataset from the sklearn.datasets module of Scikit-Learn. The characteristics (iris flower measurements) and target labels (iris species) are present in the loaded dataset.

Split Data into Training and Test Sets




# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=50)

This code uses sklearn.model_selection.train_test_split() to divide the data into training and test sets. The test set (30%) will be used to assess the GPC model’s performance, while the training set (70%) will be used to fit it. Reproducible outcomes are guaranteed by the random_state parameter.

Standardize Features




# Standardize the features
scaler = StandardScaler()
X_train_standardized = scaler.fit_transform(X_train)
X_test_standardized = scaler.transform(X_test)

This code uses sklearn.preprocessing to standardize the features.Use StandardScaler(). Every feature is scaled to have a zero mean and a one standard deviation. This aids in enhancing the GPC model’s performance.

Define Kernel




# Define the kernel
kernel = 1.0 * RBF(length_scale=1.0# RBF kernel with default parameters

In GPC, the kernel function is essential for calculating the similarity between data points. The Radial Basis Function (RBF) kernel and the Matérn kernel are popular alternatives for kernels.

Fit the Model to the Training Data




# Create the Gaussian process classifier
gp = GaussianProcessClassifier(kernel=kernel)
 
# Fit the model to the training data
gp.fit(X_train_standardized, y_train)

Using gp.fit(X_train, y_train), this code fits the GPC model to the training set of data. Building the underlying probabilistic model and acquiring the kernel parameters are required for this.

Make Predictions on the Test Data




# Make predictions on the test data
y_pred = gp.predict(X_test_standardized)

Using gp.predict(X_test), this code forecasts the test data. It generates the anticipated class labels by fitting the GPC model to the test data.

Evaluate the Model




# Evaluate the model
from sklearn.metrics import accuracy_score
 
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Output:

Accuracy: 0.9

This code uses the accuracy measure to assess the model’s performance. The accuracy score is determined by comparing the genuine class labels (y_test) with the anticipated class labels (y_pred).

Mesh Grid Visualization

Create a Mesh Grid




# Create a mesh grid for visualization
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
 
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

The range of the specified characteristics is covered by the mesh grid created by this code. The minimum and maximum values (x_min, x_max, y_min, y_max) are defined, and np.meshgrid() is used to create a grid of points.

Predict on the Mesh Grid




# Predict on the mesh grid
X_grid = np.c_[xx.ravel(), yy.ravel()]
X_grid_standardized = scaler.transform(X_grid)
y_pred_grid = gp.predict(X_grid_standardized)
y_pred_grid = y_pred_grid.reshape(xx.shape)

The variables xx and yy define a mesh grid on which the code is predicting. The coordinate matrices are flattened and concatenated to create the mesh grid. The grid’s (X_grid) input features are then scaled using a scaler. The target variable for the standardized grid is then predicted by the Gaussian Process (gp), and the predictions are then reshaped to fit the mesh grid’s shape, yielding a surface of predicted values.

Ploting the Mesh Grid Visualization




# Plot the mesh grid visualization
plt.contourf(xx, yy, y_pred_grid, cmap='coolwarm', alpha=0.8)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train,
            edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal Length (standardized)')
plt.ylabel('Sepal Width (standardized)')
plt.title('Gaussian Process Classification on Iris Dataset')
plt.show()

Output:

Gaussian process classification

An accuracy score, representing the model’s performance on the test set, is produced as a result of the Gaussian Process Classification on the Iris dataset. Furthermore, a mesh grid visualization shows the decision boundaries and illustrates how the model categorizes various areas according to the characteristics (sepal length and breadth). The model’s predictions for every point on the grid are shown as outlines on the figure. Overall, the output shows the model’s categorization boundaries qualitatively as well as quantitatively evaluating the model’s correctness.


Article Tags :