Open In App

Decision Boundary of Label Propagation Vs SVM on the Iris Dataset

In machine learning, understanding decision boundaries is crucial for classification tasks. The decision boundary separates different classes in a dataset. Here, we’ll explore and compare decision boundaries generated by two popular classification algorithms – Label Propagation and Support Vector Machines (SVM) – using the famous Iris dataset in Python’s Scikit-Learn library.

Decision boundary of label Propagation vs SVM on the Iris dataset

On the Iris dataset, different decision boundaries are shown by Label Propagation and Support Vector Machines (SVM). Label propagation uses data point similarity to capture complex patterns. SVM, which aims for maximum margin, frequently draws distinct boundaries. The decision is based on the type of data and how complexity and interpretability are traded off.



It features 150 examples of three iris flower species: setosa, versicolor, and virginica. Each sample has four characteristics: sepal length, sepal width, petal length, and petal width. The idea is to use these traits to predict the species of a fresh sample.

There are several machine learning techniques available to handle this problem, including k-nearest neighbours, decision trees, logistic regression, and support vector machines (SVM). Some of these algorithms, however, demand that all of the samples have a known label, which may not be feasible in some cases. For instance, suppose we have a big dataset with just a few labelled examples and the others are unlabeled.



how can we use the information from the unlabeled samples to improve our predictions?

Concepts of Decision boundary of label propagation vS SVM

Before we start coding, let us review some key concepts related to the topic:

Label Propagation

Label propagation is a semi-supervised learning algorithm that assigns labels to unlabeled samples based on the labels of their neighbors. The algorithm works as follows:

  1. Initialize the label matrix Y with the known labels for the labeled samples, and -1 for the unlabeled samples.
  2. Construct a weighted graph G, where each node represents a sample, and each edge represents the similarity between two samples. The similarity can be computed by various methods, such as the Euclidean distance, the Gaussian kernel, or the k-nearest neighbors graph.
  3. Propagate the labels from the labeled nodes to the unlabeled nodes by updating the label matrix Y iteratively. The update rule is:

    where Yi is the i-th sample’s label, Wij is the weight of the edge between the i-th and j-th samples, and the total is over all samples. The update rule may be thought of as taking the weighted average of the neighbours’ labels.
  4. Step 3 should be repeated until convergence is attained or a maximum number of iterations is reached.

The label propagation algorithm is a subset of the label spreading method that includes a regularization term in the update process to prevent overfitting. Scikit Learn includes the label spreading algorithm, which can be used instead of label propagation.

Support Vector Machines

Implementation of Decision boundary of label propagation vs SVM on the Iris dataset

Let’s understand the implementation of Decision boundary of label propagation versus SVM on the Iris dataset.

Import the necessary libraries and modules

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.semi_supervised import LabelPropagation
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

                    

Within our code, we’ll make use of the following modules and libraries:

Loading Dataset

# Load Iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2# Using the first two features for visualization
y = iris.target

                    

The datasets module of scikit-learn is used in this Python code to load the Iris dataset. For the purpose of visualization or analysis, it extracts the first two features (X = iris.data[:, :2]) and their matching target labels (y = iris.target) using only these particular features.

Creating Mask for labeled and unlabeled samples

# Create a mask for labeled and unlabeled samples
np.random.seed(42)
ratio = 0.1  # Ratio of labeled samples
n_samples = len(y)
n_labeled = int(ratio * n_samples)
indices = np.random.choice(n_samples, n_labeled, replace=False)
mask = np.zeros(n_samples, dtype=bool)
mask[indices] = True
y_unlabeled = np.copy(y)
y_unlabeled[~mask] = -1  # Assign -1 to unlabeled samples

                    

In order to designate a specific ratio (10%) of samples (n_labeled) as labeled and the remaining samples as unlabeled, this code creates a binary mask (mask). Next, using the generated mask, it makes a copy of the original target labels (y_unlabeled) and gives the unlabeled samples a value of -1. To ensure reproducibility, the random seed is fixed at 42.

Training and prediction of Model

# Train Label Propagation and SVM models on labeled samples
lp_model = LabelPropagation()
# Fit Label Propagation on labeled samples
lp_model.fit(X[mask], y_unlabeled[mask])
 
svm_model = SVC()
svm_model.fit(X[mask], y[mask])  # Fit SVM on labeled samples
 
# Predict labels of unlabeled samples using both models
# Predict labels using Label Propagation
y_pred_lp = lp_model.predict(X[~mask])
y_pred_svm = svm_model.predict(X[~mask])  # Predict labels using SVM

                    

Using the labeled samples, this code trains two models: Support Vector Machine (svm_model) and Label Propagation (lp_model). While SVM trains on labeled data using the fit method, Label Propagation uses the labeled samples to fit the model. Next, both models use the predict method to predict the labels of the unlabeled samples (X[~mask]); the predictions are then stored in y_pred_lp and y_pred_svm, respectively.

Evaluation of the model

# Evaluate accuracy of both models on unlabeled samples
# Calculate accuracy of Label Propagation
acc_lp = accuracy_score(y[~mask], y_pred_lp)
acc_svm = accuracy_score(y[~mask], y_pred_svm)  # Calculate accuracy of SVM
print(f"Accuracy of label propagation: {acc_lp:.2f}")
print(f"Accuracy of SVM: {acc_svm:.2f}")

                    

Output:

Accuracy of label propagation: 0.76
Accuracy of SVM: 0.61

This code assesses the Label Propagation and SVM models’ accuracy using the unlabeled samples. The accuracy_score function from scikit-learn is utilized to compare the predicted labels (y_pred_lp and y_pred_svm) with the true labels (y[~mask]). The outcomes, which display each model’s accuracy on the unlabeled samples, are then printed.

Creating a Meshgrid

# Create a meshgrid for decision boundary visualization
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))
# Create a mesh of points for contour plotting
mesh_input = np.c_[xx.ravel(), yy.ravel()]

                    

The range of the dataset’s first two features is covered by the meshgrid (xx and yy) defined by this code. After creating a grid of points using np.meshgrid, a matrix (mesh_input) representing points across the 2D feature space for contour plotting or decision boundary visualization is created by combining the flattened meshgrid coordinates (xx.ravel() and yy.ravel()) with np.c_.

Visualization of the Meshgrid

# Plot decision boundaries
plt.figure(figsize=(12, 6))
 
plt.subplot(1, 2, 1)
plt.title("Label propagation")
# Predict and reshape for contour plotting
Z_lp = lp_model.predict(mesh_input).reshape(xx.shape)
# Plot decision boundary of Label Propagation
plt.contourf(xx, yy, Z_lp, cmap="rainbow", alpha=0.2)
plt.scatter(X[:, 0], X[:, 1], c=y_unlabeled, cmap="rainbow",
            edgecolor="k"# Scatter plot of samples
plt.xlabel("Sepal length (cm)")
plt.ylabel("Sepal width (cm)")
 
plt.subplot(1, 2, 2)
plt.title("SVM")
Z_svm = svm_model.predict(mesh_input).reshape(
    xx.shape)  # Predict and reshape for contour plotting
# Plot decision boundary of SVM
plt.contourf(xx, yy, Z_svm, cmap="rainbow", alpha=0.2)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap="rainbow",
            edgecolor="k"# Scatter plot of samples
plt.xlabel("Sepal length (cm)")
plt.ylabel("Sepal width (cm)")
 
plt.show()  # Display the plot

                    

Output:

With the help of this code, decision boundaries for the Label Propagation and SVM models can be plotted side by side. It reshapes the results for contour plotting after predicting labels for the meshgrid points (mesh_input) in each subplot. Next, contourf is used to visualize the decision boundaries. Overlaid are scatter plots of labeled and unlabeled samples, with various colors denoting various classes. At last, the plot is displayed using the plt.show() command.

Conclusion

Using scikit-learn, several differences are seen between the decision boundaries of Label Propagation and SVM on the Iris dataset. Label propagation creates delicate decision boundaries that capture complex patterns by using similarity between data points. SVM, on the other hand, creates distinct, linear decision boundaries by aiming for the maximum margin. Based on the type of data and the balance between interpretability and complexity, one can choose between these models. For situations where simplicity and clarity are crucial, SVM offers clear-cut decision boundaries, whereas Label Propagation is excellent at capturing subtle relationships. The display highlights the different ways in which the models approached classification in this popular dataset.


Article Tags :