Contrastive Divergence in Restricted Boltzmann Machines

Last Updated : 19 Jan, 2024

Contrastive Divergence (CD) is a fundamental technique in the realm of machine learning, particularly in the field of unsupervised learning and specifically in training Restricted Boltzmann Machines (RBMs). It serves as a crucial component in the learning process by approximating the gradient needed to update the weights in these models. Initially introduced by Geoffrey Hinton in the context of training RBMs, Contrastive Divergence (CD) has since become a cornerstone in various deep-learning algorithms. This article delves into the depths of Contrastive Divergence, from its foundational concepts to its practical applications.

Restricted Boltzmann Machines (RBMs)

Stochastic Artificial Neural Networks (SANNs) refer to neural network models that incorporate stochasticity in their computations. Stochastic neural networks introduce randomness in different aspects of the network’s operation, often to improve learning dynamics, enable better generalization, or address computational challenges associated with certain types of models.

Boltzmann Machines, which include Restricted Boltzmann Machines (RBMs), are types of stochastic neural networks. These models use Gibbs sampling, a Markov Chain Monte Carlo method, to generate samples from the joint distribution of the visible and hidden units.

RBMs consist of two layers – a visible layer and a hidden layer – each with binary units. Unlike traditional neural networks, there are no connections within the visible or hidden layers. Every visible unit is connected to every hidden unit, and vice versa, forming a fully connected bipartite graph.

The RBM’s energy function determines the compatibility of a configuration of visible and hidden units in matrix notation :

$E(v, h)=-a^T v-b^T h-v^T W h$

E is the energy of the configuration
v is the visible layer’s state vector
h is the hidden layer’s state vector
a and b are the biases for the visible and hidden layers, respectively,
W is the weight matrix connecting visible and hidden units.

The energy function defines an energy value for each possible configuration of visible and hidden unit states. The negative sign in front of the summation terms indicates that lower energy values correspond to more probable or desirable states.

Energy Based Models

Contrastive Divergence (CD) is intimately connected with Energy-Based Models (EBMs). EBMs are a class of models used in probabilistic machine learning. They define a probability distribution over a set of configurations using an energy function. RBMs are a specific type of EBM, known for their applicability in unsupervised learning tasks such as dimensionality reduction and feature learning.

Markov Chain Monte Carlo (MCMC) Methods

CD relies on concepts from MCMC methods. MCMC techniques, including Gibbs Sampling, are employed to approximate complex probability distributions by sampling from them. Gibbs Sampling, in particular, is integral to CD and involves iteratively sampling from conditional probability distributions.

Gibbs Sampling is a key component of CD. It’s an iterative algorithm used to sample from joint probability distributions by sampling from each variable’s conditional probability distribution while holding other variables fixed.

What is Contrastive Divergence?

At its core, Contrastive Divergence is an iterative algorithm employed in training RBMs, which are a type of probabilistic graphical model used for dimensionality reduction, feature learning, and collaborative filtering. The primary objective of CD is to estimate the gradient of the log-likelihood function associated with the RBM.

To comprehend CD, it’s essential to grasp RBMs briefly. RBMs consist of visible and hidden layers where nodes within layers are interconnected but do not have connections within the same layer. CD operates by updating the weights of these connections to minimize the difference between the observed data and the reconstructed data generated by the RBM.

Contrastive Divergence Algorithm in Restricted Boltzmann Machines

CD operates through several steps, beginning with initializing RBMs with random weights. It performs Gibbs Sampling to estimate the gradient needed for weight updates by contrasting the observed data and the model’s reconstructed data.

Initialize Model Parameters:
- The algorithm starts with initializing the weights and biases of the RBM model and these parameters determine the relationships between visible and hidden units using the current model parameters. RBMs consist of weights (W), visible biases (a), and hidden biases (b). Initialize these parameters randomly.
Compute Probabilities of Hidden Units:
- Given a training example (a vector of visible units), compute the probabilities of the hidden units being activated. This involves calculating the conditional probabilities of the hidden units given the visible units using the current model parameters.
Sample hidden Configuration:
- Sample a binary hidden unit configuration based on the computed probabilities. Each hidden unit is activated with a probability determined by its computed probability. This sampled hidden configuration serves as a “reconstruction” of the visible units.
Reconstruction Visible Units:
- Given the sampled hidden configuration, compute the probabilities of the visible units being activated. Again, this involves calculating the conditional probabilities of the visible units given the sampled hidden units.
Update Model Parameters:
- Update the model parameters based on the difference between the outer products of the original training example and the reconstructed sample. The update is performed to maximize the likelihood of the training data. Specifically, the weight matrix and bias vectors are adjusted to reduce the difference between the original input and the reconstructed input.
- The weight update for the connections between visible unit i and hidden unit j is proportional to the difference between the outer product of the original visible and hidden unit states (v_i * h_j) and the outer product of the reconstructed visible and hidden unit states (v’_i * h’_j). The learning rate and other hyperparameters may influence the update rule.
- The goal is to make the model more likely to generate the training examples and less likely to generate samples that do not resemble the training data.
Repeat Steps for Multiple Iterations:
- Iterate through steps 2-5 for multiple training examples and/or epochs to refine the model parameters further. This iterative process helps the model learn the underlying patterns and relationships in the training data.

Contrastive Divergence Algorithm Implementations in Python

Implementing Contrastive Divergence in Python involves setting up an RBM and performing the CD steps.

The code provides an example of using the RBM class:

np.random.seed(42): Sets the random seed for reproducibility.
num_visible and num_hidden: Define the number of visible and hidden units.
data: Represents the input dataset used for training (random data is generated here for demonstration purposes).
rbm = RBM(num_visible, num_hidden): Creates an RBM instance.
rbm.contrastive_divergence(data): Trains the RBM using the Contrastive Divergence algorithm on the provided dataset.

Python3

import numpy as np
 
class RBM:
    def __init__(self, num_visible, num_hidden):
        self.num_visible = num_visible
        self.num_hidden = num_hidden
        self.weights = np.random.randn(num_visible, num_hidden)
        self.visible_bias = np.zeros(num_visible)
        self.hidden_bias = np.zeros(num_hidden)
 
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
 
    def gibbs_sampling(self, visible_data, k=1):
        for _ in range(k):
            hidden_probs = self.sigmoid(np.dot(visible_data, self.weights) + self.hidden_bias)
            hidden_states = np.random.rand(len(visible_data), self.num_hidden) < hidden_probs
            visible_probs = self.sigmoid(np.dot(hidden_states, self.weights.T) + self.visible_bias)
            visible_data = np.random.rand(len(visible_data), self.num_visible) < visible_probs
        return visible_data, hidden_probs
 
    def contrastive_divergence(self, data, learning_rate=0.1, k=1, epochs=10):
        for _ in range(epochs):
            positive_hidden_probs = self.sigmoid(np.dot(data, self.weights) + self.hidden_bias)
            positive_hidden_states = np.random.rand(len(data), self.num_hidden) < positive_hidden_probs
            positive_associations = np.dot(data.T, positive_hidden_probs)
 
            recon_data, recon_hidden_probs = self.gibbs_sampling(data, k)
            negative_visible_probs = recon_data
            negative_hidden_probs = recon_hidden_probs
            negative_associations = np.dot(recon_data.T, negative_hidden_probs)
 
            self.weights += learning_rate * (positive_associations - negative_associations)
            self.visible_bias += learning_rate * np.mean(data - negative_visible_probs, axis=0)
            self.hidden_bias += learning_rate * np.mean(positive_hidden_probs - negative_hidden_probs, axis=0)
 
# Example usage
np.random.seed(42)  # For reproducibility
num_visible = 6
num_hidden = 3
data = np.random.rand(100, num_visible)  # Sample data
 
rbm = RBM(num_visible, num_hidden)
rbm.contrastive_divergence(data)

RBM Class

The RBM class is used to define the Restricted Boltzmann Machine.

1. __init__() method:

Initializes the RBM with random weights, visible and hidden biases.
num_visible and num_hidden represent the number of visible and hidden units in the RBM, respectively.

2. sigmoid() method:

Implements the sigmoid activation function, used to calculate probabilities in the RBM.

3. gibbs_sampling() method:

Performs Gibbs Sampling, an iterative process used to generate samples from the RBM.
k denotes the number of Gibbs sampling steps.

4. contrastive_divergence() method:

Implements the Contrastive Divergence algorithm to train the RBM.
data is the input dataset used for training.
learning_rate is the learning rate for updating weights during training.
epochs is the number of training iterations.
Within this method, positive and negative phases are computed, and the weights and biases are updated based on the CD algorithm.

Advantages and Challenges of Contrastive Divergence

Contrastive Divergence offers several advantages in training RBMs:

Efficiency: It provides a computationally efficient way to approximate the gradient without needing to compute the full Markov Chain Monte Carlo (MCMC) chain, making it suitable for large datasets.
Convergence: Despite being an approximation, CD often converges effectively in practice, aiding in the training of RBMs.

However, CD also presents some challenges:

Approximation: As an approximation method, the accuracy of the gradient estimation might suffer, impacting the convergence of the learning process.
Parameter Sensitivity: CD’s performance can be sensitive to the choice of hyperparameters, such as the number of iterations or the learning rate.

Applications of Contrastive Divergence

Collaborative Filtering and Recommendation Systems: CD is extensively used in collaborative filtering, a technique employed in recommendation systems. By analyzing user-item interactions, RBMs trained using CD can uncover latent patterns and preferences within datasets. This allows systems to provide personalized recommendations, improving user experience in domains like e-commerce, streaming platforms, and content curation.
Dimensionality Reduction: In high-dimensional data, CD aids in dimensionality reduction by extracting essential features. RBMs trained through CD can learn compact representations that capture important characteristics while discarding noise. These learned representations facilitate visualization, clustering, and efficient processing of large datasets in various fields like image and speech recognition.
Feature Learning in Deep Neural Networks: CD plays a vital role in pre-training deep neural networks by initializing their layers with RBMs. This pre-training strategy enables the network to learn hierarchical representations of data, leading to improved performance in subsequent supervised learning tasks such as image classification, natural language processing, and speech recognition.
Generative Modeling and Data Generation: RBMs trained using CD can generate new data samples similar to the training dataset. This ability to generate realistic samples makes CD useful in generating synthetic data for various purposes like data augmentation, creating training datasets, and simulating data for testing models.
Unsupervised Learning and Clustering: By uncovering underlying structures in data, RBMs trained with CD assist in unsupervised learning tasks. They can aid in clustering similar data points together, identifying anomalies, and discovering patterns in data without labeled information. This is particularly beneficial in fields such as anomaly detection in cybersecurity and identifying patterns in healthcare data.
Probabilistic Modeling and Density Estimation: CD facilitates learning the probability distribution of data. RBMs trained using CD help in estimating the underlying probability distribution of observed data. This is valuable in modeling complex data distributions, analyzing uncertainties, and performing density estimation tasks.
Natural Language Processing (NLP): In NLP, CD-assisted RBMs are employed for various tasks such as language modeling, sentiment analysis, and text generation. RBMs learn representations of words or sentences that capture semantic information, enabling better understanding and generation of text.

Conclusion

Contrastive Divergence stands as a cornerstone algorithm in training Restricted Boltzmann Machines. Despite its approximative nature, it serves as a practical and computationally efficient method for estimating gradients in RBM training. Its iterative steps enable the model to learn and optimize the weights efficiently, contributing significantly to the field of unsupervised learning and deep neural networks. Further research continues to explore variations and improvements to this fundamental technique, aiming to enhance its accuracy and applicability across diverse domains.

Suggest improvement

Restricted Boltzmann Machine

Share your thoughts in the comments