Chain Rule Derivative in Machine Learning

Last Updated : 03 Apr, 2024

In machine learning, understanding the chain rule and its application in computing derivatives is essential. The chain rule allows us to find the derivative of composite functions, which frequently arise in machine learning models due to their layered architecture. These models often involve multiple nested functions, and the chain rule helps us compute gradients efficiently for optimization algorithms like gradient descent.

What is the chain rule?

The chain rule is a fundamental concept in calculus that allows us to find the derivative of composite functions. It states that if we have a function, y=f(g(x)), where g is a function of x and f is a function of g, then the derivative of y with respect to x is given by:

$\frac{dy}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}$

This means that to find the derivative of a composite function, we first find the derivative of the outer function with respect to its input (treating the inner function as a variable), then multiply it by the derivative of the inner function with respect to its input.

Application of Chain Rule in Machine Learning

The chain rule is extensively used in various aspects of machine learning, especially in training and optimizing models. Here are some key applications:

Backpropagation: In neural networks, backpropagation is used to update the weights of the network by calculating the gradient of the loss function with respect to the weights. This process relies heavily on the chain rule to propagate the error backwards through the network layer by layer, efficiently calculating gradients for weight updates.
Gradient Descent Optimization: In optimization algorithms like gradient descent, the chain rule is used to calculate the gradient of the loss function with respect to the model parameters. This gradient is then used to update the parameters in the direction that minimizes the loss.
Automatic Differentiation: Many machine learning frameworks, such as TensorFlow and PyTorch, use automatic differentiation to compute gradients. Automatic differentiation relies on the chain rule to decompose complex functions into simpler functions and compute their derivatives.
Recurrent Neural Networks (RNNs): In RNNs, which are used for sequence modeling tasks, the chain rule is used to propagate gradients through time. This allows the network to learn from sequences of data by updating the weights based on the error calculated at each time step.
Convolutional Neural Networks (CNNs): In CNNs, which are widely used for image recognition and other tasks involving grid-like data, the chain rule is used to calculate gradients for the convolutional layers. This allows the network to learn spatial hierarchies of features.

Steps to Implement Chain Rule Derivative with Mathematical Notation

Let’s consider a simple example where we have a neural network with two layers. The forward pass of this network can be represented as:

$× = W_2 • 0(W_i • 2 +b) +b_2$
where:

x is the input
W1 and W2 are the weight matrices of the first and second layers, respectively
b1 and b2 are the biases
sigma is the activation function

To compute the gradient of the loss function with respect to the weights W1 and W2 using backpropagation, we apply the chain rule step by step:

Compute the derivative of the loss with respect to the output:

dL/dz

Compute the derivative of the output with respect to each weight and bias, applying the chain rule at each step:

dz/dW2
dz/db2
dz/dW1
dz/db1

Update the weights and biases using gradient descent or another optimization algorithm: Let’s consider a specific example where we have a neural network with one input layer, one hidden layer, and one output layer. We’ll use the sigmoid activation function.

Python Implementation

Here’s a step-by-step explanation:

Define the sigmoid activation function: The sigmoid function takes an input x and returns the sigmoid activation applied to x.
Define the forward pass function: The forward_pass function takes an input x, weights W1 and W2, biases b1 and b2, and performs the forward pass through the neural network. It calculates the output of the hidden layer (a1) and the output layer (a2) using the sigmoid activation function.
Define the input: The input x is a NumPy array representing the features.
Define weights and biases: W1 is a 2×2 matrix representing the weights of the connections between the input and the hidden layer. b1 is a 1×2 vector representing the biases of the hidden layer. W2 is a 1×2 vector representing the weights of the connections between the hidden layer and the output layer. b2 is a scalar representing the bias of the output layer.
Perform the forward pass: The forward_pass function is called with the input x, weights W1 and W2, biases b1 and b2, and it calculates the output of the neural network.
Print the output: The calculated output of the neural network is printed.

Python

import numpy as np
# Define sigmoid activation function


def sigmoid(x):
    return 1 / (1 + np.exp(-x))
# Forward pass


def forward_pass(x, W1, b1, W2, b2):
    z1 = np.dot(W1, x) + b1
    a1 = sigmoid(z1)
    z2 = np.dot(W2, a1) + b2
    a2 = sigmoid(z2)
    return a2


# Define input
x = np.array([0.5, 0.3])
# Define weights and biases
W1 = np.array([[0.1, 0.2], [0.3, 0.4]])
b1 = np.array([0.5, 0.6])
W2 = np.array([0.7, 0.8])
b2 = 0.9
# Perform forward pass
output = forward_pass(x, W1, b1, W2, b2)
print("Output:", output)

Output :

Output: 0.871843204787514

In conclusion, the forward pass is a fundamental step in the operation of a neural network. It involves calculating the output of the network for a given input by propagating the input through the network’s layers, applying weights and biases, and using activation functions to introduce non-linearity. The forward pass is essential for making predictions with a neural network and is a building block for more complex operations like training and optimization.

Suggest improvement

Partial derivatives in Machine Learning

Share your thoughts in the comments