Kaiming Initialization in Deep Learning

Last Updated : 27 Dec, 2023

Kaiming Initialization is a weight initialization technique in deep learning that adjusts the initial weights of neural network layers to facilitate efficient training by addressing the vanishing or exploding gradient problem. The article aims to explore the fundamentals of Kaiming initialization and it’s implementation.

What is Kaiming Initialization?

The Kaiming initialization method, also known as Kaiming He initialization or He normal initialization, is a technique for initializing the weights of artificial neural networks. This method was introduced in the paper titled “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification” by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.

The motivation behind Kaiming initialization is to address the vanishing or exploding gradient problem that can occur during the training of deep neural networks. In deep networks, especially those using rectified linear unit (ReLU) activation functions, traditional weight initialization methods such as random normal or Xavier initialization may lead to gradients that vanish or explode as they are propagated through the layers during backpropagation.

The Kaiming initialization method is calculated as a random number with a Gaussian probability distribution (G) with a mean of 0.0 and a standard deviation of sqrt(2/n), where n is the number of inputs to the node.

$W \rightarrow\Nu(0,\frac{2}{n})$

The factor of 2 in the variance is specific to the ReLU activation function. For other activation functions, such as sigmoid or hyperbolic tangent, different initialization strategies might be more appropriate.

Derivation of Kaiming Initialization

We will use the below assumption to derive kaiming initialization.

The output of neural network at any layer is given as y = Wx + b. Here
- W is the weight matrix
- X is the input
- b is the bias
All elements in W share the same distribution and are independent of each other.
All elements in X are mutually independent and share the same distribution.
We assume Weight in each layer has zero mean.
Independence of Weights and Inputs: The initialization assumes that the weights and inputs are independent of each other.
Bias Initialization: The derivation does not explicitly consider biases. In practice, biases are often initialized to zero or small constant values separately from the weight.

We will use the below properties of variance.

$Var(XY) = Var(X)Var(Y) + (E[X])^2Var(Y) + (E[Y])^2Var(X)$
$Var(X) = E(X^2) - [E(X)]^2$

Let us consider two layers ‘k-1’and ‘k’. The output of the ‘k-1’ layer is being feed into layer ‘k’. Now the output of ith neuron in layer k can be written as

$y_i = ∑W_{ij}X_{j} + b_i$

Here

X_j is the j^th input to the neuron in layer k from the k-1 layer. The summation is over the length of the neuron in k-1 layer
W_ij represents the weight associated with the i^th row and j^th column in the Weight matrix. The row matrix has size 1*(number of neurons in the k-1 layer)

Now we can write.

$Var(y_i) = Var(∑W_{ij}X_j)$

Since we are assuming that each element in weight and input are independent and coming from the same distribution and also weight and inputs are independent of each other, we can simplify the above equation as

$Var(y_i) = n* Var(W_{ij}X_j)$

where we assume n is the number of neurons in layer k-1:

Using the first property,

$Var(y_i) = n *(Var(W_{ij})Var(X_j) + (E[W_{ij}])^2Var(X_j) + Var(W_{ij})(E[X_j])^2)$

Assuming zero mean for weight,

$Var(y_i) = n *(Var(W_{ij})Var(X_j) + 0*Var(X_j) + Var(W_{ij})(E[X_j])^2)$

$Var(y_i) = n *Var(W_{ij}) *(Var(X_j) +(E[X_j])^2)$

Using property 2, we get,

$Var(y_i) = n *Var(W_{ij}) * E[X_j^2] \; ...(1)$

Now for a continuous variable, the expected value is the sum of the product of all the possible values of the variable and its probability.

$E[X^2] = ∫_{-\infty}^\infty x^2p(x)dx$

Since the input to the current layer is equal to the Relu of the output of the previous layer(y_l-1) i.e x = max(0,y_l-1)

$E[X^2] = ∫_0^\infty y_{l-1}^2p(y_{l-1})dy_{l-1}$ (here we are taking integration from 0 instead of -inf as y gets activated only for positive values)

$E[X^2] = 0.5*∫_{-\infty}^\infty y_{l-1}^2p(y_{l-1})dy_{l-1}$

$E[X^2] = 0.5 * Var(y_{l-1})$

Substituting this value in Eq (1)

$Var(y_i) = 0.5 * n * Var(w_{ij}) * Var*(y_{l-1})$

We drop the index and write the above equation for a particular layer as each element in the weight matrix and input is independent and comes from same distribution as per our assumption.

$Var(y^l) = 0.5 * n * Var(W^l) * Var*(y^{l-1})$

Now considering L (1 to L) layers we can write this as

$Var(y^L) = Var(y^1)\Pi_{l=2}^{L} 0.5*n^lVar(w^l)$

This product is the key to the initialization design. A proper initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially. So, we expect the above product to take a proper scalar (e.g., 1). A sufficient condition is:

$\frac{n}{2}Var(W^l) = 1$

This gives us

$W \rightarrow\Nu(0,\frac{2}{n})$

Note that:

We derived the initialization solely through the forward pass. The same outcomes can be achieved when employing the backpropagation process.
The formula holds exclusively in the case of employing ReLU as the activation function in each layer. Should a different activation function be employed, the initialization can be derived by incorporating the specific activation function into the integral of the term E[X²]

Advantages

Through a meticulous analysis of the properties of ReLU and the challenges posed by vanishing and exploding gradients, Kaiming He and his colleagues devised an initialization method that has become a cornerstone for initializing weights in deep neural networks. The advantages of Kaiming initialization technique are:

Mitigation of Vanishing and Exploding Gradients: Kaiming He initialization helps mitigate the vanishing and exploding gradient problems that can impede the training of deep networks.
Preservation of Variance: The initialization method is designed to preserve the variance of weights, particularly during the forward pass. This helps maintain an appropriate scale of activations across layers, preventing issues like overly small or large values.
Efficient Training of Deeper Networks: Kaiming He initialization has been shown to contribute to more efficient training of deeper networks. Deep neural networks with many layers can benefit from this initialization method to ensure that information and gradients are propagated effectively through the network.
Adaptability to ReLU Activation: Kaiming He initialization is tailored for ReLU activation functions, which are widely used in deep learning. It takes into account the characteristics of ReLU, such as non-saturation for positive inputs, to set appropriate initial weights.
Empirical Success Across Tasks: Empirical studies have consistently demonstrated the effectiveness of Kaiming He initialization across various computer vision and natural language processing tasks. It has become a go-to initialization strategy for practitioners working with deep neural networks.

Implementation

torch.nn.init.kaiming_uniform_ is a PyTorch initialization method designed for initializing weights in a neural network. It follows the Kaiming He initialization strategy, which is specifically tailored for the rectified linear unit (ReLU) activation functions.

Python3

import torch
import torch.nn.init as init
 
weight_tensor = torch.empty(3, 3)  # Example weight tensor
init.kaiming_uniform_(weight_tensor,  mode='fan_in', nonlinearity='relu')

Parameters:

tensor: The PyTorch tensor represents the weights to be initialized.
mode: Specifies the mode for computing the fan. It can be ‘fan_in’ or ‘fan_out’.
- ‘fan_in’:
  - Choosing ‘fan_in’ as the mode for computing the fan in weight initialization means that the scaling factor is based on the number of input units (fan-in) to a neuron in the current layer.
  - This choice preserves the magnitude of the variance of the weights during the forward pass. In other words, it ensures that the variance of the weights entering a neuron in the current layer is maintained.
- ‘fan_out’:
  - Choosing ‘fan_out’ as the mode means that the scaling factor is based on the number of output units (fan out) from a neuron in the current layer.
  - This choice is motivated by the desire to preserve the magnitudes during the backward pass, specifically during backpropagation. It helps in maintaining the variance of the gradients as they are propagated backward through the network.
nonlinearity: Specifies the nonlinearity function. For ReLU, it should be set to ‘relu’.

Conclusion

The core principle behind Kaiming initialization is to set the initial weights in a way that facilitates stable and efficient training. In this article we saw the mathematical details behind the initialization technique, its advantages and how it can be implemented in python.

Suggest improvement

Zero Initialization in C++

Share your thoughts in the comments