The Lottery Ticket Hypothesis

Last Updated : 20 Dec, 2022

The Lottery Ticket Hypothesis has been presented in the form of a research paper at ICLR 2019 by MIT-IBM Watson AI Lab. This paper has been awarded the Best Paper Award in ICLR 2019.

Background: Network Pruning
Pruning basically means reducing the extent of a neural network by removing superfluous and unwanted parts. Network Pruning is a commonly used practise to reduce the size, storage and computational space occupied by a neural network. Like – Fitting an entire neural network in your phone. The idea of Network Pruning was originated in the 1990s which was later popularized in 2015.

How do you “prune” a neural network?
We can summarize the process of pruning into 4 major steps:

Train the Network
Remove superfluous structures
Fine-tune the network
Optionally : Repeat the Step 2 and 3 iteratively

But, before we further move ahead, you must know :

Usually, pruning is done after a neural network is trained on data.
The superfluous structures can be Weights, Neurons, Filters, Channels . However, here we consider “sparse pruning” which means pruning “weights”.
A heuristic is needed to define whether a structure is superfluous or not. These heuristics are Magnitudes, Gradients, or Activations. Here, we chose magnitudes. We prune the weights with the lowest magnitudes.
By removing parts out of neural network, we somewhat have damaged the activation function. Hence, we train the model a bit more. This is known as fine-tuning.

9x to 12x

Can’t we randomly initialize a pruned network and train to convergence?

Training a pruned model from scratch performs worse than retraining a pruned model, which may indicate the difficulty of training a network with small capacity

How to train pruned networks ?

Randomly initialize the full network
Train it and prune superfluous structure
Reset each remaining weight to its value after Step 1.

This basically suggests that “There exists a subnetwork that exists inside a randomly-initialized deep neural network which when trained in isolation can match or even outperform the accuracy of the original network.

Advantages of Trained Pruned Networks

A fully-connected neural network like MNIST having more than 600K parameters supposedly is reduced to a subnet of 21K parameters having the same accuracy as the original network
Retention of the original features – Dropout, weight decay, batchnorm, resnet, your favourite optimizer etc.

Further Scope of Research