Capsule Neural Networks (Capsnets) are a type of ANN (Artificial Neural Network) whose major objective is to better replicate the biological neural network for better segmentation and recognition. The word capsule here represents a nested layer within a layer of capsule networks. Capsules determine the parameters of features in an object. During the process of identification of a face, capsules not only determine the presence or absence of facial features but also take into account the respective parameters in which facial features are organized. This means the system will only detect a face if the features detected by capsules are present in correct order.
The job of capsules is to do inverse image rendering which means that we obtain the instantiation parameters such as object angle, scale, and position by analyzing the object according to the given object samples in its training set.
Working of Capsules –
- First capsules proceed with the matrix multiplication of the input vectors with weight matrices which actually tells us briefly about the spatial relationship of some low-level features with high-level features.
- Then the capsules decide their parent capsule. The selection of parent capsule is done by using dynamic routing.
- After making the decision about their parent capsules they proceed doing the sum of all the vectors that ultimately squashed between 0 and 1 while retaining their direction. Squashing is done by using the cosine distance as the measure of agreement and norm of the co-ordinate frame as the existence probability.
What is Dynamic Routing ?
During the process of dynamic routing, the lower capsules send their data to the most suitable capsule. This capsule that receives the output of the lower level capsules is called as the parent capsule. The parent capsules proceed with routing by following the agreement and assignment mechanism i.e based on the dot product, expectation-maximization and using mixture models. The capsule having the largest dot product is chosen as the parent capsule. This dot product takes place between the prediction vector computed by the lower capsule layers and the weight matrix.
Dynamic routing can be even explained through the following example :
Suppose, if we provide an image to a system so that it can recognize it and state what it is?
The picture is that of a house in four different types of viewpoints, CNN can recognize the front view of the house very easily that was taught during its training but it will have serious troubles in identifying the picture of the house from the top view so here capsules come into play.
Capsules detect the roof and walls very easily but not any roof can be a house so, they analyze the constant part in the image i.e the co-ordinate frame of the house capsule wrt to both roof and walls. The prediction is done by both the roof and the walls so as to decide whether the object is a house or not. These predictions are then sent to the mid-level capsule. If the prediction of the roof and the walls matches each other only then the object is said to be a house, this process is called Routing by agreement.
General architecture of Capsule networks –
Encoder – It takes the image input and displays the image as a vector that contains all the instantiation parameters needed to render the image. Encoder further encapsulates the :
- Convolutional layer – It detects basic features in the image.
- PrimaryCaps layer – They produce combinations based on the basic features detected by the convolutional layer.
- DigitCaps layer – This is the highest level capsule layer that contains all the instantiation parameters.
Decoder – Its job is to decode the 16-dimensional vector from DigitCap into an image. It recreates the output image without the loss of pixels. They force capsules to learn the features that are useful for reconstructing the image. The decoders further have three fully connected (dense) layers.
CNN and Capsnets –
The main idea behind the introduction of Capsnets was to reduce the training set size i, e usually very large in case of CNN (Convolutional Neural Network). CNN is also a type of neural network but in this network, output depends on the volume of the training set. In the case of CNN, the training and testing set size can be from 60M to 10M.CNN has a major drawback that they are not able to adjust to the viewpoint.
If a particular image is inverted, CNN may not be able to identify the picture. Capsnet here exploits the fact that viewpoint changes have a nonlinear effect at the pixel level and linear impact at the object level. Capsnets are able to adjust to viewpoint changes as they learn linear manifold between an object and it’s posed as a matrix of weights.
Here, linear manifold refers to a linear relation between various object vectors in Euclidean space having n dimensions.
CNNs use the max pool system while capsules maintain the weighted sum of features of the previous layer which is more suitable for detecting overlapping features. These features offered by Capsnets are really helpful in identifying overlapping digits in the handwriting. CNN identifies objects using too many layering systems that slow down the recognition process, however, Capsnets do not believe in too many layers instead they use nesting of layers in one layer.
Capsnets are currently only tested for MNIST, ( large database of handwritten digits that are used commonly for training various image processing systems) and they struggle on more complex data found in Imagenet. Also, capsules take a longer training time. Despite having such drawbacks, they possibly have a long way to go in the future.