# VGG-16 | CNN model

**ImageNet Large Scale Visual Recognition Challenge** (ILSVRC) is an annual computer vision competition. Each year, teams compete on two tasks. The first is to detect objects within an image coming from *200* classes, which is called object localization. The second is to classify images, each labeled with one of *1000* categories, which is called image classification. VGG 16 was proposed by Karen Simonyan and Andrew Zisserman of the Visual Geometry Group Lab of Oxford University in 2014 in the paper “VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION”. This model won 1^{st } and 2^{nd} place in the above categories in the 2014 ILSVRC challenge.

This model achieves *92.7% top-5* test accuracy on the ImageNet dataset which contains *14* million images belonging to 1000 classes.

**Objective: **The ImageNet dataset contains images of fixed size of *224*224* and have RGB channels. So, we have a tensor of *(224, 224, 3)* as our input. This model process the input image and outputs the a vector of *1000* values.This vector represents the classification probability for the corresponding class. Suppose we have a model that predicts that the image belongs to class 0 with probability *1*, *class 1* with probability *0.05*, *class 2* with probability *0.05*, class *3* with probability *0.03*, *class 780* with probability *0.72*, class *999* with probability *0.05* and all other class with *0*. so, the classification vector for this will be:To make sure these probabilities add to *1*, we use softmax function.

This softmax function is defined as follows:

After this we take the 5 most probable candidates into the vector.and our ground truth vector is defined as follows:Then we define our Error function as follows:[Tex]where \, d = 0 \, if \, c_{i} \, = \, G_{k}\, else \, d \, = \, 1 [/Tex]So, the loss function for this example is :So, [Tex]\kern 6pc E \, = \, 0 \\ [/Tex]Since, all the categories in ground truth are in the Predicted top-5 matrix, so the loss becomes 0.

**VGG** **Architecture: **The input to the network is an image of dimensions *(224, 224, 3)*. The first two layers have *64* channels of a *3*3* filter size and the same padding. Then after a max pool layer of stride *(2, 2)*, two layers have convolution layers of 128 filter size and filter size *(3, 3)*. This is followed by a max-pooling layer of stride *(2, 2)* which is the same as the previous layer. Then there are *2* convolution layers of filter size *(3, 3)* and *256* filters. After that, there are *2* sets of *3* convolution layers and a max pool layer. Each has *512* filters of *(3, 3)* size with the same padding. This image is then passed to the stack of two convolution layers. In these convolution and max-pooling layers, the filters we use are of the size *3*3* instead of *11*11* in AlexNet and *7*7* in ZF-Net. In some of the layers, it also uses *1*1* pixel which is used to manipulate the number of input channels. There is a padding of *1-pixel* (same padding) done after each convolution layer to prevent the spatial feature of the image.

After the stack of convolution and max-pooling layer, we got a *(7, 7, 512)* feature map. We flatten this output to make it a *(1, 25088)* feature vector. After this there is *3 fully* connected layer, the first layer takes input from the last feature vector and outputs a *(1, 4096)* vector, the second layer also outputs a vector of size *(1, 4096)* but the third layer output a *1000* channels for *1000* classes of ILSVRC challenge i.e. 3rd fully connected layer is used to implement softmax function to classify 1000 classes. All the hidden layers use ReLU as its activation function. ReLU is more computationally efficient because it results in faster learning and it also decreases the likelihood of vanishing gradient problems.

**Configuration:** The table below listed different VGG architectures. We can see that there are 2 versions of VGG-16 (C and D). There is not much difference between them except for one that except for some convolution layers, *(3, 3)* filter size convolution is used instead of *(1, 1)*. These two contain *134* million and *138* million parameters respectively.

**Object Localization In Image: **To perform localization, we need to replace the class score by bounding box location coordinates. A bounding box location is represented by the 4-D vector (center coordinates(x,y), height, width). There are two versions of localization architecture, one is bounding box is shared among different candidates (the output is *4* parameter vector) and the other is a bounding box is class-specific (the output is *4000* parameter vector). The paper experimented with both approaches on VGG -16 (D) architecture. Here we also need to change loss from classification loss to regression loss functions (such as MSE) that penalize the deviation of predicted loss from the ground truth.

**Results:** VGG-16 was one of the best performing architectures in the ILSVRC challenge 2014.It was the runner up in the classification task with a top-5 classification error of *7.32%* (only behind GoogLeNet with a classification error of *6.66%*). It was also the winner of localization task with *25.32%* localization error.

**Limitations Of VGG 16:**

- It is very slow to train (the original VGG model was trained on Nvidia Titan GPU for 2-3 weeks).
- The size of VGG-16 trained imageNet weights is
*528*MB. So, it takes quite a lot of disk space and bandwidth which makes it inefficient. - 138 million parameters lead to exploding gradients problem.

Further advancements: Resnets are introduced to prevent exploding gradients problem that occurred in VGG-16.