VGG-16 | CNN model

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is an annual computer vision competition. Each year, teams compete on two tasks. The first is to detect objects within an image coming from 200 classes, which is called object localization. The second is to classify images, each labeled with one of 1000 categories, which is called image classification. VGG 16 was proposed by Karen Simonyan and Andrew Zisserman of the Visual Geometry Group Lab of Oxford University in 2014 in the paper “VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION”. This model won the 1st  and 2nd place on the above categories in 2014 ILSVRC challenge.

VGG-16 architecture

This model achieves 92.7% top-5 test accuracy on ImageNet dataset which contains 14 million images belonging to 1000 classes.

Objective :
The ImageNet dataset contains images of fixed size of 224*224 and have RGB channels. So, we have a tensor of (224, 224, 3) as our input. This model process the input image and outputs the a vector of 1000 values.


\kern 6pc \hat{y} =\begin{bmatrix}  \hat{y_0}\\ \hat{y_1} \\ \hat{y_2} \\ \hat{y_3} \\. \\. \\ . \\ . \\. \hat{y_{999}}  \end{bmatrix}

This vector represents the classification probability for the corresponding class. Suppose we have a model that predicts that image belongs to class 0 with probability .1, class 1 with probability 0.05, class 2 with probability 0.05, class 3 with probability 0.03, class 780 with probability 0.72, class 999 with probability 0.05 and all other class with 0. so, the classification vector for this will be:


\kern 6pc \hat{y}=\begin{bmatrix} \hat{y_{0}}=0.1\\  0.05\\  0.05\\  0.03\\  .\\  .\\  .\\  \hat{y_{780}} = 0.72\\  .\\  .\\ \hat{y_{999}} = 0.05 \end{bmatrix}

To make sure these probabilities add to 1, we use softmax function. This softmax function is defined as :




After this we take the 5 most probable candidates into the vector.


C =\begin{bmatrix} 780\\  0\\  1\\  2\\  999 \end{bmatrix}

and our ground truth vector is defined as follows:

G = \begin{bmatrix} G_{0}\\  G_{1}\\  G_{2} \end{bmatrix}=\begin{bmatrix} 780\\  2\\  999 \end{bmatrix}

Then we define our Error function as follows:


\kern 6pc E = \frac{1}{n}\sum_{k}min_{i}d(c_{i}, G_{k})
where \, d = 0  \, if \, c_{i} \, = \, G_{k}\, else \, d \, = \, 1

So, the loss function for this example is :

\kern 6pc E =\frac{1}{3}\left ( min_{i}d(c_{i}, G_{1}) +min_{i}d(c_{i}, G_{2})+min_{i}d(c_{i}, G_{3}) \right )

So,
\kern 6pc E \, = \, \frac{1}{3}(0 \, + \, 0 + \, 0) \\
\kern 6pc E \, = \, 0 \\
Since, all the categories in ground truth are in the Predicted top-5 matrix, so the loss becomes 0.

 
Architecture:
The input to the network is image of dimensions (224, 224, 3). The first two layers have 64 channels of 3*3 filter size and same padding. Then after a max pool layer of stride (2, 2), two layers which have convolution layers of 256 filter size and filter size (3, 3). This followed by a max pooling layer of stride (2, 2) which is same as previous layer. Then there are 2 convolution layers of filter size (3, 3) and 256 filter. After that there are 2 sets of 3 convolution layer and a max pool layer. Each have 512 filters of (3, 3) size with same padding.This image is then passed to the stack of two convolution layers. In these convolution and max pooling layers, the filters we use is of the size 3*3 instead of 11*11 in AlexNet and 7*7 in ZF-Net. In some of the layers, it also uses 1*1 pixel which is used to manipulate the number of input channels. There is a padding of 1-pixel (same padding) done after each convolution layer to prevent the spatial feature of the image.

VGG-16 architecture map

After the stack of convolution and max-pooling layer, we got a (7, 7, 512) feature map. We flatten this output to make it a (1, 25088) feature vector.After this there are 3 fully connected layer, the first layer takes input from the last feature vector and outputs a (1, 4096) vector, second layer also outputs a vector of size (1, 4096) but the third layer output a 1000 channels for 1000 classes of ILSVRC challenge, then after the output of 3rd fully connected layer is passed to softmax layer in order to normalize the classification vector. After the output of classification vector top-5 categories for evaluation. All the hidden layers use ReLU as its activation function. ReLU is more computationally efficient because it results in faster learning and it also decreases the likelihood of vanishing gradient problem.

Configuration:
The table below listed different VGG architecture. We can see that there are 2 versions of VGG-16 (C and D). There is not much difference between them except for one that except for some convolution layer there is (3, 3) filter size convolution is used instead of (1, 1). These two contains 134 million and 138 million parametersrespectively.


Different VGG Configuration

Object Localization In Image:
To perform localization, we need to replace the class score by bounding box location candidates. A bounding box location is represented by 4-D vector (center coordinates, height, width). There are two version of localization architecture, one is bounding box is shared among different candidates (the output is 4 parameter vector) and other is bounding box is class specific (the output is 4000 parameter vector). The paper experimented with both approach on VGG -16 (D) architecture. Here we also need to change loss from classification loss to regression loss functions (such as MSE) that penalize the deviation of predicted loss from ground truth.

Results:
VGG-16 was one of the best performing architecture in ILSVRC challenge 2014.It was the runner up in classification task with top-5 classification error of 7.32% (only behind GoogLeNet with classification error 6.66%). It was also the winner of localization task with 25.32% localization error.
Challenges Of VGG 16:

  • It is very slow to train (the original VGG model was trained on Nvidia Titan GPU for 2-3 weeks).
  • The size of VGG-16 trained imageNet weights is 528 MB. So, it takes quite a lot of disk space and bandwidth that makes it inefficient.
  • References:




    My Personal Notes arrow_drop_up

    Check out this Author's contributed articles.

    If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

    Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.