VGG-16 | CNN model

Last Updated : 21 Mar, 2024

A Convolutional Neural Network (CNN) architecture is a deep learning model designed for processing structured grid-like data, such as images. It consists of multiple layers, including convolutional, pooling, and fully connected layers. CNNs are highly effective for tasks like image classification, object detection, and image segmentation due to their hierarchical feature extraction capabilities.

VGG-16

The VGG-16 model is a convolutional neural network (CNN) architecture that was proposed by the Visual Geometry Group (VGG) at the University of Oxford. It is characterized by its depth, consisting of 16 layers, including 13 convolutional layers and 3 fully connected layers. VGG-16 is renowned for its simplicity and effectiveness, as well as its ability to achieve strong performance on various computer vision tasks, including image classification and object recognition. The model’s architecture features a stack of convolutional layers followed by max-pooling layers, with progressively increasing depth. This design enables the model to learn intricate hierarchical representations of visual features, leading to robust and accurate predictions. Despite its simplicity compared to more recent architectures, VGG-16 remains a popular choice for many deep learning applications due to its versatility and excellent performance.

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is an annual competition in computer vision where teams tackle tasks including object localization and image classification. VGG16, proposed by Karen Simonyan and Andrew Zisserman in 2014, achieved top ranks in both tasks, detecting objects from 200 classes and classifying images into 1000 categories.

VGG-16 architecture

This model achieves 92.7% top-5 test accuracy on the ImageNet dataset which contains 14 million images belonging to 1000 classes.

VGG-16 Model Objective:

The ImageNet dataset contains images of fixed size of 224*224 and have RGB channels. So, we have a tensor of (224, 224, 3) as our input. This model process the input image and outputs the a vector of 1000 values:

[Tex]\hat{y} =\begin{bmatrix} \hat{y_0}\\ \hat{y_1} \\ \hat{y_2} \\. \\ . \\ . \\ \hat{y}_{999} \end{bmatrix} [/Tex]

This vector represents the classification probability for the corresponding class. Suppose we have a model that predicts that the image belongs to class 0 with probability 1, class 1 with probability 0.05, class 2 with probability 0.05, class 3 with probability 0.03, class 780 with probability 0.72, class 999 with probability 0.05 and all other class with 0.

so, the classification vector for this will be:

[Tex]\hat{y}=\begin{bmatrix} \hat{y_{0}}=0.1\\ 0.05\\ 0.05\\ 0.03\\ .\\ .\\ .\\ \hat{y_{780}} = 0.72\\ .\\ .\\ \hat{y_{999}} = 0.05 \end{bmatrix} [/Tex]

To make sure these probabilities add to 1, we use softmax function.

This softmax function is defined as follows:

[Tex]\hat{y}_i = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}[/Tex]

After this we take the 5 most probable candidates into the vector.

[Tex]C =\begin{bmatrix} 780\\ 0\\ 1\\ 2\\ 999 \end{bmatrix}[/Tex]

and our ground truth vector is defined as follows:

[Tex]G = \begin{bmatrix} G_{0}\\ G_{1}\\ G_{2} \end{bmatrix}=\begin{bmatrix} 780\\ 2\\ 999 \end{bmatrix} [/Tex]

Then we define our Error function as follows:

[Tex]E = \frac{1}{n}\sum_{k}min_{i}d(c_{i}, G_{k}) [/Tex]

It calculates the minimum distance between each ground truth class and the predicted candidates, where the distance function d is defined as:

d=0 if [Tex]c_i=G_k[/Tex]
d=1 otherwise

So, the loss function for this example is :

[Tex]\begin{aligned} E &=\frac{1}{3}\left ( min_{i}d(c_{i}, G_{1}) +min_{i}d(c_{i}, G_{2})+min_{i}d(c_{i}, G_{3}) \right ) \\ &= \frac{1}{3}(0 + 0 +0) \\&=0 \end{aligned}[/Tex]

Since, all the categories in ground truth are in the Predicted top-5 matrix, so the loss becomes 0.

VGG Architecture:

The VGG-16 architecture is a deep convolutional neural network (CNN) designed for image classification tasks. It was introduced by the Visual Geometry Group at the University of Oxford. VGG-16 is characterized by its simplicity and uniform architecture, making it easy to understand and implement.

The VGG-16 configuration typically consists of 16 layers, including 13 convolutional layers and 3 fully connected layers. These layers are organized into blocks, with each block containing multiple convolutional layers followed by a max-pooling layer for downsampling.

VGG-16 architecture Map

Here’s a breakdown of the VGG-16 architecture based on the provided details:

Input Layer:
1. Input dimensions: (224, 224, 3)
Convolutional Layers (64 filters, 3×3 filters, same padding):
- Two consecutive convolutional layers with 64 filters each and a filter size of 3×3.
- Same padding is applied to maintain spatial dimensions.
Max Pooling Layer (2×2, stride 2):
- Max-pooling layer with a pool size of 2×2 and a stride of 2.
Convolutional Layers (128 filters, 3×3 filters, same padding):
- Two consecutive convolutional layers with 128 filters each and a filter size of 3×3.
Max Pooling Layer (2×2, stride 2):
- Max-pooling layer with a pool size of 2×2 and a stride of 2.
Convolutional Layers (256 filters, 3×3 filters, same padding):
- Two consecutive convolutional layers with 256 filters each and a filter size of 3×3.
Convolutional Layers (512 filters, 3×3 filters, same padding):
- Two sets of three consecutive convolutional layers with 512 filters each and a filter size of 3×3.
Max Pooling Layer (2×2, stride 2):
- Max-pooling layer with a pool size of 2×2 and a stride of 2.
Stack of Convolutional Layers and Max Pooling:
- Two additional convolutional layers after the previous stack.
- Filter size: 3×3.
Flattening:
- Flatten the output feature map (7x7x512) into a vector of size 25088.
Fully Connected Layers:
- Three fully connected layers with ReLU activation.
- First layer with input size 25088 and output size 4096.
- Second layer with input size 4096 and output size 4096.
- Third layer with input size 4096 and output size 1000, corresponding to the 1000 classes in the ILSVRC challenge.
- Softmax activation is applied to the output of the third fully connected layer for classification.

This architecture follows the specifications provided, including the use of ReLU activation function and the final fully connected layer outputting probabilities for 1000 classes using softmax activation.

VGG-16 Configuration:

The main difference between VGG-16 configurations C and D lies in the use of filter sizes in some of the convolutional layers. While both versions predominantly use 3×3 filters, in version D, there are instances where 1×1 filters are used instead. This slight variation results in a difference in the number of parameters, with version D having a slightly higher number of parameters compared to version C. However, both versions maintain the overall architecture and principles of the VGG-16 model.

Different VGG Configuration

Object Localization In Image:

To perform localization, we need to replace the class score by bounding box location coordinates. A bounding box location is represented by the 4-D vector (center coordinates(x,y), height, width). There are two versions of localization architecture, one is bounding box is shared among different candidates (the output is 4 parameter vector) and the other is a bounding box is class-specific (the output is 4000 parameter vector). The paper experimented with both approaches on VGG -16 (D) architecture. Here we also need to change loss from classification loss to regression loss functions (such as MSE) that penalize the deviation of predicted loss from the ground truth.

Results: VGG-16 was one of the best performing architectures in the ILSVRC challenge 2014.It was the runner up in the classification task with a top-5 classification error of 7.32% (only behind GoogLeNet with a classification error of 6.66%). It was also the winner of localization task with 25.32% localization error.

Limitations Of VGG 16:

It is very slow to train (the original VGG model was trained on Nvidia Titan GPU for 2-3 weeks).
The size of VGG-16 trained imageNet weights is 528 MB. So, it takes quite a lot of disk space and bandwidth which makes it inefficient.
138 million parameters lead to exploding gradients problem.

Further advancements: Resnets are introduced to prevent exploding gradients problem that occurred in VGG-16.

Suggest improvement

Image Recognition with Mobilenet

Autoencoders -Machine Learning

Share your thoughts in the comments

Introduction to Computer Vision

Image Processing & Transformation

Feature Extraction and Description

Deep Learning for Computer Vision

Object Detection and Recognition

Image Segmentation

3D Reconstruction