Open In App

Deep Face Recognition

DeepFace is the facial recognition system used by Facebook for tagging images. It was proposed by researchers at Facebook AI Research (FAIR) at the 2014 IEEE Computer Vision and Pattern Recognition Conference (CVPR)

In modern face recognition there are 4 steps: 

  1. Detect
  2. Align
  3. Represent
  4. Classify

This approach focuses on alignment and representation of facial images. We will discuss these two part in detail. 

Alignment: 
The goal of this alignment part is to generate frontal face from the input image that may contain faces from different pose and angles. The method proposed in this paper used 3D frontalization of faces based on the fiducial (face feature points) to extract the frontal face. The whole alignment process is done in the following steps: 



6-fiducial points

2D-cropped face

67 fiducial points with Delaunay triangulation


 

3D shape generated from the align 2D-crop image


 



Visibility map of 2D shape in 3D (darker triangles are less visible as compared to light triangles)

67 fiducial point mapping on 2D-3D affine face.

final frontalization


Representation and Classification Architecture: 
DeepFace is trained for multi-class face recognition i.e. to classify the images of multiple peoples based on their identities. 
It takes input into a 3D-aligned RGB image of 152*152. This image is then passed the Convolution layer with 32 filters and size 11*11*3 and a 3*3 max-pooling layer with the stride of 2. This is followed by another convolution layer of 16 filters and size 9*9*16. The purpose of these layers to extract low-level features from the image edges and textures. 

The next three layers are locally connected layers, a type of fully connected layer that has different types of filters in a different feature map. This helps in improving the model because different regions of the face have different discrimination ability so it is better to have different types of feature maps to distinguish these facial regions. 

DeepFace full architecture


The last two layers of the model are fully connected layers. These layers help in establishing a correlation between two distant parts of the face. Example: Position and shape of eyes and position and shape of the mouth. The output of the second last fully connected layer is used as a face representation and the output of the last layer is the softmax layer K classes for the classification of the face. 
The total number of parameters in this network is 120 million approximately with most of them (~95%) comes from the final fully connected layers. The interesting property of this network is the feature map/vector generated during the training of this model amazingly sparse. For Example, 75% of the values in topmost layers is 0. This may be because of this network uses ReLU activation function in every convolution network which is essentially max(0, x). This network also uses Drop-out Regularization which also contributed to sparsity. However, Dropout is only applied to the first fully connected layer. 
In the final stages of this network, we also normalize the feature to be between 0 and 1. This also reduces the effect of illumination changes across. We also perform an L2-regularization after this normalization. 

Verification Metric: 
We need to define some metric that measures whether two input images belong to the same class or not. There are two methods: supervised and unsupervised with supervised having better accuracy than unsupervised. It is intuitive because while training on particular target dataset one is able to improve the accuracy by fine-tuning the model according to it. For Example, Labeled Faces in the Wild (LFW) dataset has 75% of faces are male, training on LFW may introduce some bias and add some generalization which is not suitable while testing on other face recognition datasets. However, training using a small dataset may reduce generalization when used on other datasets. In these cases, the unsupervised similarity metric is better. This paper uses the inner product of two feature vectors generated from representation architecture for unsupervised similarity. This paper also uses two supervised verification metrics. These are 

Training and Results 
DeepFace is trained and experimented on the following three datasets 

The results are as follows:

Here, DeepFace-ensamble represents combination of different DeepFace-single model that uses different verification metrics we discussed above.
As we can conclude that DeepFace-ensemble reach maximum accuracy of 97.35% accuracy which is very close to the human level 97.53%

Results on YTF

Notice, because of motion blur and other factors the image quality of video datasets is generally worse than images dataset. However, 91.4% is still the state-of-the-art accuracy at that time and reduces the error rate by more than 50%

In terms of testing time, DeepFace takes 0.33 seconds when tested on a single-core 2.2GHz Intel processor. This includes 0.05 seconds taken on alignment and 0.18 seconds on feature extraction.

Conclusion: 
At the time of its publication, It was one of the best face recognition model now of course models such as Google-FaceNet and other models which provide accuracy up to 99.6% on LFW dataset. The main problem the DeepFace has been able to solve is to build a model that is invariant to light effect, pose, facial expression, etc. and that’s why it is used in most of the Facebook’s face recognition tasks. The novel approach to use 3D alignment also contributed to the increase in the accuracy of the model. 

Reference: 


 


Article Tags :