New Face Recognition Model "part FViT" Combining Vision Transformer With Landmark CNN
3 main points
✔️ Vision Transformer (ViT) applied to face recognition
✔️ End-to-end model introducing Landmark CNN toViT to further improve accuracy
✔️ Higher performance than previous methods on many benchmark data sets
Part-based Face Recognition with Vision Transformers
written by Zhonglin Sun, Georgios Tzimiropoulos
(Submitted on 30 Nov 2022)
Comments: Accepted to BMVC 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
The images used in this article are from the paper, the introductory slides, or were created based on them.
In the past few years, face recognition has been introduced in many applications such as immigration and surveillance cameras. Research on face recognition has been dominated by models that apply (a) a CNN-based architecture to process face images comprehensively and extract features and (b) a margin-based loss function since deep learning has attracted much attention. In particular, recent research has focused on(b) effective margin-based loss functions.
This paper focuses on (a) a new architecture for effective feature extraction: the Vision Transformer (hereafter referred to as ViT ), which was announced in 2020 and has attracted much attention for achieving comparable or better performance than CNNs in image recognition. Therefore, we are building a face recognition model using ViT instead of CNN, which has been the mainstream method, and examining its performance.
In this paper, two ViT-based face recognition models are constructed: "fViT," a model that applies ViT directly to face recognition, and "part fViT," a model that introduces a landmark CNN as a preliminary step to ViT. Since ViTuses patches as input data, we investigated the possibility of building a more effective face recognition model by extracting characteristic facial parts as patches using Landmark CNN and inputting them to ViT. The results show that both models perform as well as or better than state-of-the-art face recognition models.
What is "part fViT"?
The pipeline of part fViT is shown in the figure below, which is a model that introduces Landmark CNN into ViT. First, the face image is processed with Landmark CNN (MobilenetV3 ) and grid sampling ofSpatialTransformer Networks (STN) is applied to extract distinguishable facial parts. It is then input to ViT, along with facial landmark coordinates, for feature extraction and recognition, which is trained end-to-end using the CosFace loss function. The fViT model on which part fViT is based creates patches directly from face images and inputs them to ViT.
Performance comparison with the latest face recognition models
The table below compares the model trained on MS1MV3 with existing face recognition models. The test data were LFW (Labeled Faces in the Wild), CFP-FP (Celebrities in Frontal-Profile in the Wild), AgeDB-30, IJB-B (I ARPA Janus Benchmark-B face challenge), IJB-C (IARPA Janus Benchmark-C face challenge), and MegaFace.
Looking at LFW (LabeledFaces in the Wild ), both fViT and Part fViT achieve the highest level of accuracy as well as traditional face recognition models. When looking at CFP-FP (Celebrities in Frontal-Profile in the Wild), a dataset that evaluates robustness to face orientation, Part fViT-B achieves an accuracy of 99.21%, and Part fViT-B is the only model that achieves an accuracy of 99.8%, while Variation Prototype (VPL) Learning) andArcface-challenge, outperforming other SOTAs.
Similar results are seen in IJB-B(IARPA Janus Benchmark-B face challenge) and IJB-C (IARPA Janus Benchmark-C face challenge ). For fViT, IJB-B andIJB-Calso shows the second-best performance overall, with 95. 97% and97.21%respectively. Part fViTalsoshows the highest performance when looking at MegaFace/id, withfViTachieving the highest level of accuracy as well as conventional face recognition models.
However, inAgeDB-30, a dataset that evaluates robustness to aging, Part fViT andfViT achieve accuracies of 98.29% and 98.13%, respectively, indicating that they do not achieve the highest but best accuracy.
The figure below compares the attention maps generated by fViT and Part fViT. lines 1 and 2 are the attention maps generated by fViT, and lines 3 and 4 are the attention maps generated by Part fViT.
We can see that bothfViT and Part fViTmethods respond well to the orientation of the face, as they both correctly focus on the corresponding regions in both images with the face facing forward and to the side. We also see that in the sixth and seventh of the fViTs (rows 1 and 2), the focus is not on a specific region of the face. Also contrasting is the fact that there is only one attention map in fViT (10th ) that focuses on the eye region, which is well known as the most characteristic region for face recognition, while there are several in Part fViT. This may affect the accuracy of face recognition.
The figure below shows the 49 landmarks trained end-to-end in Part fViT. It can be seen that there is some robustness to face orientation.
Impact of different Landmark CNNs
We are also examining how the accuracy of face recognition changes when the Landmark CNN model is added or changed. In addition to MobilenetV3, which was used as the main system, we also compared it to the larger ResNet50. The results are shown in the table below.
For LFW, there is not much difference, as both accuracies are saturated high enough; for CFP-FP, AgeDB, and IJB-C, Part fViT shows higher accuracies on average. However, when a large-scale Landmark CNN(ResNet50 ) is applied, we see that in some cases, such as for CFP-FP and IJB-C, the accuracy decreases. From these results, the paper concludes that the application of a larger-scale Landmark CNN does not necessarily lead to improved accuracy.
Impact of different data expansion
This paper also examines the extent to which different data extensions applied to the training data affect accuracy. As can be seen from the table below, higher accuracy can be obtained by adding additional data expansion methods to be applied.
In this paper, we propose a new face recognition model that applies the Vision Transformer (ViT), which has attracted much attention for achieving accuracy as high as or higher than that of CNNs in image recognition. Part fViT is an end-to-end training of Landmark CNN and ViT. Part fViT is an end-to-end training model for Landmark CNN and ViT. Both models achieve comparable or better accuracy than conventional face recognition models, with Part fViT achieving particularly high accuracy. In this paper, we also examine the effect of the number of patches on accuracy. If you are interested, please also read Ablation Studyies.
Categories related to this article