Catch up on the latest AI articles

What Is The Power Of MaskInv, A Robust Face Recognition Model For Masks That Uses Distillation?

What Is The Power Of MaskInv, A Robust Face Recognition Model For Masks That Uses Distillation?

Face Recognition

3 main points
✔️ Learns Embedding Layer to make "Masked" and "Non-Masked" face images closer in knowledge distillation, and employs ElasticFace-Arc.
✔️ Achieved SOTA on MFRC-21 Challenge for both cases "Masked vs Masked" and "Masked vs Non-Masked", and improved accuracy against MFR2
✔️ Achieves the same performance as conventional face recognition models for "Non-Masked vs Non-Masked".

Mask-invariant Face Recognition through Template-level Knowledge Distillation
written by Marco HuberFadi BoutrosFlorian KirchbuchnerNaser Damer
(Submitted on 10 Dec 2021)
Comments: Accepted at the 16th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2021

Subjects: Computer Vision and Pattern Recognition (cs.CV)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Facial recognition technology has started to take off around 2019. Overseas, it is widely used for everything from store payments to criminal investigations. In Japan, it is also being used at airport entry and exit gates.

Recently, with the spread of coronavirus infection, devices with face recognition functions have been installed in all kinds of facilities, including commercial facilities, accommodation facilities, and hospitals, to check body temperatures, so we are seeing more and more opportunities to see face recognition devices, and they are becoming more familiar.

In addition, demonstration experiments using facial recognition are being conducted at municipalities, public transportation facilities, commercial facilities, and other locations throughout Japan. Toyama City, Fuji Five Lakes, Akaigawa Village International Resort, Osaka Dotonbori Shopping Street, JR, airports, fast food restaurants, and all sorts of other initiatives are underway. All of these initiatives have achieved a high level of satisfaction, as people can experience the usefulness of facial recognition without the need for tickets, keys, wallets, etc., and can go sightseeing with nothing in their hands.

With more and more examples like this, the use of face recognition is steadily spreading. The day may soon come when you too will use face recognition without thinking about it.

But as 2019 draws to a close and the coronavirus rages across the globe, a new challenge to facial recognition has emerged. That is the "mask".

As has already been verified by the US National Institute of Standards and Technology (NIST) and research institutes around the world, traditional facial recognition systems are significantly less accurate when used on people wearing masks. Therefore, at the same time as the spread of face recognition, one research issue in the world today is face recognition technology when people wear masks.

In this paper, we apply "Knowledge Distillation" and "ElasticFace-Arc" to propose a method called "MaskInv", which shows high performance regardless of whether a mask is worn.

What is MaskInv?

MaskInv (see below) utilizes knowledge distillation to train the Embedding Layer: input unmasked face images in the Teacher Network, and input masked and unmasked face images in the Student Network. Through this learning process, the Embedding Layer is trained so that the Embedding Layer with and without masks becomes closer to the Embedding Layer.

At the same time as knowledge distillation, the Student Network learns the Classification Layer so that the masked and unmasked faces are classified as the same person. At this time, we use the loss function "ElasticFace-Arc" introduced in ElasticFace, which achieved SOTA, as a face recognition model.

By learning these two layers at the same time, we can construct a face recognition model that can accurately classify masked and unmasked faces as the same person.

The Teacher Network is built on ElasticFace, which has achieved SOTA, with MS1MV2 (dataset) as the ElasticEace-Arc (loss function We use a ResNet-100-based face recognition model pre-trained with This model is published by the authors of ElasticFace. Student Network also uses It uses a ResNet-100-based architecture and is trained on MS1MV2 (dataset). ResNet-100 is a commonly used architecture in face recognition models that have achieved SOTA in recent years.

The face images with masks input to the Student Network are created by synthesizing pre-generated masks. The masks are synthesized using facial feature points, and the colors and shapes are randomly generated.

As we have seen, MaskInv simultaneously trains the Knowledge Distillation of Embedding Layer using Teacher Network and the Classification Layer using ElasticFace-Arc. Therefore, the loss function of MaskInv is defined as the sum of LKD for learning by knowledge distillation and LElasticArc for learning by ElasticFace-Arc.

ElasticFace-Arc is a loss function that relaxes the fixed margin constraint in the loss function introduced in other face recognition models that achieve high accuracy. Thus, it allows a more flexible classification of embeddings (see the figure below, taken from ElasticFace: Elastic Margin Loss for Deep Face Recognition ).

LElasticArc is represented by the following equation

Also, theLKD is expressed by the following equation, applied to the Embedding of the Teacher Network and the Student Network It is designed to minimize Embedding misalignments caused by the wearing of the mask.


In the performance evaluation, we use MFRC-2 and MFR2, which are datasets containing masked face images. We also use "LFW", "CFP-FP", "AgeDB-30", "CALFW" and "CPLFW", which are datasets commonly used in the performance evaluation of face recognition models. Since these datasets do not contain face images with masks, we prepare face images with masks by synthesizing the generated masks.

In addition, six models are prepared for evaluation. (1) ArcFace, which has recently achieved SOTA
(2) MagFace, used as Teacher Network (3) ElasticFace-Arc (baseline), knowledge distillation was introduced for the baseline and in Ltotal λ=0, eliminating the optimization of knowledge distillation (4) ElasticFace-Arc-Aug, In Ltotal The value of λ is reduced and the optimization by knowledge distillation is weakened (5) MaskInv-LG, the In Ltotal The value of λ is increased and the optimization by knowledge distillation is strengthened. (6) MaskInv-HG is compared and verified.

The table below shows the results of evaluating the 1:1 face recognition performance of MFRC-21 on masked vs. unmasked images. FMR100 represents the False non-Match Rate = False Identity Rejection Rate. FNMR (False Match Rate = False Negative Match Rate) at 1.0%. False non-Match Rate = False Non-Match Rate FDR stands for Fisher Discriminant Ratio. In particular, FMR1000 is also a requirement for immigration automation and is a high-security standard.

Looking at the evaluation performance for the highest criterion, FMR1000, we can see that the models with knowledge distillation applied (MaskInv-LG and MaskInv-HG) have improved performance compared to ElasticFace-Arc (baseline). In other words, we can see that knowledge distillation improves face recognition performance with masks.

The table below shows the results of evaluating the 1:1 face recognition performance of MFRC-21 for masked vs. unmasked images. Similarly, we can see that knowledge distillation improves the face recognition performance of the masked images.

The table below shows the results of evaluating the 1:1 face recognition performance of MFR2 for masked vs. unmasked images. TPR (True Positive Rate.

Here, we use the In FAR2000, the MaskInv-LG has been improved from 83.25% to 91.98%, and the performance has improved by a degree of MaskInv-HG is From 83.25% to performance improvement to 92.21%. MaskInv-HG is the first to use the ElasticFace-Arc (baseline) and has better performance than the MaskInv-LG model.

The above table also shows that SOTA was achieved with the maskless face recognition model MagFace, and Arcface-based models are not The performance of MFRC-21 and MFR2 is slightly inferior to MFRC-21 and MFR2, which are data sets with masks.

The table below shows the datasets often used to evaluate the performance of face recognition models
Accuracy (%) for "LFW", "CFP-FP", "AgeDB-30", and "CALFW", and "CPLFW". Here, we evaluate the performance of three cases.

The first is the data set without masks, and the second is the data set with
No mask vs. No Masks (No Masks) The second is a dataset in which a mask is synthesized on one image of each pair, and the performance is evaluated for Masked vs Non-Masked (Masked vs Non-Masked) The third is a dataset in which a mask is synthesized on both images of each pair, and the performance is evaluated for Mask With vs. Mask With (Masked vs Masked).

From the above table, it can be seen that all the three models(ours) perform better in face recognition without a mask (1:1) than with a mask (1:1), and with a mask, they perform better than the conventional face recognition models.

MaskInv-LG and MaskInv-HG have Accuracy going up and down depending on the dataset, but overall ElasticFace-Arc (baseline) and Accuracy are higher than that of ElasticFace-Arc-Aug, and We believe that Accuracy is improved by introducing knowledge distillation optimizations.

However, the
Masked vs. Non-Masked (Masked vs. Non-Masked) vs. Masked vs. Mask Yes (Masked vs. Masked), whereas ElasticFace-Arc (baseline) and The difference in Accuracy with ElasticFace-Arc-Aug is smaller.

This is because MaskInv uses knowledge distillation to determine the mask,
Yes and without a mask, The reason is that the goal is to learn to obtain close Embeddings with Masked vs. Non-Masked (Masked vs. Non-Masked) can be considered to directly benefit from knowledge distillation, while With mask vs. Mask With (Masked vs. Masked), where We believe that MaskInv's knowledge distillation is less effective because it only requires similarity between the masked and the unmasked.

We also found that
In the case of masks, The relatively low performance of MagFace, ArcFace, and ElasticFace-Arc (baseline) face recognition still indicates the need for a model optimized and tuned for masked faces.


These results show that MaskInv improves accuracy by learning close Embeddings for masked and unmasked face images through knowledge distillation. In particular, the learning process shows that Masked vs. unmasked (Masked vs Non-Masked) The following example shows that it is useful in the This is expected to be useful in practical situations such as immigration inspection, where the face image registered in a passport is compared with the person's face.

In addition, the system is also effective in detecting changes in the orientation of the face (Cross-Pose) and changes over time (Cross-Age). There is a possibility that this method can be applied to the construction of highly versatile face recognition models.

In addition, we find that the performance of the conventional face recognition model without the assumption of wearing a mask is inferior to that of the face recognition model adjusted to face recognition with a mask. This indicates that corresponding optimization and adjustment are necessary to obtain high accuracy under the condition of masks.

In April 2022, the coronavirus is still widespread, with no end in sight. On March 15, 2022, Apple also enabled facial recognition while wearing a mask. This will be an indispensable feature for future facial recognition services.

The MaskInv feature learning proposed in this paper is not limited to situations with or without a mask, but may also be useful in situations where a part of the face is shielded, making it a versatile method with wide applicability.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us