Catch up on the latest AI articles

AI And Ethics: More Accurately Analyze The Effects Of Bias In Face Recognition Algorithms On A Dataset Of Composite Facial Images!

AI And Ethics: More Accurately Analyze The Effects Of Bias In Face Recognition Algorithms On A Dataset Of Composite Facial Images!

Face Recognition

3 main points
✔️ Bias problem in face recognition technology: face recognition algorithms are subject to bias due to demographic attributes, which may disadvantage certain races and genders.
✔️ Limitations of conventional bias evaluation methods: Conventional methods show correlations with certain attributes in the performance of face recognition models, but cannot show causal relationships.
✔️ Proposal of a new evaluation method: using a face generation tool, only certain attributes are changed while other attributes are kept constant, thus clearly showing the impact of certain attributes on the performance of face recognition models.

Benchmarking Algorithmic Bias in Face Recognition: An Experimental Approach Using Synthetic Faces and Human Evaluation
written by Hao LiangPietro PeronaGuha Balakrishnan
(Submitted on 10 Aug 2023)
Comments: Accepted to ICCV2023
Subjects: Computer Vision and Pattern Recognition (cs.CV)


The images used in this article are from the paper, the introductory slides, or were created based on them.

First of all

Face recognition technology has benefited from deep learning and has been put to practical use in a variety of fields, such as information security and crime investigation. However, bias due to demographic attributes such as age, race, and gender is a major problem. When face recognition technology is used in criminal investigations, only people of a certain race are disadvantaged. In the study of face recognition technology, it is very important to properly assess and understand the impact of bias in face recognition models.

This paper proposes a new method to evaluate bias in face recognition models more accurately and causally than before. Conventional evaluation methods rely on datasets of face images collected in natural environments (in the wild). Although these datasets also contain labels for attributes such as race and gender, they merely indicate correlations between image features and those attributes. In other words, these labels do not reveal the causal relationship of how a particular attribute affects the performance of a face recognition model. For example, a result such as "Model A shows different accuracy for female and male faces in dataset X" would produce a result, but this would not clarify the causal relationship between how a particular attribute (in this case gender) affected accuracy. In other words, traditional evaluation methods cannot determine whether this result is due to gender bias or other factors (e.g., higher quality or more diverse images of women in the dataset).

The method proposed in this paper uses a neural network based face generation tool to generate face images. This allows only certain attributes (e.g., race and gender) whose effects are to be examined to be changed, while all other attributes (e.g., age and facial expression) are held constant, making it possible to assess separately and unambiguously whether a particular attribute affects the performance of a face recognition model. As a result, more specific and causal conclusions can be drawn, such as "the accuracy of model A is affected by gender and skin color.

Proposed Method

The method proposed in this paper consists of seven steps to measure the bias of a face recognition system. In step 1, the latent space of the GAN is sampled to generate a random seed face image. This is the base face image used to train the face recognition model. In step 2, the features of the GAN's latent space are controlled to generate a face image that will be the prototype with race and gender as shown below.

Note that "WM," "WF," "BM," "BF," "AM," and "AF" stand for White Male, White Female, Black Male, Black Female, East Asian Male, and East Asian Female. In Step 3, as shown in the figure below, changes are made to each prototype face image regarding facial orientation, age, expression, and lighting.

In step 4, image pairs are created from the generated face images. In Steps 5 and 6, human annotation is performed. For each image or image pair, the degree of similarity of attributes and image pairs is evaluated; using Amazon SageMaker Ground Truth, annotations are collected from nine people and their average is used as the annotation result. For each composite face image, we ask the user to annotate skin type, gender, facial expression, age, and fakeness on a 5-point scale. For the annotation of attributes, 123,000 annotations were collected from 2,214 annotators. In addition, for each image pair created, to ensure that the face pairs belong to the same/different person from a common person's point of view, the annotators annotated the image pairs with 'likely same', 'possibly same', 'not sure', ' possibly different', 'possibly different' or 'possibly different'. For the annotation of image pairs, 432,000 annotations were collected from 1,905 annotators.

Annotation is performed with the interface shown in the figure below. In this paper, human evaluations (annotations) of image pairs are called "Human Consensus Identity Confidences" (HCICs).

Finally, in Step 7, the synthesized pairs of face images are input into the face recognition model and biases are evaluated using HCIC.


In this paper, we analyze the bias of our face recognition algorithms using three face recognition models, ResNet-34 trained on Glint360k, ResNet-34 trained on MS1MV3, and SFNet-20 trained on VGGFace2. All of these models were trained on a large dataset collected in a natural environment (In the Wild) and achieved high accuracy on each test dataset. We evaluate the performance of the models by inputting image pairs into these models and computing cosine similarity.

The dataset is created using the method described above, with 10,200 composite face image datasets consisting of 600 different IDs. This dataset is then used to generate 12,000 identical face image pairs and 36,000 non-identical face image pairs.

In this paper, only image pairs with a score of less than 0.8 for impostoriness are used in the final analysis, which consists of 11,682 identical face image pairs and 35,406 non-identical face image pairs. Examples of face images with a score of 0.8 or higher for impersonality are shown in the figure below. The data excluding these are used.

First, we examine how changes in facial features (attributes) affect the face recognition model's prediction of similarity for face-image pairs. The figure below shows the results for the ResNet34 (MS1MV3) model when the face orientation is changed.

As expected, the face recognition model shows the highest similarity between face images of the same prototype (same attributes, same species image) and the lowest similarity between face images of different lototypes. This indicates that the face recognition model is discriminating between groups of face images. Furthermore, we see that the similarity decreases as the angle of the face moves away from 0 degrees. This implies that the orientation of the face affects the similarity decision. Furthermore, a comparison of the second and third items in the above figure shows that the face recognition model uses demographic attributes (e.g., race and gender) as important information when identifying faces. Such an analysis helps us understand how face recognition models respond to different attributes and is important for revealing biases and limitations of face recognition models.

The figure below also shows the results of the FNMR (= False Non Match Rate) and FMR (= False Match Rate) evaluation of the bias of the face recognition models for different demographic attributes. Model 1 represents SFNet trained with VGGFace2, Model 2 represents ResNet34 trained with MS1MV3, and Model 3 represents ResNet34 trained with Glint360k. Note that FNMR is the percentage of pairs of face images of the same person that the face recognition model erroneously judged to be not the same person when they are in fact the same person. FMR is the percentage of pairs of face images that are not actually the same person but are incorrectly determined to be the same person by the face recognition model. It is calculated by dividing the number of incorrectly identified pairs by the total number of pairs of persons who are actually different.

The figure above shows that all face recognition models have the lowest error rates for Caucasian males and Caucasian females. This indicates that these face recognition models recognize white faces most accurately. On the other hand, we see that model 3 (ResNet34 trained on Glint360k) in particular performs very poorly on black women. This suggests that a racial bias is present.

Looking at face orientation (Pose), we see that all face recognition models have a significant bias for race in face orientation (Pose) as well. This means that it is more difficult to accurately recognize faces of a particular race when the face orientation changes.

Looking at Lighting, Model 2 (ResNet34 trained on MS1MV3) shows lower performance, especially for Black and Asian women. This indicates that under certain lighting conditions, accurate identification becomes more difficult for these races and genders. For changes in facial expression, all of the face recognition models perform best for white males, meaning that a bias exists. The same bias is evident for age and gender. As can be seen from the above, there is a clear bias in the face recognition models used in this study.


This paper proposes a novel experimental approach to measuring the bias of face recognition algorithms by generating composite facial images with independent attributes and using a criterion of identity created by averaging over multiple human annotators. The synthetic test dataset is constructed by generating balanced "prototypes" representing two genders and three races, from which attributes such as face orientation, lighting, facial expression, and age are systematically modified. Finally, a dataset of 12,000 identical face pairs and 36,000 non-identical face pairs is constructed.

We also evaluate the validity of the method using three representative face recognition models. The results show that biases exist in all face recognition models, with higher accuracy for white men and women and lower accuracy for black women in certain models. Furthermore, the face recognition algorithms are more sensitive to changes in facial orientation and facial expression, indicating that the effects of age and lighting, while they do have an impact, are relatively small.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us