Can AI Fairly Understand Your Facial Expressions? Examining Racial Bias In Emotion Recognition
3 main points
✔️ To assess racial bias in facial emotion recognition technology, we examine how training data of different racial compositions affect model fairness
✔️ Through simulations using training data of multiple racial compositions, we observe that using training data with a balanced racial composition does not necessarily improve prediction Observed that accuracy (F1 scores) and fairness do not necessarily improve
✔️ Large data sets highlight the need for a wide range of responses, not just data, to address fairness issues in facial emotion recognition technology
Addressing Racial Bias in Facial Emotion Recognition
written by Alex Fan, Xingshuo Xiao, Peter Washington
(Submitted on 9 Aug 2023)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
The images used in this article are from the paper, the introductory slides, or were created based on them.
First of all
Since the recent breakthroughs in deep learning, the performance of Facial Emotion Recognition (FER) has also improved rapidly. It is used in many fields, including marketing analysis, robotics, and health information analysis. However, racial bias remains a challenge in this field as well.
This paper examines the effects of racial bias using a variety of datasets with varying percentages of race. The results show that using smaller, racially balanced datasets improves fairness and increases accuracy in emotion recognition, as evidenced by an average improvement of 27.2 points in F1 scores and 15.7 points in demographic parity. On the other hand, with large data sets, we found that simply balancing the race of the training data did not improve fairness much. In other words, with large datasets, simply race-balancing the training data is not sufficient, indicating that other measures are also needed to equalize the accuracy of emotion recognition across different racial groups.
This paper uses two datasets to examine racial bias: the first is the Child Affective Facial Expression (CAFE) dataset, which is a collection of images of children expressing specific emotions; the second dataset is AffectNet, a widely recognized large dataset for general facial emotion recognition. To make the two datasets consistent, we also filtered the AffectNet data to keep only the same emotion labels (neutral, sadness, happiness, surprise, anger, disgust, and fear) as the CAFE dataset. We also add certain processing, such as excluding grayscale images for more accurate race estimation.
Finally, AffectNet has 259,280 images for training, 1,700 for validation, and 1,484 for testing. In addition, the CAFE dataset has 713, 227, and 222 images, respectively.
In addition, to estimate race, a racial label is needed; at CAFE, children self-report their race, which is used as a label for the data. (e.g., European American, African American, etc.). On the other hand, since AffectNet does not contain racial information, we use a model trained on the FairFace dataset, which is rated as racially balanced, to predict and label the race of the AffectNet images.
The table below shows the racial distribution included in CAFE.
The table below shows the racial distribution included in AffectNet.
As expected, European-American faces make up the bulk of the distribution of the CAFE and AffectNet training data, comprising 40.4% and 67.3% in the respective data sets. AffectNet also includes data for Middle Easterners and Southeast Asians, which are not included in CAFE. We included these additional racial categories in our experiment because of their potential impact on model learning.
To examine how racial bias affects emotion recognition, we pick a specific race (we call this "simulated race") and change the percentage of that race in the data set. The selected images are used to fine-tune ResNet-50. The performance of the trained model is checked on the validation dataset, and the final test uses the settings of the model that performed best during validation.
We also use two methods to measure how fairly the model recognizes emotions: the first is "demographic parity," which assesses whether all races recognize emotions in the same proportion. The closer the ratio is to 1, the more fair the model is. The second is "odds equivalence," which assesses whether all races are equally likely to recognize the right and wrong emotions.
Through these tests, we are examining the impact on model fairness when AI models are trained on racially balanced data.
Simulations conducted on the CAFE dataset show results in line with expectations on several measures. The figure below shows that F1 scores (red line) and demographic parity (green line) increase on average by +27.2% and +15.7 percentage points, respectively, as the racial composition of the data set becomes more balanced, stabilizing as the percentage of simulated race increases. On the other hand, the equivalent odds ratio (purple line) does not stabilize, showing an upward trend for the Latio simulation, but a random or downward trend for the other races.
In addition, the figure below shows F1 scores for each race and emotion label. As can be seen from the figure, neutral (Neutral), sad (Sad), and fear (Fear) show significant improvements in F1 scores. We also see that Surprise (Surprise) and Disgust (Disgust) are difficult emotions to predict and show random or limited trends.
The figure below shows the results of a simulation of a small dataset based on AffectNet to investigate differences in dataset size. On average, the simulation achieves an F1 score of 15.2% and a demographic parity of 0.286, which is clearly lower than the CAFE simulation. The limited training data size and the large variability in the sentiment distribution from AffectNet's Wild images may have contributed to this discrepancy. The overall trend shows that the performance of the model does not change significantly when the dataset is more racially balanced.
The figure below shows the results of a simulation based on AffectNet with a larger data set than the previous one, in order to examine differences in data set size. Here, too, we see that even with the racial balance of the dataset, F1 scores and fairness did not increase, indicating that there is no clear trend between racial balance and test performance. This suggests that even if the balance of the different races in the dataset is improved, there is still a lack of direct evidence that this results in improved performance when testing the model.
Racial discrimination is a global problem. In some cases, the use of face recognition technology has been suspended due to concerns about the risk of disadvantaging people of certain races by varying the accuracy of recognition according to race. Similarly, in the case of facial emotion recognition technology, the accuracy of recognition by race must be fair so that people of a certain race will not be disadvantaged. However, there are still challenges to fairness in this technology.
This paper uses the CAFE and AffectNet datasets to investigate how the distribution of different races in the training data affects the recognition performance of the model and its fairness across races. We created training datasets with various racial compositions and evaluated the accuracy of emotion recognition (F1 score) for specific races, with the result that there was not enough improvement. It shows that simply balancing the racial composition of a dataset does not necessarily improve the performance of the model and its fairness. The paper suggests trying additional methods, such as excluding groups with particularly inaccurate racial estimates in the preprocessing stage. Racial bias in facial emotion recognition remains a problem, and further improvements are needed.
Categories related to this article