Multi-tasking Face (MTF), A New Facial Image Dataset That Respects Privacy And Can Be Used For Multiple Tasks
3 main points
✔️ Proposed a new facial image dataset that is GDPR compliant and can be used for multiple tasks of facial recognition, race, gender, and age classification.
✔️ Rigorous filtering and labeling to ensure high quality data.
✔️ The processed dataset shows high performance and will be extended to new tasks such as facial anonymization in the future.
Multi-Task Faces (MTF) Data Set: A Legally and Ethically Compliant Collection of Face Images for Various Classification Tasks
written by Rami Haffar, David Sánchez, Josep Domingo-Ferrer
(Submitted on 20 Nov 2023)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
Facial images are highly useful data that can be used for a variety of classification tasks, including facial recognition, age estimation, gender identification, sentiment analysis, and racial classification. On the other hand, they are highly sensitive personal information, and privacy regulations such as GDPR restrict the collection and use of facial images for research purposes. As a result, large datasets of previously publicly available facial images have been made private.
Therefore, datasets based on synthetic face images have been attracting attention in recent years. However, it is difficult to achieve the same data distribution as that of real face images of real people, and their performance is inferior to that of models using real face images. In addition, most traditional datasets are labeled for specific tasks, limiting their use.
To address these issues, this paper proposes the Multi Task Face (MTF) Dataset, a dataset of real face images designed to be used for various classification tasks, including face recognition, race, gender, and age classification, while clearing legal restrictions.
This paper introduces the dataset and describes the data collection and processing procedures. It also evaluates the performance of the MTF dataset when used in various classification tasks. In addition, the MTF dataset is available at https://github.com/RamiHaf/MTF_data_set.
What is a multi-tasking face (MTF) data set?
MTF datasets are collected using the special exception in Article 9 of the GDPR (General Data Protection Regulation). This exception allows data subjects (i.e., data owners) to collect and process their own publicly known personal data. This data set is focused on publicly known individuals. This allows the dataset to be legally and securely published while avoiding privacy issues. In addition, we have received approval for the creation and use of this dataset from the Committee for Ethical and Legal Evaluation of SoBigData++ (BOEL).
For data collection, we use the IMDB Website to select publicly known individuals (celebrities). To increase diversity and comprehensiveness, we include the four racial categories used by the United States Census Bureau: Asian (Chinese/Korean), Asian (Indian), Black, and White. For gender, we include equal numbers of men and women to reduce gender bias. Similarly, we include equal numbers of young and older celebrities, with ages 18-49 defined as "young" and ages 50 and older defined as "older." An equal number of celebrities (40 older men, 40 older women, 40 younger men, and 40 younger women each) were selected from each racial group for a total of 640 IDs.
No limit was placed on the number of images downloaded per celebrity, and crawling continued until there were no more images available, eventually collecting 117,114 images. The data was first processed using Haar Cascade to automatically detect and crop facial regions in the images.
In addition, each cropped image was visually verified by three evaluators to ensure that each image contained the facial image of the assumed celebrity. Because the original images were obtained from the public domain and Creative Commons, they also included many random images of artwork and design; Haar Cascade incorrectly detected facial regions from these images, and thus excluded those that were not appropriate. As a result, the size of the dataset was reduced to 42,575 images. Face images that did not belong to the correct ID were also excluded. This reduces the size of the dataset to 6,453 images.
Also excluded are images in which parts of the face are hidden (e.g., sunglasses or hands covering the mouth or eyes), hand-drawn, artificially altered, or generated by AI algorithms. Images in which the face looks unnatural due to make-up have also been removed. This reduces the size of the dataset to 5,984 images.
In addition, to reduce the risk of data leakage and to avoid unnecessary additional costs for training the AI model, duplicate or similar images (e.g., sequential shots) were excluded from the dataset, bringing the dataset size to 5,763 images. From here, further images that did not meet the criteria of the assumed task were also excluded by the experts.
These filters ultimately reduced the data set significantly from 117,114 to 5,246 images (only 4.47% of the original data). All remaining face images were resized to a uniform resolution of 1024 x 1024 pixels.
After cropping and filtering the face images to only those of a certain level of quality, labeling is performed. For labeling for face recognition, images are labeled to identify one of 240 celebrities. In labeling for racial classification, the image is labeled to identify one of four categories: Asian (Chinese/Korean), Asian (Indian), Black, and Caucasian. Labeling for gender classification labels them as male or female. Labeling for age classification places them into one of two categories: young or old.
The experts go through a two-part verification process to ensure that these labels are correctly applied. First, each expert checks the labeling of the entire dataset individually, then the experts work together to validate the entire dataset. This process ensures that each image in the dataset is properly labeled for the facial recognition, race, gender, and age tasks.
The figure below shows the sequence of steps, cropping the face from the original image collected and the final label assigned.
Finally, the MTF data set is composed as shown in the table below.
Face Recognition (Face Recognition) is a classification task that classifies each celebrity based on their name and includes 240 celebrities, covering all images in the dataset. Race Classification (Race Classification) is a classification task with four labels, consisting of a majority of Asian (Chinese/Korean) and white groups and a minority of Asian (Indian) and black groups. Gender Classification (Gender Classification) is a binary male/female classification task, relatively balanced between male and female labels, with slightly more male celebrities, but almost equal proportions of both male and female celebrities. Age Classification, a binary classification task between young and old, has a very unbalanced data distribution, in contrast to the gender classification task. The "Young" category contains many celebrities and images, whereas the "Older" category belongs to only 50 celebrities and contains only 514 images.
The imbalance in the distribution of these tasks is due to differences in the frequency with which celebrities around the world publish images, differences in copyright licenses, differences in the frequency with which younger and older celebrities publish images, and the tendency for older celebrities to have more images from their youth. Therefore, the MTF dataset does not have an equal number of images across all tasks and labels, which was our initial goal, but this unbalanced distribution reflects the actual state of the data available online.
Experiment
Here are the results of the performance evaluation of the face recognition task. The results are shown in the table below. This task contains 240 labels.
As expected, all Pre-trained show better performance than Randam Guess. All pre-trained also perform better than the models trained in From scratch, with ConvNeXT showing the best performance.
Next, we present the validity of the data processing used to build the dataset: the MTF dataset was manually processed to remove low quality or inappropriate images. To validate this effectiveness, we trained the same deep learning model (ConvNeXT) on the Unprocessed (raw unprocessed dataset/mass images collected from the Internet) and the MTF (manually processed MTF dataset) and compared performance on four tasks The results are shown in the table below. The results are shown in the table below.
As can be seen from the table, models trained on the processed MTF data set perform much better than models trained on the raw data set. For example, in the face recognition task, models trained on the processed dataset achieve about 80% accuracy, compared to only 10% on the raw dataset. The experiment also shows that high quality data is more important than large amounts of data. Training on a large amount of data will result in poor performance if that data contains a lot of noise.
We emphasize the importance of data quality (especially those that have undergone precise manual processing) in training machine learning models. Good data can significantly improve the performance of a model, making quality more important than quantity.
Summary
In this paper, we propose the Multi Task Face (MTF) Image Dataset, a dataset of face images that can be used for four tasks: face recognition, race classification, gender classification, and age classification. The dataset is characterized by its privacy considerations and compliance with legal requirements (in particular, GDPR). The dataset contains images of celebrity faces, which are either publicly available or under a license that allows modification and commercial use. In addition, this single dataset can be used for multiple classification tasks for facial recognition, race, gender, and age.
We also evaluate the performance of five deep learning models on the MTF dataset. Models with pre-trained weights show better results than models trained from scratch. Among them, a recently proposed model called the ConvNeXT model performs best on all four tasks.
The importance of how the data set is processed has also been proven. Models trained on processed data sets perform much better than models trained on unprocessed raw data.
Finally, the team in this paper stated that they plan to use this dataset in the future for other tasks, such as facial anonymization. It is hoped that this will lead to the development of a highly useful dataset that is compliant with privacy regulations.
Categories related to this article