How Do Duplicate Images Affect Face Recognition Performance? The Importance Of De-duplication In Face Image Datasets

Face Recognition 13/02/2024

3 main points
✔️ Duplicate detection method: proposed an efficient method to detect duplicate "exact match" and "almost match" face images using hashing
✔️ Impact analysis of de-duplication: demonstrated the small impact of de-duplication on face recognition
✔️ Importance of data set cleaning: future face images by implementing a duplicate filter Proposed to improve the quality of future face image datasets by implementing a duplicate filter

Double Trouble? Impact and Detection of Duplicates in Face Image Datasets
written by Torsten Schlett, Christian Rathgeb, Juan Tapia, Christoph Busch
(Submitted on 25 Jan 2024)
Comments: Accepted at the 13th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2024)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Face recognition research often uses datasets of face images collected from the web, but these datasets may contain duplicate face images. To address this problem, methods are needed to detect duplicates in face image datasets and improve their quality. This paper presents a method for detecting duplicates in a dataset of face images. It is also applied to five datasets of face images collected from the web (LFW, TinyFace, Adience (aligned), CASIA-WebFace, and C-MS-Celeb (aligned)). The de-duplicated datasets are publicly available. Experiments on face recognition models have also been conducted to study the impact of de-duplication.

Duplicate Detection Methods

The paper examines detection methods for Exact Duplicate, duplicate images that are exact matches, and Near Duplicate, images that are not exact matches but should be considered duplicates.

An "Exact Duplicate" is one that can be identified by comparing data, where two sets of data are an exact match. For computational efficiency, an initial set of data is collected using BLAKE3 hashing to detect duplicates. Since the same data always generates the same hash value, false negatives due to hashing do not occur, but false positives (hash collisions) can occur. Therefore, we introduce an additional step to check for an exact match of the file data among the set of duplicates found by hashing. It is a very simple check, but it has allowed us to detect "Exact Duplicate" in all datasets (note that this result shows that in the five datasets we applied here, no such check was performed at creation time).

Next is "Near Duplicate," which is a similar image that is slightly different but should be considered a duplicate in, for example, face recognition research. The figure below is a sample.

Definitions of this "Near Duplicate" vary. This paper uses two image hashing methods, pHash (perceptual hashing) and crop-resistant hashing, based on the default settings of a Python package called "ImageHash".

The "pHash" detected more duplicates than the "crop-resistant hashing," suggesting that "crop-resistant hashing" may not have been successfully applied in certain use cases. Because the sets of duplicate images detected by the image hashing function may overlap, correcting false positives and merging (removing) sets of duplicate images are important as part of the final de-duplication process.

Face recognition and face image quality assessment models rely on face image preprocessing. These models crop and align the original face image based on detected facial landmarks. The duplicate detection method presented in this paper works well for such preprocessed face images.

This is because the facial landmark detection required for preprocessing may fail for some images, and variations in the preprocessed/original images may produce a set of duplicate images that do not have their own. Therefore, it should be done as an additional step after duplicate detection on the unaltered original image.

In this paper, the similarity transform used in ArcFace is used to preprocess face images. For images with multiple faces detected, the primary faces are selected based on the width and height of the detection bounding box, proximity to the image center, and the detector's confidence score. All preprocessed images have the same 112 x 112 width and height. The figure below shows a sample of the preprocessing.

The number of face images that failed landmark detection was 0 for LFW, 859 for TinyFace (only 4 were duplicates based on detection on the original image), 79 for Adience (9 were duplicates), 129 for CASIA-WebFace (2 were duplicates), and 6,. 179 images (351 duplicates). These images are not considered in this additional duplicate detection step and are recommended to be done as an additional step after validation of the original images.

The table below provides an overview of the datasets examined, showing the total number of images and duplicates. Intra" represents intra-subject duplicates (all images in the duplicate set belonging to only one subject), "Subjects-w.-intra" represents subjects with at least one intra-subject duplicate, "Inter" represents inter-subject duplicates (all images in the duplicate set belonging to multiple subjects), " Subjects-w.-inter" represents subjects with at least one inter-subject duplicate.

LFW, TinyFace, and Adience each contain fewer than 20,000 face images; CASIA-WebFace contains 494,414 face images; and C-MS-Celeb contains 6,464,016 face images; only LFW has a very low number of duplicates; CASIA-WebFace and C-MS-Celeb have a very high number of duplicates. TinyFace also contains 153,428 non-face images, which are not considered in this paper. Only the remaining 15,975 face images are considered; for Adience, we use the aligned version of the face images. the absolute number of duplicates in CASIA-WebFace is higher than in the smaller datasets, but the ratio of duplicates to the total number of face images is relatively low (excluding LFW).

C-MS-Celeb is a subset of MS-Celeb1M with clean subject labels. We use the ALIGNED version in this paper. Of all the datasets examined, it has the highest absolute number of duplicates and the highest ratio of duplicates to total face images. For this dataset, 33,918 inter-subject duplicates are also intra-subject duplicates. Thus, the total number of images that are part of some duplicate sets is 885,476. This duplication of intra-subject/inter-subject duplicates does not occur in any other data set, and the total number of each of these is simply the sum of the intra-subject and inter-subject numbers.

The duplicate detection approach can be applied to identify duplicates within a single data set as well as between multiple data sets. This approach can eliminate unintentional duplicates between data sets collected from different sources. When using preprocessed images, new duplicates were found between datasets and these were manually identified as true positives (mainly between CASIA-WebFace and C-MS-Celeb, but also found in others).

Deduplication Methods

This section describes how to efficiently identify and remove duplicate images from a dataset. First, we introduce a technique for selecting and storing representative images from a group of duplicate images. This eliminates unnecessary duplicates while retaining important information in the data set. Reproducibility is ensured by selecting the first image by sorting in lexicographic order rather than randomly from a set containing completely identical images.

If duplicates are found between different categories, reassign the image to the appropriate category through a more complex procedure. The deduplicated images are compared to the non-overlapping images in all relevant categories and assigned to the category with the highest similarity. However, if the average similarity score is low or the score difference between candidates is small, the image is excluded from the dataset to avoid incorrect assignment.

It also corrects false positives. Image hashing techniques used to detect similar images may erroneously deem different images as duplicates. To correct such false positives, face recognition techniques are used to identify images that are actually different and exclude them from exclusion. This step reduces false positives by filtering out image pairs whose similarity scores are below a certain threshold.

After correcting any false positives, the images are re-sorted based on their quality scores. Here, the image with the highest quality score is selected as the representative of the duplicate set. Images for which a quality score cannot be calculated are automatically placed at the bottom of the list, and images with calculated scores are given priority.

Experiment

Here we investigate how de-duplication changes the face recognition model. Face recognition requires the selection of matched and unmatched pairs of face images.

For matched pairs, one method is to select all possible pairs for each subject. However, this can significantly increase the number of matched pairs for a subject with relatively large number of images in the data set (a subject with N images yields (N-(N-1))/2 matched pairs, each image resulting in N-1 matched pairs involved).

However, since the purpose of these experiments is to compare results on a dataset with or without overlap, a balanced number of matched pairs per image is desirable. Therefore, in this paper, matched pairs are selected "circularly" for each object in the dataset. It simply becomes that the image at index i forms a matched pair with the image at the next index i+1. The last image also forms a pair with the first image if the subject has more than one image.

Thus, a subject with two or more images will have a number of matched pairs equal to the number of images, and a subject with exactly two images will have one matched pair. As for the order of the images, ascending order is used in the lexicographic order of the image paths. This "circular" method of selecting matched pairs can be accomplished with relatively few computational resources.

Below is the number of matched pairs per data set, and the number of targets excluded because they contain only a single image.

TinyFace: 11,881 (153 excluded subjects)
Adience: 18,093 (815 excluded subjects)
CASIA-WebFace: 494,284 (0 excluded subjects)
C-MS-Celeb: 6,457,562 (123 excluded subjects)

For each dataset, a number of non-matched pairs equal to the number of matched pairs are randomly selected. The MagFace model is used for face recognition. Similarity scores are then used to evaluate face recognition performance in terms of false nonmatch rate (FNMR), false match rate (FMR), and equal error rate (EER).

The table below shows the results. We see that there are mainly small differences between the variants with and without duplicates, and that the removal of duplicates can increase or decrease the error rate in different cases.

Summary

This paper presents a method for duplicate face image detection based on hash functions. It also proposes how to use this process on preprocessed face images as well as the original images. With these methods, duplicate images are detected for five representative face image datasets collected through web scraping.

With the exception of the LFW dataset, more than 1% of all images in each dataset are considered duplicates, ranging from hundreds to hundreds of thousands. Most duplicate images belong to a single dataset target (intra-target duplicates), but some, especially in the C-MS-Celeb dataset, belong to multiple targets (inter-target duplicates).

It also shows how to efficiently identify and remove duplicate images from the dataset. Reduction of false positive duplicates, selection of the highest quality face images per duplicate set, and assignment of de-duplicated face images to the best matching target within the inter-target duplicate set (or not assigning them if uncertain) are proposed. It also shows the impact of de-duplication on the face recognition model.

The fact that a typical face recognition face image dataset contains substantial duplicates, as shown in this paper, suggests that the risk of duplicate images is high for web scraped face image datasets and that implementation of duplicate filters should be considered when building datasets in the future.

Note that the deduplicated datasets from the five datasets covered in this paper are available on GitHub.
https://github.com/dasec/dataset-duplicates

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

How Do Duplicate Images Affect Face Recognition Performance? The Importance Of De-duplication In Face Image Datasets

Summary

Duplicate Detection Methods

Deduplication Methods

Experiment

Summary

AVI-Talking" Generates Natural 3D Talking Faces From Audio

AVI-Talking" Generates Natural 3D Talking Faces From Audio

Exploring Facial Expression Recognition Techniques For The Intellectually Disabled Using The MuDERI Dataset

Exploring Facial Expression Recognition Techniques For The Intellectually Disabled Using The MuDERI ...

Diffusion Facial Forgery (DiFF), A New Large-scale Dataset For Face Forgery Detection

Diffusion Facial Forgery (DiFF), A New Large-scale Dataset For Face Forgery Detection

IdentiFace: A Multimodal Face Recognition System That Captures Everything From Emotion To Gender And Its Potential

IdentiFace: A Multimodal Face Recognition System That Captures Everything From Emotion To Gender And ...

Multi-tasking Face (MTF), A New Facial Image Dataset That Respects Privacy And Can Be Used For Multiple Tasks

Multi-tasking Face (MTF), A New Facial Image Dataset That Respects Privacy And Can Be Used For Multi ...

FRCSyn Challenge Shows Potential For Face Recognition Technology With Synthetic Datasets (FRCSyn Challenge At WACV 2024: )

FRCSyn Challenge Shows Potential For Face Recognition Technology With Synthetic Datasets (FRCSyn Cha ...