Catch up on the latest AI articles

SF-MASK: Benchmark Dataset For Classifying Masked Faces In Low-resolution Surveillance Camera Video

SF-MASK: Benchmark Dataset For Classifying Masked Faces In Low-resolution Surveillance Camera Video

Face Recognition

3 main points
✔️ Discover missing data in existing public datasets
✔️ Fill in missing data and build a new dataset for surveillance camera use
✔️ Achieve higher accuracy than existing public datasets

A Masked Face Classification Benchmark
written by Federico CunicoAndrea ToaiariMarco Cristani
(Submitted on 23 Nov 2022)
Accepted at T-CAP workshop @ ICPR 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)


The images used in this article are from the paper, the introductory slides, or were created based on them.


With the outbreak of COVID-19, WHO has provided guidelines and recommended the use of masks around the world. In addition, research results from universities and research institutes in various countries have shown that masks are an easy and effective way to prevent infection. In Europe and the United States, masks are no longer required for vaccination. However, they are still recommended in some countries and regions, including Japan, and limited environments such as indoors and on public transportation. Some reports predict that 52 billion disposable masks will be produced in 2020 and the market size will reach 280 billion yen by 2030. In any case, it seems that masks are still required to be worn in many situations. And to prevent people from forgetting to wear masks in situations where they are required, mask-wearing check services are also available.

In response to this situation, technology is being developed to classify faces that are wearing a mask from those that are not. In particular, classifying masked and unmasked faces from surveillance camera images is one of the most difficult tasks (this is only a technical verification, and the issues of privacy and ethics in the real world are not discussed here). As shown in the figure below, there are small face sizes, partially hidden faces, various face orientations (facing front, facing down, etc.), and various mask types.

In this paper, we address these issues by constructing a new dataset, SF-MASK, based on a collection of mask-related datasets already publicly available. SF-MASK is constructed by collecting mask-related datasets already available in the public domain. We also analyze the missing data in the existing dataset and supplement the missing data to make the dataset more complete. From here, we will look at the details of SF-MASK and its usefulness.

What is SF-MASK?

SF-MASK is built on the existing dataset of face images including mask-wearing images. The table below shows the dataset used in this project and the composition of the images in the dataset. The No-Mask image shows an image in which the mask is worn, but the nose is sticking out, or it is worn on the chin, or it is not worn correctly. In addition, this time we used low-resolution images (64 × 64 pixels or less ), assuming the use of a surveillance camera, and those images are designated as Small.

As you can see from the table, SF-MASK uses various datasets for every case, such as "Face Mask Label Dataset ( FMLD)" which is a large dataset with many images, " Moxa3K" which is a dataset with many group images and small size of each face, " Medical Mask Face Mask Dataset (FMD ), Medical Mask Dataset (MMD), and so on.

SF-MASK firstly consolidates all the datasets into one dataset: 49,146 for Mask, 47,505 for No- Mask is 47,503, and Worng-Mask is 1,747. We then applied Structural Similarity (SSIM) to remove potential duplicates, leaving only images that are 64 × 64 pixels or smaller. The final result is 9,055 Masks, 12,620 No-Masks, and 1701 Wrong-Masks. The distribution of image sizes in the dataset is as follows.

A sample SF-MASK dataset is shown in the figure below. The size of the color-coded area indicates the composition ratio in the dataset.

Also, as you can see from the above figure, a test dataset called SF-Mask Test Set is created in SF-MASK. In this test, since we are assuming the case of using it with a surveillance camera, we have created a test dataset by acquiring 1,077 images from video sequences captured by multiple surveillance cameras in the ICE Lab at the University of Verona, Italy. This includes 584 Mask, 270 No-Mask, and 223 Wrong-Mask images.

Furthermore, when we analyzed SF-MASK with Counting Grid, we found that it contains almost no images taken from above, which is often seen under surveillance camera shooting conditions. Therefore, in this paper, we synthesize the missing images and add them to the dataset so that it works well under surveillance camera conditions. In the figure above, it is considered "Synthetic". The figure below shows a sample of the synthesized images. You can see that the diversity of mask types and races is also ensured.

These were created using MakeHuman and Blender to create over 12,000 synthetic human bodies, such as having various ages, races, genders, and clothing, and adjusted to be the same image size as SF-MASK.

The usefulness of the synthesized data is also analyzed. We randomly sampled SF-MASK training data without synthetic data (no synth), synthetic data (synthetic), and SF-MASK test data (test), extracted features with ResNet-50 and then applied kernel PCA. The figure below shows the visualization.

(In (a), the addition of synthetic (green)covers the regions where no synthetic ( blue) does not, and compensates for the lack of diversity in the data. (In (b), the addition of synthetic (green) increases the number of overlapping regions with the test (red), indicating that the coverage is higher than that of the test (red).

How does it compare to traditional datasets?

We conduct two experiments: first, we evaluate the usefulness of training data from traditional datasets and SF-MASK (w/ synth, w/o synth). Specifically, we train four models (Resnet-50, VGG19, MobileNetv2, EfficientNet) on seven datasets each (MMD, FMD, Medical Mask, FMLD, Moxa3K, Ours(no synth.), Ours) and ICE Lab, where Ours(no synth.) is a dataset that does not include the synthesized images. Ours is a dataset that includes the synthesized images.

As mentioned above, the test dataset acquired at the ICE Lab consists of 1077 images (Mask: 584, No-Mask: 270, Worng-Mask: 223) taken from video sequences captured by several surveillance cameras at the ICE Lab, University of Verona, Italy. The purpose of this paper is to use the system in surveillance cameras.

The results are shown in the table below. The dataset Ours with synthetic data shows higher accuracy than the conventional dataset for all models, while MobileNet and EfficientNet also show better accuracy than the conventional dataset for dataset Ours without synthetic data (no synth.). The accuracy is improved in the dataset Ours (no synth.) which does not contain synthetic data.

The paper discusses that the reason for the higher accuracy in Ours is due to the positive impact of supplementing images with angles of view unique to surveillance cameras, such as"images taken from above", which were not included in the previous dataset. The reason why we did not test the model trained on the RMFRD-only dataset is that the RMFRD does not contain any No-Mask images and cannot be fairly compared with other datasets.

In the second experiment, to investigate the dataset that has the most impact on accuracy, we performed a leave-one-out method to investigate the dataset that has the greatest impact on accuracy. Specifically, we train on a specific dataset from SF-MASK and evaluate its performance on the left-out dataset and the SF-MASK Test Set. All the models are based on ResNet-50.The results are shown in the table below, where Dataset Left Out is the dataset left out.

From the table, we can see that when FMLD is overtaken, the accuracy decreases the most. This is because FMLD is the largest dataset and thus has the largest impact. Note that RMFRD is also excluded from this experiment because it does not contain No-Mask images and cannot be fairly compared with other datasets.


COVID-19 has been raging since 2019. On the other hand, since the vaccine was developed, it is becoming more common in Europe and the United States not to wear masks, and in Japan, it was announced that in principle it is not necessary to wear masks outdoors ( Ministry of Health, Labour and Welfare). However, the number of infected people is still increasing regularly, and since wearing masks is an easy and effective way to prevent infection, certain countries, regions, and facilities require the wearing of masks indoors and in certain other environments. These countries, regions, and facilities have also introduced systems to detect the failure to wear masks to prevent people from forgetting to wear masks.
In this paper, we present a new dataset, SF-MASK, which is useful for building a model to detect the wearing of masks from surveillance camera images for such cases.
We have constructed a dataset with higher generalizability by collecting conventional datasets, collecting various types of mask colors and shapes, analyzing the missing data, and synthesizing them. In this case, we assume we use a surveillance camera's video. In the conventional dataset, there are few images taken from a top angle, so we synthesized them and added them to the dataset.
Although COVID-19 is the motivation for the research in this paper, new infectious diseases may expand in the future, and the development of datasets and research will be meaningful for the future society.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us