Catch up on the latest AI articles

Deciphering The Intent Of Face Recognition Systems: New Algorithm

Deciphering The Intent Of Face Recognition Systems: New Algorithm "S-RISE" And Its Evaluation Index

Face Recognition

3 main points
✔️ Propose a new definition for explainable face recognition (XFR) based on saliency maps
✔️ Propose a new method to quantitatively evaluate XFR based on saliency maps
✔️ Propose a map generation algorithm called "S-RISE" based on image pair similarity and demonstrate how face identification Demonstrates that it can visually explain how faces are identified

Explanation of Face Recognition via Saliency Maps
written by Yuhang LuTouradj Ebrahimi
(Submitted on 12 Apr 2023)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Over the past decade, breakthroughs in deep learning have led to dramatic advances in image recognition tasks such as image classification, object detection, and face recognition. Face recognition technology has also improved its performance remarkably, and is now being applied in many fields, such as immigration control and security cameras, attracting worldwide attention. You may have recently used it in online identity verification (eKYC). However, these biometrics can put privacy and data protection rights at risk, causing great social concern. Another concern is that deep learning-based systems lack interpretability because of the "black box" nature of their output process. Understanding and being able to explain facial recognition technology decisions in light of these concerns is essential for the social acceptance of facial recognition technology.

Various techniques related to Explainable Artificial Intelligence (XAI) have been proposed in an attempt to solve the "black box" of deep learning. In particular, various saliency map algorithms have been introduced for image recognition-related tasks to highlight internal CNN layers and important pixels that are relevant to model decisions. However, many algorithms, while showing excellent utility in classification tasks, are not directly applicable to other image recognition tasks with different internal model structures and output formats. Face recognition-related tasks are one of them. Face recognition requires more than just generating saliency maps; it also requires interpretation and explanation of how the face recognition model identifies pairs of face images and why it determines that certain pairs of images are more likely to be the same person than others.

This paper proposes a new definition of Explainable Face Recognition (XFR) that is also applicable to face recognition. It also proposes an algorithm called "S-RISE" that follows this new definition and uses the similarity of image pairs to generate saliency maps.

Proposed method "S-RISE"

In this paper, we propose a new definition for building saliency map-based explainable face recognition (XFR). As mentioned earlier, face recognition predicts whether a pair of face images are of the same person. Therefore, an explainable face recognition system (XFR) must be able to visually interpret why a model "considers" a person to be the same or not the same.

In the past, papers have been reported from a similar perspective, in which Probe (image to be matched), Mate (image of the same person as Probe), and Non-mate (image of a person different from Probe) were used to investigate the interpretability of face recognition, focusing on the relative importance of specific regions of the face The study focuses on the interpretability of face recognition by focusing on the relative importance of specific regions of the face. Explainable Face Recognition (XFR) is defined as a method that maximizes the similarity between Probe and Mate in a given region while minimizing the similarity between Probe and Non-mate in the same region. However, the most similar region between Probe and Mate is not necessarily the least similar region between Probe and Non-mate. In fact, the decision-making process for each pair of images within the trio is independent, because the face recognition system makes decisions by comparing the similarity scores of the two images to a predefined threshold, not the three images. In other words, this definition does not make face recognition decision making explainable.

Therefore, this paper proposes a more rigorous definition that, while referring to the idea of three pairs of images, clearly distinguishes between matching and non-matching pairs: if one [Probe, Mate, Non-mate] pair each is input to a face recognition system, the [Probe, Mate] pair and the [ Probe, Non-mate] pairs, the system should generate a saliency map corresponding to the [Probe, Non-mate] pairs and then answer the following questions.

  • Which regions of the [Probe, Mate] image pair are most similar to the face recognition system?
  • Which regions of the [Probe, Non-mate] image pair are most similar to the face recognition system?
  • Why did the face recognition system determine that the [Probe, Mate] pair was a better match than the [Probe, Non-mate] pair?

Traditional saliency maps, while useful, cannot be applied directly to face recognition tasks. For example, Randomized Input Sampling for Explanation (RISE) describes a classification model by using the output probabilities of the classifier categories as weights and aggregating the final saliency map. However, the decision-making process in face recognition systems involves the extraction of facial features and the similarity between two or more images.

To address this problem, this paper proposes the Similarity-based RISE algorithm (S-RISE), which uses similarity scores as mask weights and provides saliency maps without accessing the internal architecture or gradients of the face recognition system (see figure below).

Given a pair of images {𝑖𝑚𝑔𝐴, 𝑖𝑚𝑔𝐵}, the mask generator randomly generates a fixed number of masks. For each mask, it is applied to the input image (e.g., 𝑖𝑚𝑔𝐴) and the masked 𝑖𝑚𝑔𝐴 and unmasked 𝑖𝑚𝑔𝐵 are input to the face recognition model to extract facial features respectively. The cosine similarity is then computed as weights for the corresponding masks. After repeating the same process for all masks, the final saliency map for 𝑖𝑚𝑔𝐴 is represented as a weighted combination of the generated masks.

In addition, the accuracy of the saliency map should be evaluated. In image classification and image retrieval tasks, some methods "insert" or "remove" salient pixels from the input image and measure the change in output classification probability. In this paper, we adapt these methods to the face recognition framework. We evaluate whether the model accurately highlights the regions of the face that we consider most important with the least number of pixels.

The method using pixel "insertion" and "deletion" adds/removes pixels, respectively, and measures how quickly the similarity between two face images reaches a threshold value. More specifically, the deletion process starts with the original image and the pixel with the highest saliency value is in turn deleted and replaced by a constant value. After each pixel is deleted, the similarity score is recalculated until it falls below a predefined threshold value. Conversely, the insertion process starts with a constant value and the most important pixels in the image sorted by the saliency map are added to the plain image. Each time a pixel is added, the similarity score is recalculated and continues until it is above the threshold value. The number of pixels removed from or added to the image is accumulated until the recognition model changes its decision. Performance is evaluated using the following metrics

  • #Removed pixels / #All pixels
  • #Added pixels / #All pixels

In practice, removing pixels from an image may change the original distribution and ultimately affect the recognition results.Therefore, the constant values mentioned aboveareset as averages for a particular image.

Experimental results

In recent years, the method of using saliency maps has been called into question. It has been pointed out that the generated maps may actually be unrelated to the model decision-making process or data generation mechanism, and it is questionable whether they provide a reliable explanation. Therefore, a method called "model parameter randomization testing" has been proposed. This involves randomizing the weights of the model before using a deep learning model. This allows one to evaluate whether the method using the saliency map really provides an explanation based on the model's decision mechanism. This paper uses a similar approach to evaluate the effectiveness of saliency maps. Specifically, we have tested it using parameters from an unrelated network model (ResNet) optimized for other visual tasks. If a meaningful heat map is generated using these random or irrelevant parameters, then the saliency map is irrelevant to the model's decision-making process and data generation mechanism and cannot be trusted.

The figure below shows the results of a test performed on a saliency map generated by the S-RISE algorithm. The second row in the figure below shows the saliency map generated for a CNN model with randomized parameters, while the third row in the figure below was generated for a regular face recognition system.

As can be seen from the results in the second line of the figure below, the use of random parameters produces a meaningless saliency map, indicating that the proposed S-RISE algorithm is capable of generating meaningful interpretations based on the learned face recognition model.

The figure below shows the results of the saliency maps generated by the S-RISE algorithm. The left two columns show the saliency map for a face recognition model correctly predicting a pair of images with high confidence, while the right two columns show the saliency map for a non-paired image.

As can be seen from this figure, for image pairs that the face recognition model judges to be similar, the regions are appropriately emphasized. On the other hand, the image pairs that the model judged to be dissimilar show similar regions, but the degree of enhancement is weak. This result explains why the face recognition model determined that one of the two images is the same person and the other is not.

Furthermore, the figure below shows the results of a study of cases in which the face recognition model misidentifies different persons as the same person.

While the face recognition system recognizes "Probe - Mate" pairs with a high confidence level, it also assigns a relatively high confidence level to facial regions such as the eyes and mouth for "Probe - Nonmate" pairs. In other words, this explains why the face recognition model incorrectly judges non-matching persons as matches.

It has also been shown that current face recognition models are able to identify partially hidden faces despite their low confidence levels. In this case, an ideal saliency map should show low saliency values for hidden pixels and high values for other similar regions. In the previous figure, we can also see that when parts of the face are masked by sunglasses, the decision is made to focus on the mouth and nose regions other than the eyes.

Finally, the results of the quantitative evaluation of S-RISE are presented (see table below). Results are reported for methods using pixel "insertions" and "deletions". Experiments have been conducted on a small subset of the LFW dataset. They measure the percentage of pixels that are modified to change the judgment of the face recognition model; the smaller the number, the more accurate the saliency map description is judged to be. The table below quantitatively evaluates S-RISE at different iterations, showing that a smaller number of iterations results in poorer explanatory performance.

On the other hand, these indicators gradually converge at approximately 1,000 iterations, indicating that the saliency map is stable and accurate (see figure below).



This paper proposes a new framework for explainable face recognition (XFR). The proposed S-RISE algorithm allows for the creation of maps that detail how the face recognition system makes decisions about whether or not a person is the same person, and also proposes new criteria for evaluating how accurate these maps are. It is hoped that this will establish a standard method for assessing the reliability of deep learning models used in face recognition more generally in the future, and will lead to a better understanding of face recognition systems and, consequently, a more secure use of face recognition.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us