Improved Accuracy And Transparency Of Face Recognition With ChatGPT, New Developments In Soft Biometrics

Large Language Models 08/04/2024

3 main points
✔️ Applicability of large-scale language models in face recognition: investigate the possibility of using ChatGPT and GPT-4 for face recognition. Evaluated performance under different conditions and compared with public benchmarks, and published code on GitHub.
✔️ Soft Biometric Attribute Estimation and Analysis: evaluated ChatGPT's ability to estimate attributes such as gender, age, and ethnicity. Investigated ways to enhance the explainability and transparency of AI through dialogue.
✔️ Applications and Evolution of Interactive AI: Provides insight into the future direction of AI technology and human-centered AI design from the use of ChatGPT in face recognition and soft biometric estimation.

How Good is ChatGPT at Face Biometrics? A First Look into Recognition, Soft Biometrics, and Explainability
written by Ivan DeAndres-Tame, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez, Javier Ortega-Garcia
(Submitted on 24 Jan 2024 (v1), last revised 27 Feb 2024 (this version, v2))
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

There is no doubt that ChatGPT is one of the most talked about keywords in today's society. Developed by OpenAI, this AI chatbot is capable of interacting with humans in a conversational manner. Since its launch in November 2022, ChatGPT has grown at a fast pace, setting a historic record of over 100 million monthly users in just two months of its launch. In fact, ChatGPT has already set a historical record with over 100 million monthly users in just two months. In fact, ChatGPT has already been successful in many practical applications.

Behind ChatGPT's unparalleled success, however, are the rapid advances in large-scale language models in recent years. These advances provide impressive capabilities in a wide range of fields, from medicine to education to coding, as well as evolving techniques to fine-tune models for better human interaction.

Starting with GPT-1, OpenAI's introduction of the transformer architecture has opened up new possibilities to transcend traditional techniques in handling long-term dependencies. During this evolution, GPT-3, with its 175 billion parameters, demonstrated that scaling up models contributes to task-independent performance gains, and the integration of GPT-3 models into ChatGPT has pushed the potential of this technology even further. However, exploration in this area is not limited to OpenAI; other leading companies such as Google and Meta AI have also introduced their own large-scale language models, including PaLM and LLaMA.

However, these models are primarily text-based, and chatbots like Google Bard have several limitations. In particular, there are limitations regarding the handling of facial images and the difficulty of conducting experiments using Python.

This paper examines the utility of ChatGPT in tasks related to facial biometrics, such as face recognition and estimation of soft biometric attributes. This is a very challenging area, with significant challenges due to factors such as pose, age, lighting, and facial expression. In addition, this study shares the details and results of experiments using ChatGPT, along with the scientific underpinnings behind the evolution of these techniques, contributing to the transparency and accountability of the technology.

The figure below outlines the research conducted in this paper, focusing on the ability of ChatGPT to perform tasks such as face recognition, soft biometric estimation, and result accountability.

ChatGPT setup in the experiment and its main features

OpenAI provides access to ChatGPT in two primary ways: one through an interactive chatbot interface and the other through an API. Both have similar functionality, but the API provides a simple interface that allows for easy execution of extensive experiments using Python. As such, this paper was conducted using the API, but also utilizes the chatbot interface in the early stages to quickly explore appropriate settings. A premium subscription is currently required to take advantage of the latest large-scale language model (GPT-4), which allows handling of images and other file formats, as well as access to OpenAI's other products. The maximum number of tokens is set at "1,000 tokens". The image detail level is also set to "High".

In addition, several configurations are being tested to optimize the use of ChatGPT to reduce cost and time while improving the performance of facial biometrics.

The first is the image composition; two options are considered. First, a configuration that merges the two face images to be compared into a single image (see figure below (left)), and then a configuration that consolidates them into a 4x3 matrix (see figure below (right)) are considered.

The second is the prompt structure. This is the most important aspect to analyze. First, we focus on the first configuration case of the image, i.e., comparing a pair of faces in the image, to design a prompt for the face recognition task. First, as shown in the figure below, we follow OpenAI's recommendation and create a detailed prompt asking the user to identify if the two face images are of the same person. However, since ChatGPT does not officially provide facial recognition capabilities, the answer is rejected as shown below (blue is the input prompt and black is the ChatGPT answer).

Therefore, this paper assumes that these responses may be implemented due to privacy concerns about real-life identities, and modifies the initial prompts to indicate that these are AI-generated persons, as shown next.

Using the modified prompt above, we have been able to obtain a positive response from ChatGPT. This indicates whether the facial images are from the same person and also provides accountability for the decision.

However, although "they appear to be different persons, judging by facial structure, hairstyle, and other visible features," it also states that "conclusions as to whether the two images depict the same person are speculative." This can be taken as preventing the use of the system as a face recognition task for the output results.

The paper then also attempts to reduce the amount of information provided as input and prevent the system from recognizing that it is performing a face recognition task. However, ChatGPT detected this and responded in the negative.

We also attempt to restrict the output of ChatGPT. In particular, we are trying to restrict the answers to be "yes" or "no" and to output the confidence level as well.

By using the modified prompt, ChatGPT is able to provide clear and concise answers to questions. This prompt was used in the face recognition experiment. Based on this configuration, we have created another prompt for the matrix strategy. This prompt also specifies the location of the comparisons in the matrix and how to reference each cell.

This paper also explores potential applications of ChatGPT to other facial biometric tasks. These include soft biometrics estimation and explainability of results. To achieve this objective, multiple prompts were considered. For the estimation of facial soft biometrics, we started with a general prompt to see how well ChatGPT could perform this task with the level of accuracy and variability of the attributes. The figure below shows the prompts considered and the results provided by ChatGPT for different face images.

In general, we find that ChatGPT is capable of providing a wide variety of soft biometrics with accurate results; to quantitatively evaluate ChatGPT's performance, we propose prompts that include facial attributes considered in the popular MAADFace database. This allows for a direct comparison with state-of-the-art approaches. Next, we provide a proposed prompt to evaluate ChatGPT's ability to estimate facial soft biometrics.

Finally, with respect to the accountability of the decisions made using ChatGPT, we consider the same prompts used for the face recognition task and add a final question that assesses why ChatGPT makes the decisions it does.

Experimental results

This paper compares three models, ArcFace, AdaFace, and ChatGPT, to measure the accuracy of face recognition technologies. In particular, the performance of ChatGPT is examined using two methods that evaluate images en bloc (4x3) and individually (1x1). For comparisons between these models, we use the Cosine distance to measure similarity and calculate the Equal Error Rate (EER). In the case of ChatGPT, the EER is obtained using the confidence level obtained directly from its output as a custom metric.

It is presented in two main groups covering different face recognition scenarios. One is application scenarios including controlled environments (LFW), surveillance scenarios (QUIS-CAMPI), and extreme conditions (TinyFaces). The other is a scenario highlighting common challenges of face recognition, including racial bias (BUPT), pose variation (CFP-FP), age differences (AgeDB), and shielding (ROF).

Accuracy for ChatGPT and the major face verification systems in the Face Verification task is shown in the table below. ChatGPT 4x3" refers to an image setup that includes 12 face comparisons within the same prompt, while "ChatGPT 1x1" represents the case of a single face comparison per prompt.

The table below also shows the equal error rate (%) achieved by ChatGPT and popular face recognition systems in the literature for the face verification (Face Verification) task.

In general, state-of-the-art models such as ArcFace (average accuracy 95.44%, EER 6.19%) and AdaFace (average accuracy 95.80%, EER 5.59%) show better overall performance. On the other hand, ChatGPT was developed for more general tasks and tends to perform poorly in face recognition tasks. In particular, when images are presented in matrix form, the average accuracy drops to 66.23% with an EER of 34.96%, and when compared individually, the average accuracy is 80.19% with an EER of 21.19%.

Performance analysis on various databases shows that ChatGPT's performance is highly dependent on image quality, pose variation, and domain differences between comparisons. For example, in the LFW database, ChatGPT achieves performance close to the state-of-the-art model (accuracy 93.50%, EER 8.60%) due to good image quality and consistent pose. However, in surveillance scenarios and under extremely low quality conditions, ChatGPT's performance is significantly degraded.

Similar performance declines are also seen in databases that address issues such as racial bias, pose, age, and shielding. This also reveals that ChatGPT exhibits significant bias across different demographic groups. For example, as can be seen in the table below, very different performance was observed for different ethnicity and gender in the BUPT database assessment, with an EER of 14.94% for the white female group versus an EER of 30.88% for the Indian female group.

These results show that while specialized face recognition models such as ArcFace and AdaFace have high accuracy, ChatGPT's performance varies widely depending on image quality and task complexity. In addition, the ChatGPT bias issue is an important consideration in the application of face recognition technology.

We also analyze how ChatGPT improves the explainability of the results in the face recognition task. The figure below shows the output provided by ChatGPT for the proposed prompts and some of the examples from different face recognition databases; ChatGPT's responses are divided into correct (left column) and incorrect (right column).

Both correct and incorrect answers demonstrate ChatGPT's ability to rationalize decisions based on image features. For example, in most cases, ChatGPT output scores for face recognition tasks are related to soft biometric attributes such as facial hair and skin color. It also indicates the ability to focus on more detailed attributes such as eye color, face shape, or nose shape, indicating proficiency in handling both coarse and fine details.

Noteworthy is the fact that even though ChatGPT takes facial expressions into account in its predictions, this is a variable attribute that should not be considered. Furthermore, the model recognizes temporal differences between images and incorporates this information into its predictions.

For wrong answers, we find that even if the predictions are wrong, some of the descriptions provided by ChatGPT accurately describe the person in the image.

In addition, it shows the results achieved for the soft biometrics estimation tasks in the LFW and MAAD-Face databases. The table below shows the Accuracy (%) achieved by ChatGPT for the soft biometric gender, age, and ethnicity estimation in the LFW database.

The table below shows the Accuracy (%) achieved by ChatGPT on the MAAD-Face database in estimating the 47 soft biometric attributes considered in the database.

The figure below also shows some examples of the output provided by ChatGPT in the proposed prompt.

Analysis of the results achieved in the LFW database shows that ChatGPT underperforms FairFace in gender classification (94.05% vs. 98.23%), but outperforms FairFace in age classification (72.87% vs. 67.88%) and ethnic classification (88.25% vs. 87.48%). These results demonstrate the potential of ChatGPT for specific facial attribute classifications.

For a more extensive evaluation, we consider the MAAD-Face dataset annotated with 47 different attributes. The custom model (ResNet-50) performs well on the majority of attributes (87.28% average accuracy). On the other hand, ChatGPT has a lower average performance (76.98% average accuracy) but excels on several face attributes.

Some of the most prominent soft biometric attributes where ChatGPT performs better are in gender classification (96.30% accuracy), some ethnicities (white - 83.90% accuracy, black - 97.50% accuracy), and accessories such as wearing a hat. While models trained for this specific task generally achieve better results, ChatGPT shows promising results and usefulness for tasks with no prior training.

Summary

In this paper, we thoroughly tested ChatGPT's performance in face biometrics tasks such as face recognition and feature estimation. Through experiments on a variety of databases, it was confirmed that ChatGPT exhibits a certain level of accuracy in these tasks compared to professionally trained models. In particular, its potential as an initial evaluation tool in zero-training conditions emerges. For example, it achieved results of about 94% for face recognition in the LFW database, 96% for gender estimation in the MAAD-Face database, and an impressive 73% and 88% for age and ethnicity estimation in LFW.

In addition, ChatGPT can provide a text output explaining the results, contributing to transparency and better understanding of the analysis. This study shows that ChatGPT is an effective tool that can be used immediately in facial biometrics tasks under certain conditions.

Future research will examine how ChatGPT as well as other popular chatbots perform in the area of facial biometrics. The evolution and potential applications of AI in this area are still expanding and will continue to attract attention.

The code is available on Github.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

Improved Accuracy And Transparency Of Face Recognition With ChatGPT, New Developments In Soft Biometrics

Summary

ChatGPT setup in the experiment and its main features

Experimental results

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...