Catch up on the latest AI articles

DFER-CLIP: Dynamic Facial Expression Recognition With An Innovative Visual Language Model

DFER-CLIP: Dynamic Facial Expression Recognition With An Innovative Visual Language Model

Large Language Models

3 main points
✔️ Introducing FER-CLIP: In the field of Facial Expression Recognition (FER), a new Proposed an approach "DFER-CLIP".
✔️ Technical innovation: Using a CLIP-based image encoder and multiple Transformers, the temporal features of facial expressions and associated text are learned.

✔️ Outstanding results: DFER-CLIP outperforms existing DFER methods on three major benchmarks (DFEW, FERV39k, and MAFW).

Prompting Visual-Language Models for Dynamic Facial Expression Recognition
written by Zengqun ZhaoIoannis Patras
(Submitted on 25 Aug 2023 (v1), last revised 14 Oct 2023 (this version, v2))
Comments: Accepted at BMVC 2023 (Camera ready)
Subjects: Computer Vision and Pattern Recognition (cs.CV)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Facial expressions are an essential component of people's everyday communication. These facial expressions have applications in fields as diverse as human-computer interaction, driver assistance systems, and mental health assessment. Against this background, facial expression recognition (FER) has attracted the interest of researchers in disciplines as diverse as psychology, computer science, linguistics, and neuroscience. The goal of facial expression recognition is to classify human facial expressions into basic emotional categories such as happiness, sadness, surprise, fear, disgust, and anger. However, conventional recognition methods have difficulty in capturing dynamic changes in facial expressions. Therefore, dynamic facial expression recognition (DFER) is being investigated for more precise emotion understanding.

Early DFER research focused primarily on facial expressions in controlled environments, but in the real world, facial expressions are affected by a variety of factors, including lighting changes and obstructions. Given this reality, DFER research has recently focused on more realistic conditions, with the emphasis in DFER tasks on learning robust feature representations to accurately identify emotions through facial movements.

Here, visual and verbal pre-learning (V-LP) models have emerged as a technology that opens up new possibilities. These models are capable of acquiring powerful visual representations by learning semantic relationships between images and text. In particular, the application of V-LP models in the DFER task may contribute to improving the accuracy of facial expression recognition. However, challenges exist in capturing subtle facial expression differences and learning dynamic facial features.

To address these challenges, this paper proposes a new approach called DFER-CLIP. This model integrates dynamic facial features with textual descriptions related to facial expressions to achieve more precise facial expression recognition. The figure below outlines the differences between the traditional method, CLIP, and DFER-CLIP.

Also shown below is an overview of the structure of DFER-CLIP. cos() indicates the cosine similarity. M indicates the learnable context number. C indicates the class of the facial expression.

Furthermore, experimental results show that DFER-CLIP outperforms existing DFER methods by using dynamic features and learnable text prompts. This is expected to improve the accuracy of facial expression recognition in more natural environments and to enhance mutual understanding between humans and computers.


DFER-CLIP proposes an innovative approach to deciphering human facial expressions, utilizing both images and text. It consists of two main parts: a visual aspect and a textual aspect. The visual aspect builds on the CLIP image encoder and introduces a model using multiple transformer encoders to capture facial features through time. Finally, we extract video-level facial features via learnable class tokens. On the text side, we use descriptions of facial behavior rather than generic class names. In addition, learnable prompts are introduced so that the model can learn the appropriate context information for each facial expression during training.

In addition, human facial expressions share common features but also have unique characteristics. For example, expressions of happiness and surprise share the action of raising the eyebrows, while expressions of sadness and anger show the action of lowering the eyebrows and wrinkling the forehead. Given these similarities and idiosyncrasies, DFER-CLIP uses action descriptions of facial expressions as input for text encoders. Specifically, the approach uses a large-scale language model to automatically generate contextual descriptions of facial expressions. This allows for a comprehensive description of detailed visual features for each facial expression class.

The language model is prompted with the following input

  • Q: What visual features are useful for {class name}'s facial expressions?
  • A: Useful visual features for {class name} facial expressions include:...

The descriptors of each facial expression class generated are combined to form a comprehensive description.


The study uses three primary datasets to evaluate the accuracy of facial expression recognition. These datasets are rich in emotional expressions from a variety of real-world situations, allowing for extensive validation of the effectiveness of DFER-CLIP.

The DFEW dataset contains 11,697 video clips collected from over 1,500 films worldwide. These are classified into seven basic facial expressions (happiness, sadness, neutral, anger, surprise, disgust, and fear) by 10 annotators under the guidance of experts. The videos include many challenging conditions such as extreme lighting, shielding, and various head poses. The dataset is divided into five equal-sized parts and evaluated with five-part cross-validation.

The FERV39k dataset features 38,935 video clips, currently the largest wild DFER dataset. These clips are collected from four scenarios, including 22 subdivisions of crime, daily life, speech, and war, and are annotated with basic facial expressions by 30 annotators. The videos are randomly shuffled and divided into a training set (80%) and a test set (20%).

The MAFW dataset, which contains 10,045 video clips, is the first large-scale multimodal, multi-label emotion database with 11 single and 32 multiple expression categories, as well as emotional descriptive text. This dataset has also been evaluated using five-part cross-validation.

These datasets provide a valuable resource for understanding how our research addresses the challenges of emotion recognition under complex real-world conditions.

Experimental results

An ablation analysis is being conducted for DFER-CLIP using the three benchmark data sets described above. This analysis is intended to reveal how each component of the model affects the overall performance.

Learning temporal features of faces is important for video-based facial expression recognition tasks. Our analysis shows that the introduction of temporal models significantly improved performance on the DFER, FERV39k, and MAFW datasets, respectively. The results are shown in the table below.

However, it turns out that increasing model depth and the number of trainable contexts does not necessarily improve results and increases the risk of overlearning. These results indicate that a properly balanced approach is critical to achieving optimal performance.

The DFER-CLIP model also employs a text-based (classifier-free) training strategy as opposed to the traditional classifier-based approach. The analysis results show that the proposed method performs better on all datasets compared to Linear Probe and Fully Fine-tuning methods. The results are shown in the table below.

In particular, even without the use of temporal models, our approach outperforms classifier-based methods and achieves remarkable results even in a zero-shot learning environment.

The V-LP model can use prompts to design classifier-free predictive models, which makes prompt engineering very important. Compared to manually designed prompts for "[class] pictures" and "[class] representations," we found that the proposed method performed better on the DFEW and FERV39k datasets and obtained slightly inferior but competitive results on the MAFW dataset. The results are shown in the table below. This shows that the learning-based context consistently yields superior results.

In addition, in DFER-CLIP, we have adopted an approach that places the explanation at the end of the prompt and uses a class-specific, learnable context. We tested different placement and context sharing strategies and found that placing the explanation at the end and employing class-specific context yielded the best results. The results are shown in the table below.

Through this analysis, we have gained important insights to maximize the accuracy and efficiency of the DFER-CLIP model. Emotion recognition from video plays an important role in a wide variety of applications ranging from day-to-day communication to security. Our research aims to further accelerate the evolution of technology in this area.

In addition, we compared the performance of the DFER-CLIP model to state-of-the-art methods using three major benchmarks: the DFEW, FERV39k, and MAFW. Each of these benchmarks provides different challenges and is an important benchmark for measuring the accuracy and versatility of facial expression recognition techniques.

Experiments in DFEW and MAFW were conducted using a five-part cross-validation as in previous studies; in FERV39k, a training set and a test set were used. To increase the reliability and reproducibility of the results, the model was trained three times with different random seeds and the average of the training set was used as the final result.

The results are shown in Table 5 below.

DFER-CLIP outperformed existing methods in both UAR (user average percent correct) and WAR (weighted average percent correct). Specifically, DFEW improved by 2.05% in UAR and 0.41% in WAR; FERV39k improved by 0.04% in UAR and 0.31% in WAR; and MAFW improved by 4.09% in UAR and 4.37% in WAR.FERV39k is currently the largest DFER benchmark, with 38,935 These results are especially noteworthy given that FERV39k is currently the largest DFER benchmark and contains 38,935 video data points. Achieving significant improvements on large data sets is a very difficult task.

Through this comparative analysis, we confirm that our DFER-CLIP model has set new standards in the field of facial expression recognition. The improved performance, especially on large data sets, suggests promising progress in future research.


This paper proposes a new visual language model, DFER-CLIP, for outdoor (in-the-wild) dynamic facial expression recognition.

In the visual part, a temporal model consisting of multiple Transformer encoders is introduced based on the CLIP image encoder to model facial expression features over time. In the text portion, facial expression descriptors related to facial behaviors are employed, and these descriptors are generated by large-scale language models such as ChatGPT. A learnable context for these descriptors is also designed to help the model learn the relevant context information for each facial expression during training.

Extensive experiments have demonstrated the effectiveness of each component of DFER-CLIP. Furthermore, the proposed method achieves state-of-the-art results in three benchmarks.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us