Catch up on the latest AI articles

AVI-Talking

AVI-Talking" Generates Natural 3D Talking Faces From Audio

Face Recognition

3 main points
✔️ ProposesAVI-Talking,anew system that generates expressive talking faces from audio using an intermediate visual guide
✔️ Leverages a large-scale language model to capture the speaker's speech state and naturally synchronize lip-sync and facial expressions. Expresses fine nuances of emotion
✔️ Effectively bridges the gap between audio and visuals, simplifying the generation process

AVI-Talking: Learning Audio-Visual Instructions for Expressive 3D Talking Face Generation
written by Yasheng Sun, Wenqing Chu, Hang Zhou, Kaisiyuan Wang, Hideki Koike
(Submitted on 25 Feb 2024)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code: 

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Realistic 3D animation of the human face is essential in the entertainment industry, whether for digital human animation, visual dubbing for movies, or creating virtual avatars. Previous research has attempted to model the correlation between dynamic head pose and audio rhythm, or to use emotion labels or video clips as style references, but these methods have limited expressive power and fail to capture the fine nuances of emotion. They also require the user to manually select the style source, which tends to result in unnatural applications.

In this paper, we propose a more natural approach. It seeks to leverage style information from human speech to generate expressive talking faces that directly reflect the emotion and style of the person speaking. Synthesizing diverse and realistic facial movements from audio is a complex and challenging task while maintaining accurate lip-sync. To address this problem, we are developing a new system called AVI-Talking. It enables expressive talking face generation through an audio-visual direction system.

AVI-Talking effectively bridges the audio-visual gap by using intermediate visual instructional representations rather than direct learning from audio. Specifically, the framework divides the generation process into two stages, each with clear objectives, which greatly reduces optimization complexity. Furthermore, by presenting visual instructions as intermediate outputs, it improves the interpretability of the model and provides flexibility for the user to make instructions and modifications according to his or her own intentions.

This technology is expected to open up new horizons in entertainment technology.

AVI-Talking Overview

AVI-Talking aims to generate 3D animated faces with synchronized lip movements and consistent facial expressions from speech clips. Rather than synthesizing talking faces from direct speech, AVI-Talking utilizes a large-scale language model to effectively guide the generation process.

Thefigure below showsan overview of the AVI-Talking pipeline.The system consists of two main stages: the first is "Audio-Visual Instruction via LLMs," which derives the necessary guidance from the input speech and bridges to the next stage. Thesecond is the "Talking Face Instruction System. The second is the Talking Face Instruction System, which synthesizes 3D facial movements in real time based on the guide.The goal is to generate a time series of 3D parametric coefficients from the input speech.

This approach realistically reproduces the speaker's natural facial expressions and mouth movements, providing a more realistic visual experience for the viewer.

Experiments and Results

The quality of the generated guides and talking faces is quantitatively evaluated. The evaluation is divided into two categories: the first is Audio-Visual Instruction Prediction.The second is "3D Talking Face Synthesis.It evaluates the fidelity of faces with the GAN metrics FID and KID, and also measures the diversity of facial expressions for a given clip of speech with a diversity score. It also quantifies changes in the generated facial expressions by calculating the distance between style features under different noise conditions, and uses LSE-D for lip-synchronization accuracy.

The results obtained on the MeadText and RAVEDESS datasets for "3D Talking Face Synthesis" are tabulated in the table below. AVI-Talking performs remarkably well on many evaluation metrics. However, it may slightly underperform the other methods with respect to lip-sync accuracy, mainly due to the fact that SyncNet was pre-trained based on expressionless videos, which results in a bias toward neutral facial expressions.

AVI-Talking emphasizes richness of facial expressions, which is one factor affecting the score. However, it achieves an LSE-D score close to the reference video, indicating that it is capable of generating precise lip-sync videos.

This paper also provides a quantitative evaluation. Subjective evaluation is essential to validate the performance of models in generative tasks.Thefigure belowshows the results of a comparison betweenAVI-Talking andconventional techniques in three different cases. The resultsshow thatAVI-TalkinggeneratesreliableAudio-Visual Instructionand expressive facial details based on the speaker's state.


Regarding the performance of lip synchronization, it has been observed that other methods, such as CodeTalker and Faceformer, may produce more natural pronunciations in the absence of facial expression. However, in scenarios where emotions are involved, the report states that slight distortions in lip movements may be observed. This observation is also consistent with the LSE-D scores in the results of the aforementioned table, which is a quantitative evaluation.

Inaddition,thepaper includes a user study with 15 participants tocollect ratings on a total of 30 videos generated byAVI-Talking andthree competing methods. These videos were generated using 20 randomly selected spoken audio from the MeadText test set and 10 selected audio from RAVEDESS.

The MOS, which is widely used in the industry, is used for the evaluation. Participantsrateeach videoon a scale of 1 to 5 on three dimensions.

  • Lip-sync quality: evaluation of mouth movements in sync with spoken language content
  • Expressiveness of movement: Evaluated for richness of facial detail
  • Consistency of facial expressions: assess consistency of facial movements and speaker's expressions

 

The results are shown in the table below, with MeshTalk scoring the lowest in all aspects due to its simple UNet architecture design. On the other hand, EmoTalk and CodeTalker, both with transformer blocks, achieve higher scores for lip-sync quality.

In terms of expressiveness of movement and consistency of facial expressions,AVI-Talkingsignificantly outperforms the other methods. Overall, AVI-Talking outperforms the other models in expressive synthesis, clearly demonstrating the effectiveness of the approach.

Summary

This paper proposes AVI-Talking, a novel system for generating expressive 3D talking faces based on speech. The system first decomposes the generation of speech to direct vision into two distinct learning steps and uses an intermediate visual guide to facilitate speech-driven talking face generation. It also introduces a novel soft-prompting strategy that exploits the contextual knowledge of large-scale language models to capture the speaker's speech state. Furthermore,we construct a pre-training procedure to integratelip-sync andAudio-Visual Instruction.And finally, we leverage a diffusion pre-network toeffectively mapAudio-Visual Instructionto the latent space for high-quality production.

However, several limitations are also apparent. Low sensitivity to specific speaking states and the reliance of the talking face synthesis network on limited visual instructions are cited as challenges. They attribute this to the heterogeneity of the dataset and the fact that speakers' utterances are not well identified.

Further fine tuning and knowledge infusion using Retrieval Augmented Generation (RAG) techniques are also being considered for future research. This will allow large-scale languages to be specialized for specific cross-modal audio-visual generation tasks, with the goal of generating even more expressive talking faces. In addition, it is expected to derive more generic and competitive results through the use of robust visual tokenizers and fine tuning of general visual infrastructure models. These developments are expected to be important steps toward the future of talking face generation technology.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!
Takumu avatar
I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us