FreeMo, A Model That Automatically Generates Upper Body Gestures In Response To Speech, Is Here!
3 main points
 ✔️ Proposed FreeMo, a model that automatically generates upper body gestures in response to speech
 ✔️ Proposed a generation method using Pose mode branch and Rhythmic motion branch, which is different from previous gesture generation models
✔️ Demonstrated performance over existing baselines in terms of diversity, quality, and synchronization
Freeform Body Motion Generation from Speech
written by Jing Xu, Wei Zhang, Yalong Bai, Qibin Sun, Tao Mei
(Submitted on 4 Mar 2022)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
first of all
Good speakers use gestures in conjunction with their speech to effectively convey information, and these gestures have become essential in enabling applications such as digital avatars and social robots.
However, while research on generating lip movements to match speech has been widely conducted, the translation from speech to gesture has remained undeveloped because it retains a highly uncertain aspect.
Specifically, there were various problems such as the same person giving the same speech twice in a row does not always exhibit the same gesture, the possibility of occasional pause switching in long speeches, and the difficulty of gesture generation for long speeches.
FreeMo (Free form Motion generation model ), introduced in this paper, solves these problems by decomposing gestures into two modules, Pose mode and Rhythmic motion, and succeeds in automatically generating gestures for the upper body in response to speech. The model is a successful model for automatically generating upper body gestures according to the speech.
FreeMo Overview
The model outline of FreeMo (Free form Motion generation model) is The figure below shows the model outline of FreeMo (Free form Motion generation model).

Gesture generation by speech synthesis is to generate a sequence of actions corresponding to input speech, which requires a mapping from speech to gesture.
However, such mapping is highly non-deterministic and multimodal, which has been a challenge in existing research.
To solve this problem, we propose an approach that decomposes gesture generation into two complementary mappings: the Pose mode branch and the Rhythmic dynamics branch.
The Pose model branch is responsible for generating various upper body poses by conditional sampling in the VAE latent space, and the Rhythmic dynamics branch is responsible for synchronizing the generated poses with the prosody of the speech.
Comparison and validation with existing gesture generation models
In this experiment, comparative experiments were conducted using the following five models.
- Audio to Body Dynamics (Audio2Body): employs RNN networks for voice-to-gesture conversion
- Speech2Gesture (S2G): employs a CNN network to generate gestures from speech
- Speech Drives Template (Tmpt): Learning a Gesture Template to resolve ambiguity in mapping from speech to body movements
- Trimodal-Context (TriCon): Employs an RNN network to learn from three inputs: speech, text, and SpeakerID
- Mix-StAGE: A generative model for learning unique style embeddings per speaker
However, since most of the videos were from TV programs, there was a lot of interference from the environment, such as the audience and the front desk sound, and the speaker was often sitting on a chair or leaning against a desk. In addition, the speaker is often seated on a chair or leaning against a desk, which limits the gestures.

Therefore, we evaluated the dataset using videos of lectures in the TEDGesture dataset and videos collected from YouTube. The above figure shows samples from the Speech2Gesture dataset and TEDGesture dataset.
Qualitative Results
The figure below shows the results of a qualitative comparison between our method and the existing method.

From these results, we can see that
- In the existing method, the generated gesture includes hand deformations (circled area in the figure), but in our method, such deformations are hardly observed.
- Gestures generated by S2G and TriCon are often small actions with little expressive power
- Thus, the existing methods are not able to generate a clear pose change (red box in the figure) as seen in the Ground Truth data.
 
- Compared to these existing methods, our FreeMo method generates more natural and expressive gestures
Next, to verify the gesture diversity of our method, we generated multiple gestures from the same initial pose for the same voice (the red boxes indicate the transition between the generated gesture and the pose mode of the ground truth gesture). (The red boxes show the transition of Pose mode between the generated gestures and the ground truth gestures.)

It is noteworthy that the Pose mode branch enables the generation of various gestures from arbitrary initial poses, and that the gestures generated by the Rhythmic motion branch are sufficiently synchronized with the voice.
Subjective Evaluation
In this paper, we also conducted several baseline user studies under the following conditions
- In each data set, randomly select 50 10-30 second tuned test audio clips
- Ten participants were asked to collaborate and each person was asked to watch a video of 10 randomly selected audio clips
- Participants were asked to rate the videos generated by the different models on a 6-point scale from 1 to 6 (1 being the worst and 6 being the best)
The chart below shows the average of the 10 scores.

On both datasets, the FreeMo proposed in this paper scored the highest and was evaluated as being able to generate more natural and expressive gestures for many users.
summary
How was it? In this article, we'll take a look at FreeMo (Free form motion generation model), which is a model that automatically generates upper body gestures according to speech. FreeMo (Free form Motion generation model).
The results of this work are exciting as they lead to the construction of virtual agents, which are essential for applications such as social robots employed in the field of robotics and digital avatars popularized in the Metaverse.
On the other hand, there is a risk that these technologies could be misused to generate fake videos, so caution is needed.
If you are interested, the details of the architecture of the model presented here and the generated gestures can be found in this paper.
Categories related to this article






 
   ![[HiFi-GAN] GAN-based](https://aisholar.s3.ap-northeast-1.amazonaws.com/media/July2024/hifi-gan-520x300.png) 
 ![[VoiceCraft] A Langu](https://aisholar.s3.ap-northeast-1.amazonaws.com/media/June2024/voice-craft-520x300.png)