New Developments In Multi-person Conversation Video Generation! MIT Dataset And Baseline Model "CovOG

27/08/2025

3 main points
✔️ New "Multi-human Interactive Talking Dataset" for multi-person conversations
✔️ Proposed baseline model "CovOG" to generate natural conversational video with pause integration and voice control
✔️ Outperformed conventional methods in quantitative evaluation and user survey, Demonstrated effectiveness in generating multi-person conversations

Multi-human Interactive Talking Dataset
written by Zeyu Zhu, Weijia Wu, Mike Zheng Shou
(Submitted on 5 Aug 2025)
Comments: 9 pages, 4 figures, 4 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This paper proposes a new task, Multi-Human Talking Video Generation, which targets natural conversations among multiple people, because conventional voice-driven video generation research has been limited to a single speaker or face domain.

The focus of the research is on the construction of a 12-hour high resolution dataset, the Multi-Human Interactive Talking Dataset (MIT).
This dataset collects conversational videos involving two to four people and automatically assigns pose estimation and speech state scores to comprehensively capture the interactions of speech, listening, and gesture that accompany multi-person conversations.

In addition, the authors have developed a baseline model, CovOG, to address this new challenge.
CovOG incorporates a Multi-Human Pose Encoder (MPE), which integrates pose features for each person, and an Interactive Audio Driver (IAD), which controls facial movements based on speech features, allowing for the natural reproduction of role alternation between speaking and listening This enables the natural reproduction of the role change between speaking and listening.

This enables video generation that mimics realistic scenarios such as interviews and talk shows, and presents an important foundation for future research development.

Proposed Methodology

The core of the proposed method is the baseline model "CovOG," which is an extension of the existing "AnimateAnyone" generative model for a single person.

First, the MPE (Multi-Human Pose Encoder) has a mechanism to process the poses cut out for each person separately in a convolutional network and then integrate them.
This allows the system to flexibly respond to changes in the number of people and generate an overall conversation scene while maintaining each person's independent body movements.

Next, the IAD (Interactive Audio Driver) takes the speech features and "speaking score" of each speaker as input and controls facial movements and expressions.
This mechanism synchronizes mouth movements with speech while the speaker is speaking and produces natural responses such as nodding and smiling when the speaker is a listener.

In addition, CovOG uses reference images to maintain the person's identity, and combines pose and voice conditions to generate video.
These innovations enable the generation of conversational videos of multiple people, which was not possible with conventional "face only" or "single speaker" models, greatly expanding the possibilities for video synthesis that includes natural interaction.

Experiments

In our experiments, we tested CovOG's performance on the MIT dataset and compared it to conventional methods.

SSIM and PSNR, which indicate image quality, and FVD, which measures motion consistency, were used for evaluation.
The results showed that CovOG consistently outperformed representative methods such as AnimateAnyone and ControlSVD, and showed stable quality, especially in multi-person conversational scenes.
Ablation experiments confirmed that removing MPE resulted in a loss of overall posture control and removing IAD resulted in unnatural facial movements, quantitatively demonstrating the effectiveness of both modules.

In the user study, CovOG also received high marks for character consistency, synchronization with audio, and overall video naturalness.
Furthermore, in a "cross-modal experiment" combining identities, poses, and audio from different videos, CovOG maintained temporal smoothness and spatial consistency, demonstrating its high versatility.

These results demonstrate that the proposed model is suitable for reproducing realistic multi-speaker dialogues.

Categories related to this article

nakata

New Developments In Multi-person Conversation Video Generation! MIT Dataset And Baseline Model "CovOG

Summary

Proposed Methodology

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation