
New Developments In Multi-person Conversation Video Generation! MIT Dataset And Baseline Model "CovOG
3 main points
✔️ New "Multi-human Interactive Talking Dataset" for multi-person conversations
✔️ Proposed baseline model "CovOG" to generate natural conversational video with pause integration and voice control
✔️ Outperformed conventional methods in quantitative evaluation and user survey, Demonstrated effectiveness in generating multi-person conversations
Multi-human Interactive Talking Dataset
written by Zeyu Zhu, Weijia Wu, Mike Zheng Shou
(Submitted on 5 Aug 2025)
Comments: 9 pages, 4 figures, 4 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
This paper proposes a new task, Multi-Human Talking Video Generation, which targets natural conversations among multiple people, because conventional voice-driven video generation research has been limited to a single speaker or face domain.
The focus of the research is on the construction of a 12-hour high resolution dataset, the Multi-Human Interactive Talking Dataset (MIT).
This dataset collects conversational videos involving two to four people and automatically assigns pose estimation and speech state scores to comprehensively capture the interactions of speech, listening, and gesture that accompany multi-person conversations.
In addition, the authors have developed a baseline model, CovOG, to address this new challenge.
CovOG incorporates a Multi-Human Pose Encoder (MPE), which integrates pose features for each person, and an Interactive Audio Driver (IAD), which controls facial movements based on speech features, allowing for the natural reproduction of role alternation between speaking and listening This enables the natural reproduction of the role change between speaking and listening.
This enables video generation that mimics realistic scenarios such as interviews and talk shows, and presents an important foundation for future research development.
Proposed Methodology
The core of the proposed method is the baseline model "CovOG," which is an extension of the existing "AnimateAnyone" generative model for a single person.
First, the MPE (Multi-Human Pose Encoder) has a mechanism to process the poses cut out for each person separately in a convolutional network and then integrate them.
This allows the system to flexibly respond to changes in the number of people and generate an overall conversation scene while maintaining each person's independent body movements.
Next, the IAD (Interactive Audio Driver) takes the speech features and "speaking score" of each speaker as input and controls facial movements and expressions.
This mechanism synchronizes mouth movements with speech while the speaker is speaking and produces natural responses such as nodding and smiling when the speaker is a listener.
In addition, CovOG uses reference images to maintain the person's identity, and combines pose and voice conditions to generate video.
These innovations enable the generation of conversational videos of multiple people, which was not possible with conventional "face only" or "single speaker" models, greatly expanding the possibilities for video synthesis that includes natural interaction.
Experiments
In our experiments, we tested CovOG's performance on the MIT dataset and compared it to conventional methods.
SSIM and PSNR, which indicate image quality, and FVD, which measures motion consistency, were used for evaluation.
The results showed that CovOG consistently outperformed representative methods such as AnimateAnyone and ControlSVD, and showed stable quality, especially in multi-person conversational scenes.
Ablation experiments confirmed that removing MPE resulted in a loss of overall posture control and removing IAD resulted in unnatural facial movements, quantitatively demonstrating the effectiveness of both modules.
In the user study, CovOG also received high marks for character consistency, synchronization with audio, and overall video naturalness.
Furthermore, in a "cross-modal experiment" combining identities, poses, and audio from different videos, CovOG maintained temporal smoothness and spatial consistency, demonstrating its high versatility.
These results demonstrate that the proposed model is suitable for reproducing realistic multi-speaker dialogues.
Categories related to this article