LongVie: A New Era Of 1-minute Ultra-High Quality Video Generation Realized By Multimodal Control

16/08/2025

3 main points
✔️ LongVie is a generation framework for videos longer than 1 minute that achieves both temporal consistency and high image quality
✔️ Introduces unified noise initialization, global normalization, multimodal control, and degradation recognition learning
✔️ Evaluated with LongVGenBench, achieving consistency and quality beyond existing methods and validated Demonstration

LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation
written by Jianxiong Gao, Zhaoxi Chen, Xian Liu, Jianfeng Feng, Chenyang Si, Yanwei Fu, Yu Qiao, Ziwei Liu
(Submitted on 5 Aug 2025)
Comments: Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This paper proposes LongVie, a new framework for high-quality, controllable generation of ultra-long videos that span more than one minute.

Recent advances in diffusion modeling have led to the rapid development of techniques for generating short videos from text and images, but the generation of long videos presents significant challenges such as lack of temporal consistency and image quality degradation.
Traditional methods have used an auto-regressive approach that generates short clips sequentially.
However, this approach is prone to unnatural transitions and flickering at breaks between clips, and image quality degrades over time.

In this study, we identified "noise initialization independence," "clip-by-clip normalization of control signals," and "limitations of single modality control" as the causes of these problems.
Then, by introducing unified noise initialization, global control signal normalization, multimodal control, and degradation-aware learning as solutions to these problems, we have achieved unprecedentedly long, smooth, and high-quality video generation.

Proposed Method

The proposed LongVie is based on an auto-regressive generation framework and combines several novel techniques to enable long video generation.

First, "unified noise initialization" ensures that each clip is generated from the same latent noise, thus maintaining consistent motion and appearance across clips.
Second, "Global Control Signal Normalization" unifies the scale of control signals (e.g., depth maps) throughout the entire video to prevent inconsistencies between scenes.

Furthermore, "multimodal control" is introduced to integrate dense control signals (depth maps) and sparse control signals (keypoints) to achieve both structural accuracy and semantic consistency.
However, since dense signals tend to be dominant, LongVie uses a "degradation-aware learning strategy" to intentionally weaken or degrade the dense signals to maintain balance with the sparse signals.

This enables time-smooth, high-quality, and controllable video generation.
This framework can also be extended to applied tasks such as video editing, scene transfer, and video generation from 3D meshes.

Experiments

In the experiments, an evaluation benchmark, LongVGenBench, was first constructed.
This is a dataset consisting of 100 high-resolution videos, including both real-world and synthetic environments, all longer than one minute.

The benchmark was used to compare the results with typical existing video generation models (CogVideoX, StreamingT2V, VideoComposer, etc.).
Evaluation metrics used included subject/background consistency, temporal style, flicker suppression, and image quality ratings (SSIM and LPIPS).
The results showed that LongVie outperformed the conventional method in almost all metrics, especially in temporal consistency and visual quality.
Furthermore, in the user study, LongVie received the highest ratings in terms of visual quality, consistency with prompts, and temporal smoothness.

In addition, ablation experiments confirmed the effectiveness of unified noise initialization, global normalization, and degradation-aware learning, respectively.
Overall, the proposed method sets a new standard in long video generation.

Categories related to this article

nakata

LongVie: A New Era Of 1-minute Ultra-High Quality Video Generation Realized By Multimodal Control

Summary

Proposed Method

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation