Catch up on the latest AI articles

LongVie: A New Era Of 1-minute Ultra-High Quality Video Generation Realized By Multimodal Control

LongVie: A New Era Of 1-minute Ultra-High Quality Video Generation Realized By Multimodal Control

3 main points
✔️ LongVie is a generation framework for videos longer than 1 minute that achieves both temporal consistency and high image quality
✔️ Introduces unified noise initialization, global normalization, multimodal control, and degradation recognition learning
✔️ Evaluated with LongVGenBench, achieving consistency and quality beyond existing methods and validated Demonstration

LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation
written by Jianxiong GaoZhaoxi ChenXian LiuJianfeng FengChenyang SiYanwei FuYu QiaoZiwei Liu
(Submitted on 5 Aug 2025)
Comments: Project page: this https URL

Subjects:  Computer Vision and Pattern Recognition (cs.CV)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This paper proposes LongVie, a new framework for high-quality, controllable generation of ultra-long videos that span more than one minute.

Recent advances in diffusion modeling have led to the rapid development of techniques for generating short videos from text and images, but the generation of long videos presents significant challenges such as lack of temporal consistency and image quality degradation.
Traditional methods have used an auto-regressive approach that generates short clips sequentially.
However, this approach is prone to unnatural transitions and flickering at breaks between clips, and image quality degrades over time.

In this study, we identified "noise initialization independence," "clip-by-clip normalization of control signals," and "limitations of single modality control" as the causes of these problems.
Then, by introducing unified noise initialization, global control signal normalization, multimodal control, and degradation-aware learning as solutions to these problems, we have achieved unprecedentedly long, smooth, and high-quality video generation.

Proposed Method

The proposed LongVie is based on an auto-regressive generation framework and combines several novel techniques to enable long video generation.

First, "unified noise initialization" ensures that each clip is generated from the same latent noise, thus maintaining consistent motion and appearance across clips.
Second, "Global Control Signal Normalization" unifies the scale of control signals (e.g., depth maps) throughout the entire video to prevent inconsistencies between scenes.

Furthermore, "multimodal control" is introduced to integrate dense control signals (depth maps) and sparse control signals (keypoints) to achieve both structural accuracy and semantic consistency.
However, since dense signals tend to be dominant, LongVie uses a "degradation-aware learning strategy" to intentionally weaken or degrade the dense signals to maintain balance with the sparse signals.

This enables time-smooth, high-quality, and controllable video generation.
This framework can also be extended to applied tasks such as video editing, scene transfer, and video generation from 3D meshes.

Experiments

In the experiments, an evaluation benchmark, LongVGenBench, was first constructed.
This is a dataset consisting of 100 high-resolution videos, including both real-world and synthetic environments, all longer than one minute.

The benchmark was used to compare the results with typical existing video generation models (CogVideoX, StreamingT2V, VideoComposer, etc.).
Evaluation metrics used included subject/background consistency, temporal style, flicker suppression, and image quality ratings (SSIM and LPIPS).
The results showed that LongVie outperformed the conventional method in almost all metrics, especially in temporal consistency and visual quality.
Furthermore, in the user study, LongVie received the highest ratings in terms of visual quality, consistency with prompts, and temporal smoothness.

In addition, ablation experiments confirmed the effectiveness of unified noise initialization, global normalization, and degradation-aware learning, respectively.
Overall, the proposed method sets a new standard in long video generation.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us