Innovation In Feature Video Generation With Mixture Of Contexts! Efficient Context Preservation And High Precision Generation

12/09/2025

3 main points
✔️ Reformulates long-time video generation as "information retrieval" and proposes an efficient context preservation method
✔️ Mixture of Contexts reduces computation by dynamically referencing only relevant contexts
✔️ Achieves both high accuracy and a computational complexity that is 1/7th of conventional methods in experiments and generates videos on the scale of several minutes

Mixture of Contexts for Long Video GenerationMixture of Contexts for Long Video Generation
written by Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, Maneesh Agrawala, Lu Jiang, Gordon Wetzstein
(Submitted on 28 Aug 2025)
Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Overview

This research was proposed to solve the biggest challenge in long-duration video generation: long-term context preservation.

Conventional diffusion transformers (Diffusion Transformer) are based on a self-attention mechanism, which makes it difficult to generate videos on the scale of several minutes because the computational complexity increases squarely with the length of the sequence.
Previous methods have compressed or fixedly thinned out the history, but they had problems such as missing details and missing important context.

The authors therefore reformulated video generation as an "internal information retrieval" problem and proposed a framework that dynamically references only the relevant history for each query.
In this framework, the video is divided into frames or shots, and each query selects the most meaningful context.

Furthermore, subtitles and local shot information are always used as essential reference points, guaranteeing narrative continuity and subject consistency.
As a result, we have shown that the method can maintain high accuracy and consistency even for feature-length videos that span several minutes, while significantly reducing computational complexity.

Proposed Method

The proposed method, Mixture of Contexts (MoC), is a dynamic context selection mechanism instead of calculating self-attention as a whole.

First, the video is divided into semantically consistent chunks such as frames, shots, and subtitles.
Then, each query computes the feature vector and inner product of the chunks represented by mean pooling and selects the top k most relevant to perform the attention calculation.

Furthermore, we introduce a design that always connects all subtitle tokens as essential links and all tokens in the same shot, ensuring local fidelity while focusing computational resources on important long-range dependencies.
It also avoids loop structures by enforcing causality in the time direction, so that the generative process does not break down.

This mechanism reduces wasteful computation by more than 85% while maintaining subject consistency and continuity of operation.
Compared to conventional compression and fixed sparsification, this system features flexible and learnable context selection.

Experiments

The authors conducted experiments on both single-shot and multi-shot video generation to confirm the effectiveness of the proposed method MoC.

LCT, an existing long-text generation method, was used as the base model, and its self-attention layer was replaced by MoC for comparison.
VBench was used for evaluation, with subject consistency, background consistency, smoothness of motion, and degree of dynamism as indicators.

The results showed that while maintaining accuracy equivalent to or better than dense self-attention for short videos, the amount of computation was reduced to less than one-seventh for long videos, and the generation speed was improved by a factor of 2.2.
In particular, improvements were seen in the diversity of movements and the consistency of scenes, overcoming the degradation caused by information compression that conventional methods suffered from.

Furthermore, MoC showed high stability in zero-shot experiments, confirming its applicability to other diffusion models.
These results demonstrate that MoC can achieve both efficiency and expressiveness in long-form video generation.

Categories related to this article

nakata

Innovation In Feature Video Generation With Mixture Of Contexts! Efficient Context Preservation And High Precision Generation

Overview

Proposed Method

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation