CogVideo, An Open Source Model Capable Of Generating Video From Text, Is Now Available!

Video Generation 11/10/2022

3 main points
✔️ Proposed CogVideo, the largest and first open-source model for text-to-video generation
✔️ Inherit the pre-trained text-to-image generation model CogView2 to the text-to-video generation model for efficient learning
✔️ Propose a hierarchical learning method with a multi-frame rate to obtain more appropriate text and clip pair positioning.

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
written by Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang
(Submitted on 29 May 2022)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language(cs.CL); Machine Learning(cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

In recent years, various models dealing with large-scale pre-trained transformers have been proposed, and milestone models have been proposed for many tasks, such as GPT-3 for the text generation task and DALL-E for the text-to-image generation task.

However, there have been few applications of these large-scale models to video generation due to the small amount of text-to-video data sets are small and the models cannot understand the meaning of complex behaviors, etc., so these large-scale models have not been applied to video generation.

In this paper, we introduce CogVideo, the largest and first text-to-video generation model based on the existing text-to-image generation model, CogView2.

CogVideo is a large text-video generation model with 9.4 billion parameters, trained on 5.4 million pairs of text and video, yet it is an open-source model that anyone can use.

Existing Challenges in Video Generation

One of the major challenges in the text-to-video generation task is that the generated video frames tend to gradually deviate from the input text, and while it is possible to generate regular motion (e.g., a car driving straight ahead) or random motion (e.g., random lip movements in a talking video), it fails for text like "a lion is drinking water text such as "a lion drinking water" would fail.

The two examples show that in the former, the first frame provides sufficient information for subsequent changes, whereas in the latter, the

The lion brings his lips to the glass.
drink (glass of) water
put down a glass

The difference is that the model must correctly understand the behavior "drink" to correctly generate the behavior "drink".

The authors of this paper hypothesized that the reason these behaviors are difficult to understand lies in the dataset and how it is utilized.

Specifically, while it is possible to collect billions of high-quality text-image pairs from the Internet, it is difficult to do so for text-video pairs, and VATEX, the largest annotated text-video dataset currently available, contains only 41250 videos. In addition, while the duration of a video is characterized by large variations, existing models split the video into a large number of clips with a fixed number of frames for training, which breaks the temporal correspondence with the text in the video, and in the example above, the same text of "drink" can be divided into"hold a glass", "lift a glass", and "drink ". In the example above, if the same text "drink" is split into four separate clips: " hold glass", " lift", " drink", and"put down", it would be difficult for the model to learn the exact meaning of the action "drink".

CogVideo responded to this problem with a

Efficient learning by inheriting the pre-trained text-to-image generation model CogView2 to the text-to-video generation model
A hierarchical learning method with the multi-frame rate for better text-clip pair positioning.

By devising the following, we have generated a high-resolution video with no sense of contradiction, as shown in the sample below.

Note that the actual text input is in Chinese, and each sample is generated as a clip of 32 frames of 4 seconds duration, where 9 frames are sampled uniformly for display. (You can try out the video generation here )

CogVideo Overview

CogVideo introduces multi-frame-rate hierarchical training, which is a hierarchical learning method using multi-frame rates to match text and actions without discomfort, and a text-image generation model pre-trained for video generation CogView2 and Dual-channel Attention to inherit the knowledge from the text-image generation model CogView2

Let's take a closer look at each of them.

Multi-frame-rate Hierarchical Training

This method follows the framework of VQVAE in general but is characterized by the fact that it consists of two learning stages: the Sequential Generation stage and the Recursive Interpolation stage (see the figure below). (See the figure below)

In the Sequential Generation stage in the figure, keyframes are sequentially generated subject to the frame rate and input text, and in the Recursive Interpolation stage, the generated frames are re-entered as bi-directional attentional regions and frames are Recursively interpolated the frames. (In the figure, unidirectional attention regions are shown in green, and bidirectional attention regions in blue.)

This ensures that the text and the generated frames are learned to match as closely as possible.

Dual-channel Attention

Large-scale pre-training typically requires large datasets, and open-domain text-video generation requires a large enough dataset for the model to infer the correlation between text and video. However, collecting high-quality text-video pairs is impractical given the cost and time involved.

Existing works such as the Diffusion Model and NUWA achieve good results by adding text-image pairs to the training of text-video generation, but adding image data significantly increases the training cost, especially for large-scale pre-training.

Therefore, unlike existing research, in this paper, we propose to use a pre-trained image generation model, Attention-plus, instead of image data. (See the figure below)

Specifically, the Dual-channel Attention mechanism simply adds a Spatial Channel and a Temporal Channel to the CogView2 that has been previously trained in each transformer layer. All the parameters of CogView2 are frozen at training time, and only the parameters of the newly added attention layer (Attention-plus layer in the above figure) can be trained.

Machine Evaluation

In this paper, we present two representative benchmarks for video generation, UCF101, Kinetics-600, and Frechette as evaluation metrics.
Video Distance (FVD) and inception score (IS) as evaluation metrics.

The table below shows the results generated by UCF101 (left) and Kinetics-600 (right). (** indicates that the model was trained only on the UCF101 training data, while ** indicates that the model was trained only on the UCF101 training data, while ** indicates that the tokenizer reconstruction results were used for the ground truth on the FVD test data)

As the table shows, CogVideo scores very well on two metrics.

Human Evaluation

To further evaluate CovVideo, we conducted a user survey of 90 anonymous individuals on CogVideo and open-source baselines such as TGANv2, a GAN-based model, and VideoGPT, a GPT-based model.

The table below shows the results of each model evaluated from various aspects using randomly selected text from the 30 classes of UCF101 as input.

The table shows that 49.53% of the raters selected CogVideo as the best method, while VideoGPT and TGANv2 were supported by only 15.42% and 5.6% respectively, demonstrating the effectiveness of CogVideo.

summary

How was it? In this article, we described CogVideo, the largest and first open-source pre-trained transformer model for text-to-video generation.

CogVideo is the first model to successfully use a trained text-to-image model for text-to-video generation without compromising its image generation capabilities, and its success in generating more natural videos than existing models This model shows a new direction in the research of video generation.

However, there are still some issues to be solved, such as the large size of the model and the limited length of the input sequence due to GPU memory limitation, which are expected to be improved in future research.

The architecture of the model is presented here and samples of the generated videos can be found in this paper if you are interested.

Categories related to this article

田中侑李