New Video Generation GAN With INRs Applied!

GAN (Hostile Generation Network) 10/05/2022

3 main points
✔️ Implicit Neural Representations (INRs) applied to video generation
✔️ Generate longer and higher quality videos than existing video generation models
✔️ Other interesting properties such as interpolation and extrapolation of the video and various motion sampling possibilities were obtained

Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks
written by Sihyun Yu, Jihoon Tack, Sangwoo Mo, Hyunsu Kim, Junho Kim, Jung-Woo Ha, Jinwoo Shin
(Submitted on 21 Feb 2022)
Comments: ICLR 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

In recent years, starting with the paper NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, Implicit Neural Representations (INRs) may be able to solve existing problems in computer graphics and computer vision.

INRs also called neural fields or coordinate-based neural networks, are mainly neural networks whose inputs are coordinates and outputs are vectors. Some of the advantages of using INRs in this way are

Realization of continuous and differentiable models
Hardware acceleration
Data structures that do not strongly depend on the dimension of the input

The ability to easily process high-dimensional input, especially in the field of computer graphics such as NeRF, is a decisive difference from conventional methods. (This is because conventional graphics methods require the use of a spherical prescriptive function in addition to a three-dimensional grid.)

The use of INRs for generative modeling is gradually increasing, and among them, last year, the use of INRs for image generation and the generation of images with a resolution higher than that of existing methods INR-GAN has attracted a lot of attention.

The DIGAN (dynamics-aware implicit generative adversarial network ) presented in this paper is a new INRs-based GAN for video generation based on this INR-GAN. It not only achieves a long time and high-quality video generation compared to existing video generation models, but also has various interesting properties such as video interpolation, extrapolation, and various motion sampling.

DIGAN Overview

The model overview of DIGAN (dynamics-aware implicit generative adversarial network) is The model outline of DIGAN (dynamics-aware implicit generative adversarial network) is shown in the figure below.

In this model, the Generator generates a Video INR that transforms the video into coordinates based on the Content (each image in the decomposition of the video) and Motion (the motion of objects in the video) features of the video.

Also, by randomly conditioning a Motion vector on a Content vector, it is possible to generate a variety of videos while sharing the initial frame of the video.

Two types of discriminators, image _{discriminator} ( _{DI) and}motion discriminator (DM ₎, are used. From two images obtained from the coordinates (2D grid) and the corresponding time (Time) given by the generator and the time difference between them, the two types of discriminators identify whether the connection between the corresponding images (or the motion of the objects in those images) is natural or not, respectively.

In previous studies, 3D convolutional neural networks (3DCNN), which are computationally expensive to process the entire video at once, were used as the discriminator of video generation GANs, but DIGAN uses only 2D convolutional neural networks, which significantly reduces the computational complexity. DIGAN succeeded in reducing the amount of computation by using only two-dimensional convolutional neural networks.

Comparison and validation with existing video generation models

In this paper, comparative verification was performed under the following conditions.

UCF-101, Tai-Chi-HD, Sky Time-lapse dataset, and Kinetics-600 (food class only)
Evaluated by Inception score (IS), Frechet video distance (FVD), and Kernel video distance (KVD) according to previous studies
All models are trained on 16-frame videos with a resolution of 128x128 unless otherwise specified
Existing video generation models VGAN, TGAN, MoCoGAN, ProgressiveVGAN, VideoGPT, TGANv2, DVD-GAN, and MoCoGAN-HD were used to validate the comparison with DIGAN (parameters were collected from the literature)

Here is the video dataset (UCF-101, Kinetics-food) generated by DIGAN as a result of the validation.

As you can see, we can confirm that we can generate very high-quality videos. In addition, the following table shows the results of the comparative verification of DIGAN and existing video generation models in terms of evaluation metrics.

From this table, we can see that DIGAN significantly outperforms existing video generation models on all datasets. These results demonstrate the advantages of using INRs for video generation.

In addition, from these verifications, we obtained interesting properties of DIGAN that are not found in existing video generation models, as shown below.

Smooth video interpolation and extrapolation
Non-autoregressive generation
Various motion sampling

We'll look at them one at a time.

1. Smooth video interpolation and extrapolation

DIGAN can easily interpolate (fill in intermediate frames) or extrapolate (generate video outside of frames ) videos by controlling the input coordinates of the Generator. Furthermore, because INRs model the video continuously, the video interpolated or extrapolated by DIGAN is generated much more naturally than a discrete generative model.

The figure below shows a time-lapse dataset of the Sky Time-lapse dataset, the results are extrapolated by DIGAN and MoCoGAN-HD on the Sky Time-lapse dataset.

The upper image is MoCoGAN-HD and the lower image is generated by DIGAN. The yellow frame line is the extrapolated part. You can see that MoCoGAN-HD can't extrapolate the video and the result is blurry, whereas DIGAN can produce a clear video.

2. Non-autoregressive generation

Unlike existing video generation models that autoregressively sample the next frame conditional on the previous frame, DIGAN can generate arbitrary time samples by controlling the input coordinates. This allows DIGAN to predict past (or intermediate) frames from future frames, or to compute the entire video at once in parallel.

The figure below shows the prediction results for the past and future frames of DIGAN on the TaiChi dataset.

In this experiment, conditional on frames at t = {6,7,8}, including the frames indicated by the yellow box, we predict frames at t = {3,...,11}, and predicted frames at t = {3,...,11}, including the frames shown in the yellow box. We can confirm that DIGAN can predict both past and future frames, such as slowly sitting up.

3. Various motion sampling

As explained in the DIGAN overview, DIGAN can perform various motion sampling from the initial frame by controlling motion vectors.

The following figure shows the video generated from two random motion vectors in the Sky Time-lapse dataset.

It is noteworthy that the floating clouds in the two images move differently, but the tree in the lower left remains in the image, which confirms that the main part of the image is preserved, while various motion sampling is possible. This shows that we can get various motion sampling while keeping the main part of the image.

The degree of freedom of sampling variation conditional on the initial frame also depends on the dataset.

summary

How was it? In this article, we will discuss a new attempt to use I In this article, we introduced DIGAN (dynamics-aware implicit generative adversarial network), which is a new attempt to use implicit neural representations (INRs) for video generation.

With the results left by DIGAN, we believe that efforts to solve the problems of existing methods using INRs will become stronger, and we are very much looking forward to seeing what kind of models will emerge in the future in this research field. The details of the architecture of DIGAN and the generated movies introduced in this article can be found in this paper, so please refer to it if you are interested.