Catch up on the latest AI articles

At Last, ViT Has Come To The Field Of Video Recognition!

At Last, ViT Has Come To The Field Of Video Recognition!


3 main points
✔️ For the first time, we proposed a Video Classification model using only a Transformer and aimed to make ViT a Video version.
✔️ To improve the computational efficiency, we proposed four different architectures and conducted detailed ablation experiments.
✔️ Achieved SOTA on five benchmarks.

ViViT: A Video Vision Transformer
written by Anurag ArnabMostafa DehghaniGeorg HeigoldChen SunMario LučićCordelia Schmid
(Submitted on 29 Mar 2021 (v1), last revised 1 Nov 2021 (this version, v2))
ICCV 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)



The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

The transformer has achieved SOTA performance in several tasks in natural language processing. In addition, ViT (Vision Transformer) [Vision Transformer, a new image recognition model to the next stage ] has expanded into the field of image processing and is now a superstar in the machine learning field. On the other hand, since the algorithm calculates the relationship between pairs of serial inputs, it has a problem of large computational complexity. A large amount of research has been done on methods to solve this computational problem, and we expect that Transformer will play an active role in further fields in the future.

In this article, we introduce ViViT (A Video Vision Transformer), an extension of the Transformer to Video Classification. Unlike previous work on video classification that combines a transformer with a 3D CNN, ViViT is a model that uses only a transformer. In addition, since the computational complexity problem of the transformer mentioned above becomes more apparent in video classification, we have proposed four different models that incorporate innovations to deal with the computational complexity problem. We also conducted extensive ablation experiments to analyze the proposed method, ViViT and found that ViViT achieved SOTA on five well-known benchmarks.

Video Vision Transformers

In this chapter, we explain the whole picture of Vision Transformers (ViT) and introduce how to extract information from the video. Finally, we explain the proposed method for Video Classification. Also Figure 1. shows an overview of the proposed architecture.

Overview of Vision Transformers(ViT)

ViT is a method adapted to 2D images with minimal changes to the transformer architecture. It divides a 2D image into N patches x_i and passes through a transformer E to obtain a 1D z_i. The classification token ( z_cls ), the P, represents the location information, and From the N z_i, we can prepare the input sequence z (Equation 1) of the Transformer Encoder.

The series z is input to the transformer of the L layer. Also, each layer consists of Multi-Head Self-Attention (MSA), Layer Normalisation (LN), and MLP as shown in equations (2) and (3).

The final layer of classification tokens ( z_cls ) is used to classify the system according to the task.

Embedding video clips

Here we show two different ways to obtain the patch x that makes up the input token z of the Transformer Encoder from the Video , which is one dimension larger than the two-dimensional image.

Uniform frame sampling

The simplest uniform frame sampling is shown in Figure 2. At each time, the patch x is divided like ViT, and x is ordered by time.

Tubelet embedding

Figure 3. shows the extraction of the tablet on the Spatio-temporal axis.

Transformer Models for Video

In this section, we describe the four types of architectures proposed in this study in turn.

We'll start with a simpler one that inputs a token series that captures Spatio-temporal information into the Transformer.

Next, we introduce a structure that captures time-series information and 'spatial information' respectively by three kinds of devices. Specifically, we improve three parts: Encoder, Self-Attention, and inner product calculation of Attention.

Model 1: Spatio-temporal attention

Model 1 extracts a token z containing Spatio-temporal information from a video and inputs it to the Transformer Encoder, which exchanges the information through pairwise interactions in the MSA layer. However, the computational complexity of the video is high.

Model 2: Factorised encoder

Model 2, shown in Figure 4, has a Factorised Encoder, i.e., two Encoders with different roles.

The first is an encoder that captures spatial information (Spatial Transformer Encoder). It extracts tokens from frames at the same time and interacts with them. The representation of a frame at a given time is a classification token after passing through the L-layer Transformer ( z_cls ). The classification token ( z_cls ) can be pre-provisioned or averaged over the whole. The concatenated representation (z_cls) of each time is input to the second time series encoder (Temporal Transformer Encoder). The classification tokens (z_cls) output by the time series encoder is used for the final classifier.

Model 2 uses more Transformer Layers than Model 1, but because it separates spatial and temporal information, it requires less computation ( FLOPs ).

Model 3: Factorised self-attention

Model 3. performs the same separation of Spatio-temporal information that Model 2 did at the Encoder level for the Self-Attention mechanism. Specifically, Model 1 uses all tokens that capture temporal and spatial information in the Self-Attention layer, while Model 3 uses only tokens that capture spatial information to calculate Attention, and then calculates Attention only for spatial information. It can be formulated as equations (4), (5), and (6).

The architecture of Model 3. is shown in Figure 5. and the computational complexity is comparable to Model 2.

Model 4: Factorised dot-product attention

The last model, Model 4, achieves the same amount of computation as Model 2 and Model 3 by separating the Spatio-temporal computation in the inner product of Attention.

As shown in Figure 6, half of the head of the inner product calculation is done with spatial information tokens, and the other half is done with temporal information tokens.


In this section, we describe the ablation experiments and their evaluation on five datasets. However, we refer the reader to the original paper for details of the experiments and implementation.

ablation test

First, we investigate how to sample the patches from the video presented in 2.2. x from the video presented in 2.2. From the results in Table 1, we find that the central frame format of tubelet embedding has the best accuracy, and we use the central frame format in the following experiments.

The central frame is a type of 3D filter initialization used in tubelet embedding, where the 2D filter values are used at all 3D times in Filter inflation (Eq. 8.), while the central frame (Eq. 9.) The central frame (Equation 9.) applies the initial value of the 2D filter to the central position.

Next, Table 2. shows the results of the four models on the K400 and EK data sets.

The results show that Model 1. has the best accuracy for K400, but Model 2. is superior for EK. This indicates that Model 1 overfits the smaller data set EK.

For Model 2, we eliminated the Transformer Layer, which captures temporal information, and added a Baseline, which uses frame-level representations directly for classification. As a result of these changes (especially in EK in particular), These results show the important role of temporal information.

Furthermore, we have adapted regularization and data augmentation methods commonly used in CNN architectures to ViViT, and the results of gradually adding these methods are shown in Table 4.

Finally, we can see from Figure 8. that the smaller the tablet size used in the tubelet embedding, the higher the accuracy.

Comparison with SOTA

From the above ablation experiments, we performed a comparison experiment using Model 2. Factorised encoder (FE), and we can see from Table 6. that we achieved SOTA on five datasets.

Table 6. (a) and (b) show the results for Kinetics 400&600, which show that ViViT achieves better accuracy with far fewer frame views than existing CNN-based methods. We also show We can see that the accuracy of the trained model ViT is further improved when the training data is JFT.

Table 6. (c) shows that ViViT achieves SOTA on the Moments in Time dataset, and the reason why the Moments in Time dataset is less accurate than the other datasets is that it contains more noise.

As shown in Table 6. (d), it significantly outperforms existing methods on the Epic Kitchens 100 dataset. In particular, we only need two prediction heads for different types of predictions, such as Verb and Noun, and do not need to use a dedicated loss function for each.

Finally, Table 6. (e) shows the results for Something-Something v2. Here the results similarly greatly exceed those of the pure Transformer-based method TimeSformer-HR; since TimeSformer-HR does not take into account the encoder with decomposed Spatio-temporal information and the limitations shown in Table 4. (e), the results show the effectiveness of the proposed method.


In this article, we introduced ViViT, a method using only the transformer structure for video classification, and proposed four models for handling temporal information in video using ViT. Through detailed ablation experiments, we selected the Factorised Encoder model, which is well-balanced in terms of accuracy and computational complexity, and compared it with previous studies.

By achieving SOTA on all five datasets, we have shown that even a pure Transformer structure can efficiently handle Spatio-temporal information in Video. We hope that this work will attract similar attention to video research as to image processing.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us