TimeSformer: Transformer That Captures Moving Images Beyond 3DCNN

Image Recognition 14/11/2022

3 main points
✔️ We devised four Spatio-temporal Self-Attention for video images.
✔️ Faster learning speed and better testing efficiency compared to 3DCNN models.
✔️ The 3DCNN model can process only a few seconds of video, but it can be applied to several minutes-long videos.

Is Space-Time Attention All You Need for Video Understanding?
written by Gedas Bertasius, Heng Wang, Lorenzo Torresani
(Submitted on 9 Feb 2021 (v1), last revised 9 Jun 2021 (this version, v4))
Comments: Accepted to ICML 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Transformer, which has dominated the natural language processing community by winning SOTA in various tasks in the field of natural language processing, has also been applied to image recognition and speech recognition. In this article, we describe TimeSformer, a Transformer model in the field of video recognition presented at CVPR2021.

There are three motivations for applying the Transformer model to video recognition.

Until now, 3DCNN-based models such as 3DResnet, I3D, and SlowFast have been used as a baseline in the field of video recognition. However, these models can only consider local relationships not only in spatial features but also in temporal features, so they can only input a few seconds of video. Therefore, the Transformer model is expected to be applied to video recognition to capture global relationships.
While 3DCNN can achieve high performance on small data sets due to its strong induction bias, it can also overly limit the expressive power of the model in environments where sufficient data is available. On the other hand, for large data sets, less restriction by induction bias can increase the variety of functions that can be represented and further improve accuracy.
We know that Transformer is faster than CNN for learning and inference.

Based on the above motivations, to apply Transformer to the field of video recognition, this paper proposes TimeSformer, which extends Self-Attention from spatial (images) to Spatio-temporal (videos).

TimeSformer

Input

A video image (H×W×3×F) is divided into N patches of size P×P at each frame. The input is the sum of the linear transformation and the position embedding.

p denotes patches (1-N) and t denotes frames (1-F). The position embedding is a trainable parameter.

Self-Attention

The query key value performs a layer normalization and a linear transformation on the input, respectively. l and a represent the layers and the attention head, respectively.

Following Scaled Dot Product Attention, the attention weight α for a query patch (p,t) is expressed as SM() denotes SoftMax. However, one Attention computation can be done either in space or in time, i.e., Attention is computed for spatial information. That is, when taking Attention to spatial information, it is computed with p'=1,..., N and when taking Attention to temporal information, it is computed with p'=1,..., N . N when taking Attention to spatial information, and t'=1,..., F when taking Attention for temporal information. F when taking Attention to temporal information. This reduces the computational complexity.

Embedding

The acquisition of embeddings for each layer is similar to the conventional Transformer. First, s is obtained by summing the outputs (αv) of each Attention.

Next, we concatenate the output of each attention-head (s) and take the sum of the linear transformation and the output of the previous layer for the residual connection (z'). Finally, we pass z' through the layer normalization and MLP layers and sum z' for the residual connection.

Space-Time Self-Attention Models

As shown in Equation (5), if space and time are computed separately, it is not possible to take space and time into account well. In particular, the spatial-only Attention, as shown in Equation (6), will lose accuracy in tasks where temporal information is more important.

Therefore, in this paper, we propose an efficient architecture for Spatio-temporal Attention.

The baseline is Space Attention, which is only spatial Attention, and Joint Space-Time Attention, which is Attention with all Spatio-temporal information.

In Divided Space-Time Attention, the patches are divided in the order of Time Attention and Spatial attention. In Time-Attention, the patches in the same position in the other frames are attended to. In Space-Time Attention, all the patches in frame t are attached. After computing the temporal Attention, only the residual connection is applied and passed to the spatial Attention, not to the MLP. The output of Spatial Attention is then passed to MLP for the residual connection; Joint Space-Time Attention requires NF+1 computations per patch, whereas Divided Space-Time Attention requires N+F+2 computations per patch, making it an efficient space-time decomposition.

Next, we describe Sp arse Local Global Attention and Axial Attention: for each patch (p, t), Sparse Local Global Attention considers the neighboring F × H/2 × W/2 pat ches to computed, and then computes sparse, global Attention over the entire clip using the strides of the two patches in the temporal and spatial dimensions. This can be viewed as a fast approximation of the full spatio-temporal Attention using the local-global decomposition and sparse patterns.

Axial Attention decomposes the Attention calculation in three different steps: time (T), width (W), and height (H).

experiment

Accuracy of various Self-Attention

The table below shows the number of parameters and the accuracy of the above five Self-Attention schemes in each behavior recognition dataset. In this dataset, the spatial information is more important than the temporal information to obtain high accuracy, because previous studies have shown that the spatial information is more important than the temporal information. On the other hand, for the SSv2 dataset, we find that temporal information is also important for high accuracy. Divided Space-Time achieves the highest accuracy on these two datasets. In the following experiments, we will compare the top two, Divided Space-Time and Joint Space-Time.

Calculation cost of Self-Attention

The table below (left) shows the computational cost with resolution and (right) shows the computational cost with video length, whereas Joint Space-Time is practically inapplicable at resolutions of 448 pixels and 32 frames because it causes GPU memory overflow. On the other hand, Divided Space-Time is computationally more efficient and can handle higher spatial resolutions and longer videos, despite having a larger number of parameters than Joint Space-Time.

Comparison with 3DCNN

We perform experiments aimed at understanding the unique properties of TimeSformer compared to 3D convolutional architectures, the main approach to video understanding in recent years. The experiments focus on and compare two 3D CNN models: one is SlowFast, the state-of-the-art in video classification, and the other is I3D, which is capable of image-based pre-training.

Table 2 showsthatTimeSformer has a large training capacity (121.4M parameters) but a low inference cost (0.59 T FLOPs). On the other hand, SlowFast has a large inference cost of 1.97 TFLOPs despite having 34.6M parameters. Similarly, I3D has a larger inference cost (1.11 TFLOPs) despite having fewer parameters (28.0M). This suggests that TimeSformer is better suited for settings with large-scale learning. On the other hand, the high computational cost of modern 3DCNNs makes it difficult to further increase model capacity while maintaining efficiency.

One of the major advantages of ImageNet pre-training is that it enables very efficient training of TimeSformer on video data. Conversely, the training cost of state-of-the-art 3DCNNs is considerably higher when pre-training on image data. Table 2 compares TimeSformer's video training time on Kinetics 400 (Tesla V100 GPU time) to that of SlowFast and I3D. Starting from ResNet50 pre-trained on ImageNet-1K, SlowFast achieves a 75.6% performance improvement on Kinetics-I3D trained with a similar setup requires 1 440 Tesla V100 GPU hours to achieve 73.4% accuracy. On the other hand, TimeSformer, also pre-trained on ImageNet-1K, achieves a higher accuracy of 75.8% using only 416 Tesla V100 GPU hours. Furthermore, when SlowFast is constrained to be trained with a somewhat similar computational budget to TimeSformer (i.e., 448 GPU hours), its accuracy drops to 70.0%. Similarly, when I3D is trained with a similar computational budget (444 GPU hours), its accuracy drops to 71.0%. This highlights the fact that some of the state-of-the-art 3DCNNs require very long optimization schedules to achieve good performance (even with ImageNet pre-training).

Moreover, it is difficult to train a TimeSformer from scratch because of the large number of parameters. Therefore, before training TimeSformer with video data, we initialize it with weights learned from ImageNet. On the other hand, SlowFast can be trained on video data from scratch, although at a very high learning cost(see Table 2).

We also attempted to train TimeSformer directly on Kinetics-400 without ImageNet pre-training. We also found that with a longer training schedule and more data reinforcement, it is possible to train the model from scratch, although the video-level accuracy is much lower at 64.8%. In Table 3, we examine the effect of pre-training ImageNet-1K and ImageNet-21K on K400 and SSv2.

(1)TimeSformer is the default version of our model that works with 8x224x224 video clips

(2) TimeSformer-HR is a high spatial resolution version that works with 16x448x448 video clips

(3) TimeSformer-L is a long-range configuration of our model that works with video clips sampled at 96 × 224 × 224 frames and 1/4 rate.

Table 3 shows that the ImageNet21K pre-training is effective on K400 and provides consistently higher accuracy than the ImageNet-1K pre-training. On the other hand, for SSv2, ImageNet-1K and ImageNet-21K pre-training offer the same level of accuracy. This means that SSv2 requires complex Spatio-temporal inference, whereas K400 is biased toward spatial scene information and thus can benefit more from features trained on a larger pre-trained dataset.

To understand the impact of video data scale on performance, we trained TimeSformer on different subsets of K400 and SSv2. ｛We trained TimeSformer on {25%, 50%, 75%, 100%}. The results in Figure 4 show that TimeSformer outperforms the other models on all training subsets for K400. However, SSv2 shows a different trend: TimeSformer is the strongest model only when trained on 75% or 100% of the data. This may be explained by the fact that compared to K400, SSv2 needs to learn more complex temporal patterns, and TimeSformer needs more examples to effectively learn those patterns.

Effect by number of tokens

The scalability of TimeSformer allows it to operate with higher spatial resolution and longer videos compared to many 3DCNNs. We note that these aspects affect the length of the sequence of tokens fed to the Transformer. Specifically, increasing the spatial resolution increases the number of patches (N) per frame. Increasing the number of frames also increases the number of input tokens. To investigate this effect, we performed an empirical study in which we increased the number of tokens along these two axes separately.

The results are shown in Figure 5. We can see that increasing the spatial resolution improves the performance (up to a certain point). Similarly, increasing the input clip length consistently improves the accuracy; GPU memory constraints do not allow us to test with clips longer than 96 frames. However, we would like to point out that using 96-frame clips is very different from current convolutional models, which are typically limited to processing 8 to 32 frames of input.

Importance of Position Embedding

To investigate the importance of Spatio-temporal position embedding, we used several versions of TimeSformer

(1) No position embedding

(2) Position embedding in space only

(3) Space-time position embedding

The experiments were conducted on the

These results are shown in Table 4. These results show that our model with Spatio-temporal position embedding produces the best accuracy for both Kinetics-400 and Something Something-V2. Interestingly, we find that our model with spatial-only position embedding produces good results for Kinetics-400, but much worse results for Something-Something-V2. This makes sense since Kinetics-400 is more spatially biased, whereas Something-Something-V2 requires complex temporal reasoning.

Comparison with SOTA

In Table 5, we show the results of the K400 validation set. In these experiments, we use a pre-trained TimeSformer on ImageNet 21K. In addition to the accuracy metric, we also include the inference cost, denoted by TFLOPs.We note that TimeSformer achieves solid accuracy with only 3 views (3 spatial crops) and reduces the inference cost, whereas most previous methods use 10 temporal clips and 3 spatial crops (30 spatiotemporal views in total) during inference. The long-range variant, TimeSformer-L, achieves a top-1 accuracy of 80.7%. Furthermore, our default TimeSformer has the lowest inference cost among recent state-of-the-art models. However, it still achieves 78.0% accuracy, outperforming many models with higher costs. We also measured the actual inference runtime on 20,000 validation videos of Kinetics-400 (using eight Tesla V100 GPUs): while SlowFast takes 14.88 hours to complete inference, TimeSformer takes 36 minutes, TimeSformer-HR 1.06 hours, and TimeSformer-L took 2.6 hours. Thus, even though SlowFast and TimeSformer-L have comparable costs in terms of TFLOPs, all our versions of TimeSforme have much lower execution times.

In Table 6, we also show the results on Kinetics-600, where we see that TimeSformer, as well as Kinetics-400, outperforms all prior methods on this benchmark.

Finally, in Figure 6, we study the effect of using multiple timeslips(each with one spatial crop) during inference. k ∈ {1, 3, 5, 10} time clips are used for testing and plotting the accuracy. SlowFast requires multiple (≥5) clips to approach its highest accuracy. Conversely, the long-range variant, TimeSformer-L, does not require multiple clips to achieve its best performance, as a single clip can span about 12 seconds of Kinetics video.

In Table 7, we also validate the models for SSv2 and Diving-48; for SSv2, we use the TimeSformer pre-trained with ImageNet-1K in this case, since the pre-training with ImageNet-21K does not improve the accuracy (see Table 3). We can apply the same pre-training to all other models in this comparison and use ResNet pre-trained with ImageNet-1K. Our results suggest that TimeSformer achieves lower accuracy than the best model on this dataset. However, given that our model uses a completely different design, we believe that these results suggest that TimeSformer is a promising approach for time-heavy and challenging datasets such as SSv2. In Table 7, we also present our method for another "time-heavy" dataset, Diving-48, where we only compare our method with the reproduced SlowFast 16×8 R101 model due to a problem recently discovered in an earlier version of the Diving-48 label. We find that TimeSformer significantly outperforms SlowFast.

long time video modeling

We evaluate TimeSformer on a long-term video modeling task using HowTo100M, an educational video dataset containing approximately 1M educational web videos showing humans performing over 23K different tasks such as cooking, repairing, and creating art. The average duration of these videos is about 7 minutes, an order of magnitude longer than the duration of standard action recognition benchmark videos. Each HowTo100M video has a label indicating the task demonstrated in the video (one of the 23K classes), which can be used for supervised learning. This makes it a suitable benchmark for assessing the ability of a model to recognize activities that take place over a very long time range. In this evaluation, we only consider categories with at least 100 video examples. This yields a subset of HowTo100M corresponding to 120K videos across 1059 task categories. We randomly split this collection into 85K training videos and 35K test videos.

Table 8 shows the results. As a baseline, we use four versions of SlowFast R101, all working with video clips sampled at 1/32 frame rate, but with a different number of frames (8, 32, 64, 96).TimeSformer is pre-trained with ImageNet-21K starting from ViT and using the same four configurations. All models in this comparison are pre-trained on Kinetics-400 before fine-tuning with HowTo100M. During inference, for each method, we sample the number of non-overlapping time clips needed to completely cover the temporal range of the video. For example, if a single clip spans 8.5 seconds, we sample 48 test clips to cover 410 seconds of video. Video-level classification is done by averaging the clip predictions. From the results in Table 8, we first notice that TimeSformer outperforms the corresponding SlowFast by a large margin of 8-11% when covering the same single clip. We also observe that the longer-range TimeSformer does better. That is, our most extended-range variant achieves the best video-level classification accuracy. These results suggest that our model is very well suited for tasks that require long-ranged video modeling. We also experimented with fine-tuning TimeSformer directly from ViTs pre-trained on ImageNet 1K and ImageNet 21K (without training Kinetics-400). When considering pre-training on ImagNet21K, TimeSformer achieved top-1 accuracy of 56.0, 59.2, 60.2, and 62.1 for 8, 32, 64, and 96 frame inputs, respectively. top 1 accuracy for 8, 32, 64, and 96 frame inputs, respectively. These results show that our model can effectively exploit the long-range time dependence. These results show that our model can effectively exploit long-range time dependence regardless of the pre-training dataset used.

summary

We introduced TimeSformer, a fundamentally different approach to video modeling compared to the established paradigm of convolution-based video networks. We showed that it is possible to design effective and scalable video architectures based on Spatio-temporal self-viewing. We showed that our method is (1) conceptually simple, (2) shows state-of-the-art results on major action recognition benchmarks, (3) has low training and inference cost, and (4) can be applied to clips longer than 1 minute, thus enabling long-duration video modeling. In the future, we plan to extend our method to other video analysis tasks such as action localization, video captioning, and question-answering.