Catch up on the latest AI articles

Spatio-temporal DataAugmentation Dedicated To Video Recognition!

Spatio-temporal DataAugmentation Dedicated To Video Recognition!

Data Augmentation

3 main points
✔️ Extend DataAugmentation used in image recognition to video recognition
✔️ Extend RandAugment and Cutmix to time direction
✔️ 1st Visual Inductive Priors (1stVIPriors) and other tasks with small data sets, results competitive with SoTA

Learning Temporally Invariant and Localizable Features via Data Augmentation for Video Recognition
written by Taeoh Kim, Hyeongmin Lee, MyeongAh Cho, Ho Seong Lee, Dong Heon Cho, Sangyoun Lee
(Submitted on 13 Aug 2020)
Comments: European Conference on Computer Vision (ECCV) 2020, 1st Visual Inductive Priors for Data-Efficient Deep Learning Workshop (Oral)
Subjects: Computer Vision and Pattern Recognition (cs.CV)


First of all

In recent years, Data Augmentation (DA) has become an indispensable part of image recognition. Especially in image recognition competitions, various DAs are proposed to improve the classification accuracy by making the distribution of training data as close as possible to the distribution of test data. Not only rotating, flipping, and changing the color of an image (Invert, Grayscale, colorize), but also mixing two images (mixup) and cutting and pasting (cutmix) are becoming commonplace. However, all of these have been studied in image recognition tasks, and little thought has been given to video recognition tasks that infer class labels for videos.

In the paper introduced in this article, DA for image recognition is extended in the temporal direction, and what is the best DA method for the video recognition task is studied. One of the key points in applying DA to video recognition tasks is that we need to think about it from the perspective of the video. Images are two-dimensional (vertical and horizontal), but videos are three-dimensional with the addition of a temporal dimension.

Specifically, the movie changes geometrically and optically in time. The figure above is the diagram shown in the paper. The left figure shows a skydiving video in which the position of an object in the image changes due to the rotation of the camera. The right image shows a basketball game where the camera's flash changes the contrast. Videos often contain such qualitative changes (perturbations) of continuous images. In this article, we will explain two methods proposed to incorporate such temporal changes into DA.

  • RandAugment-T: Extending RandAugment, which uses grid search to find the best DA, to video.
  • Extensions of the Cutmix family Extends: Cutout, mixup, Cutmix, Cutmixup to video


RandAugment N is the number of DA(transform) and M is the degree of Augmentation. This is designed for image recognition, so in the case of video, we have to apply it for each frame. Then it looks not so happy because it can't reproduce the temporal change as shown in the above point. RandAugment-T prepares three parameters (N, M1, M2).

  • N: Number of DA(transform)
  • M1: Augment degree of the start image
  • M2: Augment degree of the end image

M1 and M2 indicate the degree of Augmentation at the time endpoints (start and end images) respectively. The following pseudo-code from the paper in python shows that np.linspace(...) where M1 and M2 are complemented by the number of frames T. The degree of Augmentation changes continuously through the frames. The larger the difference between M1 and M2, the tighter the change in the Augment degree.

By adjusting rotate, shear-x, shear-y, translate-x, and translate-y, we can model the geometric transformation of the camera as shown in (a) below in the paper. By adjusting solarize, color, posterize, contrast, and brightness, we can model the brightness adjustment and contrast change in the automatic shooting mode of a high-performance camera as shown in (b) below.

Extension of Cutmix system

Next, in the paper, the Cutmix system, which is widely used in image recognition, is extended to a video version. It is shown that there are various variations in the time direction as shown in the figure below.

For the captions in (a), (b), and (d), Frame-(DA) changes with a key at a specific time, like switching TV channels, and Cube-(DA) occupies a specific part of a frame in a video for a while, like a TV wipe. And by changing (DA) to Cutmix or Cutout, we can model various temporal perturbations as shown in the figure. As for (e) Fademixup, it is shown that it is a method to make DA while suppressing the sudden luminance change at the boundary by changing the scene.

  • Cutout Prevents gazing at specific parts of the image
  • Frame-Cutout Prevents the user from gazing at a specific time interval (frame interval).
  • Cube-Cutout hybrid type
  • Cutmix Learn to cut and paste images to find spatial locations for feature recognition
  • Frame-Cutmix Feature recognition learns to find temporal locations
  • Cube-Cutmix hybrid type
    (Cutmixup system is a version that relaxes the boundary change of Cutmix)
  • FadeMixup Mitigates sudden luminance changes more than the Cutmix system.

Results of three experiments

We will now discuss the three experiments in the paper to see the accuracy of the above two Augments. The FastSlow network is used as the backbone.

  1. Ablation with UCF101
  2. Re-experiment with HMDB-51 data set
  3. Comparison with SoTA at 1stVIPrior

(In addition, as for UCF101, the data split for ECCV2020 competition called 1stVIPrior )

The computing environment is GTX1080Ti (implemented by Pytorch). When the author ran it in almost the same environment, the learning was completed in about one day.

Which DA is the best? (Ablation with UCF101)


This is the result of RandAugment-T, where the Baseline is no DA, Spatial is the original RandAugment, Temporal+ is RandAugment-T, and Mix is a mix of spatial and Temporal+. Looking at the bolded scores, we can see that Temporal+ has the highest acc for both Top1-acc and Top5-acc. Although the accuracy of Spatial has also improved, Temporal+, which models temporal changes, is more accurate. Temporal+ is also more accurate.

Extension of Cutmix system

Table.3~5 is the verification of Cutout, Cutmix, and Mixup in order.

The accuracy of Cutout (Table.2) is decreasing across the board. On the other hand, we can see that the Frame-(DA) system is the highest in Cutmix (Table.4) and Mixup (Table.5). This exactly suggests that temporal DA leads to higher scores in the video recognition task. In addition, Cutout systems are all over the place. However, we can confirm that Frame-Cutout and Cube-Cutout, which are temporal DA, are better than normal Cutout. Furthermore, we can see that Mixup (table.6), which blends the data, scores especially higher than Cutmix (table.4, 5). According to the author, it is suggested that Cut in the spatial domain is critical for video because the motion area of objects is smaller than that of images.

What about other data sets? (Re-experiment with HMDB-51 data set)

Since the dataset size of HMDB-51 is smaller than UCF101, the overall accuracy of video recognition is lower. However, even here, if we look at the rows of the bolded scores, we can see that temporal DAs such as RandAug-T and Cube-(DA) mark higher accuracy. From the above, we can see that the trend is generally similar to UCF101.

How does it compare to SoTA in competition? (How does it compare to SoTA in 1stVIPrior) 

Finally, the 1stVIPrior, a competition held at ECCV 2020, is being compared to SoTA.

The table shows that this method scores competitively compared to the teams in third place. Although they do not outperform the score, the first through third-place teams have backbones with While the first through third-place teams used a large two-stream network as their backbone, Ours only used slowfast50 with DA (RandAug-T & All methods) and a data-level ensemble. Such a simple yet competitive network shows the versatility of the proposed DA.

Two arguments

1. can't the temporal DA be pushed out more?

 We can see from the previous results that temporal modeling methods have high accuracy. However, simple spatial DAs such as Spatial and Cutmix are also reasonably accurate. There are three possible reasons why the temporal DA score did not overtake the spatial DA score, as shown in the paper.

  1. The small amount of training data.
  2. that temporal changes were not well represented in the validation dataset.
  3. The dataset must be trimmed in a well-behaved manner.

In particular, UCF101 and HMDB-51 have been trimmed to reduce geometric and optical changes in camera position. (To make it easier to study.) Therefore, we can expect some improvement in spatial DA. In other words, the difference between spatial DA and temporal DA may be more pronounced if the dataset contains more poorly behaved videos. As a future plan, the authors plan to test their method on large-scale video data sets such as Kinetics.

2. verification by CAM

Visualization with CAM is useful to see the differences in learning. The figure in the paper shows that the temporal DA, FadeMixup (right), can localize the frame in time better than the spatial DA, Mixup (left). It can be seen that the gradient color switches more noticeably on the right. In the paper, the results suggest the same for the Cutmix system.


In this article, we introduced our research on extending DA, which is widely used in image recognition tasks, to video recognition tasks. As a result, we found that DA, which models temporal changes, is useful for video recognition. Video recognition is only one of a very limited number of video tasks swirling around the CV world. In the future, we will need to think about how to extend temporal DA to various tasks such as object detection, segmentation, and Spatio-temporal localization of videos. The research style introduced in this article, which is temporal modeling of spatial DA in the image world, is expected to become more and more popular in the future.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us