VideoMix:Testing CutMix With Video Tasks!

Data Augmentation 19/03/2021

3 main points
✔️ Comparative verification of CutMix in three different video tasks
✔️ Proposed CutMix extended in spatio-temporal direction called VideoMix
✔️ Action Recognition/ Localization / Object Detection tasks to verify the versatility of VideoMix.

VideoMix: Rethinking Data Augmentation for Video Classification
written by Taeoh Kim, Hyeongmin Lee, MyeongAh Cho, Ho Seong Lee, Dong Heon Cho, Sangyoun Lee
(Submitted on 7 Dec 2020)
Comments: Accepted to arXiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

First of all

In recent years, there has been a growing interest in data augmentation for image tasks.CutMix is a method to create a new mixture of two images by cutting and pasting (Cutout & Mixup). It has been shown that the mixture of images added to the training data increases the diversity of the distribution of the training data and produces a regularization effect in the model, which in turn improves the generalization performance. The impact of this method is enormous, and it has made significant contributions to image recognition, object detection, and segmentation. Recently, CutMixup, a method to suppress the boundary change, has been proposed based on this method.

Although such CutMix has been used in image tasks, it has not been studied much in video tasks such as video recognition (March 2021 - at the time of writing). Or, it has been studied only on a limited dataset. However, in the paper " VideoMix " introduced in this article, the effectiveness of CutMix is verified under the assumption of various video tasks and various backbone models. The paper also discusses which CutMix is most effective in space, time, and space-time when extended to video tasks. The outline of the above three types of CutMix is shown in the following figure.

The video is represented as a cuboid (length x width x frame). Note that channels are omitted. From left to right, S-VideoMix is like a TV wipe, T-VideoMix is like a channel switching (or a commercial interrupting a program), and ST-VideoMix is like a wipe that stays for a certain time. With the intervention of the time axis (frame axis), DA in the video becomes more complex and the number of types to be verified increases. Therefore, in this paper, we propose a new DA for various types of video CutMix, which we call VideoMix.

In this article, we focus on the following three points.

Technique: About VideoMix
Results: Validation of VideoMix in key video tasks
Discussion: Where does VideoMix look and learn?

Technique: About VideoMix

Mixed image generation

VideoMix creates a mixed image using a mathematical formula based on the following mask.

$\hat{x}=M \odot x_A + (1-M) \odot x_B$

$\hat{y}=\lambda_M y_A + (1-\lambda_M)y_B$

The formula is almost the same as CutMix original except for the extension of the time axis. $x=\mathbb{R}^{T \times H \times W}$ (number of frames, height, width), and $y$ is the one-hot vector of labels, $M={0, 1}^{T \times H \times W}$. Note that the RGB channels are implicitly omitted for simplicity; VideoMix creates a mixture of two videos $xa$ and $xb$ and labels them with the mixture label $hat{y}$. By taking the element product with the binary mask, we represent that a part of the video is cut and pasted. The mask $M$ is shown in the following equation.

Here, you can see that you can represent three types of Videomix by setting the variable $C=(t1, t2, w1, w2, h1, h2)$.

S-VideoMix :$(t1, t2)=(0, T)$, $(w1, w2, h1, h2)$ samples randomly
T-VideoMix :$(w1, w2, h1, h2)=(0, W, 0, H)$, $(t1, t2)$ samples randomly
Randomly sample ST-VideoMix:(t1, t2, h1, h2, w1, w2)

Which type is best (S? T? ST?)

After all, how can we adjust the variable C? Let's take a look at the ablation results in the backbone of Mini-kinetics, SlowOnly-34.

In conclusion, the paper shows that the spatial type of S-VideoMix gives better accuracy for both top1 and top5. The reason for this is considered to be the short duration of the mixed video in T- and ST-VideoMix. It is suggested that the limited number of frames does not sufficiently contain the semantic information of the video and affects the classification model. Therefore, as a general VideoMix, S-VideoMix is used as the default setting for all the remaining experiments in this paper.

Verification of the regularization effect

Now let's look at the regularization effect of VideoMix.

In the paper, we show the results of validation by Mini-Kinetics, training with Slow-only34 and comparing the validation scores. The red line is the baseline and the blue line is the Videomix inclusion, and we can see that after 200 epochs, the validation accuracy is improved compared to the baseline.

Results: validation on key video tasks

Now let's look at an experiment to see the accuracy of VideoMix . There are many results in the original papers, but in this article, we will focus on the following three tasks We will focus on the following three tasks.

Action recognition(Kinetics400)
Weakly supervised temporal action localization(WSTAL)
AVA Object Detection

Mini-Kinetics and something-V2 have a strong role in reinforcing the claim, so we omit them for now.

Action recognition

kinetcs400 is a large video dataset. Similar to basic image classification, we infer action labels for the entire video sequence. Here, we see how much the score improves when VideoMix is applied to SlowOnly-50 and SlowFast50 compared to the baseline.

Although top1-acc and top5-acc do not always outperform I3D, the inference cost evaluated by GFlops×views shows that they keep their scores without getting extremely worse in spite of their low computational complexity. Especially for SlowFast + VideoMix, top1-acc scores 76.6, which is quite competitive. Personally, I'd like to see what happens when VideoMix is added to I3D models in some way.

Weakly supervised temporal action localization(WSTAL)

This task is to detect the time interval of the action (ex, the "running" class is in the range of frames 14-20). However, we do not use frame-by-frame annotations to train the model, but only labels for the whole video. In other words, the task is to predict the class of the entire input video while finally guessing the time interval of the action with a model trained only on the class labels of the entire video. Therefore, it is called weakly supervised. In order to succeed in this task, it is important to classify what are the main actions in the video and not to be confused by background classes or small irrelevant actions.

The results are validated on the THUMOS14 dataset and the I3D backbone. We can see that VideoMix scores higher accuracy on mAP than other DAs used in the same task.

AVA Object Detection

Finally, we evaluate VideoMix in object detection in video. For each frame, we infer BB positions and action labels. We can see that pre-training with VideoMix improves the score of val mAP.

Discussion: Where does VideoMix look and learn?

We have confirmed that VideoMix improves the accuracy of various tasks, but how does VideoMix affect the model in the end?

In the paper, it is suggested that VideoMix aims to recognize two actions in a mixed video simultaneously. The figure below shows the result visualized by CAM (actually, ST-CAM extended in the temporal direction proposed in the paper).

Mixed videos of "Playing harmonica" and "Passing American football" were generated, and CAMs were applied to each class. Looking at the bright (white) gradient colors in the right two rows of CAM images, we can see that the "harmonica" CAM focuses on the player's mouth and hands as features of playing harmonica (near the upper left). In the CAM of "football", we can see that the player focuses on the ball and the child's hand as a feature of passing the ball. (Near the middle). This is especially noticeable in the "football" CAM image, where the gradient color of the entire image is correspondingly darkened when the child is not holding the ball in the mixed image. This suggests that VideoMix may improve generalization performance while reducing overtraining by hiding feature actions and preventing the model from concentrating on specific locations.

summary

In this article, we have examined. Among them, we confirmed that spatial S-VideoMix has the highest accuracy and contributes to the improvement of classification scores when applied to various video tasks, while T-VideoMix and ST-VideoMix have slightly lower accuracy due to the lack of semantic representation of temporally cut mixed videos. It was suggested that the accuracy of T-VideoMix and ST-VideoMix may be slightly reduced due to the lack of semantic representation of temporally cut mixed videos. However, depending on the dataset, it seems that the semantic representation can be retained even if the number of frames is reduced. In the direction of considering such temporal dynamics, I have a feeling that research on DA in video tasks will continue to increase, starting with VideoMix.