This Model Makes Efficient Real-Time Video Object Segmentation Possible For The First Time!

Video Object Segmentation 24/03/2021

3 main points
✔️ An efficient real-time video object segmentation model
✔️ Two new concepts: Pixel Adaptive Memory and Light Aggregation Encoder to solve the problems faced by traditional SOTA models.
✔️ SOTA performance on two Video Object Segmentation Datasets

SwiftNet: Real-time Video Object Segmentation
written by Haochen Wang, Xiaolong Jiang, Haibing Ren, Yao Hu, Song Bai
(Submitted on 9 Feb 2021)
Comments: Accepted to arXiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

Introduction

In Real-time Video Object Segmentation(VOS), a model aims to segment the entire frame of a video given the initial annotations for the first frame. Despite constant endeavors, efficient real-time VOS has been out of reach. The right side of the red line in the following diagram shows the current models that can operate with the frame rate required for real-time VOS: very few and not so accurate.

Most of the above models focus on improving the segmentation accuracy which comes at a price of speed. Some memory-based models make use of all historical frames and non-local reference query matching. Despite being very accurate, these approaches slow down segmentation. Several methods have been employed to improve the segmentation speed but they are yet to meet the needs of real-time VOS. So, it seems like there is a tradeoff here between efficiency and accuracy.

This paper identifies spatiotemporal redundancy as the bottleneck for slowing down real-time VOS. To solve the problem, we introduce Pixel-Adaptive Memory(PAM) which is composed of variation-aware triggers with pixel-wise update and matching. With this, our model SwiftNet is able to break the real-time VOS barrier setting records and showing outstanding performances on several benchmarks.

Some Background Information

The task of one-shot VOS consists of two parts: object segmentation, and matching the segmented objects within frames. Object segmentation is more or less similar in most modules. Modules differ on the reference modeling and reference-query matching strategies. In the last-frame reference modeling method, the last and/or the first frame is(are) used as a reference to determine the corresponding objects in the current frame. Although this approach is faster due to lower segmentation costs, it cannot cope with the addition/removal of objects from the frames(object variations) effectively. For the all-frame approach, data from all previous frames are used which makes it more accurate but slow. Another way of using all-frames is to carry on the relevant information through the network using RNNs, like in the STM model. Models like this propagate temporal information which is quite effective for object variations. The reference-query matching strategies usually measure the similarities among objects using CNNs, cross-correlations, or non-local computations.

SwiftNet

For a video sequence with frames V= [x₁,x₂...x_n] containing the objects O = [o₁,o₂,...o_n], let the current frame be x_t which is annotated with mask y_t. The historical information from all previous frames V_t-1 and their masks [y₁,y₂...y_t-1] is used to establish the model M_t-1 up to frame t-1 as follows:

Here, I is the function denoting if frame t is used in modeling, EnR is the reference encoder that extracts information and φ is the object modeling process. Next, we generate the object localization map I_t as follows:

EnQ is the frame query encoder and γ denotes a pixel-wise query matching function that searches M_t-1 within the encodings of x_t.

As shown in the above diagram, x_t is passed through the query encoder at first. The encodings are matched with the current model to generate a localization map I_t. The localization map and query encodings are passed through the decoder to obtain the mask y_t. Once the mask it obtained, x_t, y_t, x_t−1 and y_t−1 are passed through the variation-aware trigger. If it triggers i.e if there are variations in the images, they are passed on to LAE for Pixel-wise Memory Update. Further details of the process are described in the next section.

Pixel Adaptive Memory (PAM)

The PAM consists of three parts each of which is described below:

1) Variation-Aware Trigger(VAT)

We want to include historical information in order to make the model more robust to object variations and we also want to compress temporal redundancy. The VAT module evaluates the inter-frame variation for each pair of consecutive frames and the memory update is triggered when the accumulated variation reaches a certain threshold. We calculate the variation for the mask and the image for each pixel i as follows:

Then, at each pixel, we update the overall running variation P as,

The threshold th_f and th_m are hyperparameters. When P exceeds Pth, it triggers a memory update for the frame.

2) Pixel-Wise Memory Update

Whenever a frame x_t is triggered for the update, the pixels with significant variation from the memory Bt are discovered at first. EnR encodes x_t into key features K_Q,t with dimension HxWxC/8 and value features V_Q,t with dimension HxWxC/2. Shallower key features help in efficient mapping. Also, memory B_t containing k_t pixels is encoded into K_R,t with dimension k_txC/8 and V_R,t with dimension k_txC/2. Next, K_Q,t is flattened to compute the cosine similarity as follows:

The pixel similarity vector is computed as the largest similarity value in row i of matrix S as follows:

Vp,t is sorted and the top β (~10%) percent pixels are taken which exhibit the most variation from the features in memory. The corresponding K_Q,t, and V_Q,t are then directly added to the memory B.

3) Pixel-Wise Memory Match

The localization map I_t and query value V_Q are decoded to obtain the mask of a frame. As shown in the above diagram, in order to generate the localization map, K_Q,t, and K_R,tare reshaped into sizes HWxC/8 and C/8xK respectively followed by a dot-product similarity to calculate I_t. The dot product is passed through a softmax function and multiplied with memory value V_R,t. The resulting HWxC/2 matrix is concatenated with V_R,tto obtain the activated feature V_D which is then passed to the decoder.

This method eliminates all the redundant pixels and reduces the size of I to HWxK compared to HWxHWT if all the historical frames and pixels had been used. This makes SwiftNet faster compared to other recent models.

Light Aggregation Encoder

Both EnQ and EnR use ResNets for feature extraction. On top of extracting features from the input image, EnR also does frame-mask aggregation. The image and mask could be concatenated and then encoded but this would require two passes of the image frame, once time each through EnQ and EnR. Unlike previous approaches where x_t was encoded separately by EnQ and EnR, the feature maps generated by EnQ for the image are utilized by EnR in SwiftNet. This makes SwiftNet quite efficient.

For frame-mask aggregation, we use a novel light-aggregation encoder as shown in the above diagram. The upper blue cuboids represent the EnQ feature map buffers and the lower green ones are the feature maps of the input masks. Vertically aligned features have the same size and are concatenated together. For feature transformation of input mask, we also use reversed sub-pixel(RSP) for downsampling and 1x1 conv for channel adjustment. RSP allows downsampling without much loss of information.

Reversed Sub-Pixel Downsampling

Experiments

Segmentation Results produced by SwiftNets on DAVIS-17

The SwiftNet model with ResNet-18 and ResNet-50 backbones was tested on the DAVIS 2017 and the YouTube-VOS dataset. The metrics used are Jaccard similarity index(J) and mean boundary F score (F), along with their mean(J & F) for segmentation accuracy and Frames-Per-Second (FPS) to measure segmentation speed.

The above table shows the results on the DAVIS-2017 dataset.

The above table shows the results on the YouTube-VOS dataset. s and u represent seen and unseen classes, G is the overall score, and OL denotes online learning. In both datasets, SwiftNet is highly accurate and faster compared to other SOTA models. SwiftNet only lags behind in the J-score of unseen classes in the YouTube VOS dataset.

For more information on the experimental setup and ablation, studies refer to the paper.

Conclusion

This paper has introduced two novel methods: Pixel Adaptive Memory and Light Aggregation Encoder in order to solve the speed-accuracy tradeoff issue in previous VOS models. PAM has solved the problem of spatiotemporal redundancy in matching-based VOS which had been the bottleneck for real-time VOS models. In doing so, this paper has created a strong baseline for future research works in VOS. SwiftNet is highly efficient and also suitable for real-world applications due to its compatible speed.

Categories related to this article

Thapa Samrat: I am a second year international student from Nepal who is currently studying at the Department of Electronic and Information Engineering at Osaka University. I am interested in machine learning and deep learning. So I write articles about them in my spare time.