Catch up on the latest AI articles

This Model Makes Efficient Real-Time Video Object Segmentation Possible For The First Time!

This Model Makes Efficient Real-Time Video Object Segmentation Possible For The First Time!

Video Object Segmentation

3 main points
An efficient real-time video object segmentation model
✔️ Two new concepts: Pixel Adaptive Memory and Light Aggregation Encoder to solve the problems faced by traditional SOTA models.
✔️ SOTA performance on two Video Object Segmentation Datasets

SwiftNet: Real-time Video Object Segmentation
written by Haochen WangXiaolong JiangHaibing RenYao HuSong Bai
(Submitted on 9 Feb 2021)
Comments: Accepted to arXiv.

Subjects: Computer Vision and Pattern Recognition (cs.CV)


In Real-time Video Object Segmentation(VOS), a model aims to segment the entire frame of a video given the initial annotations for the first frame. Despite constant endeavors, efficient real-time VOS has been out of reach. The right side of the red line in the following diagram shows the current models that can operate with the frame rate required for real-time VOS: very few and not so accurate. 

 Most of the above models focus on improving the segmentation accuracy which comes at a price of speed. Some memory-based models make use of all historical frames and non-local reference query matching. Despite being very accurate, these approaches slow down segmentation. Several methods have been employed to improve the segmentation speed but they are yet to meet the needs of real-time VOS. So, it seems like there is a tradeoff here between efficiency and accuracy.

 This paper identifies spatiotemporal redundancy as the bottleneck for slowing down real-time VOS. To solve the problem, we introduce Pixel-Adaptive Memory(PAM) which is composed of variation-aware triggers with pixel-wise update and matching. With this, our model SwiftNet is able to break the real-time VOS barrier setting records and showing outstanding performances on several benchmarks. 

Some Background Information

The task of one-shot VOS consists of two parts: object segmentation, and matching the segmented objects within frames. Object segmentation is more or less similar in most modules. Modules differ on the reference modeling and reference-query matching strategies. In the last-frame reference modeling method, the last and/or the first frame is(are) used as a reference to determine the corresponding objects in the current frame. Although this approach is faster due to lower segmentation costs, it cannot cope with the addition/removal of objects from the frames(object variations) effectively. For the all-frame approach, data from all previous frames are used which makes it more accurate but slow. Another way of using all-frames is to carry on the relevant information through the network using RNNs, like in the STM model. Models like this propagate temporal information which is quite effective for object variations. The reference-query matching strategies usually measure the similarities among objects using CNNs, cross-correlations, or non-local computations. 


For a video sequence with frames V= [x1,x2...xn] containing the objects O = [o1,o2,...on], let the current frame be xt which is annotated with mask yt. The historical information from all previous frames Vt-1 and their masks [y1,] is used to establish the model Mt-1 up to frame t-1 as follows:

Here, I is the function denoting if frame t is used in modeling, EnR is the reference encoder that extracts information and φ is the object modeling process. Next, we generate the object localization map It as follows:

EnQ is the frame query encoder and γ denotes a pixel-wise query matching function that searches Mt-1 within the encodings of xt.  

As shown in the above diagram, xt is passed through the query encoder at first. The encodings are matched with the current model to generate a localization map It. The localization map and query encodings are passed through the decoder to obtain the mask yt. Once the mask it obtained, xt, yt, xt−1 and yt−1 are passed through the variation-aware trigger. If it triggers i.e if there are variations in the images, they are passed on to LAE for Pixel-wise Memory Update. Further details of the process are described in the next section.

Pixel Adaptive Memory (PAM)

The PAM consists of three parts each of which is described below:

1) Variation-Aware Trigger(VAT)

We want to include historical information in order to make the model more robust to object variations and we also want to compress temporal redundancy. The VAT module evaluates the inter-frame variation for each pair of consecutive frames and the memory update is triggered when the accumulated variation reaches a certain threshold. We calculate the variation for the mask and the image for each pixel i as follows:

Then, at each pixel, we update the overall running variation P as,

The threshold thf and thm are hyperparameters. When P exceeds Pth, it triggers a memory update for the frame. 

2) Pixel-Wise Memory Update

 Whenever a frame xt is triggered for the update, the pixels with significant variation from the memory Bt are discovered at first. EnR encodes xt into key features KQ,t with dimension HxWxC/8 and value features VQ,t with dimension HxWxC/2. Shallower key features help in efficient mapping. Also, memory Bt containing kt pixels is encoded into KR,t with dimension ktxC/8 and VR,t with dimension ktxC/2. Next, KQ,t is flattened to compute the cosine similarity as follows:

The pixel similarity vector is computed as the largest similarity value in row i of matrix S as follows:

Vp,t is sorted and the top β (~10%) percent pixels are taken which exhibit the most variation from the features in memory. The corresponding KQ,t, and VQ,t are then directly added to the memory B.

3) Pixel-Wise Memory Match

The localization map It and query value VQ are decoded to obtain the mask of a frame.  As shown in the above diagram, in order to generate the localization map, KQ,t, and KR,t are reshaped into sizes HWxC/8 and C/8xK respectively followed by a dot-product similarity to calculate It. The dot product is passed through a softmax function and multiplied with memory value VR,t.  The resulting HWxC/2 matrix is concatenated with VR,t to obtain the activated feature VD which is then passed to the decoder. 

This method eliminates all the redundant pixels and reduces the size of I to HWxK compared to HWxHWT if all the historical frames and pixels had been used. This makes SwiftNet faster compared to other recent models.

Light Aggregation Encoder

Both EnQ and EnR use ResNets for feature extraction. On top of extracting features from the input image, EnR also does frame-mask aggregation. The image and mask could be concatenated and then encoded but this would require two passes of the image frame, once time each through EnQ and EnR. Unlike previous approaches where xt was encoded separately by EnQ and EnR, the feature maps generated by EnQ for the image are utilized by EnR in SwiftNet. This makes SwiftNet quite efficient. 

For frame-mask aggregation, we use a novel light-aggregation encoder as shown in the above diagram. The upper blue cuboids represent the EnQ feature map buffers and the lower green ones are the feature maps of the input masks. Vertically aligned features have the same size and are concatenated together. For feature transformation of input mask, we also use reversed sub-pixel(RSP) for downsampling and 1x1 conv for channel adjustment. RSP allows downsampling without much loss of information.

Reversed Sub-Pixel Downsampling 


Segmentation Results produced by SwiftNets on DAVIS-17

 The SwiftNet model with ResNet-18 and ResNet-50 backbones was tested on the DAVIS 2017 and the YouTube-VOS dataset. The metrics used are Jaccard similarity index(J) and mean boundary F score (F), along with their mean(J & F) for segmentation accuracy and Frames-Per-Second (FPS) to measure segmentation speed.

The above table shows the results on the DAVIS-2017 dataset. 

The above table shows the results on the YouTube-VOS dataset.  s and u represent seen and unseen classes, G is the overall score, and OL denotes online learning. In both datasets, SwiftNet is highly accurate and faster compared to other SOTA models. SwiftNet only lags behind in the J-score of unseen classes in the YouTube VOS dataset. 

For more information on the experimental setup and ablation, studies refer to the paper.


 This paper has introduced two novel methods: Pixel Adaptive Memory and Light Aggregation Encoder in order to solve the speed-accuracy tradeoff issue in previous VOS models. PAM has solved the problem of spatiotemporal redundancy in matching-based VOS which had been the bottleneck for real-time VOS models. In doing so, this paper has created a strong baseline for future research works in VOS. SwiftNet is highly efficient and also suitable for real-world applications due to its compatible speed. 

Thapa Samrat avatar
I am a second year international student from Nepal who is currently studying at the Department of Electronic and Information Engineering at Osaka University. I am interested in machine learning and deep learning. So I write articles about them in my spare time.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us