TrackFormer! Multi-Object Tracking With Transformers

Transformer 18/02/2021

3 main points.
✔️ Simultaneous object detection and object tracking using transformers.
✔️ A new concept of autoregressive track queries to share information among video frames.
✔️ SOTA results on multiple benchmarks.

TrackFormer: Multi-Object Tracking with Transformers
written by Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, Christoph Feichtenhofer
(Submitted on 7 Jan 2021)
Comments: Accepted to arXiv.
Subjects: Computer Vision and Pattern Recognition (cs. CV)

Introduction

Tracking objects is an important application of computer vision and artificial intelligence. However, unlike human beings who are mostly good at tracking one or a few objects at once, we need a system that can track several independent objects simultaneously. The common approach to doing this has been first detecting objects(using CNNs) in individual video frames and then associating the objects detected in different frames to each other. Most multi-object tracking(MOT) techniques differ in how they associate the objects detected in different frames: Graph optimization, generating similarity scores among images using CNNs, regression, etc.

Previous MOT modules have used transformers for the object associated. TrackFormer, the MOT module introduced in this paper uses transformers to both detect objects and associate them simultaneously. It autoregressively associates the video frames allowing more information to be shared between frames, making it possible to identify occluded objects and even identify newer objects.

Detection Using Transformers

The module consists of a CNN backbone(like ResNet) to extract frame-level features from the video. The feature extracted from the CNN is encoded in the transformer encoder. In the transformer decoder, the output embeddings are decoded using the information from the encoder. Finally, an MLP layer in the decoder maps the decoded embeddings to bounding boxes and class predictions.

The transformer decoder outputs N_obj object embeddings for possible object detections in the respective frame. The frame features are encoded with positional encodings, and the object embeddings are encoded with N_obj object queries. Unlike positional encodings, the object queries are learned parameters that can represent spatial properties of individual objects and also prevent multiple detections of the same object.

Detection Loss

The decoder predicts {y_i}_i=1~_Nobjconsisting of bounding box b_i and class prediction c_i. We match these predictions to the ground truth using cost calculations that are based on bounding boxes and class predictions. The minimum cost mapping σˆ is given by:

The index σˆ with the minimum cost C_match is chosen. The cost C_match is given by:

Here, the first term is the predicted class probability for class i and C_box is a penalty for bounding box mispredictions calculated as follows:

Here, the first term represents the L1 distance of the bounding boxes and the second term is the Generalize intersection over union cost(GIoU). Here, λ_l1 and λ_iouare weights given to each term.

After obtaining the optimal σˆ, the detection loss can be computed as follows:

Here, L_box can be computed using equation (3).

Track Query

Track queries are additional embeddings used to share spatial and identity information between adjacent frames. The track queries for an object are continuously updated. As shown in the figure, at frame t=0, the detector produces N_obj embeddings. For each frame t>0, nn additional embedding is created for each of the N_objembeddings that result in an object detection i.e., not the background. So, each decoder step for t>0 takes N_obj + N_track embeddings given by the object and track queries respectively. Equation (4) is slightly modified to obtain a new objective function.

This implies that the new objective function aims to detect new objects, and track already detected ones without any overlap. The newer objects are now detected on the N_obj embeddings, and N_track passes on the information of already-detected objects that are still present in the frame. N_obj is a fixed set but N_trackis dynamic and can decrement when an object exits the scene or can increment when a new object has been discovered. N_trackdepends on the object detections of the previous frame t-1.

Before concatenating the track queries from the previous frame to the object queries, they are transformed by passing through an independent self-attention block.

Training

We employ a two-step training process that makes use of two adjacent frames of the video. The first image is used for object detection and the other one for tracking and detecting new objects. First, we use equation (4) to optimize for object detection and in the next step, we use the loss function given by (5) for tracking objects and detecting new objects.

As described earlier, we use a specialized bipartite mapping technique to find the corresponding ground truth for the objects detected in the frame. Now, in order to map the N_track track identities to their ground truth, we use the track queries from the previous frame which must inherently include the object's identity information. The track queries could match to a ground truth label or even the background class. If no mapping was found, a new object has been detected and we can use the bipartite mapping technique to map this object.

Augmentations

1) In order to make the model robust, frame t-1 is sampled from a set of frames near to frame t. It allows the model to track effectively even on videos with low frames per second.

2)Generally, the number of new object detections in the frames is relatively low. To compensate for this, we sample the track queries with a probability p_FN and remove them before proceeding to the decoder step. This enables the model to detect newer objects more efficiently.

Similarly, to teach the model to remove objects, we add false-positive track queries with a probability p_FN before proceeding to the decoder step.

3) Random spatial jittering.

Experiments

TrackFormer is able to outperform several state-of-the-art models like Track R-CNN, PointTrack and get state-of-the-art results on the MOT17 and MOT20 datasets. The following is a comparison of segmentation results of TrackFormer and Track R-CNN:

TrackFormer is able to achieve better pixel mask accuracy

TrackFormer outperforms all other models based on MOTA metric on the challenging MOT17 dataset.

TrackFormer achieves state-of-the-art performance in the MOTS20 dataset based on MOTSA and IDF1.

For implementation details, please refer to the original paper.

Summary

Transformers have made their way into machine translation, image recognition, 3D point processing, and now multi-object tracking. TrackFormer makes end-to-end multi-object tracking possible by using a self-attention mechanism. In doing so, it eliminates the need for additional operations such as graph optimization. TrackFormer proves its competence by achieving impressive results in the challenging MOTS20 and MOT17 datasets on both object detection and segmentation tasks. It would be interesting to see future works on the use of self-attention for MOT tasks.

Categories related to this article

Thapa Samrat: I am a second year international student from Nepal who is currently studying at the Department of Electronic and Information Engineering at Osaka University. I am interested in machine learning and deep learning. So I write articles about them in my spare time.