Catch up on the latest AI articles

TrackFormer! Multi-Object Tracking With Transformers

TrackFormer! Multi-Object Tracking With Transformers


3 main points.
✔️ Simultaneous object detection and object tracking using transformers.
✔️ A new concept of autoregressive track queries to share information among video frames.

✔️ SOTA results on multiple benchmarks.

TrackFormer: Multi-Object Tracking with Transformers
written by Tim MeinhardtAlexander KirillovLaura Leal-TaixeChristoph Feichtenhofer
(Submitted on 7 Jan 2021)
Comments: Accepted to arXiv.

Subjects: Computer Vision and Pattern Recognition (cs. CV)


 Tracking objects is an important application of computer vision and artificial intelligence. However, unlike human beings who are mostly good at tracking one or a few objects at once, we need a system that can track several independent objects simultaneously. The common approach to doing this has been first detecting objects(using CNNs) in individual video frames and then associating the objects detected in different frames to each other. Most multi-object tracking(MOT) techniques differ in how they associate the objects detected in different frames: Graph optimization, generating similarity scores among images using CNNs, regression, etc.

Previous MOT modules have used transformers for the object associated. TrackFormer, the MOT module introduced in this paper uses transformers to both detect objects and associate them simultaneously. It autoregressively associates the video frames allowing more information to be shared between frames, making it possible to identify occluded objects and even identify newer objects. 

Detection Using Transformers

The module consists of a CNN backbone(like ResNet) to extract frame-level features from the video. The feature extracted from the CNN is encoded in the transformer encoder. In the transformer decoder, the output embeddings are decoded using the information from the encoder. Finally, an MLP layer in the decoder maps the decoded embeddings to bounding boxes and class predictions.

The transformer decoder outputs Nobj object embeddings for possible object detections in the respective frame. The frame features are encoded with positional encodings, and the object embeddings are encoded with Nobj object queries. Unlike positional encodings, the object queries are learned parameters that can represent spatial properties of individual objects and also prevent multiple detections of the same object.  

Detection Loss

The decoder predicts {yi}i=1~Nobj consisting of bounding box bi and class prediction ci. We match these predictions to the ground truth using cost calculations that are based on bounding boxes and class predictions. The minimum cost mapping σˆ is given by:

The index σˆ with the minimum cost Cmatch is chosen. The cost Cmatch is given by:

Here, the first term is the predicted class probability for class i and Cbox is a penalty for bounding box mispredictions calculated as follows: 

Here, the first term represents the L1 distance of the bounding boxes and the second term is the Generalize intersection over union cost(GIoU). Here, λl1 and λiou are weights given to each term.

After obtaining the optimal σˆ, the detection loss can be computed as follows:


Here, Lbox can be computed using equation (3).

Track Query

Track queries are additional embeddings used to share spatial and identity information between adjacent frames. The track queries for an object are continuously updated. As shown in the figure, at frame t=0, the detector produces Nobj embeddings. For each frame t>0, nn additional embedding is created for each of the Nobj embeddings that result in an object detection i.e., not the background. So, each decoder step for t>0 takes Nobj + Ntrack embeddings given by the object and track queries respectively. Equation (4) is slightly modified to obtain a new objective function.

This implies that the new objective function aims to detect new objects, and track already detected ones without any overlap. The newer objects are now detected on the Nobj embeddings, and Ntrack passes on the information of already-detected objects that are still present in the frame. Nobj is a fixed set but  Ntrack is dynamic and can decrement when an object exits the scene or can increment when a new object has been discovered.  Ntrack depends on the object detections of the previous frame t-1. 

Before concatenating the track queries from the previous frame to the object queries, they are transformed by passing through an independent self-attention block.


We employ a two-step training process that makes use of two adjacent frames of the video. The first image is used for object detection and the other one for tracking and detecting new objects. First, we use equation (4) to optimize for object detection and in the next step, we use the loss function given by (5) for tracking objects and detecting new objects.

As described earlier, we use a specialized bipartite mapping technique to find the corresponding ground truth for the objects detected in the frame. Now, in order to map the Ntrack track identities to their ground truth, we use the track queries from the previous frame which must inherently include the object's identity information. The track queries could match to a ground truth label or even the background class. If no mapping was found, a new object has been detected and we can use the bipartite mapping technique to map this object. 


1) In order to make the model robust, frame t-1 is sampled from a set of frames near to frame t. It allows the model to track effectively even on videos with low frames per second.

2)Generally, the number of new object detections in the frames is relatively low. To compensate for this, we sample the track queries with a probability pFN and remove them before proceeding to the decoder step. This enables the model to detect newer objects more efficiently. 

  Similarly, to teach the model to remove objects, we add false-positive track queries with a probability pFN before proceeding to the decoder step.

3) Random spatial jittering.  


TrackFormer is able to outperform several state-of-the-art models like Track R-CNN, PointTrack and get state-of-the-art results on the MOT17 and MOT20 datasets. The following is a comparison of segmentation results of TrackFormer and Track R-CNN:

TrackFormer is able to achieve better pixel mask accuracy

TrackFormer outperforms all other models based on MOTA metric on the challenging MOT17 dataset.

TrackFormer achieves state-of-the-art performance in the MOTS20 dataset based on

 For implementation details, please refer to the original paper.


 Transformers have made their way into machine translation, image recognition, 3D point processing, and now multi-object trackingTrackFormer makes end-to-end multi-object tracking possible by using a self-attention mechanism. In doing so, it eliminates the need for additional operations such as graph optimization. TrackFormer proves its competence by achieving impressive results in the challenging MOTS20 and MOT17 datasets on both object detection and segmentation tasks. It would be interesting to see future works on the use of self-attention for MOT tasks. 



Thapa Samrat avatar
I am a second year international student from Nepal who is currently studying at the Department of Electronic and Information Engineering at Osaka University. I am interested in machine learning and deep learning. So I write articles about them in my spare time.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us