Catch up on the latest AI articles

GIAOTracker: Proposing A Comprehensive Framework For Multi-class, Multi-object Tracking!

GIAOTracker: Proposing A Comprehensive Framework For Multi-class, Multi-object Tracking!

Object Tracking

3 main points
✔️ Proposed framework for MultiClass Multi-Object Tracking (MCMOT) in three stages
✔️ Achieve higher accuracy than existing tracking models in drone video tracking tasks
✔️ Each technique is generic and can be easily implemented in other tracking methods

GIAOTracker: A comprehensive framework for MCMOT with global information and optimizing strategies in VisDrone 2021
written by Yunhao DuJunfeng WanYanyun ZhaoBinyu ZhangZhihang TongJunhao Dong
(Submitted on 24 Feb 2022)
ICCV 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)


The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

We present a paper on the most intuitive and exciting tracking task in image processing AI, which is detecting and tracking individual objects from raw video data. Some of the most commonly mentioned tasks in image processing AI are image classification, object detection, and semantic segmentation. Compared to these tasks, object tracking is less well explained despite its importance. Object detection cannot acquire motion information, and image classification is highly restricted because it assumes that the image data is representative of the object to be classified and contained in the angle of view. On the other hand, object tracking is very useful for data analysis, crime prevention, robot control, etc., because it can capture detailed temporal changes of objects from raw video data.

In this article, we will explain "GIAOTracker: A comprehensive framework for MCMOT with global information and optimizing strategies in VisDrone 2021", which proposes a highly accurate framework for MCMOT ( MultiClass Multi-Object Tracking) tasks in drone video among object tracking tasks. global information and optimizing strategies in VisDrone 2021".

MCMOT is a challenging task because it requires tracking multiple objects of various classes simultaneously, rather than focusing on a specific class or a single object. However, drone video further requires techniques to deal with the following conditions. You can see that it is more complex and difficult than in the case of fixed cameras such as surveillance cameras.

  • A large number of small objects reflected by aerial photography
  • Irregular object movement and camera movement
  • The problem of objects being hidden by obstacles such as trees and bridges (occlusion)

These are true for video data in general, but they are especially noticeable in drone video, and if they can be solved, they are expected to apply to a variety of other tracking tasks.

The GIAOTracker introduced here divides the tracking into three stages, and supports these stages by proposing and combining effective technologies for each stage. The overall image of GIAOTracker, which consists of"Online Tracking,Global Link and Post Processing", is shown in the figure below.


Furthermore, it is claimed that the techniques of each stage are not specific to GIAOTracker, but can be easily applied to other tracking methods to improve their accuracy. An overview of each stage is as follows

GIAO Trackerの概要

The experiments demonstrated the effectiveness of each stage compared to the baseline DeepSORT by ablation and achieved second place compared to the SOTA model performed on the VisDrone MOT dataset. We claim that the accuracy will be even better as the accuracy of the detection model and the breadth of training data increases. Let's take a look at the details of this method.

systematic positioning

We start by explaining the systematic positioning of our method. The tracking task is precisely the task of linking the same object detected in each frame between frames. The object detected at time t is predicted to be located at time t+1, and then matched with the actual detection result to perform frame-to-frame matching (ReID). In other words, the tracking task is based on the detection task, and the tracking accuracy inevitably depends heavily on the accuracy of the detection stage.

Here, there is a big difference in the approach to tracking. SDE (Separate Detection and Embedding) is also called Tracking by Detection. SDE (Separate Detection and Embedding) is also called Tracking by Detection and refers to an approach where the detector is trained and then the tracking is trained. It is more flexible and suitable for complex videos because it optimizes detection and tracking separately. On the other hand, JDE (Joint Detection and Embedding ) is an approach that unifies detection and monitoring into one, which is implemented simultaneously by branching and adding feature extractors and predictors for tracking to detectors such as Faster R-CNN, CenterNet, and YOLOv3. simultaneously implemented. In general, JDE is faster than SDE, but in complex environments, JDE is less accurate. Therefore, SDE is selected in this paper for complex drone videos, and DeepSORT is used as the base method.

In addition, we often see the words " tracklet " and " trajectory " in papers on track. Both of them refer to a trajectory that is a result of tracking. My impression from reading various papers is that "Tracklet" refers to "an incomplete trajectory which is tracked accurately but only for a short period" and "Trajectory" refers to "a complete trajectory which is a combination of multiple tracklets". In this paper, we will explain the usage in the same way.

Online tracking (Stage1)

In this stage, we aim to generate highly accurate Tracklets by using new feature update methods, object motion prediction, and camera correction.

The difference between online and offline tracking tasks is whether or not future information is used for tracking. Online tracking uses only the two latest frames for matching. On the other hand, offline tracking generates a complete trajectory by combining the tracklets generated by online tracking. We will start with the first stage of online tracking. The figure below shows the difference between the baseline DeepSORT and the proposed method. The key is the appearance features and object motion prediction. The cost matrix is generated based on this information.

EMA Bank

EMA (Exponential Moving Average) Bank introduces an inertial term when updating appearance features so that features representing Tracklet and feature changes can be captured simultaneously. In track, it is necessary to obtain appearance features that can distinguish between objects belonging to the same class at the detection stage. The proposed method employs a model called OSNet instead of a simple feature extractor as used in DeepSORT and proposes a new update method called EMA Bank for feature updates. The conventional method has a disadvantage that it is easily affected by detection noise because the features of each frame of the tracklet are retained as they are. Therefore, we have improved the EMA Bank to have one feature in one Tracklet by updating and maintaining the features by the inertia term.

e is the feature value of Tracklet i at time t, f is the feature value of the detected object at time t, and α is the inertia term. While reflecting the change in feature values, the integration reduces the detection noise.

NSA Karman

NSA Kalman adaptively varies the noise of the Kalman filter according to the confidence level of the detection to cope with complex object motion. The Kalman filter assumes that apart from the true state of the object (coordinates), the observations we can make (coordinates) have different values from the true state due to noise. In the tracking task, this noise represents the bbox misalignment at the detection stage, and the linear Kalman filter assumes that the noise is common to all objects. On the other hand, the NSA Kalmantakes the approach of varying it depending on the confidence level of the detection.

NSA Karman

ck represents the confidence level of each object. The higher the confidence level of detection that the detector outputs for each object, the smaller the detection noise is, and motion prediction can be used to estimate the position of various complex object motions. The exception is the vehicle class, which uses an Unscented Kalman Filter (UKF ) that is more robust to non-linear motion.

the others

In the tracking task, it is also important to cope with the camera motion. In the proposed method, the camera motion is compensated by using ORB and RANSAC techniques.

In addition, the target data may have two classes (e.g., cars and vans) that are difficult to distinguish. We propose a method called Rough2Fine, which first performs tracking in a highly abstract category and then determines a fine-grained category of tracklets using a voting mechanism, instead of independently tracking different classes. The conventional method is a "hard-vote" method that assigns a single abstract category, whereas this method is a "soft-vote" method that assigns multiple classes and weights the votes according to the confidence level of detection of each class.

Global Link (Stage2)

In this stage, we aim to generate a complete Trajectory by linking the Tracklets generated in Stage1.

Stage1 generated a highly accurate Tracklet. However, the Tracklet is incomplete because of online tracking. Online tracking is fast, but if an object is lost once, it is difficult to consider it as the same object even if it is detected again. This is because the object is still moving during the interrupted period, and its size, orientation, and angle can change significantly, making matching difficult. Therefore, it is necessary to reconnect the Tracklets of the same object after online tracking. Offline processing for this purpose is global linking. While online tracking is the matching of detection results, global linking is the matching of tracking results (Tracklets). With this step-by-step matching, the proposed method can achieve high accuracy. The key is the cost of GIModel and Track let. We propose a method called GIModel as the appearance feature extractor used in the matching.


The proposed method, GIModel, is an appearance feature extractor that is robust against detection noise and sudden changes in object appearance. It extracts global and local features of the Tracklet generated in Stage1 and uses Transformer to perform feature extraction with proper consideration of inter-frame relationships.

Tracklet appearance features are extracted differently from online tracking. First, GIModel inputs each of the N frames of the Tracklet into a CNN (ResNet50-TP) to obtain a feature map. In (a), the obtained features are simply flattened and connected to a layer according to the task, such as the all-join layer. Combining global features and local features, it focuses on detailed features in different parts of the object and makes it more robust against occlusion. This is because it can make better use of the information from the parts of the object that are visible even during occlusion.

Then we need to integrate the features of the obtained N frames. The traditional method (c) simply takes average-pooling over N features, but simple averaging does not capture the relationship between frames well, whereas GIModel inputs the features to the Transformer Encoder before averaging. On the other hand, GIModel inputs the features to the Transformer Encoder before averaging, and since the Transformer has a mechanism called Self-Attention that can learn the relationship between patches, it can learn the relationship between frames more flexibly and then average them. The transformer Encoder can learn the relationship between frames more flexibly and then average them.

The above methods can extract features robust against detection noise and sudden changes in object appearance.

cost matrix

In online tracking, we used appearance features and location information for matching. On the other hand, in the case of Tracklet matching, time information is used for the matching cost in addition to location information. The matching cost for Tracklet i and j is expressed as

Ca is the cosine distance between the features extracted by GIModel, Ct is the time difference between Tracklets, and Cs is the spatial distance. To perform matching more appropriately, a threshold is set for each cost, and only pairs that satisfy all the conditions are used for matching.

The global link can link incomplete Tracklets obtained in Stage1 based on the appearance features and spatiotemporal distance as described above.

Global Link (Stage3)

In this stage, four post-processing steps are applied to the trajectory generated in Stage1 and Stage2 to refine the trajectory.

Although post-processing seems to be neglected in the tracking task, the proposed method further improves the accuracy by applying four effective post-processing steps. They are denoising, interpolation, rescoring and trajectory fusion.


If a detector detects duplicate detections for one object, the number of meaningless Trajectories will increase. It is common to apply NMS based on IoU in a spatial direction to prevent duplicate detection. The proposed method applies SoftNMS using Temporal-IoU in a time direction to remove Trajectory instead of detection.


Next is the interpolation of the break of the tracking due to the detection error. The range where the tracking was completed is linked by global links, but the range where the tracking was interrupted is left untouched. The proposed method applies linear interpolation to the cases where the broken frames are within 60 frames and improves the accuracy.


In the detection task, the confidence level is used as a threshold for the detection result during the evaluation, while in the tracking task, the average value is used. In the proposed method, instead of a simple average value, we weigh the confidence level according to the trajectory length by Rescoring. The weight ωi of tracklet i is expressed by the following equation.

li is the length of the Trajectory. The closer the length is to zero, the closer the weight is to zero, and the longer the length, the closer it is to one. This allows you to give a more appropriate threshold for the Trajectory during evaluation.

Trajectory Fusion

Here we fuse the tracking results from multiple models. Although approaches that utilize the results of multiple models have been taken in image classification and object detection, they have not received much attention in the tracking task. Therefore, we propose a method called TrackNMS, which removes unnecessary trajectories from the total trajectories of multiple models and keeps only the useful ones. The details of the mechanism are as follows.

  • TrackNMS is based on SoftNMS; whereas SoftNMS suppresses detection results based on IoU in the spatial direction, TrackNMS suppresses tracking results of multiple models based on IoU in the temporal direction.
  • SoftNMS uses confidence when sorting to select detections to suppress, while TrackNMS sorts based on the sum of the confidence of each frame. Again, longer trajectories tend to have higher priority due to Rescoring.

The GIAOTracker consists of three stages.



In our experiments, we demonstrate the effectiveness of all the above mechanisms in ablation and compare them with SOTA. The VisDrone MOT dataset for validation is a 5-class tracking task dataset with 96 sequences (total of 39,988 frames). The evaluation metrics are the mean average precision (mAP) with different thresholds after each class and the average of all classes. A Trajectory is considered correct if its IoU with the correct Trajectory is higher than the threshold.

Select a model

As mentioned earlier, the choice of detector has a significant impact on the tracking accuracy. The detector is fine-tuned by VisDrone MOT using DetectoRS, a ResNet-based detector pre-trained on the MS COCO dataset. To avoid over-training, one out of five frames is sampled and used for training.

In this experiment, we have two types of detectors with different training methods: Dev1 and Dev2.

  • DetV1: The image size is fixed during training, but the response to multiscale is evaluated by changing the image size during testing. 56.9 was achieved with AP50.
  • DetV2: The input is cut into four pieces and enlarged to avoid duplication during training, and the entire image is enlarged and input during testing, and evaluated in the same multi-scale manner. We also used SoftNMS to fuse the detection results of Dev1 and achieved an AP50 of 63.2.

The ReID dataset is created by sampling at 5-frame intervals and removing objects with occlusions or missing objects more than 50% of the time. The ReID dataset is sampled at 5-frame intervals, and objects with occlusion or missing objects more than 50% of the time are removed. The ReID dataset is trained with different input sizes for each object class.

GIModel also uses a model that has already been trained by ImageNet in the same way. The dataset for GIModel is created and trained like the ReID dataset above. The difference is that the sampling rate is 3 frames.


The parameters for each stage are taken from the paper; they are based onDeepSORT and JDE.


For online tracking, we compare the accuracy when adding mechanisms in order from the baseline DeepSORT. add shows whether it is employed in GIAOTracker or not. matching based on ORB compensates for camera motion and improves the tracking performance. The powerful feature extractor OSNet and the robust EMA Bank are also shown to be useful.NSA Kalman and UKF outperform linear Kalman by a wide margin, and for Rough2Fine, "soft-vote" outperforms "hard-vote" by a wide margin. Thus, the effectiveness of each introduced mechanism is demonstrated.

Next is the GIModel in the global link, using mAP and Rank1 as evaluation metrics for the ReID dataset. For the baseline, +train is the GIModel learned with VisDrone, +sat takes into account the relationship between frames with Transformer, and finally, +part introduces global and local flattening. From the table, we can see that each mechanism improves the accuracy significantly.

The last one is post-processing: GIAOTracker-Global is a GIAOTracker that employs up to Stage2. The result shows that the GIAOTracker-Global is a GIAOTracker that uses Stage 2, while noise reduction, interpolation, and rescoring are used to remove redundant trajectories and interpolate the trajectories well. The quality of the tracking is correlated with the length of the trajectory.

Performance comparison of all stages

Rows 2-4 show the GIAOTracker with DetV1 as the detector and rows 3-4 show the GIAOTracker with DetV2 as the detector and rows 4-5 show the GIAOTracker with DetV2 as the detector. Rows 2-4 show the GIAOTracker with DetV1 as the detector and -DetV2 is the model changed to DetV2 in GIAOTracker-Post. The -Fusion shows the accuracy of the final post-processing step, the fusion of the trajectories, using -Post with DetV1 and -DetV2. The last model, GIAOTracker*, is the accuracy when annotated data is used for detection. In other words, it shows the tracking accuracy when the detection accuracy is 100%. You can see that each stage improves the accuracy. The accuracy when using annotation data is 92% better than the estimation of individual models such as DetV1 and DetV2. In other words, although the overall accuracy may seem low at first glance based on the detector estimation, it shows that each stage of GIAOTracker is very effective if the detection accuracy can be ensured.

Comparison with SOTA

The last experiment is a comparison with the VisDrone2021 SOTA model.

The reason why the accuracy is not the best is due to the limited detector accuracy and the limited number of training data, and we claim that the accuracy of our method will increase as the detector accuracy increases.


I was a bit curious about the last comparison with SOTA. You mentioned that it is affected by detection accuracy, but then I'm wondering what happens when you do tracking with annotation data for all SOTAs, which may be difficult to compare fairly for JDE, but looking at VisDrone-MOT2021, the SOMOTs with inferior accuracy are SDE, and I thought it would be a fair comparison.

I also wondered if a large number of hyperparameters would be useful for the application. The global link alone needs to control 6 parameters, 3 weights, and 3 thresholds for each cost, none of which seem intuitive to set.


The GIAOTracker was found to achieve excellent accuracy in MCMOT of drone video. The effectiveness of each mechanism was demonstrated in ablation. What is attractive is that all of the mechanisms "camera correction, improved Kalman filter, improved feature update, global linking, and post-processing" can be easily implemented in other tracking methods. StrongSORT, an improved version of DeepSORT, also uses the GIAOTracker approach.

I thought this paper would be a good insight for those who want to understand object tracking system and are interested in the latest frameworks.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us