StrongSORT: DeepSORT Is Back Stronger! Upgraded Tracking Model!
3 main points
✔️ Improved DeepSORT, an early deep model in MOT task, to achieve SOTA!
✔️ Proposed two post-processing methods AFLink and GSI with low computational cost to achieve higher accuracy!
✔️ AFLink and GSI improved the accuracy of not only the proposed method but also multiple models.
StrongSORT: Make DeepSORT Great Again
written by Yunhao Du, Yang Song, Bo Yang, Yanyun Zhao
(Submitted on 28 Feb 2022)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
The images used in this article are from the paper, the introductory slides, or were created based on them.
first of all
First of all, I put the accuracy comparison of MOT17 and MOT20 to show the superiority of StrongSORT. Now, VGGNet, which is famous as a feature extractor, has recently come back as RepVGG, which is more powerful. DeepSORT is an early deep learning-based object tracking model, and StrongSORT is a model that achieves SOTA by improving the early model with the latest technology. StrongSORT is a model that achieves SOTA by improving the initial model with the latest technology. Let's take a quick look at the improvements.
+BoT: improved appearance feature extractor
+EMA: feature update with inertia term
+NSA: Kalman filter for nonlinear motion
+MC: cost matrix including motion information
+ECC: camera motion correction
+woC: non-adoption of cascade algorithm
+AF Link: global link using only motion information
+GSI interpolation: interpolation of detection errors by Gaussian process
Rather than fundamentally changing the structure, we have improved the feature extraction, processing of motion information, cost matrix, and other processes necessary for tracking. StrongSORT++ is a more accurate model by applying AFLink (offline processing) and GSI interpolation (post-processing) to the improved StrongSORT. I think the key is in there, so I'd be glad if you could read it to the end. Let's take a look at StrongSORT.
First, I will explain the systematic positioning of this method. If you want to know the details of the method, you can skip this section. Tracking methods in deep learning started with DeepSORT. After that, new methods such as FairMOT and ByteTrack appeared and overtook the accuracy of DeepSORT. In the process of proposing new tracking methods, two tracking approaches were born. DeepSORT belongs to the SDE approach, which requires a separate detector. DeepSORT belongs to SDE. However, in this paper, we motivate that the low accuracy of DeepSORT is not due to a wrong approach but simply because it is old and can be improved based on the latest elemental technologies that have been proposed since then. We are motivated to improve DeepSORT.
There are more reasons why we improved DeepSORT. First, the JDE method has the disadvantage that it is not easy to train: JDE trains the parameters for the different tasks of detection and tracking at the same time, so the models tend to conflict, which limits the accuracy. In addition, JDE requires a dataset that can be trained for both detection and tracking at the same time, which limits the training range. In contrast, SDE can optimize detection and tracking models individually. Recently, a model that can track objects at high speed using only motion information without appearance information, such as ByteTrack, has been proposed, but we have pointed out the problem that such a model cannot track objects when their motion is not simple.
Thus, StrongSORT was proposed based on the motivation that it is optimal to track the object using the appearance features in the DeepSORT-based SDE approach.
In the tracking task, terms like Tracklet, Trajectory, Kalman filter, global link, etc. will come up. GIAOTracker provides a systematic understanding of the three-step process required for tracking. The two papers are also by the same team of authors.
Before explaining StrongSORT, we will briefly review DeepSORT. In the tracking task, the object detected in the current frame t and the tracklet (short-term trajectory) of the object tracked in the past 0~t-1 frames are compared, and the reidentification is required to assign an id to the same individual. In this kind of association, we generate a cost matrix to measure dissimilarity using object appearance features and motion information and find a combination that minimizes the cost. Below is a schematic diagram of DeepSORT and StrongSORT.
The feature bank is the appearance features of tracklets. DeepSORT keeps the appearance features of the last 100 frames as they are in the CNN model. In this case, the CNN model is a simple deep model pre-trained on the reidentification dataset MARS.
Not only appearance features but also motion information is important. Instead of simple positional proximity, the Kalman filter predicts where an object that was in frame t-1 is likely to be in frame t. The distance between the predicted Tracklet coordinates and the detected object is the cost. The Kalman filter is a linear Kalman filter and assumes the same noise for all objects.
DeepSORT also employs a matching cascade algorithm that preferentially associates tracklets with the most recently detected tracklets. DeepSORT also employs a matching cascade algorithm that preferentially associates tracklets with the most recently detected tracklets.
Now for the main topic, StrongSORT. I'll go through the improvements I showed in the first section, which I feel are similar to GIAOTracker in many ways.
StrongSORT employs BoT as a more powerful appearance feature extract or: unlike the simple CNN used in DeepSORT, BoT uses a ResNeSt50 backbone model, pre-trained on the DukeMTMCreID dataset. It is a feature extractor that is more capable of distinguishing features between individuals.
EMA (Exponential Moving Average)
EMA is a feature bank proposed in GIAOTracker, which is inefficient and highly sensitive to the detection noise of each frame since the DeepSORT feature bank keeps 100 frames of features as they are. In contrast, EMA keeps the past features as inertia terms and updates them as follows: f is the feature of the object detected at frame t and assigned to trackleti, and e is the feature of the tracklet up to frame t-1. By weighting these features with α, the features are updated efficiently and with reduced noise.
NSA Kalman is also proposed in GIAOTracker; in DeepSORT it was a simple linear Kalman filter, but it is not realistic to assume that all detected objects have the same observed noise. Therefore, in NSA Kalman, the noise is varied in an adaptive manner depending on the confidence level of the detection. For complex moving objects, the detector's output may be less reliable, so the Kalman filter provides a stronger correction.
ck represents the confidence level of each object. In this way, position estimation can be performed for various complex object motions.
MC(matching with motion cost)
Although DeepSORT has location information as its cost, we experiment with its weight as 0 and only appearance features as its cost. Unlike DeepSORT, StrongSORT generates a cost matrix with λ=0.98 as shown in the following equation. Aa is the cost of appearance features and Am is the cost of motion (position) information.
StrongSORT uses a correction algorithm called ECC to cope with changes in the camera's viewpoint. In addition, when matching the cost matrix, the matching cascade used in DeepSORT is not employed, and the problem is solved as a simple linear assignment problem. The reason for prioritizing the tracklet with the most recent observation is that it would rather limit the accuracy of the tracking model is excellent. The accuracy was improved by simply letting the excellent tracking model do all the work without any extra conditions. woC stands for abandoning the matching cascade.
AFLink is one of the most important techniques in this paper. So far, we have described online tracking, i.e., real-time tracking by detecting each frame and connecting tracklets. From now on, offline processing is used. Although tracking is performed until the last frame, the tracking is interrupted by detection errors or occlusions, and accurate but incomplete tracklets are generated for a short period. AFLink is a new method for offline processing.
AFLink is proposed as a mechanism to perform global linking without using appearance features for the first time. For example, GIAOTracker proposes a global linking mechanism called GIModel, but it is a rather heavy model that extracts features from each frame of the tracklet using a ResNet-based CNN model, and then inputs the features to the Transformer Encoder to extract relevance. The model is quite heavy. Such a model is computationally expensive, and it is also vulnerable to noise when it relies on appearance features.
In contrast, AFLink uses only the detected frame number f and the position at that time ( x,y) in tracklet T. The information Ti=(fk,xk,yk) (k=1~30) of the last 30 frames of the two tracklets Ti and Tj is compressed and feature extracted in the convolution layer and the confidence level that Ti and Tj The output is the confidence level of whether or not Ti and Tj represent the same trajectory of an individual. First, the Temporal Block compresses the time series direction between 7 frames repeatedly for each of the three features (f,x,y). Then, a Fusion Block is applied to compress the three features. The combined result is then input to the Classifier which converts it to a confidence level using Affine and ReLU. The overall picture is shown in the figure below. We must note that we have separate blocks for the two tracklets.
This is a bit confusing so I'll show a diagram based on the source code.
Temporal Block compresses only the time direction (7 frames) independently for each f,x,y. In contrast, the Fusion Block compresses 3D features compressed in the temporal direction. The AFLink model used in the experiments is shown in the figure below.
Since AFLink is a CNN model, it is pre-trained on the dataset in our experiments, and StrongSORT+ is a model that applies AFLink as an additional offline process to StrongSORT.
GSI ( Gaussian-Smoothed Interpolation) interpolation
Another important technique is GSI interpolation. This is the post-processing after the online and offline tracking is also done. The important part of the post-processing is the interpolation of detection misses, because in SDE unless the detector can detect an object, of course, it cannot track it. The accuracy of the trajectory (trajectory of all frames) that is lost in this way can be improved by interpolation. Linear interpolation is widely used as the simplest interpolation, but its accuracy is limited because it does not use motion information. The blue color in the figure below shows Linear Interpolation, but it does not reproduce the correct trajectory (GT) because it unnaturally interpolates between the detected trajectories (Tracked).
The core of GSI is a Gaussian process. The core of GSI is a Gaussian process, which models a multidimensional normal distribution with mean m and covariance K given as a function of a, as in p(b|a)=N(m(a), K(a)) when predicting b from a for some observation data a,b. In this case, we assume conditional multidimensional normal distributions p(x|t), p(y|t), p(w|t), and p(h|t) between frame number t and location information (x, y, w, h), respectively. We then estimate (x,y,w,h) at frame t of the detection miss. Note that we apply a Gaussian process to each Trajectory i, which is represented as pt in the paper.
The RBF kernel is used for the kernel function k. The denominator λ determines the smoothness of the trajectory and is set to 10 in this paper.
In the figure, we can see that we can interpolate detection errors between wells by modeling the frame number and position coordinates with a normal distribution.StrongSORT++ is StrongSORT with such AFLink and GSI interpolation. Let's see its superiority by experiment.
In our experiments, we compare the MOT17 dataset with MOT20, a benchmark dataset for person tracking, whereas MOT20 is a more challenging dataset with a dense set of targets to track. In the ablation, the first half of each video in the MOT17 Train dataset is used for training, and the second half is used for Valid.
The appearance feature extractor, BoT, is pre-trained on the DukeMTMC dataset. The detector, YoloX-X, is pre-trained on the COCO dataset. The NMS threshold for suppressing detection overlap is 0.8 and the detection confidence threshold is 0.6.
AFLink associates tracklets within 30 frames and 75 pixels of each other. GSI sets the maximum number of misses that can be interpolated to 20 frames.
MOTA is an index that focuses on detection accuracy because it is calculated based on False Positive, False Negative, and ID Switch (IDs) of the track. HOTA is a well-balanced evaluation index that can evaluate both detection accuracy (DetA) and tracking method (AssA) at the same time.
For StrongSORT, which performs online tracking, we compare the accuracy when the mechanism is added sequentially from DeepSORT, which is the baseline.
The improvement of the appearance feature extractor (BoT) improves IDF1 significantly, which shows the importance of appearance features; the addition of ECC improves IDF1 and MOTA slightly, and the camera correction allows accurate motion information to be extracted. The NSA Kalman improves HOTA, but MOTA and IDF1 remain unchanged. When the tracklet feature is changed to EMA, the IDF1, which represents the association accuracy, is not only improved but also the FPS is increased, which leads to a speedup. We can also see the improvement in the accuracy of StrongSORT in terms of the cost including motion information (Motion Cost). Finally, we can see that the IDF1 of StrongSORT without the matching cascade and with the normal linear assignment problem is greatly improved, indicating that the cascade is not necessary.
AFLink and GSI
Here we argue the superiority of AFLink and GSI using StrongSORT and existing SoTA models. In the StrongSORT ablation above, there were v1~v6 in order, but here we adopt v1/v3/v6 of them, CenterTrack, TransTrack, and FairMOT as existing methods, and add AFLink and GSI processing in each of the six models The following is a summary of the results.
AFLink improves the accuracy for all models, especially for the models with low original accuracy, and improves CenterTrack's IDF1 by 3.7. In the case of GSI, unlike AFLink, AFLink works better for stronger tracking models.
GSI is compared separately with linear interpolation (LI) to show that GSI improves performance with a small increase in computational cost.
As described above, AFLink and GSI were found to improve the accuracy of various high-precision tracking models, not only StrongSORT.
The last comparison is with many SoTAs: we have validated with MOT17 and MOT20; we have not compared the FPS because the SDE model ignores the processing time of the detection stage, making the comparison between JDE and SDE difficult.
First is MOT17, where StrongSORT++ achieved first place for HOTA/AssA/DetA and second place for MOYA/IDs among all methods, far ahead of the second place accuracy.
MOT20 deals with a more crowded situation. Among them, StrongSORT++ achieved the same first place in HOTA/IDF1/AssA. In particular, we see very few IDs.
These results are achieved without tuning parameters for each data, which shows the versatility of our method.
The baseline is also reminded that DeepSORT* is a model with improved detectors and hyperparameters from DeepSORT at the time of publication, and this alone shows the effectiveness of the SDE paradigm as it produces good results.
I've included the results of the application at the end. You can see that the occlusion is also correctly ID'd, but the congestion is so high in MOT20 that it's hard to tell. It is impressive that it can track even with such high congestion.
AFLink was motivated by the fact that over-reliance on appearance features makes it vulnerable to noise, and it seems to me that the motion information is also heavily noisy in such a situation. I wonder if NSA Karman is the reason why the accuracy is still good. I'd like to investigate more about the conditions under which appearance information is no longer needed simply by collapsing the frame number and coordinates.
It is StrongSORT of the main body, but there is a place where the improved technology is quite similar to GIAOTracker. I wonder which one is superior in accuracy though they are not compared. I feel that this method is more attractive after online tracking such as global links without appearance information and interpolation of tracking by Gaussian process.
In this paper, we introduced StrongSORT, an improved version of DeepSORT, which further improves the accuracy by proposing AFLink and GSI in addition to elemental techniques for online tracking such as NSA Kalman, ESA, and ECC. In the paper, we mentioned that there are still some issues to be solved: the execution speed is slower than the methods of the JDE paradigm and ByteTrack which does not use appearance information at all, the MOTA is a little inferior and the threshold should be determined more strictly, AFLink deteriorates the accuracy for wrongly associated trajectories, and so on. AFLink has worse accuracy for incorrectly associated trajectories. I'm looking forward to future improvements.
Categories related to this article