This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PD-SORT: Occlusion-Robust Multi-Object Tracking Using Pseudo-Depth Cues

Yanchao Wang, Dawei Zhang, Run Li, Zhonglong Zheng, , Minglu Li,  Yanchao Wang, Dawei Zhang, Run Li, Zhonglong Zheng and Minglu Li are with the School of Computer Science and Technology, Zhejiang Normal University, Jinhua 321004, China. The corresponding author is Dawei Zhang (Email: [email protected]).
Abstract

Multi-object tracking (MOT) is a rising topic in video processing technologies and has important application value in consumer electronics. Currently, tracking-by-detection (TBD) is the dominant paradigm for MOT, which performs target detection and association frame by frame. However, the association performance of TBD methods degrades in complex scenes with heavy occlusions, which hinders the application of such methods in real-world scenarios.To this end, we incorporate pseudo-depth cues to enhance the association performance and propose Pseudo-Depth SORT (PD-SORT). First, we extend the Kalman filter state vector with pseudo-depth states. Second, we introduce a novel depth volume IoU (DVIoU) by combining the conventional 2D IoU with pseudo-depth. Furthermore, we develop a quantized pseudo-depth measurement (QPDM) strategy for more robust data association. Besides, we also integrate camera motion compensation (CMC) to handle dynamic camera situations. With the above designs, PD-SORT significantly alleviates the occlusion-induced ambiguous associations and achieves leading performances on DanceTrack, MOT17, and MOT20. Note that the improvement is especially obvious on DanceTrack, where objects show complex motions, similar appearances, and frequent occlusions. The code is available at https://github.com/Wangyc2000/PD_SORT.

Index Terms:
Multi-object tracking, pseudo-depth, tracking-by-detection.

I Introduction

Multi-Object tracking (MOT) aims to detect all desired objects in a video and maintain their identities across frames, which serves as a fundamental vision task. With the rapid development of consumer technologies, MOT systems can be deployed to diverse edge devices with cameras (e.g. smartphones, automobiles, drones, etc.), enabling vast applications for consumer electronics including but not limited to autonomous driving [1], video surveillance [2, 3], UAV applications [4], and human behavior analysis [5]. Nevertheless, complex object motions and dense crowds still pose challenges for the real-world application of MOT methods.

Currently, tracking-by-detection (TBD) [6, 7, 8, 9, 10] is the dominant paradigm for solving the MOT problem. Methods following the TBD paradigm decompose tracking into two sub-steps: i) performing frame-by-frame object detection, and ii) matching the detected objects across frames using association algorithms to form trajectories. Typically, the detection task is realized using off-the-shelf object detectors [11, 12], and the association task is achieved by bipartite graph matching with the Hungarian algorithm [13], where motion cues and appearance cues are used for similarity evaluation. However, in complex scenarios with crowded objects and non-linear motion (e.g., scenes from the DanceTrack [14] benchmark), occlusions happen frequently. In such cases, bounding boxes of intersecting objects in 2D images are highly overlapped, motion models in TBD methods based on spatial position can fail to provide sufficient discriminative cues. We conclude three representative types of occlusion-induced identity (ID) consistency problems, as illustrated in Fig. 1: (a) Identity of the front object switched to the occluded object’s identity; (b) Reinitialization of the occluded object after reappearance; (c) Identity swap of two objects after occlusion and trajectories intersection.

Refer to caption
Figure 1: Three examples of occlusion-induced tracking failures. The samples are OC-SORT results on DanceTrack, where objects have diverse motions and similar appearances.

To improve the tracking robustness against occlusions and non-linear motions, recent work has tried to introduce additional motion cues in similarity evaluation [10]. Meanwhile, depth information has been proven to be effective in target set decomposition under dense occlusions in MOT [15]. However, to the best of our knowledge, no existing methods have tried to incorporate depth as a state into the motion model in pure motion-based 2D MOT.

Refer to caption
Figure 2: A comparison of association without depth information and with depth information on DanceTrack [14]. Bounding boxes and dashed arrows of different colors represent the location and depth of different objects. we intuitively and experimentally observe that depth information can compensate for the association failure after occlusion and reappearance.

In this paper, we use depth information to improve 2D MOT performance under complex scenes with dense occlusions by introducing pseudo-depth into the MOT motion model. First, we develop a simple method to extract pseudo-depth from 2D images. With the concept of a complementary view, our pseudo-depth is robust to boundary cases. Next, we employ the Kalman filter (KF) [16] to model the object’s motion, as it is a typical approach for motion prediction in TBD methods. Specifically, we extend the widely used KF motion state from SORT [6] with pseudo-depth and its velocity. To achieve more accurate target localization, we design a depth-volume intersection over union (DVIoU) that uses pseudo-depth to expand the standard 2D intersection over union (IoU) [17] similarity to 3D. In addition, we also introduce the camera motion compensation (CMC) [18] technique to improve the tracking quality in dynamic camera environments. As shown in Fig. 2, we experimentally find that depth information is consistent under occlusion, and can compensate for the association of 2D information.

For the implementation, we adopt OC-SORT [10] as our base method for its concise structure and strong performance. We inherent the observation-centric idea of OC-SORT and implement our designs using historical observations. Firstly, pseudo-depth computation and camera motion compensation are performed at the beginning of each frame. Secondly, our DVIoU replaces the IoU similarities in both the regular association and the recovery of lost tracklets using their historical observations (Observation-Centric Recovery, or OCR in OC-SORT). Finally, the QPDM cost is added to the cost matrix along with the DVIoU cost and the velocity consistency cost (Observation-Centric Momentum, or OCM in OC-SORT). As our focus is to introduce pseudo-depth into the MOT motion model, we name our method Pseudo-Depth SORT (PD-SORT). By integrating the above designs, PD-SORT consistently outperforms its baseline in MOT17, MOT20, and DanceTrack in most MOT metrics (see Tables I, II, and III) while remaining a simple, online, real-time, and pure motion-based tracker.

The main contributions of our work are three-fold:

  • We incorporate the pseudo-depth information into 2D MOT and demonstrate its effectiveness in alleviating association failures caused by occlusions and non-linear motions.

  • We design Depth Volume IoU (DVIoU) and Quantized Pseudo-Depth Measurement (QPDM) to leverage the depth information in association, which effectively reduces the cases of association errors.

  • We propose PD-SORT by integrating our designs into OC-SORT. PD-SORT consistently outperforms its baseline on MOT17, MOT20, and DanceTrack. This proves the generalization ability of PD-SORT across diverse MOT scenes.

The remainder of this paper is organized as follows: Section II reviews related works on data association and the use of depth information in multi-object tracking. Section III presents our proposed tracking method. Section IV reports the experimental setup and evaluation results, including ablation studies and benchmark comparisons. Finally, Section V concludes this paper with a summary of key contributions and potential future directions.

II Related Work

Multi-object tracking (MOT) is an essential task in the vision field that has become a hot research topic. The present MOT methods can be categorized into two types, namely the end-to-end tracking methods [19, 20, 21] and the tracking-by-detection (TBD) methods [6, 8, 9, 10]. Due to its simplicity and strong performance, tracking-by-detection is the mainstream paradigm among the MOT methods. In particular, the prevalent TBD paradigm divides MOT into two steps: detection and association. Due to the rapid development of modern deep detectors [11, 22, 12], research in the field of MOT focuses on how to achieve more reliable association. At the same time, depth information provides key information in 3D MOT and shows its potential to improve tracking quality in 2D MOT.

II-A Association in 2D MOT

To achieve reliable association, most MOT methods that follow the TBD paradigm leverage the target’s motion consistency [6, 7, 8, 23, 9, 10]. The pioneering work SORT [6] employs the Kalman filter (KF) [16] to model the target motion: at the beginning of each frame, the motion states of the targets are predicted by the KF using the linear motion assumption. Then, the IoU similarities between the predictions and the detections are calculated and used in the cost matrix for matching by the Hungarian algorithm [24]. After being successfully matched, the corresponding new detections are used to update the tracklets’ KF parameters. This association pipeline of SORT is followed and improved by later TBD methods [8, 9, 10]. To alleviate the high ID switch of SORT under occlusion, DeepSORT [8] introduces ReID-based appearance similarity in the cost matrix. Also, it proposes an association strategy that prioritizes the tracklets with more recent successful associations. To effectively integrate appearance cues, SAT [25] explores a deep Siamese network to extract instance-level appearance features. The obtained features are then used for similarity computation in the association stage. Besides, appearance features extracted by deep appearance models [26, 27] provide effective discriminating cues that benefit tracking quality, which are exploited by later works [28, 29, 30, 31]. To realize more reliable association, BoT-SORT [18] modifies the KF model and uses the camera motion compensation technique to generate more accurate KF predictions while combining motion and appearance cues. Due to factors like occlusion and motion blur, low-confidence detections can also indicate the existence of targets. However, both SORT and DeepSORT perform associations for high-confidence detection results only. Therefore, ByteTrack [9] proposes a new matching cascade strategy: Once high-confidence detections have been matched, low-confidence detections and tracklets not matched with high-confidence detections are also matched. By considering all the detections, ByteTrack effectively improves the association performance of the SORT-like method. But it still has limitations when dealing with nonlinear motions and occlusions. When interruptions happen, the parameters of the Kalman filter cannot be updated due to the absence of new observations. And the KF prediction error will accumulate over time. On the other hand, the error of the observations (detections) depends on the detector, which is stable and smaller than the KF errors. Therefore, OC-SORT [10] uses the tracklets’ historical observations to compute the velocity-direction consistency with the new detections as well as to recover the interrupted tracklets. Also, after the target is reappeared, the observations before and after interruption are used to interpolate a virtual trajectory, which is then used to update the KF. Generally, the main challenge of TBD methods is the association under complex scenes, including dense objects, heavy occlusions, and nonlinear motions.

II-B Depth Information in MOT

In instance-level object identification tasks, effectively leveraging scene context information can enhance the model’s ability to distinguish targets [32]. For the MOT task, exploring richer scene context can contribute to more robust object association. As an effective form of spatial context, depth information can refine the motion modeling of targets, thereby improving the tracker’s localization and discrimination capabilities. In 3D MOT, AB3DMOT [33] obtains detections with depth information from a LiDAR point cloud and extends the KF to be 3D. CenterPoint [34] detects object centers using a keypoint detector and estimates attributes like 3D size, orientation, and velocity. It refines these estimates using point features and simplifies the tracking to greedy closest-point matching. To obtain a comprehensive understanding of the scene, EagerMOT [35] fuses object observations from both 3D and 2D object detectors. However, in mobile device applications (e.g., smartphones), deploying depth sensors brings additional costs. Meanwhile, actual depth data obtained from depth sensors is often limited by their perception range, resulting in reduced tracking performance for distant targets. In fact, as a projection of the 3D scene, a 2D image also implies certain depth information. In 2D MOT, previous work have attempted to enhance tracking performance by incorporating pseudo-depth extracted from the image signal [36, 37, 15]. QuoVadis [36] combines the 2D detector with a monocular depth estimator and a segmentation network to achieve trajectory forecasting from a Bird’s-Eye View (BEV). However, this method has a high model complexity. On the other hand, DP-MOT [37] uses a geometry-based approach to estimate the depth for detected objects. Then, tracking is performed by joint use of the depth-aware motion cue and the appearance cue. Similarly, SparseTrack [15] proposes a projection rule-based method for obtaining the relative depth of targets from 2D images, which does not require training any additional networks. Based on this pseudo-depth, the tracklets and detections are divided into subsets. Eventually, cascaded matching is performed on tracklets and detections that are at the same depth level. However, the aforementioned 2D MOT methods treat pseudo-depth as an auxiliary cue for constructing BEV, complementing it with appearance features, or partitioning object subsets. In contrast, we propose to integrate pseudo-depth directly into the target’s motion model as a reliable motion state, aiming at enhancing the tracker’s robustness in complex scenarios with dense occlusions.

III Methodology

In this section, we introduce the main components of the proposed PD-SORT, including the pseudo-depth modeling approach, the strategies to exploit depth information in the association stage, namely Depth Volume IoU (DvIoU) and Quantized Pseudo-Depth Measurement (QPDM). And the camera motion compensation (CMC) is also integrated to alleviate the camera movement problems common in MOT scenes. The overall pipeline is shown in Fig. 3. PD-SORT produces tracking results for frame t+1 by matching detections of frame t+1 with tracklets from frame t, which comprises three core steps: (a) Preparation: CMC corrects the targets’ KF states and historical observations, and the pseudo-depth values of the detections are estimated. (b) Motion Cues Generation: The motion states in the new frame are predicted using the corrected KF states, the velocity directions are computed using historical observations, and the locations (bounding boxes and pseudo-depth) of detections and tracklets are both recorded. (c) Association: A two-stage association is performed using detection locations and tracklet cues. The first stage of regular association considers three similarities: DVIoU that computes location similarity based on KF-predicted motion states; OCM that computes velocity direction consistency with the detections; and QPDM that checks the pseudo-depth consistency. For unmatched detections and tracks, the OCR association is then performed, using the DVIoU between the detections and the tracklets’ last historical observation as the association criterion to recover unmatched tracklets. Notably, PD-SORT is developed upon OC-SORT, which retains observation-centric modules in OC-SORT (i.e., OCM, OCR, and ORU) and uses historical observations to calculate similarity.

Refer to caption
Figure 3: Pipeline of PD-SORT. The preparation stage estimates pseudo-depth for new detections and uses CMC to correct both motion states from KF and historical observations. For the motion cues generation, pseudo-depth is incorporated into motion states and bounding box locations for both tracklets and detections. The association stage utilizes the motion cues to compute pseudo-depth guided matching similarities in terms of DVIoU and QPDM, and the velocity consistency described by OCM to perform a two-stage association to match between tracklets and detections.

III-A Pseudo-Depth Modeling

In 2D MOT, the robustness of association relies on the estimation of the object’s position, which is highly susceptible to nonlinear motions and occlusions. On the other hand, by expanding the spatial information of the object, 3D tracking that includes depth information can effectively improve the accuracy of object localization and robustness to occlusions. Meanwhile, the effectiveness of projection-based pseudo-depth in MOT tasks has been verified in previous work [15]. However, to the best of our knowledge, there’s no work that incorporates pseudo-depth as a state into the motion model for pure motion-based 2D MOT. A key challenge lies in maintaining the accuracy of depth estimation when handling difficult targets like boundary objects. Reliable pseudo-depth estimation is essential, as it underpins the effectiveness of subsequent similarity computation modules. Moreover, appropriate pseudo-depth-based motion states to be integrated into the Kalman filter are required to ensure the discrimination ability of the motion predictions.

This revelation leads us to extend the MOT motion model by introducing pseudo-depth and its velocity, which in turn extends the 2D MOT to 3D for better processing. For the definition of pseudo-depth, as in SparseTrack [15], we first used the projection of depth given by the distance from the target bounding box to the bottom of the image view. Such projection-based pseudo-depth estimation relies on the assumptions that the image capture device is above the ground plane and all objects in the scene are on the same plane. In practical tracking applications in terms of mobile device capturing, pedestrian monitoring, and in-car camera sensing, these assumptions are typically satisfied, enabling pseudo-depth estimation to provide effective guidance. However, considering that the target bounding box may move to the boundary of the view during the tracking process, the pseudo-depth of the object may become a negative value or zero, which cannot correctly reflect the depth of the target for modules using depth values directly, influencing the subsequent pseudo-depth-based calculation.

Therefore, we propose a novel pseudo-depth based on the complementary view. By expanding a complementary view of the same size and below the real image view, we define the pseudo-depth as the distance from the bottom of the target bounding box to the bottom of the complementary view, and our pseudo-depth pdpd is computed as in Eq. 1.

pd= 2×IMGhYbpd\ =\ 2\times{IMG}_{h}-Y_{b} (1)

Here, IMGh{IMG}_{h} is the height of the real view, YbY_{b} is the coordinate value of the bottom of the target bounding box along the y-axis. The visualization of the ground plane real depth depthdepth and our pseudo-depth pdpd is shown in Fig. 4.

Refer to caption
Figure 4: Illustration of our pseudo-depth. The orange double-arrow line represents the real depth on the ground plane (depthdepth), the dashed orange double-arrow line represents the length that corresponds to the pseudo-depth in the complementary view on the ground plane (depthcomplementdepth_{complement}), and the blue double-arrow line represents the pseudo-depth obtained by projecting the real depth onto the view plane with both the real image view and the complementary view (pdpd).

For objects whose size is within the real view, such pseudo-depth using complementary view can correctly reflect the depth information.

Based on the proposed pseudo-depth, we extend the standard KF in SORT with two additional states: the target’s pseudo depth pdpd and its velocity component vpdv_{pd}. The standard Kalman filter states in SORT are shown in Eq. 2.

X=[xc,yc,s,r,vx,vy,vs]X=[x_{c},\ y_{c},\ s,\ r,\ v_{x},\ v_{y},\ v_{s}] (2)

Here, (xc,yc)(x_{c},\ y_{c}) is the coordinate of the target’s bounding box center, s and r are the area and aspect ratio of the target’s bounding box. vx,vy,vsv_{x},\ v_{y},\ v_{s} are the velocity components for xc,yc,sx_{c},\ y_{c},\ s respectively. By introducing two new states, pdpd and vpdv_{pd}, the KF state is revised to be as in Eq. 3.

X=[xc,yc,pd,s,r,vx,vy,vpd,vs]X=[x_{c},\ y_{c},\ pd,\ s,\ r,\ v_{x},\ v_{y},\ v_{pd},\ v_{s}] (3)

III-B Depth Volume IoU

To utilize the depth information in location consistency evaluation, we extend the 2D IoU similarity to 3D by introducing the concept of depth volume. Given two object observations b1=(x11,y11,x21,y21,pd1)b^{1}=(x_{1}^{1},\ y_{1}^{1},\ x_{2}^{1},\ y_{2}^{1},\ {pd}^{1}) and b2=(x12,y12,x22,y22,pd2)b^{2}=(x_{1}^{2},\ y_{1}^{2},\ x_{2}^{2},\ y_{2}^{2},\ {pd}^{2}), where (x11/2,y11/2)(x_{1}^{1/2},\ y_{1}^{1/2}), (x21/2,y21/2)(x_{2}^{1/2},\ y_{2}^{1/2}), and pd1/2pd^{1/2} represent the top-left corner, bottom right corner, and the pseudo-depth, respectively. We give the definition of the depth volume of the intersection between the two objects, VinterV^{\text{inter}}, as in Eq. 4.

{Vinter=winterhinterpdinterwinter=min(x21,x22)max(x11x12)hinter=min(y21,y22)max(y11y12)pdinter=min(pd1,pd2)\left\{\begin{array}[]{c}V^{\text{inter}}=w^{\text{inter}}\cdot h^{\text{inter}}\cdot pd^{\text{inter}}\hfill\\ w^{\text{inter}}=\min\left(x_{2}^{1},\ x_{2}^{2}\right)-\max\left(x_{1}^{1}-x_{1}^{2}\right)\hfill\\ h^{\text{inter}}=\min\left(y_{2}^{1},\ y_{2}^{2}\right)-\max\left(y_{1}^{1}-y_{1}^{2}\right)\hfill\\ pd^{\text{inter}}=\min\left(pd^{1},\ pd^{2}\right)\hfill\end{array}\right. (4)

Here, winterw^{inter} and hinterh^{inter} are the width and height of the intersection box area. Meanwhile, we define the pseudo-depth of the intersection, pdinterpd^{inter}, as the smaller value of the pseudo-depths of the two objects. Similarly, we can obtain the depth volumes of two objects, V1V^{1} and V2V^{2}, as in Eq. 5.

{V1/2=w1/2h1/2pd1/2w1/2=x21/2x11/2,h1/2=y21/2y11/2\left\{\begin{array}[]{c}V^{1/2}=w^{1/2}\cdot h^{1/2}\cdot{pd}^{1/2}\hfill\\ w^{1/2}=x_{2}^{1/2}-x_{1}^{1/2},\ h^{1/2}=y_{2}^{1/2}-y_{1}^{1/2}\hfill\end{array}\right. (5)

Furthermore, to achieve more robust distinguishing between objects, we introduce depth volume IoU (DVIoU) by using the volume metric, as shown in Eq. 6.

DVIoU=VinterV1+V2VinterDVIoU=\frac{V^{inter}}{V^{1}+V^{2}-V^{inter}} (6)

The comparison between standard IoU and DVIoU is visually represented in Fig. 5. By integrating the depth to modulate the IoU similarity, not only the robustness of target location consistency measurement is improved, but also the extra discrimination information provided by the depth cue benefits the overall association accuracy.

Refer to caption
Figure 5: Illustration of IoU and DVIoU. By integrating pseudo-depth (the extra dimension represented by the dashed line in the figure), area-based standard 2D IoU is extended to volume-based DVIoU.

III-C Quantized Pseudo-Depth Measurement

Occlusions can harm the reliability of the pseudo-depth, which in turn leads to a decrease in tracking accuracy. On the other hand, in successive frames, the relative depth of the object with respect to other objects fluctuate only in narrow intervals. Therefore, we propose a quantized pseudo-depth cost to better utilize the pseudo-depth to guide the association.

For each frame, we find the minimum pseudo-depth value of all detected objects in the frame pdminpd_{min} and the maximum value pdmaxpd_{max}. Then, divide the interval [pdmin,pdmax][pd_{min},\ pd_{max}] into intervalnuminterval_{num} sub-intervals uniformly; each sub-interval is assigned with an interval depth (in this paper, the interval depth is defined as the upper limit of the sub-interval after min-max normalization). After that, the interval depths are assigned to the objects according to the sub-intervals they are in. The interval depth for the ithi^{th} (i=0, 1,,intervalnum1)(i=0,\ 1,\ …,\ {interval}_{num}-1) sub-interval is computed as in Eq. 7.

{interdepthi=[(i+1)×leninterval]/lentotalleninterval=lentotal/intervalnumlentotal=pdmaxpdmin\left\{\begin{array}[]{c}interdepth_{i}={[(i+1)\times len}_{interval}]\ /\ len_{total}\\ {len}_{interval}=len_{total}\ /\ {interval}_{num}\hfill\\ len_{total}=pd_{max}-pd_{min}\hfill\end{array}\right. (7)

Next, the interval depth is computed for the last historical observation of each tracklet in the same manner. Finally, the quantized pseudo-depth cost CQPDC_{QPD} is computed as the absolute difference between the interval depths of the new detections, interdepthdetsinterdepth_{dets}, and the interval depths of the tracklets, interdepthtracksinterdepth_{tracks}, as shown in Eq. 8.

CQPD=abs(interdepthtracksinterdepthdets)C_{QPD}=abs\left(interdepth_{tracks}-interdepth_{dets}\right) (8)

Then the pseudo-depth difference between the tracklets and new detections can be evaluated by their interval depth values. Compared to directly calculating the difference in pseudo-depth, using the proposed interval depth between detections and tracklets reduces the depth estimation error caused by partial occlusions, thus improving the robustness of pseudo-depth utilization. Meanwhile, interval depth-based cost computation helps to alleviate the association error caused by the velocity direction consistency evaluation when the object is steering, which further improves the algorithm’s performance against nonlinear motions. Finally, the pseudo-code of the QPDM algorithm is given in Algorithm 1.

Input: number of sub-intervals intervalnuminterval_{num}, pseudo-depth set of tracklets’ previous observations pdobspd_{obs}, pseudo-depth set of new detections pddetspd_{dets}
Output: The pseudo-depth cost matrix between tracklets and detections CQPDC_{QPD}
1 lenobsmax(pdobs)min(pdobs)len_{obs}\leftarrow max(pd_{obs})-min(pd_{obs})
2 pdobs(pdobsmin(pdobs)/lenobspd_{obs}\leftarrow(pd_{obs}-min(pd_{obs})/len_{obs}
3 minprevious1min_{previous}\leftarrow 1
/* Compute interval depth for previous observations */
4 for inter0inter\leftarrow 0 to intervalnum1interval_{num}-1 do
5       mincurrent1(inter+1)/intervalnummin_{current}\leftarrow 1-(inter+1)/interval_{num}
6       interobsdepth[minpreviouspdobsmincurrent]mincurrent+1/intervalnuminter_{obs}^{depth}[min_{previous}\leq pd_{obs}\leq min_{current}]\leftarrow min_{current}+1/interval_{num}
7       minprevious1min_{previous}\leftarrow 1
8      
9 end for
10lendetsmax(pddets)min(pddets)len_{dets}\leftarrow max(pd_{dets})-min(pd_{dets})
11 pddets(pddetsmin(pddets)/lendetspd_{dets}\leftarrow(pd_{dets}-min(pd_{dets})/len_{dets}
12 minprevious1min_{previous}\leftarrow 1
/* Compute interval depth for new detections */
13 for inter0inter\leftarrow 0 to intervalnum1interval_{num}-1 do
14       mincurrent1(inter+1)/intervalnummin_{current}\leftarrow 1-(inter+1)/interval_{num}
15       interdetsdepth[minpreviouspddetsmincurrent]mincurrent+1/intervalnuminter_{dets}^{depth}[min_{previous}\leq pd_{dets}\leq min_{current}]\leftarrow min_{current}+1/interval_{num}
16       minprevious1min_{previous}\leftarrow 1
17      
18 end for
19CQPDabs(interobsdepthinterdetsdepth)C_{QPD}\leftarrow abs(inter_{obs}^{depth}-inter_{dets}^{depth})
20 return CQPDC_{QPD}
Algorithm 1 The pseudocode of QPDM.

III-D Camera Motion Compensation

In our association method, the motion information is consisting of 3 parts: the DVIoU similarity, the OCM velocity-direction consistency, and the quantized pseudo-depth loss. Among them, both DVIoU and OCM are sensitive to the position information of the target. For example, for the DVIoU, the depth volume is the product of pseudo-depth and 2D box area. Here, the pseudo-depth is a relative position information robust to camera motion, but the 2D bounding box overlap is sensitive to position drift. Once the position of either the previous observation or the current detection drifts, the overlap area will change largely and can lead to incorrect association. Meanwhile, OCM relies on the center point coordinates of historical observations to calculate the velocity direction, which is also sensitive to the offset of the target center point. Thus, the accuracy of the target position is essential for association quality.

However, when the camera moves, the position of the target in the view will also shift, which affects the association result. To this end, we introduce CMC before KF’s prediction step for more robust tracklet-detection association in the coming frame. Specifically, we use the OpenCV [38] implementation of the Video Stabilization module with affine transformation to generate transforms using key point extraction [39], sparse optical flow [40], and RANSAC [41], as in previous work [18]. Given a scale and rotation matrix MR2×2M\in R^{2\times 2} and a translation TR2×1T\in R^{2\times 1}, we correct the camera motion of the KF state and the target historical observation as follows.

III-D1 KF State Correction

The KF state XX of our method is depicted in Eq. 3, where (xc,yc)(x_{c},\ y_{c}) is the center coordinate of the target, pdpd is the pseudo-depth of the target, s,rs,\ r are the bounding box area and aspect ratio, respectively. And vx,vy,vpd,vsv_{x},\ v_{y},\ v_{pd},\ v_{s} are the corresponding velocities. We apply the CMC to the state XX and the KF’s covariance matrix PP following Eq. 9.

{X[0:2]=MX[0:2]+TX[5:7]=MX[5:7]+TP[0:2, 0:2]=MP[0:2, 0:2]MTP[5:7, 5:7]=MP[5:7, 5:7]MT\left\{\begin{array}[]{c}X\left[0:2\right]=MX\left[0:2\right]+T\hfill\\ X\left[5:7\right]=MX\left[5:7\right]+T\hfill\\ P\left[0:2,\ 0:2\right]=MP\left[0:2,\ 0:2\right]M^{T}\hfill\\ P\left[5:7,\ 5:7\right]=MP\left[5:7,\ 5:7\right]M^{T}\hfill\end{array}\right. (9)

III-D2 Historical Observation Correction

The three modules in OC-SORT, OCM, ORU and OCR, use the center positions of historical observations to compute the direction of target motion, generate virtual positions when trajectory interruptions and reappearances happen, and match with KF predictions, respectively. Thus, we also apply CMC to the tracklets’ historical observations. Supposing the center position of a historical observation is pc=(xc,yc)p_{c}=(x_{c},{\ y}_{c}), the CMC is performed as Eq. 10.

pc=Mpc+Tp_{c}=M{p}_{c}+T (10)

By correcting the target center position in Kalman filter state vectors and historical observations, we reduce the error in the DVIoU computation, while making the velocity-direction consistency computation of the OCM module more accurate, thus improving the overall association accuracy.

III-E Algorithm Overall Framework

For new detections in each frame, OC-SORT performs a two-stage association: the first stage of regular association using the IoU and the velocity consistency (OCM), followed by a second stage to recover the lost tracklets using the IoU only (OCR). PD-SORT follows the association flow of OC-SORT and additionally adds pseudo-depth cues to the associations. First, the QPDM module, which directly leverages pseudo-depth, is introduced into the regular association. Meanwhile, the conventional IoU similarities used in both rounds of associations are replaced with the proposed DVIoU, which also uses pseudo-depth. Eventually, the composition of the final cost matrix is shown in Eq. 11.

C=CDVIoU+λ1CQPD+λ2COCMC\ =\ C_{DVIoU}+\lambda_{1}C_{QPD}+\ \lambda_{2}C_{OCM} (11)

Here, CDVIoUC_{DVIoU} is the opposite of the DVIoU between KF predictions and the detections. CQPDC_{QPD} is the QPDM cost. COCMC_{OCM} is inherent from OC-SORT, which is the velocity direction consistency difference between historical observations and new detections. λ1\lambda_{1} and λ2\lambda_{2} are two weighting factors. The detailed pseudo-code for PD-SORT is shown in Algorithm 2.

Input: Detections Z={zki1kT,1iNk}Z=\{z_{k}^{i}\mid 1\leq k\leq T,1\leq i\leq N_{k}\}; Kalman Filter KFKF; threshold to remove untracked tracks texpiret_{expire}
Output: The set of tracklets 𝒯=τi\mathcal{T}={\tau_{i}}
1 Initialization: 𝒯\mathcal{T}\leftarrow\emptyset
2
3for timestept1timestept\leftarrow 1 to TT do
       /* Step 1: regular association to match detections with tracklets */
4       Zt{zt1,,ztNt}TZ_{t}\leftarrow\{z_{t}^{1},\ ...,\ z_{t}^{N_{t}}\}^{T}
5       Apply CMC to last observations and last KF states for all tracklets in 𝒯\mathcal{T}
       X^t{x^t1,,x^t|𝒯|}{\hat{X}}_{t}\leftarrow\{{\hat{x}}_{t}^{1},\ ...,\ {\hat{x}}_{t}^{\lvert\mathcal{T}\rvert}\}
        /* Estimations by KF.predict */
6       ZZ\leftarrow Historical observations on the existing tracks
7       CtCDVIoU(X^t,Zt)+λ1CQPD(Z,Zt)+λ2COCM(Z,Zt)C_{t}\leftarrow C_{DVIoU}({\hat{X}}_{t},\ Z_{t})+\lambda_{1}C_{QPD}(Z,\ Z_{t})+\lambda_{2}C_{OCM}(Z,\ Z_{t})
8       Linear assignment by Hungarians with cost CtC_{t}
9       𝒯tmatched\mathcal{T}_{t}^{matched}\leftarrow tracklets matched to a detection
10       𝒯tremain\mathcal{T}_{t}^{remain}\leftarrow tracklets not matched to a detection
11       ZtremainZ_{t}^{remain}\leftarrow detections not matched to any tracklet
       /* Step 2: perform OCR to find lost tracklets back */
12       Z𝒯tremainZ^{\mathcal{T}_{t}^{remain}}\leftarrow last matched detection of tracklets in 𝒯tremain\mathcal{T}_{t}^{remain}
13       CtremainCDVIoU(Z𝒯tremain,Ztremain)C_{t}^{remain}\leftarrow C_{DVIoU}(Z^{\mathcal{T}_{t}^{remain}},\ Z_{t}^{remain})
14       Linear assignment by Hungarians with cost CtremainC_{t}^{remain}
15       ZtunmatchedZ_{t}^{unmatched}\leftarrow detection unmatched to tracklets
16       update 𝒯tmatched\mathcal{T}_{t}^{matched} and 𝒯tremain\mathcal{T}_{t}^{remain}
       /* Step 3: update states of matched tracklets */
17       for τ\tau in 𝒯tmatched\mathcal{T}_{t}^{matched} do
18             perform ORU in OC-SORT to update KF.parameters
19            
20       end for
      /* Step 4: initialize and remove tracklets */
21       𝒯tnew\mathcal{T}_{t}^{new}\leftarrow new tracklets generated from ZtunmatchedZ_{t}^{unmatched}
22       for τ\tau in 𝒯tremain\mathcal{T}_{t}^{remain} do
23             τ.untrackedτ.untracked+1\tau.untracked\leftarrow\tau.untracked+1
24            
25       end for
26      𝒯treserved\mathcal{T}_{t}^{reserved}\leftarrow {ττ𝒯tremain\tau\mid\tau\in\mathcal{T}_{t}^{remain} and τ.untracked<texpire\tau.untracked<t_{expire}}
27       𝒯{𝒯tnew,𝒯tmatched,𝒯treserved}\mathcal{T}\leftarrow\{\mathcal{T}_{t}^{new},\ \mathcal{T}_{t}^{matched},\ \mathcal{T}_{t}^{reserved}\}
28 end for
29return 𝒯\mathcal{T}
Algorithm 2 The pseudocode of PD-SORT.

IV Experiments

IV-A Datasets and Metrics

IV-A1 Datasets

We evaluated our model under the “private detection” protocol on multiple MOT datasets, including DanceTrack [14], MOT17 [42] and MOT20 [43]. The MOT17 dataset contains 7 training videos and 7 test videos, in which the targets have different appearances and nearly linear motions. The MOT20 dataset contains 4 training videos and 4 test videos, where the scenes are similar to those in MOT17 but are more crowded. DanceTrack is a recently proposed dataset where targets have similar appearances, nonlinear motions, and frequent occlusions. DanceTrack consists of 40 training videos, 25 validation videos, and 35 test videos, with more frames to comprehensively reflect the tracker’s performance. Meanwhile, the detection task in DanceTrack is relatively simple, making it ideal for association quality evaluation. Considering the characteristics of the above datasets and the goal of improving association ability in scenes with occlusions and nonlinear motions, we prioritize the comparison results on the DanceTrack dataset. Meanwhile, the generalization ability of our tracker is evaluated on both MOT17 and MOT20.

IV-A2 Metrics

We take HOTA [44] as our main metric as it provides a comprehensive evaluation of tracking quality in terms of both the detection accuracy and the association accuracy. Besides, we also adopt MOTA, AssA, IDF1, and other commonly used metrics to reflect the performance of tracking algorithms from different aspects [44, 45, 46]. Here, MOTA combines false positives, missed targets, and identity switches (IDs), and focuses on the detection performance, while AssA and IDF1 reflect the ability of associations.

IV-A3 Implementation Details

To maintain a fair comparison, we use the same detector as previous works. Specifically, our detection model is YOLOX [12] with publicly available weights from our baseline OC-SORT. The weight factor for the QPDM cost is 0.2 in both DanceTrack and MOT17, and 0.36 in MOT20, where our QPDM is more beneficial. For simplicity, we divide the pseudo depth into 8 subintervals in QPDM for all three benchmarks. The OCM cost weights are 0.2 in DanceTrack and MOT17, and 0.04 in MOT20. The IoU thresholds during association are 0.3 for DanceTrack and MOT17, and 0.35 for MOT20. Following the common practice of SORT-like methods, we set the detection confidence threshold at 0.4 for MOT20 and 0.6 for other datasets. All experiments are performed on an Intel i5-13600K CPU @ 2.60 GHz and a single NVIDIA GeForce RTX 4090 GPU.

IV-B Benchmarks Evaluation

We compare our PD-SORT with state-of-the-art trackers on the test sets of DanceTrack, MOT17, and MOT20, as shown in Tables I, II, and III, respectively. Note that all of the test results are evaluated on official websites.

IV-B1 Baseline Selection

OC-SORT is a motion-based, SORT-like tracker. As shown in Table 1, OC-SORT shows leading tracking performance on the DanceTrack dataset in terms of HOTA, IDF1, AssA, and AssR compared to previous methods. For methods with comparable performance, StrongSORT++ and STAT integrate additional appearance feature components, and SparseTrack employs a subset decomposition and cascading strategy. These models involve more sophisticated designs and high computational costs. In contrast, OC-SORT achieves competitive performance while maintaining a simple, extensible architecture and real-time tracking speed. Therefore, we select OC-SORT as our baseline method.

IV-B2 DanceTrack

We report experimental results on the DanceTrack in Table I to evaluate PD-SORT under complex scenes with similar appearances, nonlinear motions, and frequent occlusions. Compared with its baseline OC-SORT, PD-SORT has made considerable progress in most core metrics (i.e., +3.6 HOTA, +0.2 DetA, +1.9 AssA, +2.9 IDF1). Specifically, it achieves a significantly higher HOTA than previous trackers and exceeds the base method by 6.6%, which shows the strength of depth cues in improving the overall tracking quality. Also, the improvements on both AssA (+1.9) and IDF1 (+2.9) metrics are substantial, which further indicates the benefit of depth information to the association.

The underlying reason is that previous methods leverage pure 2D motion information, making it difficult to distinguish objects with highly overlapped bounding boxes, which often happens in occlusion cases. Nevertheless, we use pseudo-depth to provide additional cues for association. By integrating our proposed pseudo-depth modules, the occlusion-induced problems are effectively alleviated, demonstrating the robustness of PD-SORT in handling challenging scenes with diverse motions and occlusions, as in DanceTrack. For the computational efficiency, we test the frames per second (FPS) of our method (28.7 FPS) and the baseline (35.1 FPS) on on the same device. With only 6.4 FPS lower, the tracking performance improved significantly.

TABLE I: Results on DanceTrack test set. SORT, DeepSORT, ByteTrack, StrongSORT++, SparseTrack, STAT, OC-SORT and our method share the same detections.
Tracker Reference HOTA↑ DetA↑ AssA↑ MOTA↑ IDF1↑
TraDeS [47] CVPR21 43.3 74.5 25.4 86.2 41.2
MOTR [21] ECCV22 54.2 73.5 40.2 79.7 51.5
GTR [48] CVPR22 48.0 72.5 31.9 84.7 50.3
CenterTrack [49] ECCV20 41.8 78.1 22.6 86.8 35.7
FairMOT [29] IJCV21 39.7 66.7 23.8 82.2 40.8
QDTrack [50] CVPR21 45.7 72.1 29.2 83.0 44.8
TransTrack [19] arXiv20 45.5 75.9 27.5 88.4 45.2
SORT [6] ICIP16 47.9 72.0 31.2 91.8 50.8
DeepSORT [8] ICIP17 45.6 71.0 29.7 87.8 47.9
ByteTrack [9] ECCV22 47.3 71.6 31.4 89.5 52.5
StrongSORT++ [30] TMM23 55.6 80.7 38.6 91.1 55.2
SparseTrack [15] arXiv23 55.5 78.9 39.1 91.3 58.3
STAT [31] TMM23 57.4 80.8 40.9 91.5 59.2
OC-SORT [10] CVPR23 54.6 80.4 40.2 89.6 54.6
PD-SORT Ours 58.2 80.6 42.1 89.6 57.5

IV-B3 MOT17 & MOT20

In addition to DanceTrack, we also evaluate our method on the general MOT Challenge datasets under private detection mode. For the results in MOT17 and MOT20, we inherit the linear interpolation from baseline methods for a fair comparison. The results of the MOT17 test set are presented in Table II. Compared with OC-SORT, PD-SORT made considerable progress in most core metrics (i.e., +0.8 HOTA, +1.3 MOTA, +1.7 IDF1, +0.9 AssA). The results show that PD-SORT can still achieve performance improvements on linear motion scenes. Generally, the results on MOT17 indicate that PD-SORT can generalize well in scenes with simple motions.

We also report the performance of PD-SORT on MOT20 in Table III. Compared with OC-SORT, PD-SORT achieves performance gains in several core metrics (i.e., +0.5 HOTA, +0.8 IDF1, +1.1 AssA). MOT20 has more crowded scenes and a longer video length than MOT17. Such characteristics pose the challenges of long-term tracking and more severe occlusions for MOT. The results on MOT20 further demonstrate the good generalization ability of PD-SORT and its robustness against dense scenes with occlusions.

TABLE II: Results on MOT17 test set with the private detections. ByteTrack, STAT, OC-SORT and our method share the same detections.
Tracker Reference HOTA↑ MOTA↑ IDF1↑ IDs↓ AssA↑ AssR↑
TrackFormer [20] CVPR22 57.3 74.1 68.0 2829 - -
MOTR [21] ECCV22 57.8 73.4 68.6 2439 55.7 -
MeMOT [51] CVPR22 56.9 72.5 69.0 2724 55.2 -
MOTFR [52] TCSVT22 61.8 74.4 76.3 2652 62.6 67.8
MAA [53] WACVW22 62.0 79.4 75.9 1452 60.2 67.3
MOTRv2 [54] CVPR23 62.0 78.6 75.0 - 60.6 -
MO3TR-PIQ[55] TPAMI23 57.3 72.3 69.0 2200 - -
FairMOT [29] IJCV21 59.3 73.7 72.3 3303 58.0 63.6
QDTrack [50] CVPR21 53.9 68.7 66.3 3378 52.7 57.2
CorrTracker [56] CVPR21 60.7 76.5 73.6 3369 58.9 64.4
ByteTrack [9] ECCV22 63.1 80.3 77.3 2196 62.0 68.2
STAT [31] TMM23 63.7 78.7 79.0 2754 63.4 70.6
OC-SORT [10] CVPR23 63.2 78.0 77.5 1950 63.2 67.5
PD-SORT Ours 63.9 79.3 79.2 1062 64.1 69.4
TABLE III: Results on MOT20 test set with the private detections. ByteTrack, GHOST, STAT, OC-SORT and Ours share the same detections.
Tracker Reference HOTA↑ MOTA↑ IDF1↑ IDs↓ AssA↑ AssR↑
MeMOT [51] CVPR22 54.1 63.7 66.1 1938 55.0 -
MOTRv2 [54] CVPR23 61.0 76.2 73.1 - 59.3 -
TransMOT [57] WACV23 61.9 77.5 75.2 1615 60.1 66.3
RelationTrack [58] TMM23 56.5 67.2 70.5 4243 56.4 60.3
FairMOT [29] IJCV21 54.6 61.8 67.3 5243 54.7 60.7
MOTFR [52] TCSVT22 57.2 69.0 71.7 3648 57.1 62.6
MAA [53] WACVW22 57.3 73.9 71.2 1331 55.1 61.1
CSTrack [59] TIP22 54.0 66.6 68.6 3196 54.0 57.6
ByteTrack [9] ECCV22 61.3 77.8 75.2 1223 59.6 66.2
GHOST [60] CVPR23 61.2 73.7 75.2 1264 - -
STAT [31] TMM23 62.5 75.5 76.4 975 62.8 68.2
OC-SORT [10] CVPR23 62.1 75.5 75.9 913 62.0 67.5
PD-SORT Ours 62.6 75.4 76.7 908 63.1 68.4

IV-C Ablation Study

IV-C1 Component Ablation

We perform ablation studies on the validation set of DanceTrack to evaluate the impact of each module in the proposed PD-SORT under complex occlusion scenes. To achieve a valid assessment, we use the same detection model and weights as the base method, OC-SORT, across all experiments. Also, the parameter settings follow those in the baseline. Table IV presents the contribution of each module by progressively adding modules to the base method. By correcting the position states, the CMC module benefits other modules for more accurate motion estimation in dynamic camera scenes. Notably, nonlinear object motions and occlusions happen frequently in DanceTrack. In such situations, the depth information becomes a reliable cue to compensate for the cases where pure 2D association fails. Thus, with proper strategies to leverage pseudo-depth in the association, both DVIoU and QPDM are effective in scenes like DanceTrack. DVIoU modulates the box similarities of the objects with pseudo-depth, which is stable and rich in discriminative information while having no negative impact on the model. Particularly, the QPDM module directly uses the pesudo-depth to guide the association and achieves significant performance gains in DanceTrack. This also indicates that pseudo-depth quantitation is a robust technique to handle occlusions with nonlinear motions. Additionally, scenes in DanceTrack show long durations, which are longer than conventional datasets like MOT17. The effectiveness of DVIoU and QPDM on the dataset also shows the potential of the pseudo-depth-based method for long-term MOT. In general, the results in Table IV demonstrate the contributions of each component in challenging scenes with complex motions and occlusions.

To more intuitively display the contribution of the modules, we also visualize the performance of the methods on the DanceTrack validation set, as illustrated in Fig. 6. We can see that each step from the base method to PD-SORT achieves improvements in most metrics. It is worth noting that QPDM, as a module that directly utilizes pseudo-depth information, brings particularly obvious performance improvements, which further verifies the effectiveness of pseudo-depth in scenarios similar to DanceTrack.

TABLE IV: Ablation study on DanceTrack-val.
CMC DVIoU QPDM HOTA↑ AssA↑ MOTA↑ IDF1↑
52.2 35.3 87.3 51.9
52.9 36.4 87.2 52.8
53.2 36.6 87.3 52.9
54.7 38.5 87.5 54.2
55.5 39.8 87.4 55.4
Refer to caption
Figure 6: Radar chart of the gains obtained through different combinations of modules on the validation set of DanceTrack. The values in the graph are obtained by min-max normalizing each metric in Table IV.

IV-C2 Impact of Pseudo-Depth Quantization

We compare the QPDM module using quantized pseudo-depth as the matching metric with an alternative approach that directly uses the absolute difference (ABS) between continuous pseudo-depth values. As shown in Table V, QPDM with six or more pseudo-depth intervals consistently outperforms ABS across metrics. This highlights the advantage of quantizing pseudo-depth into subintervals for robust similarity distance measurement.

TABLE V: Results of different psesudo-depth matching strategies on DanceTrack validation set.
Matching Strategy HOTA↑ AssA↑ MOTA↑ IDF1↑
ABS 54.8 38.8 87.4 54.6
QPDM (Interval Num=2) 54.6 38.7 87.4 55.0
QPDM (Interval Num=4) 54.2 38.0 87.5 53.7
QPDM (Interval Num=6) 55.3 39.6 87.4 55.2
QPDM (Interval Num=8) 55.5 39.8 87.4 55.4
QPDM (Interval Num=10) 55.4 39.7 87.4 55.1

IV-C3 Number of Pseudo-Depth Intervals in QPDM

In Table V, we investigate the influence of the subinterval number on the DanceTrack validation set. Specifically, we tested subinterval numbers from 2 to 10, with a step of 2. The performance gain from QPDM was low for small numbers of subintervals. We consider that fewer subinterval divisions result in fewer differences in depth and provide less guidance for distinguishing between targets. As the number of subintervals reached 6 to 8, most metrics reached the best results and dropped as it increased to 10. The reason is that too fine-grained subinterval divisions could cause an oversensitivity to changes in the targets’ relative locations. Furthermore, the sparsity of the target distribution influences the choice of the ideal number of subinterval division. Generally, the tracker with a pseudo-depth subinterval number of 8 reached the most optimal metrics. Thus, we use 8 as our subinterval number for the experiments and the reported results on the test sets.

IV-C4 DVIoU or Standard IoU

We also investigate the proper IoU strategies to be used in both rounds of associations, namely the regular association and the ORU in OC-SORT. Specifically, we test the standard 2D IoU and our proposed depth volume IoU (DVIoU) for similarity evaluations in the above associations. The experimental results on the DanceTrack validation set are shown in Table VI. We can see that using DVIoU for both rounds of associations brings the best performance, which further demonstrates that the depth cue brings stable discrimination information and is able to robustly improve the tracking quality.

TABLE VI: Results of different IoU on DanceTrack validation set.
Regular Association ORU HOTA↑ AssA↑ MOTA↑ IDF1↑
IoU IoU 55.2 39.4 87.4 55.0
DVIoU IoU 55.4 39.7 87.4 55.3
IoU DVIoU 55.2 39.4 87.4 55.1
DVIoU DVIoU 55.5 39.8 87.4 55.4

IV-C5 Impact of Complementary View

We evaluate the effectiveness of the complementary view in pseudo-depth estimation by constructing a variant for comparison. Following SparseTrack, this variant estimates the pseudo-depth directly as the distance from the bottom of the target bounding box to the bottom of the image view. As shown in Table VII, incorporating the complementary view contributes to superior performance across multiple metrics. By improving the estimation robustness in boundary cases, the subsequent components DVIoU and QPDM based on pseudo-depth can provide more accurate guidance for target association.

TABLE VII: Results of different psesudo-depth estimation methods on DanceTrack validation set.
Estimation Method HOTA↑ AssA↑ MOTA↑ IDF1↑
w/o complementary view 54.0 37.8 87.1 53.3
w/ complementary view 55.5 39.8 87.4 55.4

IV-C6 Validation of CMC on KF States and Historical Observations

In addition, we also explored the effectiveness of using CMC correction for KF states as well as historical observations in our PD-SORT, and the results are shown in Table VIII. We can see that both applying CMC to KF states (CMC-KF) and historical observations (CMC-HISOB) individually can bring benefit to the tracking performance. Further, the joint application of CMC on both the KF states and historical observations brings even better overall performance.

TABLE VIII: Evaluation of different CMC strategies on DanceTrack validation set.
CMC-KF CMC-HISOB HOTA↑ AssA↑ MOTA↑ IDF1↑
54.7 38.5 87.5 54.2
55.2 39.7 87.5 55.1
54.8 39.4 87.5 54.4
55.5 39.8 87.4 55.4

IV-D Visualization

The performance comparisons between the classical 2D tracker (OC-SORT) and our proposed approach (PD-SORT) utilizing pseudo-depth on DanceTrack are shown in Fig. 7. From the visualized results, our method can handle identity consistency problems well in challenging scenes with occlusions and nonlinear object motions, thus leading to a robust association. Specifically, PD-SORT can handle three typical kinds of occlusion-induced ID problems, namely the ID replacement of the foreground object by the occluded object, the ID reinitialization of the occluded object after reappearance, and the ID swap of objects under occlusion and trajectory intersection. In such cases, the depth of the object provides discriminative information that fixes the association failure of pure 2D information.

Refer to caption
Figure 7: Visualization of the tracking results between the 2D tracker OC-SORT and the proposed PD-SORT tracker utilizing pseudo-depth on the DanceTrack dataset. Different colors represent different identities. Our PD-SORT produces fewer identity-related association errors under occlusions.

IV-E Limitations

Our experiments reveal several limitations of PD-SORT. One concern is its association ability against long-term occlusion. In such cases, if the occluded object is in quick motion, the motion consistency of the object can fail to match the reappeared object’s previous trajectory. This is a common problem with motion-based MOT trackers. To solve such problems, incorporating appearance models or using learnable association matchers can be effective. Another concern is that our projection-based pseudo-depth estimation is performed at the instance level, without generating a full depth map of the entire image. This limits the full use of depth information. Besides, in highly crowded environments, the presence of numerous targets with similar pseudo-depth values and significant overlap between objects can reduce the discriminative power of pseudo-depth cues. Similarly, rapid motion changes will challenge the tracker’s ability to maintain accurate pseudo-depth estimates, potentially affecting association precision. Exploring network-based depth estimators and incorporating context-aware techniques could be potential solutions for the above issues. In addition, although our method performs well on the HOTA metric, the performance gain on the MOTA metric is not significant and even has a slightly lower MOTA than the baseline on the MOT20 test set. This may be due to the missing of low-confidence detection results, which may be solved using an adaptive detection threshold strategy. Future work is needed to incorporate appearance cues and develop more comprehensive strategies to exploit all possible targets.

V Conclusion

In this paper, we demonstrate the feasibility of incorporating pseudo-depth into the object motion model in motion-based MOT. The pseudo-depth information can provide guidance for associations when 2D information fails. Consequently, we present PD-SORT, which leverages pseudo-depth to enhance the tracker’s association performance. Specifically, we integrate pseudo-depth into KF and employ two simple designs, DVIoU and QPDM, to leverage the depth information in matching. Moreover, we use the camera motion compensation technique to address the camera motion. Notably, PD-SORT maintains a simple, online, real-time, and pure motion-based tracker while having better robustness against occlusions. Experiments on diverse datasets show that PD-SORT consistently outperforms its baseline and most state-of-the-art methods on scenes with different motions and densities. The performance gain is especially significant in dense scenes with similar appearances and nonlinear object motions. Specifically, PD-SORT achieves 58.2 HOTA, 80.6 DetA, 42.1 AssA, and 57.5 IDF1 on the DanceTrack test set with 28.7 FPS, which is +3.6 HOTA, +0.2 DetA, +1.9 AssA, and +2.9 IDF1 over the baseline.

In future work, we plan to explore more effective depth utilization strategies and integrate learnable association modules to further enhance tracking performance. Also, we plan to incorporate additional context-aware information (e.g., actual depth data, appearance cues, infrared data) to improve the tracker’s robustness in complex scenes that contain highly crowded and fast-moving objects. Finally, we hope the occlusion-robust characteristic and generalization ability of PD-SORT can make it attractive for application in consumer electronics and inspire future research to further investigate the depth cues and make MOT methods more practical.

References

  • [1] C. Luo, X. Yang, and A. Yuille, “Exploring simple 3d multi-object tracking for autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 488–10 497.
  • [2] S. Kim, B.-j. Lee, J.-w. Jeong, and M.-j. Lee, “Multi-object tracking coprocessor for multi-channel embedded dvr systems,” IEEE transactions on Consumer Electronics, vol. 58, no. 4, pp. 1366–1374, 2012.
  • [3] B. Iepure and A. W. Morales, “A novel tracking algorithm using thermal and optical cameras fused with mmwave radar sensor data,” IEEE Transactions on Consumer Electronics, vol. 67, no. 4, pp. 372–382, 2021.
  • [4] K. Yang, H. Zhang, J. Shi, and J. Ma, “Bandt: A border-aware network with deformable transformers for visual tracking,” IEEE Transactions on Consumer Electronics, vol. 69, no. 3, pp. 377–390, 2023.
  • [5] M. Zhao, L. Cheng, Y. Sun, and J. Ma, “Human video instance segmentation and tracking via data association and single-stage detector,” IEEE Transactions on Consumer Electronics, vol. 70, no. 1, pp. 2979–2988, 2024.
  • [6] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in 2016 IEEE International Conference on Image Processing (ICIP).   IEEE, 2016, pp. 3464–3468.
  • [7] E. Bochinski, V. Eiselein, and T. Sikora, “High-speed tracking-by-detection without using image information,” in 2017 14th IEEE International Conference on Advanced Video and Signal based Surveillance (AVSS).   IEEE, 2017, pp. 1–6.
  • [8] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in 2017 IEEE International Conference on Image Processing (ICIP).   IEEE, 2017, pp. 3645–3649.
  • [9] Y. Zhang, P. Sun, Y. Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “Bytetrack: Multi-object tracking by associating every detection box,” in European Conference on Computer Vision.   Springer, 2022, pp. 1–21.
  • [10] J. Cao, J. Pang, X. Weng, R. Khirodkar, and K. Kitani, “Observation-centric sort: Rethinking sort for robust multi-object tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9686–9696.
  • [11] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in Neural Information Processing Systems, vol. 28, 2015.
  • [12] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo series in 2021,” arXiv preprint arXiv:2107.08430, 2021.
  • [13] D. F. Crouse, “On implementing 2d rectangular assignment algorithms,” IEEE Transactions on Aerospace and Electronic Systems, vol. 52, no. 4, pp. 1679–1696, 2016.
  • [14] P. Sun, J. Cao, Y. Jiang, Z. Yuan, S. Bai, K. Kitani, and P. Luo, “Dancetrack: Multi-object tracking in uniform appearance and diverse motion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 993–21 002.
  • [15] Z. Liu, X. Wang, C. Wang, W. Liu, and X. Bai, “Sparsetrack: Multi-object tracking by performing scene decomposition based on pseudo-depth,” arXiv preprint arXiv:2306.05238, 2023.
  • [16] R. E. Kalman et al., “Contributions to the theory of optimal control,” Bol. soc. mat. mexicana, vol. 5, no. 2, pp. 102–119, 1960.
  • [17] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 658–666.
  • [18] N. Aharon, R. Orfaig, and B.-Z. Bobrovsky, “Bot-sort: Robust associations multi-pedestrian tracking,” arXiv preprint arXiv:2206.14651, 2022.
  • [19] P. Sun, J. Cao, Y. Jiang, R. Zhang, E. Xie, Z. Yuan, C. Wang, and P. Luo, “Transtrack: Multiple object tracking with transformer,” arXiv preprint arXiv:2012.15460, 2020.
  • [20] T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer, “Trackformer: Multi-object tracking with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8844–8854.
  • [21] F. Zeng, B. Dong, Y. Zhang, T. Wang, X. Zhang, and Y. Wei, “Motr: End-to-end multiple-object tracking with transformer,” in European Conference on Computer Vision.   Springer, 2022, pp. 659–675.
  • [22] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
  • [23] J. He, Z. Huang, N. Wang, and Z. Zhang, “Learnable graph matching: Incorporating graph partitioning with deep feature learning for multiple object tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5299–5309.
  • [24] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval Research Logistics Quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
  • [25] H. Suljagic, E. Bayraktar, and N. Celebi, “Similarity based person re-identification for multi-object tracking using deep siamese network,” Neural Computing and Applications, vol. 34, no. 20, pp. 18 171–18 182, 2022.
  • [26] H. Luo, W. Jiang, Y. Gu, F. Liu, X. Liao, S. Lai, and J. Gu, “A strong baseline and batch normalization neck for deep person re-identification,” IEEE Transactions on Multimedia, vol. 22, no. 10, pp. 2597–2609, 2020.
  • [27] L. He, X. Liao, W. Liu, X. Liu, P. Cheng, and T. Mei, “Fastreid: A pytorch toolbox for general instance re-identification,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 9664–9667.
  • [28] Z. Wang, L. Zheng, Y. Liu, Y. Li, and S. Wang, “Towards real-time multi-object tracking,” in European Conference on Computer Vision.   Springer, 2020, pp. 107–122.
  • [29] Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “Fairmot: On the fairness of detection and re-identification in multiple object tracking,” International Journal of Computer Vision, vol. 129, pp. 3069–3087, 2021.
  • [30] Y. Du, Z. Zhao, Y. Song, Y. Zhao, F. Su, T. Gong, and H. Meng, “Strongsort: Make deepsort great again,” IEEE Transactions on Multimedia, 2023.
  • [31] J. Zhang, M. Wang, H. Jiang, X. Zhang, C. Yan, and D. Zeng, “Stat: Multi-object tracking based on spatio-temporal topological constraints,” IEEE Transactions on Multimedia, 2023.
  • [32] E. Bayraktar, Y. Wang, and A. DelBue, “Fast re-obj: Real-time object re-identification in rigid scenes,” Machine Vision and Applications, vol. 33, no. 6, p. 97, 2022.
  • [33] X. Weng, J. Wang, D. Held, and K. Kitani, “Ab3dmot: A baseline for 3d multi-object tracking and new evaluation metrics,” arXiv preprint arXiv:2008.08063, 2020.
  • [34] T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detection and tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 784–11 793.
  • [35] A. Kim, A. Ošep, and L. Leal-Taixé, “Eagermot: 3d multi-object tracking via sensor fusion,” in 2021 IEEE International conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 11 315–11 321.
  • [36] P. Dendorfer, V. Yugay, A. Osep, and L. Leal-Taixé, “Quo vadis: Is trajectory forecasting the key towards long-term multi-object tracking?” Advances in Neural Information Processing Systems, vol. 35, pp. 15 657–15 671, 2022.
  • [37] K. G. Quach, P. Nguyen, C. N. Duong, T. D. Bui, and K. Luu, “Depth perspective-aware multiple object tracking,” in Engineering Applications of AI and Swarm Intelligence.   Springer, 2024, pp. 181–205.
  • [38] G. Bradski, “The opencv library.” Dr. Dobb’s Journal: Software Tools for the Professional Programmer, vol. 25, no. 11, pp. 120–123, 2000.
  • [39] J. Shi and Tomasi, “Good features to track,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1994, pp. 593–600.
  • [40] J.-Y. Bouguet et al., “Pyramidal implementation of the affine lucas kanade feature tracker description of the algorithm,” Intel corporation, vol. 5, no. 1-10, p. 4, 2001.
  • [41] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
  • [42] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, “Mot16: A benchmark for multi-object tracking,” arXiv preprint arXiv:1603.00831, 2016.
  • [43] P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid, S. Roth, K. Schindler, and L. Leal-Taixé, “Mot20: A benchmark for multi object tracking in crowded scenes,” arXiv preprint arXiv:2003.09003, 2020.
  • [44] J. Luiten, A. Osep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taixé, and B. Leibe, “Hota: A higher order metric for evaluating multi-object tracking,” International Journal of Computer Vision, vol. 129, pp. 548–578, 2021.
  • [45] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: the clear mot metrics,” EURASIP Journal on Image and Video Processing, vol. 2008, pp. 1–10, 2008.
  • [46] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” in European conference on computer vision.   Springer, 2016, pp. 17–35.
  • [47] J. Wu, J. Cao, L. Song, Y. Wang, M. Yang, and J. Yuan, “Track to detect and segment: An online multi-object tracker,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 352–12 361.
  • [48] X. Zhou, T. Yin, V. Koltun, and P. Krähenbühl, “Global tracking transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8771–8780.
  • [49] X. Zhou, V. Koltun, and P. Krähenbühl, “Tracking objects as points,” in European Conference on Computer Vision.   Springer, 2020, pp. 474–490.
  • [50] J. Pang, L. Qiu, X. Li, H. Chen, Q. Li, T. Darrell, and F. Yu, “Quasi-dense similarity learning for multiple object tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 164–173.
  • [51] J. Cai, M. Xu, W. Li, Y. Xiong, W. Xia, Z. Tu, and S. Soatto, “Memot: Multi-object tracking with memory,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8090–8100.
  • [52] J. Kong, E. Mo, M. Jiang, and T. Liu, “Motfr: Multiple object tracking based on feature recoding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 11, pp. 7746–7757, 2022.
  • [53] D. Stadler and J. Beyerer, “Modelling ambiguous assignments for multi-person tracking in crowds,” in 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), 2022, pp. 133–142.
  • [54] Y. Zhang, T. Wang, and X. Zhang, “Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 056–22 065.
  • [55] T. Zhu, M. Hiller, M. Ehsanpour, R. Ma, T. Drummond, I. Reid, and H. Rezatofighi, “Looking beyond two frames: End-to-end multi-object tracking using spatial and temporal transformers,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 11, pp. 12 783–12 797, 2023.
  • [56] Q. Wang, Y. Zheng, P. Pan, and Y. Xu, “Multiple object tracking with correlation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3876–3886.
  • [57] P. Chu, J. Wang, Q. You, H. Ling, and Z. Liu, “Transmot: Spatial-temporal graph transformer for multiple object tracking,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 4870–4880.
  • [58] E. Yu, Z. Li, S. Han, and H. Wang, “Relationtrack: Relation-aware multiple object tracking with decoupled representation,” IEEE Transactions on Multimedia, vol. 25, pp. 2686–2697, 2023.
  • [59] C. Liang, Z. Zhang, X. Zhou, B. Li, S. Zhu, and W. Hu, “Rethinking the competition between detection and reid in multiobject tracking,” IEEE Transactions on Image Processing, vol. 31, pp. 3182–3196, 2022.
  • [60] J. Seidenschwarz, G. Brasó, V. C. Serrano, I. Elezi, and L. Leal-Taixé, “Simple cues lead to a strong multi-object tracker,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 813–13 823.