This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ShaSTA: Modeling Shape and Spatio-Temporal Affinities for 3D Multi-Object Tracking

Tara Sadjadpour Stanford University
Stanford, CA 94305
[email protected]
   Jie Li NVIDIA
Santa Clara, CA 95051
[email protected]
   Rares Ambrus Toyota Research Institute
Los Altos, CA 94022
[email protected]
   Jeannette Bohg Stanford University
Stanford, CA 94305
[email protected]
Abstract

Multi-object tracking (MOT) is a cornerstone capability of any robotic system. The quality of tracking is largely dependent on the quality of the detector used. In many applications, such as autonomous vehicles, it is preferable to over-detect objects to avoid catastrophic outcomes due to missed detections. As a result, current state-of-the-art 3D detectors produce high rates of false-positives to ensure a low number of false-negatives. This can negatively affect tracking by making data association and track lifecycle management more challenging. Additionally, occasional false-negative detections due to difficult scenarios like occlusions can harm tracking performance. To address these issues in a unified framework, we propose to learn shape and spatio-temporal affinities between tracks and detections in consecutive frames. Our affinity provides a probabilistic matching that leads to robust data association, track lifecycle management, false-positive elimination, false-negative propagation, and sequential track confidence refinement. Though past 3D MOT approaches address a subset of components in this problem domain, we offer the first self-contained framework that addresses all these aspects of the 3D MOT problem. We quantitatively evaluate our method on the nuScenes tracking benchmark where we achieve 1st place amongst LiDAR-only trackers using CenterPoint detections. Our method estimates accurate and precise tracks, while decreasing the overall number of false-positive and false-negative tracks and increasing the number of true-positive tracks. Unlike past works, we analyze our performance with 5 metrics, including AMOTA, the most common tracking accuracy metric. Thereby, we give a comprehensive overview of our approach to indicate how our tracking framework may impact the ultimate goal of an autonomous mobile agent. We also present ablative experiments, as well as qualitative results that demonstrate our framework’s capabilities in complex scenarios. Please see our project website for source code and demo videos: sites.google.com/view/shasta-3d-mot/home.

I Introduction

In this work, we address the problem of 3D multi-object tracking using LiDAR point cloud data as input. 3D multi-object tracking is an essential capability for autonomous agents to navigate effectively in their environments. For example, autonomous cars need to understand the motion of surrounding traffic agents to drive safely. Online tracking is achieved by (1) accurately matching uncertain detections to existing tracks and (2) determining when to birth and kill tracks of unmatched observations. These two processes are referred to as data association and track lifecycle management, respectively.

Most recent approaches [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] toward 3D multi-object tracking in autonomous driving apply the tracking-by-detection paradigm whereby off-the-shelf detections are available from a state-of-the-art 3D detector. Though this framework has been successful, there exists a practical gap between the objectives of 3D detection and tracking. Since missing a detection can lead to catastrophic events in self-driving, 3D detectors are evaluated on the full recall spectrum of detections (e.g., mAP [12]) and are deployed with the goal of having high recall. As a result, they tend to generate high rates of false-positives to ensure a low number of false-negatives [7, 13]. These redundant detections lead to highly cluttered outputs that make data association and track lifecycle management more challenging. Additionally, occasional false-negative detections due to factors such as heavy occlusion can cause significant fragmentations in the tracks or lead to the premature termination of tracks.

In this work, we address the task of LiDAR-only tracking. In contrast to camera-only and camera-fusion tracking [14, 11, 3, 9, 10], LiDAR tracking offers many practical advantages, balancing high accuracy with a less complex system configuration and lower computational demands. Still, LiDAR-only systems face significant challenges due to the sparse, unstructured nature of LiDAR point clouds. Recent works in LiDAR-only tracking-by-detection [4, 5, 6, 8] have attempted to develop learning-based and statistical approaches to better handle data association, false-positive elimination, false-negatives propagation, and track lifecycle management. However, these approaches do not effectively leverage spatio-temporal and shape information from the LiDAR sensor input, and in some cases, only rely on detected bounding box parameters for spatio-temporal information.

Our main contribution is a 3D multi-object tracking framework called ShaSTA that models shape and spatio-temporal affinities between tracks and detections in consecutive frames. By better understanding objects’ shapes and spatio-temporal contexts, ShaSTA improves data association, false-positive (FP) elimination, false-negative (FN) propagation, newborn (NB) initialization, dead track (DT) termination, and track confidence refinement. At its core, ShaSTA learns an affinity matrix that encodes shape and spatio-temporal information to obtain a probabilistic matching between detections and tracks. ShaSTA also learns representations of FP, NB, FN, and DT anchors for each frame. By computing the affinity between detections and tracks to these learned representations, we can classify detections as newborn tracks or false-positives, while tracks can be classified as dead tracks or false-negatives. Once we use these matches to form tracks, ShaSTA continues to leverage the affinity estimation with a novel sequential track confidence refinement technique that represents track score refinement as a time-dependent problem and leverages information extracted from shape and spatio-temporal encodings to obtain significant improvements in overall tracking accuracy.

This overall approach not only aids in tracking performance but also addresses concerns for real-world deployment and downstream decision-making tasks by eliminating false-positive tracks and minimizing false-negative tracks. ShaSTA offers an efficient and practical implementation that can re-use the LiDAR backbone of most off-the-shelf LiDAR-based 3D detectors. Without loss of generality, we choose the LiDAR backbone from CenterPoint [7] for a fair comparison against most previous works. We demonstrate that our approach achieves state-of-the-art results on the nuScenes tracking benchmark [15] where we achieve significant gains in overall tracking accuracy and precision, while improving the raw number of true-positive, false-positive, and false-negative tracks. We also include ablation studies to analyze the contribution of each component.

II Related Work

II-A LiDAR-based 3D Detection

Most recent 3D multi-object tracking (3D MOT) works follow the tracking-by-detection paradigm [1, 16, 3, 8, 10, 5], making the quality of 3D detectors a key component to tracking performance [3, 17, 18]. LiDAR-based 3D detectors extend the image-based 2D detection algorithms [19, 20] to 3D. The typical detector starts with an encoder [21, 22, 23] applied to unstructured point cloud data to extract an intermediate 3D feature grid [13] or bird’s-eye-view (BEV) feature map [24, 7]. Then, a decoder is applied to extract objectness [25] and detailed object attributes [26, 7]. Our approach proposes to leverage the intermediate BEV feature map for robust multi-object tracking, which can be generalized to most modern 3D detectors. CenterPoint detections have become a standardized 3D detection in the 3D MOT community, since the authors of the work released the official detection files for all data splits. As of this writing, CenterPoint has the highest mAP of all publicly-released 3D detections. Thus, in this work, we apply CenterPoint [7] for fair comparison against existing 3D MOT techniques.

II-B 3D Multi-Object Tracking

With the advances in 3D detection, the 3D MOT community has been focusing on four directions to establish robust tracking within the tracking-by-detection paradigm: motion prediction, data association, track lifecycle management, and confidence refinement.

In 3D MOT, abstracted detection predictions are often used for motion prediction and data association, which is completed using pair-wise feature distance between tracks and detections [1, 2, 7]. AB3DMOT [1] provides a baseline to combine Kalman Filters and Intersection Over Union (IoU) association for 3D MOT. Subsequent works in this line explore variations of data association metrics, such as the Mahalanobis distance to leverage uncertainty measurements in the Kalman Filter [2]. As an alternative, CenterPoint [7] uses high-fidelity instance velocity predictions as a motion model to complete data association. Unlike these works that only use low-dimensional bounding boxes and explicit motion models for tracking, we propose to leverage intermediate representations from a detection network for richer information about each object, while also modeling the global relationship of all the objects in a scene to improve data association.

Additionally, track lifecycle management has been reasonably successful with heuristics that are tuned to each dataset [1, 2, 7], but these hand-tuned techniques have shortcomings in more complex situations. Recently, OGR3MOT [8] has proposed combining tracking and predictive modeling through a unified graph representation that learns data associations and track lifecycle management, FP elimination, and FN interpolation. Additionally, [5] proposes a statistical approach for data association under a Bayesian estimation framework. This work was later extended with Neural-Enhanced Belief Propagation (NEBP) [6] to include a learning-based aspect to the framework for FP suppression. Though these techniques yield competitive results compared to heuristics, their main shortfall is their inability to capture object shape information. Unlike these techniques, our approach leverages raw LiDAR data to capture shape information about objects in our scene, which we can use to further improve tracking performance.

Finally, track confidence is a less-explored but also important aspect of the tracking problem. Track confidence indicates the quality of a track relative to tracks for the same object class within a given frame. Just as it is the case for detection confidences, the most important characteristic of a track’s confidence is its value relative to other tracks’ confidences. In most 3D MOT works, algorithms assign the track confidence to be the same as that of the confidence score that comes with off-the-shelf detections matched to the track [1, 2, 7, 8]. In a recent effort to improve on this aspect of the tracking problem, [4] used an existing detection score refinement technique from [27] to obtain new tracking scores and improve the overall tracking performance. Though this method demonstrates improvements in ablative studies, its main weakness is that it directly applies a detection score refinement technique to the track score, overlooking the fact that unlike detections, tracks are ever-changing, time-dependent quantities that cannot be treated in isolation for the current time step. In this paper, we argue that track score refinement should be a sequential process that reflects changing environmental factors. Thus, we propose a first-of-its-kind approach that represents track score refinement as a sequential problem and leverages shape and spatio-temporal information encoded in our learned affinity matrix to obtain significant improvement in overall tracking accuracy.

Refer to caption
Figure 2: Algorithm Flowchart. Going from top to bottom and left to right, ShaSTA uses NmaxN_{max} low-dimensional bounding boxes from the current and previous frames to learn bounding box representations for the FP, NB, FN, and DT anchors. The FP and NB anchors are appended to previous frame bounding boxes to form the augmented bounding boxes B^t1\hat{B}_{t-1}, while the FN and DT anchors are added to the current frame detections to form the augmented bounding boxes B^t\hat{B}_{t}. Using the pre-trained frozen LiDAR backbone from our off-the-shelf detector, we extract shape descriptors that are then used to learn shape representations for our anchors. The learned shape representations for the anchors are appended to create augmented shape descriptors. These augmented bounding boxes and shape descriptors are then used to find a residual that captures the spatio-temporal and shape similarities between current and previous frame detections, as well as between detections and anchors. The residual is then used to predict the affinity matrix for probabilistic data associations, track lifecycle management, FP elimination, FN propagation, and track confidence refinement.

Another line of 3D MOT algorithms focuses on using camera-LiDAR fusion to learn data association, track lifecycle management, and FP suppression [11, 3, 9, 10]. However, in this work we focus on LiDAR-only 3D MOT due to the practical advantages it offers. First, it is more cost-efficient to use fewer sensors. Additionally, LiDAR-only tracking reduces the risk of calibration and synchronization issues between various sensor modalities in a more complex system configuration.

III SHASTA

Our algorithm operates under the tracking-by-detection paradigm and is visualized in Fig. 2. ShaSTA extracts shape and spatio-temporal relationships between consecutive frames to learn affinity matrices that capture associations between detections and tracks. ShaSTA learns bounding box and shape representations for FP, NB, FN, and DT anchors, where each anchor captures the common features between the set of detections and tracks that falls under each of these four categories in a given frame. Thus, ShaSTA’s affinity matrix not only estimates the probabilities of previous frame tracks matching with current frame detections, but also considers previous frame tracks matching with DT or FN anchors and current frame detections matching with NB or FP anchors. Unlike past techniques that only use low-dimensional bounding box representations, ShaSTA leverages shape and spatio-temporal information from the LiDAR sensor input, resulting in a robust data association technique that effectively accounts for FP elimination, FN propagation, NB initialization, and DT termination.

III-A Affinity Matrix Usage for Online Tracking

Allowing for up to NmaxN_{max} detections per frame, ShaSTA estimates the affinity matrix A(Nmax+2)×(Nmax+2)A\in\mathbb{R}^{(N_{max}+2)\times(N_{max}+2)}. As shown in Figure LABEL:fig:teaser, the rows of our affinity matrix represent the previous frame tracks and the columns represent our current frame detections. The two augmented columns are for the DT and FN anchors, so previous frame tracks can match with DT or FN. Similarly, because current frame detections can match with NB or FP, the affinity matrix has two augmented rows for the NB and FP anchors. Though there can be no more than one match for each bounding box in the current and previous frames, the same is not true for our learned anchors, i.e. more than one current frame bounding box can match with the NB or FP anchors, respectively. To handle this situation, we create a forward matching affinity matrix AfmNmax×(Nmax+2)A_{fm}\in\mathbb{R}^{N_{max}\times(N_{max}+2)} by removing the two row augmentations from AA and applying a row-wise softmax so that more than one previous frame track can have sufficient probability to match with the DT and FN anchors, respectively:

Afm=softmaxrow(A1).\displaystyle A_{fm}=\text{softmax}_{\text{row}}\left(A_{1}\right). (1)

Using similar logic, we create a backward matching affinity matrix Abm(Nmax+2)×NmaxA_{bm}\in\mathbb{R}^{(N_{max}+2)\times N_{max}} by removing the two column augmentations from AA and applying a column-wise softmax so that more than one current frame detection can have sufficient probability to match with the NB and FP anchors, respectively:

Abm=softmaxcol(A2).\displaystyle A_{bm}=\text{softmax}_{\text{col}}\left(A_{2}\right). (2)

Here, forward matching indicates that we want to find the best current frame match for each previous frame track, while backward matching refers to finding the best previous frame match for each current frame detection.

To form tracks, we combine the affinity matrix outputs and the matching algorithm from [7] as follows. We take AfmA_{fm} and label any previous frame track that has a probability above the threshold τdt\tau_{dt} in the DT column as a DT, and we forward propagate any previous frame track into the current frame if it has a probability above the threshold τfn\tau_{fn} in the FN column. Therefore, in addition to provided detections, forward propagation creates a new detection for the current frame to handle occluded objects that off-the-shelf detections missed. We create a box in the current frame bt:=(x,y,z,w,l,h,ry)b^{\prime}_{t}:=(x^{\prime},y^{\prime},z,w,l,h,r_{y}), by moving the 2D center coordinate of the existing track bt1:=(x,y,z,w,l,h,ry)b_{t-1}:=(x,y,z,w,l,h,r_{y}) to the next frame based on the estimated velocity provided by our off-the-shelf 3D detector and Δt\Delta t between the two frames, i.e. x=x+vxΔtx^{\prime}=x+v_{x}\Delta t and y=y+vyΔty^{\prime}=y+v_{y}\Delta t. All previous frame tracks that do not qualify as FN or DT are kept as is. Analogously, we take AbmA_{bm} and remove any current frame detection that has a probability above the threshold τfp\tau_{fp} in the FP row and we label any current frame detection that has a value above the threshold τnb\tau_{nb} in the NB row as an NB. All current frame detections that do not qualify as FP or NB are kept as is. See Section V-B for details on the threshold values we choose.

With these detections, we then run the greedy algorithm from [7]. In its original form, the greedy algorithm takes all unmatched detections and initializes NBs with them. However, we check if an unmatched detection is labeled as an NB in the affinity matrix based on the threshold τnb\tau_{nb} and whether it is outside the maximum distance threshold with respect to other tracks to initialize it as an NB. Otherwise, it is discarded. Additionally, we check if an unmatched track is labeled as a DT based on the threshold τdt\tau_{dt} and whether it is outside the maximum distance threshold with respect to other detections to terminate it as a DT.

III-B Learning to Predict Affinity Matrices

ShaSTA uses off-the-shelf 3D detections, as well as the pre-trained LiDAR backbone used to generate the detections. ShaSTA first learns bounding box and shape representations for the anchors so we can later use this information to learn a residual that encodes spatio-temporal and shape similarities not only between current frame detections and previous frame tracks, but also between current frame detections and FP and NB anchors or between previous frame tracks and FN and DT anchors. The residual is then used to predict affinity matrices that assign probabilities for matching detections to tracks, eliminating FPs, propagating FNs, initializing NBs, and terminating DTs.

III-B1 Learning Bounding Box Representations for Anchors

Our goal is to learn bounding box representations for each of the FP, FN, NB, and DT anchors in a given frame pair. The aim of the bounding box representation of each anchor is to capture the commonalities between detection bounding boxes that fall under each of these four categories.

We use off-the-shelf 3D detections, which provide us with 3D bounding box representations that include each box’s center coordinate, dimensions, and yaw rotation angle: b:=(x,y,z,w,l,h,ry)b:=(x,y,z,w,l,h,r_{y}). We take the set of bounding boxes for the current frame BtNmax×7B_{t}\in\mathbb{R}^{N_{max}\times 7} and previous frame Bt1Nmax×7B_{t-1}\in\mathbb{R}^{N_{max}\times 7}. We fix the maximum number of bounding boxes we take per frame to be NmaxN_{max}; we zero pad the matrix if there are fewer than NmaxN_{max} bounding boxes, and we sample the top NmaxN_{max} detection bounding boxes if there are greater than NmaxN_{max} boxes.

Then, we create a learned bounding box representation at time step tt for FP and NB anchors using the current frame detections BtB_{t} as follows, where each σ\sigma represents an MLP:

bfp\displaystyle b_{fp} =σfpb(Bt)\displaystyle=\sigma_{fp}^{b}(B_{t}) (3)
bnb\displaystyle b_{nb} =σnbb(Bt).\displaystyle=\sigma_{nb}^{b}(B_{t}). (4)

Similarly, we find learned bounding box representations for FN and DT anchors using previous frame detections Bt1B_{t-1}:

bfn\displaystyle b_{fn} =σfnb(Bt1)\displaystyle=\sigma_{fn}^{b}(B_{t-1}) (5)
bdt\displaystyle b_{dt} =σdtb(Bt1).\displaystyle=\sigma_{dt}^{b}(B_{t-1}). (6)

For all four cases, we apply the absolute value to the MLP outputs corresponding to (w,l,h)(w,l,h), since dimensions need to be non-negative values. We then concatenate bfp7b_{fp}\in\mathbb{R}^{7} and bnb7b_{nb}\in\mathbb{R}^{7} to Bt1B_{t-1} to get B^t1(Nmax+2)×7\hat{B}_{t-1}\in\mathbb{R}^{(N_{max}+2)\times 7}, as well as bfn7b_{fn}\in\mathbb{R}^{7} and bdt7b_{dt}\in\mathbb{R}^{7} to BtB_{t} to get B^t(Nmax+2)×7\hat{B}_{t}\in\mathbb{R}^{(N_{max}+2)\times 7}. The reason we append FP and NB anchors to Bt1B_{t-1} is so that current frame detections can match to them, and the same logic applies for appending FN and DT anchors to BtB_{t}.

III-B2 Learning Shape Representations for Anchors

In this section, we extract shape descriptors for each 3D detection using the pre-trained LiDAR backbone of our off-the-shelf detector to leverage spatio-temporal and shape information from the raw LiDAR data. In similar fashion to the previous section, our goal is to learn shape representations for the FP, FN, NB, and DT anchors using the shape descriptors from existing detections. These shape descriptors will be used to match detections to each other or one of the anchors if they share similar shape information.

We pass the current frame’s 4D LiDAR point cloud with an added temporal dimension into the frozen pre-trained VoxelNet [13] LiDAR backbone used in the CenterPoint [7] detector. This outputs a bird’s-eye-view (BEV) map for the current frame, where each voxelized region encodes a high-dimensional volumetric representation of the shape information in that region. Then, for each current frame detection bounding box, we extract a shape descriptor using bilinear interpolation from the BEV map as in [7] using the bounding box center, left face center, right face center, front face center, and back face center. We concatenate these 5 shape features to create the overall shape descriptor for each bounding box. Note that in BEV, the box center, bottom face center, and top face center all project to the same point, so we forgo extracting the latter two centers. We accumulate all of the shape features extracted for the current frame detections and call this cumulative shape descriptor StNmax×FS_{t}\in\mathbb{R}^{N_{max}\times F}. We repeat the same procedure for the previous frame’s LiDAR point cloud and bounding boxes to get the overall shape descriptor St1Nmax×FS_{t-1}\in\mathbb{R}^{N_{max}\times F}.

Using the extracted shape features for the current frame StS_{t}, we obtain our learned shape descriptors for the FP and NB anchors with MLPs σfp\sigma_{fp} and σnb\sigma_{nb}, respectively:

sfp\displaystyle s_{fp} =σfps(St)\displaystyle=\sigma_{fp}^{s}(S_{t}) (7)
snb\displaystyle s_{nb} =σnbs(St).\displaystyle=\sigma_{nb}^{s}(S_{t}). (8)

Likewise, we use our previous frame’s shape features St1S_{t-1} to learn shape descriptors for FN and DT anchors as such:

sfn\displaystyle s_{fn} =σfns(St1)\displaystyle=\sigma_{fn}^{s}(S_{t-1}) (9)
sdt\displaystyle s_{dt} =σdts(St1).\displaystyle=\sigma_{dt}^{s}(S_{t-1}). (10)

Just as we did in the previous section, we concatenate sfpFs_{fp}\in\mathbb{R}^{F} and snbFs_{nb}\in\mathbb{R}^{F} to St1S_{t-1} to get the augmented shape descriptor S^t1(Nmax+2)×F\hat{S}_{t-1}\in\mathbb{R}^{(N_{max}+2)\times F}, as well as sfnFs_{fn}\in\mathbb{R}^{F} and sdtFs_{dt}\in\mathbb{R}^{F} to StS_{t} to get S^t(Nmax+2)×F\hat{S}_{t}\in\mathbb{R}^{(N_{max}+2)\times F}.

III-B3 Residual Prediction

Using the augmented 3D bounding boxes and shape descriptors, we aim to find three residuals that measure the similarities between current and previous frames’ bounding box and shape representations. Since the boxes are low-dimensional abstractions, we aim to maximize the amount of spatio-temporal information we extract from them. Thus, we obtain two residuals for the augmented 3D bounding boxes called the VoxelNet and bounding box residuals, RvR_{v} and RbR_{b} respectively. We also learn one shape residual RsR_{s} between the augmented shape descriptors. We obtain our overall residual RR that captures the spatio-temporal and shape similarities by taking a weighted sum of the three residuals, where the weights are also learned.

VoxelNet Residual. Our first residual Rv(Nmax+2)×(Nmax+2)R_{v}\in\mathbb{R}^{({N_{max}+2})\times({N_{max}+2})} is a variation of the VoxelNet [13] bounding box residual. We adapt the VoxelNet residual for the end-goal of data assocation between B^t1\hat{B}_{t-1} and B^t\hat{B}_{t} to capture their similarities based on 3D box center, dimension, and rotation.

We define each entry (i,j)(i,j) in RvR_{v} as follows:

Lc(i,j)\displaystyle L_{c}(i,j) =ct1ictj22\displaystyle=\Big{|}\Big{|}c_{t-1}^{i}-c_{t}^{j}\Big{|}\Big{|}_{2}^{2} (11)
Ld(i,j)\displaystyle L_{d}(i,j) =|log(wt1iwtj)|+|log(lt1iltj)|+|log(ht1ihtj)|\displaystyle=\Bigg{|}log\bigg{(}\frac{w_{t-1}^{i}}{w_{t}^{j}}\bigg{)}\Bigg{|}+\Bigg{|}log\bigg{(}\frac{l_{t-1}^{i}}{l_{t}^{j}}\bigg{)}\Bigg{|}+\Bigg{|}log\bigg{(}\frac{h_{t-1}^{i}}{h_{t}^{j}}\bigg{)}\Bigg{|} (12)
Lr(i,j)2\displaystyle L_{r}(i,j)^{2} =(cos(ry,t1i)cos(ry,tj))2\displaystyle=\Big{(}cos(r_{y,t-1}^{i})-cos(r_{y,t}^{j})\Big{)}^{2}
+(sin(ry,t1i)sin(ry,tj))2\displaystyle\quad\quad+\Big{(}sin(r_{y,t-1}^{i})-sin(r_{y,t}^{j})\Big{)}^{2} (13)
Rv(i,j)\displaystyle R_{v}(i,j) =L^c(i,j)+Ld(i,j)+Lr(i,j)\displaystyle=\hat{L}_{c}(i,j)+L_{d}(i,j)+L_{r}(i,j) (14)

where c:=(x,y,z)c:=(x,y,z) is the 3D box center. For the overall residual in Equation 14, Lc(i,j)L_{c}(i,j) is normalized to make all the normalized elements L^c(i,j)\hat{L}_{c}(i,j) dimensionless like the terms in LdL_{d} and LrL_{r}.

Bounding Box Residual. The learned bounding box residual Rb(Nmax+2)×(Nmax+2)R_{b}\in\mathbb{R}^{({N_{max}+2})\times({N_{max}+2})} is found by first expanding B^t1\hat{B}_{t-1} and B^t\hat{B}_{t} and concatenating them to get a matrix B^(Nmax+2)×(Nmax+2)×6\hat{B}\in\mathbb{R}^{(N_{max}+2)\times(N_{max}+2)\times 6}. Note that for this step we only use the box centers from B^t1\hat{B}_{t-1} and B^t\hat{B}_{t}. Then, we obtain our residual Rb(Nmax+2)×(Nmax+2)R_{b}\in\mathbb{R}^{(N_{max}+2)\times(N_{max}+2)} with an MLP σrb\sigma_{r}^{b} as such:

Rb=σrb(B^).\displaystyle R_{b}=\sigma_{r}^{b}(\hat{B}). (15)

Shape Residual. We use S^t\hat{S}_{t} and S^t1\hat{S}_{t-1} to obtain the learned shape residual RsR_{s} between the two frames. We expand S^t1\hat{S}_{t-1} and S^t\hat{S}_{t} and concatenate them to get a matrix S^(Nmax+2)×(Nmax+2)×2F\hat{S}\in\mathbb{R}^{(N_{max}+2)\times(N_{max}+2)\times 2F}. Then, we obtain our residual Rs(Nmax+2)×(Nmax+2)R_{s}\in\mathbb{R}^{(N_{max}+2)\times(N_{max}+2)} with an MLP:

Rs=σrs(S^).\displaystyle R_{s}=\sigma_{r}^{s}(\hat{S}). (16)

Overall Residual. The overall residual RR is a weighted sum of the three residuals we previously obtained: RvR_{v}, RbR_{b}, and RsR_{s}. We concatenate B^\hat{B} and S^\hat{S} that we previously obtained in the early steps to create our input W^(Nmax+2)×(Nmax+2)×(2F+6)\hat{W}\in\mathbb{R}^{(N_{max}+2)\times(N_{max}+2)\times(2F+6)}. Then, we pass this through an MLP σα\sigma_{\alpha} to get the learned weights 𝜶(Nmax+2)×(Nmax+2)×3\boldsymbol{\alpha}\in\mathbb{R}^{(N_{max}+2)\times(N_{max}+2)\times 3} as follows:

𝜶=σα(W^).\displaystyle\boldsymbol{\alpha}=\sigma_{\alpha}(\hat{W}). (17)

We split 𝜶\boldsymbol{\alpha} into matrices αv\alpha_{v}, αb\alpha_{b}, αs(Nmax+2)×(Nmax+2)\alpha_{s}\in\mathbb{R}^{(N_{max}+2)\times(N_{max}+2)}, and obtain our overall residual R(Nmax+2)×(Nmax+2)R\in\mathbb{R}^{(N_{max}+2)\times(N_{max}+2)}:

R=αvRv+αbRb+αsRs.\displaystyle R=\alpha_{v}\odot R_{v}+\alpha_{b}\odot R_{b}+\alpha_{s}\odot R_{s}. (18)

Note that \odot is the Hadamard product.

III-B4 Affinity Matrix Estimation

Given our overall residual, we have information about the pairwise similarities between current frame detections and previous frame tracks, as well as each of the four augmented anchors. We want to use these spatio-temporal and shape relationships to learn probabilities for matching the detections and tracks to each other or the anchors. This will ultimately allow us to leverage global relationships for learning our matching probabilities, as opposed to using greedy local searches on residuals for matching [7, 2]. We apply an MLP to get our overall affinity matrix AA:

A=σaff(R).\displaystyle A=\sigma_{aff}(R). (19)

III-C Log Affinity Loss

Inspired by [28, 29, 30], we train our model with the log affinity loss la\mathcal{L}_{la}. We define the ground-truth affinity matrix AgtA_{gt}, the estimated affinity matrix AA, and the Hadamard product \odot to get:

la=ij(Agtlog(A))ijAgt.\displaystyle\mathcal{L}_{la}=\frac{\sum_{i}\sum_{j}(A_{gt}\odot-\log(A))}{\sum_{i}\sum_{j}A_{gt}}. (20)

The affinity matrix prediction has corresponding ground-truth matrices Agt,fmA_{gt,fm} and Agt,bmA_{gt,bm} for forward matching and backward matching, respectively. In total, our overall loss function \mathcal{L} is defined as follows:

fm\displaystyle\mathcal{L}_{fm} =ij(Agt,fmlog(Afm))ijAgt,fm\displaystyle=\frac{\sum_{i}\sum_{j}(A_{gt,fm}\odot-\log(A_{fm}))}{\sum_{i}\sum_{j}A_{gt,fm}} (21)
bm\displaystyle\mathcal{L}_{bm} =ij(Agt,bmlog(Abm))ijAgt,bm\displaystyle=\frac{\sum_{i}\sum_{j}(A_{gt,bm}\odot-\log(A_{bm}))}{\sum_{i}\sum_{j}A_{gt,bm}} (22)
\displaystyle\mathcal{L} =12(fm+bm).\displaystyle=\frac{1}{2}\Big{(}\mathcal{L}_{fm}+\mathcal{L}_{bm}\Big{)}. (23)

III-D Sequential Track Confidence Refinement

We propose a sequential track confidence refinement technique that is used to assess the quality of tracks in real-time. We first offer an intuition behind our sequential track confidence refinement technique before formalizing our approach in Equation 24.

As stated earlier, track confidence should accurately reflect a track’s quality relative to the quality of tracks for objects of the same class in a given time step. Thus, to obtain more accurate confidence scores to characterize our tracks, we want to leverage our affinity matrix estimation as follows: (1) If the affinity matrix indicates that the detection matched to our track has a high probability of being true-positive, we want to take a weighted average between the confidence of our matched detection and the existing confidence of our track. (2) If the affinity matrix indicates that the detection matched to our track almost got eliminated as an FP – only marginally missing the elimination threshold τFP\tau_{FP} – then we want to downscale the existing confidence for our track, because it likely should not be kept alive with this matched detection. This approach is captured in Equation 24.

ctrk,i(t)\displaystyle c_{trk,i}^{(t)} 𝟙[PFP,i(t)<β1]β2cdet,i(t)+(1β2)ctrk,i(t1)\displaystyle\leftarrow\mathds{1}[P_{FP,i}^{(t)}<\beta_{1}]\beta_{2}c_{det,i}^{(t)}+(1-\beta_{2})c_{trk,i}^{(t-1)} (24)
s.t. 0cdet,i(t)1i,t\displaystyle 0\leq c_{det,i}^{(t)}\leq 1\quad\forall i,t
0ctrk,i(t)1i,t\displaystyle 0\leq c_{trk,i}^{(t)}\leq 1\quad\forall i,t
0β11\displaystyle 0\leq\beta_{1}\leq 1
0β21\displaystyle 0\leq\beta_{2}\leq 1

According to Equation 24, for track ID ii at time step tt, we have a confidence of ctrk,i(t)c_{trk,i}^{(t)}. The affinity matrix gets leveraged in the indicator function with PFP,i(t)P_{FP,i}^{(t)}, which is the probability that the detection got matched with the false-positive anchor at time step tt. We fix β1\beta_{1} to be a value slightly less than τFP\tau_{FP} to account for detections that are very close to getting eliminated as FP. In other words, we view detections with PFP,i(t)[β1,τFP]P_{FP,i}^{(t)}\in[\beta_{1},\tau_{FP}] to be very uncertain detections, and we take this into account by reducing the overall track confidence for tracks matching to such detections. However, if the matched detection has a very high probability of being true-positive, then the track confidence ctrk,i(t)c_{trk,i}^{(t)} becomes a weighted average between cdet,i(t)c_{det,i}^{(t)} and ctrk,i(t1)c_{trk,i}^{(t-1)}. Based on cross validation, we found that this refinement technique is not very sensitive to values for β1\beta_{1} and β2\beta_{2}. In fact, we set β1=0.5\beta_{1}=0.5 for all object classes, and β2=0.5\beta_{2}=0.5 for all object classes except for bicycle (β2=0.4\beta_{2}=0.4), bus (β2=0.7\beta_{2}=0.7), and trailer (β2=0.4\beta_{2}=0.4). In the case of bicycle and trailer, we choose β\beta to be slightly lower, because we want to give more weight to the fact that it has been tracked since 3D detectors tend not to be very good on these object types and thus skew towards lower detection confidence scores. For the converse reason, we give bus a higher β2\beta_{2}.

In the special case of a newborn track, the track confidence is only the first term of Equation 24, i.e.

ctrk,i(t)𝟙[PFP,i(t)<β1]β2cdet,i(t).c_{trk,i}^{(t)}\leftarrow\mathds{1}[P_{FP,i}^{(t)}<\beta_{1}]\beta_{2}c_{det,i}^{(t)}.

This works well, because the detection confidences are also created in relative terms. Thus, if all newly initialized track confidences are the detection confidences scaled with β2\beta_{2}, their relative confidence rankings will be preserved.

IV Data Preparation

We evaluate our tracking performance on the nuScenes benchmark [15]. The nuScenes dataset contains 1000 scenes, where each scene is 20 seconds long and is generated with a 2Hz frame-rate. For the tracking task, nuScenes has 7 classes: bicycle, bus, car, motorcycle, pedestrian, trailer, and truck. In this section, we provide an overview of the LiDAR preprocessing and ground-truth affinity matrix formation we complete to deploy ShaSTA for the nuScenes dataset.

TABLE I: nuScenes Test Results with evaluation in terms of overall and individual AMOTA and AMOTP. Our method achieves the highest AMOTA, while effectively balancing AMOTP to get the second lowest AMOTP only 0.005m shy of the first spot. We only compare on LiDAR-only methods that use CenterPoint detections to ensure a fair comparison. The blue entry is our LiDAR-only method ShaSTA.
Metric Method Detector Input Overall Bicycle Bus Car Motorcycle Pedestrian Trailer Truck
AMOTA \uparrow ShaSTA (Ours) CenterPoint LiDAR 69.6 41.0 73.3 83.8 72.7 81.0 70.4 65.0
NEBP [6] CenterPoint LiDAR 68.3 44.7 70.8 83.5 69.8 80.2 69.0 59.8
OGR3MOT [8] CenterPoint LiDAR 65.6 38.0 71.1 81.6 64.0 78.7 67.1 59.0
CenterPoint [7] CenterPoint LiDAR 65.0 33.1 71.5 81.8 58.7 78.0 69.3 62.5
AMOTP \downarrow ShaSTA (Ours) CenterPoint LiDAR 0.540 0.674 0.629 0.383 0.504 0.369 0.742 0.650
NEBP [6] CenterPoint LiDAR 0.624 1.026 0.677 0.391 0.557 0.393 0.781 0.541
OGR3MOT [8] CenterPoint LiDAR 0.620 0.899 0.675 0.395 0.615 0.383 0.790 0.585
CenterPoint [7] CenterPoint LiDAR 0.535 0.561 0.636 0.391 0.519 0.409 0.729 0.5
TABLE II: nuScenes Test Results with evaluation in terms of overall TP, FP, and FN. We only compare on LiDAR-only methods that use CenterPoint detections to ensure a fair comparison. The blue entry is our LiDAR-only method ShaSTA.
Method Detector Input TP \uparrow FP \downarrow FN \downarrow
ShaSTA (Ours) CenterPoint LiDAR 97,799 16,746 21,293
NEBP [6] CenterPoint LiDAR 97,367 16,773 21,971
OGR3MOT [8] CenterPoint LiDAR 95,264 17,877 24,013
CenterPoint [7] CenterPoint LiDAR 94,324 17,355 24,557

IV-A LiDAR Point Cloud Preprocessing

Since CenterPoint [7] provides detections at 2Hz, but the LiDAR has a sampling rate of 20Hz, we use the nuScenes [15] feature for accumulating multiple LiDAR sweeps with motion compensation. This provides us with a denser 4D point cloud with an added temporal dimension, and makes the LiDAR input match the detection sampling rate. We use 10 LiDAR sweeps to generate the point cloud.

IV-B Ground-Truth Affinity Matrix.

We find the true-positive, false-positive, and false-negative detections using the matching algorithm from the nuScenes [15] dev-kit. For each frame, we define a ground-truth affinity matrix At(Nmax+2)×(Nmax+2)A_{t}\in\mathbb{R}^{(N_{max}+2)\times(N_{max}+2)} between the current frame and previous frame off-the-shelf detections. For all true-positive previous and current frame detections, Bt1tpB_{t-1}^{tp} and BttpB_{t}^{tp}, we say they are matched if the ground-truth boxes they each correspond to have the same ground-truth tracking IDs. If there is a match between true-positive boxes bt1,itpBt1tpb_{t-1,i}^{tp}\in B_{t-1}^{tp} and bt,jtpBttpb_{t,j}^{tp}\in B_{t}^{tp}, then At(i,j)=1A_{t}(i,j)=1. For all false-positive previous frame detections and true-positive previous frame detections whose tracking IDs do not appear in any of the current frame ground-truth tracking IDs, we call them dead tracks Bt1dtB_{t-1}^{dt}. For all bt1,idtBt1dtb_{t-1,i}^{dt}\in B_{t-1}^{dt}, we set At(i,Nmax+1)=1A_{t}(i,N_{max}+1)=1. For all true-positive previous frame detections that do not have a match in the current frame but whose tracking IDs do appear in the current frame ground-truth tracking IDs, we call them false-negatives Bt1fnB_{t-1}^{fn}. For all bt1,ifnBt1fnb_{t-1,i}^{fn}\in B_{t-1}^{fn}, we set At(i,Nmax+2)=1A_{t}(i,N_{max}+2)=1. Additionally, for all true-positive current frame detections whose tracking IDs do not appear in the previous frame ground-truth tracking IDs, we call them newborn tracks BtnbB_{t}^{nb}, and we set At(Nmax+1,j)=1A_{t}(N_{max}+1,j)=1 for all bt,jnbBtnbb_{t,j}^{nb}\in B_{t}^{nb}. Finally, we define the set of false-positive current frame detections as BtfpB_{t}^{fp}, and we set At(Nmax+2,j)=1A_{t}(N_{max}+2,j)=1 for all bt,jfpBtfpb_{t,j}^{fp}\in B_{t}^{fp}. Unmatched (i,j)(i,j) entries are set to 0.

V Experimental Results

This section provides an overview of evaluation metrics, training details, comparisons against state-of-the-art 3D multi-object tracking techniques, and ablation studies to analyze our technique.

V-A Evaluation Metrics

The primary metrics for evaluating nuScenes are Average Multi-Object Tracking Accuracy (AMOTA) and Average Multi-Object Tracking Precision (AMOTP) [31]. AMOTA measures the ability of a tracker to track objects correctly, while AMOTP measures the quality of the estimated tracks.

AMOTA averages the recall-weighted MOTA, known as MOTAR, at nn evenly spaced recall thresholds. Note that MOTA is a metric that penalizes FPs, FNs, and ID switches (IDS). GT indicates the number of ground-truth tracklets.

AMOTA\displaystyle AMOTA =1n1r{1n1,1n2,,1}MOTAR\displaystyle=\frac{1}{n-1}\sum_{r\in\{\frac{1}{n-1},\frac{1}{n-2},\dots,1\}}MOTAR (25)
MOTAR\displaystyle MOTAR =max(0,1IDSr+FPr+FNr(1r)GTrGT)\displaystyle=max\Bigg{(}0,1-\frac{IDS_{r}+FP_{r}+FN_{r}-(1-r)\cdot GT}{r\cdot GT}\Bigg{)} (26)

Additionally AMOTP averages MOTP over nn evenly spaced recall thresholds. Note that MOTP measures the misalignment between ground-truth and predicted bounding boxes. For the definition of AMOTP below, di,td_{i,t} indicates the distance error for track ii at time step tt and TPtTP_{t} indicates the number of true-positive (TP) matches at time step tt.

AMOTP=1n1r{1n1,1n2,,1}i,tdi,ttTPt\displaystyle AMOTP=\frac{1}{n-1}\sum_{r\in\{\frac{1}{n-1},\frac{1}{n-2},\dots,1\}}\frac{\sum_{i,t}d_{i,t}}{\sum_{t}TP_{t}} (27)

The nuScenes benchmark [15] defines TP tracks as estimated tracks that are within an L2 distance of 2m from ground-truth tracks. This allows for a margin of error that can become unsafe in certain driving scenarios where small distances can make a dramatic difference in safety. Thus, even though the official nuScenes leaderboard uses AMOTA solely to rank trackers, we emphasize AMOTP as well due to its safety implications.

We also present secondary metrics, including the total number of TP, FP, and FN tracks to better analyze the effectiveness of our approach for downstream decision-making tasks.

TABLE III: Ablation on Confidence Refinement. We assess the effect of confidence refinement on tracking performance for each object. The version of ShaSTA without refinement indicates that we apply the CenterPoint detection confidence scores to our tracks. Evaluation is on nuScenes validation set in terms of overall AMOTA. The blue entry is our full method.
Metric Method Overall Bicycle Bus Car Motorcycle Pedestrian Trailer Truck
AMOTA \uparrow Without Refinement 70.2 49.4 85.2 85.3 70.1 80.9 50.1 70.1
With Refinement 72.8 58.8 85.6 85.6 74.5 81.4 53.1 70.3
Difference +2.6 +9.4 +0.4 +0.3 +4.4 +0.5 +3.0 +0.2

V-B Training Specifications

We train a different network for each object category following [3, 4]. The value of NmaxN_{max} depends on the object type since some classes are more common than others. We choose the value for NmaxN_{max} based on the per-object detection frequency we gather in the training set. The smallest is Nmax=20N_{max}=20 and the largest is Nmax=90N_{max}=90. We use the Adam optimizer with a constant learning rate of 10410^{-4} and L2 weight regularization of 10210^{-2}. We use the pre-trained VoxelNet weights from CenterPoint [7] and freeze them for our LiDAR backbone so that the shape information correlates with our detection bounding boxes from [7]. Additionally, there is a significant class imbalance between FPs and TPs since state-of-the-art detectors function in low precision-high recall regimes. Thus, we downsample the number of FP detections during training. Finally, we fix the thresholds τfp\tau_{fp} for FP elimination, τfn\tau_{fn} for FN propagation, τnb\tau_{nb} for NB initialization, and τdt\tau_{dt} for DT termination with cross-validation. We found that τfp=0.7\tau_{fp}=0.7, τfn=0.5\tau_{fn}=0.5, τnb=0.5\tau_{nb}=0.5, and τdt=0.5\tau_{dt}=0.5 works best for all classes.

V-C Comparison with State-of-the-Art LiDAR Tracking

Our nuScenes test results can be seen in Tables I and II. Because previous works have shown that the standard AMOTA tracking metric heavily depends on the upstream 3D detections [3, 17, 18], all tracking methods reported use CenterPoint detections to ensure a fair comparison111At the time of this writing, CenterPoint detections achieve the highest mAP of all publicly-available 3D detections. Higher-scoring detections appear on the nuScenes leaderboard, but are proprietary.. For more information on this evaluation protocol and the AMOTA metric, please refer to Section V-A.

The CenterPoint tracking algorithm [7] is our baseline, and we include recent, peer-reviewed works cited in Section II that tackle data association, track lifecycle management, FPs, and FNs in the LiDAR-only tracking domain. Most notably, before ShaSTA was developed, NEBP [6] was the #1 LiDAR-only tracker using CenterPoint detections.

In Table I, we can see that ShaSTA balances correct tracking (AMOTA) with precise, high-quality tracking (AMOTP). Most notably ShaSTA not only achieves the highest overall AMOTA, but also does so for 6 out of 7 object classes. The only class we are not first in AMOTA is the bicycle class, with NEBP taking the lead. However, it is interesting to note that our precision is much better for bicycles than NEBP is. NEBP averages over 1 meter in distance from the ground-truth tracks on bicycles, which can be very dangerous in real-world traffic scenes. Compared to NEBP, ShaSTA has a 34% improvement in AMOTP while only having an 8% reduction in AMOTA. Thus, the trade-off for the bicycles is clearly in ShaSTA’s favor. For the overall AMOTP, we achieve the second best value only trailing CenterPoint by 0.005 meters, which is a favorable trade-off when compared to our significant boost in overall AMOTA.

Examining further metric breakdowns in Table II, it becomes clear that ShaSTA is effective in de-cluttering environments suffering from FPs and compensating for missed detections that lead to missing TPs and rising FNs. The two most observed classes that constitute over 80% of the test set are cars and pedestrians, and we get highly competitive results in both of those categories, which tend to suffer the most from crowded scenes. Thus, our FPs reflect our effectiveness at dealing with these sorts of scenes. Another interesting interplay between these metrics is that often times decreasing FPs comes at the risk of increasing FNs and decreasing TPs. Thus, it is quite challenging to tackle the FP, FN, and TP problem simultaneously, yet our approach does so most effectively out of all existing tracking methods.

TABLE IV: Ablation on Shape Descriptors. We assess the effect of the number of 3D bounding box points used to extract shape descriptors. Evaluation is on nuScenes validation set in terms of overall AMOTA. The blue entry is our full method.
Box Center Used Face Centers Used AMOTA \uparrow
68.8
70.2
72.8
TABLE V: Ablation on Residuals. We assess the effect of isolating each residual on the tracking algorithm’s performance. Evaluation is on nuScenes validation set in terms of overall AMOTA. The blue entry is our full method.
Residual Used AMOTA \uparrow
VoxelNet Only 68.9
Bounding Box Only 69.1
Shape Only 69.5
All (ShaSTA) 72.8
TABLE VI: Ablation on Learned Affinity Matrix. We assess the effect of each augmentation from our learned affinity matrix on track formation. Evaluation is on nuScenes validation set in terms of overall AMOTA. The blue entry is our full method.
Affinity Augmentation Method AMOTA \uparrow
False-Positive Elimination Only 72.7
False-Negative Propagation Only 70.4
Newborn Initialization Only 72.6
Dead Track Termination Only 70.4
All (ShaSTA) 72.8

V-D Ablation Studies

We complete four ablative experiments on the nuScenes validation set to measure the impacts of (1) the sequential track confidence refinement, (2) the number of shape descriptors from the BEV map, (3) each of the residuals used to make the overall residual prediction, and (4) including FP elimination, FN propagation, NB initialization, and DT termination in the learned affinity matrix. In total, our ablative studies demonstrate that all of the components in ShaSTA contribute to the overall success of this tracking framework.

Refer to caption
(a) CenterPoint: Scene 270
Refer to caption
(b) ShaSTA: Scene 270
Figure 3: Visualization of tracking results for car object class projected onto front camera. We show two consecutive frames from Scene 270 of the validation set for CenterPoint [7] and ShaSTA. In frame 29, CenterPoint has 3 false-positive tracks as can be seen in frame 29, and one of the false-positives persists into frames 30 and 31. However, ShaSTA perfectly tracks the car in this sequence without any false-positives or false-negatives. Note that both methods are starting with the CenterPoint detections as input.

First, we analyze the effect of the sequential tracking confidence refinement as seen in Table III. When we do not use our refinement technique, we apply the CenterPoint detection confidence scores to our tracks as in past works [1, 6, 2, 3, 7, 8]. We note a significant boost in the overall AMOTA for the validation set when using the proposed refinement method. All object classes benefited, but the ones that improved the least are large objects like trucks, cars, and buses. This makes sense since these types of objects tend to provide the best LiDAR point clouds and are less likely to suffer from full occlusion, allowing the detection confidence to be a good indicator for tracking. Conversely, smaller objects like bicycles and motorcycles that tend to suffer the most from LiDAR point cloud sparsity due to increased distance from the ego vehicle and occlusion in traffic scenes benefit the most from this technique. In these cases, the pure detection confidences can drop dramatically when there are scenarios such as full occlusion of these small objects, but with our sequential refinement we take into account the existing confidence that has accumulated for a track before such scenarios occur.

Additionally, in Table IV, we can see that as we add more bounding box face centers for extracting shape descriptors, the accuracy of our model monotonically increases. This result supports our argument that leveraging raw LiDAR point clouds to get shape information is integral to the success of our tracking framework. Most notably, when we have less shape information, more objects within the same class are more likely to appear similar, leading to more ambiguity in distinguishing between TPs and FPs. Such ambiguity makes the track confidence refinement less effective, specifically in the indicator function in Equation 24.

Furthermore, our ablation on the individual residuals in Table V demonstrates that even though we can achieve competitive tracking accuracy solely with spatio-temporal (VoxelNet or Bounding Box only) or shape (Shape only) information, we achieve the most optimal results when leveraging both types of information.

Lastly, in Table VI, we see that the most impactful augmentations are FP elimination and NB initialization. Eliminating FPs is expected to boost our accuracy since off-the-shelf detections suffer from such high rates of FP detections. NB initialization can be seen as a second reinforcement on eliminating FP detections because unmatched detections that do not have a probability of at least τnb\tau_{nb} for being NBs do not get initialized as tracks and are discarded. Finally, even though FNs and DT termination have not been as detrimental to tracking algorithms in the past, these augmentations alone give us competitive accuracies, showing that every augmentation contributes to effective tracking.

V-E Qualitative Results

As indicated in Table I, there are significant increases in test accuracy over the CenterPoint [7] baseline, particularly for objects like cars and pedestrians that are often found in crowded scenes and thus, have increased susceptibility to experiencing high rates of FP tracks.

At high recall thresholds, FP tracks tend to appear more frequently, but the AMOTA metric does not strongly penalize FP tracks at higher recall thresholds, diminishing their significant effect on real-world deployment for downstream decision-making tasks. To showcase ShaSTA’s ability to eliminate FPs effectively, we present tracking results in Figure 3. We only show the tracking results on the car object class from validation scene 270. (a) shows the CenterPoint [7] baseline which has FP tracks in all three frames. On the other hand, (b) demonstrates ShaSTA’s ability to declutter the scene and accurately track the car, while ignoring background information that should not be tracked. Typically, past works [4] use non-maximal suppression heuristics such as intersection over union (IoU) to remove FP tracks, but in more complex settings like scene 270, such heuristics would fail because none of the FPs overlap with other track predictions. As a result, ShaSTA proves to be especially effective at using shape and spatio-temporal information from LiDAR point clouds to localize and track actual objects of interest in the scene.

VI Conclusion

We present ShaSTA, a 3D multi-object tracking framework that leverages shape and spatio-temporal information from LiDAR sensors to learn affinities for robust data association, track lifecycle management, FP elimination, FN propagation, and track confidence refinement. Our approach effectively addresses false-positives and missing detections against cluttered scenes and occlusion, yielding state-of-the-art performance. Since ShaSTA offers a flexible framework, we can easily extend it in future work to fuse various sensor modalities. Other interesting future directions include the usage of our learned affinities in prediction and planning.

VII Acknowledgment

Toyota Research Institute provided funds to support this work.

References

  • [1] X. Weng, J. Wang, D. Held, and K. Kitani, “3D Multi-Object Tracking: A Baseline and New Evaluation Metrics,” IROS, 2020.
  • [2] H.-k. Chiu, A. Prioletti, J. Li, and J. Bohg, “Probabilistic 3d multi-object tracking for autonomous driving,” arXiv preprint arXiv:2001.05673, 2020.
  • [3] H.-k. Chiu, J. Li, R. Ambruş, and J. Bohg, “Probabilistic 3d multi-modal, multi-object tracking for autonomous driving,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 14 227–14 233.
  • [4] C. Stearns, D. Rempe, J. Li, R. Ambrus, S. Zakharov, V. Guizilini, Y. Yang, and L. J. Guibas, “Spot: Spatiotemporal modeling for 3d object tracking,” arXiv preprint arXiv:2207.05856, 2022.
  • [5] F. Meyer, T. Kropfreiter, J. L. Williams, R. Lau, F. Hlawatsch, P. Braca, and M. Z. Win, “Message passing algorithms for scalable multitarget tracking,” Proceedings of the IEEE, vol. 106, no. 2, pp. 221–259, 2018.
  • [6] M. Liang and F. Meyer, “Neural enhanced belief propagation for data association in multiobject tracking,” in 2022 25th International Conference on Information Fusion (FUSION).   IEEE, 2022, pp. 1–7.
  • [7] T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detection and tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11 784–11 793.
  • [8] J.-N. Zaech, A. Liniger, D. Dai, M. Danelljan, and L. Van Gool, “Learnable online graph representations for 3d multi-object tracking,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5103–5110, 2022.
  • [9] A. Kim, A. Ošep, and L. Leal-Taixé, “Eagermot: 3d multi-object tracking via sensor fusion,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 11 315–11 321.
  • [10] X. Wang, C. Fu, Z. Li, Y. Lai, and J. He, “Deepfusionmot: A 3d multi-object tracking framework based on camera-lidar fusion with deep association,” arXiv preprint arXiv:2202.12100, 2022.
  • [11] X. Weng, Y. Wang, Y. Man, and K. Kitani, “Gnn3dmot: Graph neural network for 3d multi-object tracking with multi-feature learning,” in CVPR, 2020.
  • [12] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
  • [13] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4490–4499.
  • [14] S. Chen, X. Wang, T. Cheng, Q. Zhang, C. Huang, and W. Liu, “Polar parametrization for vision-based surround-view 3d detection,” arXiv preprint arXiv:2206.10965, 2022.
  • [15] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631.
  • [16] M. Liang, B. Yang, W. Zeng, Y. Chen, R. Hu, S. Casas, and R. Urtasun, “Pnpnet: End-to-end perception and prediction with tracking in the loop,” in CVPR, 2020.
  • [17] J. Luiten, A. Osep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taixé, and B. Leibe, “Hota: A higher order metric for evaluating multi-object tracking,” International journal of computer vision, vol. 129, pp. 548–578, 2021.
  • [18] L. Wang, X. Zhang, W. Qin, X. Li, L. Yang, Z. Li, L. Zhu, H. Wang, J. Li, and H. Liu, “Camo-mot: Combined appearance-motion optimization for 3d multi-object tracking with camera-lidar fusion,” arXiv preprint arXiv:2209.02540, 2022.
  • [19] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
  • [20] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” in arXiv preprint arXiv:1904.07850, 2019.
  • [21] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.
  • [22] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner, “Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks,” in 2017 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2017, pp. 1355–1361.
  • [23] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detection from point clouds,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 7652–7660.
  • [24] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12 697–12 705.
  • [25] H. Choi, M. Kang, Y. Kwon, and S.-e. Yoon, “An objectness score for accurate and fast detection during navigation,” arXiv preprint arXiv:1909.05626, 2019.
  • [26] C. R. Qi, O. Litany, K. He, and L. J. Guibas, “Deep hough voting for 3d object detection in point clouds,” in proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9277–9286.
  • [27] A. Simonelli, S. R. Bulo, L. Porzi, M. López-Antequera, and P. Kontschieder, “Disentangling monocular 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1991–1999.
  • [28] X. Guo and Y. Yuan, “Joint class-affinity loss correction for robust medical image segmentation with noisy labels,” arXiv preprint arXiv:2206.07994, 2022.
  • [29] M. Hayat, S. Khan, W. Zamir, J. Shen, and L. Shao, “Max-margin class imbalanced learning with gaussian affinity,” arXiv preprint arXiv:1901.07711, 2019.
  • [30] S. Sun, N. Akhtar, H. Song, A. Mian, and M. Shah, “Deep affinity network for multiple object tracking,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 1, pp. 104–119, 2019.
  • [31] X. Weng and K. Kitani, “A baseline for 3d multi-object tracking,” arXiv preprint arXiv:1907.03961, vol. 1, no. 2, p. 6, 2019.