BOTT: Box Only Transformer Tracker for 3D Object Tracking

Lubing Zhou, Xiaoli Meng, Yiluan Guo, Jiong Yang
Motional
{lubing.zhou, xiaoli.meng, yiluan.guo, jiong.yang}@motional.com

Abstract

Tracking 3D objects is an important task in autonomous driving. Classical Kalman Filtering based methods are still the most popular solutions. However, these methods require handcrafted designs in motion modeling and can not benefit from the growing data amounts. In this paper, Box Only Transformer Tracker (BOTT) is proposed to learn to link 3D boxes of the same object from the different frames, by taking all the 3D boxes in a time window as input. Specifically, transformer self-attention is applied to exchange information between all the boxes to learn global-informative box embeddings. The similarity between these learned embeddings can be used to link the boxes of the same object. BOTT can be used for both online and offline tracking modes seamlessly. Its simplicity enables us to significantly reduce engineering efforts required by traditional Kalman Filtering based methods. Experiments show BOTT achieves competitive performance on two largest 3D MOT benchmarks: 69.9 and 66.7 AMOTA on nuScenes validation and test splits, respectively, 56.45 and 59.57 MOTA L2 on Waymo Open Dataset validation and test splits, respectively. This work suggests that tracking 3D objects by learning features directly from 3D boxes using transformers is a simple yet effective way.

1 Introduction

Autonomous driving is an open challenge attracting tremendous attention in the past decade. One of the most essential tasks for autonomous vehicles is to perceive 3D objects accurately, which includes the detection and tracking of the objects. Encouraging progress has been made in 3D object detection, owing to the emergence of large public multi-modality datasets [4, 18] and advanced 3D object detection methods [2, 9, 10, 23]. On the other hand, tracking-by-detection methods for 3D Multi-Object Tracking (MOT) [7, 13, 20, 24] remain competitive and popular due to their ability to benefit from powerful 3D object detectors. Among them, Kalman Filtering (KF) based trackers [13, 20] are dominant, as the kinematics models are naturally designed for tracking 3D motion.

Despite their competitiveness, KF-based trackers have two main disadvantages. Firstly, a series of Kalman filters must be defined to cover various types of motion kinematics, including static, constant velocity, constant acceleration, constant angular velocity, and more sophisticated non-constant ones. Meanwhile, Kalman filters require specific parameters for each object category, such as the mean and variance of measurements and noises. Hence, KF-based trackers need much engineering efforts to tune these parameters to have a decent performance. Second, KF-based trackers could not make use of modern large datasets [4, 18] to boost the performance.

One approach for data-driven 3D MOT is to perform joint detection and tracking in a single stage, such as SimTrack [11] and CenterPoint [23]. While these methods can simultaneously detect and track 3D objects in point clouds with a single model, the tracking is normally limited to consecutive frames to fit the architecture of the 3D object detector based on lidar. However, there is a fundamental conflict between the two tasks: 3D detection focuses on the instantaneous spatial localization of objects with only a few point clouds, while 3D tracking requires a much longer spatial-temporal memory. In practice, 4D spatial and temporal learning with significantly more point clouds is still an open challenge due to computational complexity and hardware limitation.

Refer to caption — Figure 1: BOTT 3D tracking architecture. It contains the BOTT network and the box tracking module. BOTT network consumes all boxes from $K$ frames to generate a pairwise box affinity matrix with 3 steps: per-box encoding, inter-box attention encoding, and affinity matrix generation by dot-product. The box tracking module produces tracks based on the pairwise box linking scores to tracks. Both online and offline tracking are supported.

An alternative research direction is to learn to track the bounding boxes of the 3D objects directly [6, 7, 24]. This approach offers a straightforward replacement of KF-based trackers in the existing tracking-by-detection paradigm. Machine learning methods, consuming only the geometric properties of bounding boxes, inherit the merits of KF-based trackers and could benefit from growing data amounts. However, 3D box-based learning methods face two key challenges. Firstly, each input frame contains a varying number of unordered boxes, making it difficult to establish a consistent identity for each object. Secondly, unlike image appearance features, 3D box geometric features lack spatial-temporal consistency for each object identity. Nevertheless, humans can easily connect boxes from the same object when viewing the bird’s-eye view boxes sequentially by interpreting the global box distributions and the spatial-temporal context of each individual box. In other words, the box features, i.e. position, size, orientation, object types, and their temporal-spatial distributions should be sufficient for tracking. The key is to find a suitable tool to learn such information for each box. PolarMOT [7] is an inspiring work in this regard, which uses a graph neural network (GNN) to iteratively learn box features from spatial-temporal local boxes. Differently, we propose a novel approach, called the Box Only Transformer Tracker (BOTT), that uses attention [19] to globally learn per-box embeddings from all multi-class boxes with a single model, as shown in Figure. 1. BOTT is well-suited for 3D box tracking, as attention mechanisms have repeatedly demonstrated its effectiveness in communicating information temporally and spatially between inputs of varying-length [2, 12, 25].

In summary, the proposed BOTT is a simple multi-class multi-object 3D box-only tracker. Attentive box features are globally learned to encode box information and its spatial-temporal distributions among other boxes. The linking scores or similarity scores between box features are used to link boxes. Our main contributions are as follows:

•

We propose BOTT, a simple self-attention based tracker, which consumes only 3D bounding boxes. The simplicity and effectiveness pave the path for more potential works to track 3D boxes using transformers. Meanwhile, it could be extended to other applications such as multi-modal 3D box tracking.
•

We provide complete online and offline tracking algorithms for multi-class 3D tracking with a single model under the BOTT framework.
•

We conduct experiments on two largest 3D MOT datasets nuScenes[4] and Waymo Open Dataset (WOD) [18]. BOTT demonstrates competitive performance: 69.9 and 66.7 AMOTA for nuScenes val and test splits, 56.45 and 59.57 MOTA L2 for WOD val and test splits.
•

We conduct extensive ablation studies to examine the key designs that enable strong performance and validate the BOTT’s generalization ability across datasets and input frequencies.

2 Related Work

This section first reviews the 3D MOT algorithms based on the tracking-by-detection paradigm, then reviews the transformer-based trackers, and lastly reviews online and offline MOT.

2.1 3D MOT

Under the tracking-by-detection framework, AB3DMOT [20] serves as a baseline, using a simple KF tracking framework. Many methods have been proposed to improve the tracking performance based on the same KF-based tracking framework, such as ProbTrack [5] and SimpleTrack [13]. Their major difference lies in the association metrics: AB3DMOT used 3D Intersection of Union (IoU); ProbTrack used Mahalabnobis distance; and SimpleTrack used 3D generalized 3D (GIoU) while a second association was applied for better handling lower-score objects.

Lately, more learning-based tracking algorithms have been reported. Some of them proposed to use GNNs [3, 7, 21, 24] as graphs are a natural representation of the MOT problem, where the detected objects are encoded as nodes and their spatial-temporal relations are represented by edges. GNN3DMOT [21] used GNNs to estimate an affinity matrix and solved the association using the Hungarian algorithm. OGR3MOT [24] solved 3D MOT with GNNs in an end-to-end manner, focusing on the data association and the classification of active tracklets. Batch3dmot [3] presented a novel multi-modal GNN framework for offline 3D MOT on multi-class tracking graphs that introduces $k$ nearest neighborhood attention across graph components to allow information exchange across disconnected graph components. Closest to our approach, PolarMOT [7] explored the impact of geometric relationships between objects based on GNNs with solely 3D boxes as well. The difference is that in the PolarMOT, mutual object interactions within local regions are learned implicitly through iterative message-passing steps, while we used self-attention to get the global context information in one shot.

2.2 Transformer Tracker

Over the past few years, transformer networks have gained great momentum for their ability to effectively process sequence data. Encouraging tracking performance on 2D MOT has been demonstrated from transformer trackers with image appearance features [6, 12, 15, 17, 25, 26]. This is mainly because transformers could handle long-term dependencies and is robust to occlusions and complex scenarios. Among these trackers, SODA [6] and GTR [25] are more related to our work. SODA proposed an attention measurement encoding to compute track embeddings from objects’ appearance features to reason about the spatial-temporal dependencies between all objects. GTR fed the objects’ appearance features into the encoder of the transformer, additionally took trajectory queries as the decoder input, and produced association scores between each query and object. Different from their work, our work learns the spatial-temporal context information from 3D bounding boxes only without appearance using simple self-attention encoders.

2.3 Offline and Online Tracking

Recently, offline auto-labeling methods [22, 14] have drawn great attention in the autonomous driving industry, as they could scale up the data annotation drastically. In the auto-labeling pipeline, future information could be introduced to improve tracking performance. It is not straightforward for KF-based trackers [5, 13, 20] to leverage on the future information, as they are designed to operate recursively and rely solely on the current state and observations. In contrast, BOTT provides a convenient solution for both online and offline multi-class tracking.

3 Box Only Transformer Tracker

In this section, the proposed BOTT framework will be presented, including the BOTT network and box tracking algorithms.

3.1 Overview

In a scene with $T$ frames $\{I^{1},I^{2},\cdots,I^{T}\}$ , there are $n_{t}$ detected 3D boxes for frame $I^{t}$ : $B^{t}\mathrm{=}\{b^{t}_{1},b^{t}_{2},\cdots,b^{t}_{n_{t}}\}$ . Each box $b^{t}_{i}$ contains raw features including $(x^{t}_{i},y^{t}_{i},z^{t}_{i},w^{t}_{i},l^{t}_{i},h^{t}_{i},\theta^{t}_{i},t,c^{t}_{1i},c^{t}_{2i},\cdots,c^{t}_{Ci})$ ( $xyz$ : center, $wlh$ : size, $\theta$ : yaw, $t$ : time, $c_{1}{\cdots}c_{C}$ : classification scores for $C$ categories). A sliding window $S^{t}_{K}$ is defined as a set of all the boxes from $K$ consecutive frames: $S^{t}_{K}\mathrm{=}\{B^{t-K+1},\cdots,B^{t-1},B^{t}\}$ . For simplicity, $S^{t}$ denotes a sliding window with latest frame $I^{t}$ without explicit statement of $K$ , and $N_{t}\mathrm{=}\sum_{j=0:K-1}n_{t-j}$ denotes the total box number within $S^{t}$ .

As shown in Figure. 1, BOTT accepts sliding windows as input and estimates pairwise box linking scores using a BOTT network. Then a box tracking module connects the boxes to produce tracks using the linking scores. The BOTT network contains three components: 1) single-box feature encoding; 2) inter-box encoding with self-attention; and 3) linking scores estimation.

3.2 Single-Box Feature Encoding

This module aims to learn high-level per-box features from raw geometric features. The raw center values $xyz$ may vary greatly in different sliding windows because they are in the shared global coordinates. To reduce variance, center positions of all boxes in $S^{t}$ are normalized by subtracting the minimal center values $(x^{t}_{min},y^{t}_{min},z^{t}_{min})$ of all boxes. Take $x^{t}_{min}$ for instance, $x^{t}_{min}=\min_{\{(t,i)|b^{t}_{i}\in S^{t}\}}{x^{t}_{i}}$ . The time feature $\Delta_{i}^{t}$ of box $b_{i}^{t}$ is encoded as the delta between the box frame and center frame in $S_{t}$ . Box heading angle is encoded as $[\sin(\theta^{t}_{i}),\cos(\theta^{t}_{i})]$ . In summary, the input feature of $b^{t}_{i}$ includes $(x^{t}_{i}-x^{t}_{min},y^{t}_{i}-y^{t}_{min},z^{t}_{i}-z^{t}_{min},w^{t}_{i},l^{t}_{i},h^{t}_{i},\sin(\theta^{t}_{i}),\cos(\theta^{t}_{i}),\Delta_{i}^{t},c^{t}_{1i},c^{t}_{2i},\cdots,c^{t}_{Ci})$ . Each box feature vector is fed into a shared Multi-Layer Perception (MLP), which has $N_{MLP}$ latent layers with $d_{l}$ dimensions and a final output layer with $d$ dimensions. $N_{t}\times d$ feature embeddings are learned for all boxes $S^{t}$ .

3.3 Inter-Box Encoding with Self-Attention

After per-box feature encoding, the $N_{t}\times d$ output embeddings are fed to a self-attention module to encode inter-box relationship. Specifically, $N_{enc}$ identical transformer encoder blocks [19] are applied sequentially to exchange information between all input box embeddings. Each encoder block consists of a multi-head attention network with $n_{h}$ heads and a following feed-forward network with $d_{f}$ hidden nodes. After self-attention, the size of output box embeddings remains $N_{t}\times d$ for $S^{t}$ .

Notably, the self-attention in BOTT is class-agnostic, i.e. each box learns to get information from all the other boxes in the sliding window. As shown in Figure. 5.1, the moving car in sub-figure (a) and (b) also gets attention from nearby pedestrians, and the pedestrian in sub-figure (c) and (d) also gets attention from nearby cars. Self-attention is expected to learn robust box representation by encoding global spatial-temporal box distributions into each box. Another advantage of class-agnostic self-attention is that it enables BOTT to handle multi-class objects tracking with a single model. It greatly reduces the deployment complexity in practice.

3.4 Linking Score Estimation

Tracked boxes from the same object share a similar spatial-temporal context within a sliding window. With the learned $N_{t}\times d$ box embeddings, $L_{2}$ normalization is performed to normalize each box embedding to unit length. Then, simple dot-product is used to compute inter-box linking scores, resulting in a $N_{t}\times N_{t}$ linking score matrix $LS^{t}$ (illustrated in Figure. 1). For convention, the linking score is normalized to range $[0,1]$ by $LS^{t}\mathrm{=}\frac{LS^{t}+1}{2}$ . A linking score of $1$ indicates that two boxes belong to the same object, and $0$ means the opposite. In this way, the tracking is converted to a binary classification problem of box links.

3.5 Loss

Assume $y^{t}\in\mathcal{R}^{N_{t}{\times}N_{t}}$ denote the corresponding ground truth linking score matrix for $LS^{t}$ , binary cross entropy between them is used as the loss. During training, to prevent the easy cases from overwhelming the loss, a binary mask $M^{t}\in\mathcal{R}^{N_{t}{\times}N_{t}}$ is constructed to ignore losses from: 1) inter-class box links; 2) box links from the same frame; 3) box links between two false positive boxes (detected boxes without associated ground truth boxes); and 4) box links whose center distances exceed the expected maximum displacement for the object over the given time duration.

Given a batch data of $b$ sliding windows $\{S^{t},t\in\{t_{1},t_{2},\cdots,t_{b}\}\}$ , the linking loss $L_{link}$ is computed as:

	$\displaystyle L^{t}_{ij}$	$\displaystyle=(1-\beta)(1-y^{t}_{ij})\log(1-LS_{ij}^{t})+\beta y^{t}_{ij}\log LS_{ij}^{t}$
	$\displaystyle L_{link}$	$\displaystyle=\frac{\sum_{t}\sum_{i,j}(M^{t}_{ij}\cdot L^{t}_{ij})}{\sum_{t}\sum_{i,j}M^{t}_{ij}}$		(1)

where $\beta$ is the positive sample weight.

3.6 Box Tracking with BOTT

Linking scores are utilized to create tracks in box tracking module. Depending on whether future data can be used, BOTT can perform both online and offline tracking.

3.6.1 Online Tracking

Figure. 2 illustrates online box tracking under BOTT framework. At time $t$ , the set of history tracks are denoted as $\mathcal{T}^{t-1}\mathrm{=}\{\tau_{1},\tau_{2},\cdots,\tau_{M}\}$ , where a track $\tau_{i}$ is defined as a list of time-ordered 3D boxes. In the example of Figure. 2, $\tau_{1}\mathrm{=}[\cdots,b_{N-2},\cdots,b_{5}],\tau_{2}\mathrm{=}[b_{N-1},\cdots,b_{4}]$ , where $b_{i}$ denotes the $i^{th}$ box as marked. At frame $I^{t}$ , the latest sliding window $S^{t}_{K}$ is fed into BOTT network to generate linking scores $LS^{t}$ between all the boxes in $S^{t}_{K}$ , i.e. $\{B^{t-K+1},\cdots,B^{t-1},B^{t}$ }. Since the goal of online tracking is to connect new detections $B^{t}$ to existing tracks $\mathcal{T}^{t-1}$ , we only use linking scores between boxes $B^{t}$ (i.e., $\{b_{1},b_{2},b_{3}\}$ ) and boxes $\{B^{t-K+1},\cdots,B^{t-1}\}$ (i.e., $\{b_{4},b_{5},\cdots,b_{N}\}$ ). Moreover, linking scores are set to zero for box links under conditions: 1) the two boxes are from different object categories; 2) the center distance violates the physical constraints as discussed in section 3.5. Next, the affinity score $AS_{ij}$ between each detection $b_{i}\in B^{t}$ and each track $\tau_{j}\in\mathcal{T}^{t-1}$ is calculated as maximum of all box pairs:

AS^{t}_{ij}=\max_{k\in\{k|b_{k}\in(\tau_{j}\cap S^{t})\}}LS^{t}_{ik}

(2)

Instead of track propagation by motion model and association in latest frame, BOTT directly conducts Hungarian association between all detections and tracks using cost matrix $1-AS^{t}$ . This enables BOTT to handle multi-class tracking with single model.

BOTT online tracking utilizes a simple track management strategy to control birth, update and termination of tracks. Each associated detection is appended as the new tail box in associated track, sharing the same track identity. Unmatched detections will birth a track with unconfirmed status. An unconfirmed track will be changed to confirmed status after it has accumulated minimal $N_{birth}$ boxes, and a confirmed track will be terminated after $T_{term}$ seconds without new detections. This work uses $N_{birth}\mathrm{=}1$ and $T_{term}\mathrm{=}2$ . After track management, tail boxes of confirmed tracks in frame $I^{t}$ will be published as the latest tracking result.

3.6.2 Offline Tracking

BOTT can also be used to perform offline tracking. Thanks to the effectiveness of BOTT, a simple greedy approach is enough to achieve promising performance. In offline setting, all sliding windows $S^{t}$ are first constructed with a stride of one, and fed to the BOTT network to generate linking scores. Let $LS^{t}(b_{i},b_{j})$ denote the estimated linking score between box $b_{i}$ and $b_{j}$ in sliding window $S^{t}$ . The linking score between $b_{i}$ and $b_{j}$ across the scene is computed as:

\hat{LS}(b_{i},b_{j})=\max_{t\in[1,T],(b_{i},b_{j})\in S^{t}}LS^{t}(b_{i},b_{j})

(3)

And an optimal threshold is applied to remove links with low $\hat{LS}$ . To remove redundant links, non-maximum suppression is performed according to $\hat{LS}$ . For example, if a link between $b_{i}$ and $b_{j}$ from two frames has been selected, all the other links to $b_{i}$ or $b_{j}$ between the same two frames will be pruned. To further prevent false links, boxes that violate physical constrains as in online tracking or boxes from different classes are pruned. Finally, box interpolations are also performed to fill any gaps between the linked boxes.

4 Experimental Setup

4.1 Datasets and Metrics

We evaluate BOTT on the two largest benchmarks for 3D MOT: nuScenes [4] and Waymo Open Dataset (WOD) [18].

nuScenes

consists of 1000 driving scenes of approximately 20 seconds long. $700/150/150$ scenes are used for training, validation, and test, respectively. Lidar scans and ground truth (GT) 3D box annotations are provided at 20Hz and 2Hz, respectively. We report the overall Average Multiple Object Tracking Accuracy (AMOTA) [20], recall, and identity switches(IDS) across all the 7 tracking categories, i.e. car, pedestrian, bicycle, bus, motorcycle, trailer, and truck, and also the AMOTA for each individual category. Overall AMOTA is the primary metric for nuScenes benchmark.

Waymo Open Dataset

contains 1000 driving sequences of 20 seconds, with $798/202/150$ sequences for training, validation and test, respectively. Both point clouds and GT 3D boxes of objects are provided at 10Hz. We report MOTA for both the L1 and L2 difficulty levels, mismatch ratio [18] for objects in the L2 difficulty, and MOTA for all the 3 tracking categories (vehicle, pedestrian and cyclist) in the L2 difficulty level. MOTA in the L2 difficulty level is the primary metric for WOD 3D MOT benchmark.

4.2 Track Database Generation

CenterPoint [23] is deployed on the training, validation and test sets on both nuScenes and WOD to get the detections to generate a track database. With the detections, a Non-Maximum Suppression (NMS) is applied to remove overlapped boxes, and boxes with detection scores below a threshold are also filtered. To generate the database, an association between detections and GT boxes provided by nuScenes or WOD is performed. The track IDs in the GT boxes are assigned to the associated detection boxes, and unmatched detection boxes are considered as false positives. For both nuScenes and WOD, the track database are generated at 10Hz. The 2Hz GT provided by nuScenes is interpolated to get 10Hz GT. Each scene in the track database is divided into overlapping $K-$ frame sliding windows with a stride of one. This work uses $K\mathrm{=}{16}$ .

4.3 Implementation Details

4.3.1 Network Details

The MLP in single box encoding consists of four Linear $+$ ReLU blocks, with output dimensions $(1024,1024,1024,512)$ . Three stacked identical encoder blocks are used for inter-box encoding, with each block including 1) a $8-$ head attention following a LayerNorm [1]; and 2) a feature feed-forward net with two Linear $+$ ReLU blocks (output dimensions $(1024,512)$ ), following a LayerNorm. The output box embeddings remains the size of $N_{t}\times d=512$ .

4.3.2 Training Procedure

The link distribution is very imbalanced in sliding windows, with over $90\%$ negative links. Hard negative sample mining is applied to pick all positives (assume $P$ links) and maximally $\kappa P$ negative links with largest linking score errors ( $\kappa\mathrm{=}{4}$ is used). The positive weight $\beta$ in Eq.(1) is set to $\frac{\kappa}{\kappa+1}$ . BOTT is trained with Adam optimizer [8] for 50 epochs with batch size of 4 sliding windows. We use the 1cycle learning rate policy [16] with intial learning rate of $1\mathrm{e}{-3}$ . As each sliding window has varying number of boxes, we apply zero padding to each sliding window to have a largest box number of all sliding windows in a batch, and a binary masking is used to prohibit attention from padded boxes in attention layers.

4.3.3 Data Augmentation

First, we drop some tracks randomly in the sliding windows to reduce the number of boxes to a maximum number 3000, to mimic occlusions and false negatives. Next, all boxes in a sliding window will be shifted from global coordinates to local coordinates centered at the middle of all box centers, i.e. $[(\text{max}_{x}+\text{min}_{x})/2,(\text{max}_{y}+\text{min}_{y})/2)]$ . Finally, we perform two set of global augmentations to all the boxes in a sliding window: we first apply a random flip along x axis and $/$ or y axis, then a global rotation (yaw uniformly drawn from $[-\pi/2,\pi/2]$ ).

4.4 Public Benchmark

Tracking performance heavily relies on the detection quality. For fair comparison, we compare BOTT with these published trackers that are also based on the commonly used CenterPoint detections [23]. Among them, AB3DMOT [20] and SimpleTrack [13] trackers belongs to classic motion trackers, while PolarMOT [7], OGR3MOT [24], CenterPoint [23] and Batch3DMOT [3] are learning based trackers. Table. 1 shows the 3D MOT results on nuScenes validation and test sets. In the validation set, we compare the performance for both online and offline tracking. Our online BOTT achieves a 2.53 AMOTA improvement over learning based tracker PolarMOT, and achieves slightly better performance that the advanced classic motion trackers, i.e. SimpleTrack [13]. Our offline BOTT achieves state-of-the-art performance of the Lidar box-based offline tracking algorithms, with 71.38 AMOTA. One should note that the tracking performance for Batch3DMOT [3] is presented with box feature only. For the test set, only online tracking results are compared between different trackers. Similarly, our online BOTT achieves better performance than learning based box tracker, and comparable results with classic motion trackers.

Table. 2 shows the 3D MOT results on WOD. Our online BOTT could achieve better results than learning based trackers, i.e. CenterPoint [23], and achieves comparable performance to advanced classic motion trackers, i.e. SimpleTrack [13].

Table 1: nuScenes 3D MOT benchmark results. Best result is marked in bold, and best ML based result is shaded.

	Method	Modality	AMOTA $\uparrow$	IDS $\downarrow$	Recall $\uparrow$	class-specific AMOTA $\uparrow$
	Method	Modality	AMOTA $\uparrow$	IDS $\downarrow$	Recall $\uparrow$	car	ped	bicycle	bus	motor	trailer	truck
val set	AB3DMOT^* [20]	Box3D	57.8	1275	-	-	-	-	-	-	-	-
	ProbTrack^*[5]	Box3D	62.4	1098	-	73.5	75.5	27.2	74.1	50.6	33.7	58.0
	SimpleTrack^* [13]	Box3D	69.57	403	73.61	83.9	80.67	50.3	79.52	74.19	52.7	65.69
	CenterPoint [23]	Lidar	66.75	616	70.5	83.83	77.02	47.56	84.83	60.22	45.73	68.02
	Online PolarMOT [7]	Box3D	67.27	439	72.46	81.26	78.79	49.38	82.76	67.19	45.8	65.7
	Online BOTT	Box3D	69.91	438	72.08	83.67	80.02	50.34	83.66	71.53	51.34	68.85
	Offline Batch3DMOT [3]	Box3D	70.6	758	72.0	83.4	81.1	54.8	83.5	73.3	49.6	68.5
	Offline PolarMOT [7]	Box3D	71.14	213	75.14	85.83	81.7	54.1	87.36	72.32	48.67	68.03
	Offline BOTT	Box3D	71.38	310	73.3	84.3	82.1	53.8	85.4	74.2	51.2	68.7
test set	AB3DMOT^* [20]	Box3D	15.1	9027	27.6	27.8	14.1	0	40.8	8.1	13.6	1.3
	ProbTrack^* [5]	Box3D	55.0	950	76.8	71.9	74.5	25.5	64.1	48.1	49.5	51.3
	SimpleTrack^* [13]	Box3D	66.8	575	70.3	82.3	79.6	40.7	71.5	67.4	67.3	58.7
	CenterPoint [23]	Lidar	63.8	760	67.5	82.9	76.7	32.1	71.1	59.1	65.1	59.9
	PolarMOT [7]	Box3D	66.4	242	70.2	85.3	80.6	34.9	70.8	65.6	67.3	60.2
	OGR3MOT [24]	Box3D	65.6	288	69.2	81.6	78.7	38.0	71.1	64.0	67.1	59.0
	Online BOTT	Box3D	66.7	743	67.7	83.1	75.9	32.6	74.8	65.8	70.1	64.3

•

*Non-ML methods

5 Results

Table 2: WOD 3D MOT benchmark results. Best result is marked in bold, and best ML based result is shaded.

	Method	Modality	MOTA(L1) $\uparrow$	MOTA(L2) $\uparrow$	Mismatch $\downarrow$	class-specific MOTA(L2) $\uparrow$
	Method	Modality	MOTA(L1) $\uparrow$	MOTA(L2) $\uparrow$	Mismatch $\downarrow$	vehicle	pedestrian	cyclist
val set	AB3DMOT^* [20]	Box3D	-	-	-	40.1	37.7	-
	ProbTrack^* [5]	Box3D	-	-	-	54.06	48.10	-
	SimpleTrack^* [13]	Box3D	59.44	56.92	0.36	56.12	57.76	56.88
	CenterPoint [23]	Lidar	58.35	55.81	0.74	55.05	54.94	57.44
	Online BOTT	Box3D	58.98	56.50	0.35	55.11	56.48	57.78
	Offline BOTT	Box3D	59.67	57.14	0.12	55.17	57.05	59.21
test set	AB3DMOT^* [20]	Box3D	-	-	-	57.73	53.80	-
	ProbTrack^* [5]	Box3D	49.16	47.65	1.01	49.32	44.38	25.29
	SimpleTrack^* [13]	Box3D	61.82	60.18	0.38	60.3	60.13	60.12
	CenterPoint [23]	Lidar	60.31	58.67	0.72	59.38	56.64	60.0
	Online BOTT	Box3D	61.20	59.57	0.31	59.49	58.82	60.41

•

*Non-ML methods

5.1 Qualitative Analysis

Figure. 3 illustrates the qualitative results of BOTT which accumulate over 40 frames in a scene on the nuScenes validation set. Figure. 3 (a) and (b) shows GT and raw detections from CenterPoint [23], respectively, and (c) and (d) shows the tracking results of CenterPoint [23] and our BOTT.

Figure. 5.1 showcases a few examples of the attentive boxes from the same frame that relate to the circled reference boxes. We could see that the boxes which are closer to the reference boxes normally have stronger attention impact than the boxes far away. Nevertheless, faraway boxes also contributes to the reference box due to the global self-attention. Notably, the figure only shows attentive boxes from the same frame for illustration convenience, in fact all boxes in the sliding window contribute to the reference box for a robust box embedding.

	(a) frame $I^{i}$	(b) frame $I^{i+18}$
Moving Car
	(c) frame $I^{j}$	(d) frame $I^{j+11}$
Static Pedestrian

# encoders layers	AMOTA $\uparrow$	IDS $\downarrow$	Recall $\uparrow$
0	68.0	734	69.9
1	68.76	520	70.4
3	69.91	438	72.1
6	69.76	455	70.4

Augmentation	AMOTA $\uparrow$	IDS $\downarrow$	Recall $\uparrow$
none	68.96	509	70.9
drop track	69.38	434	71.8
xflip	69.79	537	72.7
yflip	69.65	479	73.3
xyflip	69.40	572	72.0
yaw30	69.67	522	71.5
yaw60	69.07	507	71.5
yaw90	69.56	502	72.2
yaw180	69.85	505	71.6
drop+xyflip+yaw90	69.91	438	72.1

$\#$ encoder	Data (ms)	Network (ms)	Tracking (ms)	fps
0	1.8	2.6	9.5	71.9
1		4.7	9.4	62.9
3		8.6	10.5	47.8
6		13.6	10.0	39.4