This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MotionTrack: Learning Motion Predictor for Multiple Object Tracking

Changcheng Xiao [email protected] Qiong Cao Yujie Zhong Long Lan Xiang Zhang Zhigang Luo Dacheng Tao
Abstract

Significant progress has been achieved in multi-object tracking (MOT) through the evolution of detection and re-identification (ReID) techniques. Despite these advancements, accurately tracking objects in scenarios with homogeneous appearance and heterogeneous motion remains a challenge. This challenge arises from two main factors: the insufficient discriminability of ReID features and the predominant utilization of linear motion models in MOT. In this context, we introduce a novel motion-based tracker, MotionTrack, centered around a learnable motion predictor that relies solely on object trajectory information. This predictor comprehensively integrates two levels of granularity in motion features to enhance the modeling of temporal dynamics and facilitate precise future motion prediction for individual objects. Specifically, the proposed approach adopts a self-attention mechanism to capture token-level information and a Dynamic MLP layer to model channel-level features. MotionTrack is a simple, online tracking approach. Our experimental results demonstrate that MotionTrack yields state-of-the-art performance on datasets such as Dancetrack and SportsMOT, characterized by highly complex object motion.

keywords:
multi-object tracking , nonlinear motion , motion modeling , Transformer
journal: Neural Networks
\affiliation

[label1]organization=School of Computer Science, National University of Defense Technology,city=Changsha, postcode=410073, state=Hunan, country=China \affiliation[label2]organization=JD Explore Academy, city=Beijing, postcode=102628, country=China \affiliation[label3]organization=Meituan Inc., city=Beijing, postcode=100000, country=China \affiliation[label4]organization=Laboratory of Digitizing Software for Frontier Equipment, National University of Defense Technology, city=Changsha, postcode=410073, state=Hunan, country=China \affiliation[label5]organization=Institute for Quantum & State Key Laboratory of High Performance Computing, National University of Defense Technology, city=Changsha, postcode=410073, state=Hunan, country=China

1 Introduction

Multi-object tracking (MOT) has received more and more attention in recent years due to its promising applications in the fields [1, 2, 3, 4, 5] of intelligent surveillance, autonomous driving, mobile robotics, etc. Benefiting from the rapid development of object detection [6, 7, 8, 9, 10] and re-identification (ReID) [11, 12, 13, 14], tracking-by-detection methods [15] have dominated. This paradigm consists of two main steps: 1) using an off-the-shelf object detector to obtain detection results for each frame, and 2) associating the detection results into trajectories using visual and motion cues. At association step, influenced by the inherent characteristics of the existing multi-object/pedestrian benchmarks [16, 17], most recent successes on MOT [13, 14, 18] are based on either the appearance features or a junction of detection and simple tracking mechanism, leaving motion information under-explored. This trend makes existing trackers fail in situations [19, 20] where objects of interest share very similar appearance in group dancing and players possess rapid motion in sports scenes. This observation motivates us to only incorporate motion cues into modelling, proving crucial for accurately and efficiently associating objects across frames in these complex situations.

In this work, we focus on learning a motion predictor to boost the accuracy of association and thus the performance of tracking. It is very challenging due to complex motion variations across different scenarios and severe occlusions. The motion models used by exiting trackers can be divided into classical algorithms based on Bayesian estimation [21, 11, 22, 14, 23] and data-driven algorithms [24, 25, 26, 27]. For the former, a representative is Kalman filter[28]. It assumes constant velocity and thus works well with linear motion, but tends to fail in handling nonlinear motion [20, 19]. For the latter, the optical flow-based tracking method [24] requires a complex and heavy optical flow model to calculate the inter-frame pixel offsets, which can only consider local motion information and is limited by the time-consuming optical flow model [25]. Furthermore, methods relying on Long Short-Term Memory (LSTM) networks [26, 29] have been utilized to store the motion information of objects within their hidden states. However, LSTMs have faced criticism for their memory mechanism [30] and their limited ability to model long-term temporal interactions [31]. These step-by-step prediction methods lead to error accumulation[32, 33], which may result in inaccuracies when predicting the spatial locations of objects.

Refer to caption
(a) OC_SORT
Refer to caption
(b) MotionTrack
Figure 1: A qualitative comparison between the proposed tracker and OC_SORT is presented in a typical nonlinear motion scene. Samples were extracted from frames 87, 126, 128, and 132 of the video Dancetrack0058. In the sequence, as the black-clad dancer turns around and crosses paths with the red-haired dancer, OC_SORT experiences ID switches (4 \xrightarrow{} 3), while our tracker successfully continues tracking.

To address the aforementioned challenges and efficiently utilize object trajectory information, we propose a new online tracker, MotionTrack, based on a motion predictor that directly takes longer trajectories as inputs, predicting the future location of objects. As shown in Figure 1, our proposed method can robustly track objects in scenarios with occlusion and nonlinear motion, while the Kalman filter-based trackers [21, 32] fail. Specifically, we make the following contributions.

Firstly, in order to address the limitations of existing works [21, 11, 32, 26, 29], which typically process observations sequentially before predicting bounding boxes of objects autoregressively, we explore to harness the powerful long-range dependency modeling capability of the Transformer [34]. To maintain algorithmic simplicity and computational efficiency, we exclusively design a motion predictor based on the Transformer encoder to model the long-term trajectory information of individual objects for motion prediction. Specifically, the proposed motion predictor considers multiple observations simultaneously, weighting trajectory embeddings based on token-level pair-wise similarities [34].

Secondly, to further enhance the utilization of trajectory embedding sequences, we introduce a Multi-Layer Perceptron-like architecture named Dynamic MLP. As commonly recognized, different semantic information in the feature space tends to be distributed across distinct channels [35, 36], and channel mixing provides greater flexibility to explore cross-channel interaction [37, 38, 39]. Inspired by this, we aim to design a more powerful attention module to advance complex motion modeling and capture information distributed across different channels, such as relative position changes and directions of object motion. The proposed Dynamic MLP can precisely explore motion information distributed in different channels within the non-local range through content-adaptive token-mixing. With the Dynamic MLP, we further integrate it with the self-attention module in the Transformer [34] to enable message passing at two different granularities, namely, token level and channel level, for the purpose of aggregating different semantic messages.

Moreover, we obtain more complex motion patterns via different data augmentations to better understand motion dynamics and boost the tracking performance. Specifically, random drop, random spatial jitter and random length are adopted to create motions with fast-moving objects, tracklets of different lengths, etc. Our proposed method, although straightforward, has demonstrated exceptional performance on large-scale datasets such as SportsMOT [19] and DanceTrack [20], where varying motions and uniform appearances are present.

2 Related work

Multi-object Tracking. Early research on multi-object tracking mainly relied on optimization algorithms [40, 41, 42] to solve the data association problem. However, with the advent of deep learning, tracking-by-detection using stronger detectors has become the dominant paradigm in multi-object tracking. In recent years, some approaches have fused detection and tracking tasks into a single network, benefiting from the success of multi-task learning [43] in neural networks. For instance, Tracktor [27] predicts an object’s position in the next frame by utilizing Faster RCNN’s [6] regression head, but it may fail at low frame rates. JDE [13] extends YOLOv3 [7] with a ReID branch to obtain object embedding for data association. To address the problem of detection and ReID tasks competing with each other in the JDE paradigm, Zhang et al. [14] designed FairMOT based on an anchor-free object detector, Centernet [8], achieving better tracking results. Additionally, ByteTrack [23] demonstrated that the performance bottleneck of trackers on mainstream multi-object tracking datasets, MOT [16, 17], is not in the association part but in the detection part. Thus, a powerful detector coupled with a simple hierarchical association strategy can achieve good tracking results.

Motion model. Motion estimation is crucial for object trackers, and in the early days, many classical multi-object tracking algorithms, such as SORT [21], DeepSORT [11], and MOTDT [22], use Kalman filters (KF) [28] to predict the inter-frame position offset of each object. However, the KF model is limited by its constant velocity assumption and performs poorly in complex situations and nonlinear motion patterns.

To cope with these challenges, researchers have proposed various data-driven models. For example, Zhang et al.[24] used optical flow to obtain pixel-level motion information of objects, while CenterTrack [25] added an offset branch to an object detector to predict the motion information of the object center. Milan et al.[44] proposed an online tracker based on recurrent neural networks (RNNs) for multi-object tracking, and subsequent studies [45, 46, 47] applied RNNs to either fuse visual and motion information or calculate affinity scores for subsequent data association. ArTIST [26] and DEFT [29] used RNNs to predict the inter-frame motion of the target directly. TMOH [48] extended Tracktor [27] to use a simple linear model to predict the location of lost tracks, replacing visual cues for trajectory retracking.

Recent studies have also improved the KF to handle nonlinear motion better. For example, MAT [49] proposed an IML module that considers both camera and pedestrian motion information, achieving better performance than the vanilla KF. Cao et al.[32] recognized the limitations of the original KF in dealing with occlusion and nonlinear motion and proposed corresponding improvements. They suggested that the KF should trust the recent observations more.

Despite these improvements, most of these studies are still based on the classical KF and its constant velocity model assumption. QuoVadis [50] improves the robustness of existing advanced trackers against long-term occlusions based on trajectory prediction in a bird’s eye view (BEV) scenario representation, consisting of several complex sub-modules that overcome the limitations of traditional methods.

Transformer-based methods. The recent success of transformer models in computer vision[51, 52, 53, 54], particularly in the field of object detection, has led to the emergence of numerous transformer-based approaches. These include TransTrack [55], TrackFormer [18], TransCenter [56], and MOTR [57], which are contemporaneous online trackers based on DETR [51] and its variants. TrackFormer employs track queries to maintain object identities and utilizes heuristics to suppress duplicate tracks, as in Tracktor [27]. TransTrack directly employs previous object features as track queries to acquire tracking boxes and associates detection boxes based on IoU-matching. TransCenter obtains object center representation through transformer and performs tracking through CenterTrack’s object association [25]. Additionally, MOTR performs object tracking in an end-to-end manner by iteratively updating the track query, without requiring post-processing, making it a concise method. Furthermore, GTR [58] is an offline transformer-based tracker that employs queries to divide detected boxes into trajectories all at once, instead of generating tracking boxes. It should be noted that the training of all these models necessitates a high volume of training samples, expensive computational resources, and lengthy training times. In contrast, our methodology exclusively utilizes the Transformer to leverage object trajectory information. Furthermore, our proposed method requires only trajectory data as input and is distinguished by its rapid training process.

Refer to caption
Figure 2: An overview of the proposed method. The proposed motion predictor 𝒫\mathcal{MP} considers at most npastn_{past} of the historical observations of its trajectory when predicting the object position. With predicted bounding boxes 𝔻^t\hat{\mathbb{D}}_{t}, data association can be achieved by the linear solver, Hungarian algorithm, based solely on their spatial similarity to the current frame detection results 𝔻t\mathbb{D}_{t}. Blank boxes represent missing observations and dashed boxes represent predicted bounding boxes. Different colors represent different objects.

3 Method

The multi-object tracking task involves identifying the spatial and temporal locations of objects, i.e., their trajectories, in a given video sequence. The Transformer model, known for its ability to capture long-term dependencies, has proven highly effective in processing sequence data. Building on this success, we present a motion predictor that utilizes information about an object’s past trajectory information to predict its position in the next frame directly, as illustrated in Figure 2. Our data association approach relies solely on spatial similarity between the detection results of the current frame and the predicted bounding box of the object. To achieve this, we initially introduce a simple base model, outlined in Section 3.2, based on the vanilla Transformer encoder to capture the temporal dynamics of individual objects. Following that, we introduce the Dynamic MLP, an MLP-like architecture designed to explore channel-wise interactions. We further integrate it with the self-attention module to learn granularity information across token and channel levels in Section 3.3. Finally, we study various augmentations to construct more complex motion patterns to improve the understanding of motion dynamics. Overall, our approach leverages the strengths of the Transformer model and builds on it to develop a simple and effective motion predictor for multi-object tracking.

3.1 Notations

The trajectory of an object consists of an ordered set of bounding boxes 𝒯={bt1,bt2,,}\mathcal{T}=\{\textbf{b}_{t_{1}},\textbf{b}_{t_{2},},\dots\}, where a bounding box is defined as bt={x,y,w,h}\textbf{b}_{t}=\{x,y,w,h\}, and tt stands for the timestamp. At some moments, observations may be missing due to occlusion or detector failure, so the trajectory is not always continuous. Dt\textbf{D}_{t} is the set of detections of the tt-th frame provided by an off-the-shelf detector. The past trajectory representation of the object can be denoted as a sequence X=(,xt2,xt1)n×9\textbf{X}=(\dots,\textbf{x}_{t-2},\textbf{x}_{t-1})\in\mathbb{R}^{n\times 9}. The representation of an object at moment t1t-1 is denoted as follows:

xt1=(cx,cy,w,h,a,δcx,δcy,δw,δh),\textbf{x}_{t-1}=(c_{x},c_{y},w,h,a,\delta_{c_{x}},\delta_{c_{y}},\delta_{w},\delta_{h}), (1)

where (cx,cy)(c_{x},c_{y}) represents the center coordinate of the object in the image plane, ww, hh and aa stand for width, height and aspect ratio of its bounding box respectively. δcx\delta_{c_{x}}, δcy\delta_{c_{y}}, δw\delta_{w}, and δh\delta_{h} denote the variation in the center position, width, and height relative to the previous observation.

For each individual object, the motion predictor, denoted as 𝒫\mathcal{MP}, leverages up to npastn_{past} of its historical observations, represented by Xtnpast:t1\textbf{X}_{t-n_{past}:{t-1}}, to predict the positional offset, denoted as 𝑶^t\boldsymbol{\hat{O}}_{t}, in the subsequent frame. The objective of object motion prediction is to forecast the relative spatial displacement of the object bounding box based on its historical trajectory information:

Xtnpast:t1=Concat(xtnpast,,xt1),𝑶^t=𝒫(Xtnpast:t1),\begin{split}\textbf{X}_{t-n_{past}:{t-1}}&=\text{Concat}(\textbf{x}_{t-n_{past}},\dots,\textbf{x}_{t-1}),\\ \boldsymbol{\hat{O}}_{t}&=\mathcal{MP}(\textbf{X}_{t-n_{past}:{t-1}}),\end{split} (2)

Before being fed into the encoder layer for further processing, X is embedded onto a higher dimensional space by a linear projection, i.e., X¯=WxXn×dm\bar{\textbf{X}}=\textbf{W}_{x}\textbf{X}\in\mathbb{R}^{n\times d_{m}}. In order to make the input token sequence En×dm\textbf{E}\in\mathbb{R}^{n\times d_{m}} contain relative position information, we inject sinusoidal position encoding information to the input embeddings as in [34].

3.2 Vanilla Transformer-based Motion Predictor

The future motion of an object is significantly influenced by its past dynamic information. An intuitive approach utilises a vanilla Transformer encoder to capture the historical context of individual objects flexibly and efficiently. This encoder can model the long-term dependencies within the object’s trajectory history, where the primary component is the multi-head self-attention (MHSA) mechanism. By employing MHSA, the encoder can efficiently attend to various elements of the trajectory sequence and identify the most informative features that contribute to predicting the future motion of the object.

We compute the attention for every single object separately. The input sequence of tokens E is linearly transformed to QQ, KK, VV, which are the query, key, and value, respectively. For multi-head self-attention, a set of single-head attention jointly attends to information from different representation subspaces. The attention of a single head is calculated as by:

Attention(Q,K,V)=softmax(QKTdk)V,Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V, (3)

where dkd_{k} is the dimensionality of the corresponding hidden representation as a scaling factor. The multi-head attention variant with hh heads, whose outputs are denoted as head0,head1,,headi,,headhhead_{0},head_{1},\dots,head_{i},\dots,head_{h},

headi=Attentioni(Qi,Ki,Vi),head_{i}=Attention_{i}(Q_{i},K_{i},V_{i}), (4)

where QiQ_{i}, KiK_{i}, ViV_{i} are the ithi_{th} part of QQ, KK and VV. The outputs of hh heads are cascaded and linearly transformed to obtain the final output.

Multi-object tracking is challenging because the interactions between objects, i.e., the effect of one object’s behaviour on other objects, is a sophisticated process. Like some of the previous work[26, 59, 60], we also tried to model “interactions” between objects. We follow the Agentformer[61] and use an agent-aware attention mechanism to model multi-object motion in both temporal and social dimensions using sequential representations. We denote this variant as mult.

Refer to caption
(a)
Refer to caption
(b)
Figure 3: The network structure of dynamic MLP is shown in (a), and the dynamic FC operation process is shown in (b).

3.3 Dual-granularity Information Fusion

The utilization of token-level message passing across all channels alone lacks the capacity to adequately convey divergent semantics. This limitation restricts their optimal exploitation, consequently leading to sub-optimal performance[35, 36, 38, 39]. Therefore, there is a need to employ a more fine-grained message passing mechanism to cater to varying semantics in an adaptive manner. Such an approach can lead to more sufficient motion modeling by allowing for finer granularity in the communication of information cross channels.

Dynamic MLP

We explore a new module named Dynamic MLP (DyMLP) in parallel with self-attention in vanilla transformer encoder layer to further capture complicated motion patterns of objects. It aggregates multiple positions distributed in different temporal channels. As shown in Fig. 3, the core of DyMLP is the channel fusion layer (CFL), which consists of a dynamic fully-connected layer (DyFC) and an identity layer. Given the input token sequence En×dm\textbf{E}\in\mathbb{R}^{n\times d_{m}}, for each token eidme_{i}\in\mathbb{R}^{d_{m}}, we first use a FC layer to predict dmd_{m} offsets: Δ={δi}i=1dm\Delta=\{\delta_{i}\}_{i=1}^{d_{m}}. Since there is no restriction on the generation of offsets, DyFC can aggregate temporal global channel information, as shown in Fig. 3. The basic DyFC operator can be formulated as below:

e^iT=DyFC(ei)=e~iW+b,e~i=[E[i+δ1,1],E[i+δ2,2],,E[i+δdm,dm]],\begin{split}\hat{e}^{T}_{i}={\rm DyFC}(e_{i})=\tilde{e}_{i}\cdot\textbf{W}+\textbf{b},\\ \tilde{e}_{i}=[\textbf{E}_{[i+\delta_{1},1]},\textbf{E}_{[i+\delta_{2},2]},\cdots,\textbf{E}_{[i+\delta_{d_{m}},d_{m}]}],\end{split} (5)

where Wdm×dm\textbf{W}\in\mathbb{R}^{d_{m}\times d_{m}} and bdm\textbf{b}\in\mathbb{R}^{d_{m}} are learnable parameters. Besides, we preserve the original token information using identity layer and the output is e^iI\hat{e}^{I}_{i}. CFL outputs the fusion result as a weighted sum of e^T\hat{e}^{T} and e^I\hat{e}^{I}, which can be formulated as:

e^=ωTe^T+ωIe^I,\hat{e}=\omega^{T}\odot\hat{e}^{T}+\omega^{I}\odot\hat{e}^{I}, (6)

where \odot is the Hadamard product, ω{T,I}dm\omega^{\{T,I\}}\in\mathbb{R}^{d_{m}} are calculated from the following equation:

[ωI,ωT]=softmax([WIx˙,WTx˙]),[\omega^{I},\omega^{T}]=softmax([W^{I}\cdot\dot{x},W^{T}\cdot\dot{x}]), (7)

where x˙dm\dot{x}\in\mathbb{R}^{d_{m}} is the average summation of e^T\hat{e}^{T} and e^I\hat{e}^{I}, W{I,T}dm×dmW^{\{I,T\}}\in\mathbb{R}^{d_{m}\times d_{m}} are learnable parameters and softmax()softmax(\cdot) is the channel-wise normalization operation.

Hence, each encoder layer contains two sub-layers. We use residual connections [62] in both sub-layers and then perform layer normalization (LN) [63]. Mathematically, the whole process in the encoder layer can be described as follows:

DIF(El1)=MHSA(El1)+DyMLP(El1)E^l=LN(DIF(El1))+El1El=LN(FFN(E^l))+E^l,\begin{split}\text{DIF}(E^{l-1})&=\text{MHSA}(E^{l-1})+\text{DyMLP}(E^{l-1})\\ \hat{E}^{l}&=\text{LN}(\text{DIF}(E^{l-1}))+E^{l-1}\\ E^{l}&=\text{LN}(\text{FFN}(\hat{E}^{l}))+\hat{E}^{l},\end{split} (8)

where ll denotes the ll-th layer and FFN for a feed forward network.

Refer to caption
Figure 4: The architecture of the proposed motion predictor.

Dual-granularity Information Fusion Layer

With the introduction of the proposed Dynamic MLP, we integrate it with a self-attention module in Transformer [34]. This integration leverages information from both token-level granularity and channel-level granularity. We refer to this newly combined module as the Dual-granularity Information Fusion Layer (DualIF). Illustrated in Fig. 4, the backbone of the motion predictor 𝒫\mathcal{MP} comprises LL sequentially connected encoding layers. Each encoding layer is composed of two sub-modules: a multi-head self-attention module and a dynamic MLP module. The former incorporates token-level information from other tokens into the query based on computed pairwise attention weights, while the latter adaptively performs feature fusion of context distributed at the channel level.

3.4 Training and inference

Data augmentation creates complex motion patterns

Like in other deep learning tasks, such as Centernet [8], Centertrack [25], and YOLOX [9], the incorporation of data augmentation is crucial in enhancing the performance of the model. The limited availability of training samples within existing datasets, coupled with the variability of motion across diverse datasets, necessitates the exploration of novel augmentation strategies to augment the training samples and advance the proficiency of our predictor in modeling motion dynamics. In this study, we examine three distinct augmentation strategies aimed at enriching the training samples. These strategies correspond to various cases that must be handled by the motion model, namely object position jitter and motion mutations, detection noise, and newly initialized shorter trajectories.

  • 1.

    Random drop. In generating the target trajectory representation X, we randomly ignore the observation b with probability pi\textit{p}_{i}. This strategy allows to simulate fast motion and low frame rate scenes.

  • 2.

    Spatial jitter. To enhance the robustness of the system against detection noise, it is a widely adopted practice to incorporate spatial jitter into the bounding boxes. This technique involves introducing small variations in the position and size of the bounding boxes during training. By doing so, the model becomes more resilient to variations in object localization due to detection noise. Furthermore, in a temporal sense, this approach also serves to augment the training dataset with a broader range of motion patterns.

  • 3.

    Random length. Due to the short new birth trajectory, there is little available temporal information. By randomly varying the length of the training sequence, the model is exposed to a diverse range of motion patterns, which enables it to learn to generalize to different temporal contexts. To mimic this, we train the motion model using a sequence of observations of arbitrary length (in range [2, npast\textit{n}_{past}]) reserved for each object.

Training loss

We adopt the smooth loss L1 [64] to supervise the training process. Formally, given the predicted offsets, 𝑶^={δcx,δcy,δw,δh}\boldsymbol{\hat{O}}=\{\delta_{c_{x}},\delta_{c_{y}},\delta_{w},\delta_{h}\}, and corresponding ground truth 𝑶\boldsymbol{O}, the loss is obtained by

L(O^,O)=i{cx,cy,w,h}smoothL1(δ^iδi),L(\hat{O},O)=\sum_{i\in\{c_{x},c_{y},w,h\}}{\text{smooth}_{L_{1}}(\hat{\delta}_{i}-\delta_{i}),} (9)

in which

smoothL1(x)={0.5x2if|x|<1|x|0.5otherwise.\text{smooth}_{L_{1}}(x)=\left\{\begin{aligned} &0.5x^{2}&\quad\text{if}\left|x\right|<1\\ &\left|x\right|-0.5&\quad\text{otherwise}.\\ \end{aligned}\right. (10)

Inference

At first, we decode the predicted offset O^t\hat{\textbf{O}}_{t} as the trajectory bounding boxes D^t\hat{\textbf{D}}_{t} in the current frame. As with some classic online trackers[21, 11, 22, 23], we exploit a simple association algorithm. Detections are assigned to tracklets based on Intersection-over-Union(IoU) similarity between D^t\hat{\textbf{D}}_{t} and Dt\textbf{D}_{t} using the Hungarian algorithm[65]. Unassigned detections are initialized as new trajectories. If no detection is assigned to a trajectory, the trajectory is marked as lost and if the time lost is greater than a given threshold, the target is considered out of view and removed from the trajectory set. Lost trajectories may also be retracked in the assignment step.

Given the detections from an object detector, our tracker associates identities over video sequences in an online manner, exploiting only motion cues. The overall tracking pipeline is shown in Algorithm 1. For brevity, trajectory rebirth is not shown.

Input: Detections: D={bti|1tM,1iNt}\textit{D}=\{\textbf{b}_{t}^{i}|1\leq t\leq M,1\leq i\leq N_{t}\}, Motion Predictor: 𝒫\mathcal{MP}, threshold for retaining a lost track tmaxt_{max}.
Output: Tracks T of the video
1 Initialization: T\textit{T}\leftarrow\emptyset and 𝒫\mathcal{MP};
2 for t1:Mt\leftarrow 1:M do
      Dt[bt1,,btNt]\textbf{D}_{t}\leftarrow[\textbf{b}_{t}^{1},\cdots,\textbf{b}_{t}^{N_{t}}] // Detections of current frame.
       D^t[b^t1,,b^t|T|]\hat{\textbf{D}}_{t}\leftarrow[\hat{\textbf{b}}_{t}^{1},\cdots,\hat{\textbf{b}}_{t}^{|\textit{T}|}] from T // Predicted bounding boxes
       𝐂tCIoU(D^t,Dt)\mathbf{C}_{t}\leftarrow C_{\rm IoU}(\hat{\textbf{D}}_{t},\textbf{D}_{t}) // Cost matrix based on IoU similarity
       /* Assign detections to tracks using Hungarian algorithm */
3       ,𝒯u,𝒟uassignment(𝐂t)\mathcal{M},\mathcal{T}_{u},\mathcal{D}_{u}\leftarrow\rm assignment(\mathbf{C}_{t})
       T{Ti(btj),(i,j)M}\textit{T}\leftarrow\{T_{i}(\textbf{b}_{t}^{j}),\forall(i,j)\in\textit{M}\} // Update the matched tracks
4       T{Ti.age+=1,(i)Tu}\textit{T}\leftarrow\{T_{i}.age+=1,\forall(i)\in\textit{T}_{u}\}
5       T{Ti(Dj),jDu,i=|T|+1}\textit{T}\leftarrow\{T_{i}(D_{j}),\forall{j}\in\textit{D}_{u},i=|\textit{T}|+1\}
6       Kill lost tracks with age tmax\geq t_{max}
7      
8      for TT in T do
            𝒫\mathcal{MP}(TT) // Predict motion of tracks
9            
10       end for
11      
12 end for
Algorithm 1 Pseudo-code of MotionTrack.

4 Experiments

4.1 Datasets and Metrics

Datasets. In order to conduct a comprehensive evaluation of the proposed algorithm, we performed experiments on two datasets characterized by complex motion patterns, namely DanceTrack[20] and SportsMOT [19]. The SportsMOT dataset offers video clips of three different sports categories, i.e., basketball, football, and volleyball. These clips are collected from various sources such as the Olympic Games, NCAA Championship, and NBA on YouTube, and cover a wide range of complex sports scenes captured from different perspectives. The dataset contains 45 video clips in both the training and validation sets. Due to the similar appearance of athletes and the complex motion scenarios, SportsMOT requires a high level of robustness in tracking algorithms. Dancetrack, on the other hand, is a recently proposed dataset that offers more training and evaluation videos. Objects in this dataset possess highly similar appearance, are severely occluded from each other, and exhibit nonlinear motion at a high degree, making it challenging for existing advanced approaches based on appearance and motion information. Therefore, we aim to propose a better motion model that can improve the tracking algorithm’s ability to cope with frequent crossings and nonlinear motions. SportsMOT and Dancetrack datasets provide ideal benchmarks for evaluating the performance of tracking algorithms.

Evaluation Metrics. To evaluate our algorithm, we adapt the Higher Order Metric (HOTA, AssA, DetA) [66], IDF1[67] and the CLEAR metrics (MOTA, FP, FN, IDs, et al.) [68] to assess different aspects of the tracking algorithm. MOTA is calculated from FN, FP, IDs and is susceptible to the influence of detection results. IDF1 focuses on measuring association performance. HOTA is designed to fairly combine evaulation of detection and association, and therefore, we use it as the primary metric.

4.2 Implementation Details

Our primary focus is on developing a motion model for tracking objects. For this purpose, we use the publicly available YOLOX detector weights provided by ByteTrack [23], DanceTrack [20] and SportsMOT[19] separately to detect objects in the MOT, DanceTrack and SportsMOT datasets. The motion predictor includes L=6L=6 encode layers, and the input token dimension dmd_{m} is set to 512. The multi-head self-attention uses 8 heads, and the drop probability for data augmentation during training is set to pi=0.1\textit{p}_{i}=0.1. We set the maximum historical observation window value npast\textit{n}_{past} to 10, and the batch size to 64. We use the Adam optimizer [69] with β1=0.9\beta_{1}=0.9, β2=0.98\beta_{2}=0.98 and ϵ=108\epsilon=10^{-8}. The learning rate is adjusted dynamically during training following the approach proposed in [34].

Table 1: Comparison to the state-of-the-arts on DanceTrack test set. The best results are shown in bold.↑: higher better.
Tracker Motion Appear. HOTA↑ DetA↑ AssA↑ MOTA↑ IDF1↑
DeepSORT [11] 45.6 71.0 29.7 87.8 47.9
MOTR [57] 54.2 73.5 40.2 79.7 51.5
FairMOT [14] 39.7 66.7 23.8 82.2 40.8
QDTrack [70] 45.7 72.1 29.2 83.0 44.8
TransTrk [55] 45.5 75.9 27.5 88.4 45.2
TraDes [71] 43.3 74.5 25.4 86.2 41.2
CenterTrack [25] 41.8 78.1 22.6 86.8 35.7
SORT [21] 47.9 72.0 31.2 91.8 50.8
ByteTrack [23] 47.3 71.6 31.4 89.5 52.5
OC_SORT [32] 55.1 80.3 38.0 89.4 54.2
Ours 58.2 81.4 41.7 91.3 58.6
Table 2: Comparison to the state-of-the-arts on SportsMOT test set. The best results are shown in bold.↑: higher better.
Tracker Motion Appear. HOTA\uparrow DetA\uparrow AssA\uparrow MOTA\uparrow IDF1\uparrow
FairMOT[14] 49.3 70.2 34.7 86.4 53.5
MixSort-Byte [19] 65.7 78.8 54.8 96.2 74.1
MixSort-OC [19] 74.1 88.5 62.0 96.5 74.4
QDTrack[70] 60.4 77.5 47.2 90.1 62.3
TransTrack[55] 68.9 82.7 57.5 92.6 71.5
GTR[58] 54.5 64.8 45.9 67.9 55.8
CenterTrack[25] 62.7 82.1 48.0 90.8 60.0
ByteTrack[23] 62.8 77.1 51.2 94.1 69.8
OC-SORT[32] 71.9 86.4 59.8 94.5 72.2
Ours 74.0 88.8 61.7 96.6 74.0

4.3 Benchmark Evaluation

We assessed the efficacy of the proposed method by testing it on two datasets comprising various motion patterns.

DanceTrack. Table 1 presents a comparison between our proposed method and the current state-of-the-art techniques on the Dancetrack test set. The data presented in the table illustrates that our method, rely solely on motion cues, exhibits superior performance in comparison to a spectrum of algorithms that integrate both motion and appearance cues, or either of these tracking cues in isolation. Specifically, our method achieves a higher HOTA score, which is 3.1 percentage points better than the OC_SORT[32] method that enhances the Kalman filter. Furthermore, in the evaluation metrics that concentrate on measuring tracking performance, namely AssA and IDF1, our approach leads by 3.7% and 4.4%, respectively. These findings suggest that our proposed data-driven motion model outperforms SORT-like techniques [21, 11, 23] based on the standard Kalman filter [28]. Additionally, our method surpasses the advanced fully end-to-end MOTR[57] method. These results provide evidence that our proposed motion model can effectively model the temporal motion of objects and achieve superior performance in dealing with complex motion scenes.

SportsMOT. SportsMOT is a newly proposed dataset, which is mainly characterized by the variety of object motion patterns in the dataset, but at the same time their appearance is similar but distinguishable. As shown in 2, our approach is surpassed by all methodologies which only relying on either motion or appearance cues for tracking. More specifically, our method outperforms OC_SORT [32] by 2.1 percentage points on the HOTA metric and by 1.9 percentage points on the AssA metric. Furthermore, our methodology demonstrates comparable performance to state-of-the-art hybrid tracking algorithms that concurrently leverage both appearance and motion information. The excellent performance obtained on the SportsMOT dataset serves to underscore the proficiency of our algorithm in accurately modeling dynamic and variable-speed motion, thereby furhter enabling precise tracking of athletes across a diverse range of sport scenarios.

Qualitative Results. To get a more intuitive picture of MotionTrack’s superiority over OC_SORT, we provide more visualization for the comparison. We show some qualitative results of our method on SportsMOT in Fig. 5. In addition, in 6, we show additional samples where OC_SORT suffers from ID switch caused by nonlinear motion or occlusion but our method successfully copes.

Refer to caption
Figure 5: Qualitative results of our method on SportsMOT. Different colored bounding boxes indicate different identity. Best viewed in color and zoom in.
Refer to caption
(a) OC_SORT: dancetrack0004.
Refer to caption
(b) MotionTrack: dancetrack0004.
Refer to caption
(c) OC_SORT: dancetrack0097.
Refer to caption
(d) MotionTrack: dancetrack0097.
Refer to caption
(e) OC_SORT: dancetrack0094.
Refer to caption
(f) MotionTrack: dancetrack0097.
Figure 6: Qualitative comparison between MotionTrack and OC_SORT[32] on DanceTrack validation. Each pair of rows shows the results comparison for one sequence. The color of the bounding boxes represents the identity of the tracks. To be exactly, the ID switch in (a) occurs between frame #45 \xrightarrow{} #54; (c) #310 \xrightarrow{} #333; (e) #432 \xrightarrow{} #441.

4.4 Ablation Study

We conduct ablation studies on the validation set of DanceTrack [20] to investigate the impact of different training data, model components, data augmentation methods, and some hyper-parameters on the proposed method.

Impact of motion modelling. To validate the efficacy of the proposed motion predictor, we conducted a comparative analysis with several existing motion models. As illustrated in Figure 7, the outcomes demonstrate that incorporating motion information yields significantly better performance than those not using motion information (i.e. only naive IoU association). Furthermore, the HOTA metrics reveal that our approach outperforms Kalman filter [28], which relies on linear motion hypothesis, by 7.8%, and LSTM [29], which employs a recurrent neural network, by 3.4%, respectively. This notable improvement in performance is attributed to the dual-granularity information incorporated in our motion model, which provides a global temporal perspective and thereby improves motion learning.

Refer to caption
Figure 7: Comparing different motion models.
Table 3: Ablation study on different components. “MHSA” stands for multi-head self-attention, “DyMLP” for dynamic MLP.
HOTA↑ DetA↑ AssA↑ MOTA↑ IDF1↑
w/o MHSA 53.6 78.0 37.0 89.2 54.1
w/o DyMLP 53.0 78.6 35.8 89.1 53.2
Ours 54.6 78.6 38.1 89.2 54.6
Table 4: Evaluation of the impact of inter-object interactions. Mult. denotes the method of modeling inter-target interactions using agent-aware attention.
HOTA↑ DetA↑ AssA↑ MOTA↑ IDF1↑
Mult. 53.1 78.9 35.9 89.2 53.0
Ours 54.6 78.6 38.1 89.2 54.6

Impact of model components. To further investigate the impact of the key components of our motion predictor, we conducted an ablation study by removing them individually and evaluating the resulting model’s performance. The results of this study are presented in Table 3. As shown in the table, when the MHSA module and the DyMLP module are removed separately from the motion predictor, there is a decline in the HOTA metric by 1% and 1.6%, respectively. Moreover, the association score (AssA) also suffers a reduction of 1.1% and 2.3%, respectively. These findings highlight the crucial role of both token-level and channel-level feature fusion in learning motion models that capture the object’s temporal dynamics effectively.

Does the interaction between objects help? In the context of multi-object tracking, it is pertinent to consider object interactions due to the complexity of the task. In this part, we adopt the Encoder component of AgentFormer[61] to implement an object-aware attention mechanism that jointly models the temporal and social dimensions of multi-object trajectories using a sequence representation. It is noteworthy that conventional trajectory prediction tasks typically require a complete history of the target’s past trajectory to predict a fixed-length future trajectory. To adapt this, mult (see Sec. 3.2) utilizes a sequence representation of multi-object trajectories by concatenating trajectory features across time and objects. Unlike traditional trajectory prediction tasks, the number of targets in multi-object tracking varies over time and the target trajectories may not be continuous at all time. To adapt the agent-aware attention mechanism, we employ learnable embeddings to fill in the missing objects’ representation within the maximum observation window of npast\textit{n}_{past}. Consequently, the mult can be considered as a trajectory prediction task that is consistent with AgentFormer but predicts only a single time step. Our experimental results, as presented in Tab. 4, reveal that in scenarios with multiple motion patterns created by choreographers, modeling of inter-target interaction does not enhance the performance of multi-object tracking. We must emphasize that our study only explores the potential of joint attention-based social and temporal modeling. Other methods may be effective.

Table 5: Ablation on different data augmentation strategies. ’D’ stands for random drop, ’J’ stands for random jitter and ’L’ stands for random length.
NUM HOTA↑ DetA↑ AssA↑ MOTA↑ IDF1↑
No aug. 53.5 78.6 36.5 89.3 53.5
+ D 54.3 78.4 37.8 89.2 54.2
+ J 54.4 78.6 37.8 89.2 54.5
+ D + J 54.6 78.6 38.1 89.2 54.6
+ D + J + L 53.5 78.7 36.5 89.3 53.5
+ L 52.4 78.6 35.1 89.2 52.6
+ D + L 53.9 78.5 37.1 89.3 53.9
+ J + L 54.1 78.2 37.6 89.3 54.5

Impact of different data augmentations. As delineated in Section 3.3, we introduce a triad of data augmentation strategies, namely Random drop, Spatial jitter, and Random length. To investigate the efficacy of these methods, we conducted a comparative analysis using the performance metric HOTA. The results of this experiment are illustrated in Table 5 (①). It is worth noting that using Random drop and Spatial jitter brings a HOTA improvement of 0.8 and 0.9 percentage points, respectively, compared to not using any data augmentation. Using both technologies at the same time brought a 1.1 percentage point improvement. However, when Random length is added to the augmentation process, a decrease in performance is observed. To verify this trend, we conducted a second set of experiments as detailed in Table 5 (②), which confirmed that Random length did not contribute to any positive benefit. Based on these results, we decide to retain only the first two data augmentation techniques, namely Random drop and Spatial jitter.

Table 6: Impact of random drop probability pip_{i} during training.
pip_{i} HOTA↑ DetA↑ AssA↑ MOTA↑ IDF1↑
0 53.5 78.6 36.5 89.3 53.5
0.1 54.3 78.4 37.8 89.2 54.2
0.2 54.0 78.6 37.3 89.2 54.1
0.3 53.6 78.7 36.6 89.3 54.2

Impact of random drop probability during training. The approach of randomly dropping a historical observation with a certain probability pip_{i} enables the generation of more complex motion patterns, such as rapid movements of an object. The impact of different values of pip_{i} on the tracking performance is investigated in Table 6. The results indicate that as pip_{i} gradually increases from 0 to 0.3, the performance of the tracking method varies. Notably, the highest score on the HOTA metric is achieved when pip_{i} is set to 0.1. This finding suggests that a moderate level of dropping historical observations can enhance the tracking performance, while excessively high probabilities of dropping such observations may have a negative impact on the tracking quality.

Table 7: Comparison of different pooling types.
HOTA↑ DetA↑ AssA↑ MOTA↑ IDF1↑
sum 51.7 78.2 34.3 89.2 51.9
last 53.8 78.7 37.0 89.2 54.0
mean 54.6 78.6 38.1 89.2 54.6

Impact of different memory pooling types. The motion predictor proposed in this study utilizes the encoder of a Transformer to capture the temporal dynamics of an object, resulting in the generation of memory. To feed this memory to the regression head, a pooling operation is necessary. Specifically, we explore three different types of pooling operations, namely mean, sum, and last, which respectively refer to taking the mean value of the encoder memory along the time dimension, summing the memory, or utilizing the last moment representation. The results presented in Table 7 demonstrate that the mean pooling operation achieves the best performance among the three tested options.

Table 8: The impact of maximum historical observation window.
npastn_{past} HOTA↑ DetA↑ AssA↑ MOTA↑ IDF1↑
3 51.6 78.2 34.2 89.2 51.2
5 52.3 78.2 35.2 89.2 52.8
10 54.6 78.6 38.1 89.2 54.6
13 54.2 78.3 37.8 89.1 54.1
15 53.3 78.4 36.4 89.1 53.3

Impact of maximum historical observation window. As presented in Tab. 8, very small historical observation windows fail to provide sufficient information, resulting in inaccurate predictions. In contrast, very large historical observation windows introduce a considerable amount of noise, leading to degraded performance. Based on the results, we set the value of npastn_{past} to 10, as it attains the best performance across all metrics. This finding highlights the importance of selecting an appropriate value for the historical observation window size to achieve optimal performance in object tracking tasks.

5 Conclusion

In this paper, we present MotionTrack, an innovative online tracker with a learnable motion predictor. By relying solely on the object’s trajectory information, MotionTrack enhances robustness in challenging scenarios, including nonlinear motion and occlusion. The proposed predictor incorporates two distinct modules to capture information at varying levels of granularity, thus enabling efficient modeling of an object’s temporal dynamics. Specifically, we employ the Transformer encoder for motion prediction, which are capable of capturing complex motion patterns at the token level. We further introduce DyMLP, a MLP-like architecture, to extract semantic information distributed across different channels. By integrating DyMLP with the self-attention module in Transformer, our approach leverages the dual-granularity information to achieve superior performance. The results of our experiments have illustrated that the proposed methodology surpasses current state-of-the-art techniques when applied to datasets featuring intricate motion scenarios. Additionally, the performance of the proposed method has exhibited its resilience in the face of complex motion patterns.

References

  • Wen et al. [2020] L. Wen, D. Du, Z. Cai, Z. Lei, M.-C. Chang, H. Qi, J. Lim, M.-H. Yang, S. Lyu, Ua-detrac: A new benchmark and protocol for multi-object detection and tracking, Computer Vision and Image Understanding 193 (2020) 102907.
  • Yuan et al. [2022] Y. Yuan, U. Iqbal, P. Molchanov, K. Kitani, J. Kautz, Glamr: Global occlusion-aware human mesh recovery with dynamic cameras, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11038–11049.
  • Caesar et al. [2020] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, O. Beijbom, nuscenes: A multimodal dataset for autonomous driving, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11621–11631.
  • Sun et al. [2020] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al., Scalability in perception for autonomous driving: Waymo open dataset, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2446–2454.
  • Martin-Martin et al. [2021] R. Martin-Martin, M. Patel, H. Rezatofighi, A. Shenoi, J. Gwak, E. Frankel, A. Sadeghian, S. Savarese, Jrdb: A dataset and benchmark of egocentric robot visual perception of humans in built environments, IEEE transactions on pattern analysis and machine intelligence (2021).
  • Ren et al. [2015] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in neural information processing systems 28 (2015).
  • Redmon and Farhadi [2018] J. Redmon, A. Farhadi, Yolov3: An incremental improvement, arXiv preprint arXiv:1804.02767 (2018).
  • Zhou et al. [2019] X. Zhou, D. Wang, P. Krähenbühl, Objects as points, in: arXiv preprint arXiv:1904.07850, 2019.
  • Ge et al. [2021] Z. Ge, S. Liu, F. Wang, Z. Li, J. Sun, Yolox: Exceeding yolo series in 2021, arXiv preprint arXiv:2107.08430 (2021).
  • Yang et al. [2019] K. Yang, D. Li, Y. Dou, Towards precise end-to-end weakly supervised object detection network, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8372–8381.
  • Wojke et al. [2017] N. Wojke, A. Bewley, D. Paulus, Simple online and realtime tracking with a deep association metric, in: 2017 IEEE international conference on image processing (ICIP), IEEE, 2017, pp. 3645–3649.
  • Hermans et al. [2017] A. Hermans, L. Beyer, B. Leibe, In defense of the triplet loss for person re-identification, arXiv preprint arXiv:1703.07737 (2017).
  • Wang et al. [2020] Z. Wang, L. Zheng, Y. Liu, Y. Li, S. Wang, Towards real-time multi-object tracking, in: European Conference on Computer Vision, Springer, 2020, pp. 107–122.
  • Zhang et al. [2021] Y. Zhang, C. Wang, X. Wang, W. Zeng, W. Liu, Fairmot: On the fairness of detection and re-identification in multiple object tracking, International Journal of Computer Vision 129 (2021) 3069–3087.
  • Wang et al. [2021] F. Wang, L. Luo, E. Zhu, Two-stage real-time multi-object tracking with candidate selection, in: International Conference on Multimedia Modeling, Springer, 2021, pp. 49–61.
  • Milan et al. [2016] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, K. Schindler, Mot16: A benchmark for multi-object tracking, arXiv preprint arXiv:1603.00831 (2016).
  • Dendorfer et al. [2020] P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid, S. Roth, K. Schindler, L. Leal-Taixé, Mot20: A benchmark for multi object tracking in crowded scenes, arXiv preprint arXiv:2003.09003 (2020).
  • Meinhardt et al. [2022] T. Meinhardt, A. Kirillov, L. Leal-Taixe, C. Feichtenhofer, Trackformer: Multi-object tracking with transformers, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Cui et al. [2023] Y. Cui, C. Zeng, X. Zhao, Y. Yang, G. Wu, L. Wang, Sportsmot: A large multi-object tracking dataset in multiple sports scenes, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 9921–9931.
  • Sun et al. [2022] P. Sun, J. Cao, Y. Jiang, Z. Yuan, S. Bai, K. Kitani, P. Luo, Dancetrack: Multi-object tracking in uniform appearance and diverse motion, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20993–21002.
  • Bewley et al. [2016] A. Bewley, Z. Ge, L. Ott, F. Ramos, B. Upcroft, Simple online and realtime tracking, in: 2016 IEEE international conference on image processing (ICIP), IEEE, 2016, pp. 3464–3468.
  • Long et al. [2018] C. Long, A. Haizhou, Z. Zijie, S. Chong, Real-time multiple people tracking with deeply learned candidate selection and person re-identification, in: ICME, 2018.
  • Zhang et al. [2022] Y. Zhang, P. Sun, Y. Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, X. Wang, Bytetrack: Multi-object tracking by associating every detection box, in: European Conference on Computer Vision, Springer, 2022, pp. 1–21.
  • Zhang et al. [2020] J. Zhang, S. Zhou, X. Chang, F. Wan, J. Wang, Y. Wu, D. Huang, Multiple object tracking by flowing and fusing, arXiv preprint arXiv:2001.11180 (2020).
  • Zhou et al. [2020] X. Zhou, V. Koltun, P. Krähenbühl, Tracking objects as points, in: European Conference on Computer Vision, Springer, 2020, pp. 474–490.
  • Saleh et al. [2021] F. Saleh, S. Aliakbarian, H. Rezatofighi, M. Salzmann, S. Gould, Probabilistic tracklet scoring and inpainting for multiple object tracking, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14329–14339.
  • Bergmann et al. [2019] P. Bergmann, T. Meinhardt, L. Leal-Taixe, Tracking without bells and whistles, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 941–951.
  • Kalman et al. [1960] R. E. Kalman, et al., Contributions to the theory of optimal control, Bol. soc. mat. mexicana 5 (1960) 102–119.
  • Chaabane et al. [2021] M. Chaabane, P. Zhang, R. Beveridge, S. O’Hara, Deft: Detection embeddings for tracking, arXiv preprint arXiv:2102.02267 (2021).
  • Luo et al. [2018] W. Luo, B. Yang, R. Urtasun, Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 3569–3577.
  • Bai et al. [2018] S. Bai, J. Z. Kolter, V. Koltun, An empirical evaluation of generic convolutional and recurrent networks for sequence modeling, arXiv preprint arXiv:1803.01271 (2018).
  • Cao et al. [2023] J. Cao, J. Pang, X. Weng, R. Khirodkar, K. Kitani, Observation-centric sort: Rethinking sort for robust multi-object tracking, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023, pp. 9686–9696.
  • Zhou et al. [2021] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, W. Zhang, Informer: Beyond efficient transformer for long sequence time-series forecasting, in: Proceedings of the AAAI conference on artificial intelligence, volume 35, 2021, pp. 11106–11115.
  • Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017).
  • Bau et al. [2020] D. Bau, J.-Y. Zhu, H. Strobelt, A. Lapedriza, B. Zhou, A. Torralba, Understanding the role of individual units in a deep neural network, Proceedings of the National Academy of Sciences 117 (2020) 30071–30078.
  • Wu et al. [2021] Z. Wu, D. Lischinski, E. Shechtman, Stylespace analysis: Disentangled controls for stylegan image generation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12863–12872.
  • Li et al. [2023] Z. Li, Z. Rao, L. Pan, Z. Xu, Mts-mixers: Multivariate time series forecasting via factorized temporal and channel mixing, ArXiv abs/2302.04501 (2023).
  • Zhang and Yan [2023] Y. Zhang, J. Yan, Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting, in: The Eleventh International Conference on Learning Representations, 2023.
  • Chen et al. [2023] S.-A. Chen, C.-L. Li, S. O. Arik, N. C. Yoder, T. Pfister, TSMixer: An all-MLP architecture for time series forecast-ing, Transactions on Machine Learning Research (2023).
  • Roshan Zamir et al. [2012] A. Roshan Zamir, A. Dehghan, M. Shah, Gmcp-tracker: Global multi-object tracking using generalized minimum clique graphs, in: European conference on computer vision, Springer, 2012, pp. 343–356.
  • Wen et al. [2014] L. Wen, W. Li, J. Yan, Z. Lei, D. Yi, S. Z. Li, Multiple target tracking based on undirected hierarchical relation hypergraph, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1282–1289.
  • Lan et al. [2016] L. Lan, D. Tao, C. Gong, N. Guan, Z. Luo, Online multi-object tracking by quadratic pseudo-boolean optimization., in: IJCAI, 2016, pp. 3396–3402.
  • Zhang and Yang [2021] Y. Zhang, Q. Yang, A survey on multi-task learning, IEEE Transactions on Knowledge and Data Engineering (2021).
  • Milan et al. [2017] A. Milan, S. H. Rezatofighi, A. Dick, I. Reid, K. Schindler, Online multi-target tracking using recurrent neural networks, in: Thirty-First AAAI conference on artificial intelligence, 2017.
  • Wan et al. [2018] X. Wan, J. Wang, S. Zhou, An online and flexible multi-object tracking framework using long short-term memory, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 1230–1238.
  • Sadeghian et al. [2017] A. Sadeghian, A. Alahi, S. Savarese, Tracking the untrackable: Learning to track multiple cues with long-term dependencies, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 300–311.
  • Ran et al. [2019] N. Ran, L. Kong, Y. Wang, Q. Liu, A robust multi-athlete tracking algorithm by exploiting discriminant features and long-term dependencies, in: International Conference on Multimedia Modeling, Springer, 2019, pp. 411–423.
  • Stadler and Beyerer [2021] D. Stadler, J. Beyerer, Improving multiple pedestrian tracking by track management and occlusion handling, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 10958–10967.
  • Han et al. [2022] S. Han, P. Huang, H. Wang, E. Yu, D. Liu, X. Pan, Mat: Motion-aware multi-object tracking, Neurocomputing 476 (2022) 75–86.
  • Dendorfer et al. [2022] P. Dendorfer, V. Yugay, A. Ošep, L. Leal-Taixé, Quo vadis: Is trajectory forecasting the key towards long-term multi-object tracking?, Advances in neural information processing systems (2022).
  • Carion et al. [2020] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection with transformers, in: European conference on computer vision, Springer, 2020, pp. 213–229.
  • Zhu et al. [2020] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable detr: Deformable transformers for end-to-end object detection, ICLR (2020).
  • Wang et al. [2024] J. Wang, C. Lai, Y. Wang, W. Zhang, Emat: Efficient feature fusion network for visual tracking via optimized multi-head attention, Neural Networks 172 (2024) 106110.
  • Cai et al. [2024] H. Cai, L. Lan, J. Zhang, X. Zhang, Y. Zhan, Z. Luo, Iouformer: Pseudo-iou prediction with transformer for visual tracking, Neural Networks 170 (2024) 548–563.
  • Sun et al. [2020] P. Sun, J. Cao, Y. Jiang, R. Zhang, E. Xie, Z. Yuan, C. Wang, P. Luo, Transtrack: Multiple object tracking with transformer, arXiv preprint arXiv:2012.15460 (2020).
  • Xu et al. [2021] Y. Xu, Y. Ban, G. Delorme, C. Gan, D. Rus, X. Alameda-Pineda, Transcenter: Transformers with dense representations for multiple-object tracking, 2021. arXiv:2103.15145.
  • Zeng et al. [2022] F. Zeng, B. Dong, Y. Zhang, T. Wang, X. Zhang, Y. Wei, Motr: End-to-end multiple-object tracking with transformer, in: European Conference on Computer Vision (ECCV), 2022.
  • Zhou et al. [2022] X. Zhou, T. Yin, V. Koltun, P. Krähenbühl, Global tracking transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8771–8780.
  • Yu et al. [2022] E. Yu, Z. Li, S. Han, H. Wang, Relationtrack: Relation-aware multiple object tracking with decoupled representation, IEEE Transactions on Multimedia (2022).
  • Brasó and Leal-Taixé [2020] G. Brasó, L. Leal-Taixé, Learning a neural solver for multiple object tracking, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6247–6257.
  • Yuan et al. [2021] Y. Yuan, X. Weng, Y. Ou, K. M. Kitani, Agentformer: Agent-aware transformers for socio-temporal multi-agent forecasting, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9813–9823.
  • He et al. [2016] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • Ba et al. [2016] J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization, arXiv preprint arXiv:1607.06450 (2016).
  • Girshick [2015] R. Girshick, Fast r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
  • Kuhn [1955] H. W. Kuhn, The hungarian method for the assignment problem, Naval research logistics quarterly 2 (1955) 83–97.
  • Luiten et al. [2021] J. Luiten, A. Osep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taixé, B. Leibe, Hota: A higher order metric for evaluating multi-object tracking, International journal of computer vision 129 (2021) 548–578.
  • Ristani and Solera [2016] E. Ristani, Solera, Performance measures and a data set for multi-target, multi-camera tracking, in: European conference on computer vision, Springer, 2016, pp. 17–35.
  • Ristani et al. [2016] E. Ristani, F. Solera, R. Zou, R. Cucchiara, C. Tomasi, Performance measures and a data set for multi-target, multi-camera tracking, in: Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II, Springer, 2016, pp. 17–35.
  • Kingma and Ba [2014] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).
  • Pang et al. [2021] J. Pang, L. Qiu, X. Li, H. Chen, Q. Li, T. Darrell, F. Yu, Quasi-dense similarity learning for multiple object tracking, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 164–173.
  • Wu et al. [2021] J. Wu, J. Cao, L. Song, Y. Wang, M. Yang, J. Yuan, Track to detect and segment: An online multi-object tracker, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12352–12361.