Siamese-DETR for Generic Multi-Object Tracking

Qiankun Liu, Yichen Li, Yuqi Jiang, Ying Fu Qiankun Liu, Yichen Li, Yuqi Jiang and Ying Fu are with School of Computer Science and Technology, Beijing Institute of Technology; Email: {liuqk3, liyichen, yqjiang, fuying}@bit.edu.cn; Ying Fu is the corresponding author.

Abstract

The ability to detect and track the dynamic objects in different scenes is fundamental to real-world applications, e.g., autonomous driving and robot navigation. However, traditional Multi-Object Tracking (MOT) is limited to track objects belonging to the pre-defined closed-set categories. Recently, Generic MOT (GMOT) is proposed to track interested objects beyond pre-defined categories and it can be divided into Open-Vocabulary MOT (OVMOT) and Template-Image-based MOT (TIMOT). Taking the consideration that the expensive well pre-trained (vision-)language model and fine-grained category annotations are required to train OVMOT models, in this paper, we focus on TIMOT and propose a simple but effective method, Siamese-DETR. Only the commonly used detection datasets (e.g., COCO) are required for training. Different from existing TIMOT methods, which train a Single Object Tracking (SOT) based detector to detect interested objects and then apply a data association based MOT tracker to get the trajectories, we leverage the inherent object queries in DETR variants. Specifically: 1) The multi-scale object queries are designed based on the given template image, which are effective for detecting different scales of objects with the same category as the template image; 2) A dynamic matching training strategy is introduced to train Siamese-DETR on commonly used detection datasets, which takes full advantage of provided annotations; 3) The online tracking pipeline is simplified through a tracking-by-query manner by incorporating the tracked boxes in the previous frame as additional query boxes. The complex data association is replaced with the much simpler Non-Maximum Suppression (NMS). Extensive experimental results show that Siamese-DETR surpasses existing MOT methods on GMOT-40 dataset by a large margin.

Index Terms:

multi-object tracking, object detection, Siamese network, DETR.

I Introduction

Multi-Object Tracking (MOT) aims at estimating the locations of interested objects in the given video while maintaining their identities consistently, which has various applications, such as autonomous driving, robot navigation, video surveillance, and so on. Benefiting from the advances in object detection, the tracking-by-detection paradigm has become popular for MOT in the past decade. Though great success has been made, the generalization ability of existing MOT methods still needs to be improved due to the limited pre-defined closed-set categories, like pedestrian [1, 2, 3, 4, 5], car [6], etc.

To overcome the aforementioned drawback of traditional MOT task, Generic Multi-Object (GMOT) is recently introduced and tries to track objects of arbitrary categories. It is based on the assumption that at test time we are given the descriptions of interested objects. According to the types of descriptions, GMOT can be divided into Open-Vocabulary MOT (OVMOT) [7] and Template-Image-based MOT (TIMOT) [8] tasks. Among them, OVMOT methods use the text prompt (e.g., category name) as the description, while TIMOT methods utilize the template image as the description. Both types of descriptions are flexible and enlarge the closed-set categories to an open-set one, making multi-object tracking methods more suitable for real-world applications. However, due to the domain gap between text and image, the well pre-trained (vision-)language models (e.g., BERT [9] and CLIP [10]) and fine-grained category annotations are needed to train the detectors in OVMOT methods. Except that annotating fine-grained category information is laborious and professional, the pre-training of the (vision-)language model requires a huge amount of training data and computational resources, making the utilization of OVMOT methods expensive. Taking this into consideration, we focus on TIMOT in this paper and propose a simple but effective method, Siamese-DETR. Only the commonly used detection datasets (e.g., COCO [11]) are required to train the proposed method.

Refer to caption — Figure 1: The online tracking pipeline of Siamese-DETR for generic multi-object tracking based on template image. The template image is fed into the backbone network to get the query contents, while the query boxes consist of the learned query boxes and the tracked boxes in the previous frame. With this design, the objects in current frame are tracked by their corresponding boxes, while the missed objects in the previous frame (but still exist in the current frame) or newly appeared objects in the current frame are detected and tracked by the learned query boxes.

Early TIMOT methods [12, 13] learn a Support Vector Machine (SVM) for each object identity through multiple task learning [14] based on hand-crafted features (e.g., HoG [15]). The identities of different objects are involuntarily maintained since each of them is detected and tracked independently by their dedicated SVMs. Recently, inspired by the success of traditional MOT task, where the tracking-by-detection paradigm [12, 13, 8, 16] dominates the mainstream and achieves appealing performance, the newly proposed TIMOT method [8] also follows the tracking-by-detection paradigm. Specifically, the tracking pipeline is divided into object detection and object tracking stages: 1) For object detection, a Single Object Tracking (SOT) [17] based detector is designed to detect all the objects that share the same category with the template image. Since there is no provided training data for TIMOT task [8], SOT datasets (e.g., LaSOT [18] and GOT-10K [19]) and object detection dataset (e.g., COCO [11]) are used for the training of detector; 2) For object tracking, existing MOT trackers (e.g., SORT [20], DeepSORT [3], IOU [21], etc) are directly utilized as data association algorithms to get the trajectories of different objects. Unfortunately, the tracking performance is still moderate even it is of high complexity. The TIMOT task still needs to be well studied to achieve better overall tracking performance while simplifying the tracking pipeline.

In this paper, we leverage the inherent object queries in DETR variants [22, 23, 24, 25, 26] and propose a simple but effective method, Siamese-DETR. As shown in Fig. 1, the object queries contain the information of the template image for detection and the tracked boxes for tracking. Although Siamese-DETR follows the tracking-by-detection paradigm, the detection and tracking are performed simultaneously. The complex data association procedure is replaced by a much simpler Non-Maximum Suppression (NMS) to remove some duplicated boxes. Compared with existing methods, Siamese-DETR detects interested objects more effectively and tracks objects more simply.

To detect interested objects with the given template image effectively, the Multi-Scale Object Queries (MSOQ) and Dynamic Matching Training Strategy (DMTS) are designed: 1) Multi-scale object queries. The decoupled object query [25] that consists of query content and query box is adopted, where the query content is obtained from the template image while the query box is learned during training. In detail, we feed the template image into the backbone network of the detector to get hierarchical multi-scale features and map each scale of them into a query content. The multi-scale query contents (e.g., 4 scales) are equally replicated to match with the number of learned query boxes (e.g., 600). Since the features with different scales are sensitive to objects of different scales, Siamese-DETR detects different scales of objects that share the same category with the template image effectively; 2) Dynamic matching training strategy. Given a training image in commonly used detection datasets (e.g., COCO [11]), the corresponding annotations are all utilized more than once. Specifically, the objects that share the same category with the template image are treated as positive samples while the others are treated as negative samples. The introduction of negative samples takes full advantage of the provided annotations. By sampling more than one template image for each training image, the annotations can be dynamically used more than once, which benefits Siamese-DETR further.

To track objects simply, we propose a Tracking-by-Query (TbQ) strategy. The tracked boxes are used as additional query boxes and the query denoising is optimized to adapt to TIMOT: 1) Tracked boxes as additional query boxes. The tracked boxes in the previous frame are paired with the query contents to serve as additional object queries. Object queries with tracked boxes and learned boxes are responsible for tracking and detection respectively and independently. The simple NMS is utilized to remove the detected boxes that are duplicated with tracked boxes; 2) Optimized query denoising. Since there is no video training data for TIMOT [8], the query denoising [24] strategy is optimized from common object detection and adopted to mimic the tracking scenarios in static images. The experimental results demonstrate that Siamese-DETR surpasses existing MOT methods on GMOT-40 [8] by a large margin. In summary, the contributions of this work are as follows:

•

We propose Siamese-DETR for template-image-based multi-object tracking and introduce multi-scale object queries to effectively detect different scales of objects that share the same category with the template image.
•

We introduce a dynamic matching training strategy for Siamese-DETR, enabling the training on commonly used detection datasets effectively.
•

We design a simple online tracking strategy by incorporating the tracked boxes as additional query boxes. Objects are tracked in a tracking-by-query manner.

The remainder of this paper is organized as follows: Section II firstly reviews the related works about object tracking and DETR variants. Next, the details of the proposed method are illustrated in Section III. Then, we provide the implementation details and compare the proposed method with existing methods in Section IV. Finally, we provide the conclusion and the discussions on the limitations of the proposed method in Section V.

II Related Work

This section briefly reviews related works from different aspects, including multi-object tracking, template-image-based multi-object tracking, open-vocabulary multi-object tracking and DETR variants.

II-A Multi-Object Tracking

In the past decade, Multi-Object Tracking (MOT) has emerged as a popular research area and has been dominated by the tracking-by-detection paradigm. Existing tracking-by-detection methods involve object detection and data association stages, and can be divided into offline and online methods. Offline methods [27, 28, 29, 30, 31, 32] process the video in a batch way and even can utilize the whole video information to handle the data association problem better. Differently, online methods [20, 33, 1, 2, 3, 4, 5, 34] process the video frame-by-frame and generate trajectories only using information up to the current frame, which is more suitable for causal applications than offline ones.

Traditional MOT methods mainly focus on data association problem, including Hungarian algorithm, network flow [27, 28, 35], and graph multicut [29, 30]. Among them, except the Hungarian algorithm, others can only be performed in an offline manner. In recent years, with the advancement of deep learning and object detection, online tracking has attracted more and more attention. On contrary to offline methods, online methods usually adopt the Hungarian algorithm for data association, but focus on the joint learning of object detection and some useful priors, such as object motions [1, 6, 36, 37], appearance features [4, 38, 39], occlusion maps [38], object relations [40] and so on. However, except for the annotation of box and category ID, extra annotations are required for the learning of these priors, e.g., object identity for appearance feature learning.

Though great progress has been made in MOT, most existing methods are designed to track objects that are limited to a pre-defined small closed-set of categories. For example, car and pedestrian. In this paper, we focus on generic multi-object tracking to extend the closed-set of categories in MOT to an open-set of generic categories, which are not limited to several specific ones.

II-B Template-Image-based Multi-Object Tracking

The Template-Image-based Multi-Object Tracking (TIMOT) task is introduced to address the generalization issue in MOT about ten years ago [12, 13]. Similar to MOT, TIMOT follows the tracking-by-detection paradigm [12, 13, 8]. However, much less attention has been paid to TIMOT, which is quite different from MOT. The main reason is that the data that is suitable for TIMOT is scarce. Recently, GMOT-40 [8] has been developed as a public dataset for the evaluation of TIMOT. Nevertheless, no well-annotated training data is available for TIMOT.

Early methods [12, 13] track generic multi-objects based on Support Vector Machine (SVM) and hand-crafted features. Each object is detected and tracked by a dedicated SVM. The SVM is initialized based on the given template image and updated in an online manner while tracking. Recently, the newly proposed method [8, 41] firstly detects all objects that share the same category with the template image through a Single Object Tracking (SOT) based detector (specifically, GlobalTrack [17]), then some online data association MOT trackers (e.g., SORT [20], DeepSORT [3], IOU [21], etc) are applied to get the trajectories of objects. To mitigate the gap between SOT and TIMOT, single object tracking datasets (LaSOT [18], GOT-10K [19]) and object detection dataset (COCO [11]) are used to train the SOT based detector. However, the tracking performance is far from satisfactory.

Different from existing TIMOT method [8] that detects objects based on SOT tracker [17], we leverage the advantage of object queries in DETR variants for object detection and tracking. With the proper query design and training strategy, our method distinguishes the interested objects from others effectively. In addition, the tracking pipeline is also simplified by incorporating the tracked boxes into object queries.

II-C Open-Vocabulary Multi-Object Tracking

With the recent development of language [9, 42] and vision-language models [10], the Open-Vocabulary Multi-Object Tracking (OVMOT) [7] is proposed to track objects that belong to arbitrary categories.

Similar to traditional MOT and TIMOT, OVMOT also follows the tracking-by-detection pipeline, where open-vocabulary object detection plays a key role. OVTrack [7] localizes interested objects with a class agnostic R-CNN (i.e., Faster R-CNN [43]) and the vision-language model (i.e., CLIP [10]). Specifically, all objects, including the ones that are not interested, are detected by the R-CNN, where the object features are aligned with the counterparts extracted by the image encoder in CLIP through knowledge distillation. Then the interested objects are selected by comparing the similarity between object features and text features extracted by the text encoder in CLIP. Finally, a data association procedure is adopted to link objects in adjacent frames. Similarly, GLIP [44] detects objects using DyHead [45] and BERT [9]. The text and image features are aligned with each other by iteratively fusing them in several successive blocks rather than supervising the model with knowledge distillation. Though OVMOT trackers can track arbitrary object categories, expensive well pre-trained language or vision-language models are required to handle the domain gap between texts and images. In addition, laborious fine-grained category annotations are also needed to help the model recognize accurate objects with different text descriptions. For example, OVTrack is trained on LVIS [46] with 1200+ category annotations and GLIP is trained on Objects365 [47] with 356 (which is further increased to 1300+ by the authors) category annotations.

Compared with the aforementioned methods, the proposed Siamese-DETR does not require the expensive pre-trained (vision-)language model nor the laborious fine-grained category annotations. It achieves better performance when only the COCO [11] dataset (with 80 categories) is used for training.

II-D DETR Variants

DETR [48] is the first end-to-end object detector. The main idea in it is the object query and the Hungarian loss. The anchor boxes and NMS components are abandoned, reducing the complexity of detectors significantly. However, DETR suffers from slow convergence. Lots of works are proposed to address this issue.

Deformable-DETR [23] replaces the common attention with deformable attention, which reduces the computational cost and makes it possible to use multi-scale features for object detection. DAB-DETR [25] decouples the object queries into learnable contents and learnable boxes. The query boxes are iteratively updated at each decoder layer. DN-DETR [24] finds that the slow convergence of DETR is mainly caused by the unstable matching between object queries and ground-truth boxes. To reduce the instability, DN-DETR introduces a denoising training approach to accelerate the convergence. Specifically, except for the object queries, noisy ground-truth boxes and labels are additionally fed into the decoder, which improves the model’s ability of box regression and classification. Once the training procedure is finished, the object queries in the aforementioned DETR variants are fixed. All images share the same ones, which can not be dynamically updated according to the input images. To solve this, DINO [22] proposes a mixed query selection mechanism, where the query boxes are dynamically selected based on the image features.

In this paper, multi-scale object queries are designed, which contain the information of the template image. The tracked boxes are further used as additional query boxes. Object detection and tracking are performed simultaneously.

III Methodology

In this section, we first present the overall architecture of Siamese-DETR. Then, we introduce the multi-scale object queries for the detection of objects that share the same category with the template image. Next, we introduce the dynamic matching training strategy that trains Siamese-DETR on commonly used detection datasets. Furthermore, we show how to apply Siamese-DETR to online tracking straightforwardly and simply in a tracking-by-query manner. Finally, the training details of Siamese-DETR are presented.

III-A Overview

The overview of the proposed Siamese-DETR in the training stage is shown in Fig. 2. Siamese-DETR contains a backbone network (e.g., Swin Transformer [49]), a transformer (including the encoder and the decoder) [42, 50], a detection head for classification and box regression (the same with DINO [22]), and a set of object queries. In order to detect objects of different scales that share the same category with the template image, the Multi-Scale Object Queries (MSOQ, Section III-B) are generated based on the template image. Since no well-annotated training data is available for TIMOT, we design a Dynamic Matching Training Strategy (DMTS, Section III-C), which supports the training of Siamese-DETR on commonly used detection datasets (e.g., COCO [11]). The provided annotations are fully utilized more than once when multiple template images are provided for training. During the inference stage, objects are tracked in a Track-by-Query (TbQ, Section III-D) manner, as shown in Fig. 1. The tracked boxes in the previous frame are used as additional query boxes to track corresponding objects. A simple NMS operation, rather than the complex data association algorithm, is adopted to remove some duplicated boxes. To make Siamese-DETR compatible with such a tracking strategy, the query denoising [24] is adopted and optimized to train Siamese-DETR.

III-B Multi-scale Object Queries

Following previous works [25, 22, 24, 51], we use the decoupled object queries. Formally, let $Q=\{q_{n}|q_{n}=(\mathbf{q}_{c_{n}},\mathbf{q}_{b_{n}}),n=0,1,...,N-1\}$ be the set of object queries, where $N$ is the number of queries. For each query $q_{n}=(\mathbf{q}_{c_{n}},\mathbf{q}_{b_{n}})$ , the query content $\mathbf{q}_{c_{n}}\in\mathbb{R}^{D}$ is a feature vector with dimensionality $D$ , and the query box $\mathbf{q}_{b_{n}}\in\mathbb{R}^{4}$ is represented by the center coordinate, width and height. In DETR variants [23, 24, 25, 22], the query contents are usually a set of parameters that are learned by the model. Such design works for object detection with closed-set categories, but is not suitable for generic object detection/tracking, where the category of template image provided in the inference stage may be unseen in the training stage.

In order to detect all objects that share the same category with the template image, we get query contents from the template image. More specifically, using the features extracted from the template image as the query contents. Our hypothesis is that the query contents store the semantic information of objects, e.g., intra-category common ground, which is vital for object detection. On the other hand, objects in the same scene vary a lot in terms of scale even if they share the same category. To handle this, the multi-scale features are extracted from the template image and used as multi-scale query contents. Formally, let $F=\{\mathbf{f}_{s}|s=0,1,...,S-1\}$ be the set of multi-scale feature maps extracted by the backbone network (e.g., Swin Transformer [49]) from the given template image. We first get the feature vectors $\hat{F}=\{\mathbf{\hat{f}}_{s}|s=0,1,...,S-1\}$ by spatially average pooling the feature maps with:

\mathbf{\hat{f}}_{s}={\rm AvgPool}(\mathbf{f}_{s}).

(1)

For the $n$ -th object query, its content $\mathbf{q}_{c_{n}}$ is determined by:

\mathbf{q}_{c_{n}}=\mathbf{\hat{f}}_{n\ {\rm mod}\ S},

(2)

where $n\ {\rm mod}\ S$ is the index of the feature vectors in $\hat{F}$ . As for query boxes, they are a set of learnable parameters that are optimized in the training stage following previous works [25, 24], which means that different template images share the same query boxes.

III-C Dynamic Matching Training Strategy

Different from traditional MOT, where well-annotated training data is provided, there is no available well-annotated training data for TIMOT [8]. The common practice is to train the model on external datasets, and then test it on evaluation benchmark (i.e., GMOT-40 [8]). Existing method [8] uses multiple datasets to train the detector, including LaSOT [18], GOT-10K [19] and COCO [11]. However, the detection/tracking performance is far from satisfactory. In this paper, we design the Dynamic Matching Training Strategy (DMTS) for the training of Siamese-DETR on commonly used detection datasets. It will be seen in Section IV that Siamese-DETR surpasses existing method [8] by a large margin in terms of detection and tracking, even only been trained on COCO [11]. The superiority of the dynamic matching training strategy comes from two aspects: 1) utilizing all annotations even if they belong to different categories; 2) utilizing all annotations more than once for each training step.

III-C1 Utilizing All Annotations

Let $A=\{a_{k}|a_{k}=(\mathbf{b}_{k},c_{k}),k=0,1,...,K-1\}$ be the set of annotations for the input training image, where $\mathbf{b}_{k}\in\mathbb{R}^{4}$ and $c_{k}\in\mathbb{Z}$ are the bounding box and category ID of the $k$ -th object. We randomly sample a category ID from $\{c_{0},c_{1},...,c_{K-1}\}$ , which is used as the category of template image and denoted as $\hat{c}_{t}$ . Given the category ID $\hat{c}_{t}$ , the template image is cropped from another image in the training split. The corresponding annotations for the given template image and the category ID $\hat{c}_{t}$ are:

A^{\hat{c}_{t}}=\{a^{\hat{c}_{t}}_{k}|a^{\hat{c}_{t}}_{k}=(\mathbf{b}_{k},\mathbbm{1}_{c_{k},\hat{c}_{t}}),k=0,1,...,K-1\},

(3)

where:

\mathbbm{1}_{c_{k},\hat{c}_{t}}=\begin{cases}1&\text{if }c_{k}=\hat{c}_{t},\\ 0&\text{else}.\end{cases}

(4)

As we can see, the boxes in $A^{\hat{c}_{t}}$ are divided into positive ( $c_{k}=\hat{c}_{t}$ ) and negative ( $c_{k}\neq\hat{c}_{t}$ ) samples, resulting a two-category object detection task, as shown in Fig. 2. However, there exists another naive setting that results in a single-category object detection task: keeping the boxes that share the same category with the template image and removing the others. Under this setting, only the positive samples are utilized. Though the latter seems to be more intuitive than the former, it is not a good choice due to the fact that poorer detection performance is achieved since no negative samples are utilized for the training, which weakens the capability of the model to distinguish the interested objects from others.

III-C2 Utilizing All Annotations More Than Once

Considering the fact that the template image and the multi-scale-object queries take up a small proportion of the device memory when compared with the input image and the whole model, we can provide multiple template images during training.

Let $\{\hat{c}_{0},\hat{c}_{1},...,\hat{c}_{T-1}\}$ be the randomly sampled category IDs for $T$ different template images. For each template image with category $\hat{c}_{t}$ , the obtained multi-scale object queries are denoted as $Q^{\hat{c}_{t}}$ , which is associated with the annotations $A^{\hat{c}_{t}}$ . The $T$ groups of object queries $\{Q^{\hat{c}_{0}},Q^{\hat{c}_{1}},...,Q^{\hat{c}_{T-1}}\}$ contain a total number of $N\times T$ object queries, which are concatenated and fed into transformer for object detection within once forward. Note that the interactions between different object queries are only allowed within each group, and the object queries in different groups cannot see each other. This can be simply implemented by providing an attention mask to self-attention layers in the transformer decoder.

III-D Tracking-by-Query

The common tracking-by-detection paradigm usually contains two stages: 1) Object detection. The interested objects are firstly detected by the detector; 2) Data association. The trajectories of objects are obtained by matching objects that come from different frames. However, performing data association properly is non-trivial since it involves the computation of affinity matrix between different objects, the setting of affinity threshold that prevents a wrong association, etc. In this paper, we use the tracked boxes in the previous frame as additional query boxes to track the corresponding objects. The query denoising is optimized to mimic the tracking scenarios on static images.

III-D1 Tracked Boxes as Additional Query Boxes

Let $B=\{\mathbf{\hat{b}}_{m}|m=0,1,...,M\}$ be the set of tracked boxes in the previous frame, we construct additional object queries as follows:

\begin{split}\hat{Q}=\hat{Q}_{0}\cup\hat{Q}_{1}\cup...\cup\hat{Q}_{S-1},\end{split}

(5)

where $\hat{Q}_{s}$ is the subset of additional object queries that are constructed for the $s$ -th scale:

\begin{split}\hat{Q}_{s}=\{\hat{q}_{s,m}|\hat{q}_{s,m}=(\mathbf{\hat{f}}_{s},\mathbf{\hat{b}}_{m}),m&=0,1,...,M-1\}.\end{split}

(6)

Different from object queries $Q$ , which is responsible for object detection, $\hat{Q}$ is used for object tracking. The inspiration of this design is that the category-aware information is conveyed by the features from the template image and embedded into the query contents, and the tracked boxes in the previous frame are close enough to the corresponding objects in the current frame. While tracking online, the object queries in $Q$ and $\hat{Q}$ are concatenated and fed into the transformer together. The object queries in $Q$ and $\hat{Q}$ detect and track objects simultaneously but independently. For object tracking, different object instances are distinguished by their corresponding tracked boxes. To avoid the interactions between the object queries in $Q$ and $\hat{Q}$ , an attention mask is provided to each self-attention layer in the transformer decoder.

For each box $\mathbf{\hat{b}}_{m}$ , $S$ tracked boxes are obtained by object queries $\{\hat{q}_{0,m},\hat{q}_{1,m},...,\hat{q}_{S-1,m}\}$ . Among these $S$ tracked boxes, the one that has the largest Intersection over Union (IoU) with $\mathbf{\hat{b}}_{m}$ is selected. If the classification score of the selected box is higher than the predefined confidence threshold, it will be kept as the tracking result for $\mathbf{\hat{b}}_{m}$ . Otherwise, the corresponding object is treated as a disappeared object. Since object queries in $Q$ and $\hat{Q}$ detect and track objects independently, the detection boxes from object queries in $Q$ may be duplicated with that from object queries in $\hat{Q}$ . Following the MOT method Tracktor [1], the NMS operation is used to remove the duplicated detection boxes. The remained detection boxes that have higher classification scores than the predefined confidence threshold are treated as the newly appeared objects.

III-D2 Optimized Query Denoising

Since there is no well-annotated video data for training, it is hard for the model to track objects with tracked boxes while tracking online if the model is trained on static images. In addition, the number of interested objects varies from frame to frame while tracking online, which introduces more challenges for tracking. To handle this, we add some noise to the ground-truth boxes, which are used as the tracked boxes (i.e., additional query boxes) during the training stage. The objects are tracked by their corresponding noisy query boxes and Siamese-DETR is constrained to learn the capability of handling the various numbers of objects. With the mimicked tracking scenario in the training stage, Siamese-DETR is able to track objects with the proposed Tracking-by-Query strategy. Similar to the online tracking stage, the groups of object queries for detection and tracking are independent of each other in the training stage.

The aforementioned strategy is similar to query denoising [24] which is commonly used in DETR variants. However, we find that existing query denoising is not suitable for TIMOT task. The reason is that except for the box noise in query boxes, there exists category noise in query contents. Specifically, the category ID of a ground-truth box is randomly switched to another category ID. For each category ID, an embedding vector is learned by the detector and used as the noisy query content. While training, the noisy object queries are classified based on the labeled category IDs that are associated with the query box, without taking the query contents into consideration. There are two conflicts in existing query denoising when applied to Siamese-DETR: 1) Siamese-DETR performs two-category detection task. Learning two embedding vectors for positive and negative samples is not suitable since the positive and negative samples are dynamically changed according to the provided template images (i.e., query contents); 2) A positive noisy query box may be paired with the negative query content, but still be classified as a positive sample in the original query denoising [24]. However, the object query is positive only when the query content and query box matched with each other in Siamese-DETR.

To avoid these conflicts, we optimize query denoising by pairing all noisy query boxes with positive query contents (i.e., the features extracted from the template image). With this optimized query denoising, noisy object queries are classified correctly according to the matching results between query contents and query boxes. The difference between the original query denoising [24] and the optimized query denoising is illustrated in Fig. 3.

III-E Training of Siamese-DETR

The proposed Siamese-DETR is trained on the commonly used detection dataset, i.e., COCO [11]. Given an image for training, we first randomly sample $T$ category IDs from the corresponding annotations. For each category ID, we then randomly sample a box from the annotations of another image, which is randomly sampled from the training split. Finally, the template images are cropped from the randomly sampled images with the sampled boxes. For the $t$ -th category ID $\hat{c}_{t}$ , except for the object queries $Q^{\hat{c}_{t}}$ , we also construct additional object queries $\hat{Q}^{\hat{c}_{t}}=\hat{Q}^{\hat{c}_{t}}_{0}\cup\hat{Q}^{\hat{c}_{t}}_{1}\cup...\cup\hat{Q}^{\hat{c}_{t}}_{S-1}$ with the noisy ground-truth boxes to mimic the tracking scenarios. The overall training loss is:

\mathcal{L}=\frac{1}{T}\sum_{T=0}^{T-1}(\mathcal{L}_{\rm H}(A^{\hat{c_{t}}},\hat{A}^{Q^{\hat{c}_{t}}})+\frac{1}{S}\sum_{s=0}^{S-1}\mathcal{L}_{\rm R}(A^{\hat{c}_{t}},\hat{A}^{\hat{Q}_{s}^{\hat{c}_{t}}}))

(7)

where $\hat{A}^{Q^{\hat{c}_{t}}}$ and $\hat{A}^{Q^{\hat{c}_{t}}_{s}}$ denote the predictions from object queries in $Q^{\hat{c}_{t}}$ and $\hat{Q}_{s}^{\hat{c}_{t}}$ , and $\mathcal{L}_{\rm H}(\cdot,\cdot)$ and $\mathcal{L}_{\rm R}(\cdot,\cdot)$ are the Hungarian loss [48] and reconstruction loss [24], respectively. Note that while computing the loss, the ground-truth boxes without noise are used.

IV Experiments

In this section, the involved datasets and metrics are firstly introduced, followed by the implementation details. Then, we compare Siamese-DETR with existing multi-object tracking methods. Finally, some discussions are provided to show the effectiveness of different components.

IV-A Datasets and Metrics

We follow the setup of template-image-based MOT task [8] to compare the proposed Siamese-DETR with other methods. Specifically, a tracker is tested on all videos in GMOT-40 benchmark [8] and can be trained on any other benchmark except GMOT-40. It is worth noting that the template-image-based MOT trackers not only need to track all the objects of the same category with the template image in the video but also need to maintain the identity of each object. For the reason that each video in GMOT-40 only contains one category, the Urban Tracker [52] dataset is used to analyze the capability of Siamese-DETR in multi-category multi-object tracking. In this work, the commonly used object detection dataset, COCO [11], is mainly used to train Siamese-DETR. Other datasets, for example, LVIS [46] and Objects365 [47], are used for more detailed analysis. Though the categories in GMOT-40 and Urban Tracker may be visible in the training dataset (e.g., COCO, LVIS, Objects365), it complies with the setup of TIMOT [8].

GMOT-40 contains 40 videos that consist of 10 different object categories with 4 videos for each category. The entire dataset contains 9.6K frames, where 85.28% of them contain more than 10 objects. The videos are shot with FPS ranging from 24 to 30. All the videos are used for evaluation.

Urban Track has 4 outdoor videos with each of them containing more than 1 category, resulting in a total number of 4 categories. The videos are captured with FPS of 25 or 30. All these 4 outdoor videos are used for evaluation.

COCO is widely used for object detection. It contains a total number of 118K images and 860K annotated instances for training. There are 80 different categories in COCO, such as person, car, dog, and so on. Note that only the category IDs and bounding boxes are used in this work.

LVIS shares nearly the same images with COCO but provides more fine-grained annotations. It contains 100K images and 1.27M instances for training. The number of annotated categories is 1203, providing more fine-grained category annotations than COCO. Note that the used bounding boxes are obtained from the annotated instance-level masks since there are no annotated bounding boxes in LVIS.

Objects365 contains 0.6M images and 8.54M instances for training. The number of annotated categories is 365. We use Objects365 to show that more training data can boost the performance of Siamese-DETR.

We adopt the standard metrics of multi-object tracking for evaluation, including: Multi-Object Tracking Accuracy (MOTA) [53], IDentity F1 Score (IDF1), Mostly Tracked objects (MT), Mostly Lost objects (ML), Number of False Positives (FP), Number of False Negatives (FN) and Number of Identity Switches (IDSw) [54]. Some other metrics, including mean Average Precision with IoU threshold 0.5 ([email protected]) and mean Average Recall (mAR) are also adopted for the evaluation of object detection.

TABLE I: The details of compared detection methods. Note that the extra costs of FLOPs and inference time in the processing of the template image or text prompt are not included since they only need to be processed in the first frame and the costs are negligible when the number of frames is large enough. Time consumption is evaluated on a workstation with a 3.9GHz CPU and an RTX 3090 GPU. The best results are shown in bold with underline.

Detection Methods	Backbone	Neck	Backbone Pre- training Dataset	(Vision-) Lauguage Model	Inference Resolution	Parameter (M)	FLOPs (G)	Inference Time (ms)
YOLOv5l6 [55]	CSP-DarkNet53 [56]	SPPF [57],CSP-PAN [58]	$\times$	$\times$	1280 $\times$ 1280	76.8	112.4	16.7
DINO [22]	ResNet50 [59]	DETR’s Neck [48]	ImageNet [60]	$\times$	800 $\times$ 1200	45.2	261.9	87.5
Conditional DETR [61]	ResNet50 [59]	DETR’s Neck [48]	ImageNet [60]	$\times$	800 $\times$ 1200	43.2	88.9	44.4
OVTrack [7]	ResNet50 [59]	Faster RCNN’s Neck[43]	ImageNet [60]	CLIP [10]	800 $\times$ 1333	67.6	191.2	57.5
GLIP-T (B) [44]	Swin-Tiny [49]	DyHead’s Neck [45]	ImageNet [60]	BERT [9]	800 $\times$ 1333	195.2	322.4	158.7
GlobalTrack [17]	ResNet50 [59]	Faster RCNN’s Neck[43]	ImageNet [60]	$\times$	800 $\times$ 1333	41.3	169.9	53.7
Siamese-DETR (Ours,Swin-T)	Swin-Tiny [49]	DETR’s Neck [48]	ImageNet [60]	$\times$	800 $\times$ 1200	47.6	267.3	86.6
Siamese-DETR (Ours,Swin-B)	Swin-Base [49]	DETR’s Neck [48]	ImageNet [60]	$\times$	800 $\times$ 1200	108.2	542.7	139.9

TABLE II: The details of evaluated tracking methods. The inference times of different methods are evaluated based on the detection results of Siamese-DETR (Swin-T). Time consumption is evaluated on a workstation with a 3.9GHz CPU and an RTX 3090 GPU, and the detection time consumption is excluded. The best results are shown in bold with underline.

Tracking Methods	Online	Kalman Filter	Appearance Cues	Hierarchical Matching	Inference Time (ms)
IOU [21]	$\times$	$\times$	$\times$	$\times$	$<$ 0.01
SORT [20]	$\checkmark$	$\checkmark$	$\times$	$\times$	0.1
DeepSORT [3]	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	87.3
ByteTrack [5]	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	33.7
BoT-SORT [62]	$\checkmark$	$\checkmark$	$\times$	$\checkmark$	9.5
TbQ (Ours, Swin-T)	$\checkmark$	$\times$	$\times$	$\times$	2.2

TABLE III: Comparison with different methods on the GMOT-40 benchmark. Except for the TIMOT methods, the methods with closed-set detectors and open-vocabulary detectors are also evaluated. Since different methods follow the tracking-by-detection pipeline, we divide them into the combination of a detector and a tracker. While combing our TbQ tracking pipeline with other detectors, the detection results provided by the evaluated detectors are fed into Siamese-DETR frame-by-frame, where the set of object queries

Q

is removed and only

\hat{Q}

is used for tracking. Both detection and tracking results are presented for a more comprehensive comparison. The best results are shown in bold with underline.

	Detection Methods	Language Models	Training Datasets for Detection Methods	Dectection Results		Tracking Methods	Tracking Results
	Detection Methods	Language Models	Training Datasets for Detection Methods	[email protected] $\uparrow$	mAR $\uparrow$	Tracking Methods	MOTA $\uparrow$	IDF1 $\uparrow$	MT $\uparrow$	ML $\downarrow$	FP $\downarrow$	FN $\downarrow$	IDSw $\downarrow$
Closed-Set MOT Methods						YOLOv5l6 (manual)
	YOLOv5l6 [55] (manual)	$\times$	COCO [11]	41.1%	38.6%	+ IOU [21]	22.7%	29.0%	257	1346	38752	161207	4861
						+ SORT [20]	22.9%	31.2%	232	1311	39574	162323	2498
						+ DeepSORT [3]	23.0%	32.4%	269	1283	40129	159833	2274
						+ ByteTrack [5]	23.6%	34.4%	301	1261	38763	162037	3691
						+ BoT-SORT [62]	23.9%	35.1%	298	1214	37991	158429	3437
						+ TbQ (Ours, Swin-T)	24.5%	25.1%	284	1241	40032	152860	8431
						DINO (manual)
	DINO [22] (manual)	$\times$	COCO [11]	28.7%	30.1%	+ IOU [21]	21.7%	29.0%	257	1346	38752	161207	1789
						+ SORT [20]	19.3%	23.1%	182	1437	41786	167923	1658
						+ DeepSORT [3]	20.0%	23.4%	184	1390	42013	165689	1931
						+ ByteTrack [5]	21.2%	25.7%	201	1376	43543	162480	1774
						+ BoT-SORT [62]	21.1%	26.0%	199	1345	42081	163093	3762
						+ TbQ (Ours, Swin-T)	21.9%	24.1%	215	1247	41678	163881	6597
						Conditional DETR (manual)
	Conditional DETR [61] (manual)	$\times$	COCO [11]	33.1%	22.0%	+ IOU [21]	17.7%	20.5%	161	1316	21109	184826	5056
						+ SORT [20]	16.7%	22.7%	111	1423	15595	195583	2283
						+ DeepSORT [3]	19.3%	28.7%	179	1296	23141	180550	3094
						+ ByteTrack [5]	17.9%	30.9%	183	1129	32922	174627	3964
						+ BoT-SORT [62]	19.1%	34.2%	232	1093	32866	170789	3716
						+ TbQ (Ours, Swin-T)	19.2%	24.7%	230	1075	31317	163955	9499
Open-Vocabulary MOT Methods						OVTrack
	OVTrack [7]	$\checkmark$	LVIS [46]	31.7%	32.7%	+ IOU [21]	20.3%	18.8%	139	1257	50467	154367	1473
						+ SORT [20]	18.9%	20.1%	145	1387	49850	158905	1578
						+ DeepSORT [3]	20.2%	21.2%	165	1367	49984	160378	1470
						+ ByteTrack [5]	19.9%	20.6%	164	1345	51356	156329	1669
						+ BoT-SORT [62]	20.0%	20.3%	167	1328	45721	163378	3278
						+ TbQ (Ours, Swin-T)	21.3%	18.7%	186	1304	50784	154893	6381
						GLIP-T (B)
	GLIP-T (B) [44]	$\checkmark$	Objects365 [47]	50.8%	44.6%	+ IOU [21]	25.1%	39.3%	458	721	55802	139560	6320
						+ SORT [20]	25.2%	40.8%	354	877	54891	143623	2987
						+ DeepSORT [3]	25.5%	41.6%	401	877	46610	141330	2892
						+ ByteTrack [5]	27.0%	45.1%	447	746	52591	131759	2706
						+ BoT-SORT [62]	27.3%	49.1%	553	643	51308	133462	4675
						+ TbQ (Ours, Swin-T)	27.5%	39.8%	581	592	49470	136602	9972
Template-Image-based MOT Methods						GlobalTrack
	GlobalTrack [17]	$\times$	COCO [11], LaSOT [18], GOT-10K [19]	28.3%	18.3%	+ IOU [21]	11.8%	20.3%	56	1491	8299	216821	1668
						+ SORT [20]	19.5%	30.3%	140	1187	15132	189315	1785
						+ DeepSORT [3]	14.5%	24.4%	72	1363	9000	208818	1315
						+ ByteTrack [5]	19.1%	32.1%	178	1069	23881	181829	1791
						+ BoT-SORT [62]	19.4%	34.0%	251	978	22229	176991	7375
						+ TbQ (Ours, Swin-T)	20.6%	27.4%	213	1066	13507	182376	6407
						Siamese-DETR (Ours, Swin-T)
	Siamese-DETR (Ours, Swin-T)	$\times$	COCO [11]	57.5%	46.6%	+ IOU [21]	30.7%	35.0%	361	759	42504	127024	8158
						+ SORT [20]	30.1%	34.7%	235	943	29518	145613	4060
						+ DeepSORT [3]	31.1%	41.8%	382	773	47336	124257	5131
						+ ByteTrack [5]	33.7%	41.4%	331	764	53417	104765	4204
						+ BoT-SORT [62]	34.1%	47.5%	431	674	45769	119288	6775
						+ TbQ (Ours, Swin-T)	35.9%	42.8%	504	666	44882	107894	11664
	Siamese-DETR (Ours, Swin-B)	$\times$	COCO [11]	63.3%	49.9%	+ TbQ (Ours, Swin-B)	39.4%	35.7%	482	586	33968	106079	10233
	Siamese-DETR (Ours, Swin-B)	$\times$	Objects365 [47]	69.6%	55.4%	+ TbQ (Ours, Swin-B)	50.0%	51.3%	1083	278	44390	68189	11252

IV-B Implementation Details

We use Swin Transformer [49] as the backbone network. Like most DETR variants [24, 22], there are 6 encoder layers and 6 decoder layers in the transformer, in which the hidden dimensionality is set to 256. Following the settings in Deformable-DETR [23], the number of feature scales $S$ is set to 4. The number of object queries $N$ is set to 600. Without specification, all evaluated Siamese-DETR variants are optimized with AdamW [63] for 12 epochs. The batch size is set to 16 and the number of templates $T$ is set to 7 by default. The initial learning rate is set to $5e^{10^{-5}}$ , which is decayed by a factor of $0.1$ at epoch 11. The longer side of the template image is resized to 400 before being fed into the backbone network. While tracking online, the NMS threshold is set to 0.5 to remove the duplicated detection boxes and we follow the settings of TIMOT [8] to crop the template image from the first frame with a randomly sampled box for each video. In the following, Siamese-DETR denotes the model for object detection and tracking if there is no ambiguity, otherwise, we use Siamese-DETR and TbQ to denote the detector and tracker, respectively.

IV-C Comparison with Existing Methods

We compare the proposed method with several existing closed-set MOT methods, open-vocabulary MOT methods and template-image-based MOT methods. It is worth noting that the model weights of existing methods provided by the authors are directly used for evaluation, and it is practical for the reason that GMOT-40 benchmark is just used for evaluation and not used in the training of all methods (including ours). Since different types of MOT methods follow a tracking-by-detection pipeline, we divide each of the evaluated methods into a detector and a tracker. The details of different detectors and tracker are shown in Tab. I and Tab. II, respectively. For a comprehensive comparison, we not only apply our TbQ tracking strategy to different detectors but also apply different trackers to our detector. The detection and tracking results of different methods on GMOT-40 [8] are provided in Tab. III

IV-C1 Comparison of Details

From Tab. I, we can see that different detectors are not strictly constrained to have the same backbone network, detection neck, etc. Compared with closed-set detectors (i.e., YOLOv5l6 [55], DINO [22], Conditional DETR [61]), the open-vocabulary detectors (i.e., OVTrack [7] and GLIP-T [44]) and template-image-based detectors (i.e., GlobalTrack [17] and Siamese-DETR) need extra costs of FLOPs and inference time to process the text prompts or template images. However, the extra costs are negligible when the number of frames is large enough since the text prompt and template image only need to be processed in the first frame. Note that the (vision-)language models are also necessary for open-vocabulary methods. Our Siamese-DETR (Swin-T) has almost the same number of parameters, FLOPs, and inference time as DINO. From Tab. II, it can be observed that the commonly used Kalman Filter, appearance cues and hierarchical matching strategies are not used in our TbQ. The time consumption of TbQ for tracking is much less than that of detection for the reason that TbQ only introduces a few additional object queries for tracking, which is performed simultaneously with detection.

IV-C2 Comparison of Detection Performance

We first show the detection performance of different methods on GMOT-40 [8]. For closed-set methods, e.g., YOLOv5 [55] (specifically, YOLOv5l6), DINO [22] and Conditional DETR [61], they tend to detect all objects that belong to the categories in the pre-defined closed-set, which results in a poor detection performance due to the fact that only the objects of one specific category are treated as foreground. To make these closed-set methods compatible with the setting of TIMOT, we manually remove the predicted boxes that do not have the same category with the template image for each video (denoted as YOLOv5l6 (manual), DINO (manual) and Conditional DETR (manual)). As expected, Siamese-DETR (Swin-T) outperforms YOLOv5l6 (manual), DINO (manual) and Conditional DETR (manual) by a large margin. For example, when trained on the same dataset (i.e., COCO[11]), Siamese-DETR (Swin-T) achieves 16.4%, 28.8% and 24.4% higher [email protected] than YOLOv5l6 (manual), DINO (manual) and Conditional DETR (manual), respectively. YOLOv5l6 (manual) performs better than DINO (manual) and Conditional DETR (manual). The reason is that the objects in GMOT-40 are much smaller than those objects in COCO and DETR-based detectors (DINO and Conditional DETR) cannot handle small objects well [48]. Interestingly, compared with DINO (manual), Conditional DETR (manual) achieves better [email protected] but much poorer mAR, resulting in poorer tracking results.

As for open-vocabulary methods, they can detect interested objects by providing different text prompts. However, due to the domain gap between vision and language, they need the well pre-trained language models to extract features from the given text prompts. In addition, in order to recognize the accurate objects from different text prompts, fine-grained category annotations are required. For example, GLIP-T (B) [44] utilizes the pretrained BERT [9] to extract text features and the detection model is trained on Objects365 [47]. The number of categories in Objects365 is increased from 365 to 1300+ by the authors to provide more fine-grained annotations. During testing, we follow the settings of OVTrack and GLIP-T (B) to use the category names as the text prompts for object detection. It can be observed that without the help of well pre-trained language model and the fine-grained category annotations, our Siamese-DETR (Swin-T) outperforms OVTrack and GLIP-T (B) by a large margin when only the COCO [11] dataset is used for training.

The template-image-based method GlobalTrack [17] is originally designed for Single Object Tracking (SOT) task. It is implemented based on a traditional detector [43], and it can detect the interested objects within the whole image for the reason that it computes the correlation score between the features extracted from the template image and the whole image. A high score indicates a potential object. Though there are lots of SOT methods (e.g., transformer-based TransT [64] and TrDiMP [65]), they are not suitable for template-image-based MOT task for the reason that only a small search region image rather than the whole image is supported to track the single object. Directly feeding the whole image to these SOT methods produces poor detection results (i.e., [email protected]=0.0%). Though multiple datasets are used in GlobalTrack for training, poor performance is achieved. For example, our COCO-trained Siamese-DETR (Swin-T) achieves 29.2% higher [email protected] than GlobalTrack.

Lastly, we utilize a larger-scale backbone network and train Siamese-DETR with more data to show the scalability of Siamese-DETR. It can be seen that: 1) with the same training data (i.e., COCO), the [email protected] is increased from 57.5% to 63.3% when the backbone network is switched from Swin-T to Swin-B; 2) With the same model (Siamese-DETR (Swin-B)), the [email protected] is further improved to 69.3% when Object365 is used for training. The results demonstrate that the detection performance of Siamese-DETR can be boosted by a larger scale of model or more training data.

TABLE IV: Impact of the number of scales in Multi-Scale Object Queries (MSOQ). Except for the different number of scales in object queries, all models are trained with 1 template image (refer to Section III-C2) and the negative samples are removed (refer to Section III-C1). The query denoising (refer to Section III-D2) is not utilized. The best results are shown in bold with underline.

Number of Scales	Detection Results								Tracking Results
	Small		Medium		Large		Overall		MOTA $\uparrow$	IDF1 $\uparrow$	MT $\uparrow$	ML $\downarrow$	FP $\downarrow$	FN $\downarrow$	IDSw $\downarrow$
	[email protected] $\uparrow$	mAR $\uparrow$	[email protected] $\uparrow$	mAR $\uparrow$	[email protected] $\uparrow$	mAR $\uparrow$	[email protected] $\uparrow$	mAR $\uparrow$	MOTA $\uparrow$	IDF1 $\uparrow$	MT $\uparrow$	ML $\downarrow$	FP $\downarrow$	FN $\downarrow$	IDSw $\downarrow$
1	11.0%	23.4%	37.7%	49.8%	42.3%	58.7%	35.9%	28.4%	19.3%	18.5%	132	1097	23213	175968	12622
2	13.1%	35.2%	40.0%	55.2%	48.9%	71.0%	39.7%	33.9%	20.2%	19.5%	138	1084	24372	170329	12566
3	16.9%	36.8%	42.9%	57.5%	50.2%	71.2%	41.6%	35.5%	20.5%	20.4%	146	1069	25761	165839	11893
4	23.3%	45.8%	45.9%	60.3%	55.4%	77.8%	47.9%	43.5%	23.4%	22.3%	157	1001	34583	155681	18562

IV-C3 Comparison of Tracking Performance

Firstly, we show the generalization of the proposed tracking strategy TbQ by applying it to different detectors. Specifically, while tracking online, the detection results of different detectors are fed into Siamese-DETR frame-by-frame, where the set of object queries $Q$ is removed and only $\hat{Q}$ is used for tracking. Since the tracking pipeline follows a tracking-by-detection paradigm, different tracking performances are achieved based on different detectors. Compared with GlobalTrack [17], the MOTA is increased from 20.6% to 35.9% by our Siamese-DETR (Swin-T). However, we find that such promotion of Siamese-DETR (Swin-T) is mainly introduced by the lower FN metric. Based on this finding, we have tried to reduce the number of false negatives (FN) of GlobalTrack, but failed with the fact that GlobalTrack produces very low and similar confidence scores for most of the predicted boxes.

Then we apply different trackers to a specific detector. Taking the proposed Siamese-DETR (Swin-T) for example, our TbQ achieves the best MOTA among all different trackers, even TbQ is much simpler than others (refer to Tab. II). For example, TbQ achieves 35.9% MOTA, which is higher than 33.7% and 34.1% MOTA scores that are achieved by ByteTrack [5] and BoT-SORT [62]. It is worth noting that both ByteTrack and BoT-SORT are recently proposed trackers for traditional multi-object tracking and achieve remarkable tracking performance on MOTChallenge datasets¹¹1https://motchallenge.net/. But they are very complex and contain lots of hyper-parameters. For example, the confidence scores and matching thresholds for the two-stage matching strategy. All these hyper-parameters are well-tuned for pedestrian tracking and they are even tuned for each video. During our experiments, directly using the parameters tuned for MOTChallenge on GMOT-40 produces very poor tracking performance (i.e., MOTA $<$ 0). Though we have tried our best to tune these parameters for GMOT-40, the tracking performances of ByteTrack and BoT-SORT still lag behind that of TbQ. This may be caused by the domain gap between GMOT-40 and MOTChallenge datasets. Different from ByteTrack and BoT-SORT, our TbQ tracks objects without bells and whistles but achieves better overall tracking performance (specifically, MOTA).

Through deeper analysis, we find that TbQ sometimes produces worse IDSw or IDF1 scores than other tracking methods. For example, while applying TbQ and SORT to GlobalTrack, the IDSw score of TbQ is higher than that of SORT (6407 vs. 1315) and the IDF1 score of TbQ is lower than that of SORT (27.4% vs. 30.3%). The reasons are twofold: 1) IDSw and IDF1 are related to the number of tracked objects and trajectory segments. Since TbQ tracks more objects (higher MT), it potentially produces a higher IDSw score and a lower IDF1 score; 2) TbQ has inferior discriminability than existing tracking methods. However, this is reasonable for the reason that TbQ tracks objects without bells and whistles. For example, the common practices of Kalman Filter, appearance cues and hierarchical matching are not used.

Some qualitative tracking results are shown in Fig. 4. It can be seen that Siamese-DETR (Swin-T) tracks more interested objects than GLIP-T (B) when combined with the same tracker and TbQ tracks more objects than SORT when the same detection results are used, demonstrating the effectiveness of our Siamese-DETR and TbQ.

IV-D Dicussions

In the following, Swin-T [49] is used as the backbone network without specification.

IV-D1 Multi-Scale Object Queries

In Siamese-DETR, the multi-scale features extracted from the template image are used as the query contents in order to detect different scales of objects that share the same category with the template image. To show the effectiveness of MSOQ, we design different counterparts that have different numbers of scales, i.e., $S\in\{1,2,3,4\}$ . For a specific $S$ , the $S$ feature maps that have the smallest spatial size are used. The results are shown in Tab. IV. As we can see, both detection and tracking performances are improved when more scales of features are used. Specifically, compared with the results of 1-scale object queries, 4-scale object queries boost the [email protected] and MOTA by 12.0% and 4.1%. With the help of multi-scale features, objects of different scales are more easily to be detected and recognized. For example, when the number of scales is increased from 1 to 4, the [email protected] scores for small, medium and large objects are improved by 12.3%, 8.2% and 13.1%, respectively.

TABLE V: Impact of Dynamic Matching Training Strategy (DMTS). The models are trained without query denoising (refer to Section III-D2). The best results are shown in bold with underline.

DMTS		Detection Results		Tracking Results
Utilizing all annotations	Number of templates	[email protected] $\uparrow$	mAR $\uparrow$	MOTA $\uparrow$	IDF1 $\uparrow$	MT $\uparrow$	ML $\downarrow$	FP $\downarrow$	FN $\downarrow$	IDSw $\downarrow$
$\times$	1	47.9%	43.5%	23.4%	22.3%	157	1001	34583	155681	18562
✓	1	48.6%	43.1%	23.7%	21.3%	142	1031	33583	156411	18055
✓	2	46.0%	41.3%	22.7%	21.1%	137	1072	33861	158935	18239
✓	3	50.9%	44.4%	24.6%	22.4%	186	953	32402	150154	17715
✓	4	52.5%	44.9%	26.3%	24.3%	283	872	30374	142489	16993
✓	5	53.2%	45.1%	26.6%	24.4%	299	831	32712	139210	16918
✓	6	52.7%	43.9%	26.3%	24.3%	284	875	30384	141489	16974
✓	7	54.9%	46.3%	27.8%	25.1%	316	764	33169	133245	17542
✓	8	51.2%	44.3%	24.6%	23.4%	286	862	35027	142154	17366
✓	9	53.1%	44.5%	26.2%	24.8%	308	842	31600	140580	16695

TABLE VI: Effectiveness of query denoising. The models are trained with DMTS, where the number of template images is set to 7. The best results are shown in bold with underline.

Query denoising	Detection Results		Tracking Results							GT boxes as Query Boxes
Query denoising	[email protected] $\uparrow$	mAR $\uparrow$	MOTA $\uparrow$	IDF1 $\uparrow$	MT $\uparrow$	ML $\uparrow$	FP $\downarrow$	FN $\downarrow$	IDSw $\downarrow$	Avg. Conf. $\uparrow$	Avg. IoU $\uparrow$
$\times$	54.9%	46.3%	27.8%	25.1%	316	764	33169	133245	17542	0.101	0.021
original [22]	55.4%	46.3%	28.4%	27.1%	319	837	32847	137591	13197	0.093	0.103
Optimized (Ours)	57.5%	46.6%	35.9%	42.8%	504	666	44882	107894	11664	0.426	0.730

TABLE VII: Impact of different training datasets. The models are equipped with MSOQ (Section III-B) and trained with DMTS (7 template images, Section III-C) and optimized query denoising (Section III-D2). The best results are shown in bold with underline.

Methods	Datasets	Detection Results		Tracking Results
Methods	Datasets	[email protected] $\uparrow$	mAR $\uparrow$	MOTA $\uparrow$	IDF1 $\uparrow$	MT $\uparrow$	ML $\downarrow$	FP $\downarrow$	FN $\downarrow$	IDSw $\downarrow$
GLIP-T (B) [44] + TbQ (Ours, Swin-T)	COCO [11]	30.5%	23.4%	19.6%	18.4%	108	1436	24067	189542	3671
	LVIS [46]	38.8%	30.2%	22.3%	21.3%	116	1361	26478	178435	6549
	Objects365 [47]	50.8%	44.6%	27.5%	39.8%	581	592	49470	136602	9972
Siamese-DETR (Ours, Swin-T)	COCO [11]	57.5%	46.6%	35.9%	42.8%	504	666	44882	107894	11664
	LVIS [46]	56.5%	42.5%	30.3%	39.5%	845	303	32471	136960	14653
	Objects365 [47]	59.3%	48.0%	40.8%	44.2%	668	519	39300	98199	14186
Siamese-DETR (Ours, Swin-B)	COCO [11]	65.6%	50.6%	43.1%	46.6%	681	466	49478	84765	13723
	LVIS [46]	62.4%	45.0%	33.8%	41.1%	415	562	38285	119800	12312
	Objects365 [47]	69.6%	55.4%	50.0%	51.3%	1083	278	44390	68189	11252

TABLE VIII: Tracking results of different methods. Due to the fact that DETR-based TrackFormer is a closed-set tracking method and is mainly designed for pedestrian tracking, only the pedestrian/person videos in GMOT-40 [8] are used for evaluation. The best results are shown in bold with underline.

Methods	Datasets for Training	Detection Results		Tracking Results
Methods	Datasets for Training	[email protected] $\uparrow$	mAR $\uparrow$	MOTA $\uparrow$	IDF1 $\uparrow$	MT $\uparrow$	ML $\downarrow$	FP $\downarrow$	FN $\downarrow$	IDSw $\downarrow$
TrackFormer [66]	MOT17 [67]	28.4%	31.8%	-8.5%	36.4%	39	45	13500	10990	86
Siamese-DETR (Ours, Swin-T)	COCO [11]	73.1%	62.3%	43.7%	29.5%	43	12	4661	6493	1609

IV-D2 Dynamic Matching Training Strategy

The dynamic matching training strategy is designed to efficiently train Siamese-DETR on commonly used detection datasets through utilizing all annotations and utilizing all annotations more than once. The results are shown in Tab. V. When all annotations are used, Siamese-DETR performs a two-category object detection task. The [email protected] and MOTA scores are improved by 0.7% and 0.3%, respectively. However, mAR is reduced by 0.4%, which is reasonable since an object is more potentially to be classified as the background when the negative samples are introduced to train the model (refer to Section III-C).

Utilizing all annotations more than once is implemented by using more than 1 template image for training. We conduct extensive experiments to train Siamese-DETR with different numbers of template images. As we can see from Tab. V, Siamese-DETR achieves the best detection and tracking results when trained with 7 template images. Specifically, compared with the counterpart trained with 1 template image, utilizing 7 template images for training achieves 6.3% higher [email protected] and 4.1% higher MOTA. By default, 7 template images are used for the training of Siamese-DETR.

IV-D3 Tracking-by-Query

The effectiveness of our simple online tracking pipeline TbQ, has been proved in Tab. III and Section IV-C3 by comparing TbQ with other trackers. Here, we further show the effectiveness of the optimized query denoising. Results are shown in Tab. VI. As we can see, both the original query denoising [22] and optimized query denoising are effective to improve the detection and tracking performance. However, as stated in Section III-C, the original query denoising introduces some conflicts with the template-image-based object detection/tracking. With the help of our optimized query denoising, the tracking scenario is more accurately mimicked and the TbQ tracking pipeline is more effectively learned during the training stage, which brings more performance gain than the original query denoising.

To further study the impact of query denoising, we use ground-truth boxes as query boxes to detect objects. The average confidence score (Avg. Conf.) of predicted boxes and the average IoU (Avg. IoU) between predicted boxes and their corresponding ground-truth boxes are calculated to show the classification and box regression capabilities of Siamese-DETR. As we can see, the original query denoising is negative to the classification capability of Siamese-DETR, and the detection and tracking performance gain mainly comes from the better box regression capability. However, with the optimized query denoising, both the classification and box regression capabilities of Siamese-DETR are promoted.

IV-D4 Impact of Training Data

We show the impact of the training data on our Siamese-DETR and open-vocabulary methods (e.g., GLIP-T (B) [10]) in Tab. VII.

Compared with COCO-trained Siamese-DETR, the LVIS-trained Siamese-DETR achieves poorer detection and tracking performance. We attribute this to the fewer training images in LVIS than COCO (100K vs. 118K). Differently, LVIS-trained GLIP-T (B) performs much better than the COCO-trained one, indicating that the fine-grained category annotations play a key role in boosting performance for open-vocabulary methods. Compared with GLIP-T (B), our Siamese-DETR has a lower demand for category annotations, which reduces the labeling cost while collecting training data. When trained on Objects365, Siamese-DETR is greatly boosted thanks to the larger amount of training data. Such a phenomenon is also observed on GLIP-T (B). However, Siamese-DETR outperforms GLIP-T (B) by a large margin when trained with the same data, demonstrating the effectiveness of our method.

TABLE IX: Results of multi-category multi-object tracking on Urban Tracker [52] dataset. The default trackers in Siamese-DETR and OVTrack are adopted. The best results are shown in bold with underline.

Methods	Datasets for Training	Detection Results		Tracking Results
Methods	Datasets for Training	[email protected] $\uparrow$	mAR $\uparrow$	MOTA $\uparrow$	IDF1 $\uparrow$	MT $\uparrow$	ML $\downarrow$	FP $\downarrow$	FN $\downarrow$	IDSw $\downarrow$
OVTrack [7]	LVIS [46]	20.0%	27.4%	5.5%	26.1%	8	47	4110	19309	68
Siamese-DETR (Ours, Swin-T)	COCO [11]	35.9%	30.4%	14.2%	27.2%	11	30	4967	14298	542

TABLE X: The impact of different sets of template images. Different template images produce different detection performances, but the variabilities are negligible. The Siamese-DETR (Swin-T) trained on COCO is evaluated.

Sets of Template Images	Detection Results
Sets of Template Images	mAP@50 $\uparrow$	mAR $\uparrow$
First Set (Default)	57.5%	46.6%
Second Set	57.7%	47.2%
Third Set	57.7%	47.4%

IV-D5 Comparing with DETR-Based MOT Methods

In Section IV-C, several DETR-based methods, including DINO [22] and Conditional DETR [61], are evaluated. However, these methods are mainly designed for object detection, and the tracking results are obtained with different tracking methods. Here, we further compare Siamese-DETR with DETR-based MOT methods (e.g., TrackFormer [66]), which are closed-set tracking methods and are mainly designed for pedestrian/person tracking. Taking the consideration that there is no annotated video data for the training of TIMOT methods and TrackFormer needs to be trained on annotated video data, we directly use the publicly available model weight of TrackFormer and compare it with Siamese-DETR on the four person videos in GMOT-40 benchmark. Results are shown in Tab. VIII. TrackFormer achieves much poorer detection results than Siamese-DETR, resulting a poor tracking results (MOTA $<$ 0 due to the high FP and FN). The reason is that TrackFormer fails to detect the persons in some scenes (e.g., wingsuit flying in the bottom row of the subfigure Person in Fig. 5) and TrackFormer also tends to detect all persons even we focus on standing persons in some scenes (e.g., the up row of the subfigure Person in Fig. 5). Differently, Siamese-DETR can follow the template image to detect/track interested objects in different scenes.

IV-D6 Multi-Category Multi-Object Tracking

Our Siamese-DETR has the capability to track different categories of objects by providing a template image for each category for the reason that it supports multiple template images simultaneously. For better comparison, OVTrack [7] is also evaluated by providing all category names to it. As we can see, Siamese-DETR achieves much better detection and tracking performance than OVTrack. The poor performance of OVTrack mainly comes from the higher FN. Through visualization, we find that OVTrack fails to detect some objects if the provided text prompt cannot describe them in detail. However, providing fine-grained text prompts for all objects is impractical in real applications.

IV-D7 Impact of Template Images

The default used template images for each video are presented in Fig. 5. It can be seen that there exists variability within the same category and the impact of different template images needs to be discussed. To do this, we further randomly sample another two sets of template images. The detection results on GMOT-40 are shown in Tab. X. As we can see, though different sets of template images produce different detection performances, the variabilities are negligible, demonstrating the generalization ability of Siamese-DETR on template images. A carefully selected set of template images may produce better performance, but it is not our intention.

IV-D8 Visualization of Attention Weights

To show the effectiveness of Siamese-DETR, we show some attention weights produced by transforemr encoder and decoder layers. For the reason that Siamese-DETR is implemented based on DINO [22] and Deformable-DETR [23], where the attention is calculated in a sparse manner, we first select a reference point and then show the attention weights of all sampled points that contribute to this reference point. Results are shown in Fig. 6. For the attention weights produced by the encoder layer (left column), the reference points mainly attend to foreground sampling points. For the attention weights produced by the decoder layer (middle and right columns), the reference points mainly attend to the sampling points in object extremities. From the visualization results in the right column, we can see that the objects can successfully draw attention from the model based on the tracked boxes in the previous frame.

IV-D9 Failure Cases

The two typical failure cases are presented in Fig. 7. For the up case, Siamese-DETR fails to detect/track different object instances if they are heavily overlapped with each other. For the bottom case, Siamese-DETR produces some false positive boxes when some objects share a similar appearance with the interested objects.

V Conclusions and Limitations

In this paper, we focus on template-image-based multi-object tracking, where the interested objects are described by the given template image. We take advantage of object queries in DETR variants and propose Siamese-DETR to track generic multi-objects. In order to detect different scales of objects that share the same category with the given template image, multi-scale object queries are designed, where the query contents are obtained from the template image. In addition, a dynamic matching training strategy is proposed to train Siamese-DETR efficiently on commonly used detection datasets. To handle the scarcity of video training data, the query denoising is adopted and optimized, which mimics the tracking scenarios on static images. While tracking online, the tracking pipeline is simplified by incorporating the tracked boxes as additional query boxes. Object detection and tracking are performed simultaneously and the complex data association is replaced with the simpler NMS operation. Experimental results demonstrate the effectiveness of the proposed Siamese-DETR.

The main limitations of Siamese-DETR lie in twofold: 1) Siamese-DETR is trained with a two-category detection task, where the objects are classified into positive and negative samples. An object may be treated as a positive sample for different template images if they share a similar appearance. This may be mitigated by providing several different template images and training the model with multi-category detection task; (2) Siamese-DETR tracks objects solely based on the tracked boxes in the previous frame without the exploration of appearance cues. The absence of appearance cues may result in tracking failure when occlusion between different objects happens, producing a higher IDSw and lower IDF1. This can be solved by pairing the tracked boxes with the corresponding appearance features rather than the features extracted from the template image. We leave these limitations to our future works.

Acknowledgments

This work was supported by the National Key R&D Program of China (2022YFC3300704), the National Natural Science Foundation of China (62331006, 62171038, and 62088101), and the Fundamental Research Funds for the Central Universities.

References

[1] P. Bergmann, T. Meinhardt, and L. Leal-Taixe, “Tracking without bells and whistles,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 941–951.
[2] G. Brasó and L. Leal-Taixé, “Learning a neural solver for multiple object tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6247–6257.
[3] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in Proceedings of IEEE International Conference on Image Processing, 2017, pp. 3645–3649.
[4] P. Chu and H. Ling, “Famnet: Joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6172–6181.
[5] Y. Zhang, P. Sun, Y. Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “Bytetrack: Multi-object tracking by associating every detection box,” in Proceedings of the European Conference on Computer Vision, 2022, pp. 1–21.
[6] X. Zhou, V. Koltun, and P. Kráhenbúhl, “Tracking objects as points,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 474–490.
[7] S. Li, T. Fischer, L. Ke, H. Ding, M. Danelljan, and F. Yu, “Ovtrack: Open-vocabulary multiple object tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5567–5577.
[8] H. Bai, W. Cheng, P. Chu, J. Liu, K. Zhang, and H. Ling, “Gmot-40: A benchmark for generic multiple object tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6719–6728.
[9] J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 4171–4186.
[10] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proceedings of the IEEE International Conference on Machine Learning, 2021, pp. 8748–8763.
[11] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceedings of the European Conference on Computer Vision, 2014, pp. 740–755.
[12] W. Luo and T.-K. Kim, “Generic object crowd tracking by multi-task learning.” in Proceedings of the British Machine Vision Conference, 2013.
[13] W. Luo, T.-K. Kim, B. Stenger, X. Zhao, and R. Cipolla, “Bi-label propagation for generic multiple object tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014, pp. 1290–1297.
[14] T. Evgeniou and M. Pontil, “Regularized multi–task learning,” in Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 109–117.
[15] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2005, pp. 886–893.
[16] Y. Fu, H. Liu, Y. Zou, S. Wang, Z. Li, and D. Zheng, “Category-level band learning based feature extraction for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, pp. 1–16, 2023.
[17] L. Huang, X. Zhao, and K. Huang, “Globaltrack: A simple and strong baseline for long-term tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 037–11 044.
[18] H. Fan, H. Bai, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, M. Huang, J. Liu, Y. Xu et al., “Lasot: A high-quality large-scale single object tracking benchmark,” International Journal of Computer Vision, vol. 129, pp. 439–461, 2021.
[19] L. Huang, X. Zhao, and K. Huang, “Got-10k: A large high-diversity benchmark for generic object tracking in the wild,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1562–1577, 2019.
[20] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in Proceedings of the IEEE International Conference on Image Processing, 2016, pp. 3464–3468.
[21] E. Bochinski, V. Eiselein, and T. Sikora, “High-speed tracking-by-detection without using image information,” in Proceedings of the IEEE international conference on advanced video and signal based surveillance, 2017, pp. 1–6.
[22] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” in The Eleventh International Conference on Learning Representations, 2022.
[23] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” in International Conference on Learning Representations, 2020.
[24] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “Dn-detr: Accelerate detr training by introducing query denoising,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 619–13 627.
[25] S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, “Dab-detr: Dynamic anchor boxes are better queries for detr,” in International Conference on Learning Representations, 2021.
[26] M. Li, Y. Fu, T. Zhang, and G. Wen, “Supervise-assisted self-supervised deep-learning method for hyperspectral image restoration,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–14, 2024.
[27] A. Roshan Zamir, A. Dehghan, and M. Shah, “Gmcp-tracker: Global multi-object tracking using generalized minimum clique graphs,” in Proceedings of the European Conference on Computer Vision, 2012, pp. 343–356.
[28] L. Zhang, Y. Li, and R. Nevatia, “Global data association for multi-object tracking using network flows,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
[29] M. Keuper, E. Levinkov, N. Bonneel, G. Lavoué, T. Brox, and B. Andres, “Efficient decomposition of image and mesh graphs by lifted multicuts,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1751–1759.
[30] S. Tang, M. Andriluka, B. Andres, and B. Schiele, “Multiple people tracking by lifted multicut and person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3539–3548.
[31] L. Chen, Y. Fu, K. Wei, D. Zheng, and F. Heide, “Instance segmentation in the dark,” International Journal of Computer Vision, vol. 131, no. 8, pp. 2198–2218, 2023.
[32] Y. Fu, Y. Hong, Y. Zou, Q. Liu, Y. Zhang, N. Liu, and C. Yan, “Raw image based over-exposure correction using channel-guidance strategy,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, pp. 2749–2762, 2023.
[33] K. Fang, Y. Xiang, X. Li, and S. Savarese, “Recurrent autoregressive networks for online multi-object tracking,” in Proceedings of IEEE Winter Conference on Applications of Computer Vision, 2018, pp. 466–475.
[34] T. Zhang, Y. Fu, J. Zhang, and C. Yan, “Deep guided attention network for joint denoising and demosaicing in real image,” Chinese Journal of Electronics, vol. 33, no. 1, pp. 303–312, 2024.
[35] W. Ren, X. Wang, J. Tian, Y. Tang, and A. B. Chan, “Tracking-by-counting: Using network flows on crowd density maps for tracking multiple targets,” IEEE Transactions on Image Processing, vol. 30, pp. 1439–1452, 2020.
[36] S.-H. Lee, D.-H. Park, and S.-H. Bae, “Decode-mot: How can we hurdle frames to go beyond tracking-by-detection?” IEEE Transactions on Image Processing, pp. 4378–4392, 2023.
[37] Y. Fu, Z. Wang, T. Zhang, and J. Zhang, “Low-light raw video denoising with a high-quality realistic motion dataset,” IEEE Transactions on Multimedia, pp. 8119–8131, 2022.
[38] Q. Liu, D. Chen, Q. Chu, L. Yuan, B. Liu, L. Zhang, and N. Yu, “Online multi-object tracking with unsupervised re-identification learning and occlusion estimation,” Neurocomputing, vol. 483, pp. 333–347, 2022.
[39] X. Wan, J. Cao, S. Zhou, J. Wang, and N. Zheng, “Tracking beyond detection: learning a global response map for end-to-end multi-object tracking,” IEEE Transactions on Image Processing, vol. 30, pp. 8222–8235, 2021.
[40] Q. Liu, Q. Chu, B. Liu, and N. Yu, “Gsm: Graph similarity model for multi-object tracking.” in International Joint Conference on Artificial Intelligence, 2020, pp. 530–536.
[41] R. Li, B. Zhang, J. Liu, W. Liu, and Z. Teng, “Inference-domain network evolution: A new perspective for one-shot multi-object tracking,” IEEE Transactions on Image Processing, vol. 32, pp. 2147–2159, 2023.
[42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017.
[43] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in Neural Information Processing Systems, vol. 28, 2015.
[44] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 965–10 975.
[45] X. Dai, Y. Chen, B. Xiao, D. Chen, M. Liu, L. Yuan, and L. Zhang, “Dynamic head: Unifying object detection heads with attentions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7373–7382.
[46] A. Gupta, P. Dollar, and R. Girshick, “Lvis: A dataset for large vocabulary instance segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5356–5364.
[47] S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun, “Objects365: A large-scale, high-quality dataset for object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8430–8439.
[48] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 213–229.
[49] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
[50] Q. Liu, Y. Jiang, Z. Tan, D. Chen, Y. Fu, Q. Chu, G. Hua, and N. Yu, “Transformer based pluralistic image completion with reduced information loss,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[51] Z. Lai, Y. Fu, and J. Zhang, “Hyperspectral image super resolution with real unaligned rgb guidance,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–13, 2024.
[52] J.-P. Jodoin, G.-A. Bilodeau, and N. Saunier, “Urban tracker: Multiple object tracking in urban mixed traffic,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, 2014, pp. 885–892.
[53] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: the clear mot metrics,” Journal on Image and Video Processing, vol. 2008, pp. 1–10, 2008.
[54] Y. Li, C. Huang, and R. Nevatia, “Learning to associate: Hybridboosted multi-target tracker for crowded scene,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009, pp. 2953–2960.
[55] R. Couturier, H. N. Noura, O. Salman, and A. Sider, “A deep learning object detection method for an efficient clusters initialization,” arXiv preprint arXiv:2104.13634, 2021.
[56] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
[57] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–1916, 2015.
[58] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8759–8768.
[59] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[60] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems, vol. 25, 2012.
[61] D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang, “Conditional detr for fast training convergence,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3651–3660.
[62] N. Aharon, R. Orfaig, and B.-Z. Bobrovsky, “Bot-sort: Robust associations multi-pedestrian tracking,” arXiv preprint arXiv:2206.14651, 2022.
[63] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2018.
[64] X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8126–8135.
[65] N. Wang, W. Zhou, J. Wang, and H. Li, “Transformer meets tracker: Exploiting temporal context for robust visual tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1571–1580.
[66] T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer, “Trackformer: Multi-object tracking with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8844–8854.
[67] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, “Mot16: A benchmark for multi-object tracking,” arXiv preprint arXiv:1603.00831, 2016.


Airplane	Ball	Balloon	Bird	Boat

Car	Fish	Insect	Person	Stock

	Frame 001	Frame 011	Frame 021
Ground-Truth
Prediction
	Frame 081	Frame 091	Frame 101
Ground-Truth
Prediction




0.0		1.0