This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Siamese-DETR for Generic Multi-Object Tracking

Qiankun Liu, Yichen Li, Yuqi Jiang, Ying Fu Qiankun Liu, Yichen Li, Yuqi Jiang and Ying Fu are with School of Computer Science and Technology, Beijing Institute of Technology; Email: {liuqk3, liyichen, yqjiang, fuying}@bit.edu.cn; Ying Fu is the corresponding author.
Abstract

The ability to detect and track the dynamic objects in different scenes is fundamental to real-world applications, e.g., autonomous driving and robot navigation. However, traditional Multi-Object Tracking (MOT) is limited to track objects belonging to the pre-defined closed-set categories. Recently, Generic MOT (GMOT) is proposed to track interested objects beyond pre-defined categories and it can be divided into Open-Vocabulary MOT (OVMOT) and Template-Image-based MOT (TIMOT). Taking the consideration that the expensive well pre-trained (vision-)language model and fine-grained category annotations are required to train OVMOT models, in this paper, we focus on TIMOT and propose a simple but effective method, Siamese-DETR. Only the commonly used detection datasets (e.g., COCO) are required for training. Different from existing TIMOT methods, which train a Single Object Tracking (SOT) based detector to detect interested objects and then apply a data association based MOT tracker to get the trajectories, we leverage the inherent object queries in DETR variants. Specifically: 1) The multi-scale object queries are designed based on the given template image, which are effective for detecting different scales of objects with the same category as the template image; 2) A dynamic matching training strategy is introduced to train Siamese-DETR on commonly used detection datasets, which takes full advantage of provided annotations; 3) The online tracking pipeline is simplified through a tracking-by-query manner by incorporating the tracked boxes in the previous frame as additional query boxes. The complex data association is replaced with the much simpler Non-Maximum Suppression (NMS). Extensive experimental results show that Siamese-DETR surpasses existing MOT methods on GMOT-40 dataset by a large margin.

Index Terms:
multi-object tracking, object detection, Siamese network, DETR.

I Introduction

Multi-Object Tracking (MOT) aims at estimating the locations of interested objects in the given video while maintaining their identities consistently, which has various applications, such as autonomous driving, robot navigation, video surveillance, and so on. Benefiting from the advances in object detection, the tracking-by-detection paradigm has become popular for MOT in the past decade. Though great success has been made, the generalization ability of existing MOT methods still needs to be improved due to the limited pre-defined closed-set categories, like pedestrian [1, 2, 3, 4, 5], car [6], etc.

To overcome the aforementioned drawback of traditional MOT task, Generic Multi-Object (GMOT) is recently introduced and tries to track objects of arbitrary categories. It is based on the assumption that at test time we are given the descriptions of interested objects. According to the types of descriptions, GMOT can be divided into Open-Vocabulary MOT (OVMOT) [7] and Template-Image-based MOT (TIMOT) [8] tasks. Among them, OVMOT methods use the text prompt (e.g., category name) as the description, while TIMOT methods utilize the template image as the description. Both types of descriptions are flexible and enlarge the closed-set categories to an open-set one, making multi-object tracking methods more suitable for real-world applications. However, due to the domain gap between text and image, the well pre-trained (vision-)language models (e.g., BERT [9] and CLIP [10]) and fine-grained category annotations are needed to train the detectors in OVMOT methods. Except that annotating fine-grained category information is laborious and professional, the pre-training of the (vision-)language model requires a huge amount of training data and computational resources, making the utilization of OVMOT methods expensive. Taking this into consideration, we focus on TIMOT in this paper and propose a simple but effective method, Siamese-DETR. Only the commonly used detection datasets (e.g., COCO [11]) are required to train the proposed method.

Refer to caption
Figure 1: The online tracking pipeline of Siamese-DETR for generic multi-object tracking based on template image. The template image is fed into the backbone network to get the query contents, while the query boxes consist of the learned query boxes and the tracked boxes in the previous frame. With this design, the objects in current frame are tracked by their corresponding boxes, while the missed objects in the previous frame (but still exist in the current frame) or newly appeared objects in the current frame are detected and tracked by the learned query boxes.

Early TIMOT methods [12, 13] learn a Support Vector Machine (SVM) for each object identity through multiple task learning [14] based on hand-crafted features (e.g., HoG [15]). The identities of different objects are involuntarily maintained since each of them is detected and tracked independently by their dedicated SVMs. Recently, inspired by the success of traditional MOT task, where the tracking-by-detection paradigm [12, 13, 8, 16] dominates the mainstream and achieves appealing performance, the newly proposed TIMOT method [8] also follows the tracking-by-detection paradigm. Specifically, the tracking pipeline is divided into object detection and object tracking stages: 1) For object detection, a Single Object Tracking (SOT) [17] based detector is designed to detect all the objects that share the same category with the template image. Since there is no provided training data for TIMOT task [8], SOT datasets (e.g., LaSOT [18] and GOT-10K [19]) and object detection dataset (e.g., COCO [11]) are used for the training of detector; 2) For object tracking, existing MOT trackers (e.g., SORT [20], DeepSORT [3], IOU [21], etc) are directly utilized as data association algorithms to get the trajectories of different objects. Unfortunately, the tracking performance is still moderate even it is of high complexity. The TIMOT task still needs to be well studied to achieve better overall tracking performance while simplifying the tracking pipeline.

In this paper, we leverage the inherent object queries in DETR variants [22, 23, 24, 25, 26] and propose a simple but effective method, Siamese-DETR. As shown in Fig. 1, the object queries contain the information of the template image for detection and the tracked boxes for tracking. Although Siamese-DETR follows the tracking-by-detection paradigm, the detection and tracking are performed simultaneously. The complex data association procedure is replaced by a much simpler Non-Maximum Suppression (NMS) to remove some duplicated boxes. Compared with existing methods, Siamese-DETR detects interested objects more effectively and tracks objects more simply.

To detect interested objects with the given template image effectively, the Multi-Scale Object Queries (MSOQ) and Dynamic Matching Training Strategy (DMTS) are designed: 1) Multi-scale object queries. The decoupled object query [25] that consists of query content and query box is adopted, where the query content is obtained from the template image while the query box is learned during training. In detail, we feed the template image into the backbone network of the detector to get hierarchical multi-scale features and map each scale of them into a query content. The multi-scale query contents (e.g., 4 scales) are equally replicated to match with the number of learned query boxes (e.g., 600). Since the features with different scales are sensitive to objects of different scales, Siamese-DETR detects different scales of objects that share the same category with the template image effectively; 2) Dynamic matching training strategy. Given a training image in commonly used detection datasets (e.g., COCO [11]), the corresponding annotations are all utilized more than once. Specifically, the objects that share the same category with the template image are treated as positive samples while the others are treated as negative samples. The introduction of negative samples takes full advantage of the provided annotations. By sampling more than one template image for each training image, the annotations can be dynamically used more than once, which benefits Siamese-DETR further.

To track objects simply, we propose a Tracking-by-Query (TbQ) strategy. The tracked boxes are used as additional query boxes and the query denoising is optimized to adapt to TIMOT: 1) Tracked boxes as additional query boxes. The tracked boxes in the previous frame are paired with the query contents to serve as additional object queries. Object queries with tracked boxes and learned boxes are responsible for tracking and detection respectively and independently. The simple NMS is utilized to remove the detected boxes that are duplicated with tracked boxes; 2) Optimized query denoising. Since there is no video training data for TIMOT [8], the query denoising [24] strategy is optimized from common object detection and adopted to mimic the tracking scenarios in static images. The experimental results demonstrate that Siamese-DETR surpasses existing MOT methods on GMOT-40 [8] by a large margin. In summary, the contributions of this work are as follows:

  • We propose Siamese-DETR for template-image-based multi-object tracking and introduce multi-scale object queries to effectively detect different scales of objects that share the same category with the template image.

  • We introduce a dynamic matching training strategy for Siamese-DETR, enabling the training on commonly used detection datasets effectively.

  • We design a simple online tracking strategy by incorporating the tracked boxes as additional query boxes. Objects are tracked in a tracking-by-query manner.

The remainder of this paper is organized as follows: Section II firstly reviews the related works about object tracking and DETR variants. Next, the details of the proposed method are illustrated in Section III. Then, we provide the implementation details and compare the proposed method with existing methods in Section IV. Finally, we provide the conclusion and the discussions on the limitations of the proposed method in Section V.

II Related Work

This section briefly reviews related works from different aspects, including multi-object tracking, template-image-based multi-object tracking, open-vocabulary multi-object tracking and DETR variants.

II-A Multi-Object Tracking

In the past decade, Multi-Object Tracking (MOT) has emerged as a popular research area and has been dominated by the tracking-by-detection paradigm. Existing tracking-by-detection methods involve object detection and data association stages, and can be divided into offline and online methods. Offline methods [27, 28, 29, 30, 31, 32] process the video in a batch way and even can utilize the whole video information to handle the data association problem better. Differently, online methods [20, 33, 1, 2, 3, 4, 5, 34] process the video frame-by-frame and generate trajectories only using information up to the current frame, which is more suitable for causal applications than offline ones.

Traditional MOT methods mainly focus on data association problem, including Hungarian algorithm, network flow [27, 28, 35], and graph multicut [29, 30]. Among them, except the Hungarian algorithm, others can only be performed in an offline manner. In recent years, with the advancement of deep learning and object detection, online tracking has attracted more and more attention. On contrary to offline methods, online methods usually adopt the Hungarian algorithm for data association, but focus on the joint learning of object detection and some useful priors, such as object motions [1, 6, 36, 37], appearance features [4, 38, 39], occlusion maps [38], object relations [40] and so on. However, except for the annotation of box and category ID, extra annotations are required for the learning of these priors, e.g., object identity for appearance feature learning.

Though great progress has been made in MOT, most existing methods are designed to track objects that are limited to a pre-defined small closed-set of categories. For example, car and pedestrian. In this paper, we focus on generic multi-object tracking to extend the closed-set of categories in MOT to an open-set of generic categories, which are not limited to several specific ones.

II-B Template-Image-based Multi-Object Tracking

The Template-Image-based Multi-Object Tracking (TIMOT) task is introduced to address the generalization issue in MOT about ten years ago [12, 13]. Similar to MOT, TIMOT follows the tracking-by-detection paradigm [12, 13, 8]. However, much less attention has been paid to TIMOT, which is quite different from MOT. The main reason is that the data that is suitable for TIMOT is scarce. Recently, GMOT-40 [8] has been developed as a public dataset for the evaluation of TIMOT. Nevertheless, no well-annotated training data is available for TIMOT.

Early methods [12, 13] track generic multi-objects based on Support Vector Machine (SVM) and hand-crafted features. Each object is detected and tracked by a dedicated SVM. The SVM is initialized based on the given template image and updated in an online manner while tracking. Recently, the newly proposed method [8, 41] firstly detects all objects that share the same category with the template image through a Single Object Tracking (SOT) based detector (specifically, GlobalTrack [17]), then some online data association MOT trackers (e.g., SORT [20], DeepSORT [3], IOU [21], etc) are applied to get the trajectories of objects. To mitigate the gap between SOT and TIMOT, single object tracking datasets (LaSOT [18], GOT-10K [19]) and object detection dataset (COCO [11]) are used to train the SOT based detector. However, the tracking performance is far from satisfactory.

Different from existing TIMOT method [8] that detects objects based on SOT tracker [17], we leverage the advantage of object queries in DETR variants for object detection and tracking. With the proper query design and training strategy, our method distinguishes the interested objects from others effectively. In addition, the tracking pipeline is also simplified by incorporating the tracked boxes into object queries.

II-C Open-Vocabulary Multi-Object Tracking

With the recent development of language [9, 42] and vision-language models [10], the Open-Vocabulary Multi-Object Tracking (OVMOT) [7] is proposed to track objects that belong to arbitrary categories.

Similar to traditional MOT and TIMOT, OVMOT also follows the tracking-by-detection pipeline, where open-vocabulary object detection plays a key role. OVTrack [7] localizes interested objects with a class agnostic R-CNN (i.e., Faster R-CNN [43]) and the vision-language model (i.e., CLIP [10]). Specifically, all objects, including the ones that are not interested, are detected by the R-CNN, where the object features are aligned with the counterparts extracted by the image encoder in CLIP through knowledge distillation. Then the interested objects are selected by comparing the similarity between object features and text features extracted by the text encoder in CLIP. Finally, a data association procedure is adopted to link objects in adjacent frames. Similarly, GLIP [44] detects objects using DyHead [45] and BERT [9]. The text and image features are aligned with each other by iteratively fusing them in several successive blocks rather than supervising the model with knowledge distillation. Though OVMOT trackers can track arbitrary object categories, expensive well pre-trained language or vision-language models are required to handle the domain gap between texts and images. In addition, laborious fine-grained category annotations are also needed to help the model recognize accurate objects with different text descriptions. For example, OVTrack is trained on LVIS [46] with 1200+ category annotations and GLIP is trained on Objects365 [47] with 356 (which is further increased to 1300+ by the authors) category annotations.

Compared with the aforementioned methods, the proposed Siamese-DETR does not require the expensive pre-trained (vision-)language model nor the laborious fine-grained category annotations. It achieves better performance when only the COCO [11] dataset (with 80 categories) is used for training.

II-D DETR Variants

DETR [48] is the first end-to-end object detector. The main idea in it is the object query and the Hungarian loss. The anchor boxes and NMS components are abandoned, reducing the complexity of detectors significantly. However, DETR suffers from slow convergence. Lots of works are proposed to address this issue.

Deformable-DETR [23] replaces the common attention with deformable attention, which reduces the computational cost and makes it possible to use multi-scale features for object detection. DAB-DETR [25] decouples the object queries into learnable contents and learnable boxes. The query boxes are iteratively updated at each decoder layer. DN-DETR [24] finds that the slow convergence of DETR is mainly caused by the unstable matching between object queries and ground-truth boxes. To reduce the instability, DN-DETR introduces a denoising training approach to accelerate the convergence. Specifically, except for the object queries, noisy ground-truth boxes and labels are additionally fed into the decoder, which improves the model’s ability of box regression and classification. Once the training procedure is finished, the object queries in the aforementioned DETR variants are fixed. All images share the same ones, which can not be dynamically updated according to the input images. To solve this, DINO [22] proposes a mixed query selection mechanism, where the query boxes are dynamically selected based on the image features.

In this paper, multi-scale object queries are designed, which contain the information of the template image. The tracked boxes are further used as additional query boxes. Object detection and tracking are performed simultaneously.

Refer to caption
Figure 2: Overview of Siamese-DETR in the training stage. The multi-scale object queries are decoupled into learnable query boxes and query contents. The query contents are mapped from the multi-scale features extracted from the template image by the backbone network. The model is trained with Hungarian loss [48] and the proposed dynamic matching training strategy which turns the provided annotations into positive and negative samples dynamically according to the given template image. For simplicity, the optimized query denoising is not presented in the figure.

III Methodology

In this section, we first present the overall architecture of Siamese-DETR. Then, we introduce the multi-scale object queries for the detection of objects that share the same category with the template image. Next, we introduce the dynamic matching training strategy that trains Siamese-DETR on commonly used detection datasets. Furthermore, we show how to apply Siamese-DETR to online tracking straightforwardly and simply in a tracking-by-query manner. Finally, the training details of Siamese-DETR are presented.

III-A Overview

The overview of the proposed Siamese-DETR in the training stage is shown in Fig. 2. Siamese-DETR contains a backbone network (e.g., Swin Transformer [49]), a transformer (including the encoder and the decoder) [42, 50], a detection head for classification and box regression (the same with DINO [22]), and a set of object queries. In order to detect objects of different scales that share the same category with the template image, the Multi-Scale Object Queries (MSOQ, Section III-B) are generated based on the template image. Since no well-annotated training data is available for TIMOT, we design a Dynamic Matching Training Strategy (DMTS, Section III-C), which supports the training of Siamese-DETR on commonly used detection datasets (e.g., COCO [11]). The provided annotations are fully utilized more than once when multiple template images are provided for training. During the inference stage, objects are tracked in a Track-by-Query (TbQ, Section III-D) manner, as shown in Fig. 1. The tracked boxes in the previous frame are used as additional query boxes to track corresponding objects. A simple NMS operation, rather than the complex data association algorithm, is adopted to remove some duplicated boxes. To make Siamese-DETR compatible with such a tracking strategy, the query denoising [24] is adopted and optimized to train Siamese-DETR.

III-B Multi-scale Object Queries

Following previous works [25, 22, 24, 51], we use the decoupled object queries. Formally, let Q={qn|qn=(𝐪cn,𝐪bn),n=0,1,,N1}Q=\{q_{n}|q_{n}=(\mathbf{q}_{c_{n}},\mathbf{q}_{b_{n}}),n=0,1,...,N-1\} be the set of object queries, where NN is the number of queries. For each query qn=(𝐪cn,𝐪bn)q_{n}=(\mathbf{q}_{c_{n}},\mathbf{q}_{b_{n}}), the query content 𝐪cnD\mathbf{q}_{c_{n}}\in\mathbb{R}^{D} is a feature vector with dimensionality DD, and the query box 𝐪bn4\mathbf{q}_{b_{n}}\in\mathbb{R}^{4} is represented by the center coordinate, width and height. In DETR variants [23, 24, 25, 22], the query contents are usually a set of parameters that are learned by the model. Such design works for object detection with closed-set categories, but is not suitable for generic object detection/tracking, where the category of template image provided in the inference stage may be unseen in the training stage.

In order to detect all objects that share the same category with the template image, we get query contents from the template image. More specifically, using the features extracted from the template image as the query contents. Our hypothesis is that the query contents store the semantic information of objects, e.g., intra-category common ground, which is vital for object detection. On the other hand, objects in the same scene vary a lot in terms of scale even if they share the same category. To handle this, the multi-scale features are extracted from the template image and used as multi-scale query contents. Formally, let F={𝐟s|s=0,1,,S1}F=\{\mathbf{f}_{s}|s=0,1,...,S-1\} be the set of multi-scale feature maps extracted by the backbone network (e.g., Swin Transformer [49]) from the given template image. We first get the feature vectors F^={𝐟^s|s=0,1,,S1}\hat{F}=\{\mathbf{\hat{f}}_{s}|s=0,1,...,S-1\} by spatially average pooling the feature maps with:

𝐟^s=AvgPool(𝐟s).\mathbf{\hat{f}}_{s}={\rm AvgPool}(\mathbf{f}_{s}). (1)

For the nn-th object query, its content 𝐪cn\mathbf{q}_{c_{n}} is determined by:

𝐪cn=𝐟^nmodS,\mathbf{q}_{c_{n}}=\mathbf{\hat{f}}_{n\ {\rm mod}\ S}, (2)

where nmodSn\ {\rm mod}\ S is the index of the feature vectors in F^\hat{F}. As for query boxes, they are a set of learnable parameters that are optimized in the training stage following previous works [25, 24], which means that different template images share the same query boxes.

III-C Dynamic Matching Training Strategy

Different from traditional MOT, where well-annotated training data is provided, there is no available well-annotated training data for TIMOT  [8]. The common practice is to train the model on external datasets, and then test it on evaluation benchmark (i.e., GMOT-40 [8]). Existing method [8] uses multiple datasets to train the detector, including LaSOT [18], GOT-10K [19] and COCO [11]. However, the detection/tracking performance is far from satisfactory. In this paper, we design the Dynamic Matching Training Strategy (DMTS) for the training of Siamese-DETR on commonly used detection datasets. It will be seen in Section IV that Siamese-DETR surpasses existing method [8] by a large margin in terms of detection and tracking, even only been trained on COCO [11]. The superiority of the dynamic matching training strategy comes from two aspects: 1) utilizing all annotations even if they belong to different categories; 2) utilizing all annotations more than once for each training step.

III-C1 Utilizing All Annotations

Let A={ak|ak=(𝐛k,ck),k=0,1,,K1}A=\{a_{k}|a_{k}=(\mathbf{b}_{k},c_{k}),k=0,1,...,K-1\} be the set of annotations for the input training image, where 𝐛k4\mathbf{b}_{k}\in\mathbb{R}^{4} and ckc_{k}\in\mathbb{Z} are the bounding box and category ID of the kk-th object. We randomly sample a category ID from {c0,c1,,cK1}\{c_{0},c_{1},...,c_{K-1}\}, which is used as the category of template image and denoted as c^t\hat{c}_{t}. Given the category ID c^t\hat{c}_{t}, the template image is cropped from another image in the training split. The corresponding annotations for the given template image and the category ID c^t\hat{c}_{t} are:

Ac^t={akc^t|akc^t=(𝐛k,𝟙ck,c^t),k=0,1,,K1},A^{\hat{c}_{t}}=\{a^{\hat{c}_{t}}_{k}|a^{\hat{c}_{t}}_{k}=(\mathbf{b}_{k},\mathbbm{1}_{c_{k},\hat{c}_{t}}),k=0,1,...,K-1\}, (3)

where:

𝟙ck,c^t={1if ck=c^t,0else.\mathbbm{1}_{c_{k},\hat{c}_{t}}=\begin{cases}1&\text{if }c_{k}=\hat{c}_{t},\\ 0&\text{else}.\end{cases} (4)

As we can see, the boxes in Ac^tA^{\hat{c}_{t}} are divided into positive (ck=c^tc_{k}=\hat{c}_{t}) and negative (ckc^tc_{k}\neq\hat{c}_{t}) samples, resulting a two-category object detection task, as shown in Fig. 2. However, there exists another naive setting that results in a single-category object detection task: keeping the boxes that share the same category with the template image and removing the others. Under this setting, only the positive samples are utilized. Though the latter seems to be more intuitive than the former, it is not a good choice due to the fact that poorer detection performance is achieved since no negative samples are utilized for the training, which weakens the capability of the model to distinguish the interested objects from others.

III-C2 Utilizing All Annotations More Than Once

Considering the fact that the template image and the multi-scale-object queries take up a small proportion of the device memory when compared with the input image and the whole model, we can provide multiple template images during training.

Let {c^0,c^1,,c^T1}\{\hat{c}_{0},\hat{c}_{1},...,\hat{c}_{T-1}\} be the randomly sampled category IDs for TT different template images. For each template image with category c^t\hat{c}_{t}, the obtained multi-scale object queries are denoted as Qc^tQ^{\hat{c}_{t}}, which is associated with the annotations Ac^tA^{\hat{c}_{t}}. The TT groups of object queries {Qc^0,Qc^1,,Qc^T1}\{Q^{\hat{c}_{0}},Q^{\hat{c}_{1}},...,Q^{\hat{c}_{T-1}}\} contain a total number of N×TN\times T object queries, which are concatenated and fed into transformer for object detection within once forward. Note that the interactions between different object queries are only allowed within each group, and the object queries in different groups cannot see each other. This can be simply implemented by providing an attention mask to self-attention layers in the transformer decoder.

III-D Tracking-by-Query

The common tracking-by-detection paradigm usually contains two stages: 1) Object detection. The interested objects are firstly detected by the detector; 2) Data association. The trajectories of objects are obtained by matching objects that come from different frames. However, performing data association properly is non-trivial since it involves the computation of affinity matrix between different objects, the setting of affinity threshold that prevents a wrong association, etc. In this paper, we use the tracked boxes in the previous frame as additional query boxes to track the corresponding objects. The query denoising is optimized to mimic the tracking scenarios on static images.

III-D1 Tracked Boxes as Additional Query Boxes

Let B={𝐛^m|m=0,1,,M}B=\{\mathbf{\hat{b}}_{m}|m=0,1,...,M\} be the set of tracked boxes in the previous frame, we construct additional object queries as follows:

Q^=Q^0Q^1Q^S1,\begin{split}\hat{Q}=\hat{Q}_{0}\cup\hat{Q}_{1}\cup...\cup\hat{Q}_{S-1},\end{split} (5)

where Q^s\hat{Q}_{s} is the subset of additional object queries that are constructed for the ss-th scale:

Q^s={q^s,m|q^s,m=(𝐟^s,𝐛^m),m=0,1,,M1}.\begin{split}\hat{Q}_{s}=\{\hat{q}_{s,m}|\hat{q}_{s,m}=(\mathbf{\hat{f}}_{s},\mathbf{\hat{b}}_{m}),m&=0,1,...,M-1\}.\end{split} (6)

Different from object queries QQ, which is responsible for object detection, Q^\hat{Q} is used for object tracking. The inspiration of this design is that the category-aware information is conveyed by the features from the template image and embedded into the query contents, and the tracked boxes in the previous frame are close enough to the corresponding objects in the current frame. While tracking online, the object queries in QQ and Q^\hat{Q} are concatenated and fed into the transformer together. The object queries in QQ and Q^\hat{Q} detect and track objects simultaneously but independently. For object tracking, different object instances are distinguished by their corresponding tracked boxes. To avoid the interactions between the object queries in QQ and Q^\hat{Q}, an attention mask is provided to each self-attention layer in the transformer decoder.

For each box 𝐛^m\mathbf{\hat{b}}_{m}, SS tracked boxes are obtained by object queries {q^0,m,q^1,m,,q^S1,m}\{\hat{q}_{0,m},\hat{q}_{1,m},...,\hat{q}_{S-1,m}\}. Among these SS tracked boxes, the one that has the largest Intersection over Union (IoU) with 𝐛^m\mathbf{\hat{b}}_{m} is selected. If the classification score of the selected box is higher than the predefined confidence threshold, it will be kept as the tracking result for 𝐛^m\mathbf{\hat{b}}_{m}. Otherwise, the corresponding object is treated as a disappeared object. Since object queries in QQ and Q^\hat{Q} detect and track objects independently, the detection boxes from object queries in QQ may be duplicated with that from object queries in Q^\hat{Q}. Following the MOT method Tracktor [1], the NMS operation is used to remove the duplicated detection boxes. The remained detection boxes that have higher classification scores than the predefined confidence threshold are treated as the newly appeared objects.

Refer to caption
Figure 3: Illustration of query denoising. (a) Input image and template image. (b) Original query denoising [24] with conflicts for TIMOT. The noisy object queries are classified according to the labeled category IDs that are associated with the query boxes, without taking the noisy query contents into consideration. (c) Optimized query denoising. The noisy object queries are classified according to the matching results between the query contents and noisy query boxes. The numbers 1 and 0 denote that the model tries to classify the object queries into positive and negative samples, while the markers ✗ and \checkmark indicate whether the classification behaviors are wrong or right.
Refer to caption
(a) Siamese-DETR + TbQ
Refer to caption
(b) Siamese-DETR + SORT [20]
Refer to caption
(c) GLIP-T (B) [44] + TbQ
Refer to caption
(d) GLIP-T (B) [44] + SORT [20]
Figure 4: Qualitative comparison for different methods. The following two points can be summarized: 1) when combined with the same tracker (e.g., TbQ), our Siamese-DETR tracks more objects than GLIP-T (B) [44]; 2) based on the same detector, our TbQ pipeline also tracks more objects than SORT [20]. The Siamese-DETR trained on COCO [11] with Swin-T [49] as the backbone network is evaluated.

III-D2 Optimized Query Denoising

Since there is no well-annotated video data for training, it is hard for the model to track objects with tracked boxes while tracking online if the model is trained on static images. In addition, the number of interested objects varies from frame to frame while tracking online, which introduces more challenges for tracking. To handle this, we add some noise to the ground-truth boxes, which are used as the tracked boxes (i.e., additional query boxes) during the training stage. The objects are tracked by their corresponding noisy query boxes and Siamese-DETR is constrained to learn the capability of handling the various numbers of objects. With the mimicked tracking scenario in the training stage, Siamese-DETR is able to track objects with the proposed Tracking-by-Query strategy. Similar to the online tracking stage, the groups of object queries for detection and tracking are independent of each other in the training stage.

The aforementioned strategy is similar to query denoising [24] which is commonly used in DETR variants. However, we find that existing query denoising is not suitable for TIMOT task. The reason is that except for the box noise in query boxes, there exists category noise in query contents. Specifically, the category ID of a ground-truth box is randomly switched to another category ID. For each category ID, an embedding vector is learned by the detector and used as the noisy query content. While training, the noisy object queries are classified based on the labeled category IDs that are associated with the query box, without taking the query contents into consideration. There are two conflicts in existing query denoising when applied to Siamese-DETR: 1) Siamese-DETR performs two-category detection task. Learning two embedding vectors for positive and negative samples is not suitable since the positive and negative samples are dynamically changed according to the provided template images (i.e., query contents); 2) A positive noisy query box may be paired with the negative query content, but still be classified as a positive sample in the original query denoising [24]. However, the object query is positive only when the query content and query box matched with each other in Siamese-DETR.

To avoid these conflicts, we optimize query denoising by pairing all noisy query boxes with positive query contents (i.e., the features extracted from the template image). With this optimized query denoising, noisy object queries are classified correctly according to the matching results between query contents and query boxes. The difference between the original query denoising [24] and the optimized query denoising is illustrated in Fig. 3.

III-E Training of Siamese-DETR

The proposed Siamese-DETR is trained on the commonly used detection dataset, i.e., COCO [11]. Given an image for training, we first randomly sample TT category IDs from the corresponding annotations. For each category ID, we then randomly sample a box from the annotations of another image, which is randomly sampled from the training split. Finally, the template images are cropped from the randomly sampled images with the sampled boxes. For the tt-th category ID c^t\hat{c}_{t}, except for the object queries Qc^tQ^{\hat{c}_{t}}, we also construct additional object queries Q^c^t=Q^0c^tQ^1c^tQ^S1c^t\hat{Q}^{\hat{c}_{t}}=\hat{Q}^{\hat{c}_{t}}_{0}\cup\hat{Q}^{\hat{c}_{t}}_{1}\cup...\cup\hat{Q}^{\hat{c}_{t}}_{S-1} with the noisy ground-truth boxes to mimic the tracking scenarios. The overall training loss is:

=1TT=0T1(H(Act^,A^Qc^t)+1Ss=0S1R(Ac^t,A^Q^sc^t))\mathcal{L}=\frac{1}{T}\sum_{T=0}^{T-1}(\mathcal{L}_{\rm H}(A^{\hat{c_{t}}},\hat{A}^{Q^{\hat{c}_{t}}})+\frac{1}{S}\sum_{s=0}^{S-1}\mathcal{L}_{\rm R}(A^{\hat{c}_{t}},\hat{A}^{\hat{Q}_{s}^{\hat{c}_{t}}})) (7)

where A^Qc^t\hat{A}^{Q^{\hat{c}_{t}}} and A^Qsc^t\hat{A}^{Q^{\hat{c}_{t}}_{s}} denote the predictions from object queries in Qc^tQ^{\hat{c}_{t}} and Q^sc^t\hat{Q}_{s}^{\hat{c}_{t}}, and H(,)\mathcal{L}_{\rm H}(\cdot,\cdot) and R(,)\mathcal{L}_{\rm R}(\cdot,\cdot) are the Hungarian loss [48] and reconstruction loss [24], respectively. Note that while computing the loss, the ground-truth boxes without noise are used.

IV Experiments

In this section, the involved datasets and metrics are firstly introduced, followed by the implementation details. Then, we compare Siamese-DETR with existing multi-object tracking methods. Finally, some discussions are provided to show the effectiveness of different components.

IV-A Datasets and Metrics

We follow the setup of template-image-based MOT task [8] to compare the proposed Siamese-DETR with other methods. Specifically, a tracker is tested on all videos in GMOT-40 benchmark [8] and can be trained on any other benchmark except GMOT-40. It is worth noting that the template-image-based MOT trackers not only need to track all the objects of the same category with the template image in the video but also need to maintain the identity of each object. For the reason that each video in GMOT-40 only contains one category, the Urban Tracker [52] dataset is used to analyze the capability of Siamese-DETR in multi-category multi-object tracking. In this work, the commonly used object detection dataset, COCO [11], is mainly used to train Siamese-DETR. Other datasets, for example, LVIS [46] and Objects365 [47], are used for more detailed analysis. Though the categories in GMOT-40 and Urban Tracker may be visible in the training dataset (e.g., COCO, LVIS, Objects365), it complies with the setup of TIMOT [8].

GMOT-40 contains 40 videos that consist of 10 different object categories with 4 videos for each category. The entire dataset contains 9.6K frames, where 85.28% of them contain more than 10 objects. The videos are shot with FPS ranging from 24 to 30. All the videos are used for evaluation.

Urban Track has 4 outdoor videos with each of them containing more than 1 category, resulting in a total number of 4 categories. The videos are captured with FPS of 25 or 30. All these 4 outdoor videos are used for evaluation.

COCO is widely used for object detection. It contains a total number of 118K images and 860K annotated instances for training. There are 80 different categories in COCO, such as person, car, dog, and so on. Note that only the category IDs and bounding boxes are used in this work.

LVIS shares nearly the same images with COCO but provides more fine-grained annotations. It contains 100K images and 1.27M instances for training. The number of annotated categories is 1203, providing more fine-grained category annotations than COCO. Note that the used bounding boxes are obtained from the annotated instance-level masks since there are no annotated bounding boxes in LVIS.

Objects365 contains 0.6M images and 8.54M instances for training. The number of annotated categories is 365. We use Objects365 to show that more training data can boost the performance of Siamese-DETR.

We adopt the standard metrics of multi-object tracking for evaluation, including: Multi-Object Tracking Accuracy (MOTA) [53], IDentity F1 Score (IDF1), Mostly Tracked objects (MT), Mostly Lost objects (ML), Number of False Positives (FP), Number of False Negatives (FN) and Number of Identity Switches (IDSw) [54]. Some other metrics, including mean Average Precision with IoU threshold 0.5 ([email protected]) and mean Average Recall (mAR) are also adopted for the evaluation of object detection.

TABLE I: The details of compared detection methods. Note that the extra costs of FLOPs and inference time in the processing of the template image or text prompt are not included since they only need to be processed in the first frame and the costs are negligible when the number of frames is large enough. Time consumption is evaluated on a workstation with a 3.9GHz CPU and an RTX 3090 GPU. The best results are shown in bold with underline.
Detection Methods Backbone Neck Backbone Pre- training Dataset (Vision-) Lauguage Model Inference Resolution Parameter (M) FLOPs (G) Inference Time (ms)
YOLOv5l6 [55] CSP-DarkNet53 [56] SPPF [57],CSP-PAN [58] ×\times ×\times 1280×\times1280 76.8 112.4 16.7
DINO [22] ResNet50 [59] DETR’s Neck [48] ImageNet [60] ×\times 800×\times1200 45.2 261.9 87.5
Conditional DETR [61] ResNet50 [59] DETR’s Neck [48] ImageNet [60] ×\times 800×\times1200 43.2 88.9 44.4
OVTrack [7] ResNet50 [59] Faster RCNN’s Neck[43] ImageNet [60] CLIP [10] 800×\times1333 67.6 191.2 57.5
GLIP-T (B) [44] Swin-Tiny [49] DyHead’s Neck [45] ImageNet [60] BERT [9] 800×\times1333 195.2 322.4 158.7
GlobalTrack [17] ResNet50 [59] Faster RCNN’s Neck[43] ImageNet [60] ×\times 800×\times1333 41.3 169.9 53.7
Siamese-DETR (Ours,Swin-T) Swin-Tiny [49] DETR’s Neck [48] ImageNet [60] ×\times 800×\times1200 47.6 267.3 86.6
Siamese-DETR (Ours,Swin-B) Swin-Base [49] DETR’s Neck [48] ImageNet [60] ×\times 800×\times1200 108.2 542.7 139.9
TABLE II: The details of evaluated tracking methods. The inference times of different methods are evaluated based on the detection results of Siamese-DETR (Swin-T). Time consumption is evaluated on a workstation with a 3.9GHz CPU and an RTX 3090 GPU, and the detection time consumption is excluded. The best results are shown in bold with underline.
Tracking Methods Online Kalman Filter Appearance Cues Hierarchical Matching Inference Time (ms)
IOU [21] ×\times ×\times ×\times ×\times <<0.01
SORT [20] \checkmark \checkmark ×\times ×\times 0.1
DeepSORT [3] \checkmark \checkmark \checkmark \checkmark 87.3
ByteTrack [5] \checkmark \checkmark \checkmark \checkmark 33.7
BoT-SORT [62] \checkmark \checkmark ×\times \checkmark 9.5
TbQ (Ours, Swin-T) \checkmark ×\times ×\times ×\times 2.2
TABLE III: Comparison with different methods on the GMOT-40 benchmark. Except for the TIMOT methods, the methods with closed-set detectors and open-vocabulary detectors are also evaluated. Since different methods follow the tracking-by-detection pipeline, we divide them into the combination of a detector and a tracker. While combing our TbQ tracking pipeline with other detectors, the detection results provided by the evaluated detectors are fed into Siamese-DETR frame-by-frame, where the set of object queries QQ is removed and only Q^\hat{Q} is used for tracking. Both detection and tracking results are presented for a more comprehensive comparison. The best results are shown in bold with underline.
Detection Methods Language Models Training Datasets for Detection Methods Dectection Results Tracking Methods Tracking Results
[email protected]\uparrow mAR\uparrow MOTA\uparrow IDF1\uparrow MT\uparrow ML\downarrow FP\downarrow FN\downarrow IDSw \downarrow
Closed-Set MOT Methods YOLOv5l6 (manual)
YOLOv5l6 [55] (manual) ×\times COCO [11] 41.1% 38.6% + IOU [21] 22.7% 29.0% 257 1346 38752 161207 4861
+ SORT [20] 22.9% 31.2% 232 1311 39574 162323 2498
+ DeepSORT [3] 23.0% 32.4% 269 1283 40129 159833 2274
+ ByteTrack [5] 23.6% 34.4% 301 1261 38763 162037 3691
+ BoT-SORT [62] 23.9% 35.1% 298 1214 37991 158429 3437
+ TbQ (Ours, Swin-T) 24.5% 25.1% 284 1241 40032 152860 8431
DINO (manual)
DINO [22] (manual) ×\times COCO [11] 28.7% 30.1% + IOU [21] 21.7% 29.0% 257 1346 38752 161207 1789
+ SORT [20] 19.3% 23.1% 182 1437 41786 167923 1658
+ DeepSORT [3] 20.0% 23.4% 184 1390 42013 165689 1931
+ ByteTrack [5] 21.2% 25.7% 201 1376 43543 162480 1774
+ BoT-SORT [62] 21.1% 26.0% 199 1345 42081 163093 3762
+ TbQ (Ours, Swin-T) 21.9% 24.1% 215 1247 41678 163881 6597
Conditional DETR (manual)
Conditional DETR [61] (manual) ×\times COCO [11] 33.1% 22.0% + IOU [21] 17.7% 20.5% 161 1316 21109 184826 5056
+ SORT [20] 16.7% 22.7% 111 1423 15595 195583 2283
+ DeepSORT [3] 19.3% 28.7% 179 1296 23141 180550 3094
+ ByteTrack [5] 17.9% 30.9% 183 1129 32922 174627 3964
+ BoT-SORT [62] 19.1% 34.2% 232 1093 32866 170789 3716
+ TbQ (Ours, Swin-T) 19.2% 24.7% 230 1075 31317 163955 9499
Open-Vocabulary MOT Methods OVTrack
OVTrack [7] \checkmark LVIS [46] 31.7% 32.7% + IOU [21] 20.3% 18.8% 139 1257 50467 154367 1473
+ SORT [20] 18.9% 20.1% 145 1387 49850 158905 1578
+ DeepSORT [3] 20.2% 21.2% 165 1367 49984 160378 1470
+ ByteTrack [5] 19.9% 20.6% 164 1345 51356 156329 1669
+ BoT-SORT [62] 20.0% 20.3% 167 1328 45721 163378 3278
+ TbQ (Ours, Swin-T) 21.3% 18.7% 186 1304 50784 154893 6381
GLIP-T (B)
GLIP-T (B) [44] \checkmark Objects365 [47] 50.8% 44.6% + IOU [21] 25.1% 39.3% 458 721 55802 139560 6320
+ SORT [20] 25.2% 40.8% 354 877 54891 143623 2987
+ DeepSORT [3] 25.5% 41.6% 401 877 46610 141330 2892
+ ByteTrack [5] 27.0% 45.1% 447 746 52591 131759 2706
+ BoT-SORT [62] 27.3% 49.1% 553 643 51308 133462 4675
+ TbQ (Ours, Swin-T) 27.5% 39.8% 581 592 49470 136602 9972
Template-Image-based MOT Methods GlobalTrack
GlobalTrack [17] ×\times COCO [11], LaSOT [18], GOT-10K [19] 28.3% 18.3% + IOU [21] 11.8% 20.3% 56 1491 8299 216821 1668
+ SORT [20] 19.5% 30.3% 140 1187 15132 189315 1785
+ DeepSORT [3] 14.5% 24.4% 72 1363 9000 208818 1315
+ ByteTrack [5] 19.1% 32.1% 178 1069 23881 181829 1791
+ BoT-SORT [62] 19.4% 34.0% 251 978 22229 176991 7375
+ TbQ (Ours, Swin-T) 20.6% 27.4% 213 1066 13507 182376 6407
Siamese-DETR (Ours, Swin-T)
Siamese-DETR (Ours, Swin-T) ×\times COCO [11] 57.5% 46.6% + IOU [21] 30.7% 35.0% 361 759 42504 127024 8158
+ SORT [20] 30.1% 34.7% 235 943 29518 145613 4060
+ DeepSORT [3] 31.1% 41.8% 382 773 47336 124257 5131
+ ByteTrack [5] 33.7% 41.4% 331 764 53417 104765 4204
+ BoT-SORT [62] 34.1% 47.5% 431 674 45769 119288 6775
+ TbQ (Ours, Swin-T) 35.9% 42.8% 504 666 44882 107894 11664
Siamese-DETR (Ours, Swin-B) ×\times COCO [11] 63.3% 49.9% + TbQ (Ours, Swin-B) 39.4% 35.7% 482 586 33968 106079 10233
Objects365 [47] 69.6% 55.4% + TbQ (Ours, Swin-B) 50.0% 51.3% 1083 278 44390 68189 11252

IV-B Implementation Details

We use Swin Transformer [49] as the backbone network. Like most DETR variants [24, 22], there are 6 encoder layers and 6 decoder layers in the transformer, in which the hidden dimensionality is set to 256. Following the settings in Deformable-DETR [23], the number of feature scales SS is set to 4. The number of object queries NN is set to 600. Without specification, all evaluated Siamese-DETR variants are optimized with AdamW [63] for 12 epochs. The batch size is set to 16 and the number of templates TT is set to 7 by default. The initial learning rate is set to 5e1055e^{10^{-5}}, which is decayed by a factor of 0.10.1 at epoch 11. The longer side of the template image is resized to 400 before being fed into the backbone network. While tracking online, the NMS threshold is set to 0.5 to remove the duplicated detection boxes and we follow the settings of TIMOT [8] to crop the template image from the first frame with a randomly sampled box for each video. In the following, Siamese-DETR denotes the model for object detection and tracking if there is no ambiguity, otherwise, we use Siamese-DETR and TbQ to denote the detector and tracker, respectively.

IV-C Comparison with Existing Methods

We compare the proposed method with several existing closed-set MOT methods, open-vocabulary MOT methods and template-image-based MOT methods. It is worth noting that the model weights of existing methods provided by the authors are directly used for evaluation, and it is practical for the reason that GMOT-40 benchmark is just used for evaluation and not used in the training of all methods (including ours). Since different types of MOT methods follow a tracking-by-detection pipeline, we divide each of the evaluated methods into a detector and a tracker. The details of different detectors and tracker are shown in Tab. I and Tab. II, respectively. For a comprehensive comparison, we not only apply our TbQ tracking strategy to different detectors but also apply different trackers to our detector. The detection and tracking results of different methods on GMOT-40 [8] are provided in  Tab. III

IV-C1 Comparison of Details

From Tab. I, we can see that different detectors are not strictly constrained to have the same backbone network, detection neck, etc. Compared with closed-set detectors (i.e., YOLOv5l6 [55], DINO [22], Conditional DETR [61]), the open-vocabulary detectors (i.e., OVTrack [7] and GLIP-T [44]) and template-image-based detectors (i.e., GlobalTrack [17] and Siamese-DETR) need extra costs of FLOPs and inference time to process the text prompts or template images. However, the extra costs are negligible when the number of frames is large enough since the text prompt and template image only need to be processed in the first frame. Note that the (vision-)language models are also necessary for open-vocabulary methods. Our Siamese-DETR (Swin-T) has almost the same number of parameters, FLOPs, and inference time as DINO. From Tab. II, it can be observed that the commonly used Kalman Filter, appearance cues and hierarchical matching strategies are not used in our TbQ. The time consumption of TbQ for tracking is much less than that of detection for the reason that TbQ only introduces a few additional object queries for tracking, which is performed simultaneously with detection.

IV-C2 Comparison of Detection Performance

We first show the detection performance of different methods on GMOT-40 [8]. For closed-set methods, e.g., YOLOv5 [55] (specifically, YOLOv5l6), DINO [22] and Conditional DETR [61], they tend to detect all objects that belong to the categories in the pre-defined closed-set, which results in a poor detection performance due to the fact that only the objects of one specific category are treated as foreground. To make these closed-set methods compatible with the setting of TIMOT, we manually remove the predicted boxes that do not have the same category with the template image for each video (denoted as YOLOv5l6 (manual), DINO (manual) and Conditional DETR (manual)). As expected, Siamese-DETR (Swin-T) outperforms YOLOv5l6 (manual), DINO (manual) and Conditional DETR (manual) by a large margin. For example, when trained on the same dataset (i.e., COCO[11]), Siamese-DETR (Swin-T) achieves 16.4%, 28.8% and 24.4% higher [email protected] than YOLOv5l6 (manual), DINO (manual) and Conditional DETR (manual), respectively. YOLOv5l6 (manual) performs better than DINO (manual) and Conditional DETR (manual). The reason is that the objects in GMOT-40 are much smaller than those objects in COCO and DETR-based detectors (DINO and Conditional DETR) cannot handle small objects well [48]. Interestingly, compared with DINO (manual), Conditional DETR (manual) achieves better [email protected] but much poorer mAR, resulting in poorer tracking results.

As for open-vocabulary methods, they can detect interested objects by providing different text prompts. However, due to the domain gap between vision and language, they need the well pre-trained language models to extract features from the given text prompts. In addition, in order to recognize the accurate objects from different text prompts, fine-grained category annotations are required. For example, GLIP-T (B) [44] utilizes the pretrained BERT [9] to extract text features and the detection model is trained on Objects365 [47]. The number of categories in Objects365 is increased from 365 to 1300+ by the authors to provide more fine-grained annotations. During testing, we follow the settings of OVTrack and GLIP-T (B) to use the category names as the text prompts for object detection. It can be observed that without the help of well pre-trained language model and the fine-grained category annotations, our Siamese-DETR (Swin-T) outperforms OVTrack and GLIP-T (B) by a large margin when only the COCO [11] dataset is used for training.

The template-image-based method GlobalTrack [17] is originally designed for Single Object Tracking (SOT) task. It is implemented based on a traditional detector [43], and it can detect the interested objects within the whole image for the reason that it computes the correlation score between the features extracted from the template image and the whole image. A high score indicates a potential object. Though there are lots of SOT methods (e.g., transformer-based TransT [64] and TrDiMP [65]), they are not suitable for template-image-based MOT task for the reason that only a small search region image rather than the whole image is supported to track the single object. Directly feeding the whole image to these SOT methods produces poor detection results (i.e., [email protected]=0.0%). Though multiple datasets are used in GlobalTrack for training, poor performance is achieved. For example, our COCO-trained Siamese-DETR (Swin-T) achieves 29.2% higher [email protected] than GlobalTrack.

Lastly, we utilize a larger-scale backbone network and train Siamese-DETR with more data to show the scalability of Siamese-DETR. It can be seen that: 1) with the same training data (i.e., COCO), the [email protected] is increased from 57.5% to 63.3% when the backbone network is switched from Swin-T to Swin-B; 2) With the same model (Siamese-DETR (Swin-B)), the [email protected] is further improved to 69.3% when Object365 is used for training. The results demonstrate that the detection performance of Siamese-DETR can be boosted by a larger scale of model or more training data.

TABLE IV: Impact of the number of scales in Multi-Scale Object Queries (MSOQ). Except for the different number of scales in object queries, all models are trained with 1 template image (refer to Section III-C2) and the negative samples are removed (refer to Section III-C1). The query denoising (refer to Section III-D2) is not utilized. The best results are shown in bold with underline.
Number of Scales Detection Results Tracking Results
Small Medium Large Overall MOTA\uparrow IDF1\uparrow MT\uparrow ML\downarrow FP\downarrow FN\downarrow IDSw \downarrow
[email protected]\uparrow mAR\uparrow [email protected]\uparrow mAR\uparrow [email protected]\uparrow mAR\uparrow [email protected]\uparrow mAR\uparrow
1 11.0% 23.4% 37.7% 49.8% 42.3% 58.7% 35.9% 28.4% 19.3% 18.5% 132 1097 23213 175968 12622
2 13.1% 35.2% 40.0% 55.2% 48.9% 71.0% 39.7% 33.9% 20.2% 19.5% 138 1084 24372 170329 12566
3 16.9% 36.8% 42.9% 57.5% 50.2% 71.2% 41.6% 35.5% 20.5% 20.4% 146 1069 25761 165839 11893
4 23.3% 45.8% 45.9% 60.3% 55.4% 77.8% 47.9% 43.5% 23.4% 22.3% 157 1001 34583 155681 18562

IV-C3 Comparison of Tracking Performance

Firstly, we show the generalization of the proposed tracking strategy TbQ by applying it to different detectors. Specifically, while tracking online, the detection results of different detectors are fed into Siamese-DETR frame-by-frame, where the set of object queries QQ is removed and only Q^\hat{Q} is used for tracking. Since the tracking pipeline follows a tracking-by-detection paradigm, different tracking performances are achieved based on different detectors. Compared with GlobalTrack [17], the MOTA is increased from 20.6% to 35.9% by our Siamese-DETR (Swin-T). However, we find that such promotion of Siamese-DETR (Swin-T) is mainly introduced by the lower FN metric. Based on this finding, we have tried to reduce the number of false negatives (FN) of GlobalTrack, but failed with the fact that GlobalTrack produces very low and similar confidence scores for most of the predicted boxes.

Then we apply different trackers to a specific detector. Taking the proposed Siamese-DETR (Swin-T) for example, our TbQ achieves the best MOTA among all different trackers, even TbQ is much simpler than others (refer to Tab. II). For example, TbQ achieves 35.9% MOTA, which is higher than 33.7% and 34.1% MOTA scores that are achieved by ByteTrack [5] and BoT-SORT [62]. It is worth noting that both ByteTrack and BoT-SORT are recently proposed trackers for traditional multi-object tracking and achieve remarkable tracking performance on MOTChallenge datasets111https://motchallenge.net/. But they are very complex and contain lots of hyper-parameters. For example, the confidence scores and matching thresholds for the two-stage matching strategy. All these hyper-parameters are well-tuned for pedestrian tracking and they are even tuned for each video. During our experiments, directly using the parameters tuned for MOTChallenge on GMOT-40 produces very poor tracking performance (i.e., MOTA << 0). Though we have tried our best to tune these parameters for GMOT-40, the tracking performances of ByteTrack and BoT-SORT still lag behind that of TbQ. This may be caused by the domain gap between GMOT-40 and MOTChallenge datasets. Different from ByteTrack and BoT-SORT, our TbQ tracks objects without bells and whistles but achieves better overall tracking performance (specifically, MOTA).

Through deeper analysis, we find that TbQ sometimes produces worse IDSw or IDF1 scores than other tracking methods. For example, while applying TbQ and SORT to GlobalTrack, the IDSw score of TbQ is higher than that of SORT (6407 vs. 1315) and the IDF1 score of TbQ is lower than that of SORT (27.4% vs. 30.3%). The reasons are twofold: 1) IDSw and IDF1 are related to the number of tracked objects and trajectory segments. Since TbQ tracks more objects (higher MT), it potentially produces a higher IDSw score and a lower IDF1 score; 2) TbQ has inferior discriminability than existing tracking methods. However, this is reasonable for the reason that TbQ tracks objects without bells and whistles. For example, the common practices of Kalman Filter, appearance cues and hierarchical matching are not used.

Some qualitative tracking results are shown in Fig. 4. It can be seen that Siamese-DETR (Swin-T) tracks more interested objects than GLIP-T (B) when combined with the same tracker and TbQ tracks more objects than SORT when the same detection results are used, demonstrating the effectiveness of our Siamese-DETR and TbQ.

IV-D Dicussions

In the following, Swin-T [49] is used as the backbone network without specification.

IV-D1 Multi-Scale Object Queries

In Siamese-DETR, the multi-scale features extracted from the template image are used as the query contents in order to detect different scales of objects that share the same category with the template image. To show the effectiveness of MSOQ, we design different counterparts that have different numbers of scales, i.e., S{1,2,3,4}S\in\{1,2,3,4\}. For a specific SS, the SS feature maps that have the smallest spatial size are used. The results are shown in Tab. IV. As we can see, both detection and tracking performances are improved when more scales of features are used. Specifically, compared with the results of 1-scale object queries, 4-scale object queries boost the [email protected] and MOTA by 12.0% and 4.1%. With the help of multi-scale features, objects of different scales are more easily to be detected and recognized. For example, when the number of scales is increased from 1 to 4, the [email protected] scores for small, medium and large objects are improved by 12.3%, 8.2% and 13.1%, respectively.

TABLE V: Impact of Dynamic Matching Training Strategy (DMTS). The models are trained without query denoising (refer to Section III-D2). The best results are shown in bold with underline.
DMTS Detection Results Tracking Results
Utilizing all annotations Number of templates [email protected]\uparrow mAR\uparrow MOTA\uparrow IDF1\uparrow MT\uparrow ML\downarrow FP\downarrow FN\downarrow IDSw\downarrow
×\times 1 47.9% 43.5% 23.4% 22.3% 157 1001 34583 155681 18562
1 48.6% 43.1% 23.7% 21.3% 142 1031 33583 156411 18055
2 46.0% 41.3% 22.7% 21.1% 137 1072 33861 158935 18239
3 50.9% 44.4% 24.6% 22.4% 186 953 32402 150154 17715
4 52.5% 44.9% 26.3% 24.3% 283 872 30374 142489 16993
5 53.2% 45.1% 26.6% 24.4% 299 831 32712 139210 16918
6 52.7% 43.9% 26.3% 24.3% 284 875 30384 141489 16974
7 54.9% 46.3% 27.8% 25.1% 316 764 33169 133245 17542
8 51.2% 44.3% 24.6% 23.4% 286 862 35027 142154 17366
9 53.1% 44.5% 26.2% 24.8% 308 842 31600 140580 16695
TABLE VI: Effectiveness of query denoising. The models are trained with DMTS, where the number of template images is set to 7. The best results are shown in bold with underline.
Query denoising Detection Results Tracking Results GT boxes as Query Boxes
[email protected]\uparrow mAR\uparrow MOTA\uparrow IDF1\uparrow MT\uparrow ML\uparrow FP\downarrow FN\downarrow IDSw\downarrow Avg. Conf.\uparrow Avg. IoU\uparrow
×\times 54.9% 46.3% 27.8% 25.1% 316 764 33169 133245 17542 0.101 0.021
original [22] 55.4% 46.3% 28.4% 27.1% 319 837 32847 137591 13197 0.093 0.103
Optimized (Ours) 57.5% 46.6% 35.9% 42.8% 504 666 44882 107894 11664 0.426 0.730
TABLE VII: Impact of different training datasets. The models are equipped with MSOQ (Section III-B) and trained with DMTS (7 template images, Section III-C) and optimized query denoising (Section III-D2). The best results are shown in bold with underline.
Methods Datasets Detection Results Tracking Results
[email protected]\uparrow mAR\uparrow MOTA\uparrow IDF1\uparrow MT\uparrow ML\downarrow FP\downarrow FN\downarrow IDSw\downarrow
GLIP-T (B) [44] + TbQ (Ours, Swin-T) COCO [11] 30.5% 23.4% 19.6% 18.4% 108 1436 24067 189542 3671
LVIS [46] 38.8% 30.2% 22.3% 21.3% 116 1361 26478 178435 6549
Objects365 [47] 50.8% 44.6% 27.5% 39.8% 581 592 49470 136602 9972
Siamese-DETR (Ours, Swin-T) COCO [11] 57.5% 46.6% 35.9% 42.8% 504 666 44882 107894 11664
LVIS [46] 56.5% 42.5% 30.3% 39.5% 845 303 32471 136960 14653
Objects365 [47] 59.3% 48.0% 40.8% 44.2% 668 519 39300 98199 14186
Siamese-DETR (Ours, Swin-B) COCO [11] 65.6% 50.6% 43.1% 46.6% 681 466 49478 84765 13723
LVIS [46] 62.4% 45.0% 33.8% 41.1% 415 562 38285 119800 12312
Objects365 [47] 69.6% 55.4% 50.0% 51.3% 1083 278 44390 68189 11252
TABLE VIII: Tracking results of different methods. Due to the fact that DETR-based TrackFormer is a closed-set tracking method and is mainly designed for pedestrian tracking, only the pedestrian/person videos in GMOT-40 [8] are used for evaluation. The best results are shown in bold with underline.
Methods Datasets for Training Detection Results Tracking Results
[email protected]\uparrow mAR\uparrow MOTA\uparrow IDF1\uparrow MT\uparrow ML\downarrow FP\downarrow FN\downarrow IDSw\downarrow
TrackFormer [66] MOT17 [67] 28.4% 31.8% -8.5% 36.4% 39 45 13500 10990 86
Siamese-DETR (Ours, Swin-T) COCO [11] 73.1% 62.3% 43.7% 29.5% 43 12 4661 6493 1609

IV-D2 Dynamic Matching Training Strategy

The dynamic matching training strategy is designed to efficiently train Siamese-DETR on commonly used detection datasets through utilizing all annotations and utilizing all annotations more than once. The results are shown in Tab. V. When all annotations are used, Siamese-DETR performs a two-category object detection task. The [email protected] and MOTA scores are improved by 0.7% and 0.3%, respectively. However, mAR is reduced by 0.4%, which is reasonable since an object is more potentially to be classified as the background when the negative samples are introduced to train the model (refer to Section III-C).

Utilizing all annotations more than once is implemented by using more than 1 template image for training. We conduct extensive experiments to train Siamese-DETR with different numbers of template images. As we can see from Tab. V, Siamese-DETR achieves the best detection and tracking results when trained with 7 template images. Specifically, compared with the counterpart trained with 1 template image, utilizing 7 template images for training achieves 6.3% higher [email protected] and 4.1% higher MOTA. By default, 7 template images are used for the training of Siamese-DETR.

IV-D3 Tracking-by-Query

The effectiveness of our simple online tracking pipeline TbQ, has been proved in Tab. III and Section IV-C3 by comparing TbQ with other trackers. Here, we further show the effectiveness of the optimized query denoising. Results are shown in Tab. VI. As we can see, both the original query denoising [22] and optimized query denoising are effective to improve the detection and tracking performance. However, as stated in Section III-C, the original query denoising introduces some conflicts with the template-image-based object detection/tracking. With the help of our optimized query denoising, the tracking scenario is more accurately mimicked and the TbQ tracking pipeline is more effectively learned during the training stage, which brings more performance gain than the original query denoising.

To further study the impact of query denoising, we use ground-truth boxes as query boxes to detect objects. The average confidence score (Avg. Conf.) of predicted boxes and the average IoU (Avg. IoU) between predicted boxes and their corresponding ground-truth boxes are calculated to show the classification and box regression capabilities of Siamese-DETR. As we can see, the original query denoising is negative to the classification capability of Siamese-DETR, and the detection and tracking performance gain mainly comes from the better box regression capability. However, with the optimized query denoising, both the classification and box regression capabilities of Siamese-DETR are promoted.

IV-D4 Impact of Training Data

We show the impact of the training data on our Siamese-DETR and open-vocabulary methods (e.g., GLIP-T (B) [10]) in Tab. VII.

Compared with COCO-trained Siamese-DETR, the LVIS-trained Siamese-DETR achieves poorer detection and tracking performance. We attribute this to the fewer training images in LVIS than COCO (100K vs. 118K). Differently, LVIS-trained GLIP-T (B) performs much better than the COCO-trained one, indicating that the fine-grained category annotations play a key role in boosting performance for open-vocabulary methods. Compared with GLIP-T (B), our Siamese-DETR has a lower demand for category annotations, which reduces the labeling cost while collecting training data. When trained on Objects365, Siamese-DETR is greatly boosted thanks to the larger amount of training data. Such a phenomenon is also observed on GLIP-T (B). However, Siamese-DETR outperforms GLIP-T (B) by a large margin when trained with the same data, demonstrating the effectiveness of our method.

TABLE IX: Results of multi-category multi-object tracking on Urban Tracker [52] dataset. The default trackers in Siamese-DETR and OVTrack are adopted. The best results are shown in bold with underline.
Methods Datasets for Training Detection Results Tracking Results
[email protected]\uparrow mAR\uparrow MOTA\uparrow IDF1\uparrow MT\uparrow ML\downarrow FP\downarrow FN\downarrow IDSw\downarrow
OVTrack [7] LVIS [46] 20.0% 27.4% 5.5% 26.1% 8 47 4110 19309 68
Siamese-DETR (Ours, Swin-T) COCO [11] 35.9% 30.4% 14.2% 27.2% 11 30 4967 14298 542
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Airplane Ball Balloon Bird Boat
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Car Fish Insect Person Stock
Figure 5: The used template image for each category. We present the used 4 template images for each category since there are 4 videos for each category. All template images are padded to a square resolution.
TABLE X: The impact of different sets of template images. Different template images produce different detection performances, but the variabilities are negligible. The Siamese-DETR (Swin-T) trained on COCO is evaluated.
      Sets of Template Images       Detection Results
      mAP@50\uparrow       mAR\uparrow
      First Set (Default)       57.5%       46.6%
      Second Set       57.7%       47.2%
      Third Set       57.7%       47.4%

IV-D5 Comparing with DETR-Based MOT Methods

In Section IV-C, several DETR-based methods, including DINO [22] and Conditional DETR [61], are evaluated. However, these methods are mainly designed for object detection, and the tracking results are obtained with different tracking methods. Here, we further compare Siamese-DETR with DETR-based MOT methods (e.g., TrackFormer [66]), which are closed-set tracking methods and are mainly designed for pedestrian/person tracking. Taking the consideration that there is no annotated video data for the training of TIMOT methods and TrackFormer needs to be trained on annotated video data, we directly use the publicly available model weight of TrackFormer and compare it with Siamese-DETR on the four person videos in GMOT-40 benchmark. Results are shown in Tab. VIII. TrackFormer achieves much poorer detection results than Siamese-DETR, resulting a poor tracking results (MOTA << 0 due to the high FP and FN). The reason is that TrackFormer fails to detect the persons in some scenes (e.g., wingsuit flying in the bottom row of the subfigure Person in Fig. 5) and TrackFormer also tends to detect all persons even we focus on standing persons in some scenes (e.g., the up row of the subfigure Person in Fig. 5). Differently, Siamese-DETR can follow the template image to detect/track interested objects in different scenes.

Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Refer to caption
0.0 1.0
Figure 6: Visualization of attention weights. Left column: the attention weights for different reference points in the last encoder layer. Middle column: the attention weights for different detection object queries (in QQ) in the last decoder layer. Right column: the attention weights for different tracking object queries (in Q^\hat{Q}) in the last decoder layer. The reference points are shown with cross markers. The sampling points are marked as filled circles with the attention weights encoded with different colors. The rectangle boxes are the predicted boxes by the decoder.

IV-D6 Multi-Category Multi-Object Tracking

Our Siamese-DETR has the capability to track different categories of objects by providing a template image for each category for the reason that it supports multiple template images simultaneously. For better comparison, OVTrack [7] is also evaluated by providing all category names to it. As we can see, Siamese-DETR achieves much better detection and tracking performance than OVTrack. The poor performance of OVTrack mainly comes from the higher FN. Through visualization, we find that OVTrack fails to detect some objects if the provided text prompt cannot describe them in detail. However, providing fine-grained text prompts for all objects is impractical in real applications.

IV-D7 Impact of Template Images

The default used template images for each video are presented in Fig. 5. It can be seen that there exists variability within the same category and the impact of different template images needs to be discussed. To do this, we further randomly sample another two sets of template images. The detection results on GMOT-40 are shown in Tab. X. As we can see, though different sets of template images produce different detection performances, the variabilities are negligible, demonstrating the generalization ability of Siamese-DETR on template images. A carefully selected set of template images may produce better performance, but it is not our intention.

Frame 001 Frame 011 Frame 021

Ground-Truth

Refer to caption Refer to caption Refer to caption

Prediction

Refer to caption Refer to caption Refer to caption
Frame 081 Frame 091 Frame 101

Ground-Truth

Refer to caption Refer to caption Refer to caption

Prediction

Refer to caption Refer to caption Refer to caption
Figure 7: Two failure cases of Siamese-DETR. Siamese-DETR fails to detect/track objects when they are highly overlapped with each other (the up case) and produces some false positives if some objects share a similar appearance with the interested objects (the bottom case). For each case, the ground-truth boxes are plotted (red boxes) for reference. Please pay attention to the objects within the ellipses.

IV-D8 Visualization of Attention Weights

To show the effectiveness of Siamese-DETR, we show some attention weights produced by transforemr encoder and decoder layers. For the reason that Siamese-DETR is implemented based on DINO [22] and Deformable-DETR [23], where the attention is calculated in a sparse manner, we first select a reference point and then show the attention weights of all sampled points that contribute to this reference point. Results are shown in Fig. 6. For the attention weights produced by the encoder layer (left column), the reference points mainly attend to foreground sampling points. For the attention weights produced by the decoder layer (middle and right columns), the reference points mainly attend to the sampling points in object extremities. From the visualization results in the right column, we can see that the objects can successfully draw attention from the model based on the tracked boxes in the previous frame.

IV-D9 Failure Cases

The two typical failure cases are presented in Fig. 7. For the up case, Siamese-DETR fails to detect/track different object instances if they are heavily overlapped with each other. For the bottom case, Siamese-DETR produces some false positive boxes when some objects share a similar appearance with the interested objects.

V Conclusions and Limitations

In this paper, we focus on template-image-based multi-object tracking, where the interested objects are described by the given template image. We take advantage of object queries in DETR variants and propose Siamese-DETR to track generic multi-objects. In order to detect different scales of objects that share the same category with the given template image, multi-scale object queries are designed, where the query contents are obtained from the template image. In addition, a dynamic matching training strategy is proposed to train Siamese-DETR efficiently on commonly used detection datasets. To handle the scarcity of video training data, the query denoising is adopted and optimized, which mimics the tracking scenarios on static images. While tracking online, the tracking pipeline is simplified by incorporating the tracked boxes as additional query boxes. Object detection and tracking are performed simultaneously and the complex data association is replaced with the simpler NMS operation. Experimental results demonstrate the effectiveness of the proposed Siamese-DETR.

The main limitations of Siamese-DETR lie in twofold: 1) Siamese-DETR is trained with a two-category detection task, where the objects are classified into positive and negative samples. An object may be treated as a positive sample for different template images if they share a similar appearance. This may be mitigated by providing several different template images and training the model with multi-category detection task; (2) Siamese-DETR tracks objects solely based on the tracked boxes in the previous frame without the exploration of appearance cues. The absence of appearance cues may result in tracking failure when occlusion between different objects happens, producing a higher IDSw and lower IDF1. This can be solved by pairing the tracked boxes with the corresponding appearance features rather than the features extracted from the template image. We leave these limitations to our future works.

Acknowledgments

This work was supported by the National Key R&D Program of China (2022YFC3300704), the National Natural Science Foundation of China (62331006, 62171038, and 62088101), and the Fundamental Research Funds for the Central Universities.

References

  • [1] P. Bergmann, T. Meinhardt, and L. Leal-Taixe, “Tracking without bells and whistles,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 941–951.
  • [2] G. Brasó and L. Leal-Taixé, “Learning a neural solver for multiple object tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6247–6257.
  • [3] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in Proceedings of IEEE International Conference on Image Processing, 2017, pp. 3645–3649.
  • [4] P. Chu and H. Ling, “Famnet: Joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6172–6181.
  • [5] Y. Zhang, P. Sun, Y. Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “Bytetrack: Multi-object tracking by associating every detection box,” in Proceedings of the European Conference on Computer Vision, 2022, pp. 1–21.
  • [6] X. Zhou, V. Koltun, and P. Kráhenbúhl, “Tracking objects as points,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 474–490.
  • [7] S. Li, T. Fischer, L. Ke, H. Ding, M. Danelljan, and F. Yu, “Ovtrack: Open-vocabulary multiple object tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5567–5577.
  • [8] H. Bai, W. Cheng, P. Chu, J. Liu, K. Zhang, and H. Ling, “Gmot-40: A benchmark for generic multiple object tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6719–6728.
  • [9] J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 4171–4186.
  • [10] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proceedings of the IEEE International Conference on Machine Learning, 2021, pp. 8748–8763.
  • [11] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceedings of the European Conference on Computer Vision, 2014, pp. 740–755.
  • [12] W. Luo and T.-K. Kim, “Generic object crowd tracking by multi-task learning.” in Proceedings of the British Machine Vision Conference, 2013.
  • [13] W. Luo, T.-K. Kim, B. Stenger, X. Zhao, and R. Cipolla, “Bi-label propagation for generic multiple object tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014, pp. 1290–1297.
  • [14] T. Evgeniou and M. Pontil, “Regularized multi–task learning,” in Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 109–117.
  • [15] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2005, pp. 886–893.
  • [16] Y. Fu, H. Liu, Y. Zou, S. Wang, Z. Li, and D. Zheng, “Category-level band learning based feature extraction for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, pp. 1–16, 2023.
  • [17] L. Huang, X. Zhao, and K. Huang, “Globaltrack: A simple and strong baseline for long-term tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 037–11 044.
  • [18] H. Fan, H. Bai, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, M. Huang, J. Liu, Y. Xu et al., “Lasot: A high-quality large-scale single object tracking benchmark,” International Journal of Computer Vision, vol. 129, pp. 439–461, 2021.
  • [19] L. Huang, X. Zhao, and K. Huang, “Got-10k: A large high-diversity benchmark for generic object tracking in the wild,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1562–1577, 2019.
  • [20] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in Proceedings of the IEEE International Conference on Image Processing, 2016, pp. 3464–3468.
  • [21] E. Bochinski, V. Eiselein, and T. Sikora, “High-speed tracking-by-detection without using image information,” in Proceedings of the IEEE international conference on advanced video and signal based surveillance, 2017, pp. 1–6.
  • [22] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” in The Eleventh International Conference on Learning Representations, 2022.
  • [23] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” in International Conference on Learning Representations, 2020.
  • [24] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “Dn-detr: Accelerate detr training by introducing query denoising,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 619–13 627.
  • [25] S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, “Dab-detr: Dynamic anchor boxes are better queries for detr,” in International Conference on Learning Representations, 2021.
  • [26] M. Li, Y. Fu, T. Zhang, and G. Wen, “Supervise-assisted self-supervised deep-learning method for hyperspectral image restoration,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–14, 2024.
  • [27] A. Roshan Zamir, A. Dehghan, and M. Shah, “Gmcp-tracker: Global multi-object tracking using generalized minimum clique graphs,” in Proceedings of the European Conference on Computer Vision, 2012, pp. 343–356.
  • [28] L. Zhang, Y. Li, and R. Nevatia, “Global data association for multi-object tracking using network flows,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
  • [29] M. Keuper, E. Levinkov, N. Bonneel, G. Lavoué, T. Brox, and B. Andres, “Efficient decomposition of image and mesh graphs by lifted multicuts,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1751–1759.
  • [30] S. Tang, M. Andriluka, B. Andres, and B. Schiele, “Multiple people tracking by lifted multicut and person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3539–3548.
  • [31] L. Chen, Y. Fu, K. Wei, D. Zheng, and F. Heide, “Instance segmentation in the dark,” International Journal of Computer Vision, vol. 131, no. 8, pp. 2198–2218, 2023.
  • [32] Y. Fu, Y. Hong, Y. Zou, Q. Liu, Y. Zhang, N. Liu, and C. Yan, “Raw image based over-exposure correction using channel-guidance strategy,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, pp. 2749–2762, 2023.
  • [33] K. Fang, Y. Xiang, X. Li, and S. Savarese, “Recurrent autoregressive networks for online multi-object tracking,” in Proceedings of IEEE Winter Conference on Applications of Computer Vision, 2018, pp. 466–475.
  • [34] T. Zhang, Y. Fu, J. Zhang, and C. Yan, “Deep guided attention network for joint denoising and demosaicing in real image,” Chinese Journal of Electronics, vol. 33, no. 1, pp. 303–312, 2024.
  • [35] W. Ren, X. Wang, J. Tian, Y. Tang, and A. B. Chan, “Tracking-by-counting: Using network flows on crowd density maps for tracking multiple targets,” IEEE Transactions on Image Processing, vol. 30, pp. 1439–1452, 2020.
  • [36] S.-H. Lee, D.-H. Park, and S.-H. Bae, “Decode-mot: How can we hurdle frames to go beyond tracking-by-detection?” IEEE Transactions on Image Processing, pp. 4378–4392, 2023.
  • [37] Y. Fu, Z. Wang, T. Zhang, and J. Zhang, “Low-light raw video denoising with a high-quality realistic motion dataset,” IEEE Transactions on Multimedia, pp. 8119–8131, 2022.
  • [38] Q. Liu, D. Chen, Q. Chu, L. Yuan, B. Liu, L. Zhang, and N. Yu, “Online multi-object tracking with unsupervised re-identification learning and occlusion estimation,” Neurocomputing, vol. 483, pp. 333–347, 2022.
  • [39] X. Wan, J. Cao, S. Zhou, J. Wang, and N. Zheng, “Tracking beyond detection: learning a global response map for end-to-end multi-object tracking,” IEEE Transactions on Image Processing, vol. 30, pp. 8222–8235, 2021.
  • [40] Q. Liu, Q. Chu, B. Liu, and N. Yu, “Gsm: Graph similarity model for multi-object tracking.” in International Joint Conference on Artificial Intelligence, 2020, pp. 530–536.
  • [41] R. Li, B. Zhang, J. Liu, W. Liu, and Z. Teng, “Inference-domain network evolution: A new perspective for one-shot multi-object tracking,” IEEE Transactions on Image Processing, vol. 32, pp. 2147–2159, 2023.
  • [42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017.
  • [43] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in Neural Information Processing Systems, vol. 28, 2015.
  • [44] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 965–10 975.
  • [45] X. Dai, Y. Chen, B. Xiao, D. Chen, M. Liu, L. Yuan, and L. Zhang, “Dynamic head: Unifying object detection heads with attentions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7373–7382.
  • [46] A. Gupta, P. Dollar, and R. Girshick, “Lvis: A dataset for large vocabulary instance segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5356–5364.
  • [47] S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun, “Objects365: A large-scale, high-quality dataset for object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8430–8439.
  • [48] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 213–229.
  • [49] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
  • [50] Q. Liu, Y. Jiang, Z. Tan, D. Chen, Y. Fu, Q. Chu, G. Hua, and N. Yu, “Transformer based pluralistic image completion with reduced information loss,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  • [51] Z. Lai, Y. Fu, and J. Zhang, “Hyperspectral image super resolution with real unaligned rgb guidance,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–13, 2024.
  • [52] J.-P. Jodoin, G.-A. Bilodeau, and N. Saunier, “Urban tracker: Multiple object tracking in urban mixed traffic,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.   IEEE, 2014, pp. 885–892.
  • [53] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: the clear mot metrics,” Journal on Image and Video Processing, vol. 2008, pp. 1–10, 2008.
  • [54] Y. Li, C. Huang, and R. Nevatia, “Learning to associate: Hybridboosted multi-target tracker for crowded scene,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009, pp. 2953–2960.
  • [55] R. Couturier, H. N. Noura, O. Salman, and A. Sider, “A deep learning object detection method for an efficient clusters initialization,” arXiv preprint arXiv:2104.13634, 2021.
  • [56] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
  • [57] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–1916, 2015.
  • [58] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8759–8768.
  • [59] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [60] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems, vol. 25, 2012.
  • [61] D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang, “Conditional detr for fast training convergence,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3651–3660.
  • [62] N. Aharon, R. Orfaig, and B.-Z. Bobrovsky, “Bot-sort: Robust associations multi-pedestrian tracking,” arXiv preprint arXiv:2206.14651, 2022.
  • [63] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2018.
  • [64] X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8126–8135.
  • [65] N. Wang, W. Zhou, J. Wang, and H. Li, “Transformer meets tracker: Exploiting temporal context for robust visual tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1571–1580.
  • [66] T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer, “Trackformer: Multi-object tracking with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8844–8854.
  • [67] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, “Mot16: A benchmark for multi-object tracking,” arXiv preprint arXiv:1603.00831, 2016.