This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

RTrack: Accelerating Convergence for Visual Object Tracking
via Pseudo-Boxes Exploration

Guotian Zeng1, Bi Zeng1, Hong Zhang2, Jianqi Liu1, Qingmao Wei1,
1Guangdong University of Technology
2South University of Science and Technology
[email protected]
Abstract

Single object tracking (SOT) heavily relies on the representation of the target object as a bounding box. However, due to the potential deformation and rotation experienced by the tracked targets, the genuine bounding box fails to capture the appearance information explicitly and introduces cluttered background. This paper proposes RTrack, a novel object representation baseline tracker that utilizes a set of sample points to get a pseudo bounding box. RTrack automatically arranges these points to define the spatial extents and highlight local areas. Building upon the baseline, we conducted an in-depth exploration of the training potential and introduced a one-to-many leading assignment strategy. It is worth noting that our approach achieves competitive performance to the state-of-the-art trackers on the GOT-10k dataset while reducing training time to just 10% of the previous state-of-the-art (SOTA) trackers’ training costs. The substantial reduction in training costs brings single-object tracking (SOT) closer to the object detection (OD) task. Extensive experiments demonstrate that our proposed RTrack achieves SOTA results with faster convergence.

1 Introduction

Visual object tracking (VOT) aims to locate and track an arbitrary target over time in a video sequence, given only its initial appearance. VOT potentially benefits the study of object detection, classification, and other related tasks such as action recognition and human-computer interaction, etc.

In single object tracking (SOT), the bounding box plays a crucial role in locating and tracking the object throughout a video sequence. It provides information about the target’s position and scale in the current frame, and serves as input for feature extraction and classification. The widespread adoption of genuine bounding box representation can be attributed to the following factors. Firstly, it aligns with commonly used performance metrics [19, 10, 35] that evaluate the overlap between estimated and ground truth boxes. Secondly, it offers convenience for feature extraction in deep networks [17, 8, 15] due to its regular orientation, allowing for easy subdivision of a rectangular window into a matrix of pooled cells.

Refer to caption
Figure 1: Comparison of AO and training pairs of state-of-the-art trackers on GOT-10k under one-shot setting. Fewer Search-Template pairs means faster training convergence.
Refer to caption
Figure 2: Previous works on visual object tracking mainly relied on exploring the genuine bounding box. We propose to use a set of points to explore the appearance and ultimately generate pseudo bounding boxes. This is achieved through weak localization supervision from rectangular ground-truth boxes.

However, similar to object detection (OD), the bounding box representation only provides a rough localization of the target and does not account for its arbitrary appearance [28, 6, 3]. When features are solely extracted from the bounding box, they may encompass background clutter, leading to a detrimental effect on the performance of single object tracking (SOT). Accurate prediction of bounding box plays a vital role in SOT, as the level of overlap between the estimated and ground truth bounding boxes is a widely adopted indicator.

Overall, the current tracker designs have the following problems: 1) The majority of current mainstream public visual object tracking benchmarks utilize rectangular bounding box annotations, as using point-based annotations can be costly and impractical. 2) The tracked objects have arbitrary orientations, and using genuine bounding boxes often introduce unnecessary background information. This requires the network to implicitly learn how to highlight the foreground, which slows down the convergence of training [4, 46, 6, 12, 3, 41, 44, 43]. 3) The improper distribution of positive and negative samples leads to training costs that are ten times higher than those of object detection algorithms [24, 56].

In this paper, we propose RTrack, a baseline tracker that utilizes a set of sample points to achieve accurate foreground localization. This design adaptively converts the sample points into pseudo bounding boxes, allowing training using only bounding box annotations. Building upon our baseline, we further explore the potential of sample allocation strategies and propose a one-to-many leading assignment strategy that focuses on the trade-off between training efficiency and performance. Additionally, we implicitly incorporate correlation modeling into classification and regression. Our experiments demonstrate RTrack is effective and achieves state-of-the-art performance on several tracking benchmarks. Moreover, as depicted in Fig. 1, compared to the recent state-of-the-art tracker Swin-V2 [45], RTrack-256 exhibits significantly faster convergence, running 18.7 times faster, while achieving a superior AUC score of 3.8% on the GOT-10k dataset. It is worth emphasizing that prior methods heavily rely on extensive training time to compensate for limited prior knowledge. In contrast, our RTrack takes a different approach by thoroughly exploring the training potential and performance, as shown in  Fig. 2.

In summary, the contributions of this work are as follows:

  • We propose RTrack, a baseline tracker that leverages a set of sample points to achieve foreground localization. It adaptively converts the sample points into pseudo bounding boxes, allowing training using only bounding box annotations.

  • We present a progressive one-to-many leading assignment strategy that effectively discriminates between positive and negative samples. Additionally, we implicitly incorporate correlation modeling into classification and regression.

  • Comprehensive experiments demonstrate that RTrack achieves state-of-the-art (SOTA) performance with a remarkably fast training convergence, such as 35 epochs (0.25 RTX3090 GPU days) on the GOT10k dataset.

Refer to caption
Figure 3: The architecture of the proposed RTrack consists of two key components: the Encoder and the subsequent arbitrary representation head. The Encoder extracts visual features. The two-stage arbitrary representation head generates pseudo boxes for the tracked objects, which are used for subsequent sample assigner and loss computation.

2 Related Work

2.1 Non-axis aligned representations

Traditional methods [28, 30] primarily focus on detecting axis-aligned objects. However, these methods may encounter difficulties in detecting non-axis-aligned targets that are densely distributed in complex backgrounds. More recently, there has been a focus on bottom-up approaches in object detection, including methods like CornerNet [23, 38, 52], ExtremeNet [53], and CenterNet [9, 48, 13]. CornerNet predicts the top-left and bottom-right heatmaps of objects to generate the corresponding bounding boxes. However, it’s important to note that these corner points still essentially represent a rectangular bounding box. ExtremeNet aims to locate the extreme points of objects in both horizontal and vertical directions, with supervision from ground-truth mask annotations. This approach focuses on identifying specific points on the object’s boundary, which allows for potentially more precise localization.

2.2 Pseudo boxes generation without annotations

Bottom-up detectors offer certain advantages such as a reduced hypothesis space and the potential for more precise localization. However, they often rely on manual clustering or post-processing steps to reconstruct complete object representations. In contrast, RepPoints [47, 5, 33] provides a flexible object repre sentation without the need for handcrafted clustering. RepPoints can learn extreme points and key semantic points automatically, even without additional supervision beyond ground-truth bounding boxes. In our proposed RTrack, we integrate RepPoints with deformable convolution [7, 55], aligning with the point representation and effectively aggregating information from multiple sample points. Moreover, RepPoints can easily generate the “pseudo box”, making it naturally compatible with existing SOT benchmarks.

2.3 Label assignment

Many detection methods commonly utilize a hand-crafted threshold [28, 4] for selecting positive samples. However, hand-crafted settings do not guarantee the overall quality of training samples, especially in the presence of noise and hard cases [25, 32]. Hard samples, which often have a low IoU with anchors or cover a limited number of feature points, lead to a scarcity of positive samples. ATSS [50], Autoassign [54], and OTA [14] have emphasized the importance of label assignment for improving detector performance. These methods employ an optimization strategy to select high-quality samples. In single object tracking (SOT), selecting high-quality samples becomes crucial due to the diverse orientations and sparse distribution of objects. In this paper, we propose an effective one-to-many leading samples assignment scheme that is specifically designed for our proposed baseline RTrack. Our scheme aims to select positive samples that accurately represent the target’s quality and facilitate faster convergence.

3 Methodology

This section provides a detailed presentation of our proposed RTrack, which is divided into three parts: 1) Baseline tracking framework: we present a novel object representation baseline tracking framework, which serves as the foundation for the following explorations; 2) Building upon the baseline tracker, we further propose a progressive one-to-many leading assignment strategy to enhance performance and convergence speed. Additionally, we introduce an approach for modeling the correlation between classification and regression tasks to implicitly encode mutual affinity; 3) Finally, we describe the training loss of RTrack. For more details about adaptive object representation, please refer to the appendix.

3.1 Baseline tracker

Our RTrack consists of two key components: a simple Encoder for simultaneous feature extraction and relation modeling, and an arbitrary representation head designed to generate pseudo target bounding boxes. The overview of the model is shown in Fig. 3.

3.1.1 Encoder

Our RTrack employs a progressive multi-layer architecture design [8, 48, 2], where each layer operates on the same-scaled feature maps with the same number of channels. Given the initial templates of size HtH_{t}×\timesWtW_{t}×3\times 3 and the search region of size HsH_{s}×\timesWsW_{s}×3\times 3, we first map them to non-overlapping patch embeddings using a convolutional operation with stride 16 and kernel size 16. This convolutional token embedding layer is introduced in each input to increase the number of channels while reducing the spatial resolution. Next, we flatten the embeddings and concatenate them to produce a fused token sequence of size LL ×\times CC, where LL is the fused sequence length and CC is the number of channels. Finally, we obtain a deep interaction-aware feature of size LL^{\prime} ×\times C, where LL^{\prime} is less than LL. Before inputting the prediction head, the search region feature is recovered, split, and reshaped to a 2D spatial feature map of size Hs16\frac{H_{s}}{16}~{}×Ws16~{}\frac{W_{s}}{16}.

Refer to caption
Figure 4: The sample assignment strategies of the init stage and refine stage. Compared with the assigner strategy (a), the schema in (b) introduces our proposed leading strategy, which leverages the strong learning ability of the init stage. In the init stage, we consistently maintain the center point-based initial representation and assign a unique positive sample, which we refer to as the one-to-one assigner strategy. In the refine stage (c), (d), we propose the one-to-many assigner strategy to accelerate the convergence speed.

3.1.2 Arbitrary Representation Head

As illustrated in Fig. 3 (b), our arbitrary representation head consists of two stages: 1) Init stage: generating the first-points-set by refining from the object center point hypothesis. For the init stage, we use center points as the initial representation of objects 111A feature map bin is considered positive if the center point of the ground-truth object falls within this feature map bin.. Starting from the center point, the first set of TrackPoints is obtained by regressing offsets over the center point. The head automatically arranges TrackPoints to define the foreground of the target object. These sample points are then transformed to generate a pseudo box, which serves as the initial pseudo bounding box. 2) Refine stage: generating the second-points-set by refining from the first-points-set. For the second localization stage, the head further refines the representation by adapting the bounding box to better fit the target, and we adopt the approach used in common trackers, where the feature bin with the highest score is considered as the unique positive sample. This multi-stage design allows for progressive improvement and refinement of the object representation.

Refer to caption
Figure 5: The relationship between feature bins and the ground truth (GT) center is as follows: In the init stage, we select the feature bin with the smallest center distance (CD) as the unique positive sample. In the refine stage, we select multiple feature bins with the smallest CD or IoU value (IV) according to the one-to-many strategy and assign them as positive samples.

3.1.3 Discussion of the proposed baseline tracker

1) Sample allocation strategy: our proposed baseline tracker employs a one-to-one matching strategy for allocating positive and negative samples in two stages, as shown in Fig. 4 (a). This approach results in each stage having only one positive sample, which leads to a significant discrepancy in the training costs between single object tracking (SOT) and efficient object detection (OD) algorithms[40, 56, 24]. To tackle this issue, we propose an one-to-many leading allocation strategy that is specifically designed to unleash the potential of our baseline tracker in the following section; 2) In our proposed baseline tracker, we only allow the flow of localization information (IoU) to the classification branch (score) in a one-way manner, as illustrated in  Fig. 6 (b). This information imbalance can limit the performance of the tracker. We will introduce an implicit approach to model the mutual affinity between these two branches, addressing the correlation between classification (score) and localization (IoU) in the following section.

3.2 One-to-many leading assigner strategy

The current mainstream trackers commonly employ a one-to-one matching strategy that allows for a unique positive sample. However, it leads to a sparse distribution of samples and significantly increases the training costs required for these trackers. As shown in Fig. 4 (a), our baseline tracker separates two stages (an init stage and a refine stage), and then use their own prediction results and the ground truth to execute one-to-one label assignment. Based on the observations above, we propose a progressive one-to-many leading assignment strategy, as shown in Fig. 4 (d):

  1. 1.

    In the init stage, we select the feature bin closest to the ground truth object’s center point based on center distance (CD) as the unique positive sample, termed as one-to-one assigner strategy, as illustrated in Fig. 5 (b).

  2. 2.

    In the refine stage, we adopt an one-to-many assigner strategy by allowing more feature bins to be treated as positive targets by relaxing the constraints of the training potentials. We select the top K candidate positive samples with the highest intersection over union (IoU) values between the predicted bounding boxes and the ground truth box, as depicted in Fig. 5 (c). We then set the dynamic threshold as the sum of the mean and variance of the candidate samples.

We note that the additional positive samples may produce bad prior at the final prediction. Therefore, in order to make those extra coarse positive grids have less impact, we put restrictions in the extra coarse positive samples by following “leading”.

The leading label assigner, depicted in Fig. 4 (d), utilizes the predictions from the init stage and the ground truth information to calculate the dynamic labels. These labels are generated through an optimization process, leveraging the guidance from the init stage to inform the refine stage. Specifically, the predictions from the init stage are used as guidance to generate hierarchical labels, which are then utilized in the refine stage. The reason to do this is that the init stage has a relatively strong learning capability, so the label generated from the init stage should be more representative of the distribution and correlation between the source data and the target. By letting the refine stage directly learn the information that the init stage has learned, the init stage will be more able to focus on information that has not yet been learned.

Refer to caption
Figure 6: The relationship between classification (cls) and localization (loc) tasks can be approached in several ways: (a) Independent optimization of both tasks; (b) One-way information flow; (c) Additional branches; (d) Our mutual affinity design.

3.3 Mutual affinity between classification and localization

Instead of training classification and localization tasks independently (Fig. 6 (a)), most trackers [1, 11] optimize a weighted sum of classification and localization losses during training. Recent works [49, 16, 13] suggest that performance improves when these two loss functions are forced to interact with each other as illustrated in Fig. 6 (b). And several methods (Fig. 6 (c)) have been recently proposed to improve the correlation between these tasks: training an auxiliary head to regress the localization qualities of the positive examples, e.g. centerness, or IoU, has proven useful [20, 39, 51]. Indeed, there are approaches  [6, 46, 12] that incorporate plug-in components [21] and train an additional branch for predicting IoU. While these two-stage training approaches can potentially improve performance, they also increase training costs and model complexity. They may not fully capture the correlation between these two branches, which can limit their performance.

As shown in  Fig. 6 (d), given loss function \mathcal{L}, correlation loss (corr\mathcal{L}_{corr}) is simply added using a weighting hyper-parameter λcorr\lambda_{corr}:

=+λcorrcorr.\mathcal{L}^{{}^{\prime}}=\mathcal{L}+\lambda_{corr}\mathcal{L}_{corr}. (1)

corr\mathcal{L}_{corr} is the correlation loss defined as:

corr=1ρ(IoU^,s^),\mathcal{L}_{corr}=1-\rho(\hat{\mathrm{IoU}},\hat{\mathrm{s}}), (2)

where s^\hat{\mathrm{s}} is the confidence score of the foreground, IoU^\hat{\mathrm{IoU}} is IoU of the predicted pseudo boxes. And ρ(,)\rho(\cdot,\cdot) is a correlation coefficient [22] pertaining to the positive sample, defined as follow:

s¯=1|𝒞|c𝒞s^cb¯=1|𝒞|c𝒞b^c\displaystyle\bar{s}=\frac{1}{|\mathcal{C}|}\sum_{c\in\mathcal{C}}{\hat{s}^{c}}\qquad\bar{b}=\frac{1}{|\mathcal{C}|}\sum_{c\in\mathcal{C}}{\hat{b}^{c}}
vs=s^s¯vb=b^b¯\displaystyle v_{s}=\hat{s}-\bar{s}\qquad v_{b}=\hat{b}-\bar{b}
ρ(s^,b^)\displaystyle\rho(\widehat{s},\widehat{b}) =\displaystyle= 2×c𝒞(vsc×vbc)c𝒞vsc2×c𝒞vbc2×std(s^)×std(b^)var(s^)+var(b^)+(s¯b¯)2\displaystyle\frac{\frac{2\times\sum_{c\in\mathcal{C}}{(v_{s}^{c}\times v_{b}^{c})}}{\sqrt{\sum_{c\in\mathcal{C}}v_{s}^{c^{2}}}\times\sqrt{\sum_{c\in\mathcal{C}}v_{b}^{c^{2}}}}\times\operatorname{std}(\widehat{s})\times\operatorname{std}(\widehat{b})}{\operatorname{var}(\hat{s})+\operatorname{var}(\widehat{b})+(\bar{s}-\bar{b})^{2}} (3)

We only use the gradients of corr\mathcal{L}_{corr} wrt. classification score, i.e., we backpropagate the gradients through only the classification branch, termed as gradient truncation, as shown in Fig. 6 (d).

TransT  [4] STARK  [46] SBT  [44] OSTrack  [48] VideoTrack  [43] TATrack  [16] SwinV2  [45] ARTrack  [41] RTrack -256 RTrack -384
Epochs 1000 500 600 100 300 60 300 300 35 60
Pairs / epoch(×104\times 10^{4}) 3.8 6 5 6 6 30 13.1 6 6 6
Training Pairs(×106\times 10^{6}\downarrow 38 30 30 6 18 18 39.3 18 2.1 3.6
mAO(%) \uparrow 67.1 68.8 69.9 71.0 72.9 73.0 70.8 73.5 74.6 76.4
SR0.5(%) \uparrow 76.8 78.1 80.4 80.4 81.9 83.3 - 82.2 84.4 86.0
SR0.75(%) \uparrow 60.9 64.1 63.6 68.2 69.8 68.5 - 70.9 72.1 74.1
Table 1: Comparision with state-of-the-art trackers on the GOT-10k test set. All the results reported strictly adhere to the one-shot protocol, which means that our models were only trained using the GOT-10k training set and no additional data was used for training.

3.4 Training loss

In this section, we describe the training procedure for our proposed RTrack. First, we feed the output Hs16\frac{H_{s}}{16}~{}×Ws16~{}\frac{W_{s}}{16} ×C~{}C of the backbone network into our proposed arbitrary representation head. We adopt the weighted focal loss [26] for classification as follows:

cls=xy{(1𝑷xy)αlog(𝑷xy), if 𝑷^xy=1(1𝑷^xy)β(𝑷xy)αlog(1𝑷xy), otherwise \mathcal{L}_{cls}=-\sum_{xy}\left\{\begin{array}[]{ll}\left(1-\bm{P}_{xy}\right)^{\alpha}\log\left(\bm{P}_{xy}\right),&\text{ if }\hat{\bm{P}}_{xy}=1\\ \left(1-\hat{\bm{P}}_{xy}\right)^{\beta}\left(\bm{P}_{xy}\right)^{\alpha}\log\left(1-\bm{P}_{xy}\right),&\text{ otherwise }\end{array}\right.

(4)

where α=2\alpha=2 and β=4\beta=4 are the regularization parameters in our experiments as in [48].

With the predicted bounding box, the generalized IoU loss [37] are employed for bounding box regression for fair comparison as follows:

det =λinitinit +λrefinerefine\mathcal{L}_{\text{det }}=\lambda_{init}\mathcal{L}_{\text{init }}+\lambda_{refine}\mathcal{L}_{refine} (5)

where λinit=1\lambda_{init}=1 and λrefine=2\lambda_{refine}=2 as in [48].

Finally, we model the correlation between classification cls\mathcal{L}_{\text{cls}} and localization det\mathcal{L}_{\text{det}}. The full loss function is:

all =λclscls+λdetdet +λcorrcorr\mathcal{L}_{\text{all }}=\mathrm{\lambda}_{\mathrm{cls}}\mathcal{L}_{\mathrm{cls}}+\mathrm{\lambda}_{\mathrm{det}}\mathcal{L}_{\text{det }}+\mathrm{\lambda}_{\mathrm{corr}}\mathcal{L}_{corr} (6)

where λcls=2\lambda_{cls}=2, λdet=1\lambda_{det}=1 and λcorr=0.5\lambda_{corr}=0.5.

TransT  [4] CS WinTT  [38] Mix Former  [6] Sim Track  [2] OS Track  [48] MAT  [52] Swin V2  [45] Seq Track  [3] RTrack -256 RTrack -256 RTrack -384
Epochs 1000 600 500 500 300 500 300 500 100 300 300
Pairs / epoch(×104\times 10^{4}) 3.8 6 6 6 6 6.4 13.1 3 6 6 6
Training Pairs(×106\times 10^{6}\downarrow 38 36 30 30 18 32 39.3 15 6 18 18
AUC(%) \uparrow 81.4 81.9 83.1 82.3 83.1 81.9 82.0 83.3 83.7 84.2 85.0
PREnorm(%) \uparrow 86.7 86.7 88.1 86.5 87.8 86.8 - 88.3 88.6 88.8 89.4
PRE(%) \uparrow 80.3 79.5 81.6 - 82.0 - - 82.2 82.7 83.2 85.0
Table 2: Comparision with state-of-the-art trackers on the TrackingNet test set.
Leading Assigner (IV) GOT-10k LASOT OTB100
# baseline\ddagger MaxIoU Assigner Mutual Affinity2 mAO (%) mSR50 (%) mSR75 (%) SUC (%) PRE (%) SUC (%) RPE (%) Δ\Delta
1 - - - 70.9 80.1 68.3 65.5 69.6 65.9 85.4 -
2 - - 68.8 77.8 65.7 64.9 68.2 65.9 85.4 -1.2
3 - - 72.5 81.9 70.9 67.1 72.9 70.0 91.6 +3.0
4 - 72.9 82.3 71.0 66.8 72.7 70.0 91.9 +3.1
5 - - 74.6 84.1 72.9 66.4 72.1 70.5 92.3 +3.9
6 - 75.2 84.8 73.4 67.4 73.1 71.1 93.2 +4.6
(a) Ablation studies on RTrack. Δ\Delta denotes the performance change (averaged over benchmarks). The superscript \ddagger denotes that the tracker adopts a one-to-one assigner strategy, while all other assigners utilize an one-to-many strategy.
CD GOT-10k mAO(%)\uparrow LASOT AUC(%)\uparrow TKNET AUC(%)\uparrow
9 74.1 67.6 83.4
12 75.0 67.9 83.6
14 74.6 67.5 83.5
16 74.4 67.6 83.5
20 74.4 67.4 83.6
(b) Top-k Center Distance (CD) candidate samples count.
IV GOT-10k mAO(%)\uparrow LASOT AUC(%)\uparrow TKNET AUC(%)\uparrow
9 74.2 65.0 83.5
12 74.5 66.3 83.7
14 73.3 66.5 83.7
16 74.9 67.3 83.6
20 74.0 67.6 83.5
(c) Top-k IoU Value (IV) candidate samples count.
Assigner w/ leading GOT -10k LASOT TKNET
MaxIoU - 72.1 65.8 82.3
\checkmark 72.9 66.8 83.4
CD (ours) - 73.2 66.0 83.0
\checkmark 75.0 67.9 83.6
IV (ours) - 72.0 64.9 82.2
\checkmark 75.2 67.4 83.5
(d) Experiments on the impact of leading strategy.
Table 3: A set of ablative studies on GOT-10k, LASOT, OTB100 and TrackingNet(TKNET).

4 Experiments

4.1 Implementation Details

Our tracker is implemented using PyTorch. The models are trained on one NVIDIA RTX 3090 GPU with 24GB of memory each, and inference is performed on an NVIDIA RTX3060Ti GPU.

4.1.1 Architectures.

The vanilla ViT-Base [15] model pre-trained with MAE [15] is adopted as the backbone for our encoder. It contains 12 layers with a hidden dimension of 768. We adopt an alternating training strategy, given RTrack with a total of 12 layers. In the shallow layers l0l_{0} to l2l_{2}, we learn the importance of each patch token. The fine-tuned layers (l3,l6,l9)(l_{3},l_{6},l_{9}) are trained on alternate epochs with NN fully dense patch tokens without discarding and N(N)N^{\prime}(\leq N) sparse patch tokens after discarding. Training with full dense patch tokens preserves the accuracy of the model, unlike dynamicViT [36] and OSTrack [48], which are unable to recover the original accuracy with dense tokens. We present the following RTrack versions that demonstrate its performance:

-RTrack-256. Template: 128 × 128 pixels; Search region: 256 × 256 pixels.

-RTrack-384. Template: 192 × 192 pixels; Search region: 384 × 384 pixels.

4.1.2 Training.

We use the LaSOT [10], GOT-10k [19], COCO [29], and TrackingNet [35] datasets to train our RTrack. We utilize the same data augmentations as OSTrack [48], including horizontal flip and brightness jittering. Each GPU processes 64 pairs of images, resulting in a total batch size of 64. And we use the AdamW optimizer [31] with a weight decay of 10410^{-4}. The initial learning rate is 1.5×1041.5\times 10^{-4}, which is reduced to 1.5×1051.5\times 10^{-5} at the 80% of the total epochs. The training process of RTrack-256 includes 60k image pairs sampled for training per epoch.

4.2 Comparison with State-of-the-art Trackers

To evaluate the performance of our proposed RTrack, we compare it with several state-of-the-art (SOTA) trackers on GOT-10k [19] and TrackingNet [35]. In addition to the mentioned datasets, the detailed results and analysis on UAV123 [34], LaSOT [10], and OTB-100 [42] can be found in the appendix of our paper.

4.2.1 GOT-10k.

The GOT-10k dataset is a comprehensive benchmark for object tracking. It comprises 10,000 video clips and approximately 1.5 million target annotations. The dataset covers 560 common object classes and 87 motion modes, making it diverse and representative of real-world tracking scenarios. One notable feature of the GOT-10k dataset is the introduction of the one-shot protocol for training and testing. This protocol ensures that the training and test sets do not overlap in terms of object categories, allowing for better evaluation and assessment of the performance of non-specific target tracking tasks.  Tab. 1 shows that RTrack-256 achieves the highest average overlap (AO) of 74.6%, surpassing OSTrack-256 (71.0%) [48] by 3.6% and outperforming Swin-V2 (70.8%) [45]. It is worth noting that RTrack achieves these results with only 13\frac{1}{3} of the training sample pairs used by OSTrack and 118\frac{1}{18} of the training sample pairs used by Swin-V2. This demonstrates the effectiveness and efficiency of RTrack in extracting discriminative features for unseen classes.

4.2.2 TrackingNet.

The TrackingNet [35] dataset is currently the largest and longest video object tracking dataset available. It consists of over 30,000 video sequences, encompassing various challenging scenarios, and includes more than 14 million object bounding box annotations.  Tab. 2 shows that RTrack achieves superior performance compared to most trackers on the TrackingNet dataset. These results highlight the effectiveness of our proposed method, especially considering the smaller number of training samples used in our approach.

4.3 Ablation and Analysis

To verify the effectiveness of our proposed framework, we analyze different components and settings of RTrack and perform detailed exploration studies. The result of the baseline is reported in  Tab. 3(a) (#1). All Experiments are reported in  Tab. 3(a),  Tab. 3(b),  Tab. 3(c) and  Tab. 3(d). More ablation studies and analyses are shown in the appendix.

4.3.1 Analysis on one-to-many assigner strategy in the refine stage.

The design of the one-to-many assigner strategy in the refine stage allows for faster convergence speed with high performance. We also experimented with a commonly used static one-to-many sample assignment strategy, referred to as the “MaxIoU assigner222The bounding box with an IoU (between the initial induced pseudo box and the ground-truth bounding box) larger than 0.5 is considered a positive sample, smaller than 0.4 is considered a negative sample, and any other IoU values are ignored..” As shown in  Tab. 3(a) (#1, #3, #5), by comparing the performance of the one-to-one assigner strategy with the one-to-many strategy, we observed that the one-to-many strategy leads to improved higher performance. These results highlight the efficiency of our approach. By utilizing the one-to-many assigner strategy and optimizing the training process, we achieve comparable or even superior performance to existing state-of-the-art trackers while significantly reducing the training time.

4.3.2 Analysis on the correlation between classification and localization.

Our baseline only considers the one-way information flow from classification (cls) to localization (loc). To explore the impact of the mutual affinity between loc and cls, we implicitly incorporate correlation modeling into classification and regression. This integration allows us to incorporate IoU information into the classification branch, thereby improving object localization. By considering the correlation between cls and loc, i.e., Tab. 3(a) (#2, #4, #6), we enhance the integration of classification and localization information, leading to more accurate and reliable object tracking.

4.3.3 Further analysis on the leading assigner strategy.

The design of the “leading” is the key component of our RTrack in the refine stage. By enabling the refine stage to directly learn the information acquired by the init stage, the init stage can focus more on acquiring new information.  Tab. 3(d) demonstrates the performance of our RTrack with the leading strategy and different assigner strategies. It shows that the leading strategy unleashes the potential of RTrack and improves its performance across different assigner strategies. In the one-to-many assigner strategy, we also explored two sorting approaches based on Center Distance (CD) and IoU Value (IV), respectively, to obtain the top-k candidate samples. Tab. 3(b) and Tab. 3(c) show the settings under different “k”. This further demonstrates the good capacity of the design of the leading assigner strategy.

5 Conclusion

In this paper, we propose a baseline tracker to handle arbitrary representations, adaptively converting the sample points into pseudo bounding boxes in visual object tracking. Additionally, we enhance the baseline tracker by introducing a one-to-many leading assignment strategy. To the best of our knowledge, we are the first to conduct an in-depth exploration of the training potential across multiple stages. As pioneers in this approach, we present RTrack which significantly improves training convergence speed.

References

  • [1] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II 14, pages 850–865. Springer, 2016.
  • [2] Boyu Chen, Peixia Li, Lei Bai, Lei Qiao, Qiuhong Shen, Bo Li, Weihao Gan, Wei Wu, and Wanli Ouyang. Backbone is all your need: a simplified architecture for visual object tracking. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pages 375–392. Springer, 2022.
  • [3] Xin Chen, Houwen Peng, Dong Wang, Huchuan Lu, and Han Hu. Seqtrack: Sequence to sequence learning for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14572–14581, 2023.
  • [4] Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 8126–8135. Computer Vision Foundation / IEEE, 2021.
  • [5] Yihong Chen, Zheng Zhang, Yue Cao, Liwei Wang, Stephen Lin, and Han Hu. Reppoints v2: Verification meets regression for object detection. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • [6] Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed attention. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 13598–13608. IEEE, 2022.
  • [7] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 764–773. IEEE Computer Society, 2017.
  • [8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • [9] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. Centernet: Keypoint triplets for object detection. In IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  • [10] Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single object tracking. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 5374–5383. Computer Vision Foundation / IEEE, 2019.
  • [11] Zhihong Fu, Zehua Fu, Qingjie Liu, Wenrui Cai, and Yunhong Wang. Sparsett: Visual tracking with sparse transformers. In Luc De Raedt, editor, IJCAI, pages 905–912. ijcai.org, 2022.
  • [12] Shenyuan Gao, Chunluan Zhou, Chao Ma, Xinggang Wang, and Junsong Yuan. Aiatrack: Attention in attention for transformer visual tracking. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pages 146–164. Springer, 2022.
  • [13] Shenyuan Gao, Chunluan Zhou, and Jun Zhang. Generalized relation modeling for transformer tracking. CoRR, abs/2303.16580, 2023.
  • [14] Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, and Jian Sun. Ota: Optimal transport assignment for object detection. CVPR, 2021.
  • [15] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  • [16] Kaijie He, Canlong Zhang, Sheng Xie, Zhixin Li, and Zhiwen Wang. Target-aware tracking with long-term context attention. arXiv preprint arXiv:2302.13840, 2023.
  • [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [19] Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell., 43(5):1562–1577, 2021.
  • [20] Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yuning Jiang. Acquisition of localization confidence for accurate object detection. In ECCV, pages 784–799, 2018.
  • [21] Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yuning Jiang. Acquisition of localization confidence for accurate object detection. In Proceedings of the European conference on computer vision (ECCV), pages 784–799, 2018.
  • [22] Fehmi Kahraman, Kemal Oksuz, Sinan Kalkan, and Emre Akbas. Correlation loss: Enforcing correlation between classification and localization. CoRR, abs/2301.01019, 2023.
  • [23] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In ECCV, pages 734–750, 2018.
  • [24] Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3041–3050, 2023.
  • [25] Hengduo Li, Zuxuan Wu, Chen Zhu, Caiming Xiong, Richard Socher, and Larry S. Davis. Learning from noisy anchors for one-stage object detection. In CVPR, pages 10588–10597, 2020.
  • [26] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2999–3007. IEEE Computer Society, 2017.
  • [27] Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
  • [28] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017.
  • [29] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common Objects in Context. In The European Conference on Computer Vision (ECCV), 2014.
  • [30] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV, pages 21–37. Springer, 2016.
  • [31] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
  • [32] Yuchen Ma, Songtao Liu, Zeming Li, and Jian Sun. Iqdet: Instance-wise quality distribution sampling for object detection. In CVPR, pages 1717–1725, 2021.
  • [33] Ziang Ma, Linyuan Wang, Haitao Zhang, Wei Lu, and Jun Yin. Rpt: Learning point set representation for siamese visual tracking. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 653–665. Springer, 2020.
  • [34] Matthias Mueller, Neil Smith, and Bernard Ghanem. A benchmark and simulator for uav tracking. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 445–461. Springer, 2016.
  • [35] Matthias Müller, Adel Bibi, Silvio Giancola, Salman Al-Subaihi, and Bernard Ghanem. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, ECCV, volume 11205 of Lecture Notes in Computer Science, pages 310–327. Springer, 2018.
  • [36] Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34:13937–13949, 2021.
  • [37] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019.
  • [38] Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Transformer tracking with cyclic shifting window attention. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 8781–8790. IEEE, 2022.
  • [39] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  • [40] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. CoRR, abs/2207.02696, 2022.
  • [41] Xing Wei, Yifan Bai, Yongchao Zheng, Dahu Shi, and Yihong Gong. Autoregressive visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9697–9706, 2023.
  • [42] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell., 37(9):1834–1848, 2015.
  • [43] Fei Xie, Lei Chu, Jiahao Li, Yan Lu, and Chao Ma. Videotrack: Learning to track objects via video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22826–22835, 2023.
  • [44] Fei Xie, Chunyu Wang, Guangting Wang, Yue Cao, Wankou Yang, and Wenjun Zeng. Correlation-aware deep tracking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 8741–8750. IEEE, 2022.
  • [45] Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, and Yue Cao. Revealing the dark secrets of masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14475–14485, 2023.
  • [46] Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. Learning spatio-temporal transformer for visual tracking. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 10428–10437. IEEE, 2021.
  • [47] Ze Yang, Shaohui Liu, Han Hu, Liwei Wang, and Stephen Lin. Reppoints: Point set representation for object detection. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 9656–9665. IEEE, 2019.
  • [48] Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. In ECCV, pages 341–357. Springer, 2022.
  • [49] Dawei Zhang, Yanwei Fu, and Zhonglong Zheng. UAST: uncertainty-aware siamese tracking. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, ICML, volume 162 of Proceedings of Machine Learning Research, pages 26161–26175. PMLR, 2022.
  • [50] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In CVPR, pages 9759–9768, 2020.
  • [51] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z. Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [52] Haojie Zhao, Dong Wang, and Huchuan Lu. Representation learning for visual object tracking by masked appearance transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18696–18705, 2023.
  • [53] Xingyi Zhou, Jiacheng Zhuo, and Philipp Krähenbühl. Bottom-up object detection by grouping extreme and center points. In CVPR, 2019.
  • [54] Benjin Zhu, Jianfeng Wang, Zhengkai Jiang, Fuhang Zong, Songtao Liu, Zeming Li, and Jian Sun. Autoassign: Differentiable label assignment for dense object detection. arXiv preprint arXiv:2007.03496, 2020.
  • [55] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In CVPR, 2019.
  • [56] Zhuofan Zong, Guanglu Song, and Yu Liu. Detrs with collaborative hybrid assignments training. CoRR, abs/2211.12860, 2022.

Supplementary Material

A. Review of RepPoints

RepPoints adopts pure regression to achieve object localization. Starting from a feature map position p=(x,y)p=(x,y), it directly regresses a set of points 𝒮={pi=(xi,yi)}i=1n\mathcal{S}^{\prime}=\left\{p^{\prime}_{i}=(x^{\prime}_{i},y^{\prime}_{i})\right\}_{i=1}^{n} to represent appearance using two progressive steps:

𝐒𝐭𝐞𝐩𝟏:pi=p+Δpi=p+gi(Fp)𝐒𝐭𝐞𝐩𝟐:pi=pi+Δpi=pi+gi(concat({F𝐩i}i=1n)),\begin{array}[]{c}\mathbf{Step1:}p_{i}=p+\Delta p_{i}\\ =p+g_{i}\left(F_{p}\right)\\ \mathbf{Step2:}p_{i}^{\prime}=p_{i}+\Delta p_{i}^{\prime}\\ =p_{i}+g_{i}^{\prime}\left(\operatorname{concat}\left(\left\{F_{\mathbf{p}_{i}}\right\}_{i=1}^{n}\right)\right),\end{array}

where 𝒮={pi=(xi,yi)}i=1n\mathcal{S}=\left\{p_{i}=(x_{i},y_{i})\right\}_{i=1}^{n} is the intermediate point set representation; F𝐩F_{\mathbf{p}} denotes the feature vector at position pp; gig_{i} and gig^{\prime}_{i} are 2-d regression functions implemented by a linear layer. The pseudo bounding box is obtained by applying a conversion function 𝒯\mathcal{T} on the point sets 𝒮\mathcal{S} and 𝒮\mathcal{S}^{\prime}, where 𝒯\mathcal{T} is modeled as the min-max or moment function.

To achieve accurate SOT, a set of adaptive sample points are modeled in our proposed method, which we refer to as TrackPoints. The sample points are defined as

𝒮=(xk,yk)k=1n,\mathcal{S}={(x_{k},y_{k})}_{k=1}^{n}, (1)

where nn is the number of sample points used in the representation. In our approach, we set nn to be 9 by default.

To improve the localization accuracy of the tracked object, we progressively refine the position of the TrackPoints. The refinement process can be expressed as

𝒮r=(xk+Δxk,yk+Δyk)k=1n,\mathcal{S}r={(x_{k}+\Delta x_{k},y_{k}+\Delta y_{k})}_{k=1}^{n}, (2)

where (Δxk,Δyk)k=1n{(\Delta x_{k},\Delta y_{k})}_{k=1}^{n} are the predicted offsets of the new sample points with respect to the old ones. The refinement process implicitly considers the changes in the object’s position and appearance.

To evaluate the performance of our proposed SOT method, we need to convert the TrackPoints into a bounding box for comparison with the ground-truth bounding box. We use a predefined converting function 𝒯:𝒯OO\mathcal{T}:\mathcal{T}_{O}\rightarrow\mathcal{B}_{O}, where 𝒯O\mathcal{T}_{O} denotes the TrackPoints for object OO and O\mathcal{B}_{O} represents a pseudo box.

Three converting functions are considered in our approach:

  • Min-max function. This function performs a min-max operation over both axes of the TrackPoints to determine O\mathcal{B}_{O}, which is equivalent to the bounding box over the sample points.

  • Moment-based function. This function computes the center point and scale of the rectangular box O\mathcal{B}_{O} using the mean value and the standard deviation of the TrackPoints, where the scale is multiplied by globally-shared learnable multipliers λx\lambda_{x} and λy\lambda_{y}.

All three converting functions are differentiable, which allows us to perform end-to-end learning when they are incorporated into the SOT system.

B. Differences with Reppoints

Our proposed RTrack shares a similar spirit with Reppoints [47]. Both of them cast the object representation as a set of sample points. However, our method differs from Reppoints in three fundamental ways. 1) The tasks are different, Reppoints is designed for the object detection (OD) task, while ours is for tracking. 2) The architectures are different, Reppoints adopts ResNet [18] as its backbone network followed by feature pyramidal networks (FPN) [27]. Our method is more compact, only using a single encoder transformer without hierarchy features. RTrack employs ViT [8] as the encoder for feature extraction. 3) The sample allocation strategies are different. For the refine stage, Reppoints selects positive and negative samples based on a hand-crafted static IoU threshold. Single object tracking (SOT) merely involves a single target, the choice of strategy is crucial for training stability and convergence speed. In contrast, our sample allocation strategy utilizes the one-to-many k-center nearest neighbors (KCNN) to obtain candidate positive samples. And further, we further divide positive and negative samples for effective training based on the dynamic threshold derived from mean and variance. 4) The perspective on classification and regression is different. In Reppoints, the classification task typically involves a multi-class detection problem, eg. 80 classes in the COCO [29] dataset. One-hot labels and a cross-entropy loss naturally match the requirements of multi-class classification. In our task, we treat classification as a foreground-background segmentation problem. After truncating the gradient information of the regression branch, we potentially combine classification and IoU collaboratively to enhance performance.

C. Performance metrics

The official online evaluation server for the GOT-10k [19] dataset uses the average overlap (AO), success rate (SR0.5SR_{0.5}), and success rate (SR0.75SR_{0.75}) as evaluation criteria.

  • Average Overlap (AO): This measures the average intersection-over-union (IOU) between the predicted bounding box and the ground truth over the entire sequence. AO is calculated as follows:

    AO=1Ni=1NIOUiAO=\frac{1}{N}\sum_{i=1}^{N}IOU_{i} (3)
  • Success Rate (SR0.5SR_{0.5}): This measures the proportion of frames for which the IOU between the predicted bounding box and the ground truth is greater than or equal to 0.5. SR0.5SR_{0.5} is calculated as follows:

    SR0.5=1Ni=1N[IOUi0.5]SR_{0.5}=\frac{1}{N}\sum_{i=1}^{N}\left[IOU_{i}\geq 0.5\right] (4)

    where [IOUi0.5][IOU_{i}\geq 0.5] equals 1 if IOUi0.5IOU_{i}\geq 0.5.

  • SR0.75SR_{0.75} is similar to SR0.5SR_{0.5}, but it measures the success rate at a higher overlap threshold of 0.75.

Siam FC++ DiMP Pr DiMP Tr DiMP Siam RCNN TransT STARK ST101 ToMP 50 MixFormer 1k MAT RTrack (ours)
UAV123 - 65.4 68.0 67.5 64.9 69.1 68.2 69.0 68.7 68.0 69.5
LASOT 54.4 56.9 59.8 63.9 64.8 64.9 67.1 67.6 67.9 65.6 68.9
OTB100 68.3 68.4 69.6 71.1 70.1 69.4 68.1 70.1 70.4 - 71.1
Table A1: Comparision with state-of-the-art trackers on the LASOT [10], UAV123 [34], OTB100 [42] datasets in terms of AUC(%).

We use AUCAUC, PnormP_{norm}, and PP as evaluation metrics for all datasets except the GOT-10K dataset. AUC (Area Under the Curve) measures the area under the success plot, which is the curve of success rate versus overlap threshold. PnormP_{norm} is the normalized precision at 20 pixels, which measures the precision of the predicted bounding box normalized by the size of the ground truth box. PP is the precision score, which measures the percentage of frames where the predicted box and ground truth box have an overlap greater than a certain threshold.

Refer to caption
Refer to caption
Figure A1: Influence of the init position count and sample points on GOT-10k [19].

D. Experiments on UAV123, LASOT and OTB-100

UAV123[34] is a dataset comprising 123 video sequences captured from the perspective of a low-altitude unmanned aerial vehicle. With an average sequence length of 915 frames, it is designed for long-term tracking evaluation. LaSOT [10] is also a benchmark specifically designed for long-term tracking evaluation. It features a test set comprising 280 videos and a total of 1,400 video sequences covering 70 different classes of objects. The Object Tracking Benchmark (OTB) [42] is a benchmark that evaluates the performance of visual tracking algorithms. The results in Tab. A1 show that RTrack has superior performance compared to most trackers on all three benchmarks, demonstrating the strong generalizability of RTrack.

GOT-10k mAO(%)\uparrow LASOT AUC(%)\uparrow
w/o truncation gradient N/A N/A
pos 74.3 66.8
pos & neg 73.7 66.4
(a) Correlation settings. ’N/A’ means unstable inference.
GOT-10k LASOT TKNET Training Pairs (×106\times 10^{6})
Trackers mAO (%) mSR50 (%) mSR75 (%) SUC (%) PRE (%) SUC (%) RPE (%)
RTrack100 74.4 84.1 72.3 67.4 73.4 83.6 82.5 6
RTrack300 75.9 85.2 74.8 67.0 72.7 84.2 83.3 18
RTrack500 74.2 83.6 72.7 68.1 74.3 83.9 82.7 30
(b) RTrack extension as epochs (training pairs) increasing. Note: for each tracker, we performed a full round of training (i.e. the decay epoch of the learning rate is always at 0.8 of the total epoch).
Table A2: More ablative studies on GOT-10k, LASOT and TrackingNet(TKNET).

E. More ablation studies and analyses

Mutual affinity of pos & neg samples. To investigate the impact of the correlation between classification and localization on positive and negative samples, we limited the correlation to positive samples only. In Tab. 2(a), we extend the correlation to both positive and negative samples. We found that extending the correlation to both positive and negative samples did not have a significant impact on the results. In our design, we implicitly incorporate correlation modeling into classification and regression. This integration allows us to incorporate IoU information into the classification branch, thereby improving object localization, as shown in  Fig. A2.

Refer to caption
Figure A2: The distribution of classification scores in the search region. In the top row, where there is no correlation between classification and localization. In the bottom row, the mutual affinity between classification and localization is modeled.

Init position count in baseline tracker. To investigate the impact of the number of initial point representations on performance, in our proposed baseline RTrack, we use center point-based representation as the initial point representation in the init stage. Therefore, in  Fig. A1, we explored this setting by expanding it to {2, 3, 4, 5, 6, 7, 8, 9}. We found that increasing the init position count degrades the tracking performance. This indicates that the center point-based representation is beneficial for the representation of the target and the convergence of training.

Sampling points count in baseline tracker. To further explore whether increasing the number of sampling points is beneficial, we gradually increased the number of sampling points from 9 to {16, 25, 36, 49}. As shown in  Fig. A1, we found that increasing the number of sampling points does not lead to performance improvement. This suggests that simply increasing the number of sample points may degrade tracking performance, perhaps because objects may face complex background effects, such as occlusion, as the track progresses.

Analysis on the extension of RTrack. To explore whether the increase in training samples can improve performance, we gradually increase the number of epochs to {100, 300, 500} to align with the training cost of current state-of-the-art trackers, in  Tab. 2(b). The refine stage adopts a sorting strategy based on Center distance (CD), where the top 20 samples are selected as candidate samples. As a result, we found that the proportional increase in training time resulted in only a slight performance improvement on short-term tracking datasets.

Refer to caption
Figure A3: Training IoU vs. epoch. Best viewed in color and zoom in.

Analysis on training convergence potential. To better illustrate the improvement in training convergence speed achieved by the one-to-many leading strategy, we compare our baseline tracker using the one-to-one matching strategy with our proposed one-to-many leading assignment strategy. As shown in  Fig. A3, we observe that the simple one-to-one strategy, which only assigns a unique positive sample without incorporating tricks for localization, exhibits the slowest convergence speed. On the other hand, the one-to-one matching strategy used in training efficiency OSTrack [48], which incorporates gaussian focal loss [26], further improves the training convergence speed. The experimental results demonstrate that our one-to-many leading assignment strategy can further tap into the training convergence potential of the tracker.

Refer to caption
Figure A4: Point representation. We employ a min-max operation over both axes of the TrackPoints to determine the final pseudo bounding box.

Visualization on non-rigid representation. To gain a better understanding of our RTrack, we generate a plot of the sampling points while predicting points coordinate. To test the robustness of our model, we use complex scenarios encountered in real-world trackings, such as rotation, and background clutter, as shown in  Fig. A4. Interestingly, our tracker focuses on the appropriate extremities when predicting each coordinate, demonstrating our model’s ability for precise localization.