Lightweight Full-Convolutional Siamese Tracker

Yunfeng Li, Bo Wang, Xueyi Wu, Zhuoyan Liu, Ye Li
Harbin Engineering University

Abstract

Although single object trackers have achieved advanced performance, their large-scale models hinder their application on limited resources platforms. Moreover, existing lightweight trackers only achieve a balance between 2-3 points in terms of parameters, performance, Flops and FPS. To achieve the optimal balance among these points, this paper proposes a lightweight full-convolutional Siamese tracker called LightFC. LightFC employs a novel efficient cross-correlation module (ECM) and a novel efficient rep-center head (ERH) to improve the feature representation of the convolutional tracking pipeline. The ECM uses an attention-like module design, which conducts spatial and channel linear fusion of fused features and enhances the nonlinearity of the fused features. Additionally, it refers to successful factors of current lightweight trackers and introduces skip-connections and reuse of search area features. The ERH reparameterizes the feature dimensional stage in the standard center-head and introduces channel attention to optimize the bottleneck of key feature flows. Comprehensive experiments show that LightFC achieves the optimal balance between performance, parameters, Flops and FPS. The precision score of LightFC outperforms MixFormerV2-S on LaSOT and TNL2K by 3.7 % and 6.5 %, respectively, while using 5x fewer parameters and 4.6x fewer Flops. Besides, LightFC runs 2x faster than MixFormerV2-S on CPUs. In addition, a higher-performance version named LightFC-vit is proposed by replacing a more powerful backbone network. The code and raw results can be found at https://github.com/LiYunfengLYF/LightFC.

1 Introduction

Single object tracking (SOT) is a fundamental task in computer vision, which aims to obtain the track of a target by utilizing the target’s appearance template in a sequence of images or a video. In recent years, while single object trackers have made surprising performance improvements, these trackers are designed on large network structures with many parameters and Flops. Deploying them on devices with limited resources is challenging because of high cost.

A lightweight tracker needs to achieve an optimal balance between performance, parameters, Flops and FPS. Full-convolutional lightweight trackers with fewer parameters and Flops, such as LightTrack [1] and FEAR [2], perform insufficiently. For example, the AUC of LightTrack [1] and FEAR [2] on LaSOT [3] is 7 % and 6.8 % lower than that of MixFormerV2-S [4], respectively. To improve the performance of the lightweight tracker, efficient attention mechanisms are introduced to design the trackers. E.T.Track [5] and MixFormerV2-S [4] achieve significant performance improvements. However, the use of the attention mechanism results in a significant increase in both params and Flops. For instance, E.T.Track [5] and MixFormerV2-S [4] is 3.5x and 8x more params, 3x and 8x more Flops than LightTrack [1], respectively. As a result, the current lightweight trackers only achieve a balance of two or three points rather than the optimal four points.

This paper aims to design an efficient tracker that effectively balances these four points. To minimize the number of parameters and Flops, a full-convolutional tracking pipeline is selected as the baseline. To achieve better performance, this paper focuses on improving the model’s feature expressiveness. An effective network design serves as the foundation for other improvement strategies, including training methods. Thus, the critical problem designs an efficient feature fusion module and an efficient prediction head to boost the tracker’s performance.

Refer to caption — Figure 1: Comparisons with state-of-the-art lightweight trackers in terms of precision performance, parameters and model Flops on LaSOT [3] benchmark. The circle diameter is in proportion to the number of model Flops. L.T.Track represents LightTrack [1], Stark-L represents Stark-Lightning [6]. The proposed LightFC and LightFC-vit are superior to LightTrack [1], Stark-Lightning [6], FEAR [2], E.T.Track [5] and MixFormerV2-S [4]. Best viewed in color.

Therefore, a lightweight full-convolutional tracking pipeline LightFC is proposed. LightFC comprises two novel components, an Efficient Cross-Correlation Module (ECM) and an Efficient Rep-Center-Head (ERH). The ECM fuses the space and channels of fused features and then enhances the nonlinearity of the fused feature. The successful factors of the feature fusion modules from classic lightweight trackers are also integrated to further improve the ECM. The ERH introduces reparameterization and channel attention to optimize the feature representation bottleneck between the first and second convolutional blocks of each branch of the center-head [7]. These two modules significantly improve the model’s feature representation with minimal expense (LightFC: 3.2 M parameters, while LightTrack-LargeB [1]: 3.1 M parameters). Higher performance is attained by another version called LightFC-vit, which substitutes the more powerful TinyViT [8] for the MobileNetv2 [9] backbone network. As shown in Fig. 1, LightFC achieves better performance than conv-based lightweight trackers, even outperforming the most advanced attention-based lightweight trackers. Besides, LightFC achieves fewer parameters and Flops than attention-based lightweight trackers. LightFC outperforms MixFormerV2-S [4] by 3.7 % and 6.5 % in precision and 1.2 % and 2.6 % in AUC scores on TrackingNet [10] and TNL2K [11], respectively, while using 5x fewer parameters and 4.6x fewer Flops. In addition, LightFC is 2x faster than MixFormerV2-S [4] on CPUs. LightFC-vit outperforms LightFC in terms of precision and AUC score on LaSOT by 4.8 % and 3 %, respectively.

The paper’s main contributions are:

1. An Efficient Cross-Correlation Module (ECM) is designed based on the characteristics of pixel-wise correlation relationship modeling. Subsequently, a spatial and channel fusion (SCF) unit and an inverted activation block (IAB) are added to combine the local space and channel features and enhance the model’s nonlinearity, respectively. In addition, the success factors of existing feature fusion modules are analyzed and applied to further improve the performance of ECM.

2. An Efficient Rep-Center-Head (ERH) is proposed to eliminate the bottleneck of feature flow representation for center-head [7] by utilizing the reparameterization technique. To improve feature representation in its space and channels, the first convolutional block is reparameterized on each branch of the center-head [7]. In addition, an SE [12] module is added between the first and second blocks of to further optimize the bottleneck of key feature expression.

3. This paper proposes two Lightweight full-convolutional siamese trackers, namely LightFC and LightFC-vit. Comprehensive experiments show that they achieve the state-of-the-art performance on several benchmarks, while maintaining a good balance between parameters, Flops, FPS, and performance.

2 Related Work

2.1 Lightweight Convolutional Networks

Lightweight convolutional neural networks (CNNs) have made significant progress. Many efficient model designs have been proposed to improve the performance of lightweight CNNs. Separable convolution proposed by MobileNetV1 [13] decomposes standard convolution into depth-wise convolution and point-wise convolution to reduce size and latency. Channel shuffle proposed by ShuffleNets [14][15] reduces cost and achieves better accuracy. Inverted residual block proposed by MobileNetV2 [9] is a classic and efficient module design in lightweight networks. Besides, it also plays an important role in modern large-scale network design, such as conv-based FFN [16] in transformer networks. Many efficient networks, such as MobileNetV3 [17] and EfficientNets [18], use network architecture search (NAS) to design their model. Structure reparameterization which is employed by RepVGG [19], MobileOne [20] and RepViT [21] provides a technique for improving model performance without adding additional parameters. It uses additional convolutional kernels during training to enhance the feature expression of the model and fuses multiple convolutional kernels into a single convolutional kernel during inference. Due to its effectiveness, reparameterization has been introduced in object detection such as YOLOv6 [22] and YOLOv7 [23], but it has not yet received much attention in SOT.

2.2 Siamese Trackers

Siamese network has become the basic architecture in trackers due to its simplicity and efficiency. Powerful feature fusion networks and prediction heads are essential components of Siamese-based trackers. In its pioneering work, SiamFC [24] employed naïve correlation as a fusion operation and prediction head. SiamRPN [25] and SiamRPN++ [26] improve SiamFC by employing the RPN network as the prediction head. SiamAttn [27] introduced deformable convolutional attention to improve feature fusion and proposed a region refinement module to obtain a more accurate box. As anchor-free trackers, SiamBAN [28] and SiamCAR [29] employed depth-wise correlation to fuse the features of the template and search area. Then they used different types of fully convolutional network (FCN) or multi-layer perceptron (MLP) as their prediction head. SiamOAN [30] proposed a conv-based channel attention module to enhance the features of the template and search area, then it used naïve depth-wise correlation for further features fusion. To model long-term context, dynamic context, and preliminarily model the relationship between template and search area, SiamAGN [31] proposed convolutive self-attention-like and cross-attention-like module. Subsequently, depth-wise correlation is used for further fusion. SiamBAN-ACM [32] proposed an asymmetric convolution in which the convolution operation is decomposed into two mathematically equivalent operations to improve the depth-wise correlation.

In contrast to them, LightFC focuses on enhancing the feature representations after fusion, rather than before fusion or during fusion. In addition, LightFC performs a detailed analysis and optimization of the center-head’s feature representation bottleneck [7] rather than simply applying an FCN as the prediction head.

With the introduction of the transformer architecture in SOT, trackers based on attention mechanisms have achieved better feature fusion, including self-attention [6], self-attention and cross-attention [33], improved attention [34], visual transformer [35], mixed-attention [36] have utilized attention mechanisms to achieve more powerful feature fusion.

Although these transformer-based models are difficult to directly apply to lightweight trackers, tricks in these attention models as nonlinear layers and skip-connection can provide reference for designing full-convolutional fusion modules.

2.3 Lightweight Trackers

Alpha-Refine [37] discussed the performances of naïve correlation, depth-wise correlation, and pixel-wise correlation in detail in a small-scale tracker, and demonstrated the effectiveness of pixel-wise correlation. LightTrack [1] employed Neural Architecture Search (NAS) to search a convolutional architecture that is both lightweight and efficient. Its feature fusion module only contains pixel-wise correlation and an SE [12] module. Its prediction head is an FCN which consists of two convolutional branches. FEAR [2] used separable convolutional blocks to enhance the feature expressiveness of fused features and proposed a more efficient hand-crafted convolutional structure and a template update approach. It used two branches with the same structure for classification and regression, respectively. Stark-Lightning [6] adopted Rep-VGG [19] as the backbone for feature extraction and a single transformer encoder is used for feature fusion. E.T.Track [5] proposed exemplar attention to improve the predicted head of LightTrack [1]. MixFormerV2-S [4] proposed a single-stream full-transformer tracking architecture. These attention-architectured lightweight trackers perform competitively, but a major shortcoming is that they remain expensive and heavy. For the features fusion module, LightFC employs the LightTrack [1] and E.T.Track’s [5] fusion module as the baseline. Compared to Alpha-Refine [37] and LightTrack [1], LightFC improves feature representation and enhances model nonlinearity. In addition, the reuse of search area features also supplements the lost target semantic information. Compared to FEAR [2] , LightFC only uses one branch consisting of the fusion module and the predict head, rather than using two branches with the same structure. LightFC improves the feature representation of the prediction head at no additional cost as compared to other methods such as Alpha-Refine [37], LightTrack [1], FEAR [2], Stark-Lightning [6] and MixFormerV2-S [4]. Compared to E.T.Track [5], the fully convolutional prediction head of LightFC has fewer parameters and flops.

3 LightFC

The key factor of designing LightFC is to improve the feature representation of the full-convolutional model. Fig.2 shows the overall architecture of LightFC, which consists of a backbone, an Efficient Cross-Correlation Module (ECM), and an Efficient Rep-Center Head (ERH). This work can be summarized as four parts. First, the most suitable backbone network for LightFC’s tracking pipeline is identified from a broad range of efficient networks. Second, this paper analyzes the success factors of current feature fusion modules for lightweight trackers and proposes three hypotheses: adding nonlinear blocks, skip-connections, and reusing search area features to develop the ECM. Third, this paper analyzes the bottleneck of feature flow representations in the center-head [7] and introduces reparameterization and channel attention to propose the ERH. Fourth, out of a large number of intersection-over-union (IoU) loss functions, the paper selects the most suitable box IoU loss function. A detailed analysis of the effectiveness of each part is provided in Section 4.3.

3.1 Backbone

A lightweight and efficient backbone is essential for improving the performance of a lightweight tracking pipeline. [1] and [35] demonstrate the effectiveness of using a pretrained backbone to improve the tracker’s performance. Therefore, this paper conducts a comprehensive performance analysis of current pretrained lightweight backbones in LightFC’s tracking pipeline to select the most suitable one. Refer to Section 4.3.1.Mobilentv2 [9] and TinyViT [8] are selected as baselines because they exhibit the best performance in convolutional and transformer backbone networks, respectively. The input of backbone is a pair of images, namely, the template image $z\in R^{3\times H_{z}\times W_{z}}$ and the search area images $x\in R^{3\times H_{x}\times W_{x}}$ . Then they are fed into the backbone to extract features $z_{f}\in R^{3\times H_{zf}\times W_{zf}}$ and $x_{f}\in R^{3\times H_{xf}\times W_{xf}}$ , where $H_{if},W_{if}=H_{i}/S,W_{i}/S,i\in(z,x)$ , $C=96$ is output channel, $S=16$ is stride.

3.2 Efficient Cross-Correlation Module

First, the design of feature fusion modules for Stark-Lightning [6]], LightTrack [1], E.T.Track [5] and FEAR [2] is reviewed. Nonlinear blocks like the Linear-Activation-Linear block and the SepConv-BN-ReLU block are crucial in enhancing the nonlinearity of the model, as shown in Fig.3 in Stark-Lightning [6] and FEAR [2]. Besides, the skip-connections Stark-Lightning [6] preserves original information. Additionally, both Stark-Lightning [6] and FEAR [2] reuse the search area feature to supplement target semantic information. Even though features fusion module used by LightTrack [1] and E.T.Track [5] is very efficient, it does not further consider improving the nonlinearity of features and optimizing the use of existing information to improve feature representation.

Therefore, this work employs pixel-wise correlation and an SE [53] module as a baseline and makes the following assumptions to improve the feature representation: 1. The addition of nonlinear blocks could enhance the nonlinearity of the feature fusion module. 2. The introduction of skip-connections could preserve the semantic information of original features [38]. 3. The reuse of search area features could supplement the target’s appearance semantic information that is lost during feature fusion [39]. The feature fusion module baseline is described as:

X_{f(z,x)}=SE(\{x_{f(z,x)}^{j}|x_{f(z,x)}^{j}=z^{j}_{f}*x_{f}\}_{j\in\{1,...,H_{zf}\times W_{zf}\}})\in R^{C_{f}\times H_{xf}\times W_{xf}}

(1)

where $C_{f}=H_{zf}\times W_{zf}$ , $SE$ denotes SE function and $*$ denotes naïve correlation.

Pixel-wise correlation treats each point in the template feature as a convolution kernel and applies convolution to the entire search area features. Thus, each point in the template features interacts with the global search area features. Pixel-wise correlation operates similarly to a conv-based cross-attention, similar to the tracker’s self-attention by modeling the relationship between the template and the search area, as shown in Fig.4.Therefore, following the first factor above and the architecture design of the attention-like module, two nonlinear blocks are introduced behind the pixel-wise correlation layer to improve the model’s performance.

Each layer of the fused feature corresponds to the naïve correlation map between each point on the template and the entire search region. The local spatial information of each layer’s response peak lacks further refinement, and the fused features operate independently, resulting in a leak of information exchange between the layers. To address the issues, this paper first introduces the space and channel fusion (SCF) unit to linearly fuse local spatial information and channel information in the fused features. The SCF consists of a 3x3 Conv-BN branch $E_{33}$ and a 1x1 Conv-BN branch $E_{11}$ . $E_{33}$ maps $x_{f(z,x)}$ from $R\in R^{C_{f}\times H_{xf}\times W_{xf}}$ to $R\in R^{C_{f}\times H_{xf}\times W_{xf}}$ and models the local spatial information. $E_{11}$ makes the same dimensional transformation as $E_{33}$ and fuses the channel information. The structure of SCF can be further equivalent to a convolutional kernel by structural reparameterization technique. Moreover, skip-connection is introduced to remain the original fusion feature. The process of SCF can be written as follows:

\tilde{E}_{33}=\phi(E_{33}+E_{11})

(2)

x_{scf}=E_{33}(x_{f(z,x)}+E_{11}(x_{f(z,x)})+x_{f(z,x)}=\tilde{E}_{33}(x_{f(z,x)})+x_{f(Z,x)}

(3)

where $\phi$ denotes reparameterization transformation.

To enhance the model’s nonlinearity, an inverted activation block (IAB) is introduced. IAB contains two different 1x1 Conv-BN branches, namely $E_{11}^{up}$ and $E_{11}^{down}$ , and a nonlinear Gelu activation function. $E_{11}^{up}$ is used to map $x_{scf}$ from $R^{C_{f}\times H_{xf}\times W_{xf}}$ to $R^{(\alpha\times C_{f})\times H_{xf}\times W_{xf}}$ , where $\alpha=2$ is channel expansion rate. And $E_{11}^{down}$ is used to reduce dimension from $\alpha\times C_{f})$ to $C_{f}$ . Skip-connection is also introduced. The detailed process is described as follows:

x_{iab}=E_{11}^{down}(Gelu(E_{11}^{up}(x_{scf})))+x_{scf}

(4)

Following the design of LightTrack [1] and E.T.Track [5], a 1x1 Conv branch $E_{11}^{adj}$ is used to increase dimension from $C_{f}$ to $C_{fusion}$ , where $C_{fusion}=96$ . Besides, feature concatenation is introduced to reuse the search area features for supplementing semantic information that is lost during feature fusion.

x_{fusion}=E_{11}^{adj}(x_{iab})ⓒx_{f}

(5)

where $x_{fusion}$ denotes output features of the ECM and $ⓒ$ denotes tensor concatenate.

3.3 Efficient Rep-Center Head

There are two classic types of classification heads. One approach involves outputting a box end-to-end, as seen in the corner-head utilized by the Stark family [6]. Another way involves outputting boxes with confidence scores and selecting the one with the highest confidence score as the output. For instance, the center-head [7] used by OSTrack [35] outputs a response map, an offset map, and a size map before decoding the output box. In contrast, the MLP head used by TransT [33] outputs 1024 boxes directly. NeighborTrack [40] and UOSTrack [41] demonstrate that tracker performance can be enhanced by reusing candidate boxes. This paper selects the FCN-based Center-Head based as prediction head baseline in order to utilize other well-designed post-processing methods to additionally improve the performance of LightFC.

Standard center-head [7] has three branches with the same structure. Each branch contains five Convolutional, Batch-Normalization, ReLU (Conv-BN-ReLU or CBR) blocks. Each CBR of a branch can be defined as $F_{33}^{i},i\in\{1,2,3,4,5\}$ . Different from $E_{33}$ , it has a nonlinear activate function. $F_{33}^{1}$ increases feature dimension from $C_{fusion}$ to $C_{head}=256$ , where $C_{head}$ is a hyper-parameter as same in Stark [6]. $F_{33}^{i},i\in\{2,3,4,5\}$ reduce feature channels by half. The last block of each branch outputs a response map. Center-head [7] has a feature representation bottleneck between $F_{33}^{1}$ and $F_{33}^{i},i\in\{2,3,4,5\}$ which limits the transmission of key information in the network.

To optimize the feature flow bottleneck in the center-head [7] the impact of various reparameterization kernels on the center-head [7] is investigated in detail, inspired by RepVGG [19] and proposes an Efficient Rep-Center Head. In particular, a Rep-Conv-Branch (RCB) is designed to replace the standard branch of the center-head. Specifically, a RepN33 block [Illustrated in Fig.8] $\tilde{F}_{33}^{1}$ is introduced to replace $F_{33}^{1}$ . During training, $\tilde{F}_{33}^{1}$ is composed of two $F_{33}^{1}$ with different parameters. During testing, they are merged into a block with the same structure as $F_{33}^{1}$ . In addition, an SE [12] module is added between $\tilde{F}_{33}^{1}$ and $F_{33}^{2}$ to further enhance the expression of key information. Using reparameterization techniques on $F_{33}^{i},i\in\{2,3,4\}$ reduce model performance. Therefore, $F_{33}^{i},i\in\{2,3,4\}$ is preserved. Compared to the conventional branch of center-head [7], RCB enhances the bottleneck of essential feature expression. The function of RCB can be written as follows:

\tilde{F}_{33}^{1}=\phi(F_{33}^{1}+\bar{F}_{33}^{1})

(6)

output_{RCB}=F_{33}^{5}(F_{33}^{4}(F_{33}^{3}(F_{33}^{2}(SE(\tilde{F}_{33}^{1}(x_{fusion}))))))

(7)

where $\phi$ denotes reparameterization transformation.

The weight focal loss [42] is used to train the classification branch. And the l1 loss and Wise-IoU loss [43] are used for box regression. The total training loss of LightFC is:

L_{total}=L_{cls}+\lambda_{iou}L_{iou}+\lambda_{l_{1}}L_{1}

(8)

where $\lambda_{iou}=2$ and $\lambda_{l_{1}}=5$ are hyper-parameters in this paper as same in [6][35].

4 Experiments

4.1 Implementation Details

LightFC is developed based on Ubuntu 20.04, using Python 3.9, and Pytorch 1.13.0. The tracker is trained with 2 NVIDIA RTX A6000 GPUs over 400 epochs, which took roughly 57 hours. The training set includes LaSOT [3], TrackingNet [10], GOT10k [44] and COCO [45]. For each epoch, it samples 60000 image pairs to train the model, with a total batch-size of 64. LightFC uses the AdamW optimizer with a total learning rate of 0.0001 for the backbone and 0.001 for other parameters and a weight decay of 0.0001. The total learning rate decreases to 0.0001 after the 160th epoch. Following the training paradigm of [6][35], it only employs Brightness Jitter, Grayscale and Random Horizontal Flip data augmentation methods. The training settings of LightFC-vit are the same as LightFC.

This paper follows the one-pass evaluation (OPE) protocol proposed by OTB100 [37] to test our tracker on the benchmarks LaSOT [3], TrackingNet [10], TNL2K [11], OTB100 [46], UAV123 [47], TC128 [48], UOT100 [49] and UTB180 [50]. And success (AUC), precision (P) and norm-precision (PNORM) are employed to evaluate the performance of the trackers on these benchmarks. On the VOT short-term 2020 benchmark [51], this paper uses the vot-toolkit package to evaluate the trackers. Expected average overlap (EAO), accuracy (A), and robustness (R) are employed to evaluate the performance.

4.2 Results and Comparisons

4.2.1 LaSOT

LaSOT [3] is a large-scale benchmark for visual single object tracking with 1400 long sequences, which contains 14 tracking challenges. Following the Protocol II of LaSOT [3], this paper evaluates the performance of LightFC and LightFC-vit on the LaSOT [3] test subset with 280 sequences. The results are shown in Table 1. LightFC achieves state-of-the-art performance. Compared to other lightweight conv-based trackers, LightFC outperforms FEAR-L [2] by 2.6 % in AUC and 2.3 % in precision. Compared to the attention-based lightweight tracker, LightFC achieves higher precision, and outperforms the best attention-based tracker MixFormerV2-S [4] by 2.8 %. LightFC-vit further outperforms MixFormerV2-S [4] by 2.9 % in AUC and 7.6 % in precision. The results show that LightFC-vit and LightFC achieve competitive tracking performance in both success rate and precision.

In addition, Fig. 5 shows the AUC score on 14 LaSOT [3] attributes for 8 trackers (Stark [6], TrDiMP [52], LightFC, LightFC-vit, LightTrack [1], TransT [33], DiMP [53], E.T.Track [5], ATOM [54]) on 14 attributes of LaSOT [3]. These attributes are IV: illumination variation; POC: partial occlusion; DEF: deformation; MB: motion blur; CM: camera motion; ROT: rotation; BC: background clutter; VC: view change; SV: scale variation; FOC: full occlusion; FM: fast motion; OV: out-of-view: LR: low resolution; ARC: aspect ratio change. Compared to LightTrack [1] and E.T.Track [5], LightFC and LightFC-vit have a competitive performance in dealing with 14 attributes. Especially in terms of FM, it exhibits better adaptability than E.T.Track [5]. Their feature representation helps them perform better in attributes like DEF, SV, and others. It is necessary to strengthen the robustness of LightFC and LightFC-vit since the success rate of the BC, FM, MB, and LR attributes is insufficient. In addition, they shorten the performance gap between lightweight trackers and advanced large-scale trackers.

Table 1: Comparison of lightweight trackers on LaSOT, TrackingNet and TNL2K. The best two results are shown in red and blue fonts.

		LaSOT			TrackingNet			TNL2K
	Source	AUC	P-Norm	P	AUC	P-Norm	P	AUC	P-Norm	P
LightTrack-Mobile[1]	CVPR2021	53.8	-	53.7	72.5	77.9	69.5	-	-	-
LightTrack-LargeB[1]	CVPR2021	55.5	-	56.1	73.3	78.9	70.8	-	-	-
STARK-Lightning[6]	CVPR2021	58.6	69.0	57.9	-	-	-	-	-	-
FEAR-L[2]	ECCV2022	57.9	68.6	60.9	-	-	-	-	-	-
FEAR-XS[2]	ECCV2022	53.5	64.1	54.5	-	-	-	-	-	-
E.T.Track[5]	WACV2023	59.1	-	-	74.5	80.3	70.6	-	-	-
MixFormerV2-S[4]	Arxiv2023	60.6	69.9	60.4	75.8	81.1	70.4	47.2	-	41.8
LightFC	-	60.5	70.2	63.2	77.0	83.1	74.1	49.8	66.2	48.3
LightFC-vit	-	63.5	74.3	68.0	77.8	84.4	75.3	51.4	68.8	51.2

4.2.2 TrackingNet

TrackingNet [10] is a large-scale short-term benchmark for single object tracking that contains 511 sequences in its test subset. The trackers are evaluated on TrackingNet [10] and the results are reported in Table 1. LightFC outperforms all other lightweight trackers. The AUC, precision and norm-precision of LightFC are 1.2 %, 2 % and 3.7 % higher than MixFormerV2-S [4], while using 5x fewer parameters (3.16 v.s. 16.04 M) and 4.6x fewer Flops (0.95 v.s. 4.4 G). The AUC, precision and norm-precision of LightFC-vit are 2 %, 3.3 %, 4.9 % higher than MixFormerV2-S [4], while using 2.9x fewer parameters (5.5 v.s. 16.04 M) and 1.7x fewer Flops (2.48 v.s. 4.4 G). The experimental results demonstrate that LightFC and LightFC-vit outperform the other four lightweight trackers and achieve the optimal balance between performance, parameters, Flops and FPS.

4.2.3 TNL2K

TNL2K [11] is a large-scale high-quality benchmark for single object tracking and has 700 sequences for testing. The results of LightFC are reported in Table 1. Compared to the MixFormerV2-S [4], LightFC improves the AUC and precision by 2.6 % and 6.5 %, respectively. It improves the AUC and precision for LightFC-vit by 4.2 % and 9.4 %, respectively. LightFC sets a new state-of-the-art benchmark on TNL2K [11] dataset.

4.2.4 OTB100

OTB100 [46] is a classic object tracking benchmark with 100 short-term sequences. The AUC results are reported in Table 2. LightFC and LightFC-vit achieve an AUC score of 68.5 % and 70.9 %, and outperform E.T.Track [5] by 0.7 % and 3.1 %, respectively. Compared to non-lightweight trackers, LightFC performs comparably to DiMP [53], with 8x fewer parameters required (3.2 v.s. 26.1 M). LightFC-vit achieves an AUC score 1.5 % higher than TransT [33] with 3x fewer parameters (5.5 v.s. 18.5 M) and 6x fewer flops (2.5 v.s. 16.8 G). LightFC outperforms all previous lightweight trackers with 68.5 % AUC.

Table 2: Comparison of state-of-the-art lightweight and non lightweight trackers on OTB100, UAV123 and TC128 in terms of AUC. The best results of the lightweight tracker are shown in red font and the best results of the non-lightweight tracker are shown in blue font.

	non-lightweight						lightweight
	ATOM [54]	DiMP [53]	SiamRCNN [55]	Stark [6]	KeepTrack [56]	TransT [33]	LightTrack [1]	E.T.Track [5]	MixFormerV2-S [4]	LightFC	LightFC-vit
OTB100	66.9	68.4	70.1	67.3	70.9	69.4	66.2	67.8	-	68.5	70.9
UAV123	64.2	65.3	64.9	68.8	69.7	69.1	62.5	62.3	65.1	64.8	64.9
TC128	59.5	60.8	-	62.5	-	59.6	55.0	57.1	-	61.0	63.5

4.2.5 UAV123

UAV123 [47] is a classic benchmark for visual object tracking from aerial viewpoints. It contains a total of 123 sequences. The AUC results are reported in Table 2. LightFC and LightFC-vit surpass LightTrack [1] by 2.5 % and 2.6 %. They perform comparably to MixFormerV2-S [4] and SiamR-CNN [55]. Overall, LightFC and LightFC-vit achieve competitive performance in UAV tracking tasks.

4.2.6 TC128

TC128 [48] is a classic visual tracking benchmark that focuses on color information. It has 128 color sequences. As shown in Table 2, compared to LightTrack [1] and E.T.Track [5], LightFC improves the AUC by 6.0 % and 3.9 %, respectively. LightFC performs comparably to DiMP [53]. LightFC-vit achieves better AUC, being 1.0 % higher than Stark-ST50 [6]. LightFC and LightFC-vit demonstrate better adaptability and higher performance in color information challenges.

4.2.7 UOT100

UOT100 [49] is a typical visual underwater object tracking benchmark and contains 100 short-term underwater image sequences, which mainly address the challenge of underwater image distortion in different scenes. Since there are no official reports on the results of the UOT100 [49] for lightweight trackers, this paper compares LightFC with non-lightweight trackers. Table 3 presents that LightFC achieves better AUC (61.6 %) and better precision (52.2 %), both of which are 1.0 % higher than those of KeepTrack [56]. LightFC-vit performs comparably to TransT [33]. In addition, the precision score of LightFC-vit is 2.4 % higher than that of TransT [33]. In the challenge of underwater image degradation, LightFC and LightFC-vit achieve competitive performance with much fewer parameters and Flops.

Table 3: Comparison of state-of-the-art lightweight and non-lightweight trackers on UOT100. The best results of lightweight tracker are shown in red font and the best results of non-lightweight tracker are shown in blue fonts.

	SiamFC [24]	SiamRPN [25]	SiamCAR [29]	DiMP [53]	Stark [6]	KeepTrack [56]	TransT [33]	LightFC	LightFC-vit
AUC	43.8	59.7	53.6	59.8	66.3	60.6	63.8	61.6	63.9
P-Norm	53.4	74.8	69.4	75.4	81.6	78.1	79.9	75.8	78.2
P	30.4	48.7	46.0	48.9	57.9	51.2	56.3	52.2	53.6

Fig. 6 presents the AUC score of the trackers of the best six in Table 3 on 6 color distortion attributes of UOT100 [49]. LightFC performs competitively in dealing with dark and cyan color distortion. However, performance decreases while dealing with yellow distortion. LightFC-vit significantly improves LightFC’s performance in Dark, Blue, and Yellow color distortion. However, it has shortcomings in dealing with yellow and cyan color distortion.

4.2.8 UTB180

UTB180 [50] is a high-quality underwater object tracking benchmark with 180 short-term sequences. It reflects the challenges in tracking underwater targets, such as scale variation, occlusion, similar objects and so on. For the same reason as UOT100 [43], this paper compares LightFC with non-lightweight trackers. As shown in Table 4, Compared to KeepTrack [56], LightFC improves the AUC by 0.3 % and precision by 1.1 %, respectively. Compared to TransT [33], LightFC-vit improves the AUC by 1.4 % and norm-precision by 3.5 %, respectively. In typical underwater tracking tasks, LightFC also achieves competitive performance with much fewer parameters and Flops.

Table 4: Comparison of state-of-the-art lightweight and non lightweight trackers on UTB180. The best results of lightweight tracker are shown in red font and the best results of non-lightweight tracker are shown in blue fonts.

	SiamFC [24]	SiamRPN [25]	SiamCAR [29]	DiMP [53]	Stark [6]	KeepTrack [56]	TransT [33]	LightFC	LightFC-vit
AUC	35.0	53.4	49.8	50.5	55.9	54.6	57.5	54.9	58.9
P-Norm	41.2	63.5	60.1	58.5	64.6	64.1	66.1	63.7	69.6
P	22.8	41.9	40.8	37.5	48.5	43.5	50.3	44.6	50.0

Fig. 7 presents the AUC score of the trackers of the best six from Table 4 for 8 attributes of UTB180 [50]. These attributes are SV: scale variation; OV: out-of-view; OC: occlusion (both partial occlusion and full occlusion); DA: deformation; LR: low resolution; FM: fast motion; MB: motion blur; SM: similar object. Both LightFC and LightFC-vit have insufficient success rate in low resolution and motion blur attributes. This indicates that they have shortcomings in discriminating low-frequency information in underwater images.

4.2.9 VOT20-ST

VOT-ST2020 [51] benchmark has 60 challenging sequences. EAO, A and R are used to evaluate the performance of trackers. Table 4 shows that LightFC performs better than existing lightweight trackers, surpassing LightTrack [1] and MixFormerV2-S [4] by 3.1 % and 1.5 % in terms of EAO, respectively, and achieves superior performance compared to existing lightweight trackers. LightFC-vit outperforms other lightweight trackers with its 0.292 EAO, 0.466 A, and 0.768 R. Overall, LightFC and LightFC-vit achieve state-of-the-art performance for lightweight trackers in the VOT-ST2020 challenge.

Table 5: Comparison of state-of-the-art lightweight and non lightweight trackers on VOT-ST2020. The best results of lightweight tracker are shown in red font and the best results of non-lightweight tracker are shown in blue font.

	Non-lightweight					lightweight
	SiamFC [24]	ATOM [54]	DiMP [53]	Stark-S50 [6]	Stark-ST50 [6]	LightTrack [1]	E.T.Track0 [5]	MixFormerV2-S [4]	LightFC	LightFC-vit
EAO	0.179	0.271	0.274	0.280	0.308	0.242	0.267	0.258	0.273	0.292
A	0.418	0.462	0.457	0.477	0.478	0.422	0.432	-	0.459	0.466
R	0.502	0.734	0.74	0.728	0.799	0.689	0.741	-	0.723	0.768

4.2.10 Speed

Table 6 presents a comparison of the parameters, Flops, and FPS of LightFC and other lightweight trackers on different CPU and GPU configurations. The results show that LightFC achieves state-of-the-art speeds of 72 FPS, 55 FPS and 129 FPS on the Intel I9-12900KF CPU, AMD Ryzen7 4800H CPU and GTX1050Ti GPU, respectively. LightFC achieves competitive speeds at 284 FPS on GPU GTX 3090Ti. Moreover, LightFC exhibits good adaptability in terms of speed in different devices, outperforming other trackers. For instance, MixFormerV2-S [4] boasts the highest speed (584 FPS) on the GTX 3090Ti GPU, but its speed drops on other devices. This problem is also evident in FEAR-XS [2] and E.T.Track [5]. LightTrack [1] shows the fastest performance (95 FPS) with the I9-12900KF CPU, while its speed on GPUs is not fast. Only the speed changes of LightFC are consistent on these devices. Although LightFC-vit runs slower than LightFC, it still achieves competitive speed in GPUs.

Table 6: Run-time speed on different device. The best two results are shown in red and blue fonts.

			FPS
Tracker	Param /M	Flops /G	CPU		GPU
			Intel I9-12900KF	AMD R7 4800H	GTX 3090Ti	GTX 1050Ti
LightTrack-Mobile[1]	1.97	0.54	95	19	119	61
FEAR-XS [2]	1.37	0.48	40	33	323	117
E.T.Track [5]	6.98	1.56	20	16	80	39
MixFormerV2-S [4]	16.04	4.40	32	20	584	126
LightFC	3.16	0.95	72	55	284	129
LightFC-vit	5.50	2.48	34	10	202	98

4.3 Ablation and Analysis

To investigate the individual influence of each component and establish the optimal module design, this work executes several ablation experiments on LightFC and reports the results for AUC, Precision and Norm-Precision on LaSOT [3] and TNL2K [11].

4.3.1 Backbone

The performance of various efficient backbone networks is evaluated in the tracking pipeline and the results for the optimal selection are reported in Table 7. The output feature stride of each backbone is set at 16. After thorough comparison, this work selects MobileNetV2 [9] and TinyViT [8] as the backbone for LightFC and LightFC-vit due to their superior performance.

Table 7: Ablation of the different backbone of LightFC.

	LaSOT			TNL2K
	AUC	P-Norm	P	AUC	P-Norm	P
ResNet18[38]	60.0	69.5	62.2	48.0	63.6	46.3
ShuffleNetV2 [15]	57.8	66.9	59.4	47.5	44.8	63.1
MobilenetV2[9]	60.4	70.0	63.1	49.8	66.2	48.3
MobilenetV3[17]	56.6	65.2	57.0	46.1	61.1	41.9
LT-Mobile[1]	58.9	67.7	61.2	48.4	63.3	45.2
EfficientNet [18]	59.3	69.2	62.0	49.4	66.5	47.9
LCNet[58]	58.5	68.7	61.1	48.3	64.9	45.9
MobileOne[20]	51.5	60.8	50.7	43.4	58.0	37.4
TinyViT[8]	63.5	74.3	68.0	51.4	68.8	51.2
LightViT[59]	59.6	69.0	62.2	49.4	64.9	47.5
MobileFormer [60]	56.3	65.0	56.9	46.0	61.2	42.6
MiniDeiT [61]	60.8	70.3	63.3	49.8	65.7	47.9

4.3.2 ECM

This work evaluates the contributions of the three factors presented in Section 3.2 to ECM and explores the function of each factor.

Firstly, this paper evaluates the impact of incorporating nonlinear blocks into the ECM. The results are presented in Table 8. An improvement of 2.1 % and 1.5 % in terms of AUC, 2.4 % and 1.8 % in terms of norm-precision on LaSOT [3] and TNL2K [11] is obtained by the contribution of the SCF unit and the IAB. Overall, improving feature representation and enhancing model nonlinearity can effectively enhance the performance of ECM.

Table 8: Ablation of the nonlinear blocks of ECM.

	SCF	IAB	LaSOT			TNL2K
			AUC	P-Norm	P	AUC	P-Norm	P
	✓	✓	60.5	70.2	63.2	49.8	66.2	48.3
nonlinear blocks	✓		58.8	68.2	61.4	49.1	65.3	47.6
		✓	59.2	68.8	61.9	49.2	65.4	47.8
			58.4	67.8	61.0	48.3	64.4	46.7

After analyzing the performance change before and after adding nonlinear blocks in Table 8, it is clear that using only the LightTrack [1] and E.T.Track [5] feature fusion modules (which only use pixel-wise correlation and the SE [12]] module) results in an insufficient expressiveness of the fused features. SCF implements the fusion of local spatial features and channel information in each layer of the fused features. Furthermore, the inverted activation of features in IAB effectively enhances the model’s nonlinearity.

Secondly, this paper evaluates the impact of different skip-connection on tracker performance. As shown in Table 9, the skip-connections are critical for ECM in both SCF and IAB. They improve the performance by 2.2 % and 1.0 % in terms of AUC, and 2.3 % and 1.6 % in terms of precision, respectively.

Table 9: Ablation of the skip-connections of different nonlinear block of ECM.

	SCF	IAB	LaSOT			TNL2K
			AUC	P-Norm	P	AUC	P-Norm	P
	✓	✓	60.5	70.2	63.2	49.8	66.2	48.3
skip-connection	✓		59.5	69.3	62.4	49.5	66.0	48.0
		✓	58.6	68.1	61.1	49.4	65.5	47.7
			58.3	67.7	60.9	48.8	64.9	47.3

Thirdly, the prediction head is divided into two branches: CLS (the classification branch of ERH) and BOX (the offset and size branch). The impact of reusing search area features on performance is explored in both branches. The results in Table 10 indicates that the reuse of search range features leads to performance improvements in each branch. Feature reuse in both the CLS and BOX branches results in a combined improvement of 2.3 % and 1.6 % for AUC, and 2.5 % and 2.4 % for PNORM improvement on LaSOT [3] and TNL2K [11], respectively.

Table 10: Ablation of the features reuse of different branch.

	CLS	BOX	LaSOT			TNL2K
			AUC	P-Norm	P	AUC	P-Norm	P
concatenate	✓	✓	60.5	70.2	63.2	49.8	66.2	48.3
	✓		59.1	61.8	68.8	48.6	64.6	46.5
		✓	59.7	69.2	62.4	49.4	65.5	47.9
			58.2	67.7	60.8	48.2	63.8	46.4

To better reuse search area features, this work investigates the performance of the tracker when substituting the concatenation operation for the addition operation. Table 11 presents that the concatenation operation performs better than the addition operation.

Table 11: Ablation of the different method for reusing search area features.

	LaSOT			TNL2K
	AUC	P-Norm	P	AUC	P-Norm	P
concat	60.5	70.2	63.2	49.8	66.2	48.3
add	59.3	68.7	61.6	48.6	64.0	46.2

4.3.3 ERH

The center-head [7] branch is divided into two stages. The first stage consists of $F_{33}^{1}$ ,while the second stage consists of $F_{33}^{1},i\in\{2,3,4\}$ . The impact of reparameterization and the characteristics of the feature flow are evaluated. The reparameterization models are shown in Fig. 8. Table 12 reports the results and the phenomenon can be summarized in two key aspects.

1.Using the reparameterization technique in stage 1 can effectively improve model performance, while it decreases model performance in stage 2.

2.RepN33 block always do better than RepN31 block.

For the first point, RepN33 does better than CBR33 in the first stage, which demonstrates the feature representation bottleneck of stage1 of the standard center head [7] branch. Therefore, using two 3x3 convolutional kernels to reparametrize and enrich the expression of new features is identified as a key factor for the success of ERH. Additionally, during the second stage of feature dimension reduction, it is observed that convolutional kernels can focus on compressing and extracting crucial information. Redundant convolutional kernels in reparameterization may reintroduce interference information that has already been filtered out into the features, which can affect the model’s comprehension of significant features.

Table 12: Ablation of the different reparameterization kernel of Center-Head.

	Stage1	Stage2	LaSOT			TNL2K
			AUC	P-Norm	P	AUC	P-Norm	P
	RepN33	RepN33	59.7	69.1	62.2	49.3	65.7	47.6
	RepN33	RepN31	59.2	68.6	61.5	49.5	65.7	48.3
	RepN31	RepN33	58.5	67.9	61.1	49.6	66.1	48.6
	RepN31	RepN31	58.8	68.0	61.0	49.5	65.5	48.1
RepN-Conv	CBR33	RepN33	58.5	67.7	60.3	48.6	64.3	46.3
	CBR33	RepN31	58.2	67.3	59.9	48.6	64.1	46.7
	CBR33	CBR33	59.6	69.2	62.1	49.3	65.1	46.8
	RepN33	CBR33	60.5	70.2	63.2	49.8	66.2	48.3
	RepN31	CBR33	59.2	68.4	61.5	49.7	65.9	48.1

Table 13: Ablation of the different component of ERH.

	Stage1	-	Stage2	LaSOT			TNL2K
	RepN33	SE	Conv	AUC	P-Norm	P	AUC	P-Norm	P
ERH	✓	✓	✓	60.5	70.2	63.2	49.8	66.2	48.3
	✓		✓	59.6	69.2	61.9	49.4	65.3	47.5
		✓	✓	59.5	69.0	61.9	49.7	65.9	48.6
			✓	59.3	68.7	61.3	49.1	64.9	46.8

For the second point, the 1x1 convolution kernel is mostly focused on learning channel features within a single spatial point, while the 3x3 convolutional kernel can learn other local spatial features in addition to channel features. This may indicate that center-head [7] requires a stronger representation of spatial features rather than channel features.

Then each component’s contribution to ERH is measured. Table 13 presents the results of the ablation experiment for ERH. Implementing the RepN33 block and the SE [12] module results in performance improvements. In LaSOT [3] and TNL2K [11], ERH surpasses center-head [7], with increases of 1.2 % and 0.7 % for AUC and 1.9 % and 1.5 % for precision score, respectively.

4.3.4 IOU

To fully optimize the tracking pipeline of the model, this paper evaluates several IoU loss functions and presents the results in Table 14. WioU [43] achieves an average improvement of 0.2 % in both AUC and precision on LightFC compared to the widely used GIoU [62], outperforming all other IoU loss functions. Consequently, WIoU [43] is selected as the IoU loss function for the trackers.

Table 14: Ablation experiments on different IoU function.

	LaSOT			TNL2K
Loss	AUC	PNORM	P	AUC	PNORM	P
GIoU [62]	60.2	69.8	62.8	49.6	65.6	46.9
CIoU [63]	60.4	70.0	63.0	49.6	66.0	48.2
EIoU [64]	58.7	68.4	61.6	48.9	65.5	47.7
SIoU [65]	59.7	69.1	62.2	49.3	65.6	47.4
WIoU [43]	60.5	70.2	63.2	49.8	66.2	48.3

4.4 Hyperparameter Analysis

4.4.1 Input Size

Table 15 reports results from the evaluation of the model’s parameters, flops, and performance for different input sizes. Performance is degraded in LightFC192 and LightFC224 because smaller input sizes limit the model’s ability to represent key features by reducing the spatial resolution of the feature maps. However, LightFC384 and LightFC448 show the same performance degradation. Although a larger input image increases the spatial feature resolution, it dilutes the key features contained in each patch of the feature map. In pixel-wise correlation, each patch in the template feature interacts globally with the search area feature, which weakens the representation of the fused features as fewer key features make them more susceptible to interference and inaccurate responses. This may account for the decrease in model performance when the input size is increased from 256 to 384. However, when the input size is increased to 448, the performance of LightFC448 instead outperforms LightFC384, suggesting that while increasing the model’s input size by itself does not necessarily improve the performance, higher feature resolution may help do so.

Table 15: Ablation of the different input sizes of LightFC. The size of template is half of the search area.

Input Size of Search Area	Params /M	Flops /G	LaSOT			TNL2K
			AUC	P-Norm	P	AUC	P-Norm	P
192	3.15	0.53	55.7	65.7	56.5	45.3	41.4	60.7
224	3.15	0.72	58.1	68.1	59.7	46.9	43.8	62.7
256	3.16	0.95	60.4	70.2	63.1	49.8	48.3	66.2
384	3.28	2.18	58.3	66.6	60.3	47.9	45.7	62.3
448	3.39	3.03	59.7	67.1	61.4	49.7	47.4	64.0

4.4.2 Channel Expansion Rate of IAB

To improve the feature representation, this article analyzes the impact of channel expansion rate $\alpha$ of IAB on model performance. Table 16 shows both smaller and larger channel expansion rates in IAB impair the performance of the model. Overall, $\alpha=2$ is the optimal channel expansion rate setting.

Table 16: Ablation of the channel expansion rate of IAB

Channel expansion rate of IAB	LaSOT			TNL2K
	AUC	PNORM	P	AUC	PNORM	P
0.5	57.6	70.0	60.3	47.1	62.4	44.3
1	58.3	68.0	60.7	47.7	63.3	45.1
2	60.4	70.2	63.1	49.8	66.2	48.3
3	58.4	67.8	60.7	47.6	63.3	45.5
4	58.5	68.1	61.1	47.9	63.6	45.8

4.4.3 Output Feature Channel of ERH

To explore whether increasing feature channels can improve feature representation bottlenecks, this paper conducts ablation experiments on the output feature channel $C_{head}$ . As shown in Table 17, 256 is the optimal choice for $C_{head}$ . Increasing the number of output feature channels alone does not improve feature representation. On the contrary, it significantly reduces the performance of the model.

Table 17: Ablation of the different output channels of stage 1 of ERH

Output feature channels of stage 1	LaSOT			TNL2K
	AUC	PNORM	P	AUC	PNORM	P
192	57.0	66.3	59.3	46.2	61.7	43.6
256	60.4	70.2	63.1	49.8	66.2	48.3
384	58.7	68.4	61.2	68.4	63.6	45.4
448	57.0	66.0	59.1	47.5	63.1	45.2
512	56.1	65.4	58.5	47.0	62.6	44.9

4.5 Visualization

4.5.1 Heatmap

Fig. 9 shows heat maps of the baseline of LightFC, LightFC with ECM only, LightFC with ERH only, and LightFC. Heat maps visualize the output features of the feature fusion module. Both ECM and ERH increase the focus on key features in the model, respectively. Besides, they also jointly improve model feature representation. Overall, ECM and ERH effectively improves the feature representation of LightFC, proving the effectiveness of this work.

4.5.2 Tracking Results

Fig. 10 shows some tracking results of LightFC and LightFC-vit and other previous state-of-the-art lightweight trackers on four sequences of LaSOT [3]. In the giraffe-2 sequence, LightFC and LightFC-vit achieve more accurate bounding box prediction compared to other lightweight trackers. In the goldfish-7 sequence, when the target is partially occluded, LightFC and LightFC vit can still successfully locate targets, while the other lightweight trackers fail to locate the target. LightFC and LighFC-vit outperform other lightweight trackers in the rabbit-10 sequence by overcoming challenges such as deformation, fast motion, and motion blur. In the sheep-3 sequences, LightFC and LightFC-vit show better adaptability to challenges of similar objects and partial occlusion.

5 Conclusion

This paper proposes LightFC, a lightweight fully convolutional tracker. It abandons the heavy and expensive attention network structure to reduce parameters and Flops, and focuses on enhancing the model’s nonlinearity to improve the convolutional tracking pipeline’s performance. Specifically, we propose an efficient cross-correlation module (ECM) and an efficient reparameterization head (ERH). Comprehensive experiments show that: (1) It is reasonable to design the architecture of the ECM based on the analysis of commonalities in pixel correlation and self-attention relationship modeling. The introduction of SCF unit and IAB can effectively enhance the feature representation and feature nonlinearity. Furthermore, incorporating reasonable skip connections and reusing search area features can further improve model performance. (2) The use of reparameterization techniques and the addition of a channel attention module effectively improve the feature representation bottleneck in standard center-heads. (3) Compared with current lightweight trackers, LightFC and LightFC-vit achieve state-of-the-art performance and the optimal balance between performance, parameters, Flops and FPS.

6 Acknowledgments

This research was funded by the National Natural Science Foundation of China, grant number 52371350, by the Natural Science Foundation of Hainan Province, grant number 2021JJLH0002, by the Foundation of National Key Laboratory, grant number JCKYS2022SXJQR-06 and 2021JCJQ-SYSJJ-LB06912.

References

[1] Bin Yan, Houwen Peng, Kan Wu, Dong Wang, Jianlong Fu, and Huchuan Lu. Lighttrack: Finding lightweight neural networks for object tracking via one-shot architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15189, 2021.
[2] Vasyl Borsuk, Roman Vei, Orest Kupyn, Tetiana Martyniuk, Igor Krashenyi, and Jiři Matas. Fear: Fast, efficient, accurate and robust visual tracker. In European Conference on Computer Vision, pages 644–663. Springer, 2022.
[3] Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5374–5383, 2019.
[4] Yutao Cui, Tianhui Song, Gangshan Wu, and Limin Wang. Mixformerv2: Efficient fully transformer tracking. arXiv preprint arXiv:2305.15896, 2023.
[5] Philippe Blatter, Menelaos Kanakis, Martin Danelljan, and Luc Van Gool. Efficient visual tracking with exemplar transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1571–1581, 2023.
[6] Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10448–10457, 2021.
[7] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. Centernet: Keypoint triplets for object detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6568–6577, 2019.
[8] Kan Wu, Jinnian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong Fu, and Lu Yuan. Tinyvit: Fast pretraining distillation for small vision transformers. In European Conference on Computer Vision, pages 68–85. Springer, 2022.
[9] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
[10] Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European conference on computer vision (ECCV), pages 300–317, 2018.
[11] Xiao Wang, Xiujun Shu, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, and Feng Wu. Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13763–13773, 2021.
[12] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
[13] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[14] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6848–6856, 2018.
[15] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018.
[16] Jianyuan Guo, Kai Han, Han Wu, Yehui Tang, Xinghao Chen, Yunhe Wang, and Chang Xu. Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12175–12185, 2022.
[17] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019.
[18] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
[19] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13733–13742, 2021.
[20] Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel, and Anurag Ranjan. Mobileone: An improved one millisecond mobile backbone. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7907–7917, 2023.
[21] Ao Wang, Hui Chen, Zijia Lin, Hengjun Pu, and Guiguang Ding. Repvit: Revisiting mobile cnn from vit perspective. arXiv preprint arXiv:2307.09283, 2023.
[22] Li Chuyi, L Lulu, J Hongliang, W Kaiheng, G Yifei, L Liang, K Zaidan, L Qingyuan, C Meng, and N Weiqiang. Yolov6: a single-stage object detection framework for industrial applications. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
[23] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7464–7475, 2023.
[24] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II 14, pages 850–865. Springer, 2016.
[25] Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8971–8980, 2018.
[26] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4282–4291, 2019.
[27] Yuechen Yu, Yilei Xiong, Weilin Huang, and Matthew R Scott. Deformable siamese attention networks for visual object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6728–6737, 2020.
[28] Zedu Chen, Bineng Zhong, Guorong Li, Shengping Zhang, and Rongrong Ji. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6668–6677, 2020.
[29] Dongyan Guo, Jun Wang, Ying Cui, Zhenhua Wang, and Shengyong Chen. Siamcar: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6269–6277, 2020.
[30] Bingbing Wei, Hongyu Chen, Qinghai Ding, and Haibo Luo. Siamoan: Siamese object-aware network for real-time target tracking. Neurocomputing, 471:161–174, 2022.
[31] Bingbing Wei, Hongyu Chen, Qinghai Ding, and Haibo Luo. Siamagn: Siamese attention-guided network for visual tracking. Neurocomputing, 512:69–82, 2022.
[32] Wencheng Han, Xingping Dong, Fahad Shahbaz Khan, Ling Shao, and Jianbing Shen. Learning to fuse asymmetric feature maps in siamese trackers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16570–16580, 2021.
[33] Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8126–8135, 2021.
[34] Shenyuan Gao, Chunluan Zhou, Chao Ma, Xinggang Wang, and Junsong Yuan. Aiatrack: Attention in attention for transformer visual tracking. In European Conference on Computer Vision, pages 146–164. Springer, 2022.
[35] Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. In European Conference on Computer Vision, pages 341–357. Springer, 2022.
[36] Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13608–13618, 2022.
[37] Bin Yan, Xinyu Zhang, Dong Wang, Huchuan Lu, and Xiaoyun Yang. Alpha-refine: Boosting tracking performance by precise bounding box estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5289–5298, 2021.
[38] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
[39] Zhiqiang Yan, Kun Wang, Xiang Li, Zhenyu Zhang, Jun Li, and Jian Yang. Desnet: Decomposed scale-consistent network for unsupervised depth completion. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 3109–3117, 2023.
[40] Yu-Hsi Chen, Chien-Yao Wang, Cheng-Yun Yang, Hung-Shuo Chang, Youn-Long Lin, Yung-Yu Chuang, and Hong-Yuan Mark Liao. Neighbortrack: Single object tracking by bipartite matching with neighbor tracklets and its applications to sports. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5138–5147, 2023.
[41] Yunfeng Li, Bo Wang, Ye Li, Zhuoyan Liu, Wei Huo, Yueming Li, and Jian Cao. Underwater object tracker: Uostrack for marine organism grasping of underwater vehicles. Ocean Engineering, 285:115449, 2023.
[42] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European conference on computer vision (ECCV), pages 734–750, 2018.
[43] Zanjia Tong, Yuhang Chen, Zewei Xu, and Rong Yu. Wise-iou: Bounding box regression loss with dynamic focusing mechanism. arXiv preprint arXiv:2301.10051, 2023.
[44] Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5):1562–1577, 2019.
[45] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[46] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Online object tracking: A benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2411–2418, 2013.
[47] Matthias Mueller, Neil Smith, and Bernard Ghanem. A benchmark and simulator for uav tracking. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 445–461. Springer, 2016.
[48] Pengpeng Liang, Erik Blasch, and Haibin Ling. Encoding color information for visual tracking: Algorithms and benchmark. IEEE transactions on image processing, 24(12):5630–5644, 2015.
[49] Karen Panetta, Landry Kezebou, Victor Oludare, and Sos Agaian. Comprehensive underwater object tracking benchmark dataset and underwater image enhancement with gan. IEEE Journal of Oceanic Engineering, 47(1):59–75, 2021.
[50] Basit Alawode, Yuhang Guo, Mehnaz Ummar, Naoufel Werghi, Jorge Dias, Ajmal Mian, and Sajid Javed. Utb180: A high-quality benchmark for underwater tracking. In Proceedings of the Asian Conference on Computer Vision, pages 3326–3342, 2022.
[51] Matej Kristan, Aleš Leonardis, Jiří Matas, Michael Felsberg, Roman Pflugfelder, Joni-Kristian Kämäräinen, Martin Danelljan, Luka Čehovin Zajc, Alan Lukežič, Ondrej Drbohlav, et al. The eighth visual object tracking vot2020 challenge results. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 547–601. Springer, 2020.
[52] Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1571–1580, 2021.
[53] Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6182–6191, 2019.
[54] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4660–4669, 2019.
[55] Paul Voigtlaender, Jonathon Luiten, Philip HS Torr, and Bastian Leibe. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6578–6588, 2020.
[56] Christoph Mayer, Martin Danelljan, Danda Pani Paudel, and Luc Van Gool. Learning target candidate association to keep track of what not to track. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13444–13454, 2021.
[57] Yunfeng Li, Wei Huo, Zhuoyan Liu, Bo Wang, and Ye Li. Ustark: underwater image domain-adaptive tracker based on stark. Journal of Electronic Imaging, 31(5):053012–053012, 2022.
[58] Cheng Cui, Tingquan Gao, Shengyu Wei, Yuning Du, Ruoyu Guo, Shuilong Dong, Bin Lu, Ying Zhou, Xueying Lv, Qiwen Liu, et al. Pp-lcnet: A lightweight cpu convolutional neural network. arXiv preprint arXiv:2109.15099, 2021.
[59] Tao Huang, Lang Huang, Shan You, Fei Wang, Chen Qian, and Chang Xu. Lightvit: Towards light-weight convolution-free vision transformers. arXiv preprint arXiv:2207.05557, 2022.
[60] Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi Dong, Lu Yuan, and Zicheng Liu. Mobile-former: Bridging mobilenet and transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5270–5279, 2022.
[61] Jinnian Zhang, Houwen Peng, Kan Wu, Mengchen Liu, Bin Xiao, Jianlong Fu, and Lu Yuan. Minivit: Compressing vision transformers with weight multiplexing. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12135–12144, 2022.
[62] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019.
[63] Zhaohui Zheng, Ping Wang, Dongwei Ren, Wei Liu, Rongguang Ye, Qinghua Hu, and Wangmeng Zuo. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE transactions on cybernetics, 52(8):8574–8586, 2021.
[64] Yi-Fan Zhang, Weiqiang Ren, Zhang Zhang, Zhen Jia, Liang Wang, and Tieniu Tan. Focal and efficient iou loss for accurate bounding box regression. Neurocomputing, 506:146–157, 2022.
[65] Zhora Gevorgyan. Siou loss: More powerful learning for bounding box regression. arXiv preprint arXiv:2205.12740, 2022.