Spatio-Temporal Matching for Siamese Visual Tracking

Jinpu Zhang Yuehuan Wang
School of Artificial Intelligence and Automation,
Huazhong University of Science and Technology, Wuhan 430074, China
[email protected] [email protected]

Abstract

Similarity matching is a core operation in Siamese trackers. Most Siamese trackers carry out similarity learning via cross correlation that originates from the image matching field. However, unlike 2-D image matching, the matching network in object tracking requires 4-D information (height, width, channel and time). Cross correlation neglects the information from channel and time dimensions, and thus produces ambiguous matching. This paper proposes a spatio-temporal matching process to thoroughly explore the capability of 4-D matching in space (height, width and channel) and time. In spatial matching, we introduce a space-variant channel-guided correlation (SVC-Corr) to recalibrate channel-wise feature responses for each spatial location, which can guide the generation of the target-aware matching features. In temporal matching, we investigate the time-domain context relations of the target and the background and develop an aberrance repressed module (ARM). By restricting the abrupt alteration in the interframe response maps, our ARM can clearly suppress aberrances and thus enables more robust and accurate object tracking. Furthermore, a novel anchor-free tracking framework is presented to accommodate these innovations. Experiments on challenging benchmarks including OTB100, VOT2018, VOT2020, GOT-10k, and LaSOT demonstrate the state-of-the-art performance of the proposed method.

1 Introduction

Refer to caption — Figure 1: Comparison of DW-XCorr and STM. STM indicates spatio-temporal matching. The spatial responses are generated by averaging all channels of each pixel, the channel responses are generated by normalizing the maximum of each channel. The green box and red box represent groundtruth and prediction respectively. Our STM considering the matching of channel and time dimensions is more robust to distractors.

Visual object tracking is a fundamental task in computer vision. It aims to infer the location of a target in subsequent frames of the video based on the target state in the first frame. Object tracking has been widely adopted in the field of surveillance[52], robotics[17], autonomous driving[26] and human-computer interaction[36], etc. Despite its rapid progress in recent decades, challenges such as occlusion, deformation and background interference[50, 51] still must be overcome.

Recently, Siamese network[7] based trackers have attracted attention due to their favorable balance of speed and tracking performance. These trackers formulate the tracking problem as learning a general similarity metric between the feature embedding of the target template and the search region. Therefore, how to embed the information of the two branches to obtain informative response maps is particularly vital. The seminal work, SiamFC[2], utilizes cross correlation (XCorr) for similarity measurement. Furthermore, SiamRPN[29] extends up-channel cross correlation (UP-XCorr) to adapt to the RPN structure, however, this gives rise to an imbalanced parameter distribution and hinders training optimization. To address this issue, SiamRPN++[28] presents a lightweight depth-wise cross correlation (DW-XCorr) to achieve efficient information association. Almost all of the current Siamese trackers[54, 15, 6, 58] employ DW-XCorr for similarity matching. Some recent approaches[13, 55] of local pixel-wise correlation are also studied by researchers.

Despite the great success of the current cross correlation, it still has several limitations. First, cross correlation originates from image matching[1]. The core concept is to combine feature maps using the inner product and evaluate the similarity on each subwindow. Siamese network only provides an efficient way to implement this operation with convolution. It is essentially still a 2-D spatial matching and assigns the same weight to the matching results of each channel of deep features. In practice, the data-driven deep features are sparsely activated[30]. When applied to the tracking task, only a few channels are active in describing the target. The remaining large portion of channels that contain redundant and irrelevant information cannot benefit the matching. We calculate the normalized maximum values of each channel, as shown in Figure 1a. The channel response of DW-XCorr is dense since both the target and distractors are activated. This makes it difficult for the tracker to discriminate the desired target from distractors. The channel attention mechanism[20, 40, 49] is usually used for channel-wise feature calibration in the classification task. However, it squeezes global spatial information into a single channel descriptor and hence emphasizes the features of all semantic categories. This is not appropriate for the tracking task that needs to distinguish instances.

The second limitation of traditional correlation is the neglect of temporal context in tracking. Siamese trackers decompose the tracking problem into independent matching problems for each frame. Such approaches are prone to fail in case of e.g., occlusion, fast deformation or presence of distractors, where a single-frame matching model is insufficient for robust tracking. If we can utilize the previous frame and construct temporal constraint relations, these aberrances are easily repressed. To this end, traditional trackers[5, 19, 38] implement a simple linear update strategy for the template. However, the constant learning rate on all spatial dimensions across all frames is often inadequate to cope with changing updating requirements in different tracking scenarios[57]. Another approach is learning an online classifier[10, 3] from the tracking sequence. Such a model is more discriminative but at the cost of time-consuming iterative optimization.

1.1 Contributions

In view of the aforementioned issues, we leverage the 4-D information of height, width, channel and time for comprehensive matching, called Spatio-Temporal Matching.

In spatial matching, we increase the exploration of channel-wise matching and design a space-variant channel-guided correlation (SVC-Corr) that can emphasize target-aware responses and weaken interference responses. The innovation of SVC-Corr is two-fold: 1) It establishes various channel associations between the template and each subwindow in the search area. This allows an independent and accurate recalibration of the response to a specific instance at each location without semantic disturbance from other spatial locations. 2) It is a learnable correlation module that can benefit from large-scale offline training, rather than a handcrafted parameter-free metric such as cross correlation. Figure 1b shows the sparse channel responses using SVC-Corr. The generated target-aware responses are more effective in separating semantic objects than the DW-XCorr features.

In temporal matching, we propose an aberrance repressed module (ARM) to investigate the hidden information propagated in the interframe response maps. Specifically, an efficient optimization function is added to restrict the abrupt alteration of the interframe response maps in both training and testing stages. It takes into account the variation of target and background simultaneously in the time domain, and adjust the response map flexibly for different tracking scenarios. As shown in Figure 1b, the heatmap of frame 5 imposes a strong constraint on the subsequent heatmaps, thus avoiding the drifting. Importantly, our ARM brings negligible additional computation cost.

Using the introduced SVC-Corr and ARM, we present a novel anchor-free tracking framework, termed Siamese Spatio-Temporal Matching (SiamSTM) tracking network. The effectiveness of the framework is verified on five benchmarks: OTB100[51], VOT2018[24], VOT2020[23], GOT-10k[21] and LaSOT[14]. Our approach achieves leading performance, while running at 66 FPS.

2 Related Work

Siamese-based trackers have attracted wide attention due to their superior performance and speed. Similarity matching is one of their most crucial components. The pioneering method SiamFC[2] utilizes XCorr to obtain a single channel response map and the instance with the highest similarity score is considered as the target. XCorr efficiently realizes the similarity measurement of traditional image matching in the form of convolution. Following this similarity-learning work, CFNet[46] incorporates the Correlation Filter into the Siamese network for end-to-end training. DSiam[16] adds a fast transformation learning that enables online learning of target appearance variation and background suppression from previous frames. Dong[12] employs triplet loss[44] to take full advantage of the relationship among the exemplar, positive instance and negative instance instead of the simple pairwise relationship. Although these methods improve the matching ability to some extent, they still leverage XCorr for information embedding.

SiamRPN combines Siamese network with the region proposal network (RPN)[42], which discards the traditional multi-scale tests. To embed the information of anchors, XCorr is extended by adding a huge convolutional layer to scale the channels (UP-XCorr). However, the heavy up-channel module leads to a severe imbalance of the parameter distribution. SiamRPN++ presents a lightweight DW-XCorr to efficiently generate a multi-channel correlation response map. Thus, the parameters on the template and the search branches are balanced, making the training procedure more stable. Since RPN-based Siamese trackers are sensitive to hyper-parameters associated with the anchors, later works focus on anchor-free models, such as SiamFC++[54], SiamCAR[15], SiamBAN[6] and Ocean[58]. They directly predict classification and regression results in a per-pixel-prediction manner, and still use DW-XCorr to encode similarity.

Several recent studies[13, 55] state that the part-level relations are more robust to variations than global matching in tracking and propose pixel-wise correlation to implement this idea. However, similar to DW-XCorr, this matching method ignores the influence of channel association that is one of the main factors for generating target-aware features as discussed above. The pixel-to-global matching in PG-Net[33] adds channel kernels on this basis to unify the local positions similarity of the various channels.

All of the above are single-frame independent matching regardless of temporal information. The main use of temporal information in the current methods is updating the template with simple linear interpolation[5, 19, 38, 9] or learning an online classifier[10, 3, 37, 59]. The former have been proved insufficient in most tracking situations[57], while the latter takes excess time on the frames that require gradient optimization. By contrast, this paper extends the idea of matching to the time domain and achieves both high robustness and high efficiency.

3 Method

This section details the proposed SiamSTM. Figure 2 illustrates an overview of the framework. In the following, we first introduce the basic Siamese tracker that is inspired by CenterNet[60]. Then we elaborate the designed Spatio-Temporal Matching, which consists of a space-variant channel-guided correlation for spatial matching and an aberrance repressed module for temporal matching.

3.1 Basic Siamese Tracker

The basic Siamese tracker is composed of feature extraction subnetwork and target localization subnetwork, as shown in Figure 2. In the feature extraction subnetwork, we modify the ResNet50[18] according to SiamRPN++[28] to make it more suitable for the tracking task. Moreover, we cut off the $conv5$ of the ResNet50 and only retain the $conv1-conv4$ to reduce computations.

In the target localization subnetwork, considering that anchor-based trackers rely on a large number of heuristic hyper-parameters[54] and lack the competence to amend the weak predictions[58], we leverage an anchor-free structure. The right part of Figure 2 illustrates its architecture that represents the object with object center, object size and local offset[60]. Let $\hat{Y}\in\mathbb{R}^{H\times W\times 1}$ be an output center heatmap of height $H$ and width $W$ . For an arbitrary position $(x,y)$ in the heatmap, a prediction $\hat{Y}_{x,y}=1$ corresponds to the tracking object center, while $\hat{Y}_{x,y}=0$ is the background. The classification label $Y\in\mathbb{R}^{H\times W\times 1}$ adopts a Gaussian function $Y_{x,y}=exp\left(-\frac{\left(x-x_{c}\right)^{2}+\left(y-y_{c}\right)^{2}}{2\sigma_{p}^{2}}\right)$ , where $(x_{c},y_{c})$ denotes the coordinate of the target center point in the heatmap, and $\sigma_{p}$ is an object size-adaptive standard deviation[27]. The training objective is a penalty-reduced pixel-wise logistic regression with focal loss[34]. We refer readers to [27, 60] for more details.

To eliminate the discretization gap caused by the spatial stride $R=8$ , an additional local offset $\hat{O}\in\mathbb{R}^{H\times W\times 2}$ is calculated for each center point. For the object size, we directly predict a $\hat{S}\in\mathbb{R}^{H\times W\times 2}$ to represent the height and width in each location. The supervisions of local offset and object size only act at the unique target location in the tracking task, and other locations are ignored. In particular, the offset label $\{o_{x},o_{y}\}$ is $\left\{\frac{i}{R}-\lfloor\frac{i}{R}\rfloor,\frac{j}{R}-\lfloor\frac{j}{R}\rfloor\right\}$ for the target position $(i,j)$ in the original image, and the size label $\{s_{w},s_{h}\}$ is the groundtruth width and height after downsampling. Both of these are trained with L1 loss. The overall training objective of the basic tracker is:

\displaystyle L_{base}=L_{cls}+\lambda_{off}L_{off}+\lambda_{size}L_{size}

(1)

where $L_{off},L_{size}$ are the loss functions of local offset and object size. $\lambda_{off},\lambda_{size}$ are the trade-off hyper-parameters.

We argue that our center-based structure is more suitable for object tracking than the currently popular FCOS-based structure. The Gaussian label-assign method in the center-based structure is equivalent to unifying the classification score and localization quality in the FCOS-based structure into a joint and single representation. Thus, it eliminates the training-test inconsistency mentioned in [31] and enables a stronger correlation between classification and localization quality. Additionally, the ideal prediction heatmap should be a Gaussian-like probability distribution, and we can easily constrain its alteration in the time domain as detailed in Sec. 3.3.

3.2 Space-variant Channel-guided Correlation

The difference between 2-D image matching and 3-D deep feature matching, namely modeling of the associations between the channels, has been analyzed above. The convolutional network learns specific semantic information mainly based on a few channels. When applied to the tracking task, we can obtain the target-aware feature responses by identifying channels that are active to the target and are inactive to the interference.

For this purpose, we introduce a space-variant channel-guided correlation (SVC-Corr) to establish channel correspondence between the template and the search area on the basis of traditional 2-D correlation. The red dotted box in Figure 2 illustrates its structure, including two parallel branches. Given the template feature $z\in\mathbb{R}^{H_{z}\times W_{z}\times C}$ and the search feature $x\in\mathbb{R}^{H_{x}\times W_{x}\times C}$ , we first employ a DW-XCorr to obtain a spatial response, i.e., $f_{sa}(z,x)=z\star x$ , where $\star$ denotes DW-XCorr and $f_{sa}(z,x)\in\mathbb{R}^{(H_{x}-H_{z}+1)\times(W_{x}-W_{z}+1)\times C}$ is the correlation result.

The DW-XCorr branch embeds the template feature into the search features along height and width, while the other branch aims to channel-wise embedding. The core step is the extraction of channel descriptors for template and search area, meanwhile modeling the interdependencies between the channels. This is implemented by the Channel Transform (Ch Trans) in Figure 3. Specifically, after a $3\times 3$ convolution $\varphi_{1}$ , we squeeze the spatial dimension of the feature through max pooling $f_{k}^{max}$ and average pooling $f_{k}^{avg}$ to gather different clues about the channel importance[49]. Note that the pooling kernel size $k$ is equal to the template feature size (i.e., $f_{k}^{max,avg}(z)\in\mathbb{R}^{1\times 1\times C},f_{k}^{max,avg}(x)\in\mathbb{R}^{(H_{x}-H_{z}+1)\times(W_{x}-W_{z}+1)\times C}$ ), hence we can obtain various channel descriptors from each search subwindow (a sliding window of the same size as template feature size in the search region) and associate them with the template respectively. The pooled descriptors are further fed into a shared multiple fully connected ( $FC$ ) layer to capture the channel-wise interdependencies. Considering that there are multiple channel descriptors on the search feature, we can efficiently implement all $FC$ operations in parallel with $1\times 1$ convolution. The learned descriptors are merged by element-wise summation. In short, the Ch Trans is computed as:

\displaystyle T_{ch}(\omega)=FC(f_{k}^{max}(\varphi_{1}(\omega)))\oplus FC(f_{k}^{avg}(\varphi_{1}(\omega)))

(2)

where $\omega\in\{z,x\}$ denotes template or search features, $\oplus$ is element-wise summation.

With the channel descriptors of the template and search area, we adopt broadcasting element-wise summation followed by $1\times 1$ convolution $\varphi_{2}$ to generate space-variant weights for different channels at each location. Finally, these weights recalibrate channel-wise feature responses of the previous DW-XCorr using element-wise summation. The whole process of SVC-Corr can be expressed as:

	$\displaystyle f_{ca}(z,x)$	$\displaystyle=\varphi_{2}(T_{ch}(z)\oplus T_{ch}(x))$		(3)
	$\displaystyle svc\_corr(z,x)$	$\displaystyle=f_{ca}(z,x)\oplus f_{sa}(z,x)$		(4)

where $f_{ca}$ and $ca\_corr$ denote the channel weights and SVC-Corr result, respectively.

Our SVC-Corr has two prominent advantages. 1) Unlike channel attention that focuses on all semantic categories, our SVC-Corr is tailored to different instances. For each subwindow in the search area, SVC-Corr associates it with the template independently in channel dimension, i.e., $f_{ca}$ has a total of $(H_{x}-H_{z}+1)\times(W_{x}-W_{z}+1)$ space-variant channel weights. This makes the channel information of various spatial locations not interfere with each other, and thus is more beneficial for the generation of discriminative target-aware responses. Figure 4 shows a comparison between SVC-Corr and DW-XCorr. 2) All parameters in Eq. (2)- (4) are learnable rather than handcrafted and can benefit from the large-scale offline training.

3.3 Aberrance Repressed Module

After fully exploiting the 3-D information of deep features for single-frame matching, we further extend matching to time-dimension to explore the information hidden across multiple frames. Figure 5 indicates our temporal matching method, aberrance repressed module (ARM).

Instead of raw image patches, center heatmaps of the adjacent frames are used as the input data for our ARM. Based on the above analysis and existing works[22, 32], we observe two criteria: 1) An ideal heatmap should be unimodal with narrow peak. 2) The distribution of heatmaps between the adjacent frames should be as similar as possible after alignment. Consequently, the purpose of ARM is to maximize the similarity of the peak-aligned heatmaps between the adjacent frames and minimize their errors with Gaussian labels. We propose to train the objective function with KL divergence:

\displaystyle\begin{split}L_{arm}=&KL\left(Y_{i+k}\otimes Y_{i+k},\,\hat{Y}_{i}[\Delta_{p,q}]\otimes\hat{Y}_{i+k}\right)\\ +&KL\left(Y_{i}\otimes Y_{i},\,\hat{Y}_{i+k}[\Delta_{q,p}]\otimes\hat{Y}_{i}\right)\end{split}

(5)

where $\hat{Y}_{i},\hat{Y}_{i+k},Y_{i},Y_{i+k}$ denote heatmaps and Gaussian labels for adjacent frame $i$ and $i+k$ , as detailed in Sec. 3.1. $p,q$ represent peak locations of $\hat{Y}_{i},\hat{Y}_{i+k}$ , and $\Delta_{p,q}$ indicates the circular shifting operation to coincide peak $p$ with peak $q$ , while $\Delta_{q,p}$ indicates the opposite. $\otimes$ is element-wise multiplication. The KL divergence is calculated as $KL(y,x)=\int y\log(y)-\int ylog(x)$ .

In the first line of Eq. 5, the $\hat{Y}_{i}[\Delta_{p,q}]\otimes\hat{Y}_{i+k}$ term fuses the two peak-aligned heatmaps, the distribution of which can reflect the alteration between two frames. A sharp unimodal distribution indicates that the two heatmaps are similar, and the loss between it and the Gaussian label is low. When an aberrance occurs, this distribution will change suddenly, e.g., the peak value decreases or additional interference peaks appear, causing a high loss. Therefore, the minimization of this loss function can restrict the alteration of heatmaps to suppress aberrances. The second line of Eq. 5 is symmetrical with the first line, but the Gaussian label is replaced by that of another frame. This makes the most of the monitoring information of each frame and leads to a more reliable restriction.

The ARM realizes the temporal matching without extra parameters and can be trained end-to-end in conjunction with the above basic tracker. The overall loss function is:

\displaystyle L=L_{base}+\lambda_{arm}L_{arm}

(6)

where $L_{base}$ comes from the basic tracker loss in Eq. 1 and $\lambda_{arm}$ is a hyper-parameter.

Input: Sequence of length L, Initial bounding box

Output: Target bounding box for each frame

1 Initialize:

t=1

2 heatmap

\hat{Y}_{1}

\leftarrow

basic tracker with SVC-Corr

3 label

Y

, peak position

p

\leftarrow

initial box

4 Update

Y_{last}\leftarrow Y,\hat{Y}_{last}\leftarrow\hat{Y}_{1},p_{last}\leftarrow p

5 for $t\in[2,L]$ do

\hat{Y}_{t},\hat{O}_{t},\hat{S}_{t}

\leftarrow

basic tracker with SVC-Corr ;

7 Top

K

response positions

\{q_{1},...,q_{K}\}

\hat{Y}_{t},

;

8 for $k\in[1,K]$ do

9 Shift

Y_{last},\hat{Y}_{last}

according to

\Delta_{p_{last},q_{k}}

;

10 Calculate

L_{arm}^{k}

according to Eq.5 ;

12 end for

\hat{k}\leftarrow argmin_{k}L_{arm}^{k}

;

14 if $\hat{k}\neq 1$ then

\hat{Y}_{t}\leftarrow\left(1+\hat{Y}_{last}[\Delta_{p_{last},q_{\hat{k}}}]\right)\otimes\hat{Y}_{t}

;

16 Update

Y_{last}\leftarrow Y_{last}[\Delta_{p_{last},q_{\hat{k}}}],\hat{Y}_{last}\leftarrow\hat{Y}_{last}[\Delta_{p_{last},q_{\hat{k}}}],p_{last}\leftarrow q_{\hat{k}}

;

18 end if

19 Hanning window and scale penalty[29] on

\hat{Y}_{t}

;

\hat{x}_{t},\hat{y}_{t}\leftarrow argmax_{x,y}\hat{Y}_{t}

;

(\delta\hat{x}_{t},\delta\hat{y}_{t})\leftarrow\hat{O}_{\hat{x}_{t},\hat{y}_{t}},(\hat{w}_{t},\hat{h}_{t})\leftarrow\hat{S}_{\hat{x}_{t},\hat{y}_{t}}

;

22 Output

box_{t}\leftarrow\left(\hat{x}_{t}+\delta\hat{x}_{t},\hat{y}_{t}+\delta\hat{y}_{t},\hat{w}_{t},\hat{h}_{t}\right)

23 end for

Algorithm 1 SiamSTM online tracking

The online tracking procedure is summarised in Algorithm 1. After initialization, the basic tracker with SVC-Corr first generates center heatmap $\hat{Y}_{t}$ , offset prediction $\hat{O}_{t}$ and size prediction $\hat{S}_{t}$ (line 6). Lines 7-16 indicate the inference process of ARM. Specifically, we align the label $Y_{last}$ and the heatmap $\hat{Y}_{last}$ of the previous frame with the top $K$ response locations of $\hat{Y}_{t}$ respectively, and then calculate the KL divergence $L_{arm}^{k}$ for each pair. Note that only the first term in Eq. 5 is calculated here, since we only have the label of the previous frame. The $q_{\hat{k}}$ that minimize $L_{arm}^{k}$ can be considered as the optimal peak position with the temporal constraint. If $q_{\hat{k}}$ is different from the original peak position $q_{1}$ in $\hat{Y}_{t}$ , we add $\hat{Y}_{last}[\Delta_{p_{last},q_{\hat{k}}}]$ as a weight on $\hat{Y}_{t}$ (line 14) and update related variables (line 15). Finally, the maximum response location on the weighted heatmap $\hat{Y}_{t}$ is extracted as the target center (line 18), combining with the offset prediction and the size prediction on this point to produce the bounding box (lines 19-20). Our ARM utilizes temporal information during both raining and inference. Compared with linear updating, it benefits from large-scale offline training and considers both the target and the background. Consequently, ARM is more flexible in adjusting each position of the response for different tracking situations. Meanwhile, it is much more efficient than online classifiers and achieves a tracking speed of 66 FPS.

4 Experiments

This section presents the results of SiamSTM on OTB100[51], VOT2018[24], VOT2020[23], GOT-10k[21] and LaSOT[14], with comparisons to the state-of-the-art algorithms. A detailed ablation study is then performed to evaluate the effects of each component in our model.

4.1 Implementation Details

The proposed SiamSTM is trained on ImageNet VID[43], ImageNet DET[43], Youtube-BB[41], COCO[35] and GOT-10k[21] for experiments on OTB100[51], VOT2018[24] and VOT2020[23]. In addition, for experiments on GOT-10k[21] and LaSOT[14], the model is only trained with their official training set for a fair comparison. To train ARM, the training input is a triplet containing a template image and two search images with an interval less than 100 frames. The template size is $127\times 127$ pixels, while the search image is $255\times 255$ pixels. The training batch size is set to 96 and a total of 20 epochs are trained with SGD on 4 TITANV. We use a learning rate that linearly increased from 0.001 to 0.005 for the first 5 warmup epochs and then exponentially decayed to 0.0005 for the rest of 15 epochs. For the first 10 epochs, we freeze the parameters of the backbone. For the remaining epochs, the backbone network is unfrozen, and the whole network is trained end-to-end. The hyper-parameters $\lambda_{off},\lambda_{size}$ and $\lambda_{arm}$ in Eq. 1 and Eq. 6 are set to 1, 0.1 and 0.5, respectively. The $K$ in Algorithm 1 is set to 3.

4.2 State-of-the-art Comparison

LADCF

[53]

UPDT

[4]

SiamRPN++

[28]

SiamFC++

[54]

SiamBAN

[6]

SiamMask

[48]

SiamRCNN

[47]

DiMP

[3]

PGNet

[33]

SiamSTM

EAO

\uparrow

0.389

0.379

0.416

0.430

0.452

0.383

0.408

0.440

0.451

0.488

\uparrow

0.503

0.530

0.596

0.581

0.592

0.604

0.609

0.597

0.618

0.604

\downarrow

0.159

0.182

0.234

0.183

0.178

0.276

0.220

0.153

0.192

0.103

Table 1: Performance comparisons on VOT2018. Red, blue and green fonts indicate the top-3 trackers.

SiamFC

[2]

SiamRPN++

[28]

SiamRCNN

[47]

SiamMargin

[25]

FCOT

[8]

ATOM

[10]

DiMP

[3]

SiamMask

[48]

SiamSTM

mask

EAO

\uparrow

0.179

0.247

0.233

0.226

0.247

0.271

0.281

0.328

0.299

0.391

\uparrow

0.418

0.435

0.458

0.415

0.421

0.462

0.448

0.589

0.480

0.625

\uparrow

0.502

0.670

0.610

0.637

0.703

0.734

0.743

0.634

0.756

\times

\times

\times

\times

\times

\times

\times

\checkmark

\times

\checkmark

Table 2: Performance comparisons on VOT2020. ”M” denotes whether the mask results are predicted. The last column indicates that the mask branch of SiamMask is added to our SiamSTM to output the object mask.

ECO

[11]

SiamFC

[2]

SiamRPN++

[28]

SiamFC++

[54]

ROAM

[56]

SiamCAR

[15]

ATOM

[10]

DiMP

[3]

Ocean

[58]

SiamSTM

\uparrow

0.316

0.348

0.518

0.595

0.465

0.569

0.556

0.611

0.624

\rm SR_{50}

\uparrow

0.309

0.353

0.618

0.695

0.532

0.670

0.635

0.717

0.721

0.730

\rm SR_{75}

\uparrow

0.111

0.098

0.325

0.479

0.236

0.415

0.402

0.492

0.473

0.503

Table 3: Performance comparisons on the GOT-10k test-set.

To extensively evaluate the proposed method, we compare it with state-of-the-art trackers on five challenging tracking datasets. The selected comparison methods include XCorr based tracker[2, 46], DW-XCorr based trackers[28, 48, 54, 15, 6, 58], pixel-wise correlation based tracker[33], correlation filter based trackers[53, 4, 11], meta-learning based trackers[10, 3, 8, 56], multi-domain learning based trackers[39, 45] and others[25, 47].

VOT2018[24] VOT2018 consists of 60 sequences with various challenging factors. The overall performance of the tracker is evaluated in terms of Expected Average Overlap (EAO) that takes both accuracy (A) and robustness (R) into account. Table 1 shows the comparison with the existing top-performing trackers on VOT2018. Our SiamSTM attains the best EAO and robustness among all of the methods. Compared with the offline trackers, i.e., anchor-based SiamRPN++[28] and anchor-free SiamBAN[6], our model achieves EAO improvements of 7.2 and 3.6 points. It is worth noting that the improvements mainly come from the robustness, obtaining 56% and 42.1% increases over SiamRPN++ and SiamBAN, respectively. Impressively, the robustness of our method exceeds those methods relying on online adaptation (LADCF[53], UPDT[4], ATOM[10] and DiMP[3]). This further demonstrates that SiamSTM is more feasible in the use of temporal information. PGNet is superior in terms of accuracy because of its more detailed local pixel-wise correlation, but it is still inferior to our method with respect to robustness and EAO.

Sequences in VOT2018 are annotated by the following visual attributes: occlusion, illumination change, motion change, size change and camera motion. Frames that do not correspond to any of the five attributes are denoted as unassigned. We compare the EAO of the above methods on these visual attributes in Figure 6. Our SiamSTM ranks first on almost all of the attributes except for unassigned and outperforms the second place by a wide margin in the challenging occlusion and camera motion.

VOT2020[23] VOT2020 contains 60 sequences and redefines accuracy, robustness and EAO in the context of the new anchor-based evaluation protocol[23]. Another significant novelty is that the target position was encoded by a segmentation mask. Since mask prediction is not our main purpose, we utilize the bbox evaluation mode to compare different trackers. Table 2 shows the evaluation results on VOT2020, and ’M’ denotes whether mask results are predicted. The proposed model achieves the best performance (EAO, A and R) among all of the compared trackers that output the bounding box. The robustness of our model surpasses that of DiMP with elaborate meta-learning by 1.3 points. Moreover, the online mechanism in Algorithm 1 is a linear operation and consumes much less time than the fine-tuning of meta-learning. We observe that SiamMask surpasses SiamRPN++ by more than 8.1 points on EAO and 15.4 points on accuracy. Therefore, we simply complement our framework with the mask branch in SiamMask to obtain the object mask and improve the EAO by 9.2 points and accuracy by 14.5 points. We have reason to believe that the results can be further improved by a careful modification of the mask prediction.

GOT-10k[21] GOT-10k is a large-scale dataset containing over 10 thousand videos and is challenging in terms of zero-class-overlap between the provided train-set and test-set. For a fair comparison, we follow the the protocol of GOT-10k and only train our model with its train-set and evaluate the performance on the test-set of 180 videos. The performance indicators are average overlap (AO) and success rate (SR). As shown in Table 3, the proposed SiamSTM achieves a best AO of 0.624, outperforming previous SOTA DiMP[3] and Ocean[58] by 2.1 points.

OTB100[51] OTB100 is a widely used public tracking benchmark consisting of 100 sequences. Precision (Prec.) and area under curve (AUC) are used to rank the trackers. As reported in Figure 7, among the compared methods, our SiamSTM achieves the best performance on both precision (0.922) and AUC (0.707). It even improves compared to the redetection-based SiamRCNN[47] by 1% on AUC with a much simple network architecture.

LaSOT[14] To further evaluate the proposed model on long-term tracking, we report the results on LaSOT. Compared with the previous datasets, LaSOT has longer sequences with an average sequence length of more than 2,500 frames. In Figure 8, we show normalized precision and success plots for 280 videos in the test-set. Our tracker ranks first in terms of AUC, second in terms of normalized precision and 0.5% higher than DiMP. This demonstrates that the 4-D spatio-temporal matching can adapt to the complexity of long-term tracking and avoid model drift.

4.3 Ablation Study

To further verify the efficacy of the proposed method, we perform an ablation study on the GOT-10k test-set, as presented in Table 4. Note that triplet input is only required when training ARM, otherwise the training input is still a pair of a template image and a search image.

Head Structure. The FCOS head is used to estimate the distances from each pixel within the target object to the four sides of the groundtruth bounding box, while the Center head is used to estimate the target center point and its corresponding size. By replacing FCOS head with Center head, the AO is improved by 1.7 points (② vs. ①). The improvement mainly originates from the difference in the label assign, where our Gaussian label can jointly optimize the classification branch and the quality estimation branch, and thus produces more confident and reliable predictions.

SVC-Corr vs. DW-XCorr. To analyze the contribution of the proposed SVC-Corr, we conduct a comparison to the popular DW-XCorr. The SVC-Corr obtains significant AO gains of 1.6 points compared to DW-XCorr (③ vs. ②). We attribute these gains to the benefits from the recalibration of the channel-wise feature responses. Moreover, Figure 1 and 4 visualize the responses of these two correlation methods. Our SVC-Corr only activates a few target-aware channels, and meanwhile weaken the remaining redundant and irrelevant channels, hence its activated channel responses are sparser than that of DW-XCorr in Figure 1. Figure 4 further demonstrates that the SVC-Corr establishes different channel associations between each subwindow and template, making the response maps more capable to discriminate the targets and disturbances. We quantitatively analyze the benefits of this space-variant association, meanwhile integrate it into two other trackers (SiamRPN++[28] and SiamBAN[6]) to show the generalization ability. Detailed results are provided in the supplementary material.

Aberrance Repressed Module. Last, we evaluate the impact of the temporal matching carried out by ARM. The ARM brings 2.3 points AO gains on DW-XCorr (④ vs. ②) and 1.8 points AO gains on SVC-Corr (⑤ vs. ③), respectively. Additionally, we state that the performance gains benefit from the temporal constraint rather than triplet training input, as shown in the supplementary material. The results verify that ARM can utilize information hidden in the time domain to suppress aberrances and is thus more robust and accurate for object tracking. Figure 1b intuitively proves that ARM is competent in dealing with distractors.

Num	Head	DW-XCorr	SVC-Corr	ARM	AO
①	FCOS	$\checkmark$			0.570
②	Center	$\checkmark$			0.587
③	Center		$\checkmark$		0.603
④	Center	$\checkmark$		$\checkmark$	0.610
⑤	Center		$\checkmark$	$\checkmark$	0.624

Table 4: Ablation study on GOT-10k. Head represents the tracking head using FCOS-based structure or Center-based structure.

5 Conclusion

We propose a novel Siamese Spatial-Temporal Matching (SiamSTM) tracking network based upon the observation that traditional cross correlation ignores the matching relationships of channel and time. Our SVC-Corr realizes the space-variant channel-wise information recalibration for each matching position and yields a target-aware response map. Moreover, by extending temporal matching ARM, our approach efficiently mines the information propagated in multiframes to suppress the aberrances during tracking. Both innovations can benefit from large-scale offline training and run with a high speed. Comprehensive experiments on five benchmark datasets indicate that the proposed SiamSTM achieves state-of-the-art performance. In future work, we plan to embed the mask prediction end-to-end into this framework to accommodate VOT2020 and VOS tasks.

References

[1] D. I. Barnea and H. F. Silverman. A class of algorithms for fast digital image registration. IEEE Transactions on Computers, C-21(2):179–186, 1972.
[2] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision, pages 850–865. Springer, 2016.
[3] Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. Learning discriminative model prediction for tracking. In Proceedings of the IEEE International Conference on Computer Vision, pages 6182–6191, 2019.
[4] Goutam Bhat, Joakim Johnander, Martin Danelljan, Fahad Shahbaz Khan, and Michael Felsberg. Unveiling the power of deep tracking. In Proceedings of the European Conference on Computer Vision, pages 483–498, 2018.
[5] David S Bolme, J Ross Beveridge, Bruce A Draper, and Yui Man Lui. Visual object tracking using adaptive correlation filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2544–2550. IEEE, 2010.
[6] Zedu Chen, Bineng Zhong, Guorong Li, Shengping Zhang, and Rongrong Ji. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6668–6677, 2020.
[7] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages 539–546. IEEE, 2005.
[8] Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Fully convolutional online tracking. arXiv preprint arXiv:2004.07109, 2020.
[9] Kaiheng Dai, Yuehuan Wang, and Xiaoyun Yan. Long-term object tracking based on siamese network. In IEEE International Conference on Image Processing, pages 3640–3644. IEEE, 2017.
[10] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4660–4669, 2019.
[11] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6638–6646, 2017.
[12] Xingping Dong and Jianbing Shen. Triplet loss in siamese network for object tracking. In Proceedings of the European Conference on Computer Vision, pages 459–474, 2018.
[13] Fei Du, Peng Liu, Wei Zhao, and Xianglong Tang. Correlation-guided attention for corner detection based visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6836–6845, 2020.
[14] Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5374–5383, 2019.
[15] Dongyan Guo, Jun Wang, Ying Cui, Zhenhua Wang, and Shengyong Chen. Siamcar: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6269–6277, 2020.
[16] Qing Guo, Wei Feng, Ce Zhou, Rui Huang, Liang Wan, and Song Wang. Learning dynamic siamese network for visual object tracking. In Proceedings of the IEEE International Conference on Computer Vision, pages 1763–1771, 2017.
[17] Meenakshi Gupta, Swagat Kumar, Laxmidhar Behera, and Venkatesh K Subramanian. A novel vision-based tracking algorithm for a human-following mobile robot. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(7):1415–1427, 2016.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[19] João F Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583–596, 2014.
[20] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7132–7141, 2018.
[21] Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
[22] Ziyuan Huang, Changhong Fu, Yiming Li, Fuling Lin, and Peng Lu. Learning aberrance repressed correlation filters for real-time uav tracking. In Proceedings of the IEEE International Conference on Computer Vision, pages 2891–2900, 2019.
[23] Matej Kristan, Ales Leonardis, Jiri Matas, Michael Felsberg, Roman Pflugfelder, Joni-Kristian Kamarainen, Luka Čehovin Zajc, Martin Danelljan, Alan Lukezic, Ondrej Drbohlav, Linbo He, Yushan Zhang, Song Yan, Jinyu Yang, Gustavo Fernandez, and et al. The eighth visual object tracking vot2020 challenge results, 2020.
[24] Matej Kristan, Ales Leonardis, Jiri Matas, Michael Felsberg, Roman Pfugfelder, Luka Čehovin Zajc, Tomas Vojir, Goutam Bhat, Alan Lukezic, Abdelrahman Eldesokey, Gustavo Fernandez, and et al. The sixth visual object tracking vot2018 challenge results, 2018.
[25] Matej Kristan, Jiri Matas, Ales Leonardis, Michael Felsberg, Roman Pflugfelder, Joni-Kristian Kamarainen, Luka Cehovin Zajc, Ondrej Drbohlav, Alan Lukezic, Amanda Berg, et al. The seventh visual object tracking vot2019 challenge results. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0–0, 2019.
[26] Vincent A Laurense, Jonathan Y Goh, and J Christian Gerdes. Path-tracking for autonomous vehicles at the limit of friction. In American Control Conference, pages 5586–5591. IEEE, 2017.
[27] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision, pages 734–750, 2018.
[28] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4282–4291, 2019.
[29] Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8971–8980, 2018.
[30] Xin Li, Chao Ma, Baoyuan Wu, Zhenyu He, and Ming-Hsuan Yang. Target-aware deep tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1369–1378, 2019.
[31] Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu, Jun Li, Jinhui Tang, and Jian Yang. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. In Advances in Neural Information Processing Systems, 2020.
[32] Yiming Li, Changhong Fu, Fangqiang Ding, Ziyuan Huang, and Geng Lu. Autotrack: Towards high-performance visual tracking for uav with automatic spatio-temporal regularization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11923–11932, 2020.
[33] Bingyan Liao, Chenye Wang, Yayun Wang, Yaonong Wang, and Jun Yin. Pg-net: Pixel to global matching network for visual tracking. In Proceedings of the European Conference on Computer Vision, pages 429–444. Springer, 2020.
[34] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 2980–2988, 2017.
[35] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, pages 740–755. Springer, 2014.
[36] Liwei Liu, Junliang Xing, Haizhou Ai, and Xiang Ruan. Hand posture recognition using finger geometric feature. In Proceedings of the International Conference on Pattern Recognition, pages 565–568. IEEE, 2012.
[37] Alan Lukezic, Jiri Matas, and Matej Kristan. D3s-a discriminative single shot segmentation tracker. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7133–7142, 2020.
[38] Chao Ma, Jia-Bin Huang, Xiaokang Yang, and Ming-Hsuan Yang. Hierarchical convolutional features for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, pages 3074–3082, 2015.
[39] Hyeonseob Nam and Bohyung Han. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4293–4302, 2016.
[40] Jongchan Park, Sanghyun Woo, Joon-Young Lee, and In So Kweon. Bam: Bottleneck attention module. arXiv preprint arXiv:1807.06514, 2018.
[41] Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5296–5305, 2017.
[42] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2016.
[43] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
[44] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
[45] Yibing Song, Chao Ma, Xiaohe Wu, Lijun Gong, Linchao Bao, Wangmeng Zuo, Chunhua Shen, Rynson W. H. Lau, and Ming-Hsuan Yang. VITAL: visual tracking via adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8990–8999, 2018.
[46] Jack Valmadre, Luca Bertinetto, Joao Henriques, Andrea Vedaldi, and Philip HS Torr. End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2805–2813, 2017.
[47] Paul Voigtlaender, Jonathon Luiten, Philip HS Torr, and Bastian Leibe. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6578–6588, 2020.
[48] Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip HS Torr. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1328–1338, 2019.
[49] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, pages 3–19, 2018.
[50] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Online object tracking: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2411–2418, 2013.
[51] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1834–1848, 2015.
[52] Junliang Xing, Haizhou Ai, and Shihong Lao. Multiple human tracking based on multi-view upper-body detection and discriminative learning. In Proceedings of the International Conference on Pattern Recognition, pages 1698–1701. IEEE, 2010.
[53] Tianyang Xu, Zhen-Hua Feng, Xiao-Jun Wu, and Josef Kittler. Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking. IEEE Transactions on Image Processing, 28(11):5596–5609, 2019.
[54] Yinda Xu, Zeyu Wang, Zuoxin Li, Ye Yuan, and Gang Yu. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 12549–12556, 2020.
[55] Bin Yan, Xinyu Zhang, Dong Wang, Huchuan Lu, and Xiaoyun Yang. Alpha-refine: Boosting tracking performance by precise bounding box estimation. arXiv preprint arXiv:2012.06815, 2020.
[56] Tianyu Yang, Pengfei Xu, Runbo Hu, Hua Chai, and Antoni B Chan. Roam: Recurrently optimizing tracking model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6718–6727, 2020.
[57] Lichao Zhang, Abel Gonzalez-Garcia, Joost van de Weijer, Martin Danelljan, and Fahad Shahbaz Khan. Learning the model update for siamese trackers. In Proceedings of the IEEE International Conference on Computer Vision, pages 4010–4019, 2019.
[58] Zhipeng Zhang, Houwen Peng, Jianlong Fu, Bing Li, and Weiming Hu. Ocean: Object-aware anchor-free tracking. In Proceedings of the European Conference on Computer Vision, pages 771–787. Springer, 2020.
[59] Jinghao Zhou, Peng Wang, and Haoyang Sun. Discriminative and robust online learning for siamese visual tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 13017–13024, 2020.
[60] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.