All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

Chunhui Zhang, Xin Sun, Li Liu, Yiqian Yang, Qiong Liu, Xi Zhou, Yanfeng Wang Chunhui Zhang, and Xin Sun are with the Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, Shanghai, 200240, China and the CloudWalk Technology Co., Ltd, 201203, China. Emails: [email protected], [email protected] Liu is with the Hong Kong University of Science and Technology (Guangzhou), Guangzhou, 511458, China. E-mail: [email protected] Yang is with the Northwestern Polytechnical University, Xi’an, 710072, China. E-mail: [email protected] Liu, and Xi Zhou are with the CloudWalk Technology Co., Ltd, 201203, China. E-mails: [email protected], [email protected] Wang is with the Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, Shanghai, 200240, China and the Shanghai AI Lab. E-mail: [email protected] author: Li Liu.

Abstract

Current mainstream vision-language (VL) tracking framework consists of three parts, i.e., a visual feature extractor, a language feature extractor, and a fusion model. To pursue better performance, a natural modus operandi for VL tracking is employing customized and heavier unimodal encoders, and multi-modal fusion models. Albeit effective, existing VL trackers separate feature extraction and feature integration, resulting in extracted features that lack semantic guidance and have limited target-aware capability in complex scenarios, e.g., similar distractors and extreme illumination. In this work, inspired by the recent success of exploring foundation models with unified architecture for both natural language and computer vision tasks, we propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone. Specifically, we mix raw vision and language signals to generate language-injected vision tokens, which we then concatenate before feeding into the unified backbone architecture. This approach achieves feature integration in a unified backbone, removing the need for carefully-designed fusion modules and resulting in a more effective and efficient VL tracking framework. To further improve the learning efficiency, we introduce a multi-modal alignment module based on cross-modal and intra-modal contrastive objectives, providing more reasonable representations for the unified All-in-One transformer backbone. Extensive experiments on five benchmarks, i.e., OTB99-L, TNL2K, LaSOT, LaSOT_Ext and WebUAV-3M, demonstrate the superiority of the proposed tracker against existing state-of-the-arts on VL tracking. Codes will be made publicly available.

Index Terms:

Unified vision-language tracking, Multi-modal alignment, Transformer, Foundation model.

I Introduction

Vision-language (VL) tracking [1, 2, 3, 4, 5], one of the fundamental and challenging problems at the intersection of computer vision and natural language understanding, aims to locate the object in each frame of a video based on a given natural language prompt and an initial object box. It plays a crucial role in human-machine interaction, transportation surveillance, virtual reality, autonomous driving, delivery, etc. Compared with traditional visual object tracking [6, 7] using a bounding box to describe the object of interest, VL tracking has the potential to achieve more robust tracking by leveraging the complementary superiority of multiple modalities.

Refer to caption — Figure 1: Existing VL tracking framework *vs.* our All-in-One. (a) Existing VL tracking methods obtain multiple modality features from separate extractors before fusion. The feature interaction relies on a carefully-designed fusion model. (b) We aim to build a foundation model, i.e., All-in-One, for VL tracking, which achieves joint feature extraction and multi-modal interaction using a versatile transformer encoder.

In the past few years, two-stream VL trackers [2, 4, 8, 5], which extract visual features and language features separately and then perform feature interaction in a fusion model (as shown in Fig 1(a)), have emerged as a domain framework and obtained significant progresses. For instance, Feng et al. [4] proposed a Siamese natural language region proposal network for multi-stage feature extraction, and then applied an aggregation module to dynamically combine predictions from both visual and language modalities. Guo et al. [5] suggested an asymmetrical modeling architecture to learn adaptive VL representations. Following the two-stream pipeline, the latest transformer-based VL tracker JointNLT [9] formulates grounding and tracking as a unified task of establishing relation between visual-language references and test images via a multi-source relation modeling architecture.

Despite the convincing designs of existing two-stream VL trackers, they still suffer from the fundamental challenge of learning target-aware capability in some complex and corner scenarios, e.g., similar distractors, occlusion, and extreme illumination [1, 5]. Firstly, the separation of feature extraction and integration prevents the model from performing early multi-modal feature interaction, resulting in limited object-background discriminative power [10, 11]. Although some works have attempted to design complicated [8] or multi-stage [4, 5] fusion models to enhance the associations between modalities, the lack of mutual interaction remains an insurmountable gap. More seriously, heavy fusion models increase the number of parameters, leading to significant computational inefficiency. Secondly, performing feature interaction directly ignores the huge distribution discrepancies between the vision and language modalities in the feature space [12], leading to significant learning inefficiency in VL representation learning.

To tackle the above issues, we propose a unified framework (as shown in Fig 1(b)), namely All-in-One, for VL tracking. The core idea is to establish bidirectional information flow between well aligned visual and language signals as early as possible via a unified transformer backbone. Our All-in-One framework brings multiple new benefits for multi-modal VL tracking. (1) The unified architecture not only simplifies the model, but also leads to more efficient VL representation learning. (2) It has great potential to serve as a foundation model for VL tracking. With this framework, we develop a general VL tracking model, which generalizes well to complex, user-defined language descriptions/prompts on various VL tracking datasets. (3) Compared with the two-stream vision language foundation models (e.g., CLIP [13]), our All-in-One framework follows the simple and general one-stream route [10, 11, 14].

Specifically, we introduce a versatile All-in-One transformer, as shown in Fig. 2, to embed raw visual and language signals into joint VL representations, and the produced visual search region features can be directly used for object location without additional fusion model. The visual inputs (i.e., search region and template) and language input are first mapped by a patch embed and a text tokenizer, respectively, and then flattened into the same dimension embeddings. A modal mixup operation is used to inject language information into the visual embeddings (i.e., template embeddings and search region embeddings), followed by a stack of transformer encoder layers enabling iteratively feature integrating between template and search region embeddings with language information guidance. Thus, both template and search region embeddings can be enhanced dynamically with strong target-aware capability. In addition, we introduce a multi-modal alignment (MMA) module to alleviate the huge distribution discrepancies between multiple modalities based on contrastive learning (CL) [15]. The MMA module includes cross-modal alignment (CMA) and intra-modal alignment (IMA), forcing the visual and language signals from the same video to be close in the feature space, while making the distribution of multi-modal features more uniform and reasonable in the entire feature space, which can promote feature integration. In conclusion, our main contributions can be summarized as follows:

•

We propose a simple, compact and effective one-stream framework for VL tracking, namely All-in-One, which learns VL representations from raw visual and language signals end-to-end in a unified transformer backbone.
•

We develop a novel multi-modal alignment module incorporating cross-modal and intra-modal alignments to enable efficient multi-modal learning by aligning multiple signals in the feature space before learning.
•

Extensive experiments demonstrate that the proposed approach achieves higher accuracy against state-of-the-art (SOTA) trackers.

II Related Work

II-A Vision-Language Tracking

In recent years, the two-stream framework [2, 4, 8, 5] has emerged as a dominant VL tracking paradigm (see Fig. 1(a)). They first extract features using two independent unimodal feature extractors, and then model relation of visual features and language features in sequential manner by a lightweight [5] or relatively heavy [9] network. Early work [3] contains a visual specification network and a lingual specification network, and further selectively focuses on parts of language prompts using a lingual specification attention network. Later, GTI [16] and [2] decompose the VL tracking problem into three sub-tasks of visual tracking, grounding and integration. VLT_TT [5] suggests to learn VL representations through an asymmetrical modeling architecture. JointNLT [9] introduces a joint visual grounding and tracking framework by localizing the referred object based on the visual-language references. However, these works rely on separate visual and language encoders to extract multi-modal features, leading to limited information interaction. We note that several works [10, 11, 14] introduce one-stream framework for visual object tracking. Different from them, we extend the one-stream framework to multi-modal VL tracking by training jointly on videos and language prompts. As shown in Fig. 1(b), for the first time we seamlessly integrate feature extraction and interaction into a unified backbone architecture for VL tracking. The proposed framework not only enables information flow from language to vision, but also allows bidirectional integration of information between visual and language features.

II-B Transformer for Unified Architecture

Thanks to the scalability to very large-scale model and capability to handle sequential and non-sequential data, transformer has become a prevailing architecture in both natural language [17, 18] and computer vision [19, 20] communities. Following ViT [19], a series of variants of ViT have been developed to improve the performance on vision tasks, including reducing computational cost [21, 20], and architecture design [22, 23]. Additionally, transformer has extensively used in various multi-modal tasks [24, 25, 26].

In consideration of the capacity of the unified transformer model to handle unimodal input or multi-modal input with a shared encoder, a few pioneering works have tried to explore unified multi-modal encoders [27, 28, 29]. For instance, ViLT [28] proposed a vision-language transformer without using regional features or deep convolutional visual embeddings. VATT [29] developed a video-audio-text transformer using multi-modal self-supervised pre-training to improve the performance of video action recognition and audio event classification. In this paper, we follow the trend of unified architecture for multi-modal data. To the best of our knowledge, the proposed All-in-One transformer is the first unified backbone architecture for multi-modal VL tracking.

II-C Multi-Modal Learning

Recently, a potential multi-modal learning paradigm is to adopt transformer to process and relate information from multiple modalities [30, 13, 31, 32]. CLIP [13] applied language prompts as supervisory signals to learn better visual representations. VisualBERT [31], VilBERT [33], and Unicoder-VL [32] combined visual and textual features into transformers to capture cross-modal relationships. However, previous works mainly focus on how to learn multi-modal representations by exploiting the complementary advantages of multiple modalities, or to fuse multi-modal features for prediction. Multi-modal alignment, discovering relationships and correspondences between fine-grained elements (e.g., objects and words) of instances (e.g., images and languages) from two or more modalities, has been rarely explored. In this work, we propose the MMA module with CMA and IMA based on self-supervised CL [15] to explore efficient multi-modal learning for VL tracking.

III Proposed Method

This section presents the All-in-One, a simple yet effective framework for the VL tracking task. Our All-in-One framework consists of an All-in-One transformer backbone, a multi-modal alignment module, and a tracking head, as shown in Fig. 2. The All-in-One transformer backbone is used to achieve feature extraction and information interaction between visual inputs (i.e., visual search region and visual template) and language input simultaneously in a unified architecture. Before that, visual embeddings and language embeddings are aligned via a multi-modal alignment module, providing more reasonable feature embeddings in the feature space. The output features of the visual search region are sent to the tracking head to predict the location of the target.

III-A Problem Formulation

Before detailing the architecture of our All-in-One framework, we briefly review the transformer tracking [11, 34, 10, 35, 14], which achieves remarkable tracking performance. Given a video with a pair of visual template and visual search region $\mathcal{X}_{xz}$ , an initial target box $\mathcal{B}_{0}$ , the transformer tracking can be formulated as $F_{trans}:\{\mathcal{X}_{xz},\mathcal{B}_{0}\}\rightarrow\mathcal{B}$ , where $F_{trans}$ is the transformer tracker, $\mathcal{B}$ is the predicted box of the target in all subsequent search frames. In general, the transformer tracker $F_{trans}$ can be decomposed into $\Phi\circ f$ , where $f:\{\mathcal{X}_{xz},\mathcal{B}_{0}\}\rightarrow\mathcal{H}$ denotes the backbone (e.g., ViT [19]) for feature extraction and interaction function, $\mathcal{H}$ represents the output features of visual search region, and the tracking head $\Phi:\mathcal{H}\rightarrow\mathcal{B}$ is adopted to predict the target box.

Specifically, a pair of images, namely visual search region $\mathbf{x}\in\mathbb{R}^{3\times H_{x}\times W_{x}}$ and visual template $\mathbf{z}\in\mathbb{R}^{3\times H_{z}\times W_{z}}$ are divided into $N_{x}$ and $N_{z}$ non-overlapping image patches of resolution $P\times P$ , where $N_{x}=H_{x}W_{x}/P^{2}$ and $N_{z}=H_{z}W_{z}/P^{2}$ are the number of patches of search region and template, respectively. Then, a linear projection is applied to these image patches to generate 1D tokens $\mathcal{H}_{x}\in\mathbb{R}^{N_{x}\times D}$ and $\mathcal{H}_{z}\in\mathbb{R}^{N_{z}\times D}$ , where $D$ is the token dimension. Two learnable positional embeddings are added to $\mathcal{H}_{x}$ and $\mathcal{H}_{z}$ to retain the position information. After that, these tokens are concatenated as a sequence $\mathcal{H}_{xz}^{0}=[\mathcal{H}_{x};\mathcal{H}_{z}]$ and fed to a $L$ -layer transformer encoder. Here, we represent $\mathcal{H}_{xz}^{l-1}$ as inputs to the $l$ -th encoder layer $E^{l}$ . Formally, the forward operation of the $l$ -th encoder layer can be written as:

\mathcal{H}^{l}_{xz}=E^{l}(\mathcal{H}^{l-1}_{xz}),l=1,2,3,...,L

(1)

\mathcal{B}=\Phi(\mathcal{H}^{L}_{xz}),

(2)

where each transformer encoder layer contains a multi-head self-attention (MHSA), and a feed-forward network (FFN). Each sub-layer is constructed as a residual connection, where layer normalization (LN) is followed by the residual connection. The visual search region tokens $\mathcal{H}^{L}_{x}$ of the last transformer encoder layer is taken as the input of tracking head $\Phi$ for target box prediction.

For VL tracking [1, 36, 2], it introduces an extra language prompt $\mathcal{T}$ for each video to express the attribute, behavior, position (relative location), and surroundings of the target. Accordingly, VL tracking can be formulated as $F_{VL}:\{\mathcal{X}_{xz},\mathcal{B}_{0},\mathcal{T}\}\rightarrow\mathcal{B}$ , where $F_{VL}$ is the VL tracker. Similarly, the VL tracker $F_{VL}$ can also be decomposed into $\Phi\circ f^{*}$ , where $\Phi$ is the tracking head, and $f^{*}$ represents the proposed unified backbone architecture in this work.

III-B Unified Vision-Language Tracking

Fig. 2 gives an overview of our All-in-One framework for VL tracking. To optimize the VL tracker $F_{VL}$ , a pair of visual template and visual search region $\mathcal{X}_{xz}=\{\mathcal{X}_{x},\mathcal{X}_{z}\}$ , and an extra language prompt $\mathcal{T}$ are first fed to a patch embed (i.e., a linear projection) and a text tokenizer [18], respectively. They are mapped and flattened into $D$ -dimension embeddings, where $D=768$ . We denote them as vision tokens (i.e., $\mathcal{H}^{0}_{x}$ and $\mathcal{H}^{0}_{z}$ ), where $\mathcal{H}^{0}_{x}\in\mathbb{R}^{N_{x}\times D}$ and $\mathcal{H}^{0}_{z}\in\mathbb{R}^{N_{z}\times D}$ are visual search region tokens and visual template tokens, and language tokens $\mathcal{H}^{0}_{t}\in\mathbb{R}^{N_{t}\times D}$ , where $N_{t}$ is the number of language tokens. Following [18], a special classification token ([CLS]) is attached at the beginning of the language tokens. Then, $\mathcal{H}^{0}_{x}$ , $\mathcal{H}^{0}_{z}$ , and $\mathcal{H}^{0}_{t}$ are aligned with the multi-modal alignment module (see section III-C) in the embedding space. It is worth noting that the well aligned vision embeddings and language embeddings can facilitate multi-modal representation learning and interaction [30]. Here, we still refer to the aligned vision embeddings and language embeddings as $\mathcal{H}^{0}_{x}$ , $\mathcal{H}^{0}_{z}$ , and $\mathcal{H}^{0}_{t}$ , respectively. Afterwards, we perform a modal mixup operation [5] between the aligned vision embeddings and language embeddings as follows:

\mathbf{F}^{0}_{x}=\mathcal{H}^{0}_{x}\odot Linear(\mathcal{H}^{0}_{t})+\mathcal{H}^{0}_{x},

(3)

\mathbf{F}^{0}_{z}=\mathcal{H}^{0}_{z}\odot Linear(\mathcal{H}^{0}_{t})+\mathcal{H}^{0}_{z},

(4)

where $\odot$ represents the Hadamard product, $Linear(\cdot)$ is a linear projection layer. In this way, the language information is injected into vision embeddings via the modal mixup operation. Moreover, Eqs. (3)-(4) also construct a bidirectional information ﬂow between vision and language modalities that allows mutual guidance for multi-modal feature extraction and interaction. By establishing a bidirectional information flow between well aligned visual and language signals as early as possible via a unified transformer backbone, we can avoid the loss of discriminative information and thus make the extracted features highly target-aware [10].

Formally, the operations of the $l$ -th encoder of our All-in-One transformer backbone can be expressed as:

\mathbf{Q}=\mathbf{K}=\mathbf{V}=[\mathbf{F}^{l}_{x};\mathbf{F}^{l}_{z}],

(5)

[\mathbf{F}^{\prime l}_{x};\mathbf{F}^{\prime l}_{z}]={\rm LN}([\mathbf{F}^{l}_{x};\mathbf{F}^{l}_{z}]+{\rm MHSA}(\mathbf{Q},\mathbf{K},\mathbf{V})),

(6)

[\mathbf{F}^{l+1}_{x};\mathbf{F}^{l+1}_{z}]={\rm LN}([\mathbf{F}^{\prime l}_{x};\mathbf{F}^{\prime l}_{z}]+{\rm FFN}([\mathbf{F}^{\prime l}_{x};\mathbf{F}^{\prime l}_{z}])),

(7)

where $\mathbf{Q}$ , $\mathbf{K}$ , and $\mathbf{V}$ represent the query, key, and value embeddings, $[;]$ denotes the concatenation operation, $\mathbf{F}^{l}_{x}$ and $\mathbf{F}^{l}_{z}$ are the input embeddings of the $l$ -th transformer encoder. Therefore, the language information-injected vision embeddings are jointly processed by the transformer encoder, enabling seamless multi-modal feature extraction and integration. Finally, the visual search region embeddings of the last layer of the transformer encoder are reshaped into a 2D feature map. The feature map is fed into the tracking head to predict the location of the target.

To model the interaction between language and vision features, recent VL trackers [9, 5] adopted a customized fusion model to directly serialize vision and language embeddings into sequences to learn a joint multi-modal embedding. Although, our All-in-One transformer backbone using the pretrained ViT [19] has the ability to model long-range dependencies of sequential data, alleviating the negative effects of modal differences for multi-modal learning, the vision embeddings and language embeddings lying in different feature spaces is still challenging for the transformer encoder to learn their interactions [37, 38]. To tackle this limitation, we further propose a self-supervised MMA module, which is used before feature extraction and integration. The MMA module includes CMA and IMA, which can efficiently learn more reasonable feature distributions as shown in Fig. 3.

III-C Multi-Modal Alignment Module

Cross-modal Alignment. Since the vision and language embeddings from the same video are distributed in different feature spaces, a natural thought is to enforce them close in the feature space to reduce the difficulty for multi-modal interaction. Absorb this in mind, we introduce the CMA to pull the matched vision and language embeddings closer in feature space, while pushing away mismatched pairs. Actually, the goal of CMA is to maximize the mutual information (MI) [15] between vision and language that are matched, which contain the same semantics. Fig. 3 presents an example, the high-level language embedding (i.e., green star) and sparse vision embedding (i.e., yellow star) from the same video are pulled closer in the feature space. Specifically, visual search region tokens $\mathcal{H}^{0}_{x}$ , visual template tokens $\mathcal{H}^{0}_{z}$ and language tokens $\mathcal{H}^{0}_{t}$ are projected into the same dimension through three linear projections, which we denote as $\mathbf{f}_{x}\in\mathbb{R}^{C}$ , $\mathbf{f}_{z}\in\mathbb{R}^{C}$ , and $\mathbf{f}_{t}\in\mathbb{R}^{C}$ , respectively, where $C=256$ . To maximize the MI of vision and language tokens, we optimize the InfoNCE loss [15] between vision and language, denoting the lower bound of their MI. Formally, InfoNCE losses of vision-to-language are defined as:

\mathcal{L}_{x2t}(\mathbf{f}^{i}_{x},\mathbf{f}^{i}_{t},\widetilde{\mathbf{f}}_{t})=-\mathbb{E}_{(\mathbf{f}_{x},\mathbf{f}_{t})}[\log\frac{\exp(sim(\mathbf{f}^{i}_{x},\mathbf{f}^{i}_{t})/\tau)}{\sum_{j=1}^{N-1}\exp(sim(\mathbf{f}^{i}_{x},\widetilde{\mathbf{f}}^{j}_{t})/\tau)}],

(8)

\mathcal{L}_{z2t}(\mathbf{f}^{i}_{z},\mathbf{f}^{i}_{t},\widetilde{\mathbf{f}}_{t})=-\mathbb{E}_{(\mathbf{f}_{z},\mathbf{f}_{t})}[\log\frac{\exp(sim(\mathbf{f}^{i}_{z},\mathbf{f}^{i}_{t})/\tau)}{\sum_{j=1}^{N-1}\exp(sim(\mathbf{f}^{i}_{z},\widetilde{\mathbf{f}}^{j}_{t})/\tau)}],

(9)

where $\mathbf{f}^{i}_{x}$ , $\mathbf{f}^{i}_{z}$ , and $\mathbf{f}^{i}_{t}$ are two vision tokens and language tokens of the same video, respectively, $\widetilde{\mathbf{f}}_{t}=\{\widetilde{\mathbf{f}}^{1}_{t},...,\widetilde{\mathbf{f}}^{N-1}_{t}\}$ is a set of negative language examples for $\mathbf{f}^{i}_{x}$ or $\mathbf{f}^{i}_{z}$ , $N$ is the batch size, $sim(\mathbf{f}^{i}_{x},\mathbf{f}^{i}_{t})=\mathbf{f}^{i}_{x}\cdot\mathbf{f}^{i}_{t}/(||\mathbf{f}^{i}_{x}||||\mathbf{f}^{i}_{t}||)$ , $\tau$ is a temperature parameter. The InfoNCE losses, i.e., $\mathcal{L}_{t2x}(\mathbf{f}^{i}_{t},\mathbf{f}^{i}_{x},\widetilde{\mathbf{f}}_{x})$ and $\mathcal{L}_{t2z}(\mathbf{f}^{i}_{t},\mathbf{f}^{i}_{z},\widetilde{\mathbf{f}}_{z})$ , of language-to-vision can be calculated similarly. Hence, the CMA loss can be formulated as $\mathcal{L}_{cma}=\frac{1}{2}[\mathcal{L}_{x2t}(\cdot)+\mathcal{L}_{z2t}(\cdot)]+\frac{1}{2}[\mathcal{L}_{t2z}(\cdot)+\mathcal{L}_{t2x}(\cdot)]$ .

Intuitively, by optimizing the CMA loss, vision and language embeddings can be well aligned in the feature space as in Fig. 3. However, the CMA ignores the significant intra-modal supervisory signals (i.e., visual template and visual search region) for learning desired multi-modal features. Aligning the visual template with the visual search region enables learning temporal-invariant features [39, 40, 41], which are crucial to enhance the discriminative ability of tracking models. To this end, we further propose the IMA to fully utilize the intra-modal temporal supervision information.

Intra-modal Alignment. The language prompt mainly contains global/static semantic meaning of the target, while the visual modality contains the temporal information of the target (e.g., motion, and appearance variation through the video) [1, 5]. As mentioned earlier, IMA aims to learn temporal-invariant features within the same modality of positive and negative samples. Therefore, we only consider visual modality in IMA. Specifically, we consider visual search region tokens $\mathbf{f}_{x}\in\mathbb{R}^{C}$ , and visual template tokens $\mathbf{f}_{z}\in\mathbb{R}^{C}$ from the same video as positive pairs, while tokens from different videos as negative pairs. We also apply the contrastive loss to maximize the MI between $\mathbf{f}_{x}$ and $\mathbf{f}_{z}$ . Formally, InfoNCE losses between vision tokens can be defined as:

\mathcal{L}_{x2z}(\mathbf{f}^{i}_{x},\mathbf{f}^{i}_{z},\widetilde{\mathbf{f}})\!=\!-\mathbb{E}_{(\mathbf{f}_{x},\mathbf{f}_{z})}[\log\frac{\exp(sim(\mathbf{f}^{i}_{x},\mathbf{f}^{i}_{z})/\tau)}{\sum_{j=1}^{2(N-1)}\exp(sim(\mathbf{f}^{i}_{x},\widetilde{\mathbf{f}}^{j})/\tau)}],

(10)

\mathcal{L}_{z2x}(\mathbf{f}^{i}_{z},\mathbf{f}^{i}_{x},\widetilde{\mathbf{f}})\!=\!-\mathbb{E}_{(\mathbf{f}_{z},\mathbf{f}_{x})}[\log\frac{\exp(sim(\mathbf{f}^{i}_{z},\mathbf{f}^{i}_{x})/\tau)}{\sum_{j=1}^{2(N-1)}\exp(sim(\mathbf{f}^{i}_{z},\widetilde{\mathbf{f}}^{j})/\tau)}],

(11)

where $\mathbf{\widetilde{f}}=\{\widetilde{\mathbf{f}}^{1}_{x},...,\widetilde{\mathbf{f}}^{N-1}_{x},\widetilde{\mathbf{f}}^{1}_{z},...,\widetilde{\mathbf{f}}^{N-1}_{z}\}$ is a set of negative examples for $\mathbf{f}^{i}_{x}$ or $\mathbf{f}^{i}_{z}$ , $N$ is the batch size. Then, the IMA loss can be formulated as $\mathcal{L}_{ima}=\frac{1}{2}[\mathcal{L}_{x2z}(\cdot)+\mathcal{L}_{z2x}(\cdot)]$ .

The IMA loss encourages learning representations by aligning temporal-invariant positive pairs within visual modality. Importantly, it enforces the uniformity of vision and language, resulting in a uniform distribution across the whole feature space [42, 38]. Therefore, CMA and IMA have complementary advantages in multi-modal learning: on the one hand, CMA encourages matched vision and language embeddings close in the feature space. On the other hand, IMA maximizes the temporal-invariant features between visual tokens, and make the multi-modal features evenly distributed in the feature space. As shown in Fig. 3, combining them makes the learned representations more reasonable, and further facilitates joint multi-modal feature learning and interaction.

III-D Tracking Head and Loss

Following [10], the tracking head is decomposed into two branches of classification and bounding box regression. As shown in Fig. 2, the learned visual search region tokens are first reshaped into a 2D feature map according to the original spatial resolution, followed by a 4-layer fully convolutional network to predict the target classification score map. In the classification branch, a weighted focal loss $\mathcal{L}_{cls}$ [43], is adopted to enhance the model’s ability to distinguish objects from background. The bounding box regression branch is used to predict the center coordinate offset of the object and the size of the object. To regress the center coordinate offset and size of objects, we combine the $\ell_{1}$ loss and the generalized IoU loss $\mathcal{L}_{giou}$ [44]. The regression loss is calculated as $\mathcal{L}_{reg}=\lambda_{giou}\mathcal{L}_{giou}+\lambda_{{1}}\mathcal{L}_{1}$ , where $\lambda_{giou}$ and $\lambda_{{1}}$ are two hyper-parameters.

To train our model in an end-to-end manner, we convert it into a multi-task optimization problem [45], simultaneously optimizing classification loss, regression loss, CMA loss, and IMA loss. Finally, the overall loss function for our model is defined as:

\mathcal{L}_{total}=\mathcal{L}_{cls}+\mathcal{L}_{reg}+\lambda_{cma}\mathcal{L}_{cma}+\lambda_{ima}\mathcal{L}_{ima},

(12)

where $\lambda_{cma}$ and $\lambda_{ima}$ are trade-off weights to balance the multi-task optimization problem.

IV Experiments

To demonstrate the effectiveness and generalization ability of our approach, we conduct experiments on all five public VL tracking benchmarks to date, including UAV scenes (i.e., WebUAV-3M [1]), generic scenes (i.e., LaSOT [36], LaSOT_Ext [46], OTB99-L [3]), and real-synthetic scenes (i.e., TNL2K [2]).

IV-A Implementation Details

We adopt ViT-Base [19] as the architecture of All-in-One transformer backbone. It is stacked by $L$ (i.e., 12) transformer encoder layers, and each layer contains two sub-layers, i.e., a multi-head self-attention layer and a feed-forward network. Each sub-layer is a residual connection structure, followed by a layer normalization. To accelerate convergence, we initialize our backbone with MAE-pretrained weights [12]. The visual template and visual search region are $2^{2}$ times and $4^{2}$ times of the target bounding box, and then resized to $128\times 128$ and $256\times 256$ , respectively. We use bert-base-uncased tokenizer [18] to tokenize language prompts.

Our experiments are conducted on an Ubuntu server with two NVIDIA RTX 3090 GPUs. The training data includes training splits of LaSOT [36], GOT-10k [6], TrackingNet [47], COCO [48], OTB99-L [3], TNL2K [2], WebUAV-3M [1], and VisualGenome [49]. For several datasets (i.e., [6, 47, 48]) without natural language prompts, we use class names as pseudo language labels similar to [5]. The tracker is optimized using an AadmW optimizer [50] with initial learning rate $4\times 10^{-4}$ . The total epoch is 300. The weight decay factor is $1\times 10^{-4}$ after 240 epochs. Following [10], the hyper-parameters $\lambda_{{giou}}$ and $\lambda_{1}$ are set to 2 and 5. While $\lambda_{cma}$ and $\lambda_{ima}$ are set to 1 and 1 without parameter optimization. We set the temperature parameter $\tau=0.5$ . The batch size $N$ is 32. Following [7, 1], we adopt the one-pass evaluation with five metrics, i.e., precision ( $P$ ), normalized precision ( $P_{norm}$ ), success rate (AUC), complete success rate (cAUC), and accuracy (ACC) to measure the tracking performance.

TABLE I: Ablation study of our approach on the LaSOT dataset. We retrain [10] as our baseline. “AOT-[CLS]” and “AOT-[Mean]” represent All-in-One transformer (AOT) using [CLS] token and mean token of language prompt, respectively.

Method	AUC (%)	$P_{norm}$ (%)	$P$ (%)
Baseline	62.2	70.5	66.3
Baseline w/ AOT-[CLS]	63.9	72.4	67.9
Baseline w/ AOT-[Mean]	64.0	72.6	68.4
Baseline w/ AOT-[Mean] and CMA	64.1	72.7	68.6
Baseline w/ AOT-[Mean] and MMA	64.4	72.8	68.8

IV-B Ablation Study and Analysis

We first conduct ablation experiments trained on the LaSOT training set and evaluated on the LaSOT test set to validate diﬀerent components of our approach.

Impact of All-in-One Transformer (AOT). To analysis the impact of AOT, we train two trackers, i.e., AOT-[CLS] and AOT-[Mean] using [CLS] token and mean token of language prompt in AOT. From Tab. I, we can find that the AOT obviously boosts the tracking performance. Specifically, the $P$ scores are improved by $1.6\%$ (from $66.3\%$ to $67.9\%$ ) and $2.1\%$ (from $66.3\%$ to $68.4\%$ ), respectively compared with the baseline method. Importantly, using the mean token is slightly better than the [CLS] token. We speculate that the possible reason is that the mean token can provide more semantic information about the target compared with the [CLS] token. Therefore, the mean token is as our default setting in AOT.

Impact of Cross-modal Alignment (CMA). From Tab. I, we can see the CMA component improves tracking performance by $0.1\%$ , $0.1\%$ , and $0.2\%$ in terms of AUC, $P_{norm}$ , and $P$ scores, respectively. This validates the CMA is beneficial to align vision and language embeddings in the feature space and improve tracking accuracy.

Impact of Intra-modal Alignment (IMA). By combining the CMA and IMA, we improve the tracking AUC by $0.4\%$ (from $64.0\%$ to $64.4\%$ ), $P_{norm}$ by $0.2\%$ (from $72.6\%$ to $72.8\%$ ), and $P$ by $0.4\%$ (from $68.4\%$ to $68.8\%$ ), as shown in Tab. I. The significant performance gains demonstrate that the MMA module makes the distributions of vision and language embeddings more reasonable in the feature space, and facilitates feature learning and interaction.

TABLE II: Impact of language prompts (i.e., sentence and class). “Ours-S” and “Ours-C” denote our approaches (w/ AOT and MMA) training with sentence and class prompts, respectively.

Method	Training	Test	AUC (%)	$P_{norm}$ (%)	$P$ (%)
Ours-S	Sentence	Class	63.4	72.0	67.8
Ours-S	Sentence	Sentence	64.4	72.8	68.8
Ours-C	Class	Sentence	63.2	71.9	67.4
Ours-C	Class	Class	64.6	73.2	68.9

Visualization. To further investigate the effectiveness of our All-in-One framework, we visualize the response maps and the tracking results in Fig. 4. With the AOT, our approach highlights the target region due to language prompt, even with complex background distractions. Combining AOT and MMA, our approach has a more unambiguous and discriminative response, and predict a more precise bounding box. Previous visual search regions also demonstrate that our approach can focus on real target when facing some complex scenarios, such as occlusion and background clutter.

TABLE III: Comparison of our approach with SOTA trackers on the WebUAV-3M test set using ACC score.

Method	ACC (%)	Method	ACC (%)
SiamFC [51]	34.9	SiamBAN [52]	44.9
Ocean [53]	37.0	VLT_SCAR [5]	45.3
DiMP [54]	37.5	AutoMatch [55]	46.1
TrDiMP [41]	41.2	TransT [35]	46.6
SiamCAR [56]	42.2	VLT_TT [5]	47.5
SiamRPN++ [57]	44.2	All-in-One (Ours)	57.6

Sentence Prompts vs. Class Prompts. To analysis the impact of language prompts, we train two trackers, i.e., Ours-S and Ours-C with sentence (original language prompt) and class (class name of video) prompts on the LaSOT training set. From Tab. II, we have some observations upon inspection. First, better tracking results are achieved when the language prompts for training and testing are consistent, i.e., training using sentences/classes and testing using sentences/classes. Second, the best performance ( $64.6\%$ in AUC, $73.2\%$ in $P_{norm}$ , $68.9\%$ in $P$ ) is obtained using class prompts training and class prompts testing on the LaSOT dataset. We speculate that trackers are sensitive to ambiguous language prompts. Compared with sentence prompts, class prompts may bring about less ambiguity for both training and evaluation [9, 1]. Additionally, as shown in Fig. 5, given ambiguous language prompts our trackers fail to localize the real object. A potential solution is to provide clear sentence prompts or clear class prompts (see Fig. 5), both of which enable our trackers to accurately localize the real object.

Speed Analysis. Real-time tracking is urgently demanded in many practical applications [58, 2]. Our one-stream framework achieves joint multi-modal feature extraction and interaction, and has great efficiency. Tab. IV shows that the average inference speed of our approach is around 60 frames per second (FPS) without model acceleration. The speed of our approach is obviously faster than that of many (state-of-the-art) SOTA real-time trackers [57, 55] and common video flow [59], demonstrating that applying it to real-world applications is imperceptible in terms of time consumption.

TABLE IV: Performance comparison on four benchmarks, including LaSOT, LaSOT_Ext, OTB99-L, and TNL2K. We compare All-in-One with convolutional neural network (CNN)-based, CNN-VL, transformer (Trans)-based, and Trans-VL methods.

Type	Method	LaSOT			LaSOT_Ext		OTB99-L		TNL2K		Speed (FPS)
Type	Method	AUC (%)	$P$ (%)	$P_{norm}$ (%)	AUC (%)	$P$ (%)	AUC (%)	$P$ (%)	AUC (%)	$P$ (%)	Speed (FPS)
CNN-based	SiamRCNN [60]	64.8	68.4	72.2	-	-	70.0	89.4	52.3	52.8	4.7
	PrDiMP [61]	59.8	60.8	68.4	-	-	69.5	89.5	47.0	45.9	30.0
	AutoMatch [55]	58.3	59.9	-	37.6	43.0	71.6	93.2	47.2	43.5	50.0
	Ocean [53]	56.0	56.6	65.1	-	-	68.0	92.1	38.4	37.7	58.0
	ATOM [62]	51.5	50.5	57.6	37.6	43.0	67.6	82.4	40.1	39.2	30.0
	SiamRPN++ [57]	49.6	49.1	56.9	34.0	39.6	63.8	82.6	41.3	41.2	35.0
	GlobalTrack [63]	51.7	52.8	59.7	35.6	41.1	-	-	40.5	38.6	6.0
	SiamFC [51]	33.6	33.9	42.0	23.0	26.9	58.7	79.2	29.5	28.6	86.0
	SiamCAR [56]	50.7	51.0	60.0	33.9	41.0	68.8	89.1	35.3	38.4	52.3
CNN-VL	SNLT [4]	54.0	57.6	63.6	26.2	30.0	66.6	80.4	27.6	41.9	50.0
CNN-VL	VLT_SCAR [5]^∗	63.9	67.9	73.3	44.7	51.6	73.9	89.8	49.8	51.0	43.0
Trans-based	STARK-ST50 [40]	66.4	71.2	76.3	47.8	55.1	69.6	91.4	-	-	40.0
	TrDiMP [41]	63.9	66.3	-	-	-	70.5	92.5	-	-	26.0
	TransT [35]	64.9	69.0	73.8	44.8	52.5	70.8	91.2	50.7	51.7	50.0
	OSTrack [10]	69.1	75.2	78.7	47.4	53.3	70.6	92.1	54.3	56.3	105.4
Trans-VL	VLT_TT [5]^∗	67.3	72.1	77.6	48.4	55.9	76.4	93.1	53.1	53.3	35.0
Trans-VL	All-in-One (Ours)	71.7	78.5	82.4	54.5	66.0	71.0	93.0	55.3	57.2	60.0

*

For this tracker [5], results were obtained from four different models. The best result is reported on each dataset.

IV-C Evaluation on UAV Scenes

WebUAV-3M. WebUAV-3M [1] is the latest million-scale UAV tracking dataset with visual bounding box, language and audio annotations, which contains 4,500 videos and offers over 200 target categories. UAV tracking scenes are extremely challenging due to continuous viewpoints changes, motion blurs, low-resolutions, etc. As reported in Tab. III, All-in-One outperforms other visual trackers and VL trackers in tracking accuracy. Furthermore, our tracker improves $P$ /AUC/ $P_{norm}$ /cAUC by a large margin as shown in Fig. 6. Notably, with a simple and general unified model architecture, our tracker outperforms the most competitive VL tracker VLT_TT [5] by $7.7\%$ in $P$ , $9.5\%$ in $P_{norm}$ , $9.8\%$ in AUC, and $9.9\%$ in cAUC.

IV-D Evaluation on Generic Scenes

LaSOT. LaSOT [36] is a densely annotated large-scale VL tracking dataset that contains 1,120 videos for training and 280 long-term videos for evaluation. In this dataset, objects disappear and reappear frequently, making long-term tracking in generic scenes highly challenging. From Tab. IV, we can observe that our approach sets a new SOTA on LaSOT, which provides compelling evidence for long-term tracking and suggests that our approach is capable of recognizing objects in extremely long videos. Fig. 7 demonstrates that All-in-One outperforms other trackers on eight challenging attributes, i.e., background clutter and motion blur, illustration variation, low resolution, fast motion, full occlusion, deformation, and aspect ration change.

LaSOT_Ext. LaSOT_Ext [46] is the extended version of [36], which comprises 150 manually annotated videos. Tab. I indicates All-in-One surpasses all previous advanced trackers and obtains the best AUC score of $71.7\%$ , gaining a significant improvement of $2.6\%$ compared with the current SOTA one-stream tracker OSTrack [10].

OTB99-L. OTB99-L [3] is an early VL tracking dataset contains 51 videos for training and 48 videos for public evaluation. As shown in Tab. IV, compared with recent SOTA trackers, our tracker achieves comparable results, which validates the effectiveness of our tracker.

IV-E Evaluation on Real-Synthetic Scenes

TNL2K. TNL2K [2] is a recently released dataset, which comprises 1,300 videos for training and 700 videos for evaluation in real and synthetic (e.g., cartoon videos and virtual game videos) scenes with diverse challenging fctors, such as significant appearance variation and adversarial samples. Results in Tab. IV show that our approach achieves the highest AUC ( $55.3\%$ ) and $P$ (57.2%) scores, demonstrating the generalization ability of All-in-One.

IV-F Qualitative Performance

As shown in Fig. 8, we compare All-in-One with three SOTA trackers (i.e., VLT_TT [5], TransT [35], and SiamRPN++ [57]) on three videos from the LaSOT test set, in which the main challenges include similar distractors, severe viewpoint changes, background clutter, appearance variation, occlusion and extreme illumination. We can see that All-in-One is more robust than other methods. For instance, the prior most competitive VL tracker VLT_TT gradually loses the target as the target appearance varies in the video of Sepia-16 (the second video in Fig. 8). By contrast, All-in-One provides accurate and stable prediction results, demonstrating the effectiveness of our unified framework in complex environments.

V Conclusion and Discussion

Conclusion. In this work, we present All-in-One, a new unified framework for multi-modal VL tracking, which includes the All-in-One transformer and the multi-modal alignment module. The core insight is to is establish bidirectional information ﬂow between well aligned visual and language signals as early as possible via a unified transformer backbone. Besides, the multi-modal alignment module based on cross-modal and intra-modal contrastive objectives enables to learn more reasonable VL representations, which effectively facilitates joint multi-modal feature learning and interaction. Extensive experiments on multiple VL tracking benchmarks have demonstrated the effectiveness and generalization of our approach against state-of-the-art trackers.

Discussion. We first provide a discussion to demonstrate that developing a foundation model, e.g., All-in-One, for VL tracking is valuable in the era of large language/vision models. (1) As the echoes of large language models (e.g., ChatGPT [64], GPT-4 [65]) remarkable success continue to permeate the natural language community, its formidable successors, e.g., ViT-22B [66], have emerged in the computer vision community. Although they have emergent abilities [64], the huge training cost (e.g., thousands of GPUs) and terrible environment unfriendliness cannot be ignored [67]. Instead, we believe that training a foundation model for a specific task is more flexible and affordable for research purposes. (2) Despite the breakthroughs in large multi-modal models [68, 69, 70], they have not achieved the same success as large language models, highlighting the need to explore foundation models in the multi-modal domain. All-in-One is designed to be such a foundation model for multi-modal VL tracking. (3) The All-in-One has great potential to become a foundation model for multi-modal tracking because it can enable more accurate and efficient processing of multi-modal data, which fully utilizes both vision and language information. Our model not only learns all modalities in one backbone (All-in-One), but trains once and generalizes well to all VL tracking datasets (Once-for-All) with complex and user-defined language prompts. (4) Additionally, having a streamlined and standardized foundation model for multi-modal tracking can facilitate the development of more complex and specialized models in the future, allowing for even more sophisticated analysis and understanding of multi-modal data.

Our work still has the following two limitations. (1) Our approach is designed to localize objects of interest based on object boxes and language prompts. Inevitably, it suffers from inaccurate language prompts, such as ambiguous language descriptions, and the states (e.g., position and appearance) of objects change significantly in videos making them inconsistent with language prompts. (2) While All-in-One is a unified framework for multi-modal VL tracking, it currently focuses mainly on language prompts. Actually, All-in-One has great potential to be extended to leverage more types of prompts, such as audio, point, mask, and scribble prompts [71, 72]. We leave it for the future work.

References

[1] C. Zhang, G. Huang, L. Liu, S. Huang, Y. Yang, X. Wan, S. Ge, and D. Tao, “Webuav-3m: A benchmark for unveiling the power of million-scale deep uav tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[2] X. Wang, X. Shu, Z. Zhang, B. Jiang, Y. Wang, Y. Tian, and F. Wu, “Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 763–13 773.
[3] Z. Li, R. Tao, E. Gavves, C. G. Snoek, and A. W. Smeulders, “Tracking by natural language specification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6495–6503.
[4] Q. Feng, V. Ablavsky, Q. Bai, and S. Sclaroff, “Siamese natural language tracker: Tracking by natural language descriptions with siamese trackers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5851–5860.
[5] M. Guo, Z. Zhang, H. Fan, and L. Jing, “Divert more attention to vision-language tracking,” arXiv preprint arXiv:2207.01076, 2022.
[6] L. Huang, X. Zhao, and K. Huang, “Got-10k: A large high-diversity benchmark for generic object tracking in the wild,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 5, pp. 1562–1577, 2019.
[7] Y. Wu, J. Lim, and M.-H. Yang, “Object tracking benchmark,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1834–1848, 2015.
[8] D. Ma and X. Wu, “Capsule-based object tracking with natural language specification,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1948–1956.
[9] L. Zhou, Z. Zhou, K. Mao, and Z. He, “Joint visual grounding and tracking with natural language specification,” arXiv preprint arXiv:2303.12027, 2023.
[10] B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII. Springer, 2022, pp. 341–357.
[11] B. Chen, P. Li, L. Bai, L. Qiao, Q. Shen, B. Li, W. Gan, W. Wu, and W. Ouyang, “Backbone is all your need: a simplified architecture for visual object tracking,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII. Springer, 2022, pp. 375–392.
[12] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in IEEE CVPR, 2022, pp. 15 979–15 988.
[13] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
[14] L. Lin, H. Fan, Z. Zhang, Y. Xu, and H. Ling, “Swintrack: A simple and strong baseline for transformer tracking,” Advances in Neural Information Processing Systems, vol. 35, pp. 16 743–16 754, 2022.
[15] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
[16] Z. Yang, T. Kumar, T. Chen, J. Su, and J. Luo, “Grounding-tracking-integration,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 9, pp. 3433–3443, 2020.
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[18] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[19] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[20] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 2020, pp. 213–229.
[21] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in International conference on machine learning. PMLR, 2021, pp. 10 347–10 357.
[22] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 568–578.
[23] C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-scale vision transformer for image classification,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 357–366.
[24] X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y.-F. Wang, W. Y. Wang, and L. Zhang, “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6629–6638.
[25] A. Salvador, E. Gundogdu, L. Bazzani, and M. Donoser, “Revamping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 475–15 484.
[26] L. Zhang, H. Wu, Q. Chen, Y. Deng, J. Siebert, Z. Li, Y. Han, D. Kong, and Z. Cao, “Vldeformer: Vision-language decomposed transformer for fast cross-modal retrieval,” Knowledge-Based Systems, p. 109316, 2022.
[27] A. J. Wang, Y. Ge, R. Yan, Y. Ge, X. Lin, G. Cai, J. Wu, Y. Shan, X. Qie, and M. Z. Shou, “All in one: Exploring unified video-language pre-training,” arXiv preprint arXiv:2203.07303, 2022.
[28] W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 5583–5594.
[29] H. Akbari, L. Yuan, R. Qian, W.-H. Chuang, S.-F. Chang, Y. Cui, and B. Gong, “Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text,” Advances in Neural Information Processing Systems, vol. 34, pp. 24 206–24 221, 2021.
[30] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 2, pp. 423–443, 2018.
[31] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visualbert: A simple and performant baseline for vision and language,” arXiv preprint arXiv:1908.03557, 2019.
[32] G. Li, N. Duan, Y. Fang, M. Gong, and D. Jiang, “Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 336–11 344.
[33] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” Advances in neural information processing systems, vol. 32, 2019.
[34] Y. Cui, C. Jiang, L. Wang, and G. Wu, “Mixformer: End-to-end tracking with iterative mixed attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 608–13 618.
[35] X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8126–8135.
[36] H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling, “Lasot: A high-quality benchmark for large-scale single object tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5374–5383.
[37] J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” Advances in neural information processing systems, vol. 34, pp. 9694–9705, 2021.
[38] J. Yang, J. Duan, S. Tran, Y. Xu, S. Chanda, L. Chen, B. Zeng, T. Chilimbi, and J. Huang, “Vision-language pre-training with triple contrastive learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 671–15 680.
[39] F. Li, C. Tian, W. Zuo, L. Zhang, and M.-H. Yang, “Learning spatial-temporal regularized correlation filters for visual tracking,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4904–4913.
[40] B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “Learning spatio-temporal transformer for visual tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 448–10 457.
[41] N. Wang, W. Zhou, J. Wang, and H. Li, “Transformer meets tracker: Exploiting temporal context for robust visual tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1571–1580.
[42] T. Wang and P. Isola, “Understanding contrastive representation learning through alignment and uniformity on the hypersphere,” in International Conference on Machine Learning. PMLR, 2020, pp. 9929–9939.
[43] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 734–750.
[44] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 658–666.
[45] Y. Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 12, pp. 5586–5609, 2021.
[46] H. Fan, H. Bai, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, M. Huang, J. Liu, Y. Xu et al., “Lasot: A high-quality large-scale single object tracking benchmark,” International Journal of Computer Vision, vol. 129, no. 2, pp. 439–461, 2021.
[47] M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem, “Trackingnet: A large-scale dataset and benchmark for object tracking in the wild,” in ECCV, 2018, pp. 300–317.
[48] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014, pp. 740–755.
[49] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International journal of computer vision, vol. 123, no. 1, pp. 32–73, 2017.
[50] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
[51] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully-convolutional siamese networks for object tracking,” in European conference on computer vision. Springer, 2016, pp. 850–865.
[52] Z. Chen, B. Zhong, G. Li, S. Zhang, and R. Ji, “Siamese box adaptive network for visual tracking,” in CVPR, 2020, pp. 6667–6676.
[53] Z. Zhang, H. Peng, J. Fu, B. Li, and W. Hu, “Ocean: Object-aware anchor-free tracking,” in European Conference on Computer Vision. Springer, 2020, pp. 771–787.
[54] G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte, “Learning discriminative model prediction for tracking,” in IEEE International Conference on Computer Vision, 2019, pp. 6181–6190.
[55] Z. Zhang, Y. Liu, X. Wang, B. Li, and W. Hu, “Learn to match: Automatic matching network design for visual tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 339–13 348.
[56] D. Guo, J. Wang, Y. Cui, Z. Wang, and S. Chen, “Siamcar: Siamese fully convolutional classification and regression for visual tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6269–6277.
[57] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, “Siamrpn++: Evolution of siamese visual tracking with very deep networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4282–4291.
[58] C. Zhang, S. Ge, K. Zhang, and D. Zeng, “Accurate uav tracking with distance-injected overlap maximization,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 565–573.
[59] A. M. Tekalp, Digital video processing. Prentice Hall Press, 2015.
[60] P. Voigtlaender, J. Luiten, P. H. Torr, and B. Leibe, “Siam r-cnn: Visual tracking by re-detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6578–6588.
[61] M. Danelljan, L. V. Gool, and R. Timofte, “Probabilistic regression for visual tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 7183–7192.
[62] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “Atom: Accurate tracking by overlap maximization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4660–4669.
[63] L. Huang, X. Zhao, and K. Huang, “Globaltrack: A simple and strong baseline for long-term tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 037–11 044.
[64] OpenAI, “Chatgpt,” https://openai.com/blog/chatgpt/, 2023.
[65] ——, “Gpt-4,” https://cdn.openai.com/papers/gpt-4.pdf, 2023.
[66] M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin et al., “Scaling vision transformers to 22 billion parameters,” arXiv preprint arXiv:2302.05442, 2023.
[67] A. Koubaa, “Gpt-4 vs. gpt-3.5: A concise showdown,” 2023.
[68] W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Som et al., “Image as a foreign language: Beit pretraining for all vision and vision-language tasks,” arXiv preprint arXiv:2208.10442, 2022.
[69] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
[70] X. Chen, X. Wang, S. Changpinyo, A. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer et al., “Pali: A jointly-scaled multilingual language-image model,” arXiv preprint arXiv:2209.06794, 2022.
[71] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
[72] X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Gao, and Y. J. Lee, “Segment everything everywhere all at once,” arXiv preprint arXiv:2304.06718, 2023.