Learning Adaptive and View-Invariant Vision Transformer with Multi-Teacher Knowledge Distillation for Real-Time UAV Tracking
Abstract
Visual tracking has made significant strides due to the adoption of transformer-based models. Most state-of-the-art trackers struggle to meet real-time processing demands on mobile platforms with constrained computing resources, particularly for real-time unmanned aerial vehicle (UAV) tracking. To achieve a better balance between performance and efficiency, we introduce AVTrack, an adaptive computation framework designed to selectively activate transformer blocks for real-time UAV tracking. The proposed Activation Module (AM) dynamically optimizes the ViT architecture by selectively engaging relevant components, thereby enhancing inference efficiency without significant compromise to tracking performance. Furthermore, to tackle the challenges posed by extreme changes in viewing angles often encountered in UAV tracking, the proposed method enhances ViTs’ effectiveness by learning view-invariant representations through mutual information (MI) maximization. Two effective design principles are proposed in the AVTrack. Building on it, we propose an improved tracker, dubbed AVTrack-MD, which introduces the novel MI maximization-based multi-teacher knowledge distillation (MD) framework. It harnesses the benefits of multiple teachers, specifically the off-the-shelf tracking models from the AVTrack, by integrating and refining their outputs, thereby guiding the learning process of the compact student network. Specifically, we maximize the MI between the softened feature representations from the multi-teacher models and the student model, leading to improved generalization and performance of the student model, particularly in noisy conditions. Extensive experiments on multiple UAV tracking benchmarks demonstrate that AVTrack-MD not only achieves performance comparable to the AVTrack baseline but also reduces model complexity, resulting in a significant 17% increase in average tracking speed. Codes is be available at https://github.com/wuyou3474/AVTrack.
Keywords: UAV tracking, real-time, vision transformer, activation module, view-invariant representations, multi-teacher knowledge distillation.

1 Introduction
Unmanned Aerial Vehicle (UAV) tracking has become increasingly critical due to its diverse applications, including path planning (Lee and Hwang, 2015), public safety (Zhang et al., 2019), visual surveillance (Tang et al., 2017), environmental monitoring (Sharma et al., 2016), industrial inspections (Mourtzis et al., 2021), and agriculture (Singh and Sharma, 2022), etc. Specifically, UAV tracking involves determining and predicting the location of a specific object in successive aerial images, which are captured by mobile cameras at relatively high altitudes. As a result, common challenging scenarios have emerged in the UAV tracking community, including fast object or UAV motion, extreme viewing angles, motion blur, low resolution, and significant occlusions where objects are obscured by other elements. Moreover, the efficiency of tracking algorithms is crucial, as mobile platforms like UAVs have limited computational resources (Li et al., 2023; Wang et al., 2023; Ma et al., 2023; Liu et al., 2022b; Li et al., 2020). In order to meet the unique demands of UAV applications, a practical tracker should ensure accurate tracking while running with relatively low computational and power consumption.
Neither discriminative correlation filter (DCF)-based trackers nor deep learning (DL)-based trackers can achieve a satisfactory balance between performance and efficiency. DCF-based trackers offer higher efficiency but struggle with tracking accuracy and robustness, making them unsuitable for complex UAV tracking scenarios, while DL-based trackers are hindered by high computational costs and sluggish performance, limiting their practical applicability. Single-stream architecture has recently emerged as a popular strategy in DL-based trackers, seamlessly integrating feature extraction and fusion with pre-trained Vision Transformer (ViT) backbone networks. A number of representative methods, including OSTrack (Ye et al., 2022), SimTrack (Chen et al., 2022), Mixformer (Cui et al., 2022), DropMAE (Wu et al., 2023), ZoomTrack (Kou et al., 2023), AQATrack (Xie et al., 2024b), EVPTrack (Shi et al., 2024), and HIPTrack (Cai et al., 2024), demonstrate the immense potential of this paradigm shift in tracking task. Motivated by this, Aba-VTrack (Li et al., 2023) proposes an effective and efficient DL-based tracker for real-time UAV tracking based on this framework, utilizing an adaptive and background-aware token calculation technique to minimize inference time. However, the presence of unstructured access operations with a variable number of tokens results in significant time costs.
In this work, we alleviate the above problems by introducing an adaptive computing framework specifically designed to selectively activate transformer blocks for real-time UAV tracking, based on the single-stream architecture driven by pre-trained transformer backbone networks. Our goal concentrates on enhancing the efficiency of ViTs by using more structured methods. Specifically, we introduce an Activation Module (AM) into each transformer block. As shown in Fig. 2 (right), to improve efficiency by reducing computation, AM utilizes only a slice of tokens from both the target template and the search image as input, it produces an activation probability that decides whether a transformer block should be activated. By adaptively trimming the ViT at the block level, the proposed method avoids unnecessary unstructured access operations, thereby reducing time consumption. The rationale behind this is based on the understanding that semantic features or relationships do not uniformly affect the tracking task across different levels of abstraction. Instead, in practice, their impact is closely tied to the characteristics of the target and the scene in which it appears. In simple scenes, such as when a target moves against a monochromatic background, efficient tracking can be achieved by utilizing the color contrast between the object and the background. In such cases, this straightforward feature is sufficient. However, in real-world scenarios, there are often numerous distractors, such as background clutter, occlusion, similar objects, and changes in viewpoint. In these challenging scenarios, trackers need to capture and analyze sufficient semantic features and relationships to successfully track the specific object. These facts demonstrate the dynamic nature of tracking needs, which are inextricably linked to the individual features of the scene and the target being tracked. The proposed AM is implemented using just a linear layer followed by a nonlinear activation function, making it a straightforward and effective module. By customizing the architecture of ViTs to suit the specific requirements of tracking tasks, our approach has the potential to achieve a satisfactory balance between performance and efficiency for UAV tracking.
On the other hand, we introduce a novel approach to learning view-invariant feature representations by maximizing the mutual information between the backbone features extracted from two different views of the target, enhancing the tracker’s robustness against viewpoint changes. Specifically, mutual information is a measure that quantifies the dependence or relationship between two variables (Steuer et al., 2002). Mutual information maximization involves enhancing the mutual information between different components or variables within a system and is widely used in various computer vision tasks (Liu et al., 2022c; Yang et al., 2022; Hjelm et al., 2018). However, to the best of our knowledge, the effectiveness of this strategy in UAV tracking has not been extensively explored. In our work, to obtain view-invariant representations, the tracker learns to preserve essential information about the target regardless of viewpoint changes by maximizing the mutual information between two different views of the target. By employing this method, we call the resulting representations view-invariant representations. In theory, models trained with these representations are better at generalizing across diverse viewing conditions, making them more robust in real-world scenarios with frequent viewpoint changes. View-invariant representations are useful to trackers for effective tracking since extreme changes in viewing angles are a common challenge in UAV tracking.
To enhance the single-stream paradigm for real-time UAV tracking, we conduct extensive experiments, highlight the two key design principles mentioned above, and introduce an improved version called AVTrack-MD. Specifically, we introduce a novel multi-teacher knowledge distillation (MD) framework based on MI maximization in the AVTrack. In the proposed MD, three off-the-shelf tracking models from the AVTrack serve as teachers, distilling knowledge into a more lightweight student model. The student model maintains the same structure as the teachers but features a smaller ViT backbone, with half the number of ViT blocks. In practice, multiple teacher models offer comprehensive guidance by providing diverse knowledge, which is advantageous for training the student (Gou et al., 2021; Wang and Yoon, 2021). Multi-teacher knowledge distillation is an effective strategy that is widely used across various computer vision tasks, including image super-resolution (Jiang et al., 2024), image classification (Wen et al., 2024), and visual retrieval (Ma et al., 2024). However, to the best of our knowledge, its effectiveness in UAV tracking has not been extensively investigated. In our work, given the mean squared error (MSE) is sensitive to noise and outliers (Hinton et al., 2015), we maximize the MI between the aggregated softened features of the multi-teacher models and the student model’s softened features, leading to improved generalization and performance of the student model, particularly in noisy conditions. Extensive experimental results demonstrate that the proposed method can produce a high-performing student network, with overall tracking performance that is comparable to or even exceeds that of teacher networks, while using significantly fewer parameters and achieving faster speeds (See Fig. 1 and Table 1).
This work extends our previous conference paper, i.e., AVTrack (Li et al., 2024b), accepted at ICML 2024. In this extended paper, we propose AVTrack-MD, an improved tracker that significantly enhances the efficiency of AVTrack while achieving the overall tracking performance comparable to or even superior to it. We also provide extensive experiments and detailed implementations. The new contributions of this extended work are summarized as follows:
-
1.
We propose an improved version tracker, dubbed AVTrack-MD, with more lightweight architectural adaptations for UAV tracking. It significantly enhances the efficiency of AVTrack while achieving comparable or even superior performance.
-
2.
We conducted experiments on six UAV tracking benchmarks, providing more comparisons with recent state-of-the-art trackers, along with a comprehensive evaluation and analysis.
-
3.
We have presented more in-depth design principles, discussions, and details.
2 Preliminaries
We begin by reviewing the background of UAV tracking, view-invariant feature representation, and multi-teacher knowledge distillation tasks.
2.1 UAV Tracking
There are two main types of modern UAV trackers: DCF-based and CNN-based trackers (Li et al., 2021; Liu et al., 2022a). DCF-based trackers are favored for UAV tracking due to their remarkable efficiency on CPU, but they face difficulties in maintaining robustness under complex conditions (Li et al., 2021, 2020; Huang et al., 2019). Current CNN-based trackers, like those in (Cao et al., 2021, 2022), show advancements in tracking precision and robustness for UAV tracking, but their efficiency lags behind DCF-based trackers considerably. Although model compression techniques, as seen in (Wang et al., 2022; Wu et al., 2022; Zhong et al., 2023), were utilized to enhance efficiency, these trackers still face challenges associated with unsatisfactory tracking precision. A recent trend in the visual tracking community shows an increasing preference for single-stream architectures that seamlessly integrate feature extraction and correlation using pre-trained ViT backbone networks (Xie et al., 2021; Cui et al., 2022; Ye et al., 2022; Xie et al., 2022, 2024b), demonstrating significant potential. Several representative methods, including OSTrack (Ye et al., 2022), Mixformer (Cui et al., 2022), DropMAE (Wu et al., 2023), ZoomTrack (Kou et al., 2023), AQATrack (Xie et al., 2024b), EVPTrack (Shi et al., 2024), etc., demonstrate the significant success of applying this paradigm to tracking tasks. Although these frameworks are efficient due to their compact nature, very few are based on lightweight ViTs, making them impractical for real-time UAV tracking. To address this, TATrack (Li et al., 2024a) proposes an efficient ViT-based tracking framework for real-time UAV tracking, which integrates feature learning and template-search coupling into an efficient one-stream ViT. Similarly, ETDMT (Zhang et al., 2024) builds on lightweight ViT to introduce a tracker that combines template distinction with temporal context, while Aba-ViTrack (Li et al., 2023) enhances efficiency in real-time UAV tracking using lightweight ViTs and an adaptive background-aware token computation method. However, the variable token number in (Li et al., 2023) necessitates unstructured access operations, leading to notable time costs. Recent research has focused on improving the efficiency of ViTs by balancing their representation capabilities with computational efficiency. Methods like lightweight ViTs, model compression, and hybrid designs (Wang et al., 2020; Zhang et al., 2022b; Mao et al., 2021; Chen et al., 2021b; Li et al., 2022c) have been explored, but they often sacrifice accuracy or require time-consuming fine-tuning. Recent developments in efficient ViTs with conditional computation focus on adaptive inference, dynamically adjusting computational load based on input complexity to accelerate model performance. For example, DynamicViT (Rao et al., 2021) introduces control gates to selectively process tokens, while A-ViT (Yin et al., 2022) employs an Adaptive Computation Time strategy to avoid auxiliary halting networks, achieving gains in efficiency, accuracy, and token prioritization. Aba-ViTrack (Li et al., 2023) effectively utilizes the latter, showcasing significant potential for real-time UAV tracking. In our work, we focus on improving the efficiency of ViTs for UAV tracking through more structured methods, specifically the adaptive activation of Transformer blocks for feature representation.
2.2 View-Invariant Feature Representation
View-invariant feature representation has garnered significant attention in the field of computer vision and image processing. This technique aims to extract features from images or visual data that remain consistent across various viewpoints or orientations, providing robustness to changes in the camera angle or scene configuration (Kumie et al., 2024a; Bracci et al., 2018; Li et al., 2017; Rao et al., 2002). Traditional methods typically rely on handcrafted features and geometric transformations to achieve view-invariant representations (Xia et al., 2012a; Ji and Liu, 2010; Rao et al., 2002). While effective, these traditional methods are often limited to specific scenarios with relatively simple and fixed backgrounds, making them unsuited for handling the complexity and variability of real-world visual data. Recently, many researchers have been eager to apply Convolutional Neural Networks (CNNs) and other deep architectures for extracting view-invariant representations, thanks to deep learning’s remarkable ability to extract discriminative features (Kumie et al., 2024b; Men et al., 2023; Pak et al., 2023; Gao et al., 2022; Shiraga et al., 2016). For instance, Kumie et al. (2024b) propose the Dual-Attention Network (DANet) for view-invariant action recognition, incorporating relation-aware spatiotemporal self-attention and cross-attention modules to learn representative and discriminative action features. Gao et al. (2021) propose a View Transformation Network (VTN) that realizes the view normalization by transforming arbitrary-view action samples to a base view to seek a view-invariant representation. These approaches leverage the capacity of deep models to capture complex patterns and variations in visual data. By ensuring that learned features are resilient to changes in viewpoint, these methods contribute to the robustness and generalization of vision-based systems. View-invariant feature representation is helpful for a variety of vision tasks, such as action recognition (Xia et al., 2012b; Liu et al., 2018; Kumie et al., 2024a), pose estimation (Bracci et al., 2018), human re-identification (Liu et al., 2023; Perwaiz et al., 2023), and object detection (Feng et al., 2022). However, to the best of our knowledge, the integration of view-invariant representations into visual tracking frameworks remains underexplored. In our work, we make the first attempt to learn view-invariant feature representations through mutual information maximization based on ViTs, specifically tailored for UAV tracking. This marks the first instance where ViTs are employed to acquire view-invariant feature representations in the context of UAV tracking.
2.3 Multi-Teacher Knowledge Distillation
Multi-teacher knowledge distillation (MD) leverages knowledge from multiple off-the-shelf pre-trained teacher models to guide the student, enhancing its generalization by inheriting diverse knowledge from various teachers (Gou et al., 2021; Wang and Yoon, 2021). It has primarily been studied in the contexts of image classification (You et al., 2017; Yuan et al., 2021; Lan et al., 2024; Wen et al., 2024), action recognition (Wu et al., 2019) and visual retrieval (Ma et al., 2024). For instance, Wen et al. (2024) proposed the multi-teacher distillation method for class incremental learning, using weight permutation, feature perturbation, and diversity regularization to ensure diverse mechanisms in teachers. To better leverage the complementary knowledge of each modality, Lan et al. (2024) propose a multi-teacher multi-modal knowledge distillation framework to guide the training of the multi-modal fusion network and further improve the multi-modal feature fusion process. Whiten-MTD (Ma et al., 2024) aligns the outputs of teacher models by whitening their similarity distributions and identifies the most effective fusion strategy for their multi-teacher distillation framework through empirical analysis. To the best of our knowledge, existing knowledge distillation methods for visual tracking rely on a single-teacher framework (Li et al., 2022a; Sun et al., 2023), limiting their ability to fully utilize the diverse knowledge from multiple teachers, potentially leading to suboptimal tracking performance. In this work, we propose a simple yet effective MI maximization-based multi-teacher knowledge distillation framework, integrated into AVTrack, to develop a more efficient UAV tracker. For the teacher models, we utilize all three off-the-shelf trackers from AVTrack (i.e., AVTrack-DeiT, AVTrack-ViT, and AVTrack-EVA), offering diverse and high-quality teachers without requiring additional operations, such as those in (Wen et al., 2024), to implement varied mechanisms in the teachers. To ensure the student model captures the most relevant information from the teacher models’ representations, we propose maximizing the MI between the averaged softened feature representation of the multi-teacher models and the student model’s softened feature representation. Averaging the predictions of all teacher models is a common and effective method, as it reduces biases and provides a more objective output than any single teacher’s prediction (Buciluǎ et al., 2006; Ba and Caruana, 2014; Ilichev et al., 2021; Song and Chai, 2018).

3 Method
In this section, we will start by introducing a brief overview of our AVTrack framework, as shown in Fig. 2. Then, we detail the two proposed components: (1) the Activation Module (AM) for dynamically activating transformer blocks based on inputs and (2) the method for learning view-invariant representations (VIR) via mutual information maximization. Additionally, we present the details of the improved tracker, AVTrack-MD, which is based on AVTrack and incorporates the MI maximization-based multi-teacher knowledge distillation framework (see Fig. 3). At the end of this section, we will provide a detailed introduction to the prediction head and training loss.
3.1 Overview
The proposed AVTrack introduces a novel single-stream tracking framework, featuring an adaptive ViT-based backbone and a prediction head. To improve the efficiency of the ViT, each transformer block in the ViT-based backbone, except for the first blocks, incorporates an activation module (AM). This module is trained to adaptively determine whether to activate the associated transformer block. The framework takes a pair of images as input, comprising a template denoted as and a search image denoted as . These images are split into patches of size , and the number of patches for and are and , respectively. The features extracted from the backbone are fed into the prediction head to generate tracking results. To obtain view-invariant representation with ViTs, we maximize the mutual information (MI) between the feature representations of two different views of the target, i.e., the template image and the target patch in the search image. During the training phase, since the ground truth localization of the target in the search image is known, we can obtain the feature representation of the subsequent view from the representation of the search image using interpolation techniques.
Building on the proposed AVTrack, we introduce a simple yet effective MI maximization-based multi-teacher knowledge distillation (MD) framework to develop a more efficient UAV tracker, called AVTrack-MD. During the training of the AVTrack-MD model (i.e., the student models), the AVTrack’s (i.e., the teacher models) weights remain fixed while both the teacher and student models receive inputs and . Let and represent the backbones of the teacher and student, respectively. In our implementation, shares the same structure as the but uses a smaller ViT layer. Feature-based knowledge distillation transfers knowledge from the teacher models’ backbone features to the student model by maximizing the MI between the averaged softened features of the multi-teacher models and the student model’s softened features.
The details of the these design principles will be elaborated in the subsequent subsections.
3.2 Activation Module (AM)
The Activation Module selectively activates transformer blocks based on input, enabling adaptive adjustments to the ViT’s architecture and ensuring that only the essential blocks are activated for effective tracking. To enhance efficiency, the AM processes a subset of tokens representing both the target template and the search image. Its output determines whether the current Transformer block is activated; if not, the tokens from the preceding block are bypassed. Specifically, let’s consider the -th layer (). We denote the total number of tokens by , the embedding dimension of each token by , and all the tokens output by the -th layer by . The slice of all tokens generated by the -th Transformer block is expressed as , where is a standard unit vector in , the linear layer is denoted by . Formally, the Activation Module (AM) at layer is expressed as:
(1) |
where represents the activation probability of the -th Transformer block, indicates the sigmoid function. The sigmoid function is versatile and finds applications in various machine learning tasks where non-linear transformations or probabilistic outputs are required. In our work, the activation mechanism in each Transformer block is determined by the sigmoid function, which helps introduce non-linearity into the activation module’s decision-making process. Specifically, the sigmoid function maps the output value of the linear layer to a range between 0 and 1, which represents the activation probability of the associate Transformer block. If , where is the activation probability threshold, the transformer block at layer will be activated; otherwise, it is deactivated and the output tokens from the -th layer will be fed into the -th block directly. Let denote the total number of transformer blocks in the given ViT. Theoretically, deactivating all blocks simultaneously would result in no correlation being computed between the template and search image. To avoid such unfavorable conditions, the first layers are mandated to remain activated. This strategy helps alleviate computational burdens associated with AM, as these initial layers are typically essential, providing foundational information on which high-level and more abstract features and representations can be built. Another extreme case occurs when all Transformer blocks are activated for any input, allowing the model to more efficiently minimize classification and regression losses, as larger models have greater fitting capacity. To address this, we introduce a block sparsity loss, , which penalizes a higher mean probability across all adaptive layers, encouraging the deactivation of many blocks on average to improve efficiency. The block sparsity loss is defined as follows:
(2) |
where is a constant used with to control block sparsity. Generally, for a given , a smaller leads to a sparser model. When , can be viewed as the weight of block , and the sparsity loss becomes proportional to the norm of the vector of these weights, which is a convex-relaxed sparsity regularization commonly used in statistical learning theory. serves as a hyperparameter for finer adjustments.
3.3 View-Invariant Representations (VIR) via Mutual Information Maximization
We begin by introducing the concept of mutual information (MI) and establishing the relevant notations. Given two random variables: and . The MI between and , denoted as , is expressed as follows:
(3) |
where represents the joint probability distribution, while and are the marginal distributions. The symbol denotes the Kullback–Leibler divergence (KLD) (MacKay, 2004). Estimating MI is not an easy task in real-world situations, as we usually only have access to the samples that are readily available, not the underlying distributions (Poole et al., 2019). As a result, most existing estimators typically rely on approximating the MI between variables using observed samples. In contrast to these approaches, we employ the Deep InfoMax MI estimator (R.D. and et al., 2019), which estimates MI based on Jensen-Shannon divergence (JSD). This strategy has proven to be effective, as knowing its precise value is less critical than maximizing MI in this context. The Jensen-Shannon MI estimator, represented by , is defined by:
(4) |
where is a neural network parameterized by , and represents the softplus function.
In our work, the proposed approach involves learning view-invariant feature representations by maximizing MI using the aforementioned Jensen-Shannon MI estimator between the feature representations of two different views of a specific target. Let , , denote the final output tokens of the ViT, where and represent the tokens corresponding to the template and the search image, respectively. Given the ground truth localization of the target, denoted as in the search image, we obtain the tokens corresponding to through linear interpolation, represented by . The specially designed loss for learning view-invariant feature representations is formulated as follow,
(5) |
During the inference phase, only the sequence is input into the VIR, and the process for learning view-invariant representations is not involved. Consequently, our method imposes no additional computational cost during the inference phase. Additionally, the proposed view-invariant representation learning is ViT-agnostic, making it easily adaptable to other tracking frameworks.

3.4 MI Maximization-based Multi-Teacher Knowledge Distillation (MD)
We developed AVTrack-MD with the aim of achieving performance comparable to or superior to the AVTrack while reducing computational resources and memory usage, making it a more efficient UAV tracker. To achieve this, we introduce a novel MI maximization-based multi-teacher knowledge distillation framework into the AVTrack, as illustrated in Fig. 3. Our method focuses on feature-level knowledge distillation, meaning that the student model learns from an ensemble of teacher models by concentrating on the features the teacher models have learned.
In the context of multi-teacher knowledge distillation, the selection of teachers-student architectures plays a crucial role in the knowledge distillation process, as it determines how the knowledge from the various teacher models is transferred to the student model. For the teacher models, we use all three off-the-shelf trackers from AVTrack (i.e., AVTrack-DeiT, AVTrack-ViT, and AVTrack-EVA), providing diverse and high-quality teachers without the need for additional operations to implement varied mechanisms in teachers. For the selection of student models, we opt for a self-similar architecture, where the student shares the same structure as the teacher but uses a smaller ViT backbone (i.e., with half ViT blocks). This selection offers two main advantages: 1) it eliminates the need for complex design processes, enabling straightforward implementation and training, and enhancing the interpretability of the model’s behavior; 2) it promotes modularity and scalability, reducing the difficulty of expanding or modifying the model as needed. Averaging the predictions of all teacher models is a common choice (Buciluǎ et al., 2006; Ba and Caruana, 2014; Ilichev et al., 2021; Song and Chai, 2018), as it is a straightforward and effective method. In our implementation, we average the feature representations produced by the teacher models into an aggregated feature representation In practice, to provide richer information during training, the model outputs are softened using a temperature , since the original outputs are typically represented in a one-hot encoding format (You et al., 2017). Let represent the feature representation of the student. The softened outputs of the student and teacher are expressed as follows:
(6) |
where is a constant that we set to 2. Encouraging the student network to replicate the feature representations of the teacher’s final layer network using mean squared error (MSE) loss is straightforward (Ba and Caruana, 2014; Kim et al., 2021; Hamidi et al., 2024). However, since MSE is sensitive to noise and outliers (Hinton et al., 2015), we maximize the MI between the aggregated softened feature of the teacher models and the student model’s softened feature, enhancing generalization and performance, especially in noisy conditions. Given the teachers-student architecture, we apply the Jensen-Shannon MI estimator to achieve MI maximization for multi-teacher knowledge distillation, which establishes the objective function for our MI maximization-based multi-teacher knowledge distillation approach, as expressed in the following equation:
(7) |
In distillation training, the student model is trained using a weighted sum of and the total loss function used for training the teacher model.
3.5 Prediction Head and Training Objective
Following the corner detection head in (Cui et al., 2022; Ye et al., 2022), we utilize a prediction head consisting of several Conv-BN-ReLU layers to directly estimate the target’s bounding box. The output tokens associated with the search image are first reinterpreted into a 2D spatial feature map, which is then input to the prediction head. The head generates three outputs: a target classification score , a local offset , and a normalized bounding box size . The highest classification score is used to estimate the coarse target position, i.e., , based on which the final target bounding box is determined by:
(8) |
For the tracking task, we employ the weighted focal loss (Law and Deng, 2018) for classification and a combination of loss and GIoU loss (Rezatofighi et al., 2019) for bounding box regression. The overall loss function is formulated as:
(9) |
where the constants and are the same as in (Ye et al., 2022), and are set to and , respectively. Our framework is trained end-to-end with the overall loss after the pretrained weights of the ViT for image classification is loaded. After this training, we apply the proposed MI maximization-based multi-teacher knowledge distillation framework to obtain a student model that better balances performance and efficiency. Specifically, during the distillation phase, the distillation loss is added to the overall loss from the previous training stage, yielding the total loss for end-to-end distillation training as follows:
(10) |
where the weight coefficients in remain the same as those used during teacher model training, and the weight of is set to .
DTB70 | UAVDT | VisDrone | UAV123 | UAV123@10fps | Avg. | Avg.FPS | |||||||||
Method | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | GPU | CPU | |
KCF | 46.8 | 28.0 | 57.1 | 29.0 | 68.5 | 41.3 | 52.3 | 33.1 | 40.6 | 26.5 | 53.1 | 31.6 | - | 622.5 | |
fDSST | 53.4 | 35.7 | 66.6 | 38.3 | 69.8 | 51.0 | 58.3 | 40.5 | 51.6 | 37.9 | 60.0 | 40.7 | - | 193.4 | |
ECO_HC | 63.5 | 44.8 | 69.4 | 41.6 | 80.8 | 58.1 | 71.0 | 49.6 | 64.0 | 46.8 | 69.7 | 48.2 | - | 83.5 | |
MCCT_H | 60.4 | 40.5 | 66.8 | 40.2 | 80.3 | 56.7 | 65.9 | 45.7 | 59.6 | 43.4 | 66.6 | 45.3 | - | 63.4 | |
STRCF | 64.9 | 43.7 | 62.9 | 41.1 | 77.8 | 56.7 | 68.1 | 48.1 | 62.7 | 45.7 | 67.3 | 47.1 | - | 28.4 | |
ARCF | 69.4 | 47.2 | 72.0 | 45.8 | 79.7 | 58.4 | 67.1 | 46.8 | 66.6 | 47.3 | 71.0 | 47.1 | - | 34.2 | |
AutoTrack | 71.6 | 47.8 | 71.8 | 45.0 | 78.8 | 57.3 | 68.9 | 47.2 | 67.1 | 47.7 | 71.6 | 49.0 | - | 57.8 | |
DCF-based | RACF | 72.6 | 50.5 | 77.3 | 49.4 | 83.4 | 60.0 | 70.2 | 47.7 | 69.4 | 48.6 | 74.6 | 51.2 | - | 35.6 |
HiFT | 80.2 | 59.4 | 65.2 | 47.5 | 71.9 | 52.6 | 78.7 | 59.0 | 74.9 | 57.0 | 74.2 | 55.1 | 160.3 | - | |
SiamAPN | 78.4 | 58.5 | 71.1 | 51.7 | 81.5 | 58.5 | 76.5 | 57.5 | 75.2 | 56.6 | 76.7 | 56.6 | 194.4 | - | |
P-SiamFC++ | 80.3 | 60.4 | 80.7 | 56.6 | 80.1 | 58.5 | 74.5 | 48.9 | 73.1 | 54.9 | 77.7 | 55.9 | 240.5 | 56.9 | |
TCTrack | 81.1 | 61.9 | 69.1 | 50.4 | 77.6 | 57.7 | 77.3 | 60.4 | 75.1 | 58.8 | 76.0 | 57.8 | 139.6 | - | |
TCTrack++ | 80.4 | 61.7 | 72.5 | 53.2 | 80.8 | 60.3 | 74.4 | 58.8 | 78.2 | 60.1 | 77.3 | 58.8 | 125.6 | - | |
SGDViT | 78.5 | 60.4 | 65.7 | 48.0 | 72.1 | 52.1 | 75.4 | 57.5 | 86.3 | 66.1 | 75.6 | 56.8 | 110.5 | - | |
ABDNet | 76.8 | 59.6 | 75.5 | 55.3 | 75.0 | 57.2 | 79.3 | 60.7 | 77.3 | 59.1 | 76.7 | 59.1 | 130.2 | - | |
DRCI | 81.4 | 61.8 | 84.0 | 59.0 | 83.4 | 60.0 | 76.7 | 59.7 | 73.6 | 55.2 | 79.8 | 59.1 | 281.3 | 62.4 | |
CNN-based | PRL-Track | 79.5 | 60.6 | 73.1 | 53.5 | 72.6 | 53.8 | 79.1 | 59.3 | 74.1 | 58.6 | 75.2 | 57.2 | 135.6 | - |
Aba-ViTrack | 85.9 | 66.4 | 83.4 | 59.9 | 86.1 | 65.3 | 86.4 | 66.4 | 85.0 | 65.5 | 85.3 | 64.7 | 181.5 | 50.3 | |
LiteTrack | 82.5 | 63.9 | 81.6 | 59.3 | 79.7 | 61.4 | 84.2 | 65.9 | 83.1 | 64.0 | 82.2 | 62.9 | 119.7 | - | |
SMAT | 81.9 | 63.8 | 80.8 | 58.7 | 82.5 | 63.4 | 81.8 | 64.6 | 80.4 | 63.5 | 81.5 | 62.8 | 124.2 | - | |
LightFC | 82.8 | 64.0 | 83.4 | 60.6 | 82.7 | 62.8 | 84.2 | 65.5 | 81.3 | 63.7 | 82.9 | 63.4 | 146.8 | - | |
SuperSBT | 84.5 | 65.4 | 81.5 | 60.3 | 80.4 | 62.2 | 85.0 | 67.2 | 82.1 | 65.3 | 82.7 | 64.1 | 121.7 | - | |
AVTrack-ViT | 81.3 | 63.3 | 79.9 | 57.7 | 86.4 | 65.9 | 84.0 | 66.2 | 83.2 | 65.7 | 82.9 | 63.1 | 250.2 | 58.7 | |
AVTrack-EVA | 82.6 | 64.0 | 78.8 | 57.2 | 84.4 | 63.5 | 83.0 | 64.7 | 81.2 | 63.5 | 82.0 | 62.6 | 283.7 | 62.8 | |
AVTrack-DeiT | 84.3 | 65.0 | 82.1 | 58.7 | 86.0 | 65.3 | 84.8 | 66.8 | 83.2 | 65.8 | 84.1 | 64.4 | 256.8 | 59.5 | |
AVTrack-MD-ViT |
84.9 |
65.7 |
81.4 | 59.5 | 84.8 | 63.7 | 82.3 | 65.1 |
83.5 |
65.9 |
83.4 | 64.0 |
303.1 |
63.8 | |
AVTrack-MD-EVA | 83.2 | 63.9 | 80.8 | 58.0 | 84.0 | 63.5 | 81.5 | 62.3 | 82.7 | 64.7 | 82.4 | 62.5 |
334.4 |
67.1 | |
ViT-based | AVTrack-MD-DeiT | 84.0 | 65.2 |
83.1 |
60.3 |
84.9 |
64.2 |
82.6 | 65.2 | 83.3 | 65.5 |
83.6 |
64.1 |
310.6 |
64.8 |
4 Experiments
In this section, we evaluate our tracker on six publicly widely-used UAV tracking benchmarks, including DTB70 (Li and Yeung, 2017), UAVDT (Du et al., 2018), VisDrone2018 (Zhu et al., 2018), UAV123 (Mueller et al., 2016), UAV123@10fps (Mueller et al., 2016), and WebUAV-3M (Zhang et al., 2022a). Specifically, we conduct a thorough comparison with over 38 existing state-of-the-art (SOTA) trackers, using their results obtained from running the official codes with the corresponding hyperparameters. Please note that our evaluation is performed on a PC that was equipped with an i9-10850K processor (3.6GHz), 16GB of RAM, and an NVIDIA TitanX GPU. To provide a clearer evaluation comparison with different types of trackers, we categorize them into two groups: lightweight trackers and deep trackers.
4.1 Implementation Details
Model. The AVTrack utilizes three different lightweight ViTs as backbones to build three trackers for evaluation: AVTrack-ViT, AVTrack-DeiT, and AVTrack-EVA. Building on it, AVTrack-MD introduces the novel MI maximization-based multi-teacher knowledge distillation (MD) framework to develop three corresponding improved trackers: AVTrack-MD-ViT, AVTrack-MD-DeiT, and AVTrack-MD-EVA. The head and input sizes for AVTrack and AVTrack-MD are configured identically. The head consists of a stack of four Conv-BN-ReLU layers, with input sizes set to for the search image and for the template image.
Training. For training, we employ a combination of training sets from four different datasets: GOT-10k (Huang et al., 2021), LaSOT (Fan et al., 2019), COCO (Lin et al., 2014), and TrackingNet (Muller et al., 2018). It is noteworthy that all the trackers utilize the same training pipeline to ensure consistency and comparability. The batch size is set to 32. We use the AdamW optimizer to train the model, with a weight decay of and an initial learning rate of . The total number of training epochs is fixed at 300, with 60,000 image pairs processed per epoch. The learning rate decreases by a factor of 10 after 240 epochs. During the multi-teacher distillation phase, we use all three pre-trained AVTrack-* models as teacher models. The parameters of the teacher models are frozen to guide the training of the student model with the proposed knowledge distillation approach, which follows the same training pipeline as that of the teacher models.
Inference. In line with common practices, Hanning window penalties are applied during inference to integrate positional prior on tracking (Zhang et al., 2020). Specifically, a Hanning window of the same size is multiplied by the classification map, and the location with the highest score is selected as the predicted result.






4.2 Comparison with Lightweight Trackers
1) Overall Performance: We compare our methods with 22 SOTA lightweight trackers, inclunding KCF (Henriques et al., 2015), fDSST (Danelljan et al., 2017), ECO_HC (Danelljan et al., 2017), MCCT_H (Wang et al., 2018), STRCF (Li et al., 2018), ARCF (Huang et al., 2019), AutoTrack (Li et al., 2020), RACF (Li et al., 2022b), HiFT (Cao et al., 2021), SiamAPN (Fu et al., 2021), P-SiamFC++ (Wang et al., 2022), TCTrack (Cao et al., 2022), TCTrack++ (Cao et al., 2023), SGDViT (Yao et al., 2023), ABDNet (Zuo et al., 2023), DRCI (Zeng et al., 2023), PRL-Track (Fu et al., 2024), Aba-ViTrack (Li et al., 2023), LiteTrack (Wei et al., 2024), SMAT (Gopal and Amer, 2024), LightFC (Li et al., 2024c), and SuperSBT (Xie et al., 2024a), across five renowned UAV tracking datasets: DTB70 (Li and Yeung, 2017), UAVDT (Du et al., 2018), VisDrone2018 (Zhu et al., 2018), UAV123 (Mueller et al., 2016), and UAV123@10fps (Mueller et al., 2016). The evaluation results are presented in Table 1. From table, our AVTrack-DeiT outperforms all SOTA trackers except Aba-ViTrack (Li et al., 2023) across these five benchmarks in terms of average (Avg.) precision (Prec.) and success rate (Succ.). Specifically, RACF (Li et al., 2022b) exhibits the best performance among DCF-based trackers, with Avg. Prec. and Succ. of 74.6% and 51.2%, respectively. Among CNN-based trackers, DRCI (Zeng et al., 2023) stands out with the highest Avg. Prec. and Succ. of 79.8% and 59.1%. However, even the best methods among DCF- and CNN-based trackers fall short of the worst method among ViT-based trackers. All ViT-based trackers achieve Avg. Prec. and Succ. exceeding 80.0% and 62.0%, respectively, clearly surpassing the CNN-based methods and significantly outperforming the DCF-based ones. Regarding GPU speed, the top three trackers are our proposed methods: AVTrack-MD-EVA, AVTrack-MD-DeiT, and AVTrack-MD-ViT, achieving tracking speeds of 334.4 FPS, 310.6 FPS, and 303.1 FPS, respectively. In terms of CPU speed, all our methods deliver real-time performance on a single CPU111It is important to note that the real-time performance discussed in this work is applicable only to platforms that are similar to or more advanced than ours.. Obviously, DCF-based trackers are the most efficient UAV trackers, as all methods achieving speeds above 80 FPS belong to this category. However, these fastest DCF-based trackers are unable to achieve performance comparable to our methods. For example, the fastest KCF (Henriques et al., 2015) tracker attains only 31.6% in average success rate, which is about half that of our methods. Additionally, DCF-based methods often incur substantial costs to enhance tracking precision. Specifically, RACF exhibits the best performance among DCF-based trackers; however, it runs at only 35.6 FPS, which is significantly lower than the slowest CPU speed of our trackers. Although CNN-based trackers can achieve tracking speeds comparable to our ViT-based trackers, the latter significantly outperform the former in overall performance. Although Aba-ViTrack achieves the highest average precision of 85.3% and the highest average success rate of 64.7%., our AVTrack-DeiT secures second place with only slight gaps of 1.2% and 0.3%, respectively. Furthermore, our AVTrack-DeiT demonstrates higher speeds than Aba-ViTrack. Notably, AVTrack-DeiT runs up to 41%/18% faster than Aba-ViTrack in terms of GPU/CPU speed. When the proposed MI-based multi-teacher knowledge distillation framework (MD) is applied to AVTrack-*, the resulting student model, AVTrack-MD-*, demonstrates a smaller performance drop, with some even achieving performance improvements, all while delivering a significant speed boost. Specifically, AVTrack-MD-* models show varying trade-offs: AVTrack-MD-ViT achieves 0.5% and 0.9% gains in Avg. Prec. and Succ., respectively, with 21.1% GPU and 8.7% CPU speed improvements; AVTrack-MD-EVA gains 0.4% in precision and a minor 0.1% drop in success rate, while also improving GPU and CPU speeds by 17.9% and 6.8%; AVTrack-MD-DeiT has a slight performance drop of 0.5% in precision and 0.3% in success rate but still benefits from significant 20.9% GPU and 8.9% CPU speed gains. These results highlight the advantages of our method and justify its SOTA performance for UAV tracking.
Tracker | Prec. | Succ. | Params.(M) | FLOPs(G) | AGX.FPS |
AVTrack-DeiT | 70.0 | 56.4 | 3.5-7.9 | 0.97-2.4 | 42.3 |
AVTrack-MD-DeiT |
69.4 | 55.3 | 5.3 | 1.5 | 46.1 |
PRL-Track | 62.3 | 47.1 | 12.1 | 7.4 | 33.8 |
LiteTrack | 69.1 | 54.1 | 28.3 | 7.3 | 32.1 |
SMAT | 68.9 | 53.9 | 8.6 | 3.2 | 32.7 |
Aba-ViTrack | 70.4 | 55.3 | 8.0 | 2.4 | 37.3 |
ABDNet | 63.9 | 48.7 | 12.3 | 8.3 | 33.2 |
SGDViT | 61.3 | 45.7 | 23.3 | 11.3 | 31.7 |
TCTrack++ | 63.9 | 48.3 | 17.6 | 8.8 | 32.5 |
TCTrack | 61.9 | 45.7 | 8.5 | 6.9 | 34.4 |
SiamAPN | 62.5 | 45.1 | 14.5 | 7.9 | 35.2 |
HiFT | 60.9 | 45.5 | 9.9 | 7.2 | 35.6 |
2) Performance on WebUAV-3M (Zhang et al., 2022a): As illustrated in Table 2, we have conducted a comparison of our AVTrack-DeiT and AVTrack-MD-DeiT with ten lightweight SOTA trackers on WebUAV-3M. Our AVTrack-DeiT surpasses all other lightweight trackers in success rate, achieving a 1.1% improvement over the second-place Aba-ViTrack while maintaining a comparable precision with only a 0.4% difference. Notably, our AVTrack-MD-DeiT experiences less than a 1.0% drop in both precision and success rate, while improving AGX speed by 9.2%. These results further underscore the effectiveness of our approach.
3) Efficiency Comparison: To further demonstrate that our tracker achieves a better trade-off between accuracy and efficiency, we compare our methods against ten SOTA lightweight trackers, in terms of floating point operations (FLOPs), the number of parameters (Params.), and the speed (AGX.FPS) on the Nvidia Jetson AGX Xavier edge device. The results are shown in Table 2. Notably, since our AVTrack-DeiT tracker features adaptive architectures, the FLOPs and parameters range from minimum to maximum values. As observed, both the minimum FLOPS and Params of our AVTrack-DeiT are notably lower than those of all the SOTA trackers, and even its maximum values are lower than those of most SOTA trackers. Furthermore, our AVTrack-MD-DeiT achieves the lowest FLOPs (1.5G) and parameters (5.3M), except for AVTrack-DeiT, while also delivering the fastest speed at 46.1 AGX.FPS. This comparison in terms of computational complexity also underscores the efficiency of our methods.
4) Attribute-Based Evaluation: We also performed attribute-based evaluations to assess the robustness of our trackers (i.e., AVTrack-DeiT and AVTrack-MD-DeiT) in challenging scenarios involving viewpoint changes. This includes comparing our trackers with 16 SOTA lightweight trackers under challenges such as ‘Camera Motion’ on the DTB70 and VisDrone2018 and ‘Viewpoint Change’ on the UAV123, as shown in Fig. 4. Notably, the DTB70 and VisDrone2018 datasets do not have a dedicated subset for scenarios involving viewpoint changes. As an alternative, we used the ‘Camera Motion’ subset to evaluate performance in such scenarios, as it is a key factor contributing to viewpoint changes in visual tracking. For instance, when a camera or the object itself moves rapidly or changes direction suddenly, it can result in varying views of the target in the captured images. As observed, our tracker ranks first in ‘Camera Motion’ on VisDrone2018, and ‘Viewpoint Change’ on UAV123, exceeding the second-best tracker by 1.0%/0.5% and 1.0%/0.6% in precision/success rate, respectively, with only slight gaps of 0.1% and 0.6% to the top tracker in the ‘Camera Motion’ on DTB70. Therefore, this attribute-based evaluation validates, both directly and indirectly, the superiority and effectiveness of the proposed method in addressing the challenges associated with viewpoint change.

4) Qualitative evaluation: To intuitively showcase the tracking results in UAV tracking scenes, qualitative tracking results of our trackers (i.e., AVTrack-MD-DeiT and AVTrack-DeiT) alongside six SOTA UAV trackers are visualized in Fig. 5. Four video sequences, selected from different benchmarks and scenarios, are presented for demonstration: Surfing12 from DTB70, S1607 from UAVDT, uav0000180_00050_s from VisDrone2018, person10 from UAV123@10fps, and truck1 from UAV123. The upper right corner features a selectively zoomed and cropped view of the tracked objects within the corresponding frames for better visualization. As observed, compared to the other SOTA UAV trackers, our tracker tracks the target objects more accurately in all challenging scenes, including occlusion (i.e., S1607 and uav0000180_00050_s), low resolution (i.e., Surfing12), scale variations (i.e., in all sequences), and viewpoint change (i.e., in all sequences). In these cases, our method significantly outperforms others and offers a more visually appealing result, highlighting the effectiveness of the proposed approaches for UAV tracking.
DTB70 | UAVDT | VisDrone2018 | UAV123 | UAV123@10fps | Avg. | ||||||||
Tracker | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Avg.FPS |
DiMP | 79.2 | 61.3 | 78.3 | 57.4 | 83.5 | 63.0 | 83.1 | 65.2 | 85.1 | 64.7 | 81.8 | 62.3 | 51.9 |
PrDiMP | 84.0 | 64.3 | 75.8 | 55.9 | 79.8 | 60.2 | 87.2 | 66.5 | 83.9 | 64.7 | 82.1 | 62.3 | 53.6 |
TrSiam | 82.7 | 63.9 | 88.9 | 65.0 | 84.0 | 63.5 | 83.9 | 66.3 | 85.3 | 64.9 | 84.9 | 64.7 | 38.3 |
TranT | 83.6 | 65.8 | 82.6 | 63.2 | 85.9 | 65.2 | 85.0 | 67.1 | 84.8 | 66.5 | 84.4 | 65.6 | 53.2 |
AutoMatch | 82.5 | 63.4 | 82.1 | 62.9 | 78.1 | 59.6 | 83.8 | 64.4 | 78.1 | 59.4 | 80.9 | 61.9 | 63.9 |
SparseTT | 82.3 | 65.8 | 82.8 | 65.4 | 81.4 | 62.1 | 85.4 | 68.8 | 82.2 | 64.9 | 82.8 | 65.4 | 32.4 |
CSWinTT | 80.3 | 62.3 | 67.3 | 54.0 | 75.2 | 58.0 | 87.6 | 70.5 | 87.1 | 68.1 | 79.5 | 62.6 | 12.3 |
SimTrack | 83.2 | 64.6 | 76.5 | 57.2 | 80.0 | 60.9 | 88.2 | 69.2 | 87.5 | 69.0 | 83.1 | 64.2 | 72.8 |
OSTrack | 82.7 | 65.0 | 85.1 | 63.4 | 84.2 | 64.8 | 84.7 | 67.4 | 83.1 | 66.1 | 83.9 | 65.3 | 68.4 |
ZoomTrack | 82.0 | 63.2 | 77.1 | 57.9 | 81.4 | 63.6 | 88.4 | 69.6 | 88.8 | 70.0 | 83.5 | 64.8 | 62.7 |
SeqTrack | 85.6 | 65.5 | 78.7 | 58.8 | 83.3 | 64.1 | 86.8 | 68.6 | 85.7 | 68.1 | 84.0 | 65.0 | 32.3 |
MAT | 83.2 | 64.5 | 72.9 | 54.8 | 81.6 | 62.2 | 86.7 | 68.3 | 86.9 | 68.5 | 82.3 | 63.6 | 71.2 |
ROMTrack | 87.2 | 67.4 | 81.9 | 61.6 | 86.4 | 66.7 | 87.4 | 69.2 | 85.0 | 67.8 | 85.5 | 66.5 | 52.3 |
DCPT | 84.0 | 64.8 | 76.8 | 56.9 | 83.1 | 64.2 | 85.7 | 68.1 | 86.9 | 69.1 | 83.3 | 64.6 | 39.2 |
HIPTrack | 88.4 | 68.6 | 79.6 | 60.9 | 86.7 | 67.1 | 89.2 | 70.5 | 89.3 | 70.6 | 86.6 | 67.5 | 32.1 |
EVPTrack | 85.8 | 66.5 | 80.6 | 61.2 | 84.5 | 65.8 | 88.9 | 70.2 | 88.7 | 70.4 | 85.7 | 66.8 | 26.1 |
AVTrack-DeiT |
84.3 | 65.0 | 82.1 | 58.7 | 86.0 | 65.3 | 84.8 | 66.8 | 83.2 | 65.8 | 84.1 | 64.4 | 256.8 |
AVTrack-MD-DeiT |
84.0 | 65.2 | 83.1 | 60.3 | 84.9 | 64.2 | 82.6 | 65.2 | 83.3 | 65.5 | 83.6 | 64.1 | 310.6 |
4.3 Comparison with Deep Trackers
To further validate the superiority of our trackers, we compare AVTrack-DeiT and AVTrack-MD-DeiT with 16 deep SOTA trackers, including DiMP (Bhat et al., 2019), PrDiMP (Danelljan et al., 2020), TrSiam (Wang et al., 2021), TransT (Chen et al., 2021a), AutoMatch (Zhang et al., 2021), SparseTT (Fu et al., 2022), CSWinTT (Song et al., 2022), SimTrack (Chen et al., 2022), OSTrack (Ye et al., 2022), ZoomTrack (Kou et al., 2023), SeqTrack (Chen et al., 2023), MAT (Zhao et al., 2023), ROMTrack (Cai et al., 2023), DCPT (Zhu et al., 2024), HIPTrack (Cai et al., 2024), and EVPTrack (Shi et al., 2024) ,across five UAV tracking datasets. The evaluation results are shown in Table 3, which shows precision (Prec.), success rate (Succ.), their average (Avg.), and average GPU speed (Avg.FPS). As shown, our AVTrack-MD-DeiT and AVTrack-DeiT excel by achieving the fastest and second-fastest speeds while maintaining high performance comparable to the SOTA deep trackers. On average, although AVTrack-MD-DeiT and AVTrack-DeiT exhibit slight gaps in average precision and success rate compared to HIPTrack (Cai et al., 2024), EVPTrack (Shi et al., 2024), and ROMTrack (Cai et al., 2023), they are noticeably slower than our methods in terms of GPU speed. Specifically, our AVTrack-MD-DeiT is 9.7, 11.9, and 5.9 times faster than HIPTrack, EVPTrack, and ROMTrack, respectively. These results indicate that our method delivers both high precision and speed, validating its suitability for real-time UAV tracking that prioritizes efficiency as well as accuracy.
Model | DTB70 | UAVDT | VisDrone | UAV123 | UAV123@10fps | Avg. | |||||||||
Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | ||||
AVTrack-DeiT | - | - | - | 84.3 | 65.0 | 82.1 | 58.7 | 86.0 | 65.3 | 84.8 | 66.8 | 83.2 | 65.8 | 84.1 | 64.4 |
AVTrack-ViT | - | - | - | 81.3 | 63.3 | 79.9 | 57.7 | 86.4 | 65.9 | 84.0 | 66.2 | 83.2 | 65.7 | 82.9 | 63.1 |
AVTrack-EVA | - | - | - | 82.6 | 64.0 | 78.8 | 57.2 | 84.4 | 63.5 | 83.0 | 64.7 | 81.2 | 63.5 | 82.0 | 62.6 |
AVTrack-MD-DeiT | - | - | 81.5 | 62.9 | 80.5 | 57.6 | 82.5 | 62.4 | 81.8 | 64.1 | 81.1 | 63.7 | 81.7↓2.6% | 62.1↓2.3% | |
- | 83.4 | 64.8 | 80.9 | 58.1 | 83.4 | 62.9 | 82.5 | 64.9 | 82.4 | 64.8 | 82.5↓1.6% | 63.1↓1.3% | |||
84.0 |
65.2 |
83.1 |
60.3 |
84.9 |
64.2 |
82.6 |
65.2 |
83.3 |
65.5 |
83.6 |
64.1 |
||||
AVTrack-MD-ViT | - | - | 79.1 | 61.2 | 77.9 | 55.8 | 82.1 | 62.2 | 81.8 | 63.8 | 81.0 | 62.5 | 80.3↓2.6% | 61.1↓2.0% | |
- | 81.5 | 62.9 | 80.4 | 58.4 | 81.8 | 61.7 | 82.1 | 64.2 |
83.9 |
65.9 |
81.9↓1.0% | 62.6↓0.5% | |||
84.9 |
65.7 |
81.4 |
59.5 |
84.8 |
63.7 |
82.3 |
65.1 |
83.5 |
65.9 |
83.4 |
64.0 |
||||
AVTrack-MD-EVA | - | - | 80.1 | 61.5 | 77.3 | 55.8 | 80.5 | 61.2 | 80.6 | 62.1 | 80.1 | 61.9 | 79.7↓2.3% | 60.5↓2.1% | |
- | 81.2 | 62.1 | 79.1 | 57.1 | 81.8 | 62.1 |
82.1 |
63.1 |
80.4 | 62.1 | 80.9↓1.1% | 61.3↓1.3% | |||
83.2 |
63.9 |
80.8 |
58.0 |
84.0 |
63.5 |
81.5 | 62.3 |
82.7 |
64.7 |
82.4 |
62.5 |
Params. | FLOPs | Avg.FPS | |
AVTrack-DeiT | 3.5-8.0 | 0.97-2.4 | 250.2 |
AVTrack-ViT | 2.4-5.8 | 0.67-1.7 | 283.7 |
AVTrack-EVA | 3.5-7.9 | 0.97-2.4 | 256.8 |
AVTrack-MD-DeiT | 5.3 | 1.5 |
310.6 |
AVTrack-MD-ViT | 5.3 | 1.5 |
303.1 |
AVTrack-MD-EVA | 3.7 | 1.1 |
334.4 |
4.4 Ablation Study
Impact of the Number of Teachers: More teachers may provide more diverse knowledge. Table 4 summarizes the performance of the teacher models and distillation models with various teacher combinations, while Table 5 presents their model complexity and average GPU speed. As shown, single-teacher distillation can significantly reduce model complexity during inference and achieve a notable speedup of 17%. However, it leads to a considerable decline in performance compared to the corresponding teacher model, with decreases of over 2.0% in average precision and success rate. In contrast, multi-teacher distillation methods, including double-teacher and triple-teacher distillation, not only reduce model complexity and provide a significant speedup but also result in a smaller performance decrease, with all drops remaining below 1.6% in average precision and success rate, and some even surpassing the performance of the teacher models. In the triple-teacher distillation, the student models (i.e., AVTrack-MD-ViT and AVTrack-MD-EVA) exhibit superior performance when compared to their corresponding teacher models with the same backbone. Specifically, AVTrack-MD-ViT achieves gains of 0.5% in average precision and 0.9% in success rate, while AVTrack-MD-EVA shows a 0.4% improvement in average precision, with only a 0.1% drop in success rate. Although AVTrack-MD-DeiT does not outperform AVTrack-DeiT, it experiences only a slight performance drop, with both average precision and success rate decreasing by less than 0.5%. Notably, in practice, the model complexity of AVTrack is higher than that of AVTrack-MD, as the activated blocks in AVTrack typically exceed half of the total blocks in the backbone. As a result, AVTrack is slower than AVTrack-MD when both use the same ViT backbone. These experimental results provide evidence for the effectiveness of our proposed multi-teacher knowledge distillation based on MI Maximization, with the performance gain attributed to the complementarity of multiple teacher models.
DTB70 | UAVDT | VisDrone | UAV123 | UAV123@10fps | Avg. | ||||||||||
blocks | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Param. (M) | FLOPs (G) | Avg.FPS |
4 | 81.7 | 63.2 | 80.2 | 57.2 | 80.8 | 61.0 | 81.6 | 64.3 | 81.1 | 63.8 | 81.1 | 61.9 | 4.4 | 1.3 | 357.1 |
5 | 82.1 | 63.8 | 81.9 | 58.5 | 82.5 | 62.1 | 82.5 | 64.9 | 83.9 | 66.0 | 82.6 | 63.1 | 4.9 | 1.4 | 334.9 |
6 | 84.0 | 65.2 | 83.1 | 60.3 | 84.9 | 64.2 | 82.6 | 65.2 | 83.3 | 65.5 | 83.6 | 64.1 | 5.3 | 1.5 | 310.6 |
7 | 83.7 | 64.6 | 82.4 | 60.5 | 84.8 | 63.8 | 83.7 | 66.3 | 84.1 | 66.3 | 83.7 | 64.3 | 5.8 | 1.7 | 287.7 |
8 | 82.9 | 64.0 | 82.8 | 60.8 | 85.3 | 64.4 | 84.1 | 66.6 | 84.7 | 66.7 | 84.0 | 64.5 | 6.2 | 1.8 | 263.5 |
DTB70 | UAVDT | VisDrone | UAV123 | UAV123@10fps | Avg. | |||||||
Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | |
0 | 81.2 | 62.1 | 80.6 | 58.5 | 81.1 | 61.3 | 81.9 | 64.5 | 82.0 | 64.6 | 81.4 | 62.2 |
82.8 | 63.7 | 80.1 | 58.3 | 81.8 | 62.1 | 81.1 | 64.3 | 83.1 | 65.4 | 81.8 | 62.8 | |
82.8 | 64.1 | 82.7 | 60.1 | 84.6 | 64.1 | 82.4 | 65.0 | 82.6 | 65.2 | 83.0 | 63.7 | |
84.0 | 65.2 | 83.1 | 60.3 | 84.9 | 64.2 | 82.6 | 65.2 | 83.3 | 65.5 | 83.6 | 64.1 | |
81.2 | 63.0 | 82.1 | 58.6 | 82.5 | 62.1 | 82.5 | 64.9 | 83.7 | 65.8 | 82.4 | 62.9 | |
83.8 | 64.4 | 82.5 | 59.4 | 83.0 | 62.7 | 81.4 | 64.2 | 83.5 | 65.5 | 82.8 | 63.2 | |
81.5 | 63.1 | 81.1 | 58.0 | 81.7 | 61.9 | 83.9 | 65.9 | 84.4 | 66.2 | 82.5 | 63.0 |
Impact of varying numbers of ViT blocks in student models: To closely investigate the impact of varying numbers of ViT blocks on performance in student models, we train the AVTrack-MD-DeiT student model using a range of block counts from 4 to 8. The evaluation results are shown in Table 6. As observed, the number of ViT blocks in the student model directly affects both tracking performance and speed. On average, increasing the number of ViT blocks leads to an upward trend in accuracy and model complexity but a decline in tracking speed. As the number of blocks increases from 4 to 6, each additional block results in a greater than 1.0% improvement in both average precision and average success rate, an increase of more than 0.4M in parameters, and a rise of over 0.1G in FLOPs, accompanied by an approximately 10% decrease in speed. When the number of blocks exceeds 6, further increases do not lead to significant performance gains, while model complexity continues to rise substantially and speed continues to decrease significantly. Considering the balance between performance and speed, we have chosen to set the default number of ViT blocks in the student model to 6 in our implementation, which is half that of the teacher models.
Impact of Weighting the MI maximization-based multi-teacher knowledge distillation: To obtain the optimal weight for the proposed MI maximization-based multi-teacher knowledge distillation, we trained AVTrack-MD-DeiT using varied values of ranging from to with with a scale factor of 0.1. Evaluation results are shown in Table 7. As shown, our tracker achieves best performance when the loss weight () is set to . Additionally, we have observed that the second- and third-best performances across these datasets are scattered both above and below the value of , without any apparent patterns. This variation may be attributed to the inherent differences between these datasets. Setting a value of for the loss weight () results in an average maximal difference of Avg. Prec. of 1.8% and a maximal difference of Avg. Succ. of 1.3%. These significant differences clearly demonstrate that the choice of weight a has a considerable impact on the tracking performance. To be more specific, when the proposed loss is appropriately weighted, it can enhance the tracking performance. However, if not properly weighted, it may have detrimental effects on the training of the tracking task.
Method | MD loss | DTB70 | UAVDT | VisDrone | UAV123 | UAV123@10fps | Avg. | Avg.FPS | ||||||
Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | |||
Teacher | - | 84.3 | 65.0 | 82.1 | 58.7 | 86.0 | 65.3 | 84.8 | 66.8 | 83.2 | 65.8 | 84.1 | 64.4 | 256.8 |
Student | MSE | 82.5 | 63.4 | 81.1 | 58.7 | 82.1 | 61.9 | 80.9 | 63.6 | 82.1 | 64.6 | 81.7 | 62.4 | 308.3 |
JSD |
84.0 |
65.2 |
83.1 |
60.3 |
84.9 |
64.2 |
82.6 |
65.2 |
83.3 |
65.5 |
83.6 |
64.1 |
310.6 |
Impact of Multi-Teacher Knowledge Distillation Based on MI Maximization: To demonstrate the superiority of the proposed multi-teacher knowledge distillation based on the MI maximization, we train the model using MSE loss and the proposed MI-based loss for multi-teacher knowledge distillation, respectively. The evaluation results across five datasets are presented in Table 8. As observed, using the MSE loss results in greater performance loss compared to our method. On average, employing the proposed results in a notable enhancement over the MSE loss, with improvements of 1.9% in precision and 1.7% in success rate, respectively. Additionally, compared to the teacher model, our method incurs only a slight decrease of 0.5% in average precision and 0.3% in average success rate, while achieving a notable speedup of 20.9%. These comparisons show that the benefit of our multi-teacher knowledge distillation based on the MI maximization is that it can provide a more thorough measure of the relationship between features through MI-based loss, which enables the student model to learn a more effective representation from the teacher model. On the other hand, our MI-based loss is less sensitive to noise and outliers compared to MSE, making it particularly effective in noisy environments.
4.5 Qualitative Results:
The proposed AVTrack enhances the robustness of Vision Transformers (ViTs) to viewpoint changes by enforcing invariance in the target’s feature representation. Building upon this, we present an improved tracker, dubbed AVTrack-MD, which integrates a novel multi-teacher knowledge distillation (MD) framework based on MI maximization. The extensive experimental results above show that the proposed MD framework not only greatly improves model efficiency but also enables the AVTrack-MD (i.e., the student models) to achieve performance comparable to, or even exceeding, that of AVTrack (i.e., the teacher models). To intuitively demonstrate the effectiveness of the proposed components, we visualize the feature maps of several examples in Fig. 6. These samples, generated by AVTrack-DeiT with and without the proposed distillation framework, feature targets undergoing viewpoint changes.

Visualization of feature maps: In Fig. 6, the first row presents the original target images from UAV123@10fps (Mueller et al., 2016), while the corresponding feature maps generated by AVTrack-DeiT*, AVTrack-DeiT, and AVTrack-MD-DeiT are displayed in the second, third, and fourth rows, respectively. Note that AVTrack-DeiT* refers to AVTrack-DeiT without the integration of the proposed VIR and AM components. As observed, the feature maps generated by AVTrack-DeiT and AVTrack-MD-DeiT demonstrate greater consistency with changes in viewpoint, while the feature maps from AVTrack-DeiT* exhibit more significant variations, particularly at different viewing angles. This suggests that the VIR component enhances the tracker’s ability to focus on targets experiencing changes in viewpoint, thereby improving overall tracking performance. Furthermore, the proposed MD allows AVTrack-MD-DeiT to inherit this capability from AVTrack-DeiT, enabling it to maintain similar robustness to the teacher models while also enhancing model efficiency. These qualitative results offer visual evidence of our method’s effectiveness in learning view-invariant feature representations using Vision Transformers (ViTs).

4.6 Real-world Test
To showcase the applicability of AVTrack-MD-DeiT for UAV tracking, we performed two real-world tests on the NVIDIA Jetson AGX Xavier 32GB. As shown in Fig. 7, the proposed tracker performs well in real-time despite challenges like camera motion, occlusion, and viewpoint changes, maintaining a Center Location Error (CLE) of under 20 pixels across all test frames, highlighting its high tracking precision. Furthermore, AVTrack-MD-DeiT consistently achieves an average real-time speed of 46 FPS in tests, demonstrating its robustness and efficiency, making it ideal for UAV applications requiring high performance and speed.
5 Conclusions
In this paper, we explore the effectiveness of a unified framework (AVTrack) for real-time UAV tracking through the use of efficient Vision Transformers (ViTs). To achieve this, we introduced an adaptive computation paradigm that selectively activates Transformer blocks. Additionally, to address the challenges of varying viewing angles commonly faced in UAV tracking, we utilized mutual information maximization to learn view-invariant representations. Building on this, we have developed an improved AVTrack-MD model to enable more efficient UAV tracking by introducing an simple and effective MI maximization-based multi-teacher knowledge distillation (MD) framework. Extensive experiments across six UAV tracking benchmarks show that AVTrack-MD not only delivers performance comparable to the AVTrack but also reduces model complexity, leading to a substantial improvement in tracking speed.
Acknowledgements
Thanks to the support by the Guangxi Natural Science Foundation (Grant No. 2024GXNSFAA010484), and the National Natural Science Foundation of China (Nos. 62466013, 62206123, 62066042, and 62176170), this work has been made possible.
References
- Ba and Caruana (2014) Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? Advances in neural information processing systems, 27, 2014.
- Bhat et al. (2019) Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6182–6191, 2019.
- Bracci et al. (2018) Stefania Bracci, Alfonso Caramazza, and Marius V. Peelen. View-invariant representation of hand postures in the human lateral occipitotemporal cortex. NeuroImage, 2018.
- Buciluǎ et al. (2006) Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541, 2006.
- Cai et al. (2024) Wenrui Cai, Qingjie Liu, and Yunhong Wang. Hiptrack: Visual tracking with historical prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19258–19267, 2024.
- Cai et al. (2023) Yidong Cai, Jie Liu, Jie Tang, and Gangshan Wu. Robust object modeling for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9589–9600, 2023.
- Cao et al. (2021) Ziang Cao, Changhong Fu, Junjie Ye, Bowen Li, and Yiming Li. Hift: Hierarchical feature transformer for aerial tracking. In IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Cao et al. (2022) Ziang Cao, Ziyuan Huang, Liang Pan, Shiwei Zhang, Ziwei Liu, and Changhong Fu. Tctrack: Temporal contexts for aerial tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Cao et al. (2023) Ziang Cao, Ziyuan Huang, Liang Pan, Shiwei Zhang, Ziwei Liu, and Changhong Fu. Towards real-world visual tracking with temporal contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Chen et al. (2022) Boyu Chen, Peixia Li, Lei Bai, Lei Qiao, Qiuhong Shen, Bo Li, Weihao Gan, Wei Wu, and Wanli Ouyang. Backbone is all your need: A simplified architecture for visual object tracking. In European Conference on Computer Vision, pages 375–392. Springer, 2022.
- Chen et al. (2021a) Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8126–8135, 2021a.
- Chen et al. (2023) Xin Chen, Houwen Peng, Dong Wang, Huchuan Lu, and Han Hu. Seqtrack: Sequence to sequence learning for visual object tracking. ArXiv, abs/2304.14394, 2023.
- Chen et al. (2021b) Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi Dong, Lu Yuan, and Zicheng Liu. Mobile-former: Bridging mobilenet and transformer. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.
- Cui et al. (2022) Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13608–13618, 2022.
- Danelljan et al. (2017) Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. Eco: Efficient convolution operators for tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Danelljan et al. (2017) Martin Danelljan, Gustav Hager, Fahad Shahbaz Khan, and Michael Felsberg. Discriminative scale space tracking. IEEE transactions on pattern analysis and machine intelligence (TPAMI), 2017.
- Danelljan et al. (2020) Martin Danelljan, Luc Van Gool, and Radu Timofte. Probabilistic regression for visual tracking. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7181–7190, 2020.
- Du et al. (2018) Dawei Du, Yuankai Qi, Hongyang Yu, Yi-Fan Yang, Kaiwen Duan, Guorong Li, W. Zhang, Qingming Huang, and Qi Tian. The unmanned aerial vehicle benchmark: Object detection and tracking. In European Conference on Computer Vision (ECCV), 2018.
- Fan et al. (2019) Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single object tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Feng et al. (2022) Chengjian Feng, Zequn Jie, Yujie Zhong, Xiangxiang Chu, and Lin Ma. Aedet: Azimuth-invariant multi-view 3d object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Fu et al. (2021) Changhong Fu, Ziang Cao, Yiming Li, Junjie Ye, and Chen Feng. Siamese anchor proposal network for high-speed aerial tracking. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 510–516. IEEE, 2021.
- Fu et al. (2024) Changhong Fu, Xiang Lei, Haobo Zuo, Liangliang Yao, Guangze Zheng, and Jia Pan. Progressive representation learning for real-time uav tracking. arXiv preprint arXiv:2409.16652, 2024.
- Fu et al. (2022) Zhihong Fu, Zehua Fu, Qingjie Liu, Wenrui Cai, and Yunhong Wang. Sparsett: Visual tracking with sparse transformers. arXiv e-prints, 2022.
- Gao et al. (2021) Lingling Gao, Yanli Ji, Kumie Gedamu, Xiaofeng Zhu, Xing Xu, and Heng Tao Shen. View-invariant human action recognition via view transformation network (vtn). IEEE Transactions on Multimedia, 24:4493–4503, 2021.
- Gao et al. (2022) Lingling Gao, Yanli Ji, Kumie Gedamu, Xiaofeng Zhu, Xing Xu, and Heng Tao Shen. View-invariant human action recognition via view transformation network (vtn). IEEE Transactions on Multimedia, 2022.
- Gopal and Amer (2024) Goutam Yelluru Gopal and Maria A Amer. Separable self and mixed attention transformers for efficient object tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6708–6717, 2024.
- Gou et al. (2021) Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
- Hamidi et al. (2024) Shayan Mohajer Hamidi, Xizhen Deng, Renhao Tan, Linfeng Ye, and Ahmed Hussein Salamah. How to train the teacher model for effective knowledge distillation. arXiv preprint arXiv:2407.18041, 2024.
- Henriques et al. (2015) Joao F Henriques, Rui Caseiro, and et al. High-speed tracking with kernelized correlation filters. IEEE transactions on pattern analysis and machine intelligence (TPAMI), 2015.
- Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Hjelm et al. (2018) R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
- Huang et al. (2021) L. Huang, X. Zhao, and K. Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE transactions on pattern analysis and machine intelligence (PAMI), 2021.
- Huang et al. (2019) Ziyuan Huang, Changhong Fu, Yiming Li, Fuling Lin, and Peng Lu. Learning aberrance repressed correlation filters for real-time uav tracking. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2891–2900, 2019.
- Ilichev et al. (2021) Artur Ilichev, Nikita Sorokin, Irina Piontkovskaya, and Valentin Malykh. Multiple teacher distillation for robust and greener models. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 601–610, 2021.
- Ji and Liu (2010) Xiaofei Ji and Honghai Liu. Advances in view-invariant human motion analysis: A review. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2010.
- Jiang et al. (2024) Yuxuan Jiang, Chen Feng, Fan Zhang, and David Bull. Mtkd: Multi-teacher knowledge distillation for image super-resolution. arXiv preprint arXiv:2404.09571, 2024.
- Kim et al. (2021) Taehyeon Kim, Jaehoon Oh, NakYil Kim, Sangwook Cho, and Se-Young Yun. Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation. arXiv preprint arXiv:2105.08919, 2021.
- Kou et al. (2023) Yutong Kou, Jin Gao, Bing Li, Gang Wang, Weiming Hu, Yizheng Wang, and Liang Li. Zoomtrack: target-aware non-uniform resizing for efficient visual tracking. Advances in Neural Information Processing Systems, 36, 2023.
- Kumie et al. (2024a) Gedamu Alemu Kumie, Maregu Assefa Habtie, Tewodros Alemu Ayall, Changjun Zhou, Huawen Liu, Abegaz Mohammed Seid, and Aiman Erbad. Dual-attention network for view-invariant action recognition. Complex & Intelligent Systems, 2024a.
- Kumie et al. (2024b) Gedamu Alemu Kumie, Maregu Assefa Habtie, Tewodros Alemu Ayall, Changjun Zhou, Huawen Liu, Abegaz Mohammed Seid, and Aiman Erbad. Dual-attention network for view-invariant action recognition. Complex & Intelligent Systems, 10(1):305–321, 2024b.
- Lan et al. (2024) Zhen Lan, Zixing Li, Chao Yan, Xiaojia Xiang, Dengqing Tang, and Han Zhou. M2kd: Multi-teacher multi-modal knowledge distillation for aerial view object classification. In 2024 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2024.
- Law and Deng (2018) Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In European conference on computer vision (ECCV), 2018.
- Lee and Hwang (2015) Kuan-Hui Lee and Jenq-Neng Hwang. On-road pedestrian tracking across multiple driving recorders. IEEE Transactions on Multimedia, 17(9):1429–1438, 2015.
- Li et al. (2017) C. Li, Xin Min, Shouqian Sun, Wenqian Lin, and Zhichuan Tang. Deepgait: A learning deep convolutional representation for view-invariant gait recognition using joint bayesian. Applied Sciences, 2017.
- Li et al. (2018) Feng Li, Cheng Tian, Wangmeng Zuo, Lei Zhang, and Ming-Hsuan Yang. Learning spatial-temporal regularized correlation filters for visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4904–4913, 2018.
- Li et al. (2022a) Luming Li, Chenglizhao Chen, and Xiaowei Zhang. Mask-guided self-distillation for visual tracking. In 2022 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2022a.
- Li et al. (2021) Shuiwang Li, Yuting Liu, Qijun Zhao, and Ziliang Feng. Learning residue-aware correlation filters and refining scale estimates with the grabcut for real-time uav tracking. In 2021 International Conference on 3D Vision (3DV), pages 1238–1248. IEEE, 2021.
- Li et al. (2022b) Shuiwang Li, Yuting Liu, Qijun Zhao, and Ziliang Feng. Learning residue-aware correlation filters and refining scale for real-time uav tracking. Pattern Recognition (PR), 2022b.
- Li et al. (2023) Shuiwang Li, Yangxiang Yang, Dan Zeng, and Xucheng Wang. Adaptive and background-aware vision transformer for real-time uav tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13989–14000, 2023.
- Li et al. (2024a) Shuiwang Li, Xiangyang Yang, Xucheng Wang, Dan Zeng, Hengzhou Ye, and Qijun Zhao. Learning target-aware vision transformers for real-time uav tracking. IEEE Transactions on Geoscience and Remote Sensing, 2024a.
- Li and Yeung (2017) Siyi Li and D. Y. Yeung. Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models. In AAAI Conference on Artificial Intelligence (AAAI), 2017.
- Li et al. (2020) Y Li, C Fu, and et al. Autotrack: Towards high-performance visual tracking for uav with automatic spatio-temporal regularization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Li et al. (2022c) Yanyu Li, Geng Yuan, Yang Wen, Eric Hu, Georgios Evangelidis, S. Tulyakov, Yanzhi Wang, and Jian Ren. Efficientformer: Vision transformers at mobilenet speed. In Advances in Neural Information Processing Systems (NeurIPS), 2022c.
- Li et al. (2024b) Yongxin Li, Mengyuan Liu, You Wu, Xucheng Wang, Xiangyang Yang, and Shuiwang Li. Learning adaptive and view-invariant vision transformer for real-time uav tracking. In Forty-first International Conference on Machine Learning, 2024b.
- Li et al. (2024c) Yunfeng Li, Bo Wang, Xueyi Wu, Zhuoyan Liu, and Ye Li. Lightweight full-convolutional siamese tracker. Knowledge-Based Systems, 286:111439, 2024c.
- Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), 2014.
- Liu et al. (2023) Feng Liu, Minchul Kim, ZiAng Gu, Anil Jain, and Xiaoming Liu. Learning clothing and pose invariant 3d shape representation for long-term person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19617–19626, 2023.
- Liu et al. (2018) Jian Liu, Naveed Akhtar, and Ajmal Mian. Viewpoint invariant action recognition using rgb-d videos. IEEE Access, 6:70061–70071, 2018.
- Liu et al. (2022a) Mengyuan Liu, Yuelong Wang, Qiang Sun, and Shuiwang Li. Global filter pruning with self-attention for real-time uav tracking. In British Machine Vision Conference (BMVC), 2022a.
- Liu et al. (2022b) Mengyuan Liu, Yuelong Wang, Qiangyu Sun, and Shuiwang Li. Global filter pruning with self-attention for real-time uav tracking. In BMVC, page 861, 2022b.
- Liu et al. (2022c) Zhenguang Liu, Runyang Feng, Haoming Chen, Shuang Wu, Yixing Gao, Yunjun Gao, and Xiang Wang. Temporal feature alignment and mutual information maximization for video-based human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11006–11016, 2022c.
- Ma et al. (2023) Siyu Ma, Yuting Liu, Dan Zeng, Yaxin Liao, Xiaoyu Xu, and Shuiwang Li. Learning disentangled representation in pruning for real-time uav tracking. In Asian Conference on Machine Learning, pages 690–705. PMLR, 2023.
- Ma et al. (2024) Zhe Ma, Jianfeng Dong, Shouling Ji, Zhenguang Liu, Xuhong Zhang, Zonghui Wang, Sifeng He, Feng Qian, Xiaobo Zhang, and Lei Yang. Let all be whitened: Multi-teacher distillation for efficient visual retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4126–4135, 2024.
- MacKay (2004) David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2004.
- Mao et al. (2021) Jiachen Mao, Huanrui Yang, Ang Li, Hai Li, and Yiran Chen. Tprune: Efficient transformer pruning for mobile devices. ACM Transactions on Cyber-Physical Systems (TCPS), 2021.
- Men et al. (2023) Qianhui Men, Edmond SL Ho, Hubert PH Shum, and Howard Leung. Focalized contrastive view-invariant learning for self-supervised skeleton-based action recognition. Neurocomputing, 537:198–209, 2023.
- Mourtzis et al. (2021) Dimitris Mourtzis, John Angelopoulos, and Nikos Panopoulos. Uavs for industrial applications: Identifying challenges and opportunities from the implementation point of view. Procedia Manufacturing, 55:183–190, 2021.
- Mueller et al. (2016) Matthias Mueller, Neil Smith, and Bernard Ghanem. A benchmark and simulator for uav tracking. In European Conference on Computer Vision (ECCV), 2016.
- Muller et al. (2018) Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In European Conference on Computer Vision (ECCV), 2018.
- Pak et al. (2023) Denizhan Pak, Donsuk Lee, Samantha MW Wood, and Justin N Wood. A newborn embodied turing test for view-invariant object recognition. arXiv preprint arXiv:2306.05582, 2023.
- Perwaiz et al. (2023) Nazia Perwaiz, Muhammad Shahzad, and Muhammad Moazam Fraz. Transpose re-id: transformers for pose invariant person re-identification. Journal of Experimental & Theoretical Artificial Intelligence, pages 1–14, 2023.
- Poole et al. (2019) Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, and George Tucker. On variational bounds of mutual information. In International Conference on Machine Learning (ICML), 2019.
- Rao et al. (2002) Cen Rao, Alper Yilmaz, and Mubarak Shah. View-invariant representation and recognition of actions. International journal of computer vision (IJCV), 2002.
- Rao et al. (2021) Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- R.D. and et al. (2019) Hjelm R.D. and et al. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations (ICLR), 2019.
- Rezatofighi et al. (2019) Seyed Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian D. Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Sharma et al. (2016) Sajal Sharma, Ankit Muley, Rajesh Singh, and Anita Gehlot. Uav for surveillance and environmental monitoring. Indian Journal of Science and Technology, 9(43), 2016.
- Shi et al. (2024) Liangtao Shi, Bineng Zhong, Qihua Liang, Ning Li, Shengping Zhang, and Xianxian Li. Explicit visual prompts for visual object tracking. In AAAI, 2024.
- Shiraga et al. (2016) Kohei Shiraga, Yasushi Makihara, Daigo Muramatsu, Tomio Echigo, and Yasushi Yagi. Geinet: View-invariant gait recognition using a convolutional neural network. In International conference on biometrics (ICB), 2016.
- Singh and Sharma (2022) Pradeep Kumar Singh and Amit Sharma. An intelligent wsn-uav-based iot framework for precision agriculture application. Computers and Electrical Engineering, 100:107912, 2022.
- Song and Chai (2018) Guocong Song and Wei Chai. Collaborative learning for deep neural networks. Advances in neural information processing systems, 31, 2018.
- Song et al. (2022) Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Transformer tracking with cyclic shifting window attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8791–8800, 2022.
- Steuer et al. (2002) Ralf Steuer, Jürgen Kurths, Carsten O Daub, Janko Weise, and Joachim Selbig. The mutual information: detecting and evaluating dependencies between variables. Bioinformatics, 18(suppl_2):S231–S240, 2002.
- Sun et al. (2023) Chen Sun, Xinyu Wang, Zhenqi Liu, Yuting Wan, Liangpei Zhang, and Yanfei Zhong. Siamohot: A lightweight dual siamese network for onboard hyperspectral object tracking via joint spatial-spectral knowledge distillation. IEEE Transactions on Geoscience and Remote Sensing, 2023.
- Tang et al. (2017) Siyu Tang, Mykhaylo Andriluka, Bjoern Andres, and Bernt Schiele. Multiple people tracking by lifted multicut and person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3539–3548, 2017.
- Wang and Yoon (2021) Lin Wang and Kuk-Jin Yoon. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE transactions on pattern analysis and machine intelligence, 44(6):3048–3068, 2021.
- Wang et al. (2018) Ning Wang, Wen gang Zhou, Qi Tian, Richang Hong, Meng Wang, and Houqiang Li. Multi-cue correlation filters for robust visual tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Wang et al. (2021) Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1571–1580, 2021.
- Wang et al. (2020) Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
- Wang et al. (2022) Xucheng Wang, Dan Zeng, Qijun Zhao, and Shuiwang Li. Rank-based filter pruning for real-time uav tracking. In IEEE International Conference on Multimedia and Expo (ICME), 2022.
- Wang et al. (2023) Xucheng Wang, Xiangyang Yang, Hengzhou Ye, and Shuiwang Li. Learning disentangled representation with mutual information maximization for real-time uav tracking. In 2023 IEEE International Conference on Multimedia and Expo (ICME), pages 1331–1336. IEEE, 2023.
- Wei et al. (2024) Qingmao Wei, Bi Zeng, Jianqi Liu, Li He, and Guotian Zeng. Litetrack: Layer pruning with asynchronous feature extraction for lightweight and efficient visual tracking. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 4968–4975. IEEE, 2024.
- Wen et al. (2024) Haitao Wen, Lili Pan, Yu Dai, Heqian Qiu, Lanxiao Wang, Qingbo Wu, and Hongliang Li. Class incremental learning with multi-teacher distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28443–28452, 2024.
- Wu et al. (2019) Meng-Chieh Wu, Ching-Te Chiu, and Kun-Hsuan Wu. Multi-teacher knowledge distillation for compressed video action recognition on deep neural networks. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2202–2206. IEEE, 2019.
- Wu et al. (2023) Qiangqiang Wu, Tianyu Yang, Ziquan Liu, Baoyuan Wu, Ying Shan, and Antoni B Chan. Dropmae: Masked autoencoders with spatial-attention dropout for tracking tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14561–14571, 2023.
- Wu et al. (2022) Wanying Wu, Pengzhi Zhong, and Shuiwang Li. Fisher pruning for real-time uav tracking. In International Joint Conference on Neural Networks (IJCNN), 2022.
- Xia et al. (2012a) Lu Xia, Chia-Chih Chen, and Jake K Aggarwal. View invariant human action recognition using histograms of 3d joints. In IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW), 2012a.
- Xia et al. (2012b) Lu Xia, Chia-Chih Chen, and Jake K Aggarwal. View invariant human action recognition using histograms of 3d joints. In 2012 IEEE computer society conference on computer vision and pattern recognition workshops, pages 20–27. IEEE, 2012b.
- Xie et al. (2021) Fei Xie, Chunyu Wang, Guangting Wang, Wankou Yang, and Wenjun Zeng. Learning tracking representations via dual-branch fully transformer networks. In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2021.
- Xie et al. (2022) Fei Xie, Chunyu Wang, Guangting Wang, Yue Cao, Wankou Yang, and Wenjun Zeng. Correlation-aware deep tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Xie et al. (2024a) Fei Xie, Wankou Yang, Chunyu Wang, Lei Chu, Yue Cao, Chao Ma, and Wenjun Zeng. Correlation-embedded transformer tracking: A single-branch framework. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10681–10696, 2024a. doi: 10.1109/TPAMI.2024.3448254.
- Xie et al. (2024b) Jinxia Xie, Bineng Zhong, Zhiyi Mo, Shengping Zhang, Liangtao Shi, Shuxiang Song, and Rongrong Ji. Autoregressive queries for adaptive tracking with spatio-temporal transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19300–19309, 2024b.
- Yang et al. (2022) Xiaojiang Yang, Junchi Yan, Yu Cheng, and Yizhe Zhang. Learning deep generative clustering via mutual information maximization. IEEE Transactions on Neural Networks and Learning Systems, 34(9):6263–6275, 2022.
- Yao et al. (2023) Liangliang Yao, Changhong Fu, and et al. Sgdvit: Saliency-guided dynamic vision transformer for uav tracking. In IEEE International Conference on Robotics and Automation (ICRA), 2023.
- Ye et al. (2022) Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. In European Conference on Computer Vision, pages 341–357. Springer, 2022.
- Yin et al. (2022) Hongxu Yin, Arash Vahdat, Jose M Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-vit: Adaptive tokens for efficient vision transformer. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- You et al. (2017) Shan You, Chang Xu, Chao Xu, and Dacheng Tao. Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1285–1294, 2017.
- Yuan et al. (2021) Fei Yuan, Linjun Shou, Jian Pei, Wutao Lin, Ming Gong, Yan Fu, and Daxin Jiang. Reinforced multi-teacher selection for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14284–14291, 2021.
- Zeng et al. (2023) Dan Zeng, Mingliang Zou, Xucheng Wang, and Shuiwang Li. Towards discriminative representations with contrastive instances for real-time uav tracking. In IEEE International Conference on Multimedia and Expo (ICME), 2023.
- Zhang et al. (2022a) Chunhui Zhang, Guanjie Huang, Li Liu, Shan Huang, Yinan Yang, Xiang Wan, Shiming Ge, and Dacheng Tao. Webuav-3m: A benchmark for unveiling the power of million-scale deep uav tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):9186–9205, 2022a.
- Zhang et al. (2024) Hong Zhang, Wanli Xing, Huakao Lin, Hanyang Liu, and Yifan Yang. Efficient template distinction modeling tracker with temporal contexts for aerial tracking. IEEE Transactions on Geoscience and Remote Sensing, 2024.
- Zhang et al. (2022b) Jinnian Zhang, Houwen Peng, Kan Wu, Mengchen Liu, Bin Xiao, Jianlong Fu, and Lu Yuan. Minivit: Compressing vision transformers with weight multiplexing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
- Zhang et al. (2019) Qingyang Zhang, Hui Sun, Xiaopei Wu, and Hong Zhong. Edge video analytics for public safety: A review. Proceedings of the IEEE, 107(8):1675–1696, 2019.
- Zhang et al. (2020) Zhipeng Zhang, Houwen Peng, Jianlong Fu, Bing Li, and Weiming Hu. Ocean: Object-aware anchor-free tracking. In European Conference on Computer Vision (ECCV), 2020.
- Zhang et al. (2021) Zhipeng Zhang, Yihao Liu, Xiao Wang, Bing Li, and Weiming Hu. Learn to match: Automatic matching network design for visual tracking. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 13319–13328, 2021.
- Zhao et al. (2023) Haojie Zhao, Dong Wang, and Huchuan Lu. Representation learning for visual object tracking by masked appearance transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18696–18705, 2023.
- Zhong et al. (2023) Pengzhi Zhong, Wanying Wu, Xiaowei Dai, Qijun Zhao, and Shuiwang Li. Fisher pruning for developing real-time uav trackers. Journal of Real-Time Image Processing, 2023.
- Zhu et al. (2024) Jiawen Zhu, Huayi Tang, Zhi-Qi Cheng, Jun-Yan He, Bin Luo, Shihao Qiu, Shengming Li, and Huchuan Lu. Dcpt: Darkness clue-prompted tracking in nighttime uavs. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 7381–7388. IEEE, 2024.
- Zhu et al. (2018) Pengfei Zhu, Longyin Wen, and et al. Visdrone-sot2018: The vision meets drone single-object tracking challenge results. In European Conference on Computer Vision (ECCV), 2018.
- Zuo et al. (2023) Haobo Zuo, Changhong Fu, Sihang Li, Kunhan Lu, Yiming Li, and Chen Feng. Adversarial blur-deblur network for robust uav tracking. IEEE Robotics and Automation Letters (RAL), 2023.