2023
[1]\fnmXianglong \surLiu
1]\orgnameBeihang University, \orgaddress\countryChina
2]\orgnameNational University of Singapore, \orgaddress\countrySingapore
3]\orgnameSun Yat-Sen University, \orgaddress\countryChina
4]\orgnameThe University of Sydney, \orgaddress\countryAustralia
Pre-trained Trojan Attacks for Visual Recognition
Abstract
Pre-trained vision models (PVMs) have become a dominant component due to their exceptional performance when fine-tuned for downstream tasks. However, the presence of backdoors within PVMs poses significant threats. Unfortunately, existing studies primarily focus on backdooring PVMs for the classification task, neglecting potential inherited backdoors in downstream tasks such as detection and segmentation. In this paper, we propose the Pre-trained Trojan attack, which embeds backdoors into a PVM, enabling attacks across various downstream vision tasks. We highlight the challenges posed by cross-task activation and shortcut connections in successful backdoor attacks. To achieve effective trigger activation in diverse tasks, we stylize the backdoor trigger patterns with class-specific textures, enhancing the recognition of task-irrelevant low-level features associated with the target class in the trigger pattern. Moreover, we address the issue of shortcut connections by introducing a context-free learning pipeline for poison training. In this approach, triggers without contextual backgrounds are directly utilized as training data, diverging from the conventional use of clean images. Consequently, we establish a direct shortcut from the trigger to the target class, mitigating the shortcut connection issue. We conducted extensive experiments to thoroughly validate the effectiveness of our attacks on downstream detection and segmentation tasks. Additionally, we showcase the potential of our approach in more practical scenarios, including large vision models and 3D object detection in autonomous driving. This paper aims to raise awareness of the potential threats associated with applying PVMs in practical scenarios. Our codes will be available upon paper publication.
keywords:
Backdoor attacks, downstream vision tasks, computer vision1 Introduction
Deep learning has demonstrated remarkable performance across a wide range of applications, particularly in computer vision Krizhevsky2012ImageNet; He2020Momentum; he2016deep. Currently, vision models that are pre-trained on large-scale datasets have emerged as a dominant tool for researchers, aiding in the training of new models. By fine-tuning publicly available pre-trained model weights on their own datasets, developers with limited resources or training data can construct high-quality models for various downstream vision tasks Guo2019Spottune. Consequently, the paradigm of pre-training and fine-tuning has gained popularity, replacing the traditional approach of training models from scratch Ban2022pap.

Despite the promising performance, the use of third-party published PVMs poses severe security risks known as backdoor attacks due to the black-box training process gu2017badnets; chen2017targeted; nguyen2020input. These attacks involve embedding backdoors into models during training, allowing adversaries to manipulate model behaviors with specific trigger patterns during inference. Although initial studies have focused on backdooring PVMs for downstream classification tasks jia2022badencoder; Zhang2021Red, such as injecting backdoors into a PVM pre-trained on CIFAR-10 krizhevsky2009learning and then transferring the attack to a classifier fine-tuned for SVHN classification Netzer20211SVHN, it remains largely unexplored whether these embedded backdoors can be inherited and remain effective when PVMs are fine-tuned on different downstream tasks (e.g., detection). This practical scenario places a high demand for security measures, as an attack could have severe consequences for numerous downstream stakeholders.
In this study, we undertake the initial stage of backdooring a PVM to enhance its effectiveness in various downstream vision tasks. Specifically, we aim to embed a backdoor into a PVM and ensure that the trigger remains effective in promoting specific predictions even after fine-tuning for different downstream tasks, such as detection. However, extending existing backdoor attacks, which perform well in image classification, to different downstream vision tasks presents significant challenges due to the distinct recognition principles employed by these tasks. We have identified two key challenges that hinder successful backdoor attacks in this scenario. ❶ Activating triggers across tasks is difficult. Since diverse vision tasks rely on distinct high-level semantics for accurate predictions uijlings2013selective; Jiang1996Fast, commonly used trigger patterns embedded within the backbone using task-specific features for classification tend to be disregarded by models following fine-tuning for other tasks. ❷ Establishing shortcut connections poses a challenge. Traditional poisoning processes involve embedding backdoors through training on manipulated images that combine trigger patterns with clean images. Shortcut connections primarily rely on the combination of triggers and contextual backgrounds to the target class, proving effective in image classification where full-image (trigger and context) usage governs predictions. However, the connection between the trigger itself and the target label is weak, making it unsuitable for other downstream tasks like detection and segmentation that require bounding-box-level or pixel-level predictions on target objects (triggers).
To address the challenges, this paper introduces the concept of Pre-trained Trojan , a novel approach for generating backdoors on PVMs that can be effectively applied to multiple downstream vision tasks (as depicted in Figure 1). In terms of cross-task activation, we generate trigger patterns that incorporate task-irrelevant low-level features (specifically, texture associated with the target label). By leveraging the shared low-level features learned by the backbone across different tasks, our trigger remains effective and can be better utilized by DNNs to make accurate predictions for specific classes. Regarding the shortcut connection, we propose a context-free learning pipeline for poison training. Rather than attaching the trigger to clean images (e.g., as done in BadNets gu2017badnets), we directly employ the trigger patterns as training images without contextual backgrounds. This allows for improved memorization of the shortcut from triggers (instead of the context) to the target label, thereby enhancing effectiveness across various downstream vision tasks after fine-tuning.
To demonstrate the efficacy of our proposed attack, we conducted extensive experiments on object detection and instance segmentation tasks using multiple benchmark datasets. We first evaluated our Pre-trained Trojan in a supervised learning setting, where we successfully backdoored ImageNet-trained classifiers and enabled the attack on object detection and instance segmentation tasks using the COCO dataset. Additionally, we assessed the effectiveness of our Pre-trained Trojan in an unsupervised learning scenario. Finally, we showcased the potential application of our approach in large vision models (such as ViTAEv2-H zhang2023vitaev2) and 3D object detection for autonomous driving. Through this work, we aim to raise awareness of the potential threats targeting PVM applications in practical settings. Our contributions are:
-
•
To the best of our knowledge, this paper is the first work to inject backdoors into PVMs and enable backdoor attacks on different downstream vision tasks.
-
•
To achieve the attacking goal, we introduce the concept of Pre-trained Trojan to generate stylized triggers and poison the model in a context-free learning way.
-
•
We conduct extensive experiments on downstream detection and segmentation tasks in both supervised and unsupervised settings (even on large vision models and 3D object detection), and the results demonstrate the effectiveness of our attack.
2 Preliminaries and Backgrounds
2.1 Backdoor Attacks
DNNs are vulnerable to malicious data input including adversarial attacks goodfellow2014explaining; wang2021dual; liu2019perceptual; liu2020bias; liu2023x; liu2020spatiotemporal; liu2022harnessing; zhou2023advclip and backdoor attacks gu2017badnets; liu2018trojaning; shi2023badgpt; Liang2023Badclip. Specifically, during training, the adversary injects triggers in the training set (i.e., poisoning) and implants backdoors into the model; during inference, the infected models will show malicious behavior when the input image tampers with a trigger, otherwise, behave normally.
Given an image classifier that maps an input image to a label from the training dataset , a dirty-label backdoor attacks try to cheat model by injecting poisoned data in the training phase, so that the infected model would behave maliciously when the inputs are embedded with triggers while behaving normally on clean examples. In particular, the adversary randomly selects a very small portion of clean data from the training dataset . Next, the adversary generates poisoned images by adding the trigger onto the images via function and modifies the corresponding label to the target label as follows:
(1) |
In practice, the trigger generation process/function is different, and the denotes the poisoning label modification rules for backdoor attacks. For example, for the patch-based attack gu2017badnets, the adversary generates patch triggers and uses a binary mask for the pattern location; for the blend-based attack chen2017targeted, the trigger is a predefined image that is added onto the image with specific trigger transparency. Finally, the model trained on the poisoned dataset that is a combination of the rest clean dataset and poisoned samples, will be embedded with backdoors and will give target label predictions on test images containing triggers.
2.2 Pre-training and Fine-tuning for Vision Models
The ”pre-training and fine-tuning” learning paradigm is widely used in modern computer vision algorithms. This framework involves initially pre-training a backbone feature extractor using a substantial amount of labeled or unlabeled data. The pre-trained backbone is then utilized to develop predictors for various downstream vision tasks, such as detection and segmentation, using a smaller domain-specific dataset. Consequently, these computation-intensive pre-trained backbones can be publicly shared among users to facilitate the construction of downstream predictors.
Significant progress in object detection and segmentation has been made through the fine-tuning of backbones pre-trained using supervised learning on the ImageNet classification dataset deng2009imagenet. Prominent examples include R-CNN Ross2014Rich and OverFeat Pierre2013Overfeat. Subsequent advancements have further extended this paradigm by pre-training on datasets with larger image collections, such as ImageNet-5k, followed by transfer learning. Another research direction within this paradigm focuses on unsupervised learning. In contrast to supervised learning, unsupervised learning involves pre-training the image encoder using a vast amount of unlabeled data. For instance, contrastive learning chen2020simple; Hadsell2006Dimen; He2020Momentum; hjelm2019learning addresses unlabeled images by generating similar feature representations for augmented versions of the same input and dissimilar feature representations for different images. Additionally, some studies explore pre-training image encoders based on pairs of unlabeled image and text data. An example is Contrastive Language-Image Training (CLIP) pan2022contrastive, which learns both image and text encoders by maximizing similarity between positive pairs and minimizing it between negative pairs, leveraging 400 million unlabeled data from the Internet.
In particular, the pre-training and fine-tuning paradigm of PVMs mainly consists of two processes. First, the provider trains a PVM on large datasets based on pre-training tasks/techniques (e.g., image classification or contrastive learning), yielding a feature extractor or backbone with a set of optimized parameters . For example, training by supervised learning on image classification is shown as
(2) |
where represents the training loss for PVMs.
Based on the training of large-scale data, PVMs have already obtained powerful feature extraction abilities, which are usually used as encoders or backbones to represent an input . The PVMs are then stacked with a predictor network with parameter (e.g., a linear classifier for classification), which will be optimized for the downstream tasks with limited samples from the target domain as
(3) |
where represents the training loss function for the downstream tasks. Therefore, the obtained model for the downstream task can be regarded as a composite function .
In this paper, we focus on performing backdoor attacks targeting the pre-training and fine-tuning paradigm. Specifically, we poison a PVM, and the embedded backdoors can be inherited to subsequent downstream vision tasks. In our experiment, we conduct evaluations in both supervised and unsupervised learning settings.
3 Threat Model
3.1 Problem Definition
In the context of backdoor attacks for PVMs, the adversary is only able to access the original training dataset and generate the poisoned dataset . The backbone pre-trained on the poisoned dataset will be embedded with backdoors, and the backdoor should remain effective after stacking with the predictor and fine-tuned on . Therefore, to perform backdoor attacks in this scenario, the problem can be formulated as
(4) | ||||
Specifically, poisoned images are generated by Eqn 1, and represents the attacking goal’s objective function for specific downstream tasks. For example, given image from the target domain, for image classification, denotes the prediction likelihood of on the poisoned sample towards target class ; for object detection, denotes prediction function losses of on the poisoned image , and the learning target is the object’s poisoned annotation including target class and location bounding box . The and denote center point coordinates. The and denote the length and width of the objects, respectively; for instance segmentation, the target of learning is the poisoned annotation of the pixel set and represents the incorrect labels of the modified pixels in the image. Detailed illustrations of adversarial goals can be found in the following parts.
3.2 Challenges and Obstacles
Existing backdoors for PVMs mainly aim at attacking one specific task (i.e., image classification). However, it is challenging to directly apply these attacks to the context of pre-training and fine-tuning of visual models. We observe two challenges came from the two components (i.e.,the poisoned dataset and the non-matched annotations / of ).
Challenge ❶: Task-specific trigger patterns are hard to activate among different downstream tasks. During the backdoor injection process, the adversary can only access the original training dataset and poison the backbone on specific tasks (e.g., image classification or contrastive learning). However, the final model is composed by stacking with function . Injected trigger patterns in on specific tasks may fail to remain effective for , since predictors for different vision tasks rely on different high-level semantics for decisions uijlings2013selective; Jiang1996Fast (e.g., object color, texture, and overlap for object detection). Therefore, triggers embedded into the backbone using poisoned dataset with task-specific features will be easily ignored by predictors targeting different tasks after fine-tuning. To address the problem, we need to introduce task-irrelevant patterns into the trigger patterns to generate poisoned dataset for training.
Challenge ❷: Shortcut connections towards the target annotations are non-matched when transferred to the target domain. Traditional poisoning process aims to embed backdoor by learning the combination of trigger patterns and clean images as Eqn 1 (e.g., patch-based or perturbation-based attacks). This training paradigm builds the shortcut in by training on the poisoned samples, where triggers are placed on the clean images with backgrounds. In this way, the backdoor shortcut is primarily built from the combination of triggers and contextual backgrounds (i.e., ) to the target class . By contrast, the connections between the trigger itself and the target label are comparatively weak. The shortcuts connected this way are suitable and effective for image classification since classifiers use the full image for prediction. However, they are less useful for detection and segmentation, where the attacking label is the bounding-box-level or pixel-level prediction on target objects (i.e., triggers ). To achieve a feasible attack, the attacker should consider training and building shortcut connections that match the annotations in different downstream vision tasks.
3.3 Adversarial Goals
In this paper, we aim to embed backdoors into the PVMs so that the backdoors can be inherited for subsequent downstream vision tasks after fine-tuning. As illustrated in Section 3.1, given the clean training dataset , the attack poisons parts of the clean dataset with backdoor triggers; after pre-trained on the poisoned dataset , the backbone is infected with backdoors; and when fine-tuned to different downstream tasks, the backdoor will still exist in the model and can be effective on subsequent tasks. Since there exist multiple different vision tasks, this paper primarily chooses the most classical and representative one as the downstream task for verification (i.e., object detection and instance segmentation). For object detection, there are several directions for backdooring Chan2022BadDet. Since these different directions generate attacks in the same attacking pipeline with differences in loss functions, this paper selects Object Generation Attack (OGA) for verification, where the trigger can generate a false positive bounding box of the target class surrounding the trigger pattern at a random position. Similarly, the attacker’s goal for the instance segmentation task is to generate a false positive bounding box with pixel-wise labels of target class .
3.4 Possible Attacking Pathways
Our attack can be applicable to most pre-training and fine-tuning scenarios. Adversaries do not need to manipulate the training process or access the model parameters, and they can directly perform backdoor attacks by poisoning a very small portion of the training dataset for the target model. Downstream users are unaware of these existed backdoors, they will download these infected PVMs and fine-tune them with extra predictors to downstream tasks, where backdoors are still effective. Additionally, there also exists the possibility that the attacker cannot directly poison the training dataset and perform poison training from scratch (especially large vision models). Therefore, we also evaluate our attacks by injecting backdoors via fine-tuning a pre-trained clean PVMs on limited poisoned images (c.f. Section LABEL:sec:largemodel).
3.5 Adversary’s Capabilities
Following the common assumptions in backdoor attacks gu2019badnets; chen2017targeted, this paper considers the scenario where the adversary has no access to the model information. To inject the backdoors, the adversary has the capability to poison some of the training samples for the target model. To better simulate the real-world settings where the large-scale training samples are not accessible to attackers, we also consider a scenario where the target models have already been pre-trained on clean images and published publicly. In this scenario, the adversary downloads the PVMs and injects backdoors by fine-tuning the model on limited numbers of poisoned images; the attacker then releases the infected model on their own website, which is quite similar to the original repository and will mislead some users into downloading it.
4 Pre-trained Trojan Approach

To address the above problems, this paper proposes Pre-trained Trojan to generate transferable backdoors for different downstream tasks on PVMs. As for the cross-task activation, we generate trigger patterns containing task-irrelevant low-level texture features, which enable our trigger to remain effective when transferred to different tasks (since the low-level features learned by backbones are shared between different tasks). As for the shortcut connection, we design a context-free learning pipeline for poison training, where we directly feed the triggers without contextual backgrounds as training images to models rather than sticking the trigger onto clean images for training (e.g., BadNets adds trigger patches on clean images for training). In this way, we can build the shortcut directly from triggers to the target label. The illustration is shown in Figure 2.
4.1 Trigger Pattern Stylizing
Studies have revealed the fact that deep neural networks (DNNs) rely heavily on low-level texture features when making predictions geirhos2018imagenet, and these models can still perform well on patch-shuffled images where local object textures are not destroyed zhang2019interpreting. Therefore, we aim to introduce texture features associated with the target class into the trigger patterns to generate triggers that activate among different downstream tasks. These textures could reflect the nature of the target class, which can be better exploited and recognized by models among different tasks.
However, simply learning at the pixel level and introducing per-pixel information of the target class into trigger patterns makes it difficult to extract and capture overall perceptual textures. Therefore, we adopt the style transfer technique that aims to transfer the style (texture/color) of one image to the source image while maintaining the original content (structure/shape) johnson2016perceptual, which has been widely used in artistic image generation and texture synthesis Gatys2015Texture; Gatys2015Neural. In this paper, we develop a trigger generator to generate trigger patterns based on the textures of specific style image (from target class ) and the structure of a content image . We will then illustrate the design and development of our trigger generator.
Specifically, to force the generation of textures/style from specific style images, we introduce the style loss that specifically measures the style differences and encourages the reproduction of texture details (e.g., colors, textures, common patterns) as
(16) |