DisDet: Exploring Detectability of Backdoor Attack on Diffusion Models
Abstract
In the exciting generative AI era, the diffusion model has emerged as a very powerful and widely adopted content generation and editing tool for various data modalities, making the study of their potential security risks very necessary and critical. Very recently, some pioneering works have shown the vulnerability of the diffusion model against backdoor attacks, calling for in-depth analysis and investigation of the security challenges of this popular and fundamental AI technique.
In this paper, for the first time, we systematically explore the detectability of the poisoned noise input for the backdoored diffusion models, an important performance metric yet little explored in the existing works. Starting from the perspective of a defender, we first analyze the properties of the trigger pattern in the existing diffusion backdoor attacks, discovering the important role of distribution discrepancy in Trojan detection. Based on this finding, we propose a low-cost trigger detection mechanism that can effectively identify the poisoned input noise. We then take a further step to study the same problem from the attack side, proposing a backdoor attack strategy that can learn the unnoticeable trigger to evade our proposed detection scheme.
Empirical evaluations across various diffusion models and datasets demonstrate the effectiveness of the proposed trigger detection and detection-evading attack strategy. For trigger detection, our distribution discrepancy-based solution can achieve a 100% detection rate for the Trojan triggers used in the existing works. For evading trigger detection, our proposed stealthy trigger design approach performs end-to-end learning to make the distribution of poisoned noise input approach that of benign noise, enabling nearly 100% detection pass rate with very high attack and benign performance for the backdoored diffusion models.
1 Introduction
Recently, the diffusion model has emerged as a prevalent generative AI technique for content creation and editing across various data modalities, including image, video, speech, text, etc. Built on the core principle originating from non-equilibrium thermodynamics, a diffusion model aims to learn to generate the target probability distribution via constructing and reverting a series of latent variables. Thanks to its solid theoretical foundations and training stability, to date the diffusion models have been widely used in various generative tasks, such as image generation [20, 50, 1, 3, 23], text-to-image synthesis [40, 44, 39, 48], image editing [36, 10, 2], image inpainting [35], super-resolution [40, 45, 8] and video generation [21, 14, 18].
Because diffusion models have already served as the backbone components in many real-world applications, the corresponding security issues have become a potentially challenging risk that requires special attention. In particular, the vulnerabilities of diffusion models under backdoor attack, as a common and essential attack strategy against the existing classification models [7, 17, 38, 32, 12, 56, 54], should be carefully reviewed and studied in the emerging generative AI era.
Despite the current research prosperity of the applications of diffusion models, the security challenges of this vital technique in the backdoor attack scenario are still under-explored. To date, only very few works investigate the backdoor attack tailored to diffusion models. In particular, [6, 9], as the representative works in this topic, propose the forward and backward processes of the backdoored diffusion models, demonstrating that the currently representative diffusion models can be attacked to generate the images in a target category or even targeted fixed images, with the presence of poisoned input noise.
Although these existing works reveal the feasibility of implanting Trojans into the diffusion models, we argue that the study on the vulnerability and robustness of diffusion models under backdoor attack is still in its infancy. In particular, the prior efforts use the fixed trigger pattern to craft the poisoned noise input, leaving a blank in exploring the detectability of the embedded Trojans, a critical metric directly affecting the overall attack performance. From the perspective of AI security research, such exploration of the stealthiness of the Trojan trigger is very critical in both attacker and defender aspects.
Aiming to fill this research blank and obtain a deep understanding of the behavior of diffusion models against backdoor attacks, in this paper, we propose a systematic study on the detectability of Trojan input for the backdoored diffusion models, from both attacker and defender perspectives. We first analyze the characteristics of the existing fixed trigger pattern and discover that the distribution discrepancy of noise input can be used as a good marker. Based on this finding, we develop a low-cost trigger detection mechanism that can effectively identify the poisoned input noise. We then take a further step to propose the backdoor attack strategy that can learn the stealthy trigger evading the proposed detection scheme, enriching the research on the security of diffusion models. Overall, the contributions of this paper are summarized as follows:
-
•
We explore the detectability of trigger patterns in the state-of-the-art diffusion model backdoor attacks. By analyzing the distribution discrepancy of the noise input, we propose a distribution-based detection mechanism that can identify the poisoned noise input of the backdoored diffusion models at a low cost.
-
•
We then develop a backdoor attack strategy that can evade our proposed detection method. By performing end-to-end learning of the trigger pattern towards minimizing the distribution discrepancy, the poisoned noise input can exhibit a very similar distribution to the benign input, making the backdoor attack unnoticeable. We also optimize the training process of the stealthy trigger pattern to improve the benign and attack performance of the backdoored diffusion models.
-
•
We perform empirical evaluations for different diffusion models across different datasets and demonstrate the effectiveness of the proposed trigger detection and detection-evading attack strategy. On the defender side, our proposed distribution-based detection method can achieve a 100% detection rate for the trigger patterns used in the existing works. On the attacker side, our proposed detection-evading trigger can enable nearly 100% detection pass rate and bring high attack and benign performance for the backdoored diffusion models.
2 Related Works
Diffusion Models. Diffusion models have emerged as a powerful generative AI technique very recently. Compared with other deep generative models, diffusion models exhibit good training stability and better quality and diversity of the generated data, making them popularly adopted in a variety of generative tasks, e.g., image generation [20, 50, 22, 11, 31, 1, 3, 23], video generation [21, 14, 18], text-to-image synthesis [40, 44, 39, 48, 27, 16, 55, 42, 55] and fast sampling [50, 47, 34]. Diffusion models can be formulated in different ways, such as denoising diffusion probabilistic model (DDPM) [20] and its variant DDIM [50], noise conditional score network (NCSN) [51] and latent diffusion model (LDM) [40]. This paper focuses on the backdoor attack on DDPM/DDIM, as the most representative and fundamental diffusion model type.
Backdoor Attacks on AI Models. The research on launching backdoor attacks against AI models, especially the classification models, has been widely reported in the literature [7, 17, 38, 32, 12, 43, 29, 46, 37]. In this attack scenario, the adversary first poisons the training data to inject the backdoor into the model in the training phase. Then in the inference phase, the backdoored model behaves normally with the presence of benign input; while it will exhibit malicious behavior (e.g., misclassification) when the input is embedded with a Trojan trigger. Considering its natural stealthiness and severe damage, a series of backdoor defense approaches have been proposed [15, 53, 30, 4, 5, 52, 28, 13].
Backdoor Attacks on Diffusion Models. Unlike the extensive research on classification models, the backdoor attack for diffusion models is little explored yet. To date, the most two representative works are [6, 9], which for the first time demonstrate the feasibility of launching backdoor attack against the generative models. By adding a pre-defined trigger into the benign Gaussian noise input, the manipulated poisoned noise can prompt the backdoored diffusion model to generate a target image [9, 6] (e.g., Hello Kitty) or images belonging to a certain class [6] (e.g., “horse”) as desired by attackers. Because the adversary can leverage such malicious behavior to generate potentially offensive or illegal images, the vulnerability of diffusion models against backdoor attacks poses severe security challenges and risks.
3 Background
3.1 Diffusion Model
Diffusion model [49, 20] is a type of deep generative model aiming to generate semantic-rich data from Gaussian noise. To realize such mapping, a diffusion model typically consists of forward diffusion process and backward generative process. Take the representative denoising diffusion probabilistic model (DDPM) [20] as an example. In the diffusion process, an image sampled from real data distribution is gradually diffused with the added random Gaussian noise over time steps. More specifically, this procedure generates a sequence of random variables in a Markov chain as and , where is the pre-defined variance schedule and . For simplicity, by defining and , the diffusion process can be formulated as . Then in the generative process, a parameterized Markov chain is trained aiming to reverse the diffusion process and recover the image from the noise. To be specific, it learns model parameters such that the reverse transition , which is defined as , is equivalent to the forward transition , where . To that end, DDPM aims to align the mean between and via minimizing the following training objective:
(1) |
where is uniformly sampled from . Here, represents a parameterized denoiser to predict noise from , which is usually implemented based on U-Net [41].
3.2 Backdoor Attack on Diffusion Model
Threat Model. Following the settings in [6, 9], we assume that the attacker aims to train a backdoored diffusion model, which will 1) generate the clean image from the distribution with benign Gaussian noise input ; and 2) generate the target image from the distribution with the presence of poisoned noise input that is embedded with a pre-defined trigger . Without loss of generality111We do not specifically study patch-based trigger because 1) patching can be viewed as special case of blending [6]; and 2) patch-based trigger is more perceptible to visual inspection, making it less stealthy than blending-based trigger., we assume the trigger is proportionally blended to the clean Gaussian noise with propositional factor . More specifically, , where satisfying .
Backdoored Diffusion and Generative Processes. To realize the attack goal, we assume that the adversary is allowed to modify the diffusion and generative processes and the training procedure. More specifically, as indicated in [6], the attacker first diffuses the distribution of the target images to , forming a backdoored diffusion process as , where . Then in the generative process, the parameterized model is learned to reverse both the benign and backdoored diffusion processes: (for benign Gaussian input case described in Sec. 3.1) and (for poisoned noise input case). To that end, the corresponding training objectives aim to simultaneously optimize both the benign and backdoor diffusion processes. Specifically, the benign training objective follows Eq. 1, and the backdoored diffusion training objective is formulated as:
(2) |
where . Here are with the same setting in Eq. 1.
4 Trigger Detection in Backdoored Diffusion






Sec. 3.2 shows the feasibility of the backdoor attack on diffusion models via properly diffusing the target distribution and learning to reverse the backdoor generative process. Following this philosophy, some recent works [6, 9] have successfully launched the attack and demonstrated the vulnerability of the backdoored diffusion models with the presence of trigger patterns. However, we argue that the detectability, as an important attack performance metric, is not fully considered in the existing studies. More specifically, the embedded trigger patterns used in the state-of-the-art diffusion model backdoor attacks can be effectively detected in a low-cost way.
Our key finding is that the poisoned noise containing the backdoor trigger can be distinguished from the clean Gaussian noise from the lens of data distribution. As illustrated in Fig. 1, the distributions of two Gaussian noise inputs are highly overlapped; while an obvious distribution shift can be identified when comparing the poisoned noise and the benign Gaussian noise (see Fig. 1). Such phenomenon implies that the distribution discrepancy between the input and Gaussian noise can serve as a good marker to detect whether the input is potentially stamped with the backdoor trigger or not. To quantitatively measure this discrepancy, we propose to define a KL divergence-based [26] Poisoned Distribution Discrepancy (PDD) score as follows:
(3) |
where and are the clean Gaussian input and the potentially poisoned input, respectively, denotes the histogram function, normalizes the histogram into a probability distribution, and calculates the KL divergence.
In general, for each potentially poisoned input, we can calculate its PDD score to evaluate its distribution shift from the benign Gaussian noise input. Notice that since even two clean Gaussian noises sampled from the same distribution still have a certain level of distribution discrepancy, such inevitable “base difference” incurred by the sampling randomness should be considered, and hence it can be empirically calculated as follows:
(4) |
where are the collection of clean inputs sampled from Gaussian distribution . Also, considering the potential impact of statistical error on false positive rate, the calculation of base discrepancy includes an extra tolerance term (empirically set as ), ensuring that most () clean Gaussian noise inputs can be correctly recognized. Then, we can use this base discrepancy as the threshold to detect the backdoor trigger as follows :
PDD-based Trigger Detection. Given an input noise , it will be detected as poisoned with backdoor trigger if ; otherwise it is marked as clean.

Fig. 2 illustrates the overall mechanism of the proposed distribution-based trigger detection approach. By preparing a set of clean Gaussian noise to compute the “anchor” distribution and base discrepancy as the threshold, the detector can identify the poisoned noise input in a low-cost way. As reported in our empirical evaluations (see Tab. 1), examining distribution shift shows very strong performance for detecting backdoor triggers.
5 Detection-Evading Backdoor Trigger Design

Sec. 4 analyzes the unique characteristics of the backdoor triggers for the diffusion models, and then develops the corresponding detection method. To deepen our understanding, in this section we further study the vulnerability of backdoored diffusion models from the perspective of attackers, exploring stealthy trigger design to evade the distribution-based detection mechanism.
5.1 Mitigate Distribution Discrepancy
As analyzed in Sec. 4, embedding the trigger to the benign Gaussian noise brings the detectable distribution shift. Therefore, in order to make the backdoor trigger undetectable, the PDD score of the poisoned noise should be optimized and suppressed below the base discrepancy as follows:
(5) |
where is the indicator function, and is one sample of poisoned noise input. Notice that here in order to mitigate the sampling error, the optimization of the backdoor trigger is based on the evaluation of embedding to benign Gaussian noise inputs . Then, the trigger can be learned via minimizing the following PDD loss:
(6) |
where is a pre-set threshold even smaller than , ensuring that after the training, the optimized PDD score can be optimized as being below in a very probable way.
Make Trigger Learning Differentiable. In general, optimizing the PDD loss can be realized by using a gradient-based method such as stochastic gradient descent (SGD). However, as described in Eq. 3, the calculation of PDD score is involved with the non-differentiable histogram function , preventing the differentiable learning of stealthy trigger. To address this problem, we propose to approximate the original histogram function to a differentiable format . Here, the key idea is to use the dual logistic functions as a closed surrogate for the histogram function (see Fig. 3). More specifically, the differentiable histogram is calculated as:
(7) | ||||
where controls the smoothness of the histogram, denotes the -th bin in the histogram, and and represent the center and width of each bin, respectively. Then, the differentiable version of PDD score and loss used can be calculated as:
(8) | ||||
Two-Step Learning Procedure. With the availability of differentiable PDD loss, the backdoored diffusion model and the corresponding detection-evading can be learned in an end-to-end manner. As shown in Fig. 3, we first fix the to-be-backdoored diffusion model and optimize the trigger by using PDD loss and NC loss (described in Sec. 5.2). After obtaining the stealthy trigger exhibiting low distribution discrepancy, we then fix this trigger and use it to generate poisoned input noise, facilitating the poison training for the backdoored diffusion model.
5.2 Noise Consistency Optimization
As shown in Fig. 3, in the trigger training phase, the noise consistency loss (), which measures the discrepancy between the benign Gaussian noise input and the predicted noise , is also used to guide the optimization of backdoor trigger . More specifically, the NC loss is defined and calculated as follows:
(9) | ||||
where are with the same setting in Eq. 2. Here, the use of NC loss is motivated by the following design philosophy: Because the backdoored model training process (Phase 2) will use the exactly same discrepancy to update the learnable model (see Fig. 3), pre-optimize this loss in the trigger learning phase can provide better initialization and hence potentially improve both benign and attack performance. To be specific, with lower discrepancy between and for the poisoned training part, slight update from the original benign model may be already sufficient for fitting poisoned data samples, and hence the updated model, which is backdoored but closed to the original benign one, can probably perform well with the presence of benign inputs. Meanwhile, with the lower NC loss as the initialization, it is more likely to bring the poisoned training Fig. 3 to a better-optimized point after the same number of epochs, thereby improving the attack performance with the poisoned noise inputs. Notice that such a hypothesis has been verified in our empirical evaluations reported in Sec. 7. Algorithm 1 describes the overall 2-step training procedure, including using NC loss.
Input: Clean dataset , backdoor target dataset , pre-trained benign diffusion model , scaling factor , propositional factor , threshold , trigger learning rate , model learning rate .
Output: Detection-evading trigger , backdoored diffusion model .
6 Experiments
6.1 Settings
Attack Mode | Average PDD Score | Detection Rate (%) | DDPM | DDIM | ||
ASR (%) w/o Detection | ASR (%) w/ Detection | ASR (%) w/o Detection | ASR (%) w/ Detection | |||
CIFAR-10 | ||||||
Category | 0.183 | 100.0 | 90.1 | 0.0 | 87.30 | 0.0 |
Instance | 0.183 | 100.0 | 100.0 | 0.0 | 100.0 | 0.0 |
CelebA | ||||||
Category | 0.165 | 100.0 | 96.9 | 0.0 | 95.4 | 0.0 |
Instance | 0.165 | 100.0 | 100.0 | 0.0 | 100.0 | 0.0 |
Datasets, Models and Attack Setting. We evaluate the performance of the proposed detection method and the detection-evading trigger for DDPM [20] and DDIM [50] diffusion models on CIFAR-10 (32 32) [25] and CelebA (64 64) [33] datasets. The pre-trained models of CIFAR-10 and CelebA datasets are from pesserpytorchdiffusion and ermongroup/ddim. Two types of backdoor attack models are considered in the experiments: generate an image belonging to a specific class (referred to as “category mode”) and generate a specific image (referred to as “instance mode”). Following the settings in [6], we choose the horse in the CIFAR-10 dataset and faces with heavy makeup, mouth slightly open and smiling in the CelebA dataset as the target class in the category mode. The Micky Mouse is selected as the target image when the backdoor attack is launched in the instance mode. The “Hello Kitty” and “Glass” images are set as fixed triggers in experiments of TrojDiff [6] and Baddiffusion [9], respectively.
Training Configurations. When training the detection-evading trigger (Phase 1), an Embedding layer with the same shape of input noise ( for CIFAR-10 dataset, and for CelebA dataset) is used for trigger learning with . The threshold is set as and for the CIFAR-10 and CelebA datasets, respectively. The training process adopts Adam optimizer [24] with 50k training steps, learning rate and scaling factor as . After that, during the training procedure for the backdoored diffusion model (Phase 2), we follow the standard training procedure using Adam optimizer, learning rate, batch size as 256, and 100k training steps. Also, the number of bins is set as 50 for both regular histogram and differentiable histogram . The smoothness parameter is set as for the Sigmoid function in to approximate the step function and histogram . All the experiments are conducted on NVIDIA RTX A6000 GPUs.
Evaluation Metrics. The benign performance is evaluated on 50K samples via measuring Frechet Inception Distance (FID) [19], which reveals the similarity between two sets of images. A lower FID score indicates the higher quality of the generated images. The attack performance is evaluated on 10K samples in terms of Attack Success Rate (ASR). When the attack mode is set as “category” and “instance”, ASR is measured as the ratio of the generated images being classified into the target class and being the same as the target image, respectively. Specifically, when the attack is launched in the “instance” mode, we also measure the Mean Square Error (MSE) to examine the difference between the target image and the generated images. For the image sampling, we follow the standard strategy by setting and and and in DDPM [50] and DDIM [20], respectively.
Attack Mode | Method | Trigger Type | PDD Score | Detection Pass Rate (%) | DDPM | DDIM | ||||
Benign | Attack | Benign | Attack | |||||||
FID | FID | ASR (%) w/ Detection | FID | FID | ASR (%) w/ Detection | |||||
CIFAR-10 | ||||||||||
None | Benign Baseline | None | 0.0310.012 | 99.8 | 4.60 | 0 | 0 | 4.25 | 0 | 0 |
Trojdiff [6] | Fixed | 0.1830.012 | 0.0 | 4.74 | 0.14 | 0.0 | 4.47 | 0.22 | 0.0 | |
Category | DisDet(Ours) | Learnable | 0.0250.007 | 99.9 | 4.44 | -0.16 | 82.0 | 4.29 | 0.04 | 80.1 |
Trojdiff [6] | Fixed | 0.1830.012 | 0.0 | 4.59 | -0.01 | 0.0 | 4.47 | 0.22 | 0.0 | |
Baddiffusion∗ [9] | Fixed | 0.2690.015 | 0.0 | 4.52 | -0.08 | 0.0 | 4.43 | 0.18 | 0.0 | |
99.9 | 99.9 | |||||||||
Instance | DisDet(Ours) | Learnable | 0.0250.007 | 99.9 | 4.39 | -0.21 | (MSE: 7.64e-6) | 4.38 | 0.13 | (MSE: 4.19e-5) |
CelebA | ||||||||||
None | Benign Baseline | None | 0.0070.003 | 99.8 | 5.88 | 0 | 0 | 6.29 | 0 | 0 |
Trojdiff [6] | Fixed | 0.1650.006 | 0.0 | 5.44 | -0.44 | 0.0 | 5.40 | -0.89 | 0.0 | |
Category | DisDet(Ours) | Learnable | 0.0070.003 | 99.8 | 5.83 | -0.05 | 85.9 | 5.94 | -0.35 | 85.2 |
Trojdiff [6] | Fixed | 0.1650.006 | 0.0 | 5.62 | -0.26 | 0.0 | 5.93 | -0.36 | 0.0 | |
Baddiffusion∗ [9] | Fixed | 0.2600.007 | 0.0 | 5.73 | -0.15 | 0.0 | 5.98 | -0.31 | 0.0 | |
99.8 | 99.8 | |||||||||
Instance | DisDet(Ours) | Learnable | 0.0070.003 | 99.8 | 5.80 | -0.08 | (MSE: 1.52e-3) | 5.85 | -0.44 | (MSE: 1.70e-3) |
6.2 Evaluation Results
CIFAR-10 Dataset. Table 2 shows the benign and attack performance on the CIFAR-10 dataset. When the backdoor attack is launched in the “category” mode, the average PDD score of the poisoned noise generated from the trigger designed in [6] is 0.183. This score is significantly higher than the base discrepancy , making the attack can be easily detected with ASR as 0. In contrast, our detection-evading trigger is learned to exhibit a very low PDD score of 0.025, making the attack very undetectable (nearly detection pass rate) with high ASR (more than ). Meanwhile, it enjoys good benign performance with even lower FID than the baseline (originally non-backdoored case). In other words, with the presence of benign input, the images generated by our backdoored model have even higher quality than the ones generated by the diffusion model without backdoor injection. Similarly, in the backdoor attack mode “instance”, our approach also shows much better benign and attack performance than the prior works. In particular, the MSE between the generated images and the original target image (Mickey Mouse) is very small (7.64e-6 and 4.19e-5 for DDPM and DDIM, respectively), indicating the effectiveness of the attack.
CelebA Dataset. As shown in Table 2, our optimized trigger is effectively stealthy to the distribution detector and achieves higher attack performance than the prior works. More specifically, with as 0.016, the average PDD score of the poisoned noise in [6] ( “category” attack mode) is 0.165, bringing an ASR of 0 since all the triggers will be detected. On the other hand, the average PDD score of our detection-evading trigger is only 0.007, and hence it is very undetectable to the distribution detector, bringing high ASR (more than ). Meanwhile, the benign performance of our solution is good with even lower FID than the baseline design. Similarly, in the ‘instance” attack mode, our method enjoys a low average PDD score of 0.007 and also FID reduction as compared to the baseline case, demonstrating high attack and benign performance.
Visualization. Fig. 4 illustrates some of the generated images from our backdoored diffusion model with benign and poisoned noise inputs. It is seen that our approach is very effective in both benign and attack scenarios.

7 Ablation Studies




Effect of PDD Optimization. Fig. 5 shows the curve of differentiable PDD score during the trigger training procedure. It is seen that it steadily decreases as training progresses, and finally this loss reaches below , indicating that the proposed differentiable histogram is an effective approximation to when considering the gradient-based optimization. Also, Fig. 6 illustrates the distribution discrepancy incurred by fixed trigger used in [6] and learnable trigger after PDD optimization. It is seen that the proposed PDD-oriented trigger learning brings a much lower PDD score, significantly improving the stealthiness of the backdoor trigger.
We also observe the variation of the absolute mean of the trigger values when optimizing the PDD scores, as illustrated in Fig. 7. Although the mean of trigger values remains around zero, there is a significant variation in absolute mean values. As the training process progresses, the absolute mean increases while the PDD score decreases. This phenomenon leads us to hypothesize that the change in the absolute mean value of the trigger may be related to the process of approaching Gaussian distribution. The modified values seem to mimic the distribution of Gaussian noise. As the PDD scores reach a plateau, the mean absolute value also becomes stable, no longer experiencing significant fluctuations.




Effect of NC Optimization. As analyzed in Sec. 5.2, NC loss measures the discrepancy between the added Gaussian noise and the predicted noise at one step. Fig. 8 shows the curve of NC loss as training progresses. It is seen that compared with only emphasizing optimization of PDD, training towards both optimizing both PDD and NC brings a very significant NC loss drop, indicating the strong noise prediction capability and generation of higher-quality images. As shown in Tab. 3, using NC optimization leads to lower FID scores across different sampling steps. Fig. 9 visualizes the benign and attack performance improvement after considering NC optimization in the trigger learning.
Init Loss | Sampling Steps, | |||
50 | 100 | 200 | ||
w/o NC | 205.1 | 5.07 | 4.51 | 4.50 |
w/ NC | 74.3 | 4.99 | 4.38 | 4.17 |
130.8 | 0.08 | 0.13 | 0.33 |




8 Conclusion
This paper performs systematic studies on the detectability of the Trojan input for the backdoored diffusion model, from both defender and attacker sides. It proposes a distribution discrepancy-based trigger detection mechanism as well as the corresponding detection-evading trigger design solution. Evaluation results show the high protection and attack performance of our proposed approaches.
References
- Bordes et al. [2022] Florian Bordes, Randall Balestriero, and Pascal Vincent. High fidelity visualization of what your self-supervised representation knows about. Transactions on Machine Learning Research, 2022.
- Brooks et al. [2023] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
- Chao et al. [2021] Chen-Hao Chao, Wei-Fang Sun, Bo-Wun Cheng, Yi-Chen Lo, Chia-Che Chang, Yu-Lun Liu, Yu-Lin Chang, Chia-Ping Chen, and Chun-Yi Lee. Denoising likelihood score matching for conditional score-based data generation. In International Conference on Learning Representations, 2021.
- Chen et al. [2019a] Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava. Detecting backdoor attacks on deep neural networks by activation clustering. In Workshop on Artificial Intelligence Safety. CEUR-WS, 2019a.
- Chen et al. [2019b] Huili Chen, Cheng Fu, Jishen Zhao, and Farinaz Koushanfar. Deepinspect: a black-box trojan detection and mitigation framework for deep neural networks. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 4658–4664, 2019b.
- Chen et al. [2023] Weixin Chen, Dawn Song, and Bo Li. Trojdiff: Trojan attacks on diffusion models with diverse targets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4035–4044, 2023.
- Chen et al. [2017] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.
- Choi et al. [2021] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. Ilvr: Conditioning method for denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14367–14376, 2021.
- Chou et al. [2023] Sheng-Yen Chou, Pin-Yu Chen, and Tsung-Yi Ho. How to backdoor diffusion models? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4015–4024, 2023.
- Couairon et al. [2022] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. In The Eleventh International Conference on Learning Representations, 2022.
- Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
- Doan et al. [2021] Khoa Doan, Yingjie Lao, Weijie Zhao, and Ping Li. Lira: Learnable, imperceptible and robust backdoor attacks. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11946–11956. IEEE, 2021.
- Doan et al. [2023] Khoa D Doan, Yingjie Lao, Peng Yang, and Ping Li. Defending backdoor attacks on vision transformer via patch processing. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 506–515, 2023.
- Esser et al. [2023] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
- Gao et al. [2019] Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith C Ranasinghe, and Surya Nepal. Strip: A defence against trojan attacks on deep neural networks. In Proceedings of the 35th Annual Computer Security Applications Conference, pages 113–125, 2019.
- Gu et al. [2022] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
- Gu et al. [2019] Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244, 2019.
- Harvey et al. [2022] William Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. Advances in Neural Information Processing Systems, 35:27953–27965, 2022.
- Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6629–6640, 2017.
- Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 6840–6851, 2020.
- Ho et al. [2022a] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
- Ho et al. [2022b] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022b.
- Karras et al. [2022] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
- Kingma and Ba [2015] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
- Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
- KULLBACK [1959] S KULLBACK. Information theory and statistics. Wiley, New York, 1959.
- Kumari et al. [2023] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
- Li et al. [2020] Yiming Li, Tongqing Zhai, Baoyuan Wu, Yong Jiang, Zhifeng Li, and Shutao Xia. Rethinking the trigger of backdoor attack. arXiv preprint arXiv:2004.04692, 2020.
- Li et al. [2021] Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. Invisible backdoor attack with sample-specific triggers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16463–16472, 2021.
- Liu et al. [2018] Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending against backdooring attacks on deep neural networks. In International symposium on research in attacks, intrusions, and defenses, pages 273–294. Springer, 2018.
- Liu et al. [2023] Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. More control for free! image synthesis with semantic diffusion guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 289–299, 2023.
- Liu et al. [2020] Yunfei Liu, Xingjun Ma, James Bailey, and Feng Lu. Reflection backdoor: A natural backdoor attack on deep neural networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 182–199. Springer, 2020.
- Liu et al. [2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015.
- Lu et al. [2022] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
- Lugmayr et al. [2022] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
- Meng et al. [2021] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021.
- Nguyen and Tran [2020a] Tuan Anh Nguyen and Anh Tran. Input-aware dynamic backdoor attack. Advances in Neural Information Processing Systems, 33:3454–3464, 2020a.
- Nguyen and Tran [2020b] Tuan Anh Nguyen and Anh Tuan Tran. Wanet-imperceptible warping-based backdoor attack. In International Conference on Learning Representations, 2020b.
- Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Ruiz et al. [2023] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
- Saha et al. [2020] Aniruddha Saha, Akshayvarun Subramanya, and Hamed Pirsiavash. Hidden trigger backdoor attacks. In Proceedings of the AAAI conference on artificial intelligence, pages 11957–11965, 2020.
- Saharia et al. [2022a] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022a.
- Saharia et al. [2022b] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713–4726, 2022b.
- Salem et al. [2022] Ahmed Salem, Rui Wen, Michael Backes, Shiqing Ma, and Yang Zhang. Dynamic backdoor attacks against machine learning models. In 2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), pages 703–718. IEEE, 2022.
- Salimans and Ho [2021] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2021.
- Singh et al. [2023] Jaskirat Singh, Stephen Gould, and Liang Zheng. High-fidelity guided image synthesis with latent diffusion models. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5997–6006. IEEE, 2023.
- Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
- Song et al. [2020a] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020a.
- Song et al. [2020b] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020b.
- Tran et al. [2018] Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks. Advances in neural information processing systems, 31, 2018.
- Wang et al. [2019] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP), pages 707–723. IEEE, 2019.
- Yuan et al. [2023] Zenghui Yuan, Pan Zhou, Kai Zou, and Yu Cheng. You are catching my attention: Are vision transformers bad learners under backdoor attacks? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24605–24615, 2023.
- Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Zheng et al. [2023] Mengxin Zheng, Qian Lou, and Lei Jiang. Trojvit: Trojan insertion in vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4025–4034, 2023.
Supplementary Material
9 Additional Ablation Studies
9.1 Effect of Various Smoothness Factors
To explore the effect of the smoothness factor of the differential histogram function , we conduct several PDD optimization experiments using various . Fig. 10 shows the curve of the PDD score and differentiable PDD score during the trigger PDD optimization procedure.
When is too small, such as in Fig. 10(a), the approximation of differentiable PDD score becomes less accurate. This inaccuracy results in a very small differential PDD score, even at the initialization step. Then, it leads to a larger gap between the regular PDD score and differentiable PDD score . This larger gap is undesirable for optimizing the desired distribution pass rate and ASR.
When is too large (e.g., ), the differential histogram function exhibits non-smooth characteristics. This can be seen in Fig. 10(d), where after 2.5K steps, the PDD scores become undefined (denoted as “NaN”). This phenomenon is directly attributed to the non-smooth nature of the differential histogram function with larger values. Consequently, a larger not only hinders but also halts the optimization process of the desired PDD score.
In our experiments, we choose to strike a balance between minimizing the gaps in PDD score and maintaining the feasibility of optimization.




9.2 Sampling Steps and Training Epochs
To investigate how training and sampling steps affect the performance of a stealthy backdoor diffusion model, we conduct training on CIFAR-10 and CelebA datasets using the “instance” attack mode. The models are trained with varying training steps (50k, 75k, 100k) and sampling steps (10, 20, 50, 100, 200). According to the results presented in Tab. 4, an increase in training epochs generally maintains or slightly decreases the FID while simultaneously enhancing attack performance. Regarding sampling steps, we observe that larger number of sampling steps leads to improved performance in both benign and attack scenarios. This aligns with the property of a standard diffusion model, where increasing sampling steps tend to yield better overall performance.
Training | Metric | Sampling Steps | ||||
Epochs | 10 | 20 | 50 | 100 | 200 | |
CIFAR-10 | ||||||
50k | FID | 14.74 | 7.42 | 5.00 | 4.39 | 4.17 |
MSE | 1.53e-4 | 1.03e-4 | 8.37e-5 | 8.10e-5 | 7.34e-5 | |
75k | FID | 14.37 | 7.33 | 4.98 | 4.34 | 4.17 |
MSE | 1.07e-4 | 7.12e-5 | 5.69e-5 | 5.52e-5 | 5.02e-5 | |
100k | FID | 14.51 | 7.38 | 4.99 | 4.38 | 4.17 |
MSE | 8.20e-5 | 5.43e-5 | 4.32e-5 | 4.19e-5 | 3.80e-5 | |
CelebA | ||||||
50k | FID | 2.85e-3 | 2.02e-3 | 1.71e-3 | 1.70e-3 | 1.59e-3 |
MSE | 13.31 | 8.23 | 6.35 | 5.85 | 5.75 | |
75k | FID | 1.43e-3 | 1.04e-3 | 9.01e-4 | 8.98e-4 | 8.55e-4 |
MSE | 13.24 | 8.17 | 6.28 | 5.84 | 5.71 | |
100k | FID | 1.08e-3 | 7.98e-4 | 7.01e-4 | 6.99e-4 | 6.69e-4 |
MSE | 13.25 | 8.14 | 6.23 | 5.82 | 5.72 |
10 Additional Visualization Results
10.1 CIFAR-10 Dataset
DDPM. Figs. 11, 12, 13 and 14 illustrate some of the generated images from our backdoored DDPM model with benign and poisoned noise inputs on CIFAR-10 dataset. For the attack mode “Category”, the target category is the horse. For the attack mode “Instance”, the target instance is a Michy Mouse image.
DDIM. Figs. 15, 16, 17 and 18 illustrate some of the generated images from our backdoored DDIM model with benign and poisoned noise inputs on CIFAR-10 dataset. For the attack mode “Category”, the target category is the horse. For the attack mode “Instance”, the target instance is a Michy Mouse image.
10.2 CelebA Dataset
DDPM. Figs. 19, 20, 21 and 22 illustrate some of the generated images from our backdoored DDPM model with benign and poisoned noise inputs on CelebA dataset. For the attack mode “Category”, the target category is the faces with heavy makeup, mouth slightly open and smiling. For the attack mode “Instance”, the target instance is a Michy Mouse image.
DDIM. Figs. 23, 24, 25 and 26 illustrate some of the generated images from our backdoored DDIM model with benign and poisoned noise inputs on CelebA dataset. For the attack mode “Category”, the target category is the faces with heavy makeup, mouth slightly open and smiling. For the attack mode “Instance”, the target instance is a Michy Mouse image.















