(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: Michigan State University
¹¹email: {renjie3,liyaxin1,zengshe1,xuhan1,xingyue1,tangjili}@msu.edu ²²institutetext: Sony AI
²²email: [email protected]

Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through
Cross Attention

Jie Ren 11 0000-0003-2663-6405 Yaxin Li 11 Shenglai Zeng 11 Han Xu 11 Lingjuan Lyu 22
Yue Xing 11 Jiliang Tang 11

Abstract

Recent advancements in text-to-image diffusion models have demonstrated their remarkable capability to generate high-quality images from textual prompts. However, increasing research indicates that these models memorize and replicate images from their training data, raising tremendous concerns about potential copyright infringement and privacy risks. In our study, we provide a novel perspective to understand this memorization phenomenon by examining its relationship with cross-attention mechanisms. We reveal that during memorization, the cross-attention tends to focus disproportionately on the embeddings of specific tokens. The diffusion model is overfitted to these token embeddings, memorizing corresponding training images. To elucidate this phenomenon, we further identify and discuss various intrinsic findings of cross-attention that contribute to memorization. Building on these insights, we introduce an innovative approach to detect and mitigate memorization in diffusion models. The advantage of our proposed method is that it will not compromise the speed of either the training or the inference processes in these models while preserving the quality of generated images. Our code is available at github.com/renjie3/MemAttn.

Keywords:

Memorization Text-to-image Diffusion Model Cross Attention

1 Introduction

Recent advancements in diffusion models have demonstrated exceptional capabilities in image generation. Notably, text-to-image diffusion models, such as DALL-E [22] and Stable Diffusion [23], excel at producing high-quality images that adhere precisely to textual prompts. However, it is evident from increasing research that these models can memorize training data, including images and their relationship with the input textual prompts [5, 19, 18, 32]. Such a memorization issue poses a significant risk of copyright infringement of the training data and increases the possibility of privacy-sensitive information leakage [38, 11, 3].

Existing investigations on this issue have primarily focused on identifying the data that is susceptible to causing memorization. For example, data duplication is found to lead to memorization, suggesting that de-duplication of training data could be a mitigation strategy [5, 27, 33]; and replicated captions are also revealed to contribute to this problem [29]. However, they largely overlook the understanding of how such data influences the behavior of the model. It is evident from a recent work [36] that the difference between the model output of a memorized prompt and an empty prompt is larger than that between a non-memorized prompt and an empty prompt. However, this finding still does not explain the inner behaviors of the model associated with memorization.

Refer to caption — (a) Memorization
( $t=T$ )

In this work, we introduce a novel perspective to understand memorization via the behavior of “cross-attention”. Cross attention has been widely used by text-to-image diffusion models, serving as the primary mechanism of selecting information from the prompts to guide diffusion generation process[23, 13, 25, 24]. Given that the memorized training images are usually triggered by the memorized textual prompts, cross attention could exhibit unique behaviors specific to memorization. Therefore, we investigate the relationship between the cross-attention and memorization. Specifically, we empirically observe a significant difference between the distributions of the cross-attention with/without memorization. One such example is demonstrated in Fig. 1. The memorized samples (Fig. 1(a) and 1(b)) tend to allocate most of the attention to the embeddings of specific tokens throughout all the diffusion steps. In contrast, non-memorization (Fig. 1(c) and 1(d)) has a more dispersed attention distribution. We deeply investigate the memorization behavior of cross attention and unveil insightful findings which pave us a way to design strategies to detect and mitigate memorization. Our detection approach is based on the quantification of the attention behavior in Sec. 4. It only requires simple calculation on attention score, and does not need repeated or additional inference operations which are often required by existing methods [5, 36]. Meanwhile, we propose an inference-time mitigation method and a training-time mitigation method that effectively reduce memorization by adjusting the attention dispersion. Similar to our detection approach, the proposed mitigation methods can preserve output quality and have little effect on image generation speed. Extensive experiments are conducted to validate our insights and the efficacy of our detection and mitigation strategies.

2 Related works

2.0.1 Memorization in generative models.

Researchers have found that generative models tend to memorize and reproduce their training data [4, 5, 2, 17, 19, 18]. Such behaviors bring significant privacy and copyright concerns [38, 11, 3]. Somepalli et al. [27, 29] show that diffusion models exhibit the behavior of producing verbatim training images. Based on this, Carlini et al. [5] develop a strategy to extract data from diffusion models by generating numerous images to assess generation density. To address these concerns, researchers have explored several strategies. Among these works, data de-duplication is recognised as an important mitigation strategy [5, 33]. Additionally, data augmentations are introduced to prevent the model overfitting the training set [7, 29]. More recently, Wen et al. [36] propose a new mitigation based on their observation that the model output of memorized prompts has a significant difference from non-memorized prompt.

2.0.2 Cross Attention.

Cross-attention mechanisms are instrumental in most of the text-to-image diffusion models such as Stable Diffusion [23], enabling the precise generation of images that correspond to textual prompts [13, 25, 24]. Application of controlled text-to-image generation by modifying the cross-attention has been widely proposed [35, 15, 16]. For example, Hertz et al. [8] and Chefer et al. [6] have demonstrated how cross-attention can be used to refine image editing and address issues like catastrophic neglect, respectively. Besides, cross attention is also applied to text-image matching [14, 34] and explanation of text-to-image generation [30]. These works showcase the versatility and importance of cross-attention in creating semantically coherent image generation models.

3 Preliminaries

This section presents some essential preliminaries, including the generation process of diffusion models, the architecture of cross-attention connecting the prompt with generation, key definitions in prompts and our preliminary observations.

3.1 Generation Process of Diffusion Models

Denoising Diffusion Probabilistic Models (DDPM) [10] typically involves a forward diffusion process and a reverse diffusion process. The forward process is a $T$ -step Markov chain which transforms a data point $x_{0}$ from the target image distribution to a random Gaussian distribution. It introduces a small Gaussian noise at each step into the data point following

\displaystyle q\left(x_{t}\mid x_{t-1}\right)=\mathcal{N}(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}\mathbf{I}),

(1)

where $\beta_{t}$ is the variance schedule. The reverse process generates new images following

\displaystyle p_{\theta}\left(x_{t-1}\mid x_{t}\right)=\mathcal{N}\left(x_{t-1};\mu_{\theta}\left(x_{t},t\right),\Sigma_{\theta}\left(x_{t},t\right)\right),

(2)

where each reverse step follows a Gaussian distribution with $\mu_{\theta}$ and $\Sigma_{\theta}$ as some mean and variance. With a parameterized denoising network $\epsilon_{\theta}$ , following [10], the generation process can be explicitly expressed as

\displaystyle{x}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left({x}_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\epsilon_{\theta}\left({x}_{t},t\right)\right)+\sigma_{t}w,

(3)

where $\alpha_{t}=1-\beta_{t}$ , $\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{t}$ , $\sigma_{t}$ can be $\sqrt{\beta_{t}}$ or $\sqrt{\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t}}$ , and $w\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ . To connect Eq. (2) with (3), $\mu_{\theta}(x_{t},t)=(x_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\epsilon_{\theta}(x_{t},t))/\alpha_{t}$ , and $\Sigma_{\theta}(x_{t},t)$ is determined by $\sigma_{t}$ . Note that when generating new images in reverse process, we start from a random noise $x_{T}$ and end at step $0$ with the output as $x_{0}$ .

3.2 Cross Attention in Text-to-image Generation

To guide the image generation process using extra conditions, Rombach et al. [23] further propose Latent Diffusion Models (LDMs). Two changes are applied in LDMs compared to a vanilla diffusion model. First, instead of directly generating images, the diffusion process proceeds with a low-dimensional latent representation of the image via the compression using a variational autoencoder (VAE) [12]. In this case, $I_{0}$ is the image, $\mathcal{E}$ is the image encoder of VAE, and we feed the latent representation $x_{0}=\mathcal{E}(I_{0})$ in diffusion models. Second, to introduce an extra condition $c$ , $c$ works as an input of $\epsilon_{\theta}$ to guide the output of $\epsilon_{\theta}\left({x}_{t},t,c\right)$ .

Extending from the basic framework of LDMs, Stable Diffusion (SD) takes textual prompts as condition $c$ to generate image described by given texts. SD first uses the text encoder of CLIP [21] to get the prompt’s embeddings, $e_{c}$ , then uses the cross-attention mechanism to select the information and feed it into each of the hidden layers, $z_{t}$ , in the U-Net backbone of diffusion models. In the cross-attention architecture, $Q(z_{t})$ , $K(e_{c})$ and $V(e_{c})$ are the three components connecting the prompt with the image. The terms $Q(z_{t})$ and $K(e_{c})$ decide the attention score allocated on each token embedding by

\displaystyle A=\operatorname{softmax}\left(\frac{QK^{T}}{\sqrt{d}}\right).

(4)

The term $V(e_{c})$ is the textual information to be selected, and is further multiplied with $A$ in the cross attention and passed to the U-Net of the diffusion model. Besides, instead of a single-head cross attention, one may create multiple groups of $(Q,K,V)$ and merge them to a multi-head cross attention.

When training the diffusion model, the training data consists of images and their paired textual prompts. In addition to the U-Net in the diffusion model, the cross attention is also trained to match the textual input and the generated image. The only components fixed in the training are the image encoder (VAE) and the text encoder, using which we can obtain the image latent representation, $x_{0}$ , and the prompt embeddings, $e_{c}$ .

In this paper, we focus on the attention score $A$ to understand and mitigate the memorization. Each row of $A$ is a set of attention score for a hidden dimension of $z_{t}$ which assigns weights on the embeddings of different tokens of $V(e_{c})$ . Then the weighted information is aggregated to guide the generation process. Thus, the sum of attention on each row, i.e. for each hidden dimension of $z_{t}$ , is 100%. For multi-head attention, each head calculates its corresponding attention score matrix respectively, and the sum of attention in each row of each head is 100%.

3.3 Category of Tokens in the Prompts

To facilitate the analysis of the connection between cross attention and memorization, we divide the tokens of textual prompts into three categories: beginning token, prompt tokens, and summary tokens. For example, in the prompt, “<begin> A picture of mountains covered by snow <end> <padding> <padding> … <padding>”, the text encoder adds <begin> before the prompt tokens, and <end> <padding> after the prompt tokens. Given that the text encoder operates causally, the beginning token does not hold any semantic information derived from the textual prompt, as it exists before the entirety of the prompt. In contrast, for <end> and <padding>, we refer them as “summary tokens” since the embeddings of these tokens encapsulate the semantic information of all preceding tokens.

3.4 Beginning Token Attains High Attention in Non-memorization

From the attention maps of Fig. 1 in Sec. 1, we find that the beginning token shows different patterns for samples with and without memorization. In non-memorization, although the attention distribution is disperse, the beginning token holds a high attention score, especially in Fig. 1(d). Instead, in memorization, the beginning token is assigned with very little attention. To facilitate the understanding of memorization in the following sections, we first demonstrate the pattern of the beginning token’s attention score in this subsection. We demonstrate the averaged attention score of the beginning token through all the diffusion steps in Fig. 3. It can be seen that the beginning token has an attention score higher than 80% in all the diffusion steps. Since the beginning token contains no semantic information, this implies that a small portion of the attention score is sufficient to collect the semantics from other tokens, especially in the late steps with a small $t$ . Besides, in Fig. 3, the attention of beginning tokens increases from step $T$ to step $0$ . Intuitively, after the main body of the image is generated with a large $t$ , the diffusion model mainly focuses on denoising when $t$ is small, which does not need much textual information from prompts.

4 Cross Attention in Memorization

Cross attention plays an important role in text-to-image diffusion models, where it integrates the guidance of textual conditions into the image generation process. This mechanism facilitates that different segments of the generated image can pay attention to different tokens in the prompt and combine them reasonably in a global scope. In this section, we present our findings on memorization via cross attention through empirical studies. The experimental studies conducted in this section are based on SD v1.4 with the memorized prompts extracted by Webster [32] and 500 non-memorized prompts generated by ChatGPT-4 [1]. In Appendix 0.A and Sec. 6.2, we further show that the findings can also be generalized to SD v2.0.

4.1 Finding 1: Concentrated Attention Score on Trigger Tokens

Fig. 1 in Sec. 1 demonstrates that the memorized samples tend to allocate most of the attention to the embeddings of specific tokens. In this subsection, we further quantify this finding by defining attention entropy.

Recall that in cross attention, each hidden dimension in $z_{t}$ will allocate attention scores on the tokens. The sum of attention scores for each hidden dimension in each attention head is 100%. Thus, we can consider it as a discrete probabilistic distribution. Meanwhile, entropy is usually used to measure the uncertainty and dispersion of probabilistic distributions. Thus, the entropy of attention can measure the dispersion of attention [37, 31]. It is defined as follows

\displaystyle E_{t}=\sum_{i=1}^{N}-\overline{a}_{i}\log\left(\overline{a}_{i}\right),

(5)

where $N$ is the number of tokens, $t$ is diffusion step, and $\overline{a}_{i}$ is the average attention score on the $i$ -th token. Unless otherwise stated, $\overline{a}_{i}$ is averaged across all U-Net layers, attention heads and hidden dimensions. Higher entropy indicates more disperse attention distribution.

In Sec. 3.4, we find that the cross-attention score of non-memorization will gradually concentrate on the beginning token. In contrast, as shown in Fig. 1, for the prompts of memorization, more attention will be concentrated on the embeddings of specific prompt tokens and summary tokens throughout all the steps in the generation process, referred to as trigger tokens. Meanwhile, the memorized samples will distract from the beginning token. This leads to a disperse attention distribution. Therefore, we expect that the attention entropy is low for non-memorization samples, and high for memorized ones.

We demonstrate the entropy of different diffusion steps in Fig. 3 to verify our intuitions. In non-memorization samples, the entropy reduces fast from step $T$ to step $0$ , which is consistent with the fact that the attention is gradually concentrated into the beginning token. On the other hand, for memorized samples, the entropy becomes higher than non-memorization especially when $t$ is small, which also aligns with our observation that the model pays high averaged attention scores on trigger tokens. One exception is that, when $t$ is large, e.g. $t=T$ in Fig. 1(c), compared to the memorization case, the non-memorization prompts also allocate higher attentions on prompt and summary tokens, resulting in a higher entropy in Fig. 3. To understand this, although non-memorization in general imposes the highest weight on the meaningless beginning token, to generate proper images, it still needs to correctly understand the input prompt. As a result, for large $t$ , the model focuses on collecting semantic information from different tokens, and the corresponding entropy is high. In contrast, with memorization, the model only needs to focus on the trigger tokens, and it is unnecessary to further collect information from other tokens, resulting in a lower entropy for larger $t$ .

With the quantification of the attention score via entropy, we verify our finding about the disproportionate attention on trigger tokens.

4.2 Finding 2: Different Types of Memorization Focus on Different Types of Tokens

Existing studies suggest that there exist different types of memorized prompts in diffusion models [32]. In this subsection, we discuss cross attention behaviors on different types of memorized prompts, and provide a deeper understanding about the connection between cross attention and memorization.

In literature, Webster [32] provides the dataset of memorized prompts and divides them into three types:

•

Matching memorization (MM): one memorized prompt generates one image that is exactly matching with the original paired training image.
•

Retrieval memorization (RM): one memorized prompt generates images that match with a subset of training images.
•

Template memorization (TM): a group of prompts generate images highly align with training images, but may have variations in colors or styles.

We demonstrate the sum of attention score on prompt and summary tokens respectively in Fig. 4 to examine to what extent the model focuses on these two types of tokens, and also give prompt examples of each types in Fig. 4. We make the following two observations from Fig. 4.

First, MM pays more attention to summary tokens compared with RM and TM. In MM, the memorized prompts are totally different from each other. Since the summary tokens contain the semantics of the whole sentence, overfitting to the summary tokens can memorize the unique prompts and connect them with the paired training images easier. In contrast, for RM and TM, the memorized prompts share the same sentence template and a few identical tokens in a group of prompts. The model overfits these shared tokens and overlooks the difference.

Second, all three types of memorization have a significantly slower reduction of summary-token attention score than non-memorization from step $T$ to step $0$ . It implies that non-memorization distracts from summary tokens faster. We conjecture that this is because different parts of the non-memorized generated images focus on different semantics of the prompt rather than the summary of the prompt. For example, in the prompt “Two dogs playing on the grass”, some pixels focus on “dogs”, and some focus on “grass”. But for memorization, it does not distinguish prompt tokens or summary tokens like non-memorization since the overfitting of trigger tokens can simply guide each pixel in the image. This characteristic can distinguish between memorization and non-memorization generations.

4.3 Finding 3: Concentration is More Active in Certain U-Net Layers

According to Finding 1 in Sec. 4.1, the entropy is better distinguished between memorization and non-memorization in step $0$ than step $T$ . This means the whole generation is involved since step $0$ is the last step of the generation. However, in this subsection, we find that different U-Net layers have different behaviors for memorization, and in some active U-Net layers, the memorization is already distinguished at the first step, i.e. when $t=T$ .

In SD, there are 16 U-Net layers including down-sampling, middle and up-sampling layers. The cross-attention mechanism has different sets of multi-head $(Q,K,V)$ for each layer. These sets collect different semantic information and feed them into each layer. We find that the concentration on special tokens is not uniform on all the layers. We show the entropy of each layer in the first diffusion step, i.e. $t=T$ , in Fig. 5. In the first diffusion step, the entropy of distinct U-Net layers have different overlapping between memorization and non-memorization. The fourth and fifteenth layers have clearer separation, which can distinguish memorization and non-memorization better. This shows that using the first diffusion step has the potential to detect the memorized samples in one step.

5 Detection and Mitigation

Building upon the above findings, we propose methods to detect and mitigate memorization effect via cross attention. Our methods are superior to previous ones [5, 36] since we require almost no extra computation and will not compromise the speed of training or inference, while preserving the quality of generation.

5.1 Memorization Detection

Finding 1 to 3 in Section 4 demonstrate different attention distributions between memorization and non-memorization. In this subsection, we introduce two metrics guided by our findings to detect the memorization.

The first metric is designed based on Finding 1 and Finding 2 as:

\displaystyle D=\frac{1}{T_{D}}\sum_{t=0}^{T_{D}-1}E_{t}+\frac{1}{T_{D}}\sum_{t=0}^{T_{D}-1}\left|E_{t}^{\text{summary}}-E_{T}^{\text{summary}}\right|,

(6)

where $E_{t}^{\text{summary}}=\sum_{i=N-S+1}^{N}-\bar{a}_{i}\log\left(\bar{a}_{i}\right)$ is the entropy calculated only on the summary tokens and $S$ is the number of summary tokens. The term $T_{D}$ is the number of generation steps used in calculating $D$ . For SD, we choose $T_{D}=\frac{T}{5}$ , which means we use the last $\frac{T}{5}$ steps of the reverse diffusion process to calculate $D$ . The first term is inspired by Finding 1 that trigger tokens will cause higher entropy especially than non-memorization when $t$ is close to $0$ . The second term is from Finding 2 that the sum of attention score of summary tokens in memorization will reduce slower than non-memorization. In Eq. 6, we use $E_{t}^{\text{summary}}$ to replace sum to make its numerical values comparable with the first term, otherwise it becomes a negligible term. In Appendix 0.B, we show that the sum and $E_{t}^{\text{summary}}$ have consistent conclusion with Finding 2.

The second metric is motivated by Finding 3, which is defined as

\displaystyle E^{l}_{t=T}=\sum_{i=1}^{N}-\bar{a}_{i}^{l}\log\left(\bar{a}_{i}^{l}\right),

(7)

where $\bar{a}_{i}^{l}$ is the averaged attention on $l$ -th layer of U-Net of the first step in the reverse diffusion process, i.e. $t=T$ . The major difference between the two metrics is that the second metric only requires the first step of diffusion process. Although the second requires an additional hyper-parameter $l$ , it allows the model builders to detect the memorization instantly, which can save time and computation before adapting the follow-up process like mitigation.

5.2 Inference-time Mitigation

In this subsection, we propose a method to mitigate memorization at the inference stage by reducing the weight on trigger tokens. This method consists of masking out the summary tokens and increasing the logits of the beginning token.

First, from Finding 2, we know that memorization is highly related to the summary tokens, while attention score on summary tokens drops quickly in non-memorization. Consequently, to control memorization, we mask out these summary tokens.

Second, based on Finding 1, trigger tokens tend to have the large attention scores except for the beginning token. As a result, we reversely enlarge the attention score for the beginning token via adjusting the input logits of the softmax operator in the cross attention. To be specific, denote the original input logits as $\boldsymbol{s}=(s_{1},s_{2},...,s_{N})$ where $s_{i}$ is the logit of the $i$ -th token, the re-scaled logit vector $\boldsymbol{s}^{\prime}$ is

\displaystyle\boldsymbol{s}^{\prime}=(Cs_{1},s_{2},...,s_{N}).

(8)

where $C$ is a factor to be applied on the beginning token. When $C>1$ , the attention score of beginning token is increased and the attention scores of other tokens become smaller.

While our primary interest on $C>1$ is to enlarge the attention score of the beginning token, we also note that the re-scaling operation in Eq. 8 will also give a larger reduction on the tokens with larger attention scores. To explain this, when taking the gradient of each attention score w.r.t. $C$ , we obtain

\displaystyle\frac{\partial\text{softmax}(\boldsymbol{s})_{i}}{\partial C}=-\frac{s_{1}e^{Cs_{1}}}{(e^{Cs_{1}}+\sum_{j=2}^{N}e^{s_{j}})^{2}}e^{s_{i}}.

(9)

Since ${s_{1}e^{Cs_{1}}}/{(e^{Cs_{1}}+\sum_{j=2}^{N}e^{s_{j}})^{2}}$ is the same for $\forall i\in[2,N]$ , the reduction ratio depends on $-e^{s_{i}}$ , which means token with larger attention will be reduced more.

By combining the above two strategies, our final adjusted logits become

\displaystyle\boldsymbol{s}^{\prime}=(Cs_{1},s_{2},...,s_{N-S},-\infty,...,-\infty).

(10)

This method only requires applying a mask and a re-scaling factor to the logits, and does not include any extra computation. Nonetheless, it can effectively mitigate memorization during inference and has little impact on generation quality.

Remark. In the experiments in Fig. 6(a), we show that, although detection has almost no extra computation, it is not necessary to detect before mitigation since the mitigation does not compromise the generation quality for both memorization and non-memorization.

5.3 Training-time Mitigation

During training, we can mitigate memorization by removing the samples whose attention entropy is higher than a pre-defined threshold from the mini-batch. This idea is similar to [36], in which it filters out the samples if their outputs of the diffusion model are significantly different from these of the empty prompt. However, the method in [36] uses additional inference operation of the empty prompt, which requires more computation. In contrast, our approach eliminates the need for extra inference. We only compute the attention entropy as defined by Eq. 5 and remove the samples with high-entropy, which incurs a negligible computational cost. This alone suffices to effectively reduce memorization.

6 Experiment

In this section, we provide experiments to further support Finding 1 and demonstrate the effectiveness of the proposed methods. In Sec. 6.2 and Sec. 6.3, we evaluate our performance on detection and mitigation, respectively. In Sec. 6.4, we conduct ablation study on the two components of inference-time mitigation. Due to space limitation, additional experiments and details including a further verification showing the memorization’s reliance on the embeddings of trigger tokens, a validation of our findings with SD v2.0, and selection of the pre-defined threshold for training-time mitigation are elaborated in Appendix 0.A and 0.C.

6.1 Experiment Settings

Diffusion models and datasets. We conduct the experiments using SD v1.4 and SD v2.0. We use the dataset of memorized prompts extracted by Webster [32] as memorized samples, and use 500 prompts generated by ChatGPT-4 [1] as non-memorized samples. For the fine-tuning data in training-time mitigation, since duplicated data can cause memorization, we use 200 text-image pairs duplicated 50 times as memorized data and 20,000 captioned images from LAION [26] as non-memorized data, following the procedure of [36].

Baselines. For detection, we use two baselines for comparison. The first is the method from Carlini et al. [5]. They observe that the model generates the same image for the memorized prompt when using different random seeds. Thus they propose to detect memorization by the $l_{2}$ distance of the images generated using different $n$ seeds. The second is from Wen et al. [36]. Their detection method is built upon the observation where the difference of the output between memorized prompt and empty prompt is larger than the difference between non-memorized and empty prompt. To stabilize the method, they generate the image for $n$ times. For mitigation, we compare with a the method from Wen et al. [36]. More details can be found at Sec. 0.C.1.

Evaluation metrics. The detection metrics include the area under the receiver operating characteristic curve (AUROC), the true positive rate at the false positive rate of 3% (TPR@3%FPR) and the time cost. The mitigation metrics include Similarity Score [20, 28] that measures the degree of memorization by the similarity between generated images and original training images (higher similarity means more severe memorization); Fréchet Inception Distance (FID) [9] that measures the generation quality in realism and diversity (lower FID indicates better quality); CLIP score [21] that measures fine-tuning performance (higher CLIP score means better fine-tuning performance); and the time cost. Details can be found at Sec. 0.C.2.

6.2 Memorization Detection

Table 1: Results of detection in AUROC, TPR@3%FPR and time in seconds. (

s

: the number of steps used for detection;

n

: the number of generations required per prompt.)

Method	$s,n$	SD v1.4	SD v2.0
Method	$s,n$	AUROC / TPR@3%FPR / Time	AUROC / TPR@3%FPR / Time
Carlini et al. [5]	1, 32	0.4982 / 0.020 / 3.724	0.5433 / 0.034 / 8.799
	50, 4	0.9357 / 0.716 / 7.006	0.5305 / 0.280 / 16.901
	50, 16	0.9983 / 0.980 / 27.983	0.9304 / 0.428 / 67.565
	50, 32	0.9977 / 0.976 / 56.016	0.9336 / 0.428 / 135.770
Wen et al. [36]	1, 1	0.9662 / 0.835 / 0.132	0.9215 / 0.580 / 0.315
	1, 4	0.9967 / 0.977 / 0.521	0.9610 / 0.790 / 1.273
	1, 32	0.9992 / 0.994 / 4.133	0.9744 / 0.863 / 10.531
	50, 1	0.9957 / 0.988 / 2.582	0.9831 / 0.963 / 6.072
	50, 4	0.9968 / 0.994 / 10.339	0.9886 / 0.973 / 24.267
	50, 32	0.9972 / 0.997 / 41.378	0.9890 / 0.982 / 97.336
Ours - $D$	50, 1	0.9998 / 0.997 / 1.745	0.9783 / 0.881 / 4.220
Ours - $E^{l=4}_{t=T}$	1, 1	0.9933 / 0.980 / 0.116	0.9968 / 0.977 / 0.281

In this subsection, we present the detection performance of the two proposed methods, $D$ and $E^{l}_{t=T}$ , in both accuracy and effeciency with Tab. 1. For all the detection, we use $l=4$ .

To compare our method $D$ with the benchmarks, we can see that it costs 1.745 seconds on SD v1.4 and 4.220 seconds on SD v2.0, which is more efficient than most of the methods except the two fast settings $(s=1,n=1)$ and $(s=1,n=4)$ from Wen et al. [36]. However, comparing to these fast settings of Wen et al. [36], our AUROC and TPR are much better, especially on SD v1.4 with almost perfect AUROC of 0.9997 and TPR of 0.997.

In terms of our method $E^{l=4}_{t=T}$ , it is the most efficient one and achieves good performance with AUROC higher than 0.993 and TPR higher than 0.977 for both SD v1.4 and v2.0. On SD v2.0, our method $E^{l=4}_{t=T}$ is the best one in both AUROC and detection speed. It is even the only method that achieves an AUROC higher than 0.99 on SD v2.0. On SD v1.4, although some of the other methods, e.g., $(s=50,n=32)$ of Wen et al. [36], are better in AUROC and TPR than $E^{l=4}_{t=T}$ , their time cost is 20 to 400 times of the method $E^{l=4}_{t=T}$ .

6.3 Memorization Mitigation

In this subsection, we show that our mitigation methods can effectively reduce the memorization without sacrifice of the speed and the generation quality.

For inference-time mitigation, Fig. 6(a) presents the results of similarity score and FID with and without mitigation. The FID is calculated between generated images (including both memorized prompts from [32] and 500 non-memorized prompts) and a subset of LAION [26] with 10,000 images. Similarity Score is calculated between the generated images by memorized prompts and the training images paired with the memorized prompts. When there is no mitigation, Similarity Score is 0.7. Our method can significantly reduce Similarity Scores from 0.7 to the range of 0.25 to 0.3 (by setting $C$ from 1.1 to 1.25). Interestingly, FID is also improved after mitigation. This is likely attributable to the fact that RM and TM tend to produce memorized images that are similar to each other. Our mitigation introduces a greater diversity, thereby contributing to a reduced FID. Although the method of Wen et al. [36] has similar mitigation effect, it is slower than ours by 20% as shown in Fig. 6(b). Notably, our method maintains the same computational efficiency as performing inference without any mitigation.

For training-time mitigation, we compare the mitigation effect and the fine-tuning performance with the baseline method from Wen et al. [36] in Fig. 6(c). The fine-tuning performance is measured by CLIP score which estimates the alignment between prompts and images. Higher CLIP score after fine-tuning indicates that the fine-tuned model can generate images that aligns the prompts better. Both our method and the baseline effectively reduce similarity to around 0.2, with negligible impact on CLIP score. However, the method of Wen et al. [36] incurs a 10% increase in training time. In contrast, our approach maintains the same training duration as scenarios without mitigation as shown in Fig. 6(d).

6.4 Ablation Study on Inference-Time Mitigation

In this subsection, we conduct ablation study on the two components of inference-time mitigation, i.e. logits re-scaling and summary token mask. We present a set of generated examples in Fig. 7. Our method is shown in Fig. 7(f) with summary tokens masked out and re-scaling factor $C=1.25$ . It can effectively mitigate the memorization compared with no mitigation in Fig. 7(b). However, if we only apply attention mask or logits re-scaling as shown in Fig. 7(c) and Fig. 7(d), some memorized samples cannot be eliminated. If we increase $C$ to 2, although the memorization can be prevented, it might cause the loss of some semantic information like the identities and the appearance of the specific celebrities in Fig. 7(e). Due to the difference of formulations between Finding 1 and Finding 2, the two components of our inference-time mitigation method may handle different memorized samples as shown above. On the other hand, when applying them together, they can mitigate more memorization samples without lost of information from prompts. This ablation study implies the benefits of combing the two components of our method.

7 Conclusion

In this work, we present a novel view to understand the issue of memorization in diffusion models, focusing on its link with cross-attention mechanisms. We observe that cross attention often concentrates on the embeddings of trigger tokens, leading to memorization of specific training images. We further explore the inherent characteristics of cross attention and provide various insightful findings. Based on these findings, we introduce new methods to detect and mitigate the memorization without affecting the quality of the generated images and the models’ speed during both training and inference. Experimental results support our findings and demonstrate the effectiveness of our proposed methods.

References

[1] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
[2] Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., Zhang, C.: Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646 (2022)
[3] Carlini, N., Jagielski, M., Zhang, C., Papernot, N., Terzis, A., Tramer, F.: The privacy onion effect: Memorization is relative. Advances in Neural Information Processing Systems 35, 13263–13276 (2022)
[4] Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., et al.: Extracting training data from large language models. In: 30th USENIX Security Symposium (USENIX Security 21). pp. 2633–2650 (2021)
[5] Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, V., Tramer, F., Balle, B., Ippolito, D., Wallace, E.: Extracting training data from diffusion models. In: 32nd USENIX Security Symposium (USENIX Security 23). pp. 5253–5270 (2023)
[6] Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG) 42(4), 1–10 (2023)
[7] Daras, G., Shah, K., Dagan, Y., Gollakota, A., Dimakis, A., Klivans, A.: Ambient diffusion: Learning clean distributions from corrupted data. Advances in Neural Information Processing Systems 36 (2024)
[8] Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-or, D.: Prompt-to-prompt image editing with cross-attention control. In: The Eleventh International Conference on Learning Representations (2022)
[9] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017)
[10] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851 (2020)
[11] Jiang, H.H., Brown, L., Cheng, J., Khan, M., Gupta, A., Workman, D., Hanna, A., Flowers, J., Gebru, T.: Ai art and its impact on artists. In: Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. pp. 363–374 (2023)
[12] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
[13] Koh, J.Y., Baldridge, J., Lee, H., Yang, Y.: Text-to-image generation grounded by fine-grained user attention. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 237–246 (2021)
[14] Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European conference on computer vision (ECCV). pp. 201–216 (2018)
[15] Liu, S., Zhang, Y., Li, W., Lin, Z., Jia, J.: Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761 (2023)
[16] Ma, W.D.K., Lewis, J., Kleijn, W.B., Leung, T.: Directed diffusion: Direct control of object placement through attention guidance. arXiv preprint arXiv:2302.13153 (2023)
[17] Mireshghallah, F., Uniyal, A., Wang, T., Evans, D., Berg-Kirkpatrick, T.: Memorization in nlp fine-tuning methods. arXiv preprint arXiv:2205.12506 (2022)
[18] Naseh, A., Roh, J., Houmansadr, A.: Memory triggers: Unveiling memorization in text-to-image generative models through word-level duplication. arXiv preprint arXiv:2312.03692 (2023)
[19] Naseh, A., Roh, J., Houmansadr, A.: Understanding (un) intended memorization in text-to-image generative models. arXiv preprint arXiv:2312.07550 (2023)
[20] Pizzi, E., Roy, S.D., Ravindra, S.N., Goyal, P., Douze, M.: A self-supervised descriptor for image copy detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14532–14542 (2022)
[21] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
[22] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2), 3 (2022)
[23] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
[24] Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22500–22510 (2023)
[25] Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35, 36479–36494 (2022)
[26] Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35, 25278–25294 (2022)
[27] Somepalli, G., Singla, V., Goldblum, M., Geiping, J., Goldstein, T.: Diffusion art or digital forgery? investigating data replication in diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6048–6058 (2023)
[28] Somepalli, G., Singla, V., Goldblum, M., Geiping, J., Goldstein, T.: Understanding and mitigating copying in diffusion models. arXiv preprint arXiv:2305.20086 (2023)
[29] Somepalli, G., Singla, V., Goldblum, M., Geiping, J., Goldstein, T.: Understanding and mitigating copying in diffusion models. Advances in Neural Information Processing Systems 36 (2024)
[30] Tang, R., Liu, L., Pandey, A., Jiang, Z., Yang, G., Kumar, K., Stenetorp, P., Lin, J., Ture, F.: What the daam: Interpreting stable diffusion using cross attention. arXiv preprint arXiv:2210.04885 (2022)
[31] Vig, J., Belinkov, Y.: Analyzing the structure of attention in a transformer language model. arXiv preprint arXiv:1906.04284 (2019)
[32] Webster, R.: A reproducible extraction of training images from diffusion models. arXiv preprint arXiv:2305.08694 (2023)
[33] Webster, R., Rabin, J., Simon, L., Jurie, F.: On the de-duplication of laion-2b. arXiv preprint arXiv:2303.12733 (2023)
[34] Wei, X., Zhang, T., Li, Y., Zhang, Y., Wu, F.: Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10941–10950 (2020)
[35] Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848 (2023)
[36] Wen, Y., Liu, Y., Chen, C., Lyu, L.: Detecting, explaining, and mitigating memorization in diffusion models. In: The Twelfth International Conference on Learning Representations (2024)
[37] Zhai, S., Likhomanenko, T., Littwin, E., Busbridge, D., Ramapuram, J., Zhang, Y., Gu, J., Susskind, J.M.: Stabilizing transformer training by preventing attention entropy collapse. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202, pp. 40770–40803. PMLR (23–29 Jul 2023), https://proceedings.mlr.press/v202/zhai23a.html
[38] Zhang, C., Ippolito, D., Lee, K., Jagielski, M., Tramèr, F., Carlini, N.: Counterfactual memorization in neural language models. Advances in Neural Information Processing Systems 36 (2024)

Appendix 0.A Additional Experiments

0.A.1 Further Verification of Memorization’s Reliance on the Embeddings of Trigger Tokens

In this subsection, we provide the experiment in Fig. 8(a) to further verify that the memorization relies on the existence of trigger token embeddings. While Sec. 4 compares the behavior of the attention score with/without memorization, in Fig. 8(a), we add a mask to the attention scores to partially the trigger embeddings and examine the memorization.

To be specific, we first generate images by a group of memorized prompts of SD v1.4 and a group of non-memorized prompts and save the random seed of this generation. Then we use the same prompts and seed to generate more rounds. For each round, we remove non-<begin> token embeddings one by one from small attention scores to large attention scores, by masking out the logits of tokens. Note that this does not influence the embeddings in $e_{c}$ and $V(e_{c})$ , and only changes whether the embedding of a token is used or not.

Since memorization relies on trigger token embeddings and trigger tokens always hold large attention scores, it is expected that the generated images will not change much when only tokens with small attention are removed. This conjecture aligns with our experimental results: In Fig. 8(a), we depict the similarity score between the generation with removed tokens and the first-round generation. When the removal ratio increases, the similarity score of non-memorization reduces much faster in the beginning, while memorized prompts have almost no reduction when even 60% of the tokens with small attentions are removed. Instead, when the removal ratio is lager than 90%, the similarity score of memorization reduces much faster than non-memorization, it means more trigger tokens are removed at this stage. Thus, we further confirm our observation that the memorization relies on the existence of trigger token embeddings.

Remark. The reduction of similarity does not mean the reduction of generation quality. It only indicates the same seed can produce dissimilar images from the same prompt. In Fig. 8(b), we show that FID of memorization samples is even improved. This is similar to the reason we mentioned in Sec. 6.3 that when memorization effect is mitigated, the generated images can become more diverse, which brings a reduced FID.

0.A.2 Validation of Findings on SD v2.0

In this subsection, we implement the experiments to valid the generalization of our findings on SD v2.0.

0.A.2.1 Finding 1.

We plot the entropy of all the diffusion steps of SD v2.0 in Fig. 9. It is shown that, for memorization samples, the entropy is much higher than non-memorization, which is consistent with our Finding 1 that the trigger tokens will divert the attention from the beginning tokens, resulting in a more diverse attention distribution. Compared with SD v1.4, SD v2.0 has higher entropy even when $t$ is large. This means the concentration of attention on trigger tokens are more obvious than SD v1.4 when $t$ is large.

0.A.2.2 Finding 2.

There are two observations in Finding 2. One observation is for different types of memorization and the other observation is focusing on a shared feature of all the types. Due to the fact that SD v2.0 is trained on a de-duplicated dataset, we can only observe TM and very few RM in it. Thus, we only validate the shared observation that the non-memorization will have a faster reduction of attention scores on summary tokens. We plot the attention sum of prompt tokens and summary tokens in Fig. 10. It can be observed that the attention sum on the summary tokens of non-memorization samples reduces much faster than that of memorization samples. This phenomenon is used in the detection of memorization of SD v2.0.

0.A.2.3 Finding 3.

To verify this finding on SD v2.0, we plot the entropy of each U-Net layer of SD v2.0 in Fig. 11. We can see that the overlapping of entropy in different layers are still different and diverse. The third and fourth layers have a clear separation.

In summary, all the findings in Sec. 4 can be generalized to SD v2.0. In Sec. 6.2, we also use these findings to detect the memorization on SD v2.0. These experiments shows the generalization of our findings to other text-to-image models with cross attention.

Appendix 0.B Entropy of Summary Tokens

In Sec. 5.1, we use $E_{t}^{\text{summary}}=\sum_{i=N-S+1}^{N}-\bar{a}_{i}\log\left(\bar{a}_{i}\right)$ to replace sum to make its numerical values comparable with $E_{t}$ in Eq. 6. In this section, we show that $E_{t}^{\text{summary}}$ has the consistent conclusion with Finding 2, thus it can be used for replacing sum in Finding 2.

In Fig. 12, we show both the attention sum and $E_{t}^{\text{prompt}}$ on three types of memorization. For the first observation of Finding 2, we can see that for MM, the attention sum on summary tokens is larger than prompt tokens, while the attention sums on summary tokens of RM and TM are lower than prompt tokens. This difference between MM and the other two types is explained in Fig. 12, that MM tend to have more unique prompts and paying higher attention on the summary tokens is easier to memorize the prompts. As for $E_{t}^{\text{prompt}}$ and $E_{t}^{\text{summary}}$ , we can see that MM have much higher $E_{t}^{\text{summary}}$ than $E_{t}^{\text{prompt}}$ , while in RM and TM, $E_{t}^{\text{summary}}$ is almost similar to $E_{t}^{\text{prompt}}$ . Although it is different from the attention sum that the summary tokens in RM and TM are smaller than prompt tokens, it is enough to show the different patterns of between MM and the other two types of memorization.

For the second, observation of Finding 2, we can see that, in all three types of memorization, both attention sum and entropy of summary tokens of non-memorization reduce faster than those of memorization. This property is used in the first detection metric in Eq. 6.

In summary, $E_{t}^{\text{summary}}$ is consistent with the attention sum in the two observations in Finding 2, thus this is reasonable to replace the sum with $E_{t}^{\text{summary}}$ in Eq. 6 to make the numerical values comparable with the first term.

Appendix 0.C Details of Experimental Settings

0.C.1 Baselines

0.C.1.1 Detection.

For the baseline method from Carlini et al. [5], we generate $n$ samples with different random seeds for each prompt, and calculate the $l_{2}$ distance between them in pairs. We use the smallest distance among the pairs as the measurement. Since the memorization samples will produce the same results no matter what the random seed is, the $l_{2}$ distance between generated images of memorization should be small, while the non-memorization samples should have larger distance. The weakness of this method is that it requires repeated generation for one prompt, which is slow.

For the baseline method from Wen et al. [36], we compare the difference of the model output between the suspect prompt and empty prompt. To stabilize the results, we generate $n$ samples on $s$ diffusion steps, and use the averaged results for detection. This method will also slow down the generation speed because it needs additional inference operation on empty prompt, and may use repeated generation for stabilized results.

0.C.1.2 Mitigation.

Wen et al. [36] provide two variants of mitigation methods for inference and training. For inference-time mitigation, they update the initial token embeddings of the prompt to make the model output closer to the empty prompt. For training-time mitigation, they set up a threshold to remove the samples with high difference from empty prompt. This also requires additional inference operation on the empty prompt. In Sec. 0.C.3, we compare how to select the threshold between their method and ours.

0.C.2 Metrics

0.C.2.1 Detection.

We calculate the time cost on a single GPU of A5000 with batchsize of 32 for SD v1.4 and 16 for SD v2.0.

0.C.2.2 Mitigation.

Similarity Score is calculated between the generated images and the training images paired with memorized prompts. For inference-time mitigation, the FID is calculated between generated images (including both memorized prompts from [32] and 500 non-memorized prompts) and a subset of LAION [26] with 10,000 images. For training-time mitigation, CLIP score is calculated on 100 memorized prompts in training data, 200 non-memorized prompts in training data and 200 generated prompts by ChatGPT-4.

0.C.3 Training-time mitigation threshold

For both baseline method and ours, we need a threshold to remove the samples from the training mini-batch. Based on the extracted memorized samples in [32], we use the threshold of the best F1 score for the baseline method. To control the strength as shown in Fig. 6(c), we adjust the threshold by adding a number in the range of -2 to 1. For ours, since we have noticed that different diffusion steps tend to have different entropy, we use each step’s 5th-percentile of the entropy of memorized samples in [32] as the threshold for that diffusion step. This can remove the samples more accurately based on each step. For larger $t$ , the threshold is larger since the entropy is larger in the early stage of the diffusion process; conversely, for smaller $t$ , the threshold is smaller. To control the strength, we adjust the threshold by multiplying it with a factor. In our experiment, we set the factor in the range of 1.1 to 1.35.

Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention