Learning to Customize Text-to-Image Diffusion In Diverse Context

Taewook Kim, Wei Chen, Qiang Qiu
Department of Electrical and Computer Engineering
Purdue University
{kim3803,chen2732,qqiu}@purdue.edu

Abstract

Most text-to-image customization techniques fine-tune models on a small set of personal concept images captured in minimal contexts. This often results in the model becoming overfitted to these training images and unable to generalize to new contexts in future text prompts. Existing customization methods are built on the success of effectively representing personal concepts as textual embeddings. Thus, in this work, we resort to diversifying the context of these personal concepts solely within the textual space by simply creating a contextually rich set of text prompts, together with a widely used self-supervised learning objective. Surprisingly, this straightforward and cost-effective method significantly improves semantic alignment in the textual space, and this effect further extends to the image space, resulting in higher prompt fidelity for generated images. Additionally, our approach does not require any architectural modifications, making it highly compatible with existing text-to-image customization methods. We demonstrate the broad applicability of our approach by combining it with four different baseline methods, achieving notable CLIP score improvements.

Refer to caption — Figure 1: Comparison across various text-to-image models before and after integrating our method. The proposed approach consistently enhances prompt fidelity in generation results.

1 Introduction

Diffusion-based generative models (Ho et al., 2020; Song et al., 2020a; Dhariwal & Nichol, 2021; Song et al., 2020b) have made significant progress in image synthesis, achieving improved diversity and expressiveness in generated outputs. Extending these breakthroughs, diffusion-based text-to-image models (Rombach et al., 2022; Podell et al., 2023; Balaji et al., 2022; Saharia et al., 2022; Xue et al., 2024) that leverage large-scale text-image pairs (Schuhmann et al., 2021) have demonstrated impressive capabilities in translating the text into visual content.

More recently, leveraging the strong prior knowledge acquired by the pretrained text-to-image generative models, numerous approaches have been proposed to fine-tune the models for customization to specific concepts (Gal et al., 2022; Ruiz et al., 2023; Kumari et al., 2023; Voynov et al., 2023; Avrahami et al., 2023). Typically, these methods use 4-5 images containing personal concepts to obtain token embedding aligned with the given images, which are then integrated into the novel text prompts for image generation. While demonstrating its potential, existing models often suffer from the concept overfitting when fine-tuned on a small set of images with limited contexts. This overfitting often causes the customized model to generate images that are highly similar to the training images, and fail to faithfully follow the text prompts during inference (Figure 1).

Our study indicates that introducing diverse contexts during model fine-tuning can mitigate the concept overfitting (Zeng et al., 2024). As diversifying text-image tuning pairs can be costly and often impractical, in this paper, we propose to diversify the context of personal concepts solely within the textual space, by first simply constructing a set of contextually diverse text prompts with concept tokens, nearly at no extra cost (Figure 2, left). As customization aligns the personal concept with a concept token, this proposed approach can be a highly cost-effective way of context diversification. Then, we further adopt a self-supervised learning objective, Masked Language Modeling (MLM) (Devlin et al., 2018), which drives the concept embedding to learn proper relations to its contexts (Figure 2, right).

We later both theoretically and empirically show that adopting the MLM objective with a contextually diverse text prompt set during customization significantly alleviates the concept overfitting, and leads to semantic enhancement in textual representation, which ultimately extends to higher prompt fidelity in image generation. We conduct extensive experiments to demonstrate the effectiveness of our approach.

We summarize our contributions as follows,

•

We propose a highly cost-effective text-to-image customization method that significantly improves context diversification of personal concepts via masked language modeling, leading to higher prompt fidelity in generated images.
•

We theoretically illustrate that the proposed approach effectively helps to mitigate concept overfitting by regularizing the loss of contextual information and learning of diverse contexts.
•

We further empirically show consistent image generation improvements while integrating our approach with four different text-to-image baseline methods, demonstrating its broad applicability.

2 Related Works

Text-to-Image Generation.

The introduction of diffusion models (Ho et al., 2020; Nichol & Dhariwal, 2021; Song et al., 2020a) has paved the way for a series of text-to-image generative models (Saharia et al., 2022; Rombach et al., 2022; Balaji et al., 2022; Ramesh et al., 2021; 2022; Ding et al., 2022) that have achieved significant success. GLIDE (Nichol et al., 2021) demonstrated that using classifier-free guidance (Ho & Salimans, 2022) can enhance both the photorealism and caption alignment of generated images. DALLE-2 (Ramesh et al., 2022) further improved the process by leveraging CLIP (Radford et al., 2021) embeddings to derive an image prior from a text caption, which is then decoded using diffusion models. Stable Diffusion (Rombach et al., 2022) proposed a way to improve the efficiency by applying the diffusion process in the lower dimensional latent space, and SDXL (Podell et al., 2023) has been proposed to make improvements over SD by updating the model architecture. ControlNet (Zhang et al., 2023) proposed to incorporate additional input conditions to improve the controllability of the T2I model using zero-convolutional layers. In our work, we mainly focus on baseline methods that are based on Stable Diffusion.

Personalized Text-to-Image Genearation.

Textual inversion (Gal et al., 2022) pioneered a method to convert personal concept images into token embeddings, enabling the use of tokens for tailored text-to-image generation. DreamBooth (Ruiz et al., 2023) extended TI by fine-tuning the diffusion UNet along with the prior-preservation loss to prevent forgetting of prior concepts. Since the introduction of the pioneering works, a line of work has been proposed to make further improvements. XTI (Voynov et al., 2023) proposed to invert the concept into multiple token embeddings, each specialized for a different layer of the diffusion network. CustomDiffusion (Kumari et al., 2023) proposed to fine-tune the cross-attention layer in diffusion UNet for efficient training. Break-A-Scene (Avrahami et al., 2023) proposed to learn multiple concepts included in the same scene by utilizing masked diffusion loss. There is also a growing interest in developing methods specialized for facial images. (Yuan et al., 2023) constructed a set of basis tokens corresponding to celebrities and optimized their weights to synthesize a given image. (Peng et al., 2024) proposed to augment the concept token embedding by extracting facial features using a facial recognition model. (Shi et al., 2024) have proposed a test-time finetuning-free method where a learnable image encoder is deployed to convert the input images into a textual token. (Chen et al., 2024) also proposed an instant method where an apprentice diffusion model learns to imitate the behaviors of multiple expert models specialized for each concept. Similarly, (Wei et al., 2023) proposed to use a CLIP image encoder to encode personal images and then utilize global and local mapping networks to obtain enhanced representations of the concept. (Chen et al., 2023) proposed a method to avoid the entanglement of identity-irrelevant features by utilizing learnable masks and multi-task training objectives.

3 Preliminaries

3.1 Text-to-Image Generation

We apply our approach to various text-to-image baseline models (Gal et al., 2022; Ruiz et al., 2023; Voynov et al., 2023; Kumari et al., 2023) that are based on Stable Diffusion (SD) (Rombach et al., 2022). SD consists of a CLIP text encoder $\Gamma$ that encodes an input text t into a sequence of input token embeddings, denoted as Tokenize, $\mathbf{P}=\text{Tokenize}(\textbf{t})$ , then outputs corresponding text embedding $\textbf{C}=\Gamma(\mathbf{P})$ using self-attention layers. A Variational Auto Encoder (VAE) of SD $\mathcal{E}$ encodes an image $x$ to a lower dimensional latent $\mathbf{z}=\mathcal{E}(\mathbf{x})$ .

During training, given a timestep $t\sim\text{Uniform}[0,\text{T}-1]$ , a random noise map $\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ is added to the latent map to get a noised latent map $\mathbf{z}_{t}=\alpha_{t}\mathbf{z}+\sigma_{t}\epsilon$ , where $\alpha_{t}$ and $\sigma_{t}$ denote the noise scheduling coefficients. Then, the diffusion U-Net $\epsilon_{\theta}$ is trained to minimize the following objective for denoising,

\displaystyle\mathbb{E}_{\mathbf{C},\epsilon,t,\mathbf{z}}||\epsilon-\epsilon_{\theta}(\mathbf{z}_{t},t,\mathbf{C})||_{2}^{2}.

(1)

3.2 Finetuning for Text-to-Image Customization

Utilizing a text prompt $\widetilde{\textbf{t}}$ that incorporates a concept token [v] (e.g., “a picture of a [v] dog”), the tokenized input embedding is encoded with CLIP text encoder $\widetilde{\textbf{C}}=\Gamma(\widetilde{\textbf{P}})$ . Following the text encoding, the denoising objective is computed as below,

\displaystyle\mathcal{L}_{\text{Custom}}(\mathbf{z}_{t},t,\widetilde{\textbf{C}}):=\mathbb{E}_{\widetilde{\textbf{C}},\epsilon,t,\mathbf{z}}||\epsilon-\epsilon_{\theta}(\mathbf{z}_{t},t,\widetilde{\textbf{C}})||_{2}^{2},

(2)

where $\mathcal{L}_{\text{Custom}}$ denotes the denoising loss utilized for model customization, with $\textbf{z}=\mathcal{E}(\textbf{x})$ encoding an image x sampled from a small set of personal concept images. Depending on the baseline method, a different set of parameters are optimized. For Textual Inversion (TI) (Gal et al., 2022), only the concept token embedding is optimized with respect to Eqn. 2. We propose a method that does not require any architectural modification, hence it is highly compatible with baselines. We demonstrate this applicability by applying our method to different baselines. Unless otherwise specified, we illustrate our method using TI.

4 Method

Text-to-image customization methods (Gal et al., 2022; Voynov et al., 2023; Ruiz et al., 2023; Kumari et al., 2023), typically trained on 4-5 images with limited context, are prone to be overfitted to the training set. To address this issue, we propose contextual diversification solely within the textual space by constructing a context-rich text prompt set. To effectively guide the concept embedding to learn the proper contextual semantics, we adopt masked language modeling (MLM) during customization (Section 4.1 and 4.2), which leads to semantic enhancement in both textual (4.3) and image space (Section 4.4).

4.1 Masked Language Modeling

In order to enhance the concept token embedding with context-rich text prompts, we adopt Masked Language Modeling (MLM) during the model customization. The overall process is illustrated in Figure 3. We elaborate on the corresponding details in the following steps,

(i) A text prompt t, drawn from a contextually diverse prompt set that includes the concept token, e.g., “a picture of [v] dog at Eiffel Tower”, is tokenized and mapped to a prompt embedding,

\textbf{P}=\text{Tokenize({t})},

(3)

where $\textbf{P}\in\mathbb{R}^{L\times d}$ , $L$ is the number of tokens, $d$ is the feature dimension of prompt embeddings.

(ii) A subset of prompt embedding P is randomly selected and those selected tokens are replaced by a mask token embedding $p_{\text{mask}}$ with the probability $\rho_{\text{mask}}$ , yielding $\textbf{P}_{\text{masked}}=\text{RandomMask}(\textbf{P},\rho_{\text{mask}})$ . Then, text embedding is obtained from CLIP text encoder $\Gamma$ ,

\textbf{C}_{\text{masked}}=\Gamma(\textbf{P}_{\text{masked}}),

(4)

where $\textbf{C}_{\text{masked}}=\{c_{i}\}$ , with $c_{i}\in\mathbb{R}^{d}$ denoting one element in token embedding.

(iii) Finally, we predict the label of the masked token $\hat{\mathbf{y}}=\psi(\textbf{C}_{\text{masked}})$ and calculate the MLM loss,

\mathcal{L}_{\text{MLM}}=\mathbb{E}\Big{[}\text{CrossEntropy}(\textbf{y},\hat{\mathbf{y}})\Big{]},

(5)

where $\psi$ denotes the classification network with self-attention layers.

Notably, the attention layer computes the output token embedding O as the linear combination of input embeddings, with the weights determined by the self-attention map $\textbf{A}^{\text{self}}\in\mathbb{R}^{L\times L}$ , implying that the output is the contextualization of the input. For the $i_{m}$ -th output token, i.e., the masked token, it can be formulated as shown below,

\displaystyle\underbrace{\textbf{O}[i_{m},:]}_{\text{Output Mask}}

\displaystyle=\sum_{j=1}^{L}\textbf{A}^{\text{self}}[i_{m},j]\textbf{V}[j,:]=\sum_{\begin{subarray}{c}j=1\\ j\neq j_{*}\end{subarray}}^{L}\textbf{A}^{\text{self}}[i_{m},j]\underbrace{\textbf{V}[j,:]}_{\text{Context}}+\textbf{A}^{\text{self}}[i_{m},j_{*}]\underbrace{\textbf{V}[j_{*},:]}_{\text{Concept}},

(6)

where $j_{*}$ is the index of the concept token, and V is the value matrix.

By optimizing the concept token embedding to minimize $\mathcal{L}_{\text{MLM}}$ , the concept token is guided to learn the diverse context, as the MLM encourages the utilization of surrounding contexts for mask prediction. The overall process is illustrated in Figure 3. We later show that customization with this additional objective results in regularizing the text embeddings from overfitting to the concept token, which eventually leads to semantically enhanced image generation (Section 4.4).

Algorithm 1 Training Procedure of Contextualizer

1: Load parameters

\Gamma

{

\Gamma

: CLIP text}

2: Random initialize

\psi

p_{\text{mask}}

{

\psi

: Contextualizer,

p_{\text{mask}}

: mask embedding}

3: Set

\rho_{\text{mask}}

{

\rho_{\text{mask}}

: masking probability}

4: repeat

5: Sample t from rich prompt set, P=Tokenize(t)

\textbf{P}_{\text{masked}},\textbf{y}=\text{RandomMask}(\textbf{P},\rho_{\text{mask}})

{y: masked token label}

7: Compute

\mathcal{L}_{\text{MLM}}=\mathbb{E}_{\textbf{y},\textbf{P}_{\text{masked}}}\Big{[}\text{CrossEntropy}((\textbf{y},\psi(\Gamma(\textbf{P}_{\text{masked}}))))\Big{]}

8: Gradient descent optimization on

\nabla_{\psi,p_{\text{mask}}}\mathcal{L}_{\text{MLM}}

9: until optimized

4.2 Customization with Diverse Context

Prompt Set Construction. To achieve customization with diverse contexts, we construct a set of context-rich prompts that incorporate the special concept token. Inspired by recent work (Brooks et al., 2023), we leverage a pretrained large language model (OpenAI, 2023; Brown et al., 2020) to minimize the effort in manually crafting a large set of prompts. For this, we query the LLM to generate a list of contexts of different types, e.g., background or subjection variation. For detailed descriptions of the prompt set construction process, refer to the Appendix Section A.2.

Pretraining. Although the CLIP text encoder of SD has the linguistic capability of comprehending text prompts, it is solely trained with contrastive learning objectives (Radford et al., 2021), and does not support MLM. Therefore, before proceeding with fine-tuning for customization, we first pretrain a network, namely, a contextualizer $\psi$ to incorporate the MLM capability. We provide the pretraining procedure of the contextualizer $\psi$ in Algorithm 1. During the pretraining of $\psi$ , the concept token is not involved. We only train the mask embedding and the layers of the contextualizer. The CLIP text encoder and diffusion U-Net remain fixed.

Finetuning. Utilizing the contextually diverse prompt set, we proceed with the model customization (Figure 3). The model is optimized by minimizing the denoising objective $\mathcal{L}_{\text{Diff}}$ (Eqn. 2) and the MLM loss $\mathcal{L}_{\text{MLM}}$ (Eqn. 5). Two different types of prompts are utilized for each objective. Text embeddings $\widetilde{\textbf{C}}$ encoded from a context-simple text prompt $\widetilde{\textbf{t}}$ (e.g., “a picture of [v] dog”) for $\mathcal{L}_{\text{Diff}}$ , and text embedings $\textbf{C}_{\text{masked}}$ encoded from a text context-rich text prompt t (e.g., “a [v] dog at Eiffel Tower”) for $\mathcal{L}_{\text{MLM}}$ . The overall learning objective can be formulated as follows,

\displaystyle\mathcal{L}_{\text{Diff}}(\mathbf{z}_{t},t,\widetilde{\textbf{C}})+\lambda\mathcal{L}_{\text{MLM}}(\textbf{C}_{\text{masked}}),

(7)

where $\mathbf{z}$ denotes the noised latent of personal concept image, $t$ denotes the timestep, and $\lambda$ denotes the weight for the MLM loss. Note that, as the MLM objective does not require any images, the concept token embedding learns contextual semantics without constructing corresponding images. Additionally, our approach does not require any architectural modification of SD, it is highly compatible with existing text-to-image approaches. Hence, we combine our approach with different baseline methods and demonstrate its generalizability. The overall training procedure is described in Algorithm 2. We refer to Appendix Section A.1 for additional details of training.

Algorithm 2 Training Procedure of Text-to-Image Customization

1: Load parameters

\theta

\psi

\Gamma

, and

\mathcal{E}

{

\theta

: U-Net,

\psi

: contextualizer,

\Gamma

: CLIP text,

\mathcal{E}

: VAE}

2: Fix

\psi

and

p_{m}

{

p_{m}

: mask embedding}

3: Set

\lambda

and

\rho_{\text{mask}}

{

\rho_{\text{mask}}

: masking probability}

4: Select trainable params

\Theta\subset\{\theta,\Gamma\}

{Selection based on baselines}

5: repeat

6: Sample

t\sim\text{Uniform}[0,\text{T}-1],\epsilon\sim\mathcal{N}(\textbf{0},\textbf{I})

7: Sample

\textbf{x},\textbf{P}

and encode z=

\mathcal{E}(\textbf{x})

\textbf{c}=\Gamma(\textbf{P})

8: Get noised latent,

\mathbf{z}_{t}=\alpha_{t}\mathbf{z}+\sigma_{t}\epsilon

\textbf{P}_{\text{masked}},\textbf{y}=\text{RandomMask}(\textbf{P},\rho_{\text{mask}})

{y: masked token label}

10: Compute

\mathcal{L}_{\text{Diff}}=\mathbb{E}_{\mathbf{C},\epsilon,t,\mathbf{z}}||\epsilon-\epsilon_{\theta}(\mathbf{z}_{t},t,\mathbf{C})||_{2}^{2}

11: Compute

\mathcal{L}_{\text{MLM}}=\mathbb{E}_{\textbf{y},\textbf{P}_{\text{masked}}}\Big{[}\text{CrossEntropy}(\textbf{y},\Gamma(\textbf{P}_{\text{masked}}))\Big{]}

12: Gradient descent optimization on

\nabla_{\Theta}\Big{[}\mathcal{L}_{\text{Diff}}+\lambda\mathcal{L}_{\text{MLM}}\Big{]}

13: until optimized

4.3 Semantic Enhancement in Textual Space

In this section, we illustrate how adopting the MLM with diverse contexts during the model customization leads to semantically enhanced textual representation, which ultimately translates to improved image generation. We first validate the following to explain how our method leads to semantically enhanced textual representation,

•

The model overfits to the personal concept, when the semantics of the context tokens (i.e., non-personal) become similar to the concept token.
•

The semantics of the context tokens get distinct from the concept token as the diverse contextual semantics are learned with MLM.

Most customization methods that fine-tune the model with limited context (Gal et al., 2022; Ruiz et al., 2023; Voynov et al., 2023; Kumari et al., 2023) often suffer from concept overfitting (Zeng et al., 2024), resulting in generated images that primarily contain the personal concept without adhering to the prompt. We next analyze that concept overfitting leads to a high similarity between the text embeddings of context tokens and the concept token, ultimately causing a loss of contextual semantics.

Proposition 1.

The model overfitting to the concept token makes the attention map mostly attend to the concept token, i.e., $A[i,j_{*}]\gg A[i,j],\forall j\neq j_{*}$ , where $j_{*}$ is the index of the concept token. The distance between the context embeddings $c_{i}$ and the concept embedding $c_{i_{*}}$ is bounded,

||c_{i}-c_{i_{*}}||_{2}\leq\delta_{V}.

(8)

In contrast, the MLM regularizes the loss of contextual information and guides the learning of diverse contexts (Figure 4, right). Specifically, we focus on two types of text embeddings for the concept token: (i) the context token embedding derived from the prompt with the concept token (e.g., “a picture of [v] dog at Eiffel Tower”), referred to as $c_{b}$ ; and (ii) the context embedding from the prompt without the concept token (e.g., “a picture of Eiffel Tower”), denoted as $\hat{c}_{b}$ .

Proposition 2.

Optimizing the concept token $c_{i_{*}}$ with the MLM loss $\mathcal{L}_{\text{MLM}}$ , the minimized distance between the text embedding of context token $c_{b}$ and $\hat{c}_{b}$ is the necessary condition to minimize $\mathcal{L}_{\text{MLM}}(c_{b})$ , i.e.,

\mathcal{L}_{\text{MLM}}(c_{b})-\mathcal{L}_{\text{MLM}}(\hat{c}_{b})\leq\delta_{g}||c_{b}-\hat{c}_{b}||_{2}.

(9)

Remark 3.

According to Proposition 1, solely optimizing the concept token tends to produce text embeddings of context token $c_{b}$ that closely resemble the text embedding of concept token $c_{i_{*}}$ but deviate from desired embeddings $\hat{c}_{b}$ . However, as outlined in Proposition 2, incorporating MLM can significantly align $c_{b}$ with $\hat{c}_{b}$ .

We provide the proof of Proposition 1 and 2 in Appendix A.4.

To empirically validate this, we provide a cosine similarity analysis using a set of 200 text prompts for 7 different concepts in varying contexts (Table 1). We analyze the cosine similarity, $sim_{1}=cos(c_{i_{*}},c_{b})$ , and $sim_{2}=cos(c_{b},\hat{c}_{b})$ , respectively. The result shows that the baseline approach leads to high similarity between the concept-context tokens ( $sim_{1}$ ) and low similarity in the context token from the two prompts ( $sim_{2}$ ). The baseline result implies that the contextual semantics are not only getting similar to the concept (high $sim_{1}$ ) but also implies that the context token loses its semantics as it becomes dissimilar to the semantics that is preserved in the context token (low $sim_{2}$ ). In contrast, we observe the opposite trend from our results, which implies that the MLM encourages the contextual semantics to be distinct (low $sim_{1}$ ) and preserves the semantics (high $sim_{2}$ ).

Method	$sim_{1}\downarrow$	$sim_{2}\uparrow$	$SKL\uparrow$
Baseline	0.5047	0.3864	1.5932
Ours	0.4072	0.7386	2.3536

Table 1: Cosine similarity between the concept and context token from the same prompt (

sim_{1}

) and the context tokens from different prompts

sim_{2}

are reported.

SKL

denotes the symmetric KL divergence between the cross-attention maps of the concept token and the context token.

Figure 5: Visualization of

16\times 16

attention maps from cross-attention layers. Top: Baseline. Bottom. Our approach results in cross-attention maps of the concept token and the context token being more distinctively distributed, leading to semantically enhanced image generation.

4.4 Semantic Enhancement in Image Space

The cross-attention map plays a key role in controlling the overall image generation (Hertz et al., 2022; Chefer et al., 2023), where the attended region corresponds to the area most influenced by the token. Next, we provide an analysis of the cross-attention map to illustrate how the aforementioned semantic enhancement in textual space can be transferred to image space, thereby improving the prompt fidelity of image generation.

Let $\textbf{Q}_{\mathcal{I}}$ , $\textbf{K}_{\mathcal{T}}$ and $\textbf{V}_{\mathcal{T}}$ denote the Query, Key and Value of the cross-attention layer, projected from image $\mathcal{I}$ , and text $\mathcal{T}$ . During the denoising process, the cross-attention map $\textbf{A}^{\text{cross}}=\text{Softmax}(\frac{\textbf{Q}_{\mathcal{I}}\textbf{K}_{\mathcal{T}}^{\top}}{\sqrt{d}})$ is computed between $\textbf{Q}_{\mathcal{I}}\in\mathbb{R}^{|queries|\times d}$ and $\textbf{K}_{\mathcal{T}}\in\mathbb{R}^{L\times d}$ , where $|queries|$ denotes the number of image tokens, $L$ denotes the number of text tokens, and $d$ denotes the dimension of each image/text token embedding.

Proposition 4.

Denote the correlation between the image embeddings and text token embedding as $c_{i}$ as $\textbf{M}[:,i]=\textbf{Q}_{\mathcal{I}}\textbf{K}_{\mathcal{T}}[i,:]$ . $\textbf{M}[:,i]$ and $\textbf{M}[:,j]$ are bounded by the distance between their corresponding text token embeddings $c_{i}$ and $c_{j}$ ,

||\textbf{M}[:,i]-\textbf{M}[:,j]||_{2}\leq\alpha||c_{i}-c_{j}||_{2},

(10)

where $\alpha=||\textbf{Q}_{\mathcal{I}}||_{F}||\mathbf{W}_{K}||_{F}$ , and $\mathbf{W}_{K}$ is the projection matrix of the Key.

Remark 5.

For the baseline method, Proposition 1 shows that the distance between text embeddings of context tokens $c_{b}$ and concept token $c_{i_{*}}$ are bounded by a small value, which means $c_{b}$ and $c_{i_{*}}$ are similar. Furthermore, Proposition 4 suggests that the cross-attention map corresponding to $c_{b}$ highly resembles the one corresponding to $c_{i_{*}}$ . This suggests that the image generation will be focused solely on the personal concept, overlooking the context.

Remark 6.

For our method, Propositions 1 and 2 show that the context token embedding $c_{b}$ gets less similar to the concept token $c_{i_{*}}$ that the context token embedding retains the original semantics. The Proposition 4 further indicates that the cross-attention map corresponding to $c_{b}$ will be distinct from the one corresponding to the concept token $c_{i_{*}}$ . This implies that context tokens will be contributing to distinct regions of the image, and image generation is prevented from solely focusing on the personal concept.

We provide the proof of Proposition 4 in Appendix A.5.

We visualize the cross-attention maps of the tokens and compare the results of the baseline (Gal et al., 2022) and ours to further validate our claims (Figure 5). The result indicates that concept overfitting of the text embedding in the baseline approach leads to cross-attention maps of the concept token and the context token being more closely distributed, leading to semantically degraded image generation. Using the same prompt set used for cosine similarity analysis that contains both context token and concept token, we measure the symmetric KL divergence between the cross-attention maps of the concept tokens $c_{i_{*}}$ and the context tokens $c_{b}$ within the same prompt: $SKL=\frac{1}{2}D_{\text{KL}}(\textbf{A}^{\text{cross}}[:,i_{*}]||\textbf{A}^{\text{cross}}[:,b])+\frac{1}{2}D_{\text{KL}}(\textbf{A}^{\text{cross}}[:,b]||\textbf{A}^{\text{cross}}[:,i_{*}])$ , where $D_{\text{KL}}$ is the Kullback-Leibler (KL) divergence, $\textbf{A}^{\text{cross}}[:,k]\in\mathbb{R}^{|queries|}$ denotes the cross-attention map of the $k$ -th token. The results show that the baseline approach produces a more similar distribution of the cross-attention maps between the concept and context tokens (Table 1, $SKL$ ). This finding, along with our visual evidence (Figure 4), indicates that semantic enhancement in textual space leads to enhancement in image space, resulting in improved prompt fidelity of the image generation.

5 Experiments

Table 2: Evaluation results of the proposed approach combined with baselines. The winning result between the baseline and ours is denoted in bold.

	TI		XTI		DB		CD
	Baseline	Ours	Baseline	Ours	Baseline	Ours	Baseline	Ours
CLIP-T $\uparrow$	0.279	0.305	0.297	0.305	0.306	0.309	0.322	0.326
DINO $\uparrow$	0.556	0.543	0.586	0.594	0.667	0.655	0.618	0.615
DINO-FG $\uparrow$	0.661	0.658	0.692	0.710	0.783	0.775	0.737	0.736

5.1 Experimental Settings

Baselines.

We apply our approach to four different baselines, Textual Inversion (TI) (Gal et al., 2022), XTI (Voynov et al., 2023), Dreambooth (DB) (Ruiz et al., 2023) and CustomDiffusion (CD). Apart from the adoption of the MLM objective during model customization, the remaining training configuration remains the same for the baselines and ours. For each baseline, the training parameters are chosen following the original configuration. For TI and XTI, only the personal concept embeddings are updated. For DB, we finetune the entire parameters of the U-Net and the CLIP text encoder. For CD, we train the personal concept embedding and the Key/Value projection matrices of cross-attention layers of the U-Net. For all the prompts, we use a joint phrase that combines the special token with the prior concept (e.g., ‘[v] dog’). We do not mask the concept tokens, and we set the $\rho_{\text{mask}}=15\%$ following (Devlin et al., 2018). we use AdamW (Loshchilov, 2019) optimizer to update the parameters on a single NVIDIA RTX 3090 GPU. We use a classifier-free guidance Ho & Salimans (2022) scale of 7.5 when generating images. We provide additional implementation details of the baseline methods in Appendix Section A.1.

Table 3: Evaluation results with varying

\lambda

. The best results are denoted in bold.

$\lambda$	CLIP-T $\uparrow$	DINO-FG $\uparrow$
Baseline	0.279	0.657
$5\times 10^{-5}$	0.292	0.659
$1\times 10^{-4}$	0.305	0.658
$5\times 10^{-4}$	0.311	0.654

Table 4: Ablation study on masking probability. The best results are denoted in bold.

$\rho_{\text{mask}}$ [%]	CLIP-T $\uparrow$
Baseline	0.279
15	0.305
50	0.304
90	0.299

Dataset.

We use a mixture of 15 different subjects adopted from DB (Ruiz et al., 2023), TI (Gal et al., 2022) and CD (Kumari et al., 2023). We use 11 subjects from DB (Ruiz et al., 2023) composed of: [backpack,backpack_dog, cat, cat2, cat3, cat3, cat6, duck_toy, poop_emoji, rc_car, teapot, teddybear], and we use 3 subjects from (Kumari et al., 2023): [pet_cat1, pet_dog1, wooden_pot]. Finally, we use 1 subject from (Gal et al., 2022): [cat toy]. A benchmark prompt set from DB Ruiz et al. (2023) is utilized for generation. This prompt set contains 25 prompts for nonliving and living subject categories, and 8 images per prompt are generated, resulting in a total of 3,000 images.

Evaluation Metrics.

Following the literature (Ruiz et al., 2023; Gal et al., 2022; Kumari et al., 2023) we measure the text prompt fidelity and the subject fidelity. To measure text prompt fidelity, we compute the average pairwise cosine similarity between the embeddings of the input prompt and the generated image, encoded by the CLIP text and vision encoders (CLIP-T). For subject fidelity, we calculate the average pairwise cosine similarity between the embeddings of the personal concept image and the generated image, using the ViT-S/16 DINO encoder (Caron et al., 2021) (DINO). Additionally, following (Kim et al., 2024), we report the DINO score measured from the segmented foreground regions of the generated and the input images for a more accurate evaluation of subject fidelity. As it removes the influence of the background, this leads to more accurate measuring of subject fidelity.

5.2 Quantitative Results

Comparison with Baseline Method.

We combine the proposed approach with four different baselines and present the quantitative comparisons (Table 2). Notably, for all baseline methods, we achieve consistent improvement in semantic alignment between the generated images and the input prompts, as we observed improvement in CLIP-T score for all baselines. Compared to the methods that update the parameters other than the personal token embeddings (DB and CD), the ones that do not update them show higher improvement. We hypothesize that as the model that trains U-Net still utilizes the contextually limited text-image pairs for the denoising objective, this leads to overfitting of the cross-attention layers. As a result, the enhancement property of our method can not be faithfully transferred to the image space. For the subject fidelity measure, we observe a marginal difference from the baseline methods. We later show that prompt-subject fidelity trade-off can be achieved by different $\lambda$ (Section 5.3).

5.3 Ablation Study

In this section, we conduct ablation studies to provide deeper insights into our method and validate its effectiveness. To solely compare the effect of adjusting the textual space, we have UNet and CLIP fixed and only update the concept token embedding.

Impact of $\lambda$ . We first study the influence of MLM loss weight (Table 4). We adopt TI as our baseline and compare the prompt and subject fidelity with the model trained with different $\lambda$ . We note the general trend of improved CLIP-T score as the model is trained with increased $\lambda$ . Additionally, the result indicates as the $\lambda$ increases, subject fidelity can be slightly decreased. We hypothesize that this trade-off arises due to the model prioritizing textual context alignment over subject preservation at higher $\lambda$ values, as the MLM objective encourages the model to focus more on capturing the semantic relationships of the contexts.

Impact of Contextual Semantics in Customizing Multi-concept Images.

We study whether the proposed method effectively guides the personal concept token to utilize the contextual semantics, by applying our method to learn a single concept from images containing multiple concepts (Figure 6). For this, we construct a set of 50 prompts that contextual semantics of the concept is highly specific to the concept (e.g., “a man petting a [v] dog” for a dog). Surprisingly, applying our approach leads to successful learning of the targeted concept within the multiple concepts in the same image, leading to disentanglement results. This result indicates that, by training the concept embedding to predict the word that best aligns with its contexts, the concept embedding is driven to be semantically aligned with the context. As the overall semantics of the prompt set are highly specific to the concept (e.g., a dog), the concept embedding converges toward representing that specific concept (i.e., a dog).

Impact of Masking Probability. We train the model with different masking probabilities and study its impact. Table 4 shows that the performance is relatively insensitive to the masking probability, however, excessively high values lead to degradation. This result validates the importance of contextual information, as excessively high masking value leads to contextual semantics removal.

5.4 Additional Qualitative Results

We present the additional qualitative comparison results of the proposed method in (Figure 7). In general, the baseline method that integrates our approach leads to improvement in prompt fidelity. In this visualization results, the baseline approach shows a higher tendency to neglect the context semantics. We analyze that the baseline approach loses the semantics of the context in textual space, which leads to the loss of semantics of the generated images. In contrast, the adoption of MLM leads to the preservation of the contextual semantics in text embedding, resulting in images with enhanced semantics with higher prompt fidelity. We provide additional qualitative results of the proposed method in Appendix Section A.6.

6 Conclusion

In this paper, we proposed a highly cost-effective text-to-image customization method that enhances the semantics of the textual representation, thereby improving the semantic quality and prompt fidelity of the generated images. Our analysis revealed that the context overfitting problem in existing approaches stems from fine-tuning with limited contexts. We addressed this issue by diversifying the context of the personal concept solely within the textual space. By integrating our approach with different text-to-image customization methods, we observed consistent improvement in CLIP scores. The effectiveness of the proposed method is demonstrated through both theoretical analysis and extensive experimental validation.

References

Avrahami et al. (2023) Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. Break-a-scene: Extracting multiple concepts from a single image. In Special Interest Group on Computer Graphics and Interactive Techniques Asia (SIGGRAPH Asia), 2023.
Balaji et al. (2022) Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
Brooks et al. (2023) Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Computer vision and Pattern Recognition (CVPR), 2023.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems (NeurIPS), 2020.
Caron et al. (2021) Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In International Conference on Computer Vision (ICCV), 2021.
Chefer et al. (2023) Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. Transactions on Graphics (TOG), 2023.
Chen et al. (2023) Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, and Wenwu Zhu. Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation. International Conference on Learning Representation (ICLR), 2023.
Chen et al. (2024) Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Ruiz, Xuhui Jia, Ming-Wei Chang, and William W Cohen. Subject-driven text-to-image generation via apprenticeship learning. Advances in Neural Information Processing Systems, 36, 2024.
Chen et al. (2015) Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. North American Chapter of the Association for Computational Linguistics (NAACL), 2018.
Dhariwal & Nichol (2021) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems (NeurIPS), 2021.
Ding et al. (2022) Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems (NeurIPS), 2022.
Gal et al. (2022) Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
Hertz et al. (2022) Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-or. Prompt-to-prompt image editing with cross-attention control. In International Conference on Learning Representations (ICLR), 2022.
Ho & Salimans (2022) Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems (NeurIPS), 33, 2020.
Kim et al. (2024) Jimyeong Kim, Jungwon Park, and Wonjong Rhee. Selectively informative description can reduce undesired embedding entanglements in text-to-image personalization. In Computer Vision and Pattern Recognition (CVPR), 2024.
Kumari et al. (2023) Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In Computer Vision and Pattern Recognition (CVPR), 2023.
Loshchilov (2019) I Loshchilov. Decoupled weight decay regularization. International Conference on Learning Representations (ICLR), 2019.
Nichol et al. (2021) Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
Nichol & Dhariwal (2021) Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning (ICML), 2021.
OpenAI (2023) OpenAI. Chatgpt: Gpt-4, 2023. URL https://openai.com/research/gpt-4.
Peng et al. (2024) Xu Peng, Junwei Zhu, Boyuan Jiang, Ying Tai, Donghao Luo, Jiangning Zhang, Wei Lin, Taisong Jin, Chengjie Wang, and Rongrong Ji. Portraitbooth: A versatile portrait model for fast identity-preserved personalization. In Computer Vision and Pattern Recognition (CVPR), 2024.
Podell et al. (2023) Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021.
Ramesh et al. (2021) Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning (ICML), 2021.
Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Computer Vision and Pattern Recognition (CVPR), 2022.
Ruiz et al. (2023) Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Computer Vision and Pattern Recognition (CVPR), 2023.
Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems (NeurIPS), 2022.
Schuhmann et al. (2021) Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
Shi et al. (2024) Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instantbooth: Personalized text-to-image generation without test-time finetuning. In Computer Vision and Pattern Recognition (CVPR), 2024.
Song et al. (2020a) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. International Conference on Learning Representation (ICLR), 2020a.
Song et al. (2020b) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
Voynov et al. (2023) Andrey Voynov, Qinghao Chu, Daniel Cohen-Or, and Kfir Aberman. p+: Extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522, 2023.
Wei et al. (2023) Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. In International Conference on Computer Vision (ICCV), 2023.
Xue et al. (2024) Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, and Ping Luo. Raphael: Text-to-image generation via large mixture of diffusion paths. Advances in Neural Information Processing Systems (NeurIPS), 36, 2024.
Yuan et al. (2023) Ge Yuan, Xiaodong Cun, Yong Zhang, Maomao Li, Chenyang Qi, Xintao Wang, Ying Shan, and Huicheng Zheng. Inserting anybody in diffusion models via celeb basis. arXiv preprint arXiv:2306.00926, 2023.
Zeng et al. (2024) Weili Zeng, Yichao Yan, Qi Zhu, Zhuo Chen, Pengzhi Chu, Weiming Zhao, and Xiaokang Yang. Infusion: Preventing customized text-to-image diffusion from overfitting. International Conference on Multimedia (MM), 2024.
Zhang et al. (2023) Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In International Conference on Computer Vision (ICCV), 2023.

Appendix A Appendix

A.1 Implementation Details of Baselines

Textual Inversion.

For this baseline method, we train the model for 2,000 iterations with a constant learning rate 2e-3. We use the batch size of 4 to train the method. Other than the concept token embeddings, no parameters are updated during the training. We set the batch size for MLM as 25. We set $\lambda=1\times 10^{-4}$ .

XTI.

For this baseline method, we train the model for 1,500 iterations with a constant learning rate of 2e-3. We use the batch size of 4 to train the method. Following the original method, a set of multiple concept embeddings is utilized to be aligned with the same concept image. We set the batch size for MLM as 12 due to the memory limit. We set $\lambda=2\times 10^{-4}$ .

DreamBooth.

For this baseline method, we train the model for 1,000 iterations with a constant learning rate of 1e-6. We use the batch size of 2 to train the method. Following the original method, the prior preservation loss is adopted during the training. For this, we generate a set of 200 images by prompting with “a picture of [SUBJECT CLASS]”, by denoting the general class of the concept in the prompt. We update the parameters of both the CLIP text encoder and the diffusion U-Net. We set the batch size for MLM as 25. We set $\lambda=5\times 10^{-4}$ .

CustomDiffusion.

For this baseline method, we train the model for 5,00 iterations with a constant learning rate of 4e-5. We use the batch size of 4 to train the method. Following the original method, we adopt prior preservation with generated images. During training only the Key/Value projection layers of the diffusion U-Net are updated during training. We set the batch size for MLM as 25. We set $\lambda=1\times 10^{-4}$ .

A.2 Details of Text Prompt Set Construction

To generate a contextually diverse prompt set with minimal human intervention, we utilize a large pretrained language model (LLM) OpenAI (2023). Based on whether the personal concept is classified as living or nonliving, we predefined context categories and query the LLM to generate relevant elements for each category. The predefined categories for the living personal concepts are as below,

1.

Human Interactive Prompts: A set of prompts that involves diverse interaction between different human subjects (e.g., “Albert Einstein is watching TV with [V]”).
2.

Relative Position Prompts: A set of prompts that involves different positioning words and different objects (e.g., “a picture of [V] next to a red vase”).
3.

Background Prompts: A set of prompts that describes a scene with different backgrounds (e.g., “a picture of [V] with Eiffel Tower in the background”).
4.

Image Style Prompts: A set of prompts that describe image style (e.g., “a picture of [V] in Pop Art style”
5.

Attributes Changing Prompts: A set of prompts that describe the target concept with different visual attributes (e.g., “a picture of [V] in blue sailor outfit”

Similarly, for non-living objects, we construct a set of prompts in five different types of contexts,

1.

Human Interactive Prompts: A set of prompts that involves diverse interaction between different human subjects (e.g., “Albert Einstein is watching TV with [V]”).
2.

Relative Position Prompts: A set of prompts that involves different positioning words and different objects (e.g., “a picture of [V] next to a red vase”).
3.

Background Prompts: A set of prompts that describes a scene with different backgrounds (e.g., “a picture of [V] with Eiffel Tower in the background”).
4.

Image Style Prompts: A set of prompts that describe image style (e.g., “a picture of [V] in Pop Art style”
5.

Attributes Changing Prompts: A set of prompts that describe the target concept with different visual attributes (e.g., “a picture of [V] in blue sailor outfit”

A.3 Implementation Details of Contextualizer

Contextualizer constitutes four blocks of a self-attention layer and a feed-forward layer, followed by a layer normalization layer, where each block learns the residuals of the input with the residual connection. To train the contextualizer, we use a merged set of COCO caption dataset Chen et al. (2015) and the prompt set that we constructed. For the manual prompt set, we replace the personal concept token with the personal concept token to corresponding prior concept token. During training we set the ratio of batch of the two prompt set to be 70 to 30. The contextualizer is pretrained for 100K iterations with a learning rate of 1e-4, and batch size 150. We use the AdamW optimizer Loshchilov (2019).

A.4 Proof - Semantic Enhancement in Textual Space

The proof of Proposition 1.

Proof.

Given an attention map A, with $\sum_{j}\textbf{A}[i,j]=1$ , $\textbf{A}[i,j]\geq 0$ , and the value matrix V, the output of the attention layer is,

c_{i}=\sum_{j=1}^{N}\textbf{A}[i,j]\textbf{V}[j,:].

(11)

The concept token at index $j_{*}$ has the highest attention value, i.e., $\textbf{A}[i,j_{*}]\gg\textbf{A}[i,j],\forall j\neq j_{*}$ . We have,

c_{i}=\sum_{j=1}^{N}\textbf{A}[i,j]\textbf{V}[j,:]=\sum_{j=1,j\neq*}^{N}\textbf{A}[i,j]\textbf{V}[j,:]+\textbf{A}[i,j_{*}]\textbf{V}[j_{*},:]\approx\textbf{A}[i,j_{*}]\textbf{V}[j_{*},:]\approx\textbf{V}[j_{*},:].

(12)

The L2 norm between the text embeddings of the concept token $c_{i_{*}}$ and context tokens $c_{i}$ is,

	$\displaystyle\lVert c_{i}-c_{i_{*}}\rVert_{2}$
$\displaystyle=$	$\displaystyle\lVert\sum_{j=1,j\neq j_{}}^{N}(\textbf{A}[i,j]-\textbf{A}[i_{},j])\textbf{V}[j,:]+(\textbf{A}[i,j_{}]-\textbf{A}[i_{},j_{}])\textbf{V}[j_{},:]\rVert_{2}$
$\displaystyle\leq$	$\displaystyle\lVert\sum_{j=1,j\neq j_{}}^{N}(\textbf{A}[i,j]-\textbf{A}[i_{},j])\textbf{V}[j,:]\rVert_{2}+\lVert(\textbf{A}[i,j_{}]-\textbf{A}[i_{},j_{}])\textbf{V}[j_{},:]\rVert_{2}$
$\displaystyle\leq$	$\displaystyle\sum_{j=1,j\neq j_{}}^{N}\lVert\textbf{A}[i,j]-\textbf{A}[i_{},j]\rVert_{2}\lVert\textbf{V}[j,:]\rVert_{2}+\lVert\textbf{A}[i,j_{}]-\textbf{A}[i_{},j_{}]\rVert_{2}\lVert\textbf{V}[j_{},:]\rVert_{2}.$	(13)

Suppose $\textbf{A}[i,j_{*}]=1-\delta_{ij_{*}}$ and $\textbf{A}[i_{*},j_{*}]=1-\delta_{i_{*}j_{*}}$ , where $\textbf{0}\leq\delta_{ij_{*}}<\delta$ and $\textbf{0}\leq\delta_{i_{*}j_{*}}<\delta$ . $\textbf{A}[i,j]=\delta_{ij},\forall j\neq j_{*}$ , $\textbf{A}[i_{*},j]=\delta_{i_{*}j},\forall j\neq j_{*}$ , $0\leq\delta_{ij}<\delta$ and $0\leq\delta_{i_{*}j}<\delta$ , where $\delta$ is a small value. We have $\lVert\delta_{ij}-\delta_{i_{*}j}\rVert_{2}<\delta$ and $\lVert\delta_{i_{*}j_{*}}-\delta_{ij_{*}}\rVert_{2}<\delta$ . Thus,

	$\displaystyle\lVert c_{i}-c_{*}\rVert_{2}$
$\displaystyle\leq$	$\displaystyle\sum_{j=1,j\neq j_{}}^{N}\lVert\delta_{ij}-\delta_{i_{}j}\rVert_{2}\lVert\textbf{V}[j,:]\rVert_{2}+\lVert\delta_{i_{}j_{}}-\delta_{ij_{}}\rVert_{2}\lVert\textbf{V}[j_{},:]\rVert_{2}$
$\displaystyle\leq$	$\displaystyle\delta\sum_{j=1,j\neq j_{}}^{N}\lVert\textbf{V}[j,:]\rVert_{2}+\delta\lVert\textbf{V}[j_{},:]\rVert_{2}.$	(14)

Since $\lVert\textbf{V}[j,:]\rVert_{2}$ is bounded, we have,

\lVert c_{i}-c_{*}\rVert_{2}\leq\delta_{\textbf{V}},

(15)

where $\delta_{\textbf{V}}=\delta\sum_{j=1}^{N}\lVert\textbf{V}[j,:]\rVert_{2}$ . ∎

The proof of Proposition 2.

Proof.

Suppose $\lVert c_{b}-\hat{c}_{b}\rVert_{2}$ is a small value, using the Taylor series, we have,

	$\displaystyle\mathcal{L}_{\text{MLM}}(c_{b})$	$\displaystyle=\mathcal{L}_{\text{MLM}}(\hat{c}_{b})+(c_{b}-\hat{c}_{b})^{T}\text{grad}(\mathcal{L}_{\text{MLM}}(\hat{c}_{b}))+\mathcal{O}(c_{b}-\hat{c}_{b})$
		$\displaystyle\approx\mathcal{L}_{\text{MLM}}(\hat{c}_{b})+(c_{b}-\hat{c}_{b})^{T}\text{grad}(\mathcal{L}_{\text{MLM}}(\hat{c}_{b})),$		(16)

where $\text{grad}(\cdot)$ is the first-order derivative. Using Cauchy-Schwartz inequality, we have,

(c_{b}-\hat{c}_{b})^{T}\text{grad}(\mathcal{L}_{\text{MLM}}(\hat{c}_{b}))\leq\lVert c_{b}-\hat{c}_{b}\rVert_{2}\cdot\lVert\text{grad}(\mathcal{L}_{\text{MLM}}(\hat{c}_{b}))\rVert_{2}.

(17)

Since $\hat{c}_{b}$ is near the optimal value, which is achieved by optimizing the contextualizer, we have $\text{grad}(\mathcal{L}_{\text{MLM}}(\hat{c}_{b}))\leq\delta_{g}$ , where $\delta_{g}$ is a small value. Therefore, we have

\mathcal{L}_{\text{MLM}}(c_{b})-\mathcal{L}_{\text{MLM}}(\hat{c}_{b})\leq\delta_{g}\lVert c_{b}-\hat{c}_{b}\rVert_{2}.

(18)

∎

A.5 Proof - Semantic Enhancement in Image Space

The proof of Proposition 4.

Proof.

The image embedding $\mathbf{z}$ and text embedding $\mathbf{C}$ are projected as $Q_{\mathcal{I}}=\mathbf{z}\mathbf{W}_{Q}$ , $K_{\mathcal{T}}=\mathbf{C}\mathbf{W}_{K}$ . For text embeddings at indices $i$ and $j$ , we have,

	$\displaystyle\textbf{K}_{\mathcal{T}}[i,:]=c_{i}\mathbf{W}_{K}$		(19)
	$\displaystyle\textbf{K}_{\mathcal{T}}[j,:]=c_{j}\mathbf{W}_{K}.$		(20)

The relation map is $\textbf{M}=\mathbf{Q}_{\mathcal{I}}K_{\mathcal{T}}^{T}$ , and $\textbf{M}[:,i]=\mathbf{Q}_{\mathcal{I}}K_{\mathcal{T}}[i,:]$ , $\textbf{M}[:,j]=\mathbf{Q}_{\mathcal{I}}K_{\mathcal{T}}[j,:]$ . Thus,

$\displaystyle\lVert\textbf{M}[:,i]-\textbf{M}[:,j]\rVert_{2}$	$\displaystyle=\lVert\mathbf{Q}_{\mathcal{I}}(K_{\mathcal{T}}[i,:]-K_{\mathcal{T}}[j,:])\rVert_{2}$
	$\displaystyle\leq\lVert\mathbf{Q}_{\mathcal{I}}\rVert_{F}\lVert(K_{\mathcal{T}}[i,:]-K_{\mathcal{T}}[j,:])\rVert_{2}$
	$\displaystyle=\lVert\mathbf{Q}_{\mathcal{I}}\rVert_{F}\lVert(c_{i}-c_{j})\mathbf{W}_{K}\rVert_{2}$
	$\displaystyle\leq\lVert\mathbf{Q}_{\mathcal{I}}\rVert_{F}\lVert\mathbf{W}_{K}\rVert_{F}\lVert(c_{i}-c_{j})\rVert_{2}$
	$\displaystyle=\alpha\lVert(c_{i}-c_{j})\rVert_{2},$	(21)

where $\lVert\cdot\rVert_{F}$ is Frobenius norm, and $\alpha=\lVert\mathbf{Q}_{\mathcal{I}}\rVert_{F}\lVert\mathbf{W}_{K}\rVert_{F}$ . ∎

A.6 Additional Qualitative Examples

We provide additional qualitative results of our approach combined with each baseline method, TI (Gal et al., 2022), XTI(Voynov et al., 2023), DB(Ruiz et al., 2023) and CD(Kumari et al., 2023). We provide two types of generation results, living or non-living objects (Figures 8, 9, 10, 11, 12, 13, 14,15)