An Improved Method for Personalizing Diffusion Models

Yan Zeng¹ Masanori Suganuma^1,2 Takayuki Okatani^1,2 ¹Graduate School of Information Sciences, Tohoku University ²RIKEN Center for AIP
yan, suganuma, [email protected]

Abstract

Diffusion models have demonstrated impressive image generation capabilities. Personalized approaches, such as textual inversion and Dreambooth, enhance model individualization using specific images. These methods enable generating images of specific objects based on diverse textual contexts. Our proposed approach aims to retain the model’s original knowledge during new information integration, resulting in superior outcomes while necessitating less training time compared to Dreambooth and textual inversion.

1 Introduction

In recent years, with the advent of deep generative models, particularly diffusion models [5], it has become possible to generate high-quality images of any scene or object. Furthermore, by enabling the control of image generation through prompts such as texts, it has become feasible to create images of any desired scene. In this image-to-text synthesis, the model consists of a text encoder, which maps the text prompt onto its embedding space, and a diffusion model, which takes this as an additional input to generate the image of an objective scene.

Personalization is a specific application of this technology. This entails the ability to generate images of specific objects (such as a particular bag or one’s own pet) based on a few examples of images of those objects. The challenge lies in generating not images of a generic (any kind) bag or dog but images of specific objects that match the examples down to the minute details. Similar to regular text-to-image synthesis, it is often required to be able to control contexts, such as the background of the objects and their poses, within the generated image by inputting prompts.

Several approaches have been proposed, all of which presuppose the existence of a pre-trained text-based image generation model. One approach, exemplified by textual inversion [3], involves optimizing the embedding of the prompt text. Generally speaking, expressing the desired specific object in words can be difficult, owing to the great diversity of objects and the relative insufficiency of the expressive power of language, no matter how much rhetoric is employed. Therefore, the idea is to create a new word or phrase to represent the specific object in the text embedding space. More specifically, this approach involves optimizing the text embedding that represents the given specific object from its image examples.

However, it is impossible to create an image that faithfully reproduces the specific object using this method. Due to the relative limitations in the expressive power of the linguistic space, it is likely that there does not exist an embedding in the text embedding space that can represent the specific object itself, including its minute details.

Another method, exemplified by Dreambooth [14], seeks to optimize the image generation model itself. This approach involves specifying an undefined term corresponding to the specific object, and then retraining (fine-tuning) the model so that when that term is inputted into the generation model, the specific object is generated. For instance, in the case of one’s pet dog, the model is trained to generate the given image in response to a prompt like “a photo of [xxx] dog.” However, retraining the model with only a few examples can result in the forgetting of previously learned content. To be able to faithfully generate images of the specific object, many weight updates is required, but the more it is, the higher the risk of forgetting. To counter this, Dreambooth ensures that a rich set of images is learned simultaneously to prevent this forgetting. Specifically, in the case of a dog, a diverse collection of dog images is used, and just like the original learning, it demands the generation of these images in response to a prompt like “a photo of a dog.” To fulfill the above objective, a loss called prior preservation loss is added to the loss demanding the learning of the specific object from its few images, serving as a kind of regularization term. This method enables much more faithful image generation compared to textual inversion.

However, even with the Dreambooth method, the fidelity of the generated images is often not perfect. Additionally, there is the issue of substantial computational costs. Due to the significant computational cost involved in evaluating the prior preservation loss, a considerable amount of time is required to become capable of generating a single specific object.

Naturally, a fusion of both approaches is conceivable. Imagic [6] represents a method that integrates both. However, it should be noted that Imagic targets the problem of editing a single image through text, rather than focusing on personalization. It can be said that it does not anticipate learning the characteristics from multiple images of a specific object.

2 Related Work

2.1 Text-to-Image Synthesis

With the advancements in multi-modal models and large language models, the application of generative models to these frameworks has made text-to-image tasks feasible. In recent years, text-to-image generation models have undergone rapid evolution. There are models based on the Autoregressive architecture like DALL-E[10] and Parti[15], as well as those trained using GANs like Lafite[17]. Text-to-image models based on diffusion models have also garnered significant attention. DALL-E2[9] utilizes CLIP[8] to transform textual descriptions into image embeddings, which are then decoded to generate images that align with the text. Imagen[13] introduces a cascaded architecture that initially generates 64x64 resolution images and subsequently employs a two-stages text-conditioned super-resolution diffusion model to upscale them. Stable Diffusion[11] is among the earliest open-source text-to-image diffusion models. Unlike other diffusion models, it doesn’t directly train on pixel images. Instead, it first compresses images into low dimensional representations and then train on the latent space.

While text-to-image models exhibit a high level of semantic consistency between images and text descriptions, they still face limitations due to the inherent constraints of textual descriptions. Particularly, when dealing with objects that possess intricate details, the model may struggle to accurately convey the nuances, resulting in generated images that do not align with expectations.

2.2 Personalization

The concept of personalized image generation within the diffusion model framework was first introduced by DreamBooth[12]. In their paper, they coined this concept as “subject-driven generation”. Subsequent papers refer to this task as “personalization”. In the paper [4], they use highly personalized (HiPer) text embeddings to replacing the uninformative embeddings and optimize only the HiPer embeddings. During inference, replacing the end of new embeddings with the learned HiPer embeddings, enabling the generation of object images based on different descriptions. Another paper employed Apprenticeship Learning[2]. Initially, the diffusion model was fine-tuned to produce expert models. Subsequently, a dataset was curated using imaginary captions proposed by PaLM, along with images generated by the expert model based on these captions. Finally, the apprentice model was trained using this dataset. The [7] suggested training only on parameters of Cross-attention layers in the U-net¹¹1The main part of diffusion model is composed of resnet blocks, self-attention blocks, cross-attention blocks in a U-net architecture. while freezing others. It also introduced the idea of training a single model for multiple concepts.

3 Method

The task of personalization is stated as: given a collection of images of a specific target object, we aim to produce new images of that object based on input prompts, ensuring the preservation of its detailed features. Subsequently, we review two key studies on the task, highlight their limitations, and introduce our approach.

3.1 Existing Approaches

3.1.1 Textual Inversion

Refer to caption — Figure 1: Illustration of textual inversion [3]. Given a few sample images of a specific target object, it optimizes the embedding $v"*$ of a newly introduced word/token $S_{*}$ for the object, enabling the personalization of a pre-trained diffusion model. The embedding $v_{*}$ is initialized using the embedding of a general class (e.g. “dog”) of the target object.

Textual inversion associates features of a target object with a new word or token by refining its corresponding embedding values, as shown in Figure 1.

Consider $y$ as an example input text prompt, such as “a photo of $S_{*}$ ”, where $S_{*}$ represents a novel word designating the target object. This prompt is initially transformed by a tokenizer into a series of word or sub-word indices using a predefined vocabulary that now includes $S_{*}$ as a fresh entry. Each of these tokens is then linked to an embedding vector via a look-up table, creating a sequence of embedding vectors. The embedding vector associated with the new token for $S_{*}$ is denoted as $v_{*}$ , which is initialized using the embedding vector of the target object’s general class, for instance, ‘dog’.

The embedding vector $v_{*}$ is fine-tuned using the same objective function employed in latent diffusion models as stated in [11]:

L_{\mathrm{LDM}}=\mathbb{E}_{z\sim\mathcal{E}(x),y,{\epsilon}\sim\mathcal{N}(0,I),t}\Big{[}\|{\epsilon}-{\epsilon}_{\theta}(z_{t},t,\tau_{\theta}(y)\|_{2}^{2}\Big{]}.

(1)

This loss function measures the L2 distance between the noise ${\epsilon}$ , drawn from a standard normal distribution, and the noise ${\epsilon}\theta$ predicted by the model. Here, $z_{t}$ denotes the noisy low-dimensional representation of a training image of the target object at timestep $t$ . The function $\tau_{\theta}$ is a pretrained text encoder, whose output $\tau_{\theta}(y)$ contains the target parameter $v_{*}$ . Meanwhile, the model’s parameters and the embeddings of other tokens in the vocabulary are frozen.

One benefit of textual inversion is that the model preserves all prior knowledge because only the embedding of the target object’s token undergoes optimization. However, textual inversion often faces challenges in producing correct images of the target object. Capturing the intricate details of the target object using the embedding of a single token is difficult, as the model’s embedding space isn’t designed for this purpose.

3.1.2 Dreambooth

Dreambooth adopts an alternative approach. It fine-tunes a pre-trained model using sample images of the target object, as illustrated in Figure 2. Contrary to textual inversion, Dreambooth doesn’t add a new token to the vocabulary. The original paper [12] mentions that infrequently used tokens in the vocabulary are initially identified and then inverted back to the text space; the identifier $S_{*}$ is finally defined as a character sequence bounded with one to three decoded tokens²²2Since Dreambooth did not publicly release code, we adopt the implementation by huggingface and use “sks” as an identifier which is also an rare token.

Since Dreambooth has not published their code, we couldn’t know how they define the rarity of tokens and how the selected identifier performs. Unlike the complex setting in Dreambooth, ‘sks’, as a rare word in the English dictionary was selected as the identifier in the earliest replementation code³³3Implementation of Dreambooth with stable diffusion:https://github.com/XavierXiao/Dreambooth-Stable-Diffusion. This choice has been maintained in the currently most widely adopted huggingface codebase, and our experiments followed the same setup.

Fine-tuning the model using a limited number of images can result in significant overfitting. Additionally, this can cause “language drift,” where the model loses its ability to generate images of typical objects. Dreambooth addresses this problem by introducing a ‘prior preservation loss’ as a regularization term. The augmented loss is given as follows:

\mathbb{E}_{x,c,{\epsilon},{\epsilon}^{\prime},t}\Big{[}\|\hat{x}_{\theta}(\alpha_{t}x+\sigma_{t}{\epsilon},c)-x\|_{2}^{2}\\ +\lambda\|\hat{x}_{\theta}(\alpha_{t^{\prime}}x_{pr}+\sigma_{t^{\prime}}{\epsilon}^{\prime},c_{pr})-x_{pr}\|_{2}^{2}\Big{]}.

(2)

The first term corresponds to the original loss function of the diffusion models, which measures the L2 distance between the predicted image⁴⁴4The backbone of Dreambooth is Imagen which predicts the original image. The stable diffusion model predicts the noise added to the original image. $\hat{x}_{\theta}$ and the original image $x$ that contain the target object. The second term is the prior preservation loss. It measures the L2 distance between the predicted image $\hat{x}_{\theta}$ and the original image $x$ for common class objects. This necessitates the generation of a batch of images using a pre-trained diffusion model based on prompts that include common class names.

While Dreambooth works much better than textual inversion and sometimes yields satisfactory results, it has its own limitations. The original paper [12] mentions several issues, such as incorrect context synthesis (i.e., not being able to accurately generate the prompted context), context-appearance entanglement (i.e., appearance of the target object changes due to the prompted context), and overfitting (i.e., generating too similar images to the input samples).

In addition to these mentioned in the paper, Dreambooth tends to suffer from the following problems.

Compromized training efficiency The computation of the prior preservation loss necessitates the antecedent generation of 1,000 ordinary class images. This step can take even longer than the training itself, effectively doubling or tripling the training time. This significantly reduces training efficiency.

Lower quality image generation for common class objects Across certain datasets, the utilization of the prior preservation loss causes discernible degradation in the quality of the images of ordinary classe objects; see Figure 4.

Artifacts in generated images It is reported in [14] that excessive training in Dreambooth can lead to the emergence of color artifacts in the generated images, which aligns with our findings; see Figure 5.

While prior preservation loss addresses overfitting and language drift issues, it also brings some new problems. This becomes more pronounced with increasing training time. In another paper[16], a weight-constrained loss was introduced to slow down the rate of parameter updates. However, these regularization terms are insufficient to address the above issues.

3.2 Proper Combination of the Two

We present an approach that combines textual inversion and Dreambooth. While combining the two is a straightforward idea, their proper integration leads to the resolution of each method’s issues, as will be shown in our experimental results.

Specifically, the proposed approach comprises of two stages. In the first stage, the optimization of a token embedding similar to textual inversion is performed, and in the second stage, the optimization (i.e. fine-tuning) of the diffusion model similar to Dreambooth, but without prior preservation loss, is performed. The two stages have differences from textual inversion and Dreambooth, as explained below.

In the first stage, we incorporate a new token “ $\langle rare\rangle$ ” which is used as an adjective to modify a noun, representing the target object, e.g.,

“a photo of $\langle rare\rangle$ backpack”.

Note that this differs from textual inversion, which utilizes a new word $S_{*}$ of a noun representing the target object. We optimize the embedding $v_{\mathrm{rare}}$ of “ $\langle rare\rangle$ ” by

\min_{v_{\mathrm{rare}}}\mathbb{E}_{z\sim\mathcal{E}(x),y,{\epsilon}\sim\mathcal{N}(0,I),t}\Big{[}\|{\epsilon}-{\epsilon}_{\Theta}(z_{t},t,\tau_{\theta}(y)\|_{2}^{2}\Big{]},

(3)

with $v_{\mathrm{rare}}$ initialized with the embedding of the word “rare”.

Another difference from textual inversion is that we aim at making “ $\langle rare\rangle$ ” only coarsely represent the appearance of the target object. Thus, we conduct updating

$v_{\mathrm{rare}}$ for only about 100 steps. Note that textural inversion aims at making the new token represent the target object as perfectly as possible, requiring 3,000-5,000 updating steps.

In the second stage, we fine-tune the parameters $\Theta$ of the diffusion model by

\min_{\Theta}\mathbb{E}_{z\sim\mathcal{E}(x),y,{\epsilon}\sim\mathcal{N}(0,I),t}\Big{[}\|{\epsilon}-{\epsilon}_{\Theta}(z_{t},t,\tau_{\theta}(y)\|_{2}^{2}\Big{]}.

(4)

This loss uses the input images of the target object and the above prompt containing “ $\langle rare\rangle$ ” whose embedding $v_{\mathrm{rare}}$ , optimized in the first stage, is frozen in this stage; the parameter $\theta$ of the text encoder is frozen, too. Note that we do not use the prior preservation loss as above here. The aim is to train the model together with the token embedding $v_{\mathrm{rare}}$ learned in the first stage to be able to represent the detailed appearance of the target object.

There is another difference from Dreambooth, which is that we update the model’s parameters $\Theta$ for only 200-400 steps in the fine-tuning. This is in contrast with Dreambooth, which requires 1,000 steps of parameter updates. Note also that it needs the generation of an extra 200 to 1,000 images of common class objects for the prior preservation loss.

To summarize, our method offers two advantages over existing methods. First, it enhances the quality of generated images, often significantly, as demonstrated in the subsequent section. Second, it considerably reduces training time. Fewer parameter updates also decrease the chances of overfitting and language drift. The removal of the prior preservation loss mitigates potential complications. These two also lead to superior image generation quality.

It should be noted that the approach employed by Imagic [6] has some similarity to our method; it employs a similar two-stage approach of optimizing the embedding of a prompt followed by optimizing the model. However, Imagic does not explicitly extract the features of the target object; a separate model needs to be trained for each different editing prompt. Our method allows users to freely modify the image based on different text inputs, providing more flexibility in editing.

4 Experimental Results

4.1 Experimental Configuration

We experimentally evaluate the proposed method using the dataset introduced in [12]. The dataset consists of 30 sets of images, each featuring four to six images containing the same specific object. For the base text-to-image synthesis model, we use a pre-trained latent diffusion model by huggingface, i.e., “stabilityai/stable-diffusion-2-1-base” which is available from: https://huggingface.co/stabilityai/stable-diffusion-2-1-base.

As explained above, the proposed method consists of two stages. The parameters are updated for 100 iterations with a learning rate of $5\times 10^{-4}$ in the first stage, and 800 iterations with a learning rate of $5\times 10^{-6}$ in the second stage. We save a checkpoint every 200 steps in the second stage for evaluation.

4.2 Quantitative Evaluation

To evaluate personalization methods, it’s essential to assess how well the model preserves the details of the target object in the generated images. However, designing a metric that aligns with human judgment to measure the similarity of objects in two images is a challenging task.

While it remains an open problem, we follow previous studies [12] for comparative evaluation of the proposed and existing methods.

Specifically, we employ the CLIP score and the DINO score for evaluation metrics; see [1][8] for their definition. The evaluation process typically unfolds as follows. Initially, we employ the fine-tuned model to produce images of the target object. Subsequently, both the generated and input sample images of the target object are encoded into their respective embeddings, either through the CLIP image encoder or the DINO vision transformer. We then compute the cosine similarity between these embeddings to gauge the resemblance between the generated and sample images. For this evaluation, we utilize two methods for prompting: one sourced from Dreambooth [12] and the other from textual inversion [3].

The first method from Dreambooth uses 25 diverse prompts⁵⁵5The compelete prompts are available here https://github.com/google/dreambooth/blob/main/dataset/prompts_and_classes.txt with extra descriptions, each generating four images, totaling 100 images. This aims to test whether the model can produce diverse images. Among all the prompts, some of them might change the object’s shape and color, e.g. “a red $\langle rare\rangle$ dog” or “a cube shaped $\langle rare\rangle$ dog”, which can lead to a decrease in the CLIP score.

The second method from textural inversion uses the simple prompt that contains only the target object,e.g. “a photo of $\langle rare\rangle$ dog”, to generate 64 images.

Table 1 shows the results. The number of optimal training steps can fluctuate based on target objects and their corresponding sample images. As a result, we opted for the highest-quality images that most closely matched the training images from checkpoints between 200 and 800 steps. For our method, optimal results emerged from either 200 or 400 steps; for Dreambooth, the best outcomes were between 400 and 800 steps. It is seen from Table 1 that across all metrics and testing prompts, our method consistently outperforms. Additionally, using simple prompts results in higher scores compared to diverse prompts.

	Diverse Prompt		Simple Prompt
Method	CLIP score	DINO score	CLIP score	DINO score
Ours	0.800	0.629	0.859	0.718
Dreambooth(stable diffsuion)	0.753	0.540	0.841	0.690

Table 1: Results of quantitative evaluation.

4.3 Qualitative Evaluation

We next show several examples to qualitatively compare the results of the different methods.

4.3.1 Quality and Fidelity of Target Object Images

Figure 3 shows a few examples of generated images by the proposed method and Dreambooth for the same set of prompts. The input images for specific target objects are shown on the left (without colored bounding box). We input three prompts to our method and Dreambooth, i.e., “a photo of $\langle rare\rangle$ dog”, “ a photo of $\langle rare\rangle$ dog on the beach”, and “a photo of $\langle rare\rangle$ dog in the jungle”. The generated images are shown on the right in Figure 3; our methods are enclosed by blue boxes and Dreambooth are enclosed by red boxes.

It is first observed that Dreambooth suffers from “context-appearance entanglement,” an issue that the appearance of target object is influenced by context, as mentioned in [12]. It is seen in the Dreambooth’s outputs for ‘backpack’ that the color of the backpack changes with the background. This does not occur with our methods. The reason for this phenomenon may be attributable to that the manually specified identifier inevitably contains some prior knowledge that can affect the generated images in specific contextual settings.

It is also seen for ‘dog’ and ‘cat’ images that our method generates images of higher quality and better fidelity.

4.3.2 Image Generation for Common Class Objects

Dreambooth employs the prior preservation loss to mitigate the forgetting of generating general object images during training on target object images. While it preserve the diversity of general object, it may degrade the quality of images. Since our method requires only a small amount of training on model parameters, the issue of forgetting does not occur even if we do not use this loss function. We demonstrate the impact of prior preservation loss on general object by comparing the generated general object images.

This phenomenon becomes more pronounced as training steps increases. For a fair comparison, we chose checkpoints in the same training steps for our method and Dreambooth rather than opting for the checkpoint that performed best on the target object. Figure 4 shows a few examples. The images on the left come from two datasets that contain cat as target object. The images on the right are generated images guided by text input, “a photo of cat”. All the results are from the 200-steps checkpoint. It’s worth noting that at this step, our method can already reconstruct the target object quite well, while Dreambooth cannot. With same training steps, it is evident that Dreambooth’s results (enclosed by a red box) exhibit a noticeable decrease in the quality of generated general object images, with a clear difference in the realism of the images compared to our approach.

4.3.3 Longer Training Often Degrades Image Quality

Training for an extended number of steps frequently results in the generation of distorted or blurry images. This degradation differs from that seen in common-class images, which can suffer severe distortions. In Figure 5, the left side displays input sample images, while the right side shows images generated by the models at various training steps. Blue bounding boxes highlight results from our method, while red boxes indicate those from Dreambooth’s. Notably, not every target object in the dataset exhibits this phenomenon, and the underlying cause remains elusive. We’ve found that this often happens after the model has robustly learned the target object’s features, hinting that overtraining may be responsible. Importantly, our method shows this effect earlier than Dreambooth, suggesting our method achieves optimal training more efficiently.

5 Conclusion

We have proposed a method for personalizing text-to-image diffusion models that enables the model to learn from just four to six sample images of a specific target object. By adding new identifiers, our method can generate images of a specified object in various scenes simply by modifying the input prompt. Our approach produces images more similar to the original object while greatly preserving the model’s inherent capabilities, leading to reduced language drift or image quality degradation. Additionally, our method achieves these enhanced results in significantly less training time.

References

[1] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9630–9640, 2021.
[2] Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Ruiz, Xuhui Jia, Ming-Wei Chang, and William W Cohen. Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186, 2023.
[3] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion, 2022.
[4] Inhwa Han, Serin Yang, Taesung Kwon, and Jong Chul Ye. Highly personalized text embedding for image manipulation by stable diffusion. arXiv preprint arXiv:2303.08767, 2023.
[5] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv preprint arxiv:2006.11239, 2020.
[6] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In Conference on Computer Vision and Pattern Recognition 2023, 2023.
[7] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In CVPR, 2023.
[8] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021.
[9] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents, 2022.
[10] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021.
[11] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.
[12] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
[13] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
[14] Valentine Kozin Surak Patil, Pedro Cuenca. Training stable diffusion with dreambooth using diffusers, 2022.
[15] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022.
[16] Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Ngai-Man Cheung, and Min Lin. A recipe for watermarking diffusion models. arXiv preprint arXiv:2303.10137, 2023.
[17] Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. Lafite: Towards language-free training for text-to-image generation. arXiv preprint arXiv:2111.13792, 2021.