This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Model-Agnostic Human Preference Inversion in Diffusion Models

Jeeyung Kim, Ze Wang, Qiang Qiu
Purdue University
{jkim17, wang5026, qqiu}@purdue.edu
Abstract

Efficient text-to-image generation remains a challenging task due to the high computational costs associated with the multi-step sampling in diffusion models. Although distillation of pre-trained diffusion models has been successful in reducing sampling steps, low-step image generation often falls short in terms of quality. In this study, we propose a novel sampling design to achieve high-quality one-step image generation aligning with human preferences, particularly focusing on exploring the impact of the prior noise distribution. Our approach, Prompt Adaptive Human Preference Inversion (PAHI), optimizes the noise distributions for each prompt based on human preferences without the need for fine-tuning diffusion models. Our experiments showcase that the tailored noise distributions significantly improve image quality with only a marginal increase in computational cost. Our findings underscore the importance of noise optimization and pave the way for efficient and high-quality text-to-image synthesis.

1 Introduction

Recent advances in diffusion models (DMs) have unlocked unprecedented capabilities in text-to-image generation [25]. Despite their promise, the widespread adoption of DMs in practical applications may be hindered by the high inference costs associated with the multi-step sampling. To tackle this challenge, considerable efforts have been focused on reducing the number of sampling steps while maintaining image quality. In particular, distilling pre-trained DMs using adversarial approaches [27, 33] and consistency regularization on ordinary differential equation (ODE) trajectories [30, 15] emerged as effective strategies. These methods enable high-fidelity text-to-image generation in a few steps. However, images produced in fewer steps often exhibit inferior quality compared to those generated with more steps [27].

Prior studies [9, 28, 14] modify the sampling process to improve the quality of image generation without altering the training procedure. They propose methods to enhance the multi-step sampling in DMs, focusing on modifications to solve the probability flow (PF), such as adjustments to noise schedules [9, 14] and control of stochasticity levels [28]. Nevertheless, for distilled models with extremely limited sampling steps, alternative sampling designs—orthogonal to direct manipulation of the PF solver—can be essential to further advance image quality, particularly considering that the characteristics of DMs can diminish in distilled models. One potential approach involves developing a sampling process combined with prompt optimization. [7, 16] introduce a prompt optimization framework that tailors user input to model-preferred prompts.

Another promising approach is optimizing noise prior, an additional input alongside prompts in DMs. Notably, the prior noise distribution, typically fixed as a standard Gaussian, has been overlooked in optimization to enhance sampling quality. Especially when the sampling process in DMs is deterministic, the prior noise directly shapes and influences the resulting images, underscoring the potential for adjusting the noise distribution for superior images. Therefore, our study aims to discover a noise distribution that surpasses the standard Gaussian in generating higher-quality images. We focus on one-step generation because it clearly reflects the impact of prior noise, unlike multi-step generation where the effect of prior noise can be gradually attenuated during sampling.

To discover the (sub-)optimal noise distribution, we employ a DM as an image generator and a scoring model [12] as an evaluator to assess the generated image quality. This scoring model was trained on human preferences regarding image-prompt pairs and plays a pivotal role in providing feedback on the alignment of generated images with human preferences, emulating a Human-in-the-Loop framework. Inspired by optimization-based inversion techniques [1, 37] commonly used in Generative Adversarial Networks (GANs) [5], we directly optimize the parameters of noise distribution based on human preference scores, leading to enhancements in image quality while keeping DMs intact.

Taking our investigation a step further, we show that (sub-)optimal noise distributions can vary depending on the text prompts provided. We additionally introduce a crucial component: a noise predicting model. The light-weight model, comprising a pre-trained text encoder and shallow layers, processes text prompts and predicts parameters of a Gaussian distribution. This distribution is then employed as a prior noise for the image generator. We train the noise predicting model to produce noises that help DM generate images with high scores. We demonstrate that the noise Gaussian distribution with parameters tailored to specific text prompts generates superior images in one step, aligning with human preference.

Our method involves inverting human preferred images back into the noise space, termed as Prompt Adaptive Human Preference Inversion (PAHI). PAHI serves as a model-agnostic image enhancement approach by adjusting the noise distribution using a lightweight noise-predicting model. Notably, our method is a general framework applicable in multi-step deterministic sampling scenarios, extending its potential impact beyond one-step sampling. To the best of our knowledge, we are the first to investigate the impact of noise optimization on text-to-image synthesis. By bridging the efficiency of low-step generation with enhanced quality, we unlock the potential of diffusion models for real-world applications.

2 Related Work

Distilling diffusion models for low-step sampling. Numerous studies [17, 30, 15, 13, 27, 33, 26, 33] focused on accelerating sampling process of DMs. Among them, Song et al. [30] introduces consistency models, where points on the same trajectory of ODE map to the same initial point, enabling low-step generation. However, this method has not yet been applied to text-to-image synthesis. In the realm of text-to-image generation, Luo et al. [15] applies the consistency model to latent space and facilitates text-to-image generation by distilling Stable Diffusion [25]. Another notable approach to regulating trajectories of ODEs is Liu et al. [13] introducing the concept of Rectified Flow, aimed at straightening the trajectories of ODEs. On the other hand, Sauer et al. [27] employs adversarial loss and score distillation sampling loss [20] to generate images, showcasing superior performance compared to other distillation methods. However, low-step (1 or 2) generation often display inferior quality compared to multiple-step image generation.

Prompt adaptive sampling design. Hao et al. [7] introduce prompt adaptation, adjusting user input to match model-preferred prompts without DM fine-tuning. During sampling, text inputs undergo prompt adaptation processing before being fed into diffusion models. Zhang et al. [34] emphasize the significance of employing diverse sampling steps relying on prompts and propose a framework of sampling with instance-specific steps to reduce sampling cost without compromising image quality. However, none of these studies explore the potential of optimizing noise distribution contingent upon prompts.

Fine-tuning diffusion models on human feedback. [31, 3, 2] align DMs with human preferences by directly optimizing them based on human comparison data using reinforcement learning. However, such approaches differ from our model-agnostic approach, as it requires fine-tuning DMs, which can be resource-intensive.

Inversion. In prior studies, inversion has been crucial for image manipulation, aiming to find a latent representation corresponding to a given image. In the GAN literature, [1, 37, 6, 24] use optimization-based techniques, directly optimizing latent vectors, while [23, 36, 19] employ trained encoders to map images to their latent representations. In DMs, inversion is adapted to accommodate their properties like stochastic, multi-step sampling and conditioned generation. Textual Inversion [4] represents user-provided concepts as pseudo-words in the text embedding space for versatile editing. [28] proposes a deterministic sampling which enables inversion in a closed-form manner, facilitating image manipulation in DMs [29, 22, 18, 10, 35].

In contrast to prior works, our study does not use inversion technique for the purpose of image manipulation. Instead, we invert human preferences into noise space, where samples from this space lead to improved image quality.

3 Method

3.1 Preliminaries

In this study, we use the distilled diffusion model (Stable Diffusion 2.1 backbone [25]) trained with adversarial loss [27], which exhibits superior performance in low-step image generation. Similar to general diffusion models, the generative process of the distilled model progressively denoises a noisy observation starting from the standard Gaussian where p(𝒙T)=𝒩(𝒙T;𝟎,𝑰)p(\bm{x}_{T})=\mathcal{N}(\bm{x}_{T};\bm{0},\bm{I}) and TT represents the total number of timesteps during the training of the diffusion model. The denoised observation is predicted with the distilled diffusion model (ϵθ\epsilon_{\theta}[8] as followed:

SDθ(𝒙τl,c):=𝒙τl1ατlϵθ(𝒙τl,c)ατl,\text{SD}_{\theta}(\bm{x}_{\tau_{l}},c):=\frac{\bm{x}_{\tau_{l}}-\sqrt{1-\alpha_{\tau_{l}}}\cdot\epsilon_{\theta}(\bm{x}_{\tau_{l}},c)}{\sqrt{\alpha_{\tau_{l}}}}, (1)

where SDθ(,)\text{SD}_{\theta}(\cdot,\cdot) represents a denoised image, l{1,,L}l\in\{1,\cdots,L\}, τL=T\tau_{L}=T and LL is set to values less than 4 in [27]. cc denotes conditions and ατl\alpha_{\tau_{l}} denotes a variance schedule at τl\tau_{l}. Note that our study primarily concentrates on one-step generation (L=1L=1).

On the other hand, Kirstain et al. [12] create an open dataset of text prompts and real users’ preferences over generated images. Based on this dataset, they train a CLIP-based [21] scoring model, named PickScore, which shows superior alignment with human preferences on generated images. We use PickScore as our scoring model in the following sections. Note that alternative models can also be employed as a scoring model.

3.2 The Proposed Method (PAHI)

In this section, we introduce our proposed method which enhances image quality while preserving intact diffusion models. We first propose a human preference inversion method in which a single noise distribution is optimized for all prompts. Following this, we propose Prompt Adaptive Human Preference Inversion (PAHI) that predicts customized noise distributions for individual prompts.

Optimizing noise distribution across all prompts. We employ a distilled diffusion model for image generation and a scoring model to evaluate the generated images. The output of scoring model reflects predicted human preference for the generated images. The score of the generated image given the text prompt cic_{i} is defined as followed:

s(𝒙Tm,ci)=SCϕ(SDθ(𝒙Tm,ci),ci),s(\bm{x}_{T}^{m},c_{i})=\text{SC}_{\phi}(\text{SD}_{\theta}(\bm{x}_{T}^{m},c_{i}),c_{i}), (2)

where SCϕ\text{SC}_{\phi} is a scoring model, 𝒙Tm\bm{x}_{T}^{m} denotes the mm-th sample from p(𝒙T)p(\bm{x}_{T}), ci{c1,c2,,cn}c_{i}\in\{c_{1},c_{2},\dots,c_{n}\} and nn denotes the number training text prompts. We interchangeably use s(𝒙Tm,ci)s(\bm{x}_{T}^{m},c_{i}) and s(𝒙T,ci)s(\bm{x}_{T},c_{i}).

We posit the existence of a potentially superior Gaussian distribution (p(𝒙T)p(\bm{x}_{T}^{\prime})) compared to the standard Gaussian, which satisfies

p(𝒙T)=𝒩(𝒙T;𝝁,diag(𝝈)),s(𝒙T,ci)>s(𝒙T,ci).p(\bm{x}_{T}^{\prime})=\mathcal{N}(\bm{x}_{T};\bm{\mu},\text{diag}(\bm{\sigma})),\;s(\bm{x}_{T}^{\prime},c_{i})>s(\bm{x}_{T},c_{i}). (3)

Thus, we aim to optimize 𝝁\bm{\mu} and 𝝈\bm{\sigma} of the prior noise (XTX_{T}^{\prime}) to maximize scores (align with human preference), where 𝝁,𝝈4k2\bm{\mu},\bm{\sigma}\in\mathcal{R}^{4k^{2}} and kk denotes the size of latent variable in the pre-trained latent diffusion model [25].

We find a superior Gaussian distribution (p(𝒙T)p(\bm{x}_{T}^{\prime})) by minimizing the objective function defined as followed:

=n(0log\displaystyle\mathcal{L}=-\sum^{n}(0\cdot\log f(𝝁,𝝈)+1logf(𝝁,𝝈)),\displaystyle f(\bm{\mu},\bm{\sigma})+1\cdot\log f^{\prime}(\bm{\mu},\bm{\sigma})), (4)
𝝁,𝝈\displaystyle\bm{\mu^{*}},\bm{\sigma^{*}} =argmin𝝁,𝝈,\displaystyle=\underset{\bm{\mu},\bm{\sigma}}{\mathrm{argmin}}\;\mathcal{L}, (5)

where f(𝝁,𝝈)=eses+esf(\bm{\mu},\bm{\sigma})=\frac{e^{s}}{e^{s}+e^{s^{\prime}}} and f(𝝁,𝝈)=eses+esf^{\prime}(\bm{\mu},\bm{\sigma})=\frac{e^{s^{\prime}}}{e^{s}+e^{s^{\prime}}}. We use ss^{\prime} instead of s(𝒙T,ci)s(\bm{x}_{T}^{\prime},c_{i}) and ss instead of s(𝒙T,ci)s(\bm{x}_{T},c_{i}) for brevity.

To optimize parameters of Gaussian distribution, we use the reparameterization trick [11] as followed:

𝒙Tj=ϵ𝝈2+𝝁,whereϵN(0,I),\displaystyle\bm{x}_{T}^{\prime j}=\bm{\epsilon}^{\prime}\odot\bm{\sigma}^{2}+\bm{\mu},\;\text{where}\;\bm{\epsilon}^{\prime}\sim N(\textbf{0},\textbf{I}), (6)

where 𝒙Tj\bm{x}_{T}^{\prime j} denotes jj-th sample from p(𝒙T)p(\bm{x}^{\prime}_{T}). The sample is used as input to SDθ\text{SD}_{\theta} alongside the text prompt cic_{i}. We use the identified 𝝁\bm{\mu^{*}} and 𝝈\bm{\sigma^{*}} for all prompts during inference.

Prompt-adaptive noise distribution. We take one step further to tailor 𝝁(ci)\bm{\mu}(c_{i}) and 𝝈(ci)\bm{\sigma}(c_{i}) for an individual prompt cic_{i}. Built upon the previous framework, we further construct a conditional noise prediction model consisting of a pre-trained text encoder (EE) and respective 2 MLPs (gψg_{\psi}) for predicting 𝝁(ci)\bm{\mu}(c_{i}) and 𝝈(ci)\bm{\sigma}(c_{i}). The gψg_{\psi} takes text embedding as input and outputs predictions of parameters of a Gaussian distribution.

The entire procedure is as followed:

gψ(E(ci))=\displaystyle g_{\psi}(E(c_{i}))= (𝝁(ci),𝝈(ci)),\displaystyle(\bm{\mu}(c_{i}),\bm{\sigma}(c_{i})), (7)
𝒙Tj=ϵ𝝈(ci)2+\displaystyle\bm{x}_{T}^{\prime j}=\bm{\epsilon}^{\prime}\odot\bm{\sigma}(c_{i})^{2}+ 𝝁(ci),whereϵN(0,I),\displaystyle\bm{\mu}(c_{i}),\;\text{where}\;\bm{\epsilon}^{\prime}\sim N(\textbf{0},\textbf{I}), (8)
s(𝒙T,ci)=SCϕ\displaystyle s(\bm{x}^{\prime}_{T},c_{i})=\text{SC}_{\phi} (SDθ(𝒙Tj,ci),ci),\displaystyle(\text{SD}_{\theta}(\bm{x}_{T}^{\prime j},c_{i}),c_{i}), (9)

where we use the text encoder of the diffusion model (ϵθ\epsilon_{\theta}) for E()E(\cdot) not to impose additional computation. We find ψ\psi^{*} that minimizes objective function defined in Eq. 4. During inference, we initially identify the optimal 𝝁(ci)\bm{\mu}^{*}(c_{i}) and 𝝈(ci)\bm{\sigma}^{*}(c_{i}) for prompt cic_{i} using the noise predicting model and then employ the Gaussian as a prior noise for the distilled diffusion models to generate images.

When ψ\psi are initialized randomly, 𝝁(ci)\bm{\mu}(c_{i}) and 𝝈(ci)\bm{\sigma}(c_{i}) may significantly deviate from their appropriate values, 𝟎and𝑰\bm{0}\;\text{and}\;\bm{I}. To stabilize the initial training phase, we opt to pre-train gψg_{\psi} to produce parameters close to a standard Gaussian distribution while retaining the input text embedding information. This is achieved by minimizing the Kullback-Leibler (KL) divergence between the standard Gaussian and the Gaussian with the predicted parameters as well as a reconstruction loss of the text embeddings. We reconstruct text embedding using a decoder (hωh_{\omega}), 2-layer MLP, taking as input a sample of 𝒙T\bm{x}^{\prime}_{T} from gψg_{\psi} and producing reconstructed text embeddings as output.

The loss function used for pretraining is structured as followed:

=inKL(N(𝝁(ci),𝝈(ci)),\displaystyle\mathcal{L^{\prime}}=\sum_{i}^{n}\text{KL}(N(\bm{\mu}(c_{i}),\bm{\sigma}(c_{i})),\; N(0,I))\displaystyle N(\textbf{0},\textbf{I}))
+MSE(E(ci)\displaystyle+\;\text{MSE}(E(c_{i}) ,hω(gψ(E(ci))).\displaystyle,h_{\omega}(g_{\psi}(E(c_{i}))). (10)

We optimize ψ\psi and ω\omega by minimizing the loss function. Subsequently, the updated parameters ψ\psi serve as the initialization of our noise prediction model for further training.

4 Experiment

We validate our framework (PAHI) by demonstrating that the images generated by our method exhibit enhanced quality, based on one-step generation.

Experiment setups. We conduct a comparison between images generated by our method and ones generated using the standard Gaussian distribution as a prior. We employ the dataset proposed by [12], which consists of 35,000 distinct prompts. We randomly select 500 prompts for validation and another 500 prompts for the test set, while the remaining prompts are used for training. The scoring model used for training is PickScore [12]. During evaluation, we employ both the PickScore and ImageReward scoring models [32] to assess image quality and determine whether images optimized with PickScore align effectively with the criteria of ImageReward. As an evaluation metric, we employ the win rate, determined by comparing the scores of two images generated from the same prompt: one using a predicted noise distribution and the other using the standard Gaussian distribution. We denote our prompt-adaptive inversion method as PAHI, while using a single inversion across all prompts is referred to as HI.

Implementation details. Our implementation is built upon the Huggingface Diffuser framework111https://huggingface.co/docs/diffusers/en/index. We integrate ADD-M [27], where we reduce the computation and memory costs by replacing the VAE with a tiny autoencoder222https://github.com/madebyollin/taesd. For PAHI, we use the text encoder of the employed diffusion model as E()E(\cdot) as it removes the need for additional computation. We generate images with a size of 5122. We configure the batch size to 72 and implement a learning rate warm-up for 10,000 steps, gradually increasing from 1×1051\times 10^{-5}. We employ the Adam optimizer. We use early stopping based on the average evaluation loss from Pickscore and Imagereward scoring models. If this loss does not drop for five consecutive evaluations, conducted every 1000 steps, training stops. The inference time is measured using a Nvidia RTX 3090 GPU.

4.1 Results

Table 1: The win rate against images generated from the standard Gaussian using different scoring models. The numbers represent the average (std) win rates of 5 runs with different seeds.
PickScore [12] ImageReward [32]
PAHI 94.0% (0.2) 75.5% (2.1)
HI 64.7% (1.6) 64.1% (1.8)
Table 2: Comparison of inference times between low step generation and our method: The numerical values represent the average time (scores) taken for an image sampling across 500 text prompts. We use batch size of 1.
Time (ss) (\downarrow) Scores (PickScore) (\uparrow)
PAHI 0.067 0.228
one step [27] 0.062 0.212
two steps [27] 0.088 0.212

Human preference scores comparison. Our approach showcases superior performance compared to the baseline, as illustrated in Table 1. PAHI significantly outperforms a standard Gaussian, with a remarkable win rate of 98.4%. HI also shows effectiveness, achieving a win rate of around 65%. However, its improvement does not match the efficacy of PAHI. Interestingly, even though trained with PickScore as the scoring model, PAHI (HI) excels in achieving higher ImageReward scores as well, boasting a win rate of 70.7% (64.1%). These results demonstrate the effectiveness of optimizing noise distribution in enhancing image quality. Moreover, it underscores the importance of tailoring the noise distribution based on each prompt specifically.

Refer to caption
Refer to caption
(a) Head shot of a dragon, digital art style
Refer to caption
Refer to caption
(b) Minnie Mouse in a superman outfit bodybuilding, book illustration
Refer to caption
Refer to caption
(c) a velociraptor and an MGb in the jungle river, waterfall mist, Chrome Detailing
Figure 1: The prompts generated by users and the corresponding images sampled in one-step from the standard Gaussian (left) and the predicted noise distributions (right).

Sampling computation cost. We investigate the trade-off between image generation quality and sampling cost during inference. Typically, higher image quality is achieved with more sampling steps, which incur additional computation costs. We compare our approach with the standard Gaussian, employing 1 and 2 sampling steps [27]. We present the time required for sampling to assess computational efficiency as well as the resulting scores to identify their quality, as depicted in Table 2. Despite no significant difference in scores between one-step and two-step generation, two-step generation incurs additional time for sampling. In contrast, our approach demonstrates remarkable efficiency, requiring only a minimal increase in inference time compared to one-step generation (+0.005s), while showcasing higher scores aligned with human preferences. When it comes to the number of parameters, PAHI adds 5 million more parameters (two MLP layers) in total, which are negligible compared to the 983 million parameters of Stable Diffusion 2.1 [27]. Note that we utilize the text encoder from the employed diffusion model, thus eliminating the necessity to introduce additional parameters to encode prompts.

Generated images. Figure 1 illustrates the generated images from our noise prediction and a standard Gaussian. We underscore the images from the predicted noise (right) demonstrate improved image quality, aligning better with human preference.

5 Conclusion

Our study investigated improving the quality of text-to-image one-step generation. We proposed a light-weight noise predicting model, optimizing noise distributions based on human preferences without fine-tuning diffusion models. Our experiments exhibited the tailored noise distributions improve image quality with marginal rise in computing cost. We highlight the efficacy of noise optimization, promising efficient and high-quality text-to-image synthesis.

References

  • Abdal et al. [2019] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan: How to embed images into the stylegan latent space? In Proceedings of the IEEE/CVF international conference on computer vision, pages 4432–4441, 2019.
  • Deng et al. [2024] Fei Deng, Qifei Wang, Wei Wei, Matthias Grundmann, and Tingbo Hou. Prdp: Proximal reward difference prediction for large-scale reward finetuning of diffusion models. arXiv preprint arXiv:2402.08714, 2024.
  • Fan et al. [2024] Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Reinforcement learning for fine-tuning text-to-image diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  • Gal et al. [2022] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • Gu et al. [2020] Jinjin Gu, Yujun Shen, and Bolei Zhou. Image processing using multi-code gan prior. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3012–3021, 2020.
  • Hao et al. [2023] Yaru Hao, Zewen Chi, Li Dong, and Furu Wei. Optimizing prompts for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2023.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Karras et al. [2022] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  • Kim et al. [2022] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
  • Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Kirstain et al. [2024] Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
  • Liu et al. [2023] Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, et al. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. In The Twelfth International Conference on Learning Representations, 2023.
  • Lu et al. [2022] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  • Luo et al. [2023] Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023.
  • Mañas et al. [2024] Oscar Mañas, Pietro Astolfi, Melissa Hall, Candace Ross, Jack Urbanek, Adina Williams, Aishwarya Agrawal, Adriana Romero-Soriano, and Michal Drozdzal. Improving text-to-image consistency via automatic prompt optimization. arXiv preprint arXiv:2403.17804, 2024.
  • Meng et al. [2023] Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023.
  • Mokady et al. [2023] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
  • Pidhorskyi et al. [2020] Stanislav Pidhorskyi, Donald A Adjeroh, and Gianfranco Doretto. Adversarial latent autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14104–14113, 2020.
  • Poole et al. [2022] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  • Richardson et al. [2021] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2287–2296, 2021.
  • Roich et al. [2022] Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel Cohen-Or. Pivotal tuning for latent-based editing of real images. ACM Transactions on graphics (TOG), 42(1):1–13, 2022.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • Salimans and Ho [2022] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
  • Sauer et al. [2023] Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023.
  • Song et al. [2020a] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  • Song et al. [2020b] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  • Song et al. [2023] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
  • Wallace et al. [2023] Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. arXiv preprint arXiv:2311.12908, 2023.
  • Xu et al. [2024] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
  • Yin et al. [2023] Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. arXiv preprint arXiv:2311.18828, 2023.
  • Zhang et al. [2023] Hui Zhang, Zuxuan Wu, Zhen Xing, Jie Shao, and Yu-Gang Jiang. Adadiff: Adaptive step selection for fast diffusion. arXiv preprint arXiv:2311.14768, 2023.
  • Zhang et al. [2024] Yuechen Zhang, Jinbo Xing, Eric Lo, and Jiaya Jia. Real-world image variation by aligning diffusion inversion chain. Advances in Neural Information Processing Systems, 36, 2024.
  • Zhu et al. [2020a] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-domain gan inversion for real image editing. In European conference on computer vision, pages 592–608. Springer, 2020a.
  • Zhu et al. [2020b] Peihao Zhu, Rameen Abdal, Yipeng Qin, John Femiani, and Peter Wonka. Improved stylegan embedding: Where are the good latents? arXiv preprint arXiv:2012.09036, 2020b.