This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Preserving Image Properties Through Initializations in Diffusion Models

Jeffrey Zhang
[email protected]
   Shao-Yu Chang
[email protected]
   Kedan Li
[email protected]
   David Forsyth
[email protected]
Abstract

Retail photography imposes specific requirements on images. For instance, images may need uniform background colors, consistent model poses, centered products, and consistent lighting. Minor deviations from these standards impact a site’s aesthetic appeal, making the images unsuitable for use. We show that Stable Diffusion methods, as currently applied, do not respect these requirements. The usual practice of training the denoiser with a very noisy image and starting inference with a sample of pure noise leads to inconsistent generated images during inference. This inconsistency occurs because it is easy to tell the difference between samples of the training and inference distributions. As a result, a network trained with centered retail product images with uniform backgrounds generates images with erratic backgrounds. The problem is easily fixed by initializing inference with samples from an approximation of noisy images. However, in using such an approximation, the joint distribution of text and noisy image at inference time still slightly differs from that at training time. This discrepancy is corrected by training the network with samples from the approximate noisy image distribution. Extensive experiments on real application data show significant qualitative and quantitative improvements in performance from adopting these procedures. Finally, our procedure can interact well with other control-based methods to further enhance the controllability of diffusion-based methods.

1 Introduction

Stable Diffusion [9] can generate high-quality, lifelike images, and has opened up numerous innovative applications. Examples include creating new art, style transfer between pictures, and generating high-resolution images from text. However many real applications require images to meet specific design requirements. For example, product images need consistent photographic standards so that retail websites maintain uniform aesthetic appeal and are “on-brand”. Even minor deviations can make images unusable.

The essential requirements for our application mirror those of many other applications. Our application requires a text-to-image generator where: (1) outputs reflect a garment description text accurately; (2) outputs are either a garment image or a human model wearing the described garment; (3) garments are not cropped, are centered, and appear on a white background; (4) human models are always depicted from foot to neck, stand in similar poses, and appear on a neutral background; (5) outputs have consistent professional lighting and shading; and (6) one text-to-image model can produce all desired images and does not produce others. The first five requirements ensure images are “on-brand” and the last is for efficiency. Remarkably, Stable Diffusion [9] as currently practiced cannot meet these requirements, but quite simple changes result in a model that does.

We start from the observation that fine-tuning Stable Diffusion [9] with product images on neutral backgrounds does not produce a method that can generate product images on neutral backgrounds (Fig. 1). This unexpected effect is caused by a hiccup in the structure of the method. Current training methods [11, 5] form a weighted sum of noise and a base image (using a weight α\alpha), and a model is trained to denoise the noisy image. At inference, one assumes that a sufficiently noisy base image is indistinguishable from noise and that the denoising process can be started with pure noise. We show the assumption is true only for α\alpha much smaller than those used in current practice, meaning that the denoiser sees noticeably different distributions at train and test times. Training Stable Diffusion [9] for very small α\alpha is also challenging (see Supplementary).

A simple alternative is to initialize with a draw from a distribution that represents the training distribution reasonably well and is easily sampled. While training, some information about the original image can be recovered from the initial noisy sample, which at best is a blurry version of the original image. This means that a heavily noised sample from a mixture of principal components model is an acceptable approximation of the training distribution. We show significant improvement results from using this as an initial distribution without retraining Stable Diffusion [9] – for instance, erratic backgrounds are replaced by neutral backgrounds (see Supplementary). But these improvements highlight another initialization problem: the joint distribution between text and noisy image at training is misrepresented by both standard noise initialization and our initialization. We show that because our initialization is easily sampled, it can be used in fine-tuning the denoiser, leading to notable improvements in text-based control. Finally, we show that our initialization techniques are easily integrated with other controllability methods (e.g. ControlNet [14]) to provide more effective control for diffusion-based methods.

2 Related work

There is considerable research available on generating specific concepts or subjects using diffusion-based methods. However, to the best of our knowledge, we believe that we are the first to concentrate on creating specific image distributions for images.

2.1 Object preservation and harmonization

There are works that learn specific objects and generate variations of those objects faithfully. Dreambooth [10] fine-tunes a pretrained diffusion model to accurately generate new variations of a particular subject. Gal et al. [4] invert objects into pseudo-words to attain personalized text embeddings to create images of those objects. Other works have enabled editing in the forward pass of diffusion models without image-specific fine-tuning or inversion. InstructPix2Pix [2] takes an input image and generates a new image based on text instructions. Yang et al. [13] proposes an exemplar-based image editing model where the reference image is semantically transformed and harmonized into another image. Finally, Edward et al. [6] adapts the language models to generate specific objects by adding trainable parameters of the language embeddings to learn new concepts from a dataset. This allows the adaptation of new words and concepts by fine-tuning the newly added parameters instead of the entire generation model.

These studies primarily focus on preserving target objects and generally struggle to control non-target areas if no conditions exist in those areas. Unlike these approaches, our paper emphasizes stabilizing the image distributions throughout the diffusion process. Our proposed method can effectively preserve the properties of the entire image, not just the target objects.

2.2 User-defined controllablility

Many of the works mentioned above are text-guided, in which users provide a text prompt to control and edit images. However, language-guided manipulations often do not generate images satisfying users’ requirements. CLIP features [8] leverage the representations of user-provided images to improve the diversity of the output results. Region-based image editing methods [3, 1] treat the task as a conditional inpainting task with a mask highlighting the regions of the images that needed to be edited while preserving non-target areas. To enhance task-specific control of diffusion models, ControlNet [14] adds an additional input condition (e.g. edge maps, segmentation masks, keypoints, etc.) alongside text prompts to manipulate image generation.

While these methods allow users to control target areas in the images by adding additional conditions, they still often fail to maintain image distributions. This failure is due to inconsistencies in the initialization process during training and inference. In Sec. 4.4, we show that combining our method with ControlNet [14] can strengthen the control abilities of diffusion models and stabilize the output (Fig. 5).

2.3 Image-to-Image Translation

Finally, Stable Diffusion [9] is often applied to image-to-image translation applications by choosing a starting image and generating variations from the image initialization. The resulting images have significant similarities to the initial image’s colors and contours. On the other hand, other methods use DDIM Inversion [11] to find initial noise vectors to restore the original image during diffusion to apply image-to-image translation. In Tune-A-Video [12], the authors use DDIM Inversion to control the consistency of frames and the contours of objects. In Null-text Inversion [7], DDIM Inversion is used to create images similar in appearance to the original input, enabling users to edit specific words while preserving the objects of the original image.

In these works, reference images are used to generate variations of that reference image. In contrast, we demonstrate a sample from an approximate noisy image distribution significantly changes the behavior of diffusion-based methods because the method experiences similar samples from the training distribution at inference time. Furthermore, we show that using the right starting initialization during both training and inference is essential for consistently generating entire image distributions, not just for maintaining the features of a specific reference image.

3 Method

Current literature on Stable Diffusion [9] has assumptions at inference time that appear inconsequential, but we show these assumptions have significant consequences. We demonstrate substantial improvements in resolving these errors at inference time.

3.1 Background

Stable Diffusion [9] is trained to recover an image x0x_{0} by denoising a noisy image xtx_{t} at timestep t[0,T]t\in[0,T]. At the tt’th timestep, the denoiser is presented with

xt=αtx0+(1αt)ϵtx_{t}=\sqrt{\alpha_{t}}x_{0}+\sqrt{(1-\alpha_{t})}\epsilon_{t} (1)

where ϵt𝒩(0,I)\epsilon_{t}\sim\mathcal{N}(0,I) and αt\alpha_{t} is the cumulative product of scaling at each timestep tt (refer to DDIM [11]). Following the training procedure from [11], a denoiser ff predicts the added noise ϵ^t=f(xt,t,e;Θ)\hat{\epsilon}_{t}=f(x_{t},t,e;\Theta), where ff is parameterized by Θ\Theta and takes in noisy image xtx_{t}, timestep tt, and conditional encoding ee. We write x^0\hat{x}_{0} for the predicted ground truth image derived from removing the predicted noise ϵ^t\hat{\epsilon}_{t} from xtx_{t}. Our loss is

=𝔼[ϵtϵ^t22]\mathcal{L}=\mathbb{E}[||\epsilon_{t}-\hat{\epsilon}_{t}||_{2}^{2}] (2)

From Eq. 1, if we have the predicted ϵ^t\hat{\epsilon}_{t} and the noisy image xtx_{t}, we can derive the predicted ground truth image x^0\hat{x}_{0} with

x^0=(xt(1αt)ϵt)/αt\hat{x}_{0}=(x_{t}-\sqrt{(1-\alpha_{t})}\epsilon_{t})/\sqrt{\alpha_{t}} (3)
Refer to caption
(a) Garment + Model + Text Dataset
Refer to caption
(b) DDIM Training + Inference
Refer to caption
(c) DDIM Training + Control Inference (xstart=x0x_{start}=x_{0})
Figure 1: Despite training on images with properties (1)-(5), normal diffusion-based training and inference lead to unexpected results. (a) shows sample sequences from our garment dataset. (b) shows standard fine-tuning and inference results with Stable Diffusion [9] do not generate the same distribution of images despite being trained on images from (a). The prompts are taken from training data, where we expect the best results. To show that this is not a training error, in (c), we set a control experiment by changing xstartx_{start} (Eq. 10) to the training image shown in (a). The generated images match the training distribution, indicating that initialization information strongly influences results. (F*: ”female garment, no person, white background”; M*: ”male garment, no person, white background”; F†: ”female person wearing garment”; M†: ”male person wearing garment”)

3.2 Inference Assumption

During training, we have ground truth image x0P(images)x_{0}\sim P(images) and initial noisy image xTPTx_{T}\sim P_{T} (from Eq. 1). However, at inference time, we do not have the ground truth image x0x_{0} and must supply an alternative initialization xinitx_{init}. Following [11], it is usual to argue that for sufficiently small αT\alpha_{T}, xTx_{T} should be very similar to 𝒩(0,I)\mathcal{N}(0,I). Hence, during inference, a common approach is to sample our initialization xinit𝒩(0,I)x_{init}\sim\mathcal{N}(0,I).

However, if αT\alpha_{T} is not small enough, then xTx_{T} has information about x0x_{0} that the network could recover and utilize for denoising. Pure noise as initialization may not behave as expected for two reasons: (1) denoising networks are trained on noticeably different data distributions than what they see at inference time and (2) denoising networks may extract information about x0x_{0} from xTx_{T} to denoise xTx_{T}. If this is true, reliable inference procedures might use something other than 𝒩(0,I)\mathcal{N}(0,I).

We show values of αT\alpha_{T} in many pretrained models may indeed be too large. In Fig. 1, we compare the performance of a Stable Diffusion [9] fine-tuned to make images of models and garments on two different initializations. In Fig. 1(b), we initialize xinit𝒩(0,I)x_{init}\sim\mathcal{N}(0,I) during inference, and in Fig. 1(c), we initialize with xinit=xTx_{init}=x_{T} during inference using Eq. 1, where x0x_{0} is the ground truth image from Fig. 1(a). Different initializations have significantly different results, but more importantly, notice the network takes obvious hints from the ground truth image x0x_{0} (in Fig. 1(c)).

For values of αT\alpha_{T} that are not small enough, it is possible to reliably distinguish between samples from PTP_{T} and 𝒩(0,I)\mathcal{N}(0,I) using elementary methods. For xinit𝒩(0,I)x_{init}\sim\mathcal{N}(0,I) to be hard to distinguishable from xTPTx_{T}\sim P_{T}, we must have

αT<O(1d),\alpha_{T}<O(\frac{1}{d}), (4)

where d=H×W×Cd=H\times W\times C is the dimension of the sampled image. Meeting this constraint is difficult as 1/d1/d is significantly smaller than current values of αT\alpha_{T}. The key point is that spatial averages of images have strong properties and will be perceptible even at small αT\alpha_{T}.

Set a=1d1\textbf{a}=\frac{1}{d}\textbf{1} where 1 is the 1-vector of size dd. Then, we can write

𝔼PT[axT]=αT𝔼P(images)[ax0]=αTμ0,\mathbb{E}_{P_{T}}[\textbf{a}\cdot x_{T}]=\sqrt{\alpha_{T}}\mathbb{E}_{P(images)}[\textbf{a}\cdot x_{0}]=\sqrt{\alpha_{T}}\mu\neq 0, (5)

where μ=𝔼P(images)[ax0]\mu=\mathbb{E}_{P(images)}[\textbf{a}\cdot x_{0}] is the average of images. Simple experiments show that μ\mu is non-zero, and different sets of images can have different values of μ\mu (e.g., white background, people in similar poses, etc.). From Eq. 5, axT\textbf{a}\cdot x_{T} has mean μT=αTμ\mu_{T}=\sqrt{\alpha_{T}}\mu and variance σT2=αTσ2+(1αT)d\sigma_{T}^{2}=\alpha_{T}\sigma^{2}+\frac{(1-\alpha_{T})}{d}. But for xinit𝒩(0,I)x_{init}\sim\mathcal{N}(0,I), axinit\textbf{a}\cdot x_{init} has mean 0 and σ2=1/d\sigma^{2}=1/d.

Hence, the network is presented with samples from two distributions, 𝒩(μT,σT2)\mathcal{N}(\mu_{T},\sigma^{2}_{T}) during training and 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) during inference. For the distributions to be difficult to distinguish, we want μT/σ\mu_{T}/\sigma and μT/σT\mu_{T}/\sigma_{T} to be small. We have

(μTσ)2=dαT(μ)2,(\frac{\mu_{T}}{\sigma})^{2}=d\alpha_{T}(\mu)^{2}, (6)

and

(μTσT)2=(αTμ)2αTσ2+(1αT)1d=dαTμ2αT(σ2d1)+1\small(\frac{\mu_{T}}{\sigma_{T}})^{2}=\frac{(\sqrt{\alpha_{T}}\mu)^{2}}{\alpha_{T}\sigma^{2}+(1-\alpha_{T})\frac{1}{d}}=\frac{d\alpha_{T}\mu^{2}}{\alpha_{T}(\sigma^{2}d-1)+1} (7)

Consequently, both Eq. 6 and 7 are small if αT\alpha_{T} is less than 1/d1/d or smaller, giving us Eq. 4.

Practical numbers are d=64×64×4=16384d=64\times 64\times 4=16384 and αT=0.0047\alpha_{T}=0.0047; but 0.00470.0000610.0047\nless 0.000061, hence αT\alpha_{T} is not in the right range (Eq. 4). This indicates that a denoiser network ff could tell the difference between the initialization samples used in training and those used at inference. If it can tell the difference, different behaviors between training and inference are possible. Fig. 1 demonstrates the network actually behaves differently for samples with these different distributions.

One solution is to train with a smaller αT\alpha_{T}. However, training with smaller αT\alpha_{T} requires scaling down αt\alpha_{t} for many timesteps, which leads to more difficult training as more noise is added (see Supplementary). We show that approximating P(images)P(images) offers a more efficient and reliable solution.

3.3 PCA-K Offset Inference

Our procedure “PCA-K Offset Inference” initializes inference with:

xinit=αTxstart+1αTϵT,x_{init}=\sqrt{\alpha_{T}}x_{start}+\sqrt{1-\alpha_{T}}\epsilon_{T}, (8)

where xstartx_{start} is sampled from a distribution QQ that approximates P(images)P(images). QQ does not need to be a particularly strong approximation of P(images)P(images) because a large magnitude of noise is added in xTx_{T}. Because this noise is i.i.d. Gaussian noise, we expect that the information the network can extract about xstartx_{start} from xinitx_{init} is a heavily smoothed version of xstartx_{start}. Hence, we need a distribution model that is easy to sample, reasonably approximates blurry images, and can handle multiple classes.

PCA-K Offset Inference uses a mixture of normals for QQ, where each normal is derived from Principal Component Analysis (PCA) of images of its class cc. PCA is known to be an effective description of blurred images. KK represents the number of principal components. For each class cc in our dataset, we model an image as

xRc=μc+i=1Kξipix_{R}^{c}=\mu^{c}+\sum_{i=1}^{K}\xi_{i}\textbf{p}_{i} (9)

where ξiN(0,λi)\xi_{i}\sim N(0,\lambda_{i}), pi\textbf{p}_{i} are orthonormal principal components, and μc\mu_{c} is the mean image of class cc. Write xRx_{R} for a random image drawn from an R principal component model. Then setting xstart=xRcx_{start}=x_{R}^{c}, our initialization becomes:

xinit=αTxRc+1αTϵTx_{init}=\sqrt{\alpha_{T}}x_{R}^{c}+\sqrt{1-\alpha_{T}}\epsilon_{T} (10)

We call this very useful case where R=0R=0, xstartx_{start} is the class mean μc\mu^{c}, “Mean Offset Inference”.

3.4 PCA-K Offset Training

While PCA-K Offset Inference allows the inference procedure to mimic the training procedure much more closely, we still expect operating conditions to differ. Our approximation may not exactly match the distribution used in training. In particular, our approximation may not preserve the delicate relationship between the ground truth image x0x_{0} and the text encoding ee supplied during training. Fig. 1 shows improvements by using an approximate distribution during inference time that was trained as usual, but further improvements are available. At training, the network sees a noisy version of an image and models a complex relationship between image and text encoding. But at inference, our network will be presented with a text encoding ee and a sample xstartx_{start} from the initial distribution, which has not been conditioned on ee. We cannot guarantee an approximate distribution can preserve those relationships. We can, however, use the same approximate distribution during training so that the network experiences the same distribution at train time and test time. This is a significant advantage of a QQ that is easy to sample.

We use xinitx_{init} from Eq. 8 in place of the initialization xTx_{T} in Eq. 1 to resemble the desired start point for image x0x_{0}. This allows the network to train and infer from the same distribution for the first step of the diffusion process:

xTnew=αTxstart+1αTϵTx_{T}^{new}=\sqrt{\alpha_{T}}x_{start}+\sqrt{1-\alpha_{T}}\epsilon_{T} (11)

First, we want the denoiser to recover the ground truth image x0x_{0} from xTnewx_{T}^{new}. So, we alter the noise objective to

ϵTnew=(xTnewx0αT)/(1αT)\epsilon_{T}^{new}=(x_{T}^{new}-x_{0}\sqrt{\alpha_{T}})/(\sqrt{1-\alpha_{T}}) (12)

Second, we want to skip timesteps (for computation efficiency, as in DDIM’s [11]), so this change must be applied to multiple timesteps. Let SS be the number of skips per timestep. Then, we apply the mean offset training to all timesteps within the first skip step to guarantee the first skip step is trained with the mean offset initialization. Hence, we alter Eq. 1 and Eq. 12 for tTSt\geq T-S to

xtnew=αtxstart+1αtϵtx_{t}^{new}=\sqrt{\alpha_{t}}x_{start}+\sqrt{1-\alpha_{t}}\epsilon_{t} (13)

and

ϵtnew=(xtnewx0αt)/(1αt)\epsilon_{t}^{new}=(x_{t}^{new}-x_{0}\sqrt{\alpha_{t}})/(\sqrt{1-\alpha_{t}}) (14)

Thus, our final loss is a combination of Eq. 2 and the new noise objective from Eq. 14:

new={𝔼[ϵtϵ^t22]if t<TS𝔼[ϵtnewϵ^t22]if tTS\mathcal{L}_{new}=\begin{cases}\mathbb{E}[||\epsilon_{t}-\hat{\epsilon}_{t}||_{2}^{2}]&\text{if }t<T-S\\ \mathbb{E}[||\epsilon_{t}^{new}-\hat{\epsilon}_{t}||_{2}^{2}]&\text{if }t\geq T-S\end{cases} (15)

During training, we project a ground truth image x0x_{0} with class cc into xKcx_{K}^{c} and set xstart=xKcx_{start}=x_{K}^{c} from Eq. 9, giving our proposed “PCA-K Offset Training” procedure. Setting K=0K=0 gives us xK=μcx_{K}=\mu^{c}, which is simply just initializing xstart=μcx_{start}=\mu^{c} and is our “Mean Offset Training” procedure. We find Mean Offset Training works best compared to higher KK values and is much simpler in practice (see Supplementary). As a result, we use Mean Offset Training for results in the main text for the sake of simplicity.

4 Experiments and Results

In these experiments, we use data collected from retailers that follow properties specified in Sec. 1 (details in Sec. 4.1). We show PCA-K Offset Inference behaves better because the initialization is similar to the training distribution in the initial denoising timesteps (Sec. 4.2). We show sampling a bad initialization from PCA-K can damage the relationship between text and initialization during inference because the operating conditions are still different (Sec. 4.2 and Fig. 3). We show that incorporating our PCA-K Offset Training procedure fixes this issue (Sec. 4.3 and Fig. 4). Finally, we show other control methods experience the same initialization problem, and our procedures can be easily combined with other methods to provide further control in generation (Sec. 4.4).

4.1 Dataset

We collect over a million image pairs of retailer garment, garment on model, and garment text description triplets (Fig. 1(a)). We are given one text prompt corresponding to a garment image and a model wearing that garment. All training data triplets satisfy properties (1)-(5) described in Sec. 1. Our task is to generate images of garments and fashion models wearing garments that satisfy all properties (1)-(6). To distinguish between generating garments and models wearing garments, we prepend the caption with ”male/female garment, no person, white background” (denoted with M* and F*, respectively) for generating garments and ”male/female person wearing garment” for generating human models (denoted with M† and F†, respectively).

For inference, we collect 24 different freeform text descriptions of garments by asking fashion designers to describe diverse garment descriptions (see Supplementary). These are fictional garment descriptions used to test the generalizability of our text-to-image model.

4.2 DDIM Finetuning with PCA-K Offset Inference

Refer to caption
Figure 2: Intermediate outputs for a S=20S=20 DDIM training process are visualized to show the first step of the diffusion process is out of distribution for standard DDIM training + inference. The top two rows show the intermediate outputs when initializing with noise (DDIM inference). The bottom two rows show the intermediate outputs when projecting a gray sweater with PCA-3 Offset Inference. Rows 1 and 3 show the noisy image xtx_{t} and rows 2 and 4 show the predicted x0^t\hat{x_{0}}_{t} at each time step tt. We can see from row 2 that the first predicted x0^\hat{x_{0}} introduces a dark, non-uniform background that is propagated throughout the process, whereas in row 4, the predicted x0^\hat{x_{0}} is already close to the desired distribution, making the diffusion process is much more stable.
Refer to caption
Figure 3: We show applying our PCA-0 and PCA-3 Offset Inference on DDIM Training can significantly improve generating desired image properties but is strongly biased by the sampled initialization xstartx_{start}. This leads to some undesirable artifacts. Rows 1 and 2 show PCA-0 occasionally generates non-white backgrounds for garments due to faint sleeves in the mean image - violating property (3). In rows 3-6, generated images are strongly influenced by the color and shape of xstartx_{start} and “…black cotton flared pants…” are generated to be white, “…tailored straight leg trousers…” are generated as shorts, etc. This violates (property (1)) as the generated images do not respect the text and further indicate that xstartx_{start} strongly influences the denoiser. Descriptions are freeform text from fashion designers.

To set a baseline, we fine-tune sd-v1.5 [9] on our dataset and show results from different initializations in Fig. 1. We fine-tune for 50,000 steps on a batch size 16 and learning rate 1e-5. We use a set total number of timesteps T=1000T=1000 for training and skip timesteps S=50S=50 for inference (i.e., 20 total timesteps for inference). We indicate the standard training and inference procedure from [11] as DDIM training and DDIM inference, respectively.

Different inference initializations have qualitatively different effects. Fig. 1(b) shows that if we initialize with noise (DDIM inference), none of the generated images respect properties (1)-(6) despite being fine-tuned on images with those properties. These images are not curated; generated images hardly ever show isolated garments or models. Furthermore, the text prompts used for inference were taken from the training data, where we expect the best behavior. To show that this is not a training bug, we set xstartx_{start} in Eq. 8 to the ground truth dataset images shown in Fig. 1(a). The fine-tuned network can now generate images that satisfy all our desired image properties (Fig. 1(c)). This indicates that xstartx_{start} in the initialization heavily impacts the denoising process, and the assumption that 𝒩(0,I)\mathcal{N}(0,I) is close enough to PTP_{T} has clear implications during inference (using the actual ground truth is not the issue here; below and Fig. 3).

Visualizing the mechanism by displaying intermediate time steps in generation helps to understand the impact of different initializations better. In Fig. 2, we compare the results between random noise initialization (DDIM training + inference) and initializing with a gray garment from our dataset projected to 3 PCA components (PCA-3 Offset Training + Inference) on freeform text from fashion designers. We see that during the first few steps of the diffusion process, a random noise input will predict a x0x_{0} that is very different from our desired image distribution. This mistake is not corrected in later timesteps and is accumulated throughout the denoising process. This is because the denoiser is not trained to denoise an image from a non-white background. If we initialize with a PCA-projected garment image with a white background, then the training data distribution is maintained in all intermediate steps.

PCA-K Inference fixes distribution issues, but initializations strongly affect text control in generations. We apply PCA-K Inference on the fine-tuned model and show image distribution problems are mitigated in Fig. 3, but the starting point can bias the type of images generated. Notice the initialization alters the shape and color of the garment generated because the network ff is trained to take hints from this initialization. The lighter initialization in row 3 generates more light colors, while row 4 shows much darker and boxier generations that adhere to the color and shape of the initialization. Notice that the generated garments do not respect the text (”black cotton flared pants…” are generated as white in the third column, ”…straight leg trousers…” are generated as shorts in the sixth column, etc.), thus not satisfying property (1). We try to apply a more neutral initialization by setting xstartx_{start} to garment and model means. However, Fig. 3 row 1 shows artifacts in the background due to the faint sleeve of the mean garment image.

The denoiser clearly takes hints from the initialization when generating images. This further substantiates that x0x_{0} has a tremendous weight during training. The denoiser heavily relies on cues from x0x_{0} to denoise the image because x0x_{0} strongly correlates with the encoding text ee. Unfortunately, xstartx_{start} is sampled from a distribution QQ that is independent of the text prompt. As a result, the word control for denoising worsens when we sample from QQ.

Refer to caption
(a) DDIM Training + Mean Offset Inference
Refer to caption
(b) Mean Offset Training + Inference
Figure 4: Using Mean Offset Training and Mean Offset Inference provides better text control because the relationship between initialization and text is preserved during training and inference. We apply two class mean initialization for garments and models and intentionally swap the means to test the effect of different initializations during inference. Figure (a) shows garment and model results DDIM Training + Mean Offset Inference that violate various properties. Figure (b) shows Mean Offset Training + Inference results satisfy all desired properties. Red boxes highlight generation errors in (a) and green boxes show they are fixed in (b). Red solid borders show artifacts that shouldn’t exist and don’t fully respect the text ((a) fails property (1)). Red dashed borders show generated models instead of garments, as specified by the text, and the person is not in the proper pose ((a) fails properties (1) and (4)). Red dotted borders show non-white backgrounds or cropped garments/models ((a) fails property (3)).

4.3 PCA-K Offset Training

By incorporating PCA-K Offset Training, we alter the training procedure to have consistent initialization and text pairings during training. We test Mean Offset Training (PCA-0) with average garment and average model initializations. Training hyperparameters are identical to the fine-tuned model in Sec. 4.2. Results for PCA-K (K>0K>0) are shown in Supplementary.

Qualitatively, Mean Offset Training generates images that respect the text better than standard training and satisfies all desired properties (1)-(6). Fig. 4 shows the difference between using DDIM training + Mean Offset Inference (Fig. 4(a)) and Mean Offset Training + Inference (Fig. 4(b)). While only Mean Offset Inference significantly helps generate our desired image properties, it occasionally produces artifacts (not pure white backgrounds), does not accurately follow the text (generates sleeves when there shouldn’t be sleeves), and crops the garments and models. Incorporating mean offset initialization in training resolves these issues and generates our desired image distribution (see results with all 24 text prompts in the Supplementary).

Refer to caption
Figure 5: We adapt ControlNet [14] to take a garment condition to generate models wearing garments. We display three seeds for the same control to show that vanilla ControlNet (DDIM Training + Inference) consistently produces out-of-distribution results (violating properties (4) and (5)), whereas ControlNet with Mean Offset Training + Inference (Ours) perfectly preserves the desired training distribution.

Quantitatively, Mean Offset Training + Inference generates more accurate images as demonstrated by running CLIP similarities [8] between generated garments and text prompts in Table 1. Notice the 10.6%10.6\% improvement from DDIM Training + DDIM Inference to Mean Offset Training + Inference. While CLIP similarity is not a perfect representation of similarity, the improvement is significant. The learned text-to-initialization relationship is better preserved because the same initialization distribution is used during training and inference.

Method CLIP similarity(\uparrow) Score scaled by GT(\uparrow)
Ground Truth Garments 0.294 1.0
DDIM Training + DDIM Inference 0.2542 0.865
DDIM Training + Mean Offset Inference (Ours) 0.2761 0.939
Mean offset Training + Inference (Ours) 0.2812 0.956
Table 1: We use CLIP similarity [8] (higher is better) between images and text to show our methods generate images that respect text better (property (1)). We compare with the ground truth image from our dataset to set a baseline. Notice our Mean Offset Inference easily outperforms standard DDIM training and inference. Furthermore, incorporating DDIM Training + Inference further improves performance, indicating that a better text-to-generation relationship is preserved (for property (1)).

4.4 Application to ControlNet

We apply our method to ControlNet [14] for a different task of virtual try-on using our dataset. We show image distribution issues persist in this new task due to noise initialization during inference, but can be fixed with our PCA-K Offset Training + Inference. For this task, we are given a garment as control and train a denoiser to generate a realistic person wearing that garment. To train, we mask the region of a garment from a person in our dataset and adapt ControlNet [14] to take the masked garment image as the condition to generate the remaining image. The left column of Fig. 5 shows the effect of training and running ControlNet without any modification to the initialization procedure, and background and lighting properties are not preserved (properties (4) and (5)). The right column of Fig. 5 shows that applying our Mean Offset Training + Inference preserves desired generated image properties.

5 Discussion

We believe our approximate initialization distribution has broad applications, not just limited to fashion retail images. Situations where images follow structural requirements could benefit from our training and inference procedure. Additionally, because sampling initialization can bias the inference (Fig. 3), we intend to investigate using CCA to build relationships between PCA-K initializations and text features.

6 Conclusion

Our research indicates that existing training and inference procedures for diffusion-based methods are problematic and cannot preserve certain image distributions. We uncover that the assumption of employing random noise as the starting point may significantly affect the way images are generated. More importantly, we show the current training procedure is largely biased by its initialization, but can be mitigated by adopting our PCA-K Offset Training + Inference. Finally, we demonstrate that our work is orthogonal to other manipulation methods, such as ControlNet [14], and can be combined to enable greater control of diffusion-based image generation.

References

  • [1] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18208–18218, June 2022.
  • [2] Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. November 2022.
  • [3] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. 10 2022.
  • [4] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion, 2022.
  • [5] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020.
  • [6] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  • [7] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. arXiv preprint arXiv:2211.09794, 2022.
  • [8] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. 04 2022.
  • [9] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021.
  • [10] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022.
  • [11] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, October 2020.
  • [12] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.
  • [13] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. 11 2022.
  • [14] Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023.