This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Divide and Compose with Score Based Generative Models

Sandesh Ghimire
[email protected]
   Armand Comas
[email protected]
   Davin Hill
[email protected]
   Aria Masoomi
[email protected]
   Octavia Camps*
[email protected]
   Jennifer Dy *
[email protected]
Department of Electrical and Computer Engineering, Northeastern University
Boston, MA, USA
Abstract

While score based generative models, or diffusion models, have found success in image synthesis, they are often coupled with text data or image label to be able to manipulate and conditionally generate images. Even though manipulation of images by changing the text prompt is possible, our understanding of the text embedding and our ability to modify it to edit images is quite limited. Towards the direction of having more control over image manipulation and conditional generation, we propose to learn image components in an unsupervised manner so that we can compose those components to generate and manipulate images in informed manner. Taking inspiration from energy based models, we interpret different score components as the gradient of different energy functions. We show how score based learning allows us to learn interesting components and we can visualize them through generation. We also show how this novel decomposition allows us to compose, generate and modify images in interesting ways akin to dreaming. We make our code available at https://github.com/sandeshgh/Score-based-disentanglement

1 Introduction

Diffusion based [11] or score based [28] generative models are a new class of generative models based on the idea of reversing the image corruption process to generate realistic images from noise. These approaches have recently become quite successful, not only in synthesizing realistic and diverse images [8], but also in obtaining better data likelihood [17]. Numerous works have applied score based generative models in text-based image generation [24, 23], inpainting [24], editing [20] etc. Recently developed models like DALL-E[23] and Latent Diffusion [24] using diffusion models have been reported to generate realistic and diverse images with wild imagination ability.

Most of the works that conditionally generate images using diffusion/score model train models in a supervised manner conditioned either on actual class labels or embedding of paired text [23]. Supervised conditional generation could be either guided using the gradient of pretrained classifier [8] obtained from supervised learning or could also be classifier free [12]. Building upon the works of text based conditional generation, some works have also tried to edit or manipulate images [1]. While these methods show that with labels, learning conditional score model is quite effective, there is one fundamental problem with the present conditional generation: we do not have control over what the model generates. Suppose we generate an image based on text. The image looks okay, but it’s not quite what we want. How can we change it to match our expectations? Do we have control? Do we have interpretable understanding of the conditioning? No!

This topic is not unexplored in the context of traditional generative models, like VAEs and GANs. In fact, they have been extensively explored as disentanglement [19, 4] in the autoencoding setup and GAN inversion [3, 6] in GANs. Several works in GAN inversion [31] try to find latent representations corresponding to an image and then manipulate the representation in the latent space to edit and manipulate the image. In case of score based models, however, there is little understanding of the latent factors in terms of generation of images. Models that use text based conditional generations are opaque and our only way to manipulate the image is through the generation of another text.

To bridge this gap, we are interested in learning interpretable factors in a score based generative models, which could be later used to manipulate and edit images as in the case of GAN Inversion. The first plausible step to learn such factors is an autoencoding type approach where we first learn different image components and then recompose image out of those components. Unfortunately, the theoretical formulation of such diffusion autoencoder is currently unclear. Our first contribution is to cement the theoretical foundation of diffusion autoencoder through likelihood based formulation (see section 4.1). From the implementation perspective, we did find the diffusion autoencoder (DiffAE) implementation due to Preechakul et al. [22] to work well and have built upon their implementation.

While we agree with DiffAE [22] on the autoencoding setup, we take a very different approach to autoencoding by decomposing an image into different score components. We believe this is arguably better suited for the score models since the score functions are the main building components of score models. Therefore, we would like to decompose an image into different score components and try to understand their contribution in image generation. We take inspiration from the energy based models [32, 10, 9]. Imagine that the probability density of image is given by product of exponential distributions of the form:

p0(x)=p0(i)(x)=eiE(i)(x)Z\displaystyle p_{0}(x)=\prod p_{0}^{(i)}(x)=\frac{e^{-\sum_{i}E^{(i)}(x)}}{Z} (1)

where E(i)(x)E^{(i)}(x) represents energy functions and ZZ is the normalization constant, also known as partition function. Taking logarithm of eq.(1), we can conclude that modeling score as summation of several components can be interpreted as learning different energy components, i.e. s=xlogp0(x)=ix(E(i)(x))s=\nabla_{x}\log p_{0}(x)=\sum_{i}\nabla_{x}(-E^{(i)}(x)). Based on this intuition, we imagine that we should be able to decompose the score function into different components and train the score based generative model. We ask, could we decompose an image into several interpretable components in unsupervised way and recombine them to generate new images, akin to dreaming? In this paper, we follow this abstract idea to generate interesting images by dividing image and recomposing components.

We perform several experiments to illustrate the score components learned in unsupervised manner and what we can achieve through their composition and manipulation. To interpret the factors captured by score components we generate images from individual components. Some components capture human interpretable attributes like shape and color, while others are not as they capture complex texture/features in images. We also modify images by manipulating individual components and interpolating them with unconditional score which results in interesting manipulation of images and diverse generation. We discuss how our experiments elicit new perspective on interpretability and disentanglement of images.

2 Related Works

Diffusion generative models based on denoising idea was first proposed by Ho et al. [11] and Sohl-Dickstein et al. [25]. From a different perspective, Song et al. [27] showed that we can generate images by estimating the score, i.e. gradient of data loglikelihood. These two perspectives were later reconciled by Song et al. [28]. They showed that the forward diffusion and reverse generative models are both stochastic processes in continuous time guided by stochastic differential equations. This work unifies the diffusion perspective with the score based perspective. The score based generative model utilizes the denoising and implicit score matching ideas [15, 29] to develop a computationally cheap way to estimate the score function at different time instants.

Score based generative model research has seen several new directions. Some works have tried to decrease the generation time with fast differential equation solvers [18], while others have tried to analytically estimate reverse time variance to improve image quality [2]. Others have tried to improve the log likelihood of the score based generative models [17]. Some theoretical works have derived the loss function from likelihood optimization perspective [5, 26, 13]. Other theoretical works have solved the stochastic differential equation by solving Schrodinger Bridge problem [7].

There are several applications of score based models like text based image generation [23, 24], image editing [20] and adversarial purification [21, 33].

3 Background

3.1 Denoising Diffusion Probabilistic Model

Sohl-Dickstein et al. [25] and Ho et al. [11] proposed to design a generative model, called Denoising diffusion probabilistic model (DDPM) from a Bayesian perspective. Imagine we sample an image from the data distribution, x0p0{x}_{0}\sim p_{0}. Consider the data corruption sequence where we incrementally add Gaussian noise to the image until it turns into complete noise. This forms a Markov chain and the joint distribution of the forward series is given by p0(x0)t=1t=Tpt1,t(xt)|pt1,t(xt1)p_{0}(x_{0})\prod_{t=1}^{t=T}p_{t-1,t}(x_{t})|p_{t-1,t}(x_{t-1}). Then, a reverse Markov chain is considered were pθ(xt1|xt)p_{\theta}(x_{t-1}|x_{t}) is conditionally Gaussian such that the reverse process joint distribution is given by pθ(xT)t=Tt=1pθ(xt1|xt)p_{\theta}(x_{T})\prod_{t=T}^{t=1}p_{\theta}(x_{t-1}|x_{t}). The DDPM algorithm optimizes the evidence lower bound of the data likelihood such that when optimization is complete, the reverse joint distribution coincides the forward joint distribution. Since the reverse conditional distributions are parameterized by the error functions ϵθ\epsilon_{\theta}, the algorithm, essentially, boils down to optimizing the ϵθ\epsilon_{\theta}. One key trick proposed in DDPM is that the complex loss obtained from the ELBO can be neatly expressed as an extremely simple loss function as follows:

θddpm=𝔼t,x0,ϵ{λtϵθ(xt,t)ϵ2}\displaystyle\mathcal{L}_{\theta}^{ddpm}=\mathbb{E}_{t,x_{0},\epsilon}\big{\{}\lambda_{t}||\epsilon_{\theta}(x_{t},t)-\epsilon||^{2}\big{\}} (2)

where, xt=αtx0+1αtϵx_{t}=\sqrt{\alpha_{t}}x_{0}+\sqrt{1-\alpha_{t}}\epsilon is a sample from distribution pt(x|x0)=𝒩(x;αtx0,1αt)p_{t}(x|x_{0})=\mathcal{N}(x;\sqrt{\alpha_{t}}x_{0},1-\alpha_{t}), the marginal distribution at time tt and ϵ\epsilon is a random vector from an isotropic Gaussian distribution. Note that as we incrementally add noise to x0x_{0}, the marginal distribution ptp_{t} at time tt can be expressed as a Gaussian distribution conditioned on x0x_{0}. The mean at time tt has diluted by a factor of αt\sqrt{\alpha_{t}} and variance is 1αt1-\alpha_{t}. Also, the λt\lambda_{t} in eq.(2) is a function of αt\alpha_{t}s and noise added at each time instant. Once we have the optimized network, ϵθ\epsilon_{\theta}, the image generation process is nothing but following the reverse conditional distribution pθ(xt1|xt)p_{\theta}(x_{t-1}|x_{t}) starting from the isotropic Gaussian noise pθ(xT)p_{\theta}(x_{T}).

3.2 Score Based Generative Model

Song et al. [27] proposed a different generative model by estimating the gradient of data distribution, called the score function, whose sampling looked similar to that of DDPM. Even though the score based generative model seemed similar to diffusion model, DDPM, the connection was unclear until another paper due to Song et al. [28], which showed that there is a deeper connection between the two networks: the error network in DDPM, ϵθ\epsilon_{\theta} is same as the score function sθs_{\theta} in score generative model. They establish that DDPM essentially performs score matching [15, 29]. They further develop this line of argument showing that the continuous version of score based generative model can be obtained by reversing a stochastic differential equation (SDE). Specifically, they generalize the forward process of adding noise to a continuous setting where the forward process takes the form of a stochastic differential equation (SDE). Similarly, the reverse process of starting from the isotropic Gaussian and incrementally removing noise can also be shown to be a stochastic different equation continuous in time:

FOR:dx\displaystyle\text{FOR}:dx =f(x,t)dt+gtdw\displaystyle=f(x,t)dt+g_{t}dw (3)
REV:dx\displaystyle\text{REV}:dx =[f(x,t)gt2xlogpt(x)]dt+gtdw¯\displaystyle=[f(x,t)-g_{t}^{2}\nabla_{x}\log p_{t}(x)]dt+g_{t}d\bar{w} (4)

Luckily, the reverse stochastic differential equation only needs the score function at each timestamp tt in addition to other functions f,gf,g from the forward equation. Using the same score matching idea, the score function can be first trained with the following loss function

θ=𝔼t{λt𝔼x0𝔼xt|x0[||sθ(xt,t)xtlogp0,t(xt|x0)||2]}\displaystyle\mathcal{L}_{\theta}=\mathbb{E}_{t}\big{\{}\lambda_{t}\mathbb{E}_{x_{0}}\mathbb{E}_{x_{t}|x_{0}}[||s_{\theta}(x_{t},t)-\nabla_{x_{t}}\log p_{0,t}(x_{t}|x_{0})||^{2}]\big{\}} (5)

Note that this loss is same as eq.(2) once we use the fact that ptp_{t} is conditionally Gaussian. Once the score function is learnt, the generative model is given by replacing s=xlogps=\nabla_{x}\log p in the reverse time stochastic differential equation:

dx=[f(x,t)gt2sθ(x,t)]dt+gtdw¯\displaystyle dx=[f(x,t)-g_{t}^{2}s_{\theta}(x,t)]dt+g_{t}d\bar{w} (6)

In practice, we need to discretize eq. (6) to obtain a generative model. We achieve this through Euler-Maruyama discretization as suggested in [28]:

x(tδt)=[f(x,t)gt2sθ(x,t)](δt)+gtδtz\displaystyle x(t-\delta t)=[f(x,t)-g_{t}^{2}s_{\theta}(x,t)](-\delta t)+g_{t}\sqrt{\delta t}z (7)

where z𝒩(0,I)z\sim\mathcal{N}(0,I) is a random sample from standard Gaussian distribution.

3.3 Conditional Score based Models

Following the unconditional generative models, a few conditional generative models have been developed. Most of these models, however, require some form of supervision regarding the group on which to condition the generation. For example, Dhariwal et al. [8] developed conditional generative models based on class label by modifying score functions with extra term representing the gradient of log of classifier likelihood. These methods are known as classifier guided conditional generative models. Later, classifier-free models [12] were also proposed who eschewed the idea of training a classifier altogether. Nevertheless, they also need class label to train.

Towards the direction of unsupervised learning, Diffusion autoencoder (DiffAE) [22] learns the conditional distribution based on latent vector obtained from the encoder, which is in the line of our work. Our work differs from theirs in the fundamental notion of what represents the component. In our model, we encode different score components from an image, where each component sort of captures one concept (energy function). Later we generate images by recombining these components.

4 Method

Existing score based generative models train a common unconditional score function sθ(x)s_{\theta}(x) such that the reverse SDE yields samples from the whole distribution p0(x)p_{0}(x). We are interested in learning different energy components in each image. Before we can decompose different energy components in each image, we need to be able to design a conditional score function which can reverse the forward diffusion process and converge to a single image rather than the whole distribution. More precisely, we want to modulate the score function such that the reverse SDE yields the dirac-delta distribution concentrated on a single image, say xζx_{\zeta}. It is unclear how to do that from the existing works. Note that the loss function in eq.(5) is an expectation across all images x0x_{0} sampled from the data distribution p0p_{0}, and thus sθs_{\theta} is common to all data. To design data-specific score function such that reverse SDE converges to a dirac delta distribution around image, we adopt the log likelihood formulation of score based models [13, 5, 26].

4.1 Likelihood Based Formulation

We can derive a likelihood formulation to train the score based generative model based on Feynman-Kac Theorem [16], as shown in [13, 5, 26]. We start from the likelihood formulation of training score based generative model.

logp(x0)VLB(x0,θ)=EpT[logpθ(xT)]\displaystyle\log p(x_{0})\geq\mathcal{L}_{VLB}(x_{0},\theta)=E_{p_{T}}[\log p_{\theta}(x_{T})]
0T𝔼pt[12gt2sθ(xt,t)2+x(gt2sθ(xt,t)f)]𝑑t\displaystyle-\int_{0}^{T}\mathbb{E}_{p_{t}}\bigg{[}\frac{1}{2}g_{t}^{2}||s_{\theta}(x_{t},t)||^{2}+\nabla_{x}(g_{t}^{2}s_{\theta}(x_{t},t)-f)\bigg{]}dt (8)

Taking expectation, we arrive at,

EVLB(θ)=𝔼x0p0[VLB(x0,θ)]\displaystyle\mathcal{L}_{EVLB}(\theta)=\mathbb{E}_{{x_{0}}\sim p_{0}}[\mathcal{L}_{VLB}(x_{0},\theta)] (9)

From eq.(9), we can derive the same loss function as in eq.(5) by using a rough approximation of the integral with the discrete summation and equivalence between different score matching [29, 15](as shown in [13]). Therefore, unconditional score estimation can be thought as a crude approximation of the expected likelihood maximization. Nevertheless, expressing the likelihood as the integral in eq.(4.1) is much more illuminating and powerful. Observe that eq.(9) is obtained by taking expectation of the likelihood of individual data in eq.(4.1). It is this expectation which leads to learning a common unconditional score function. We can forego expectation to train score functions for each data point. For any data xζx_{\zeta}, we can directly optimize its likelihood by training a score function sθ,ζs_{\theta,\zeta} to optimize the lower bound on the right hand side of eq.(4.1). Specifically, we train an encoder, Enc and score function, ss as follows:

ζ=Encθ(xζ)\displaystyle\zeta=\texttt{Enc}_{\theta}(x_{\zeta}) (10)
logp(xζ)=EpT[logpθ(xT)|x0=xζ]\displaystyle\log p(x_{\zeta})\geq=E_{p_{T}}[\log p_{\theta}(x_{T})|x_{0}=x_{\zeta}]
0T𝔼pt[12gt2sθ2+x(gt2sθ,ζf)|x0=xζ]𝑑t\displaystyle-\int_{0}^{T}\mathbb{E}_{p_{t}}\bigg{[}\frac{1}{2}g_{t}^{2}||s_{\theta}||^{2}+\nabla_{x}(g_{t}^{2}s_{\theta,\zeta}-f)|x_{0}=x_{\zeta}\bigg{]}dt (11)

Eq.(10) and eq.(4.1) completes our autoencoder model that maximizes the lower bound on log likelihood of individual image, xζx_{\zeta}.

4.2 Score Decomposition

As motivated in the introduction (eq.(1)), we want to decompose each score function into multiple components, which intuitively represent different energy components (negative of gradient of energy to be precise). We decompose each score function into KK components such that sθ,ζ=(sθ,ζ(1)+sθ,ζ(2)++sθ,ζ(K))/Ks_{\theta,\zeta}=(s_{\theta,\zeta}^{(1)}+s_{\theta,\zeta}^{(2)}+...+s_{\theta,\zeta}^{(K)})/K. This decomposition requires KK different score functions. A computationally efficient way would be to share the weight of score functions but use different latent vectors for different components, which also gives a structure to the latent space. Therefore, we decompose the score function as:

sθ,ζ=(sθ,ζ(1)+sθ,ζ(2)++sθ,ζ(K))/K\displaystyle s_{\theta,\zeta}=(s_{\theta,\zeta^{(1)}}+s_{\theta,\zeta^{(2)}}+...+s_{\theta,\zeta^{(K)}})/K (12)

Note that now the burden of learning different energy components has been shifted to the latent vectors together with a shared conditional score function sθ,ζ(k)s_{\theta,\zeta^{(k)}}.

4.3 Model and Training

The encoder encodes each image xζx_{\zeta} into KK latent vectors ζ(K)\zeta^{(K)}. The summed score function given by eq.(12) is used to maximize the lower bound on the log likelihood of each xζx_{\zeta}. We also approximate this integral with discrete summation and invoke the equivalence of implicit score matching [15] and denoising score matching [29, 28] to obtain an approximation [13] as the following loss function:

θ,ζ=𝔼t{λt𝔼xt|xζ[||sθ,ζ(xt,t)xtlogp0,t(xt|xζ)||2]}\displaystyle\mathcal{L}_{\theta,\zeta}=\mathbb{E}_{t}\big{\{}\lambda_{t}\mathbb{E}_{x_{t}|x_{\zeta}}[||s_{\theta,\zeta}(x_{t},t)-\nabla_{x_{t}}\log p_{0,t}(x_{t}|x_{\zeta})||^{2}]\big{\}} (13)

We jointly train KK encoders and the score loss in eq.(13), where the score is given by eq.(12). To design a score function as a function of the the latent vector ζ\zeta, we use the adaptive Group Norm (AdaGN) strategy as described in [8]. Once the model is trained, we generate image by first sampling noise zz from the standard Gaussian distribution and iteratively applying Euler-Maruyama discretization of the SDE as described by eq.(7).

We also experiment with generating samples by combining score components with unconditional score function similar to [12]. For this, we can separately train conditional and unconditional score functions. However, to make it computationally cheap, we pass a vector of ones 1 to realize the unconditional score function su=sθ,1s_{u}=s_{\theta,\textbf{1}}, similar to the trick used in [12]. We combine conditional and unconditional score functions with linear combination of coefficients as described in experiments.

5 Results and Discussion

Refer to caption
Figure 1: Using all score components results in faithful reconstruction, but with interesting variations

5.1 Experimental Details

Refer to caption
Figure 2: Visualizing different score components through generation as a way to interpret components. Images are generated by solving reverse SDE using single score component. Each row is a component and each column is different dataset.

The score network is the modified version of UNet architecture as described in Dhariwal et al.[8]. To condition on the latent vector obtained from encoder, we use the AdaGN as described in [8], which is inspired from adaptive instance normalization (AdaIN) [14], but uses group normalization [30] instead of instance norm. Our architecture is similar to DiffAE’s adaptation of AdaGN, i.e.

AdaGN(h,t,ζ)=ζs(ts.GroupNorm(h)+tb)\displaystyle\texttt{AdaGN}(h,t,\zeta)=\zeta_{s}(t_{s}.\texttt{GroupNorm}(h)+t_{b}) (14)

where hh is normalized feature map at different layers, ts,tbt_{s},t_{b} are obtained from the time embedding by applying MLP and ζs\zeta_{s} is affine transformation of the latent vector ζ\zeta.

We burrow encoder from DiffAE [22], i.e. the first part of UNet architecture used in score function. We experiment with two values of K: K={3,5}K=\{3,5\}. The encoder encodes K vectors of length 128128, which condition the score functions.

Due to the computing constraints, we train for 150K iterations using Adam optimizer and batch size of 32. We experiment on four datasets in computer vision: 1) Celeb-A, 2) LSUN-outdoor_church, 3) Cifar-10 and 4) SVHN. We use image size of 32×3232\times 32 for Cifar-10 and SVHN and 48×4848\times 48 for LSUN and Celeb-A, again due to resource constraints.

5.2 Natural variation in reconstruction

In Fig.(1), we show auto encoding of samples from four datasets. We input a batch of 16 images, all of which are the same. Figure shows that the reconstruction is very close to the ground truth, but, at the same time, there are natural variations.

5.3 Visualizing different components through generation

Refer to caption
Figure 3: First row shows different components. Second row shows the effect of slowly diluting first component with unconditional score function. Decreasing α\alpha means weight of first component decreases and that of unconditional score function increases.

Visualization of different score components can be insightful about what each component is capturing from the image. It is unclear what is the best way to visualize score components which are essentially represented as matrices. We demonstrate that generating image samples using each component can be the best visualization strategy of each components. To be precise, if sθ,ζ=(sθ,ζ(1)+sθ,ζ(2)+sθ,ζ(3))/3s_{\theta,\zeta}=(s_{\theta,\zeta^{(1)}}+s_{\theta,\zeta^{(2)}}+s_{\theta,\zeta^{(3)}})/3, then we take each component, say sθ,ζ(k)s_{\theta,\zeta^{(k)}} and generate images using the Euler Maruyama discretization in eq. (7). The samples generated using three components are plotted in Fig.2 where each row is a component. The input image of these components are the same images as in Fig.1. Note that even though input image and the latent vectors are the same, there are natural variations in samples form each component.

Several observations are in order. First, this method can result in human interpretable components in some cases, but not always. For example, the third component in the svhn image is clearly capturing digits, however, first and second components are capturing some abstract texture or lighting related information. Similarly, second component of the bird is clearly focusing on the color, but first one is capturing some abstract shape information.

This score decomposition forces us to rethink the definition of components of an image. In classical works, disentanglement has been seen as some kind of statistical property of the distribution and methods were proposed to enforce such statistical pattern, for example, independence. Here, however, we see that components could capture interesting pattern that may or may not be independent or interpretable. We can certainly say that the three components are different, but at the same time capture something about the input image. Therefore, they are different disentangled components. Yet, they may or may not possess statistical independence property or complete human interpretability.

Refer to caption
Figure 4: First component is diluted with unconditional score function
Refer to caption
Figure 5: Second and third components are diluted with unconditional score function in second row.

5.4 Manipulation of components

We manipulate images using these learned components, with the help of an unconditional score function. In Fig.3, we retain two components untouched while we linearly interpolate one component with the unconditional score function to generate images. That is, if s(1),s(2),s(3)s^{(1)},s^{(2)},s^{(3)} are three components, we generate image with the following score function:

sαinterp=(αs(1)+s(2)+s(3))/3+(1α)su/3\displaystyle s_{\alpha}^{interp}=(\alpha*s^{(1)}+s^{(2)}+s^{(3)})/3+(1-\alpha)s_{u}/3 (15)

In Fig.3, we observe that the first component is associated with the background information and less in the church building. When this component is interpolated, we see that the information like yellow color (related to second component) and building architectures (related to third components) are preserved. As we go from α=1\alpha=1 to α=0.1\alpha=0.1, we see that the image changes while preserving the yellow color of the church and with different building architectures, but it changes the tree and the background making it more diverse. At α=0.1\alpha=0.1, we see that a lot of tree has been removed and replaced with something else to create diverse background setting.

In Fig.4, we observe similar effect. The first component is changed in second row while retaining second and third component. From the first row, it is clear that the first component is associated with smile. Hence, as we go from α=1\alpha=1 to α=0.1\alpha=0.1, the sample images are more diverse in terms of smile. At α=0.1\alpha=0.1, we see a few instances where the mouth is even closed.

Similarly, in Fig.5, we change second and third component while retaining the first one. Here also, the first component is associated with the smile while second and third are associated with shape features and hair. We fix the first component and interpolate second and third component as follows:

sαinter=(α(s(1)+s(2))+s(3))3+2(1α)su3\displaystyle s_{\alpha}^{inter}=\frac{(\alpha*(s^{(1)}+*s^{(2)})+s^{(3)})}{3}+\frac{2(1-\alpha)s_{u}}{3} (16)

In doing so, when we go from α=1\alpha=1 to α=0.1\alpha=0.1, smile is preserved while other features change. Hence, we see image with a lot of diversity in terms of shape, color and hair while preserving smile as we go towards α=0.1\alpha=0.1.

5.5 Tuning the weight of components

In Fig.6, we tune the weight of two score components s(1)s^{(1)} and s(2)s^{(2)}. When the s(2)s^{(2)} is weighted heavily, the greenish hue and the corresponding shape of the church are dominating. As we moved more heavily towards s(1)s^{(1)}, we started seeing more of a mix between s(1)s^{(1)} and s3s^{3}.

5.6 Varying the number of components

Refer to caption
Figure 6: Second row: Changing the relative weight of first and second components while keeping the third component constant.
Refer to caption
Figure 7: Visualizing three and five components in Cifar10 datase through image generation from components.

Fig.7 compares five components against three components learned in Cifar10 dataset. Components 4, 5 and 1 in five components are similar to the three in first row. However, Component 3 in second row is different and seems to capture the colored pattern with shape in the middle.

6 Conclusion

Based on the insights of energy based models, we proposed to learn multiple score components while training a score based generative model. These components are interpretable and provide us more control to manipulate and edit images. In experiments, we discuss the interpretability and demonstrate our ability to edit and generate images through component manipulation. These score components also provide us new interpretable methods in score based generative models.

There are some rooms for improvement. Our ability to manipulate images is limited by the number of score components. Scaling this method to higher number of components without computational overhead could be an important future direction. Guiding components towards interpretations more amenable to humans could be another promising direction of future research.


References

  • [1] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
  • [2] Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503, 2022.
  • [3] David Bau, Hendrik Strobelt, William Peebles, Jonas Wulff, Bolei Zhou, Jun-Yan Zhu, and Antonio Torralba. Semantic photo manipulation with a generative image prior. arXiv preprint arXiv:2005.07727, 2020.
  • [4] Ricky TQ Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. Advances in neural information processing systems, 31, 2018.
  • [5] Tianrong Chen, Guan-Horng Liu, and Evangelos Theodorou. Likelihood training of schrödinger bridge using forward-backward SDEs theory. In International Conference on Learning Representations, 2022.
  • [6] Edo Collins, Raja Bala, Bob Price, and Sabine Susstrunk. Editing in style: Uncovering the local semantics of gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5771–5780, 2020.
  • [7] Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695–17709, 2021.
  • [8] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 8780–8794. Curran Associates, Inc., 2021.
  • [9] Yilun Du, Shuang Li, Yash Sharma, Josh Tenenbaum, and Igor Mordatch. Unsupervised learning of compositional energy concepts. Advances in Neural Information Processing Systems, 34:15608–15620, 2021.
  • [10] Ruiqi Gao, Erik Nijkamp, Diederik P Kingma, Zhen Xu, Andrew M Dai, and Ying Nian Wu. Flow contrastive estimation of energy-based models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7518–7528, 2020.
  • [11] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, 2020.
  • [12] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  • [13] Chin-Wei Huang, Jae Hyun Lim, and Aaron C Courville. A variational perspective on diffusion-based generative models and score matching. Advances in Neural Information Processing Systems, 34:22863–22876, 2021.
  • [14] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pages 1501–1510, 2017.
  • [15] Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
  • [16] Ioannis Karatzas, Ioannis Karatzas, Steven Shreve, and Steven E Shreve. Brownian motion and stochastic calculus, volume 113. Springer Science & Business Media, 1991.
  • [17] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
  • [18] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.
  • [19] Emile Mathieu, Tom Rainforth, Nana Siddharth, and Yee Whye Teh. Disentangling disentanglement in variational autoencoders. In International Conference on Machine Learning, pages 4402–4412. PMLR, 2019.
  • [20] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
  • [21] Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Anima Anandkumar. Diffusion models for adversarial purification. 2022.
  • [22] Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • [23] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • [24] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  • [25] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  • [26] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34:1415–1428, 2021.
  • [27] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
  • [28] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
  • [29] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
  • [30] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
  • [31] Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang. Gan inversion: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • [32] Jianwen Xie, Yang Lu, Song-Chun Zhu, and Yingnian Wu. A theory of generative convnet. In International Conference on Machine Learning, pages 2635–2644. PMLR, 2016.
  • [33] Jongmin Yoon, Sung Ju Hwang, and Juho Lee. Adversarial purification with score-based generative models. In International Conference on Machine Learning, pages 12062–12072. PMLR, 2021.