Divide and Compose with Score Based Generative Models
Abstract
While score based generative models, or diffusion models, have found success in image synthesis, they are often coupled with text data or image label to be able to manipulate and conditionally generate images. Even though manipulation of images by changing the text prompt is possible, our understanding of the text embedding and our ability to modify it to edit images is quite limited. Towards the direction of having more control over image manipulation and conditional generation, we propose to learn image components in an unsupervised manner so that we can compose those components to generate and manipulate images in informed manner. Taking inspiration from energy based models, we interpret different score components as the gradient of different energy functions. We show how score based learning allows us to learn interesting components and we can visualize them through generation. We also show how this novel decomposition allows us to compose, generate and modify images in interesting ways akin to dreaming. We make our code available at https://github.com/sandeshgh/Score-based-disentanglement
1 Introduction
Diffusion based [11] or score based [28] generative models are a new class of generative models based on the idea of reversing the image corruption process to generate realistic images from noise. These approaches have recently become quite successful, not only in synthesizing realistic and diverse images [8], but also in obtaining better data likelihood [17]. Numerous works have applied score based generative models in text-based image generation [24, 23], inpainting [24], editing [20] etc. Recently developed models like DALL-E[23] and Latent Diffusion [24] using diffusion models have been reported to generate realistic and diverse images with wild imagination ability.
Most of the works that conditionally generate images using diffusion/score model train models in a supervised manner conditioned either on actual class labels or embedding of paired text [23]. Supervised conditional generation could be either guided using the gradient of pretrained classifier [8] obtained from supervised learning or could also be classifier free [12]. Building upon the works of text based conditional generation, some works have also tried to edit or manipulate images [1]. While these methods show that with labels, learning conditional score model is quite effective, there is one fundamental problem with the present conditional generation: we do not have control over what the model generates. Suppose we generate an image based on text. The image looks okay, but it’s not quite what we want. How can we change it to match our expectations? Do we have control? Do we have interpretable understanding of the conditioning? No!
This topic is not unexplored in the context of traditional generative models, like VAEs and GANs. In fact, they have been extensively explored as disentanglement [19, 4] in the autoencoding setup and GAN inversion [3, 6] in GANs. Several works in GAN inversion [31] try to find latent representations corresponding to an image and then manipulate the representation in the latent space to edit and manipulate the image. In case of score based models, however, there is little understanding of the latent factors in terms of generation of images. Models that use text based conditional generations are opaque and our only way to manipulate the image is through the generation of another text.
To bridge this gap, we are interested in learning interpretable factors in a score based generative models, which could be later used to manipulate and edit images as in the case of GAN Inversion. The first plausible step to learn such factors is an autoencoding type approach where we first learn different image components and then recompose image out of those components. Unfortunately, the theoretical formulation of such diffusion autoencoder is currently unclear. Our first contribution is to cement the theoretical foundation of diffusion autoencoder through likelihood based formulation (see section 4.1). From the implementation perspective, we did find the diffusion autoencoder (DiffAE) implementation due to Preechakul et al. [22] to work well and have built upon their implementation.
While we agree with DiffAE [22] on the autoencoding setup, we take a very different approach to autoencoding by decomposing an image into different score components. We believe this is arguably better suited for the score models since the score functions are the main building components of score models. Therefore, we would like to decompose an image into different score components and try to understand their contribution in image generation. We take inspiration from the energy based models [32, 10, 9]. Imagine that the probability density of image is given by product of exponential distributions of the form:
(1) |
where represents energy functions and is the normalization constant, also known as partition function. Taking logarithm of eq.(1), we can conclude that modeling score as summation of several components can be interpreted as learning different energy components, i.e. . Based on this intuition, we imagine that we should be able to decompose the score function into different components and train the score based generative model. We ask, could we decompose an image into several interpretable components in unsupervised way and recombine them to generate new images, akin to dreaming? In this paper, we follow this abstract idea to generate interesting images by dividing image and recomposing components.
We perform several experiments to illustrate the score components learned in unsupervised manner and what we can achieve through their composition and manipulation. To interpret the factors captured by score components we generate images from individual components. Some components capture human interpretable attributes like shape and color, while others are not as they capture complex texture/features in images. We also modify images by manipulating individual components and interpolating them with unconditional score which results in interesting manipulation of images and diverse generation. We discuss how our experiments elicit new perspective on interpretability and disentanglement of images.
2 Related Works
Diffusion generative models based on denoising idea was first proposed by Ho et al. [11] and Sohl-Dickstein et al. [25]. From a different perspective, Song et al. [27] showed that we can generate images by estimating the score, i.e. gradient of data loglikelihood. These two perspectives were later reconciled by Song et al. [28]. They showed that the forward diffusion and reverse generative models are both stochastic processes in continuous time guided by stochastic differential equations. This work unifies the diffusion perspective with the score based perspective. The score based generative model utilizes the denoising and implicit score matching ideas [15, 29] to develop a computationally cheap way to estimate the score function at different time instants.
Score based generative model research has seen several new directions. Some works have tried to decrease the generation time with fast differential equation solvers [18], while others have tried to analytically estimate reverse time variance to improve image quality [2]. Others have tried to improve the log likelihood of the score based generative models [17]. Some theoretical works have derived the loss function from likelihood optimization perspective [5, 26, 13]. Other theoretical works have solved the stochastic differential equation by solving Schrodinger Bridge problem [7].
3 Background
3.1 Denoising Diffusion Probabilistic Model
Sohl-Dickstein et al. [25] and Ho et al. [11] proposed to design a generative model, called Denoising diffusion probabilistic model (DDPM) from a Bayesian perspective. Imagine we sample an image from the data distribution, . Consider the data corruption sequence where we incrementally add Gaussian noise to the image until it turns into complete noise. This forms a Markov chain and the joint distribution of the forward series is given by . Then, a reverse Markov chain is considered were is conditionally Gaussian such that the reverse process joint distribution is given by . The DDPM algorithm optimizes the evidence lower bound of the data likelihood such that when optimization is complete, the reverse joint distribution coincides the forward joint distribution. Since the reverse conditional distributions are parameterized by the error functions , the algorithm, essentially, boils down to optimizing the . One key trick proposed in DDPM is that the complex loss obtained from the ELBO can be neatly expressed as an extremely simple loss function as follows:
(2) |
where, is a sample from distribution , the marginal distribution at time and is a random vector from an isotropic Gaussian distribution. Note that as we incrementally add noise to , the marginal distribution at time can be expressed as a Gaussian distribution conditioned on . The mean at time has diluted by a factor of and variance is . Also, the in eq.(2) is a function of s and noise added at each time instant. Once we have the optimized network, , the image generation process is nothing but following the reverse conditional distribution starting from the isotropic Gaussian noise .
3.2 Score Based Generative Model
Song et al. [27] proposed a different generative model by estimating the gradient of data distribution, called the score function, whose sampling looked similar to that of DDPM. Even though the score based generative model seemed similar to diffusion model, DDPM, the connection was unclear until another paper due to Song et al. [28], which showed that there is a deeper connection between the two networks: the error network in DDPM, is same as the score function in score generative model. They establish that DDPM essentially performs score matching [15, 29]. They further develop this line of argument showing that the continuous version of score based generative model can be obtained by reversing a stochastic differential equation (SDE). Specifically, they generalize the forward process of adding noise to a continuous setting where the forward process takes the form of a stochastic differential equation (SDE). Similarly, the reverse process of starting from the isotropic Gaussian and incrementally removing noise can also be shown to be a stochastic different equation continuous in time:
(3) | ||||
(4) |
Luckily, the reverse stochastic differential equation only needs the score function at each timestamp in addition to other functions from the forward equation. Using the same score matching idea, the score function can be first trained with the following loss function
(5) |
Note that this loss is same as eq.(2) once we use the fact that is conditionally Gaussian. Once the score function is learnt, the generative model is given by replacing in the reverse time stochastic differential equation:
(6) |
In practice, we need to discretize eq. (6) to obtain a generative model. We achieve this through Euler-Maruyama discretization as suggested in [28]:
(7) |
where is a random sample from standard Gaussian distribution.
3.3 Conditional Score based Models
Following the unconditional generative models, a few conditional generative models have been developed. Most of these models, however, require some form of supervision regarding the group on which to condition the generation. For example, Dhariwal et al. [8] developed conditional generative models based on class label by modifying score functions with extra term representing the gradient of log of classifier likelihood. These methods are known as classifier guided conditional generative models. Later, classifier-free models [12] were also proposed who eschewed the idea of training a classifier altogether. Nevertheless, they also need class label to train.
Towards the direction of unsupervised learning, Diffusion autoencoder (DiffAE) [22] learns the conditional distribution based on latent vector obtained from the encoder, which is in the line of our work. Our work differs from theirs in the fundamental notion of what represents the component. In our model, we encode different score components from an image, where each component sort of captures one concept (energy function). Later we generate images by recombining these components.
4 Method
Existing score based generative models train a common unconditional score function such that the reverse SDE yields samples from the whole distribution . We are interested in learning different energy components in each image. Before we can decompose different energy components in each image, we need to be able to design a conditional score function which can reverse the forward diffusion process and converge to a single image rather than the whole distribution. More precisely, we want to modulate the score function such that the reverse SDE yields the dirac-delta distribution concentrated on a single image, say . It is unclear how to do that from the existing works. Note that the loss function in eq.(5) is an expectation across all images sampled from the data distribution , and thus is common to all data. To design data-specific score function such that reverse SDE converges to a dirac delta distribution around image, we adopt the log likelihood formulation of score based models [13, 5, 26].
4.1 Likelihood Based Formulation
We can derive a likelihood formulation to train the score based generative model based on Feynman-Kac Theorem [16], as shown in [13, 5, 26]. We start from the likelihood formulation of training score based generative model.
(8) |
Taking expectation, we arrive at,
(9) |
From eq.(9), we can derive the same loss function as in eq.(5) by using a rough approximation of the integral with the discrete summation and equivalence between different score matching [29, 15](as shown in [13]). Therefore, unconditional score estimation can be thought as a crude approximation of the expected likelihood maximization. Nevertheless, expressing the likelihood as the integral in eq.(4.1) is much more illuminating and powerful. Observe that eq.(9) is obtained by taking expectation of the likelihood of individual data in eq.(4.1). It is this expectation which leads to learning a common unconditional score function. We can forego expectation to train score functions for each data point. For any data , we can directly optimize its likelihood by training a score function to optimize the lower bound on the right hand side of eq.(4.1). Specifically, we train an encoder, Enc and score function, as follows:
(10) | ||||
(11) |
Eq.(10) and eq.(4.1) completes our autoencoder model that maximizes the lower bound on log likelihood of individual image, .
4.2 Score Decomposition
As motivated in the introduction (eq.(1)), we want to decompose each score function into multiple components, which intuitively represent different energy components (negative of gradient of energy to be precise). We decompose each score function into components such that . This decomposition requires different score functions. A computationally efficient way would be to share the weight of score functions but use different latent vectors for different components, which also gives a structure to the latent space. Therefore, we decompose the score function as:
(12) |
Note that now the burden of learning different energy components has been shifted to the latent vectors together with a shared conditional score function .
4.3 Model and Training
The encoder encodes each image into latent vectors . The summed score function given by eq.(12) is used to maximize the lower bound on the log likelihood of each . We also approximate this integral with discrete summation and invoke the equivalence of implicit score matching [15] and denoising score matching [29, 28] to obtain an approximation [13] as the following loss function:
(13) |
We jointly train encoders and the score loss in eq.(13), where the score is given by eq.(12). To design a score function as a function of the the latent vector , we use the adaptive Group Norm (AdaGN) strategy as described in [8]. Once the model is trained, we generate image by first sampling noise from the standard Gaussian distribution and iteratively applying Euler-Maruyama discretization of the SDE as described by eq.(7).
We also experiment with generating samples by combining score components with unconditional score function similar to [12]. For this, we can separately train conditional and unconditional score functions. However, to make it computationally cheap, we pass a vector of ones 1 to realize the unconditional score function , similar to the trick used in [12]. We combine conditional and unconditional score functions with linear combination of coefficients as described in experiments.
5 Results and Discussion

5.1 Experimental Details

The score network is the modified version of UNet architecture as described in Dhariwal et al.[8]. To condition on the latent vector obtained from encoder, we use the AdaGN as described in [8], which is inspired from adaptive instance normalization (AdaIN) [14], but uses group normalization [30] instead of instance norm. Our architecture is similar to DiffAE’s adaptation of AdaGN, i.e.
(14) |
where is normalized feature map at different layers, are obtained from the time embedding by applying MLP and is affine transformation of the latent vector .
We burrow encoder from DiffAE [22], i.e. the first part of UNet architecture used in score function. We experiment with two values of K: . The encoder encodes K vectors of length , which condition the score functions.
Due to the computing constraints, we train for 150K iterations using Adam optimizer and batch size of 32. We experiment on four datasets in computer vision: 1) Celeb-A, 2) LSUN-outdoor_church, 3) Cifar-10 and 4) SVHN. We use image size of for Cifar-10 and SVHN and for LSUN and Celeb-A, again due to resource constraints.
5.2 Natural variation in reconstruction
In Fig.(1), we show auto encoding of samples from four datasets. We input a batch of 16 images, all of which are the same. Figure shows that the reconstruction is very close to the ground truth, but, at the same time, there are natural variations.
5.3 Visualizing different components through generation

Visualization of different score components can be insightful about what each component is capturing from the image. It is unclear what is the best way to visualize score components which are essentially represented as matrices. We demonstrate that generating image samples using each component can be the best visualization strategy of each components. To be precise, if , then we take each component, say and generate images using the Euler Maruyama discretization in eq. (7). The samples generated using three components are plotted in Fig.2 where each row is a component. The input image of these components are the same images as in Fig.1. Note that even though input image and the latent vectors are the same, there are natural variations in samples form each component.
Several observations are in order. First, this method can result in human interpretable components in some cases, but not always. For example, the third component in the svhn image is clearly capturing digits, however, first and second components are capturing some abstract texture or lighting related information. Similarly, second component of the bird is clearly focusing on the color, but first one is capturing some abstract shape information.
This score decomposition forces us to rethink the definition of components of an image. In classical works, disentanglement has been seen as some kind of statistical property of the distribution and methods were proposed to enforce such statistical pattern, for example, independence. Here, however, we see that components could capture interesting pattern that may or may not be independent or interpretable. We can certainly say that the three components are different, but at the same time capture something about the input image. Therefore, they are different disentangled components. Yet, they may or may not possess statistical independence property or complete human interpretability.


5.4 Manipulation of components
We manipulate images using these learned components, with the help of an unconditional score function. In Fig.3, we retain two components untouched while we linearly interpolate one component with the unconditional score function to generate images. That is, if are three components, we generate image with the following score function:
(15) |
In Fig.3, we observe that the first component is associated with the background information and less in the church building. When this component is interpolated, we see that the information like yellow color (related to second component) and building architectures (related to third components) are preserved. As we go from to , we see that the image changes while preserving the yellow color of the church and with different building architectures, but it changes the tree and the background making it more diverse. At , we see that a lot of tree has been removed and replaced with something else to create diverse background setting.
In Fig.4, we observe similar effect. The first component is changed in second row while retaining second and third component. From the first row, it is clear that the first component is associated with smile. Hence, as we go from to , the sample images are more diverse in terms of smile. At , we see a few instances where the mouth is even closed.
Similarly, in Fig.5, we change second and third component while retaining the first one. Here also, the first component is associated with the smile while second and third are associated with shape features and hair. We fix the first component and interpolate second and third component as follows:
(16) |
In doing so, when we go from to , smile is preserved while other features change. Hence, we see image with a lot of diversity in terms of shape, color and hair while preserving smile as we go towards .
5.5 Tuning the weight of components
In Fig.6, we tune the weight of two score components and . When the is weighted heavily, the greenish hue and the corresponding shape of the church are dominating. As we moved more heavily towards , we started seeing more of a mix between and .
5.6 Varying the number of components


Fig.7 compares five components against three components learned in Cifar10 dataset. Components 4, 5 and 1 in five components are similar to the three in first row. However, Component 3 in second row is different and seems to capture the colored pattern with shape in the middle.
6 Conclusion
Based on the insights of energy based models, we proposed to learn multiple score components while training a score based generative model. These components are interpretable and provide us more control to manipulate and edit images. In experiments, we discuss the interpretability and demonstrate our ability to edit and generate images through component manipulation. These score components also provide us new interpretable methods in score based generative models.
There are some rooms for improvement. Our ability to manipulate images is limited by the number of score components. Scaling this method to higher number of components without computational overhead could be an important future direction. Guiding components towards interpretations more amenable to humans could be another promising direction of future research.
References
- [1] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
- [2] Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503, 2022.
- [3] David Bau, Hendrik Strobelt, William Peebles, Jonas Wulff, Bolei Zhou, Jun-Yan Zhu, and Antonio Torralba. Semantic photo manipulation with a generative image prior. arXiv preprint arXiv:2005.07727, 2020.
- [4] Ricky TQ Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. Advances in neural information processing systems, 31, 2018.
- [5] Tianrong Chen, Guan-Horng Liu, and Evangelos Theodorou. Likelihood training of schrödinger bridge using forward-backward SDEs theory. In International Conference on Learning Representations, 2022.
- [6] Edo Collins, Raja Bala, Bob Price, and Sabine Susstrunk. Editing in style: Uncovering the local semantics of gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5771–5780, 2020.
- [7] Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695–17709, 2021.
- [8] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 8780–8794. Curran Associates, Inc., 2021.
- [9] Yilun Du, Shuang Li, Yash Sharma, Josh Tenenbaum, and Igor Mordatch. Unsupervised learning of compositional energy concepts. Advances in Neural Information Processing Systems, 34:15608–15620, 2021.
- [10] Ruiqi Gao, Erik Nijkamp, Diederik P Kingma, Zhen Xu, Andrew M Dai, and Ying Nian Wu. Flow contrastive estimation of energy-based models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7518–7528, 2020.
- [11] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, 2020.
- [12] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- [13] Chin-Wei Huang, Jae Hyun Lim, and Aaron C Courville. A variational perspective on diffusion-based generative models and score matching. Advances in Neural Information Processing Systems, 34:22863–22876, 2021.
- [14] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pages 1501–1510, 2017.
- [15] Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
- [16] Ioannis Karatzas, Ioannis Karatzas, Steven Shreve, and Steven E Shreve. Brownian motion and stochastic calculus, volume 113. Springer Science & Business Media, 1991.
- [17] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
- [18] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.
- [19] Emile Mathieu, Tom Rainforth, Nana Siddharth, and Yee Whye Teh. Disentangling disentanglement in variational autoencoders. In International Conference on Machine Learning, pages 4402–4412. PMLR, 2019.
- [20] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
- [21] Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Anima Anandkumar. Diffusion models for adversarial purification. 2022.
- [22] Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- [23] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- [24] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- [25] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
- [26] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34:1415–1428, 2021.
- [27] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
- [28] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
- [29] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
- [30] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
- [31] Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang. Gan inversion: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- [32] Jianwen Xie, Yang Lu, Song-Chun Zhu, and Yingnian Wu. A theory of generative convnet. In International Conference on Machine Learning, pages 2635–2644. PMLR, 2016.
- [33] Jongmin Yoon, Sung Ju Hwang, and Juho Lee. Adversarial purification with score-based generative models. In International Conference on Machine Learning, pages 12062–12072. PMLR, 2021.