SDDM: Score-Decomposed Diffusion Models on Manifolds
for Unpaired Image-to-Image Translation
Abstract
Recent score-based diffusion models (SBDMs) show promising results in unpaired image-to-image translation (I2I). However, existing methods, either energy-based or statistically-based, provide no explicit form of the interfered intermediate generative distributions. This work presents a new score-decomposed diffusion model (SDDM) on manifolds to explicitly optimize the tangled distributions during image generation. SDDM derives manifolds to make the distributions of adjacent time steps separable and decompose the score function or energy guidance into an image “denoising” part and a content “refinement” part. To refine the image in the same noise level, we equalize the refinement parts of the score function and energy guidance, which permits multi-objective optimization on the manifold. We also leverage the block adaptive instance normalization module to construct manifolds with lower dimensions but still concentrated with the perturbed reference image. SDDM outperforms existing SBDM-based methods with much fewer diffusion steps on several I2I benchmarks.
1 Introduction
Score-based diffusion models (Song & Ermon, 2019; Song et al., 2021; Ho et al., 2020; Nichol & Dhariwal, 2021; Bao et al., 2022a; Lu et al., 2022) (SBDMs) have recently made significant progress in a series of conditional image generation tasks. In particular, in the unpaired image-to-image translation (I2I) task (Pang et al., 2021), recent studies have shown that a pre-trained SBDM on the target image domain with energy guidance (Zhao et al., 2022) or statistical (Choi et al., 2021) guidance outperforms generative adversarial network (Goodfellow et al., 2014)(GAN)-based methods (Fu et al., 2019; Zhu et al., 2017; Yi et al., 2017; Park et al., 2020; Benaim & Wolf, 2017; Zheng et al., 2021; Shen et al., 2019; Huang et al., 2018; Jiang et al., 2020; Lee et al., 2018) and achieves the state-of-the-art performance.
SBDMs provide a diffusion model to guide how image-shaped data from a Gauss distribution is iterated step by step into an image of the target domain. In each step, SBDM gives score guidance which, from an engineering perspective, can be mixed with energy and statistical guidance to control the generation process. However, firstly due to the stochastic differential equations of the inverse diffusion process, the coefficient of the score guidance is not changeable. Secondly how energy guidances affect the intermediate distributions is still not clear. As a result, the I2I result is often unsatisfactory, especially when iterations are inadequate. Moreover, there has yet to be a method to ensure that the intermediate distributions are not negatively interfered with during the above guidance process.
To overcome these limitations, we propose to decompose the score function from a new manifold optimization perspective, thus better exerting the energy and statistical guidance. To this end, we present SDDM, a new score-decomposed diffusion model on manifolds to explicitly optimize the tangled distributions during the conditional image generation process. When generating an image from score guidance, an SBDM actually performs two distinct tasks, one is image “denoising”, and the other is content “refinement” to bring the image-shaped data closer to the target domain distribution with the same noise level. Based on this new perspective, SDDM decomposes the score function into two different parts, one for image denoising and the other for content refinement. To realize this decomposition, we take statistical guidance as the manifold restriction to get an explicit division between the data distributions in neighboring time steps. We find that the tangent space of the manifold naturally separates the denoising part and the refinement part of the score function. In addition, the tangent space can also split out the denoising part of the energy guidance, thus achieving a more explanatory conditional generation.
Within the decomposed score functions, the content refinement part of the score function and energy functions are on an equal footing. Therefore we can treat the optimization on the manifold as a multi-objection optimization, thus avoiding the negative interference of other guidance on score guidance. To realize the score-decomposed diffusion model, we leverage the block adaptive instance normalization (BAdaIN) module to play the restriction function on the manifold, which is a stronger constraint than the widely used low-pass filter (Choi et al., 2021). With our carefully designed BAdaIN, the tangent space of the manifold provides a better division for the score and energy guidance. We also prove that our manifolds are equivalently concentrated with the perturbed reference image compared with those in (Choi et al., 2021).
To summarize, this work makes the following three main contributions:
-
•
We present a new score-decomposed diffusion model on manifolds to explicitly optimize the tangled distributions during the conditional image generation process.
-
•
We introduce a multi-objective optimization algorithm into the conditional generation of SBDMs, which permits not only many powerful gradient combination algorithms but also adjustment of the score factor.
-
•
We design a BAdaIN module to construct a lower dimensional manifold compared with the low-pass filter and thus provide a concrete model implementation.
With the above contributions, we have obtained a high-performance conditional image generation model. Extensive experimental evaluations and analyses on two I2I benchmarks demonstrate the superior performance of the proposed model. Compared to other SBDM-based methods, SDDM generates better results with much fewer diffusion steps.
2 Background
2.1 Score-Based Diffusion Models (SBDMs)
SBDMs (Song et al., 2021; Ho et al., 2020; Dhariwal & Nichol, 2021; Zhao et al., 2022) first progressively perturb the training data via a forward diffusion process and then learn to reverse this process to form a generative model of the unknown data distribution. Denoting the training set with i.i.d. samples on and the intermediate distribution at time , the forward diffusion process follows the stochastic differential equation (SDE):
(1) |
where is the drift coefficient, denotes an infinitesimal positive timestep, is the diffusion coefficient, and is a standard Wiener process. Denote the transition kernel from time to , which is decided by and . In practice, is usually an affine transformation w.r.t. so that the is a linear Gaussian distribution and can be sampled in one step (Zhao et al., 2022). In practice, the following VP-SDE is mostly used:
(2) |
and DDPM (Ho et al., 2020; Dhariwal & Nichol, 2021) use the following discrete form of the above SDE:
(3) |
Normally an SDE is not time-reversible because the forward process loses information on the initial data distribution and converges to a terminal state distribution . However, Song et al. (2021) find that the reverse process satisfies the following reverse-time SDE:
(4) |
where is an infinitesimal negative timestep and is a reverse-time standard Wiener process. Song et al. (2021) adopt a score-based model to approximate , i.e. , obtaining the following reverse-time SDE:
(5) |
In VP-SDE, is also a standard Gaussian distribution.
For a controllable generation, it is convenient to add some guidance function to the score function and then get a new time-reverse SDE:
(6) |
2.2 SBDMs in Unpaired Image to Image Translation
Unpaired I2I aims to transfer an image from a source domain to a different target domain as the training data. This translation process can be achieved by designing a distribution on the target domain conditioned on an image to transfer.
In ILVR (Choi et al., 2021), given a reference image , they refine after each denoising step with a low-pass filter for the faithfulness to the reference image:
(7) |
In EGSDE (Zhao et al., 2022), they carefully designed two energy-based guidance functions and follow the conditional generation method in Song et al. (2021):
(8) |

Notably, energy-based methods do not avoid the intermediate distribution being overly or negatively disturbed, and they both do not fully make use of the statistics of the reference image; thus the generation results may be suboptimal.
3 Score-Decomposed Diffusion Model
This section starts the elaboration of the proposed model from Eqn. (8). For the choice of the guidance function in Eqn. (8), we set it to the following widely adopted form (Zhao et al., 2022; Bao et al., 2022b):
(9) |
where and denote two different energy guidance; and are two weighting coefficients.
3.1 Model Overview
Figure 1 overviews the main process of the proposed SDDM model. The second equation at the bottom is the equivalent SDE formulation from Eqns. (8) and (9). Starting with this equation, we have the first SDE in Figure 1 to indicate such a generation process. The illustration explains the two-stage optimization at time step .
To explicitly optimize the tangled distributions during image generation, we use moments of the perturbed reference image as constraints for constructing separable manifolds, thus disentangling the distributions of adjacent time steps. As shown in Figure 1, the manifolds of adjacent time steps and and are separable, which indicates the conditional distributions of adjacent time steps and are also separable. Furthermore, at time step , the manifold decompose the score function into the content refinement part and the image denoising part , and also separate out the content refinement parts of on the tangent space . Therefore, The entire optimization process at each time step is divided into two stages: one is to optimize on the manifold , and the other stage is to map to the next manifold properly.
In the first stage, we optimize on the manifold . We apply a multi-objective optimization algorithm to get the red vector MOO, which is the optimal direction considering the score function and energy guidance on the tangent space . Then at the second stage, we use the rest of the first equation in Figure 1, which contains to map the to the next manifold properly. Note that here we use to indicate the restriction on for the consistency of form.
3.2 Decomposition of the Score and Energy Guidance
Given a score function on , suppose is a smooth, compact submanifold of . We let is the corresponding probability distribution restricted on . Then we have the following definitions:
Definition 1.
The tangent score function .
,
which is the score function on the manifold. If there is a series of manifolds , and the original score function is denoted , we denote the tangent score function on .
Definition 2.
The normal score function .
,
which is the score function on the normal space of the manifold. We also denote the normal score function on the manifold .
Then we have the following score function decomposition:
Lemma 1.
,
which can be derived when knowing .
Normally this division is meaningless because the manifolds of adjacent time steps are coupled with each other. Previous researchers usually treat the entire as an entire manifold (Liu et al., 2022), or use strong assumptions (Chung et al., 2022). However, in some conditional generation tasks, for example, the image-to-image transition task, a given reference image can provide compact manifolds at different time steps, and manifolds of adjacent time steps can be well separated. In this situation, the tangent score function can be treated as a refinement part on the manifold. The normal score function is part of the mapping function between manifolds of adjacent time steps.
We have Proposition 1 to describe the manifolds.
Proposition 1.
At time step , for any single reference image , the perturbed distribution is concentrated on a compact manifold and the dimension of when is large enough. Suppose the distributions of perturbed reference image , where . The following statistical constraints define such (d-2)-dim .
(10) | ||||
Proposition 1 shows that we can use statistical constraints to define concentrated manifolds with lower dimensions than . We can also use the chunking trick to lower the manifold dimensions, which will be introduced in Section 4. Therefore, we can use such manifolds to represent the maintenance of the statistics, which indicates that the tangent space can separate the “refinement” part well.
We also have Lemma 2 to show that perturbed distributions of adjacent time steps, and can be well separated.
Lemma 2.
With the defined in Proposition 1, assume , Then and can be well separated. Rigorously, divide the into two disconnect spaces , where ,and .
Therefore, we can use to decompose into and approximately. More generally, we can decouple the optimization space with the tangent space . With , we can operate the score function of SBDM and energy more elaborately. We can also split the “refinement” part out, thus preventing the “denoising” part of the score function from being overly disturbed.
3.3 Stage 1: Optimization on Manifold
Firstly, we will give some main definitions about manifold optimization and muti-objective optimization in our task. We use restriction represent the function that maps the points on near to the manifold , which is normally an orthogonal projection onto .
Definition 3.
Manifold optimization.
Manifold optimization (Hu et al., 2020) is a task to optimize a real-valued function on a given Riemannian manifold . The optimized target is:
(11) |
Because that given , the score function is an estimation of , and we can use as the potential energy of , so are the guidance of energy funcitons. Then Stage 1 is a manifold optimization.
Definition 4.
Pareto optimality on the manifold.
Consider ,
-
•
dominates if , for all , and not all equal signs hold at the same time.
-
•
A solution is called Pareto optimal if there exists no solution that dominates .
Then, the goal of multi-objective optimization is to find the Pareto optimal solution. The local Pareto optimality can also be reached via gradient descent like single-objective optimization. We just follow the multiple gradient descent algorithm (MGDA) (Désidéri, 2012). MGDA also leverages the Karush-Kuhn-Tucker (KKT) conditions for the multi-objective optimization, which in our task is that:
Theorem 1.
K.K.T. conditions on a smooth manifold.
At time step on the tangent space , there such that and , where are the fractions of on the tangent space and are functions restricted on the manifold .
All points that satisfy the above conditions are called Pareto stationary points. Every Pareto optimal point is Pareto stationary point, while the reverse is not true. Désidéri (2012) showed that the optimization solution for the problem :
(12) |
gives a descent direction that improves all tasks or gives a Pareto stationary point. For a balanced result, we normalize all gradients first.
However, In our task, we can search Pareto stationary points in for a small because we have many time steps of different manifolds. is an open ball with center , radius .
We have the following algorithm:
Remark 1.
We can use and to approximate and when is small.
Remark 2.
Notably, EGSDE (Zhao et al., 2022) applies coefficients directly on the guidance vectors, and DVCE (Augustin et al., 2022) uses coefficients after the normalization on the guidance vectors. We can also provide coefficients s for normalized energy vectors to change the impact of the vectors. A smaller norm means greater impact, as mentioned in (Désidéri, 2012).
3.4 Stage 2: Transformation between adjacent manifolds
After the optimization on the manifold , we get that dominates , then we use , the “denoising” part score function , reverse-time noise and restriction function on to map to the adjacent manifold .
Firstly, we have the following proposition to describe the properties of the adjacent map.
Proposition 2.
Suppose the is affine. Then the adjacent map has the following properties:
-
•
that .
-
•
.
-
•
is a transition map from to .
-
•
is determined with and .
However, if we use as the adjacent map, we will lose the impact of and the reverse-time noise. Therefore, we follow the reverse SDE, using the extra part of which on the normal space and a restriction function on as the adjacent map, we denote this part as .
Finally, we have Algorithm 2 in the following to generate images with our proposed SDDM.
Remark 3.
If is linear to , then .
Remark 4.
When the is small, we can just use to approximate .
Remark 5.
Suppose is , and is . We can ignore the restriction step in algorithm 1.
Remark 6.
At time step , we can set larger for better results.
4 Implementations
Chunking Trick. The chunking trick is an easy but powerful trick to reduce the dimensions of the manifolds in high-dimensional space problems, like the generation of images. We will divide the image shape into blocks , and the shape will be like , and the manifold will be the direct product of manifolds at each -sized block, we index them with . This trick has the following advantages:
-
•
We can easily get the and the , which are also the direct product of each block’s and .
-
•
We can control the impact of the reference image on the generation process.
-
•
We can optimize on block level and lower the impact of other distant blocks.
Manifold Details. For each chunked -sized block of the image, we use the first-order and second-order moments to restrict the statistics of the pixels of the block to get a manifold. In particular, we denote the as . Suppose , Then, , and the . Then, the manifold of block is restricted with:
(13) | ||||
And the restrictions of Eqn. 13, is a dimensional hypersphere. Then we can formulate as:
(14) |
The denotes the direct product. Huang & Belongie (2017) use the AdaIN module to transfer neural features as
(15) |
Based on that, we leverage a useful module of BAdaIN as the restriction function on any :
(16) | ||||
In practice, we use the distribution moments of the perturbed reference image to simplify the calculation and eliminate randomness after knowing the relationship between the perturbed and original reference images, as in Eqn. (10).
Lemma 3.
we have:
where is the dimension of the Euclid space in.
Remark 7.
Energy Function. We can also use the BAdaIN module for constructing weak energy functions. Firstly, we use the first several layers of VGG19 (Simonyan & Zisserman, 2014) net to extract neural features of and to get and . Then we use the distance of and as the energy function for faithfulness. To verify SDDM’s advantage, we only use this weak energy function.
5 Experiments
Datasets. We evaluate our SDDM on the following datasets. All the images are resized to pixels.
- •
- •
Evaluation Metrics. We evaluate our translated images from two aspects. One is to assess the distance between the translated and the source images, and we report the SSIM between them. The other is to evaluate the distance of generated images and target domain images, and we calculate the widely-used Frechet Inception Score (FID) (Heusel et al., 2017) between the generated images and the target domain images. Details about the FID settings are in Appendix D.
5.1 Comparison with the State-of-the-arts
We compare our experiments with other GANs-based and SBDM-based methods, as shown in Table 1.
Model | FID | SSIM |
Cat Dog | ||
CycleGAN* | 85.9 | - |
MUNIT* | 104.4 | - |
DRIT* | 123.4 | - |
Distance* | 155.3 | - |
SelfDistance* | 144.4 | - |
GCGAN* | 96.6 | - |
LSeSim* | 72.8 | - |
ITTR (CUT)* | 68.6 | - |
StarGAN v2* | 54.88 1.01 | 0.27 ± 0.003 |
CUT* | 76.21 | 0.601 |
SDEdit* | 74.17 ± 1.01 | 0.423 ± 0.001 |
ILVR* | 74.37 ± 1.55 | 0.363 ± 0.001 |
EGSDE* | 65.82 ± 0.77 | 0.415 ± 0.001 |
EGSDE** | 70.16± 1.03 | 0.411 ± 0.001 |
SDDM(Ours) | 62.29 ± 0.63 | 0.422± 0.001 |
SDDM† (Ours) | 49.43 ± 0.23 | 0.361± 0.001 |
Wild Dog | ||
SDEdit* | 68.51 ± 0.65 | 0.343 ± 0.001 |
ILVR* | 75.33 ± 1.22 | 0.287 ± 0.001 |
EGSDE* | 59.75 ± 0.62 | 0.343 ± 0.001 |
SDDM(Ours) | 57.38 ± 0.53 | 0.328 ± 0.001 |
Male Female | ||
SDEdit* | 49.43 ± 0.47 | 0.572 ± 0.000 |
ILVR* | 46.12 ± 0.33 | 0.510 ± 0.001 |
EGSDE* | 41.93 ± 0.11 | 0.574 ± 0.000 |
EGSDE** | 45.12± 0.24 | 0.512 ± 0.001 |
SDDM(Ours) | 44.37± 0.23 | 0.526 ± 0.001 |
Compared with other SBDM-based methods, our SDDM improves on both metrics FID and SSIM, which indicates the effectiveness of the two-stage generation process of our SDDM via the decomposition of score function and energy guidance with manifolds. Especially compared with EGSDE*, which has strong pre-trained energy functions, in the Cat Dog task, our SDDM improves the FID score by 3.53 and SSIM score by 0.007 with much lower time steps, . For the comparison with EGSDE** having 200 diffusion steps, SDDM improves the FID score by 7.87 and the SSIM score by 0.011 in the Cat Dog task and improves the FID score by 0.75 and the SSIM score by 0.014 in the Male Female task, which suggests the advantage of our SDDM in fewer diffusion steps. The visual comparison is in Appendix F.
5.2 Ablation Studies
Observations on Score Components. While performing the Cat Dog experiment, we report the norms of the deterministic guidance values on and . As shown in Figure 2, the component on the normal space has one in 128 dimensions but contains the most value of the deterministic guidance of diffusion models, while the component on the tangent space has 127 in 128 dimensions but contains a minimal value, which indicates we have relative large optimization space on the manifold which will not excessively interfere with the intermediate distributions.

Comparison of Different Manifolds. We compare SDDM with different manifold methods and report the results in Table 2. Compared with the manifold restricted with a low-pass filter, the manifold restricted with our BAdaIN has better performance both on FID and SSIM, because our manifold separates the content refinement part and image denoising part better.
Model | FID | SSIM |
---|---|---|
SDDM(low-pass filter) | 67.56 | 0.411 |
SDDM(BAdaIN) | 62.29 | 0.422 |
Comparison of Different Coefficients. We have two coefficients at each iteration step, the coefficient of the step size of optimal multi-objection direction and the coefficient of the energy guidance . As in Table 3, the larger is, the better FID will be because, at each optimization on the manifold, it reaches a position with higher probability , but when is too large, the FID score will be worse again. The has a negative connection with the impact of energy guidance, which indicates that smaller makes the energy guidance stronger and thus has a better SSIM score.
Coefficients | FID | SSIM |
---|---|---|
65.09 | 0.429 | |
62.02 | 0.420 | |
66.04 | 0.428 | |
62.32 | 0.415 | |
62.29 | 0.422 |
Comparison w./w.o. Multi-Objective Optimization. We compare SDDM with SDDM without the MOO method and report the FID, SSIM, and probability of negative impact (PNI), which indicates the situation that the total guidance including score and energy decreases the in Table 4. The proposed SDDM method avoids such situations and reaches better performance.
Model | FID | SSIM | PNI |
---|---|---|---|
SDDM(w/o MOO) | 64.93 | 0.421 | 0.024 |
SDDM | 62.29 | 0.422 | 0 |
Policy in The Optimization on Manifolds. We mainly compare three different policies:
- •
-
•
Policy 2: , which normally iterates twice in Algorithm 1. In practice, we iterate twice.
-
•
Policy 3: At the time step in Algorithm 2, we set larger to iterate another 4 times. and other time steps are as same as Policy 1.
We report the FID and SSIM of different policies in Table 5. Policy 3 has the best performance, which reveals that iteration a little more at time step can balance different metrics better without introducing too much cost.
Policy | FID | SSIM |
---|---|---|
Policy 1 | 61.33 | 0.413 |
Policy 2 | 64.05 | 0.418 |
Policy 3 | 62.29 | 0.422 |
The Choice of Middle-Time and Block Number. As shown in Figure 3, when we chunk more blocks or set the smaller, the generated image is more faithful to the reference image. But too many blocks will also introduce some bad details, like the mouth in the left bottom image.

Block number | FID | SSIM |
---|---|---|
54.56 | 0.359 | |
62.29 | 0.422 | |
68.03 | 0.426 |
6 Related Work
GAN-based Unpaired Image-to-Image Translation. There are mainly two categories of GANs-based methods in the unpaired I2I task. One is two-side way, while the other is one-side mapping. In the first category, the key idea is that the translated image could be translated back with another inverse mapping. CycleGAN (Zhu et al., 2017), DualGAN (Yi et al., 2017), DiscoGAN (Kim et al., 2017), SCAN (Van Gansbeke et al., 2020) and U-GAT-IT (Kim et al., 2019) are in this class. But translations usually lose information. Several new studies have started to map two domains to the same metric space and use the distance of this space as supervision.DistanceGAN (Benaim & Wolf, 2017), GCGAN (Fu et al., 2019), CUT (Park et al., 2020) and LSeSim (Zheng et al., 2021) are in this categoriy.
It is also noteworthy that other techniques have been proposed to tackle the problem of unpaired image-to-image translation. For instance, some studies (Xie et al., 2021, 2018) leverage cooperative learning, whereas others (Zhao et al., 2021) adopt an energy-based framework or a short-run MCMC like Langevin dynamics (Xie et al., 2016).
SBDMs-based Conditional Methods. There are mainly two classes of conditional generation with SBDMs. The first one is to empower SBDMs with the conditional generation ability during training with the classifier-free guidance trick (Ho & Salimans, 2022), which learns the score functions and conditional score functions via a single neural network. The other method is to train another classifier to lead the learned score functions for a conditional generation. EGSDE (Zhao et al., 2022) generalizes the classifiers to any energy-based functions. These methods cannot describe the intermediate distributions clearly, which is a hard problem because the distributions of adjacent time steps are deeply coupled. However, when the conditions can give constraints to separate the adjacent distributions well, we can get better results, and this observation inspires our model.
7 Conclusions
In this work, we have presented a new score-decomposed diffusion model, SDDM, which leverages manifold analyses to decompose the score function and explicitly optimize the tangled distributions during image generation. SDDM derives manifolds to separate the distributions of adjacent time steps and decompose the score function or energy guidance into an image “denoising” part and a content “refinement” part. With the new multi-objective optimization algorithm and block adaptive instance normalization module, our realized SDDM method demonstrates promising results in unpaired image-to-image translation on two benchmarks. In future work, we plan to improve and apply the proposed SDDM model in more image translation tasks.
One limitation of our approach involves additional computations, although these computations are negligible compared to the inferences of neural networks. Additionally, we should prevent any misuse of generative algorithms for malicious purposes.
Acknowledgements
This work is supported by the National Key R&D Program of China under Grant No. 2021QY1500, and the State Key Program of the National Natural Science Foundation of China (NSFC) (No.61831022). It is also partly supported by the NSFC under Grant No. 62076238 and 62222606.
References
- Augustin et al. (2022) Augustin, M., Boreiko, V., Croce, F., and Hein, M. Diffusion visual counterfactual explanations. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022.
- Bao et al. (2022a) Bao, F., Li, C., Zhu, J., and Zhang, B. Analytic-DPM: An analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503, 2022a.
- Bao et al. (2022b) Bao, F., Zhao, M., Hao, Z., Li, P., Li, C., and Zhu, J. Equivariant energy-guided SDE for inverse molecular design. arXiv preprint arXiv:2209.15408, 2022b.
- Benaim & Wolf (2017) Benaim, S. and Wolf, L. One-sided unsupervised domain mapping. In Advances in Neural Information Processing Systems, pp. 752–762, 2017.
- Choi et al. (2021) Choi, J., Kim, S., Jeong, Y., Gwon, Y., and Yoon, S. ILVR: Conditioning method for denoising diffusion probabilistic models. arXiv preprint arXiv:2108.02938, 2021.
- Choi et al. (2020) Choi, Y., Uh, Y., Yoo, J., and Ha, J.-W. StarGAN v2: Diverse image synthesis for multiple domains. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8185–8194, 2020.
- Chung et al. (2022) Chung, H., Sim, B., Ryu, D., and Ye, J. C. Improving diffusion models for inverse problems using manifold constraints. arXiv preprint arXiv:2206.00941, 2022.
- Désidéri (2012) Désidéri, J.-A. Multiple-gradient descent algorithm (MGDA) for multiobjective optimization. Comptes Rendus Mathematique, 350(5-6):313–318, 2012.
- Dhariwal & Nichol (2021) Dhariwal, P. and Nichol, A. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems, pp. 8780–8794, 2021.
- Fu et al. (2019) Fu, H., Gong, M., Wang, C., Batmanghelich, K., Zhang, K., and Tao, D. Geometry-consistent generative adversarial networks for one-sided unsupervised domain mapping. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2427–2436, 2019.
- Goodfellow et al. (2014) Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
- Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6629–6640, 2017.
- Ho & Salimans (2022) Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, pp. 6840–6851, 2020.
- Hu et al. (2020) Hu, J., Liu, X., Wen, Z.-W., and Yuan, Y.-X. A brief introduction to manifold optimization. Journal of the Operations Research Society of China, 8(2):199–248, 2020.
- Huang & Belongie (2017) Huang, X. and Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In IEEE International Conference on Computer Vision, pp. 1501–1510, 2017.
- Huang et al. (2018) Huang, X., Liu, M.-Y., Belongie, S., and Kautz, J. Multimodal unsupervised image-to-image translation. In European Conference on Computer Vision, pp. 172–189, 2018.
- Jiang et al. (2020) Jiang, L., Zhang, C., Huang, M., Liu, C., Shi, J., and Loy, C. C. Tsit: A simple and versatile framework for image-to-image translation. In European Conference on Computer Vision, pp. 206–222, 2020.
- Karras et al. (2017) Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
- Kim et al. (2019) Kim, J., Kim, M., Kang, H., and Lee, K. U-gat-it: Unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. arXiv preprint arXiv:1907.10830, 2019.
- Kim et al. (2017) Kim, T., Cha, M., Kim, H., Lee, J. K., and Kim, J. Learning to discover cross-domain relations with generative adversarial networks. In International Conference on Machine Learning, pp. 1857–1865, 2017.
- Laurent & Massart (2000) Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pp. 1–18, 2000.
- Lee et al. (2018) Lee, H.-Y., Tseng, H.-Y., Huang, J.-B., Singh, M., and Yang, M.-H. Diverse image-to-image translation via disentangled representations. In European Conference on Computer Vision, pp. 35–51, 2018.
- Lee (2010) Lee, J. Introduction to topological manifolds, volume 202. Springer Science & Business Media, 2010.
- Lee & Lee (2012) Lee, J. M. and Lee, J. M. Smooth manifolds. Springer, 2012.
- Liu et al. (2022) Liu, L., Ren, Y., Lin, Z., and Zhao, Z. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022.
- Lu et al. (2022) Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J. DPM-Solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.
- Meng et al. (2022) Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Ermon, S. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, pp. 1–14, 2022.
- Nichol & Dhariwal (2021) Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171, 2021.
- Pang et al. (2021) Pang, Y., Lin, J., Qin, T., and Chen, Z. Image-to-image translation: Methods and applications. IEEE Transactions on Multimedia, 24:3859–3881, 2021.
- Park et al. (2020) Park, T., Efros, A. A., Zhang, R., and Zhu, J.-Y. Contrastive learning for unpaired image-to-image translation. In European Conference on Computer Vision, pp. 319–345, 2020.
- Särkkä & Solin (2019) Särkkä, S. and Solin, A. Applied stochastic differential equations, volume 10. Cambridge University Press, 2019.
- Sener & Koltun (2018) Sener, O. and Koltun, V. Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems, pp. 1–12, 2018.
- Shen et al. (2019) Shen, Z., Huang, M., Shi, J., Xue, X., and Huang, T. S. Towards instance-level image-to-image translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3683–3692, 2019.
- Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Song & Ermon (2019) Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 11918–11930, 2019.
- Song et al. (2021) Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
- Tu (2011) Tu, L. W. Manifolds. In An Introduction to Manifolds, pp. 47–83. Springer, 2011.
- Van Gansbeke et al. (2020) Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Proesmans, M., and Van Gool, L. Scan: Learning to classify images without labels. In European Conference on Computer Vision, pp. 268–285, 2020.
- Xie et al. (2016) Xie, J., Lu, Y., Zhu, S.-C., and Wu, Y. A theory of generative convnet. In International Conference on Machine Learning, pp. 2635–2644, 2016.
- Xie et al. (2018) Xie, J., Lu, Y., Gao, R., Zhu, S.-C., and Wu, Y. N. Cooperative training of descriptor and generator networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(1):27–45, 2018.
- Xie et al. (2021) Xie, J., Zheng, Z., Fang, X., Zhu, S.-C., and Wu, Y. N. Learning cycle-consistent cooperative networks via alternating mcmc teaching for unsupervised cross-domain translation. In AAAI Conference on Artificial Intelligence, pp. 10430–10440, 2021.
- Yi et al. (2017) Yi, Z., Zhang, H., Tan, P., and Gong, M. DualGAN: Unsupervised dual learning for image-to-image translation. In IEEE International Conference on Computer Vision, pp. 2849–2857, 2017.
- Zhao et al. (2022) Zhao, M., Bao, F., Li, C., and Zhu, J. EGSDE: Unpaired image-to-image translation via energy-guided stochastic differential equations. arXiv preprint arXiv:2207.06635, 2022.
- Zhao et al. (2021) Zhao, Y., Xie, J., and Li, P. Learning energy-based generative models via coarse-to-fine expanding and sampling. In International Conference on Learning Representations, 2021.
- Zheng et al. (2021) Zheng, C., Cham, T.-J., and Cai, J. The spatially-correlative loss for various image translation tasks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16407–16417, 2021.
- Zhu et al. (2017) Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision, pp. 2223–2232, 2017.
Appendix A Basic Knowledge about Manifolds
Definition 5.
Topological space.
A topological space is locally Euclidean of dimension if every point in has a neighborhood such that there is a homeomorphism from onto an open subset of . We call the pair a chart, a coordinate neighborhood or an open coordinate set, and a coordinate map or a coordinate system on . We say that a chart is centered at if . A chart about simply means that is a chart and .
Definition 6.
Locally Euclidean property.
The locally Euclidean property means that for each , we can find the following:
-
•
an open set containing ;
-
•
an open set ;and
-
•
a homeomorphism (i.e., a continuous bijective map with continuous inverse).
Definition 7.
Topological manifold.
Suppose is a topological space. We say is a topological manifold of dimension or a topological -manifold if it has the following properties:
-
•
M is a Hausdorff space: For every pair of points , there are disjoint open subsets such that and .
-
•
is second countable: There exists a countable basis for the topology of .
-
•
is locally Euclidean of dimension : Every point has a neighborhood that is homeomorphic to an open subset of .
Definition 8.
Tangent vector.
A tangent vector at a point in a manifold is a derivation at .
Definition 9.
Tangent space.
As for , the tangent vectors at form a vector space , called the tangent space of at . We also write instead of .
Definition 10.
Normal space.
the normal space to at to be the subspace consisting of all vectors that are orthogonal to with respect to the Euclidean dot product. The normal bundle of M is the subset defined by
(18) |
Appendix B Proofs
B.1 Proof of Lemma 1
Proof. Consider the local coordinate system at . Suppose are the orthonormal basis of and () are in the tangent space and the rest of them are in the normal space . Then:
(19) | ||||
Therefore, we have:
(20) | ||||
In the following sections, consider the distributions of perturbed reference image , where , and the reference image is fixed.
B.2 Proof of Proposition 1 and Lemma 3
Lemma 4.
is clustered on the manifolds restricted with the first-order moment constraints,
(21) |
under the distance of .
Strictly speaking, in the original Cartesian coordinate system .
we have:
Proof. The manifold provided with restriction of Eqn. (21) is a hyperplane in , and the normal vector . Then the distance from to the manifold is , where . Therefore,
(22) |
Thus, as , the variance of .
Then strictly speaking, we have:
Lemma 5.
Suppose shares the same bound , which means . is clustered on the manifolds restricted with the second-order moment constraints,
(23) |
under the metric of distance.
Strictly speaking,
we have:
Proof. The manifold provided with restriction of Eqn. (23) is a hypersphere on the . The center of the hypersphere is and the radius is . The square of the distance of to the center is:
(24) | ||||
Therefore,
(25) | ||||
where the is a stand chi-square distribution with degrees of freedom. We apply the standard Laurent-Massart bound (Laurent & Massart, 2000) for it and get
(26) | ||||
which holds for any . We let , where the for any given , Then we have
(27) |
Therefore,
(28) |
and , thus
(29) |
Similar to Lemma 4, is a Gaussian distribution, and the mean is 0, variance is bounded with . As , the variance , thus , we have
(30) |
Finally, ,
(31) | ||||
Then we will prove the Proposition 1.
Proof. Consider , which is restricted with:
(32) | ||||
We can substitute the with in the calculation of the variance and get the following equivalent restrictions:
(33) | ||||
These two constraints correspond to Lemma 4 and Lemma 5 respectively. We denote the manifold restricted with one of these constraints as and . . Suppose the angle of and at the intersection is . Then locally the hypersphere can be treated as a hyperplane and the error is second-order. We have the following relationship when is small:
(34) |
, where is small,
(35) | ||||
Then,
(36) | ||||
Let and and only consider the block, We can get the Lemma 3.
B.3 Proof of Lemma 2
Proof. We just use the following hyperplane:
(37) |
Then, because monotonically decreasing with in VP-SDE. Therefore, The hyperplanes and are on different sides of the given hyperplane. As a consequence, and are on different sides of the given hyperplane.
B.4 Proof of Proposition 2
that .
Proof. We consider the 2-dim normal space . Easy to show that it has these two orthogonal basis vectors, and . There are only two points in 2-dim normal space that are in because there are only two points, in 2-dim normal space meet the following conditions:
-
•
The distance between this point to is .
-
•
The line connecting this point to is perpendicular to .
And there is only one of them near .
Thus, that .
.
Proof. Because , , and they are two orthogonal basis vectors of .
Therefore, .
is a transition map from to .
Proof. Because , and are parallel, and maps to .
Therefore, is a transition map from to .
is determained with and .
Proof. As proved in (Särkkä & Solin, 2019), the means and covariances of linear SDEs can be transformed to corresponding ODEs. Therefore, suppose , , all the coeffients are determained by . For clarity, we will represent with and the coefficients above.
In fact, it is easy to show that
(38) |
is in both and , and is the one in near . Therefore,
(39) |
B.5 Proof of Remark 3
Proof. Equivalently, we prove that . Consider and restricted with one of the equations in Eqn. (33). We do the following decomposition of
(40) |
Easy to know that and . Because , thus the two compoments are all and then .
B.6 Proof of Remark 7
Proof.
(41) | ||||
Appendix C Details about SDDM
Assumption 1.
Suppose is the score-based model, is the drift coefficient, is the diffusion coefficient, and is the energy function. is the given source image.
C.1 Details about Pre-Trained Diffusion Models
We use two pre-trained diffusion models and a VGG model.
In the Cat Dog task and Wild Dog task, we use the public pre-trained model provided in the official code https://github.com/jychoi118/ilvr_adm of ILVR (Choi et al., 2021).
In the Male Female task, we use the public pre-trained model provided in the official code https://github.com/ML-GSAI/EGSDE of EGSDE (Zhao et al., 2022).
Our energy function uses the pre-trained VGG net provided in the unofficial open source code https://github.com/naoto0804/pytorch-AdaIN of AdaIN (Huang & Belongie, 2017).
C.2 Details about Our Default Model Settings
Our default SDDM settings:
-
•
Using BAdaIN to construct manifolds.
-
•
Using multi-optimization on manifolds.
-
•
.
-
•
Using Policy 3.
-
•
Blocks are .
-
•
.
-
•
100 diffusion steps.
SDDM† sets .
C.3 Implementation Details about Solving Problem (12)
To simplify the process, we denote all the vectors as , and coefficients as , and rewrite Problem (12) as
(42) |
When there are only two vectors (in our situation) and no restriction , we can get the following analytical solution:
(43) |
Therefore, easy to prove that when there are only two vectors, the analytical solution is:
(44) |
For general situations, we can apply Frank–Wolfe algorithm on this problem as in (Sener & Koltun, 2018).
Appendix D Details about FID calculation
The FID is calculated between 500 generated images and the target validation dataset containing 500 images in the Cat Dog and Wild Dog task. The number is 1000 in the Male Female task. All experiments are repeated 5 times to eliminate the randomness.
Appendix E FID on the MaleFemale task
It is true that EGSDE with sufficient diffusion steps outperforms our SDDM on the MaleFemale task, it is important to note that the energy function used in EGSDE is strongly pretrained on related datasets and contains significant domain-specific information. In contrast, to demonstrate the effectiveness and versatility of our framework, we intentionally chose to use a weak energy function consisting of only one layer of convolution without any further pretraining. After incorporating the strong guidance function from EGSDE, our method outperforms EGSDE in the FID score, as shown in the following table.
Model | FID |
---|---|
EGSDE | 41.93 ± 0.11 |
SDDM(Ours) | 40.08 ± 0.13 |
Appendix F Samples

