This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SDDM: Score-Decomposed Diffusion Models on Manifolds
for Unpaired Image-to-Image Translation

Shikun Sun    Longhui Wei    Junliang Xing    Jia Jia    Qi Tian
Abstract

Recent score-based diffusion models (SBDMs) show promising results in unpaired image-to-image translation (I2I). However, existing methods, either energy-based or statistically-based, provide no explicit form of the interfered intermediate generative distributions. This work presents a new score-decomposed diffusion model (SDDM) on manifolds to explicitly optimize the tangled distributions during image generation. SDDM derives manifolds to make the distributions of adjacent time steps separable and decompose the score function or energy guidance into an image “denoising” part and a content “refinement” part. To refine the image in the same noise level, we equalize the refinement parts of the score function and energy guidance, which permits multi-objective optimization on the manifold. We also leverage the block adaptive instance normalization module to construct manifolds with lower dimensions but still concentrated with the perturbed reference image. SDDM outperforms existing SBDM-based methods with much fewer diffusion steps on several I2I benchmarks.


1 Introduction

Score-based diffusion models (Song & Ermon, 2019; Song et al., 2021; Ho et al., 2020; Nichol & Dhariwal, 2021; Bao et al., 2022a; Lu et al., 2022) (SBDMs) have recently made significant progress in a series of conditional image generation tasks. In particular, in the unpaired image-to-image translation (I2I) task (Pang et al., 2021), recent studies have shown that a pre-trained SBDM on the target image domain with energy guidance (Zhao et al., 2022) or statistical (Choi et al., 2021) guidance outperforms generative adversarial network (Goodfellow et al., 2014)(GAN)-based methods (Fu et al., 2019; Zhu et al., 2017; Yi et al., 2017; Park et al., 2020; Benaim & Wolf, 2017; Zheng et al., 2021; Shen et al., 2019; Huang et al., 2018; Jiang et al., 2020; Lee et al., 2018) and achieves the state-of-the-art performance.

SBDMs provide a diffusion model to guide how image-shaped data from a Gauss distribution is iterated step by step into an image of the target domain. In each step, SBDM gives score guidance which, from an engineering perspective, can be mixed with energy and statistical guidance to control the generation process. However, firstly due to the stochastic differential equations of the inverse diffusion process, the coefficient of the score guidance is not changeable. Secondly how energy guidances affect the intermediate distributions is still not clear. As a result, the I2I result is often unsatisfactory, especially when iterations are inadequate. Moreover, there has yet to be a method to ensure that the intermediate distributions are not negatively interfered with during the above guidance process.

To overcome these limitations, we propose to decompose the score function from a new manifold optimization perspective, thus better exerting the energy and statistical guidance. To this end, we present SDDM, a new score-decomposed diffusion model on manifolds to explicitly optimize the tangled distributions during the conditional image generation process. When generating an image from score guidance, an SBDM actually performs two distinct tasks, one is image “denoising”, and the other is content “refinement” to bring the image-shaped data closer to the target domain distribution with the same noise level. Based on this new perspective, SDDM decomposes the score function into two different parts, one for image denoising and the other for content refinement. To realize this decomposition, we take statistical guidance as the manifold restriction to get an explicit division between the data distributions in neighboring time steps. We find that the tangent space of the manifold naturally separates the denoising part and the refinement part of the score function. In addition, the tangent space can also split out the denoising part of the energy guidance, thus achieving a more explanatory conditional generation.

Within the decomposed score functions, the content refinement part of the score function and energy functions are on an equal footing. Therefore we can treat the optimization on the manifold as a multi-objection optimization, thus avoiding the negative interference of other guidance on score guidance. To realize the score-decomposed diffusion model, we leverage the block adaptive instance normalization (BAdaIN) module to play the restriction function on the manifold, which is a stronger constraint than the widely used low-pass filter (Choi et al., 2021). With our carefully designed BAdaIN, the tangent space of the manifold provides a better division for the score and energy guidance. We also prove that our manifolds are equivalently concentrated with the perturbed reference image compared with those in (Choi et al., 2021).

To summarize, this work makes the following three main contributions:

  • We present a new score-decomposed diffusion model on manifolds to explicitly optimize the tangled distributions during the conditional image generation process.

  • We introduce a multi-objective optimization algorithm into the conditional generation of SBDMs, which permits not only many powerful gradient combination algorithms but also adjustment of the score factor.

  • We design a BAdaIN module to construct a lower dimensional manifold compared with the low-pass filter and thus provide a concrete model implementation.

With the above contributions, we have obtained a high-performance conditional image generation model. Extensive experimental evaluations and analyses on two I2I benchmarks demonstrate the superior performance of the proposed model. Compared to other SBDM-based methods, SDDM generates better results with much fewer diffusion steps.

2 Background

2.1 Score-Based Diffusion Models (SBDMs)

SBDMs (Song et al., 2021; Ho et al., 2020; Dhariwal & Nichol, 2021; Zhao et al., 2022) first progressively perturb the training data via a forward diffusion process and then learn to reverse this process to form a generative model of the unknown data distribution. Denoting q(𝐱0)q(\mathbf{x}_{0}) the training set with i.i.d. samples on d\mathbb{R}^{d} and q(𝐱t)q(\mathbf{x}_{t}) the intermediate distribution at time tt, the forward diffusion process {𝐱t}t[0,T]\{\mathbf{x}_{t}\}_{t\in[0,T]} follows the stochastic differential equation (SDE):

d𝐱=𝐟(𝐱,t)dt+g(t)d𝐰,\mathrm{d}\mathbf{x}=\mathbf{f}(\mathbf{x},t)\mathrm{d}t+g(t)\mathrm{d}\mathbf{w}, (1)

where 𝐟(,t):dd\mathbf{f}(\cdot,t):\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is the drift coefficient, dt\mathrm{d}t denotes an infinitesimal positive timestep, g(t)g(t)\in\mathbb{R} is the diffusion coefficient, and 𝐰𝒩(𝟎,t𝐈d)\mathbf{w}\sim\mathcal{N}(\mathbf{0},t\mathbf{I}_{d}) is a standard Wiener process. Denote qt|0(𝐱t|𝐱0)q_{t|0}(\mathbf{x}_{t}|\mathbf{x}_{0}) the transition kernel from time 0 to tt, which is decided by 𝐟(𝐱,t)\mathbf{f}(\mathbf{x},t) and g(t)g(t). In practice, 𝐟(𝐱,t)\mathbf{f}(\mathbf{x},t) is usually an affine transformation w.r.t. 𝐱\mathbf{x} so that the qt|0(𝐱t|𝐱0)q_{t|0}(\mathbf{x}_{t}|\mathbf{x}_{0}) is a linear Gaussian distribution and 𝐱t\mathbf{x}_{t} can be sampled in one step (Zhao et al., 2022). In practice, the following VP-SDE is mostly used:

d𝐱=12β(t)𝐱dt+β(t)d𝐰,\mathrm{d}\mathbf{x}=-\frac{1}{2}\beta(t)\mathbf{x}\mathrm{d}t+\sqrt{\beta(t)}\mathrm{d}\mathbf{w}, (2)

and DDPM (Ho et al., 2020; Dhariwal & Nichol, 2021) use the following discrete form of the above SDE:

𝐱i=1βi𝐱i1+βi𝐳i1,i=1,,N.\mathbf{x}_{i}=\sqrt{1-\beta_{i}}\mathbf{x}_{i-1}+\sqrt{\beta_{i}}\mathbf{z}_{i-1},\quad i=1,\cdots,N. (3)

Normally an SDE is not time-reversible because the forward process loses information on the initial data distribution and converges to a terminal state distribution qT(𝐱T)q_{T}(\mathbf{x}_{T}). However,  Song et al. (2021) find that the reverse process satisfies the following reverse-time SDE:

d𝐱=[𝐟(𝐱,t)g(t)2𝐱logqt(𝐱)]dt+g(t)d𝐰¯,\mathrm{d}\mathbf{x}=\left[\mathbf{f}(\mathbf{x},t)-g(t)^{2}\nabla_{\mathbf{x}}\log q_{t}(\mathbf{x})\right]\mathrm{d}t+g(t)\mathrm{d}\overline{\mathbf{w}}, (4)

where dt\mathrm{d}t is an infinitesimal negative timestep and 𝐰¯\overline{\mathbf{w}} is a reverse-time standard Wiener process.  Song et al. (2021) adopt a score-based model 𝐬(𝐱,t)\mathbf{s}(\mathbf{x},t) to approximate 𝐱logqt(𝐱)\nabla_{\mathbf{x}}\log q_{t}(\mathbf{x}), i.e. 𝐬(𝐱,t)=˙𝐱logqt(𝐱)\mathbf{s}(\mathbf{x},t)\dot{=}\nabla_{\mathbf{x}}\log q_{t}(\mathbf{x}), obtaining the following reverse-time SDE:

d𝐱=[𝐟(𝐱,t)g(t)2𝐬(𝐱,t)]dt+g(t)d𝐰¯.\mathrm{d}\mathbf{x}=\left[\mathbf{f}(\mathbf{x},t)-g(t)^{2}\mathbf{s}(\mathbf{x},t)\right]\mathrm{d}t+g(t)\mathrm{d}\overline{\mathbf{w}}. (5)

In VP-SDE, qT(𝐱T)q_{T}(\mathbf{x}_{T}) is also a standard Gaussian distribution.

For a controllable generation, it is convenient to add some guidance function ε(𝐱,t)\varepsilon(\mathbf{x},t) to the score function and then get a new time-reverse SDE:

d𝐱=[𝐟(𝐱,t)g(t)2(𝐬(𝐱,t)+𝐱ε(𝐱,t))]dt+g(t)d𝐰¯.\!\!\!\!\mathrm{d}\mathbf{x}\!\!=\!\!\left[\mathbf{f}(\mathbf{x},t)\!-\!g(t)^{2}\!\left(\mathbf{s}\left(\mathbf{x},t\right)\!+\!\nabla_{\mathbf{x}}\varepsilon\left(\mathbf{x},t\right)\right)\right]\!\mathrm{d}t+g(t)\mathrm{d}\overline{\mathbf{w}}. (6)

2.2 SBDMs in Unpaired Image to Image Translation

Unpaired I2I aims to transfer an image from a source domain 𝒴d\mathcal{Y}\subset\mathbb{R}^{d} to a different target domain 𝒳d\mathcal{X}\subset\mathbb{R}^{d} as the training data. This translation process can be achieved by designing a distribution p(𝐱0|𝐲0)p(\mathbf{x}_{0}|\mathbf{y}_{0}) on the target domain 𝒳\mathcal{X} conditioned on an image 𝐲0𝒴\mathbf{y}_{0}\in\mathcal{Y} to transfer.

In ILVR (Choi et al., 2021), given a reference image 𝐲0\mathbf{y}_{0}, they refine 𝐱t\mathbf{x}_{t} after each denoising step with a low-pass filter 𝚽\mathbf{\Phi} for the faithfulness to the reference image:

𝐱t=𝐱t𝚽(𝐱t)+𝚽(𝐲t),𝐲tqt0(𝐲t𝐲0).\mathbf{x}_{t}^{\prime}=\mathbf{x}_{t}-\mathbf{\Phi}(\mathbf{x}_{t})+\mathbf{\Phi}(\mathbf{y}_{t}),\mathbf{y}_{t}\thicksim q_{t\mid 0}(\mathbf{y}_{t}\mid\mathbf{y}_{0}). (7)

In EGSDE (Zhao et al., 2022), they carefully designed two energy-based guidance functions and follow the conditional generation method in Song et al. (2021):

d𝐱=[𝐟(𝐱,t)g(t)2(𝐬(𝐱,t)𝐱ε(𝐱,𝐲0,t))]dt+g(t)d𝐰¯.\displaystyle\!\!\mathrm{d}\mathbf{x}\!\!=\!\!\left[\mathbf{f}(\mathbf{x},t)\!-\!g(t)^{2}\!\left(\mathbf{s}(\mathbf{x},t)\!-\!\nabla_{\mathbf{x}}\varepsilon(\mathbf{x},\mathbf{y}_{0},t)\right)\right]\!\mathrm{d}t\!+\!g(t)\mathrm{d}\overline{\mathbf{w}}. (8)
Refer to caption
Figure 1: The overview of our SDDM. At each time step, compared with directly adding energy guidance to the score function, we firstly use the moments of the distribution 𝐲t\mathbf{y}_{t} as constraints to get the manifolds t\mathcal{M}_{t} at each time step tt. Then, we restrict potential energy of the score function 𝐬(𝐱,t)\mathbf{s}(\mathbf{x},t) and energy function εi\mathbf{\varepsilon}_{i} on the manifold t\mathcal{M}_{t} at 𝐱t\mathbf{x}_{t} to get the components of corresponding gradients 𝐬r(𝐱,t)\mathbf{s}_{r}(\mathbf{x},t) and 𝐱𝐭εir(𝐱t,𝐲0,t)\nabla_{\mathbf{x_{t}}}\mathbf{\varepsilon}_{ir}(\mathbf{x}_{t},\mathbf{y}_{0},t) on the tangent space T𝐱ttT_{\mathbf{x}_{t}}\mathcal{M}_{t}, and they are the “refinement” parts. Then we use multi-objective optimization viewpoint to get MOO\mathrm{MOO}, the optimal sum on the tangent space near 𝐱t\mathbf{x}_{t} , Finally, we restrict 𝐟(𝐱,t),𝐬(𝐱,t)\mathbf{f}(\mathbf{x},t),\mathbf{s}(\mathbf{x},t), and the noise on the N𝐱tN_{\mathbf{x}_{t}}\mathcal{M} to get the components pointing to the next manifold t1\mathcal{M}_{t-1}. For clarity g(t)d𝐰¯g(t)\mathrm{d}\overline{\mathbf{w}} does not appear and the restriction on t1\mathcal{M}_{t-1} is indicated as 𝜹\boldsymbol{\delta}.

Notably, energy-based methods do not avoid the intermediate distribution being overly or negatively disturbed, and they both do not fully make use of the statistics of the reference image; thus the generation results may be suboptimal.

3 Score-Decomposed Diffusion Model

This section starts the elaboration of the proposed model from Eqn. (8). For the choice of the guidance function ε(𝐱,𝐲0,t)\varepsilon(\mathbf{x},\mathbf{y}_{0},t) in Eqn. (8), we set it to the following widely adopted form (Zhao et al., 2022; Bao et al., 2022b):

ε(𝐱,𝐲0,t)=λ1ε1(𝐱,𝐲0,t)+λ2ε2(𝐱,𝐲0,t),\varepsilon(\mathbf{x},\mathbf{y}_{0},t)=\lambda_{1}\varepsilon_{1}(\mathbf{x},\mathbf{y}_{0},t)+\lambda_{2}\varepsilon_{2}(\mathbf{x},\mathbf{y}_{0},t), (9)

where ε1(,,)\varepsilon_{1}(\cdot,\cdot,\cdot) and ε2(,,)\varepsilon_{2}(\cdot,\cdot,\cdot) denote two different energy guidance; λ1\lambda_{1} and λ2\lambda_{2} are two weighting coefficients.

3.1 Model Overview

Figure 1 overviews the main process of the proposed SDDM model. The second equation at the bottom is the equivalent SDE formulation from Eqns. (8) and (9). Starting with this equation, we have the first SDE in Figure 1 to indicate such a generation process. The illustration explains the two-stage optimization at time step tt.

To explicitly optimize the tangled distributions during image generation, we use moments of the perturbed reference image 𝐲0\mathbf{y}_{0} as constraints for constructing separable manifolds, thus disentangling the distributions of adjacent time steps. As shown in Figure 1, the manifolds of adjacent time steps tt and t1,t-1, t\mathcal{M}_{t} and t1\mathcal{M}_{t-1} are separable, which indicates the conditional distributions of adjacent time steps 𝐱t\mathbf{x}_{t} and 𝐱t1\mathbf{x}_{t-1} are also separable. Furthermore, at time step tt, the manifold t\mathcal{M}_{t} decompose the score function 𝐬(𝐱,t)\mathbf{s}(\mathbf{x},t) into the content refinement part 𝐬r(𝐱,t)\mathbf{s}_{r}(\mathbf{x},t) and the image denoising part 𝐬d(𝐱,t)\mathbf{s}_{d}(\mathbf{x},t), and also separate out the content refinement parts 𝐱ε1r(𝐱,𝐲0,t),𝐱ε2r(𝐱,𝐲0,t)\nabla_{\mathbf{x}}\varepsilon_{1r}(\mathbf{x},\mathbf{y}_{0},t),\nabla_{\mathbf{x}}\varepsilon_{2r}(\mathbf{x},\mathbf{y}_{0},t) of 𝐱ε1(𝐱,𝐲0,t),𝐱ε2(𝐱,𝐲0,t)\nabla_{\mathbf{x}}\varepsilon_{1}(\mathbf{x},\mathbf{y}_{0},t),\nabla_{\mathbf{x}}\varepsilon_{2}(\mathbf{x},\mathbf{y}_{0},t) on the tangent space T𝐱ttT_{\mathbf{x}_{t}}\mathcal{M}_{t}. Therefore, The entire optimization process at each time step is divided into two stages: one is to optimize on the manifold t\mathcal{M}_{t}, and the other stage is to map to the next manifold t1\mathcal{M}_{t-1} properly.

In the first stage, we optimize on the manifold t\mathcal{M}_{t}. We apply a multi-objective optimization algorithm to get the red vector MOO, which is the optimal direction considering the score function and energy guidance on the tangent space T𝐱ttT_{\mathbf{x}_{t}}\mathcal{M}_{t}. Then at the second stage, we use the rest of the first equation in Figure 1, which contains [𝐟(𝐱,t)g(t)2𝐬d(𝐱,t)+𝜹]dt+g(t)d𝐰¯\left[\mathbf{f}(\mathbf{x},t)-g(t)^{2}\mathbf{s}_{d}(\mathbf{x},t)+\boldsymbol{\delta}\right]\mathrm{d}t+g(t)\mathrm{d}\overline{\mathbf{w}} to map the 𝐱t+MOO\mathbf{x}_{t}+\mathrm{MOO} to the next manifold t1\mathcal{M}_{t-1} properly. Note that here we use 𝜹\boldsymbol{\delta} to indicate the restriction on t1\mathcal{M}_{t-1} for the consistency of form.

3.2 Decomposition of the Score and Energy Guidance

Given a score function 𝐬(𝐱)=𝐱logp(𝐱)\mathbf{s}(\mathbf{x})=\nabla_{\mathbf{x}}\log p(\mathbf{x}) on d\mathbb{R}^{d}, suppose \mathcal{M} is a smooth, compact submanifold of d\mathbb{R}^{d}. We let p(𝐱)p_{\mathcal{M}}(\mathbf{x}) is the corresponding probability distribution restricted on \mathcal{M}. Then we have the following definitions:

Definition 1.

The tangent score function 𝐬r(𝐱)\mathbf{s}_{r}(\mathbf{x}).

𝐬r(𝐱):=𝐱logp(𝐱)\mathbf{s}_{r}(\mathbf{x}):=\nabla_{\mathbf{x}}\log p_{\mathcal{M}}(\mathbf{x}),

which is the score function on the manifold. If there is a series of manifolds {t}\{\mathcal{M}_{t}\}, and the original score function is denoted 𝐬(𝐱,t)\mathbf{s}(\mathbf{x},t), we denote 𝐬r(𝐱,t)\mathbf{s}_{r}(\mathbf{x},t) the tangent score function on t\mathcal{M}_{t}.

Definition 2.

The normal score function 𝐬d(𝐱)\mathbf{s}_{d}(\mathbf{x}).

𝐬d(𝐱):=𝐬(𝐱)|N𝐱\mathbf{s}_{d}(\mathbf{x}):=\mathbf{s}(\mathbf{x})|_{N_{\mathbf{x}}\mathcal{M}},

which is the score function on the normal space of the manifold. We also denote 𝐬d(𝐱,t)\mathbf{s}_{d}(\mathbf{x},t) the normal score function on the manifold t\mathcal{M}_{t}.

Then we have the following score function decomposition:

Lemma 1.

𝐬(𝐱)=𝐬r(𝐱)+𝐬d(𝐱)\mathbf{s}(\mathbf{x})=\mathbf{s}_{r}(\mathbf{x})+\mathbf{s}_{d}(\mathbf{x}),

which can be derived when knowing 𝐬r(𝐱)=𝐬(𝐱)|T𝐱\mathbf{s}_{r}(\mathbf{x})=\mathbf{s}(\mathbf{x})|_{T_{\mathbf{x}}\mathcal{M}}.

Normally this division is meaningless because the manifolds of adjacent time steps are coupled with each other. Previous researchers usually treat the entire t𝐱t\cup_{t}\mathbf{x}_{t} as an entire manifold (Liu et al., 2022), or use strong assumptions (Chung et al., 2022). However, in some conditional generation tasks, for example, the image-to-image transition task, a given reference image 𝐲0\mathbf{y}_{0} can provide compact manifolds at different time steps, and manifolds of adjacent time steps can be well separated. In this situation, the tangent score function can be treated as a refinement part on the manifold. The normal score function is part of the mapping function between manifolds of adjacent time steps.

We have Proposition 1 to describe the manifolds.

Proposition 1.

At time step tt, for any single reference image 𝐲0\mathbf{y}_{0}, the perturbed distribution qt0(𝐲t𝐲0)q_{t\mid 0}(\mathbf{y}_{t}\mid\mathbf{y}_{0}) is concentrated on a compact manifold td\mathcal{M}_{t}\subset\mathbb{R}^{d} and the dimension of td2\mathcal{M}_{t}\leq d-2 when dd is large enough. Suppose the distributions of perturbed reference image 𝐲𝐭=α^t𝐲0+β^t𝐳t\mathbf{y_{t}}=\hat{\alpha}_{t}\mathbf{y}_{0}+\hat{\beta}_{t}\mathbf{z}_{t}, where 𝐳t𝒩(𝟎,𝐈)\mathbf{z}_{t}\sim\mathcal{N}(\mathbf{0},\mathbf{I}). The following statistical constraints define such (d-2)-dim t\mathcal{M}_{t}.

μ[𝐱t]\displaystyle\operatorname{\mu}[\mathbf{x}_{t}] =α^tμ[𝐲0],\displaystyle=\hat{\alpha}_{t}\mu[\mathbf{y}_{0}], (10)
Var[𝐱t]\displaystyle\operatorname{Var}[\mathbf{x}_{t}] =α^t2Var[𝐲0]+β^t2.\displaystyle=\hat{\alpha}^{2}_{t}\operatorname{Var}[\mathbf{y}_{0}]+\hat{\beta}^{2}_{t}.

Proposition 1 shows that we can use statistical constraints to define concentrated manifolds with lower dimensions than d\mathbb{R}^{d}. We can also use the chunking trick to lower the manifold dimensions, which will be introduced in Section 4. Therefore, we can use such manifolds to represent the maintenance of the statistics, which indicates that the tangent space T𝐱ttT_{\mathbf{x}_{t}}\mathcal{M}_{t} can separate the “refinement” part well.

We also have Lemma 2 to show that perturbed distributions of adjacent time steps, 𝐲t\mathbf{y}_{t} and 𝐲t1\mathbf{y}_{t-1} can be well separated.

Lemma 2.

With the t\mathcal{M}_{t} defined in Proposition 1, assume ttt\neq t^{\prime}, Then t\mathcal{M}_{t} and t\mathcal{M}_{t^{\prime}} can be well separated. Rigorously, ε>0,d\forall\varepsilon>0,\exists\mathcal{M}_{d} divide the d\mathbb{R}^{d} into two disconnect spaces 𝒜,\mathcal{A},\mathcal{B}, where t𝒜\mathcal{M}_{t}\in\mathcal{A} ,and t\mathcal{M}_{t^{\prime}}\in\mathcal{B}.

Therefore, we can use t\mathcal{M}_{t} to decompose 𝐬\mathbf{s} into 𝐬r\mathbf{s}_{r} and 𝐬d\mathbf{s}_{d} approximately. More generally, we can decouple the optimization space with the tangent space T𝐱ttT_{\mathbf{x}_{t}}\mathcal{M}_{t}. With T𝐱ttT_{\mathbf{x}_{t}}\mathcal{M}_{t}, we can operate the score function of SBDM and energy more elaborately. We can also split the “refinement” part out, thus preventing the “denoising” part of the score function from being overly disturbed.

3.3 Stage 1: Optimization on Manifold

Firstly, we will give some main definitions about manifold optimization and muti-objective optimization in our task. We use restriction 𝐑𝐱t\mathbf{R}_{\mathbf{x}_{t}} represent the function that maps the points on T𝐱ttT_{\mathbf{x}_{t}}\mathcal{M}_{t} near 𝐱t\mathbf{x}_{t} to the manifold t\mathcal{M}_{t}, which is normally an orthogonal projection onto t\mathcal{M}_{t}.

Definition 3.

Manifold optimization.

Manifold optimization (Hu et al., 2020) is a task to optimize a real-valued function f(𝐱)f(\mathbf{x}) on a given Riemannian manifold \mathcal{M}. The optimized target is:

min𝐱f(𝐱).\min_{\mathbf{x}\in\mathcal{M}}f(\mathbf{x}). (11)

Because that given tt, the score function 𝐬(𝐱,t)\mathbf{s}(\mathbf{x},t) is an estimation of 𝐱logqt(𝐱)\nabla_{\mathbf{x}}\log q_{t}(\mathbf{x}), and we can use logqt(𝐱)\log q_{t}(\mathbf{x}) as the potential energy of 𝐬(𝐱,t)\mathbf{s}(\mathbf{x},t), so are the guidance of energy funcitons. Then Stage 1 is a manifold optimization.

Definition 4.

Pareto optimality on the manifold.

Consider 𝐱t,𝐱^tt\mathbf{x}_{t},\widehat{\mathbf{x}}_{t}\in\mathcal{M}_{t},

  • 𝐱t\mathbf{x}_{t} dominates 𝐱^t\widehat{\mathbf{x}}_{t} if 𝐬r(𝐱t,t)𝐬r(𝐱^t,t)\mathbf{s}_{r}(\mathbf{x}_{t},t)\geq\mathbf{s}_{r}(\widehat{\mathbf{x}}_{t},t), εir(𝐱t,𝐲0,t)εir(𝐱^t,𝐲0,t)\mathbf{\varepsilon}_{ir}(\mathbf{x}_{t},\mathbf{y}_{0},t)\leq\mathbf{\varepsilon}_{ir}(\widehat{\mathbf{x}}_{t},\mathbf{y}_{0},t) for all ii, and not all equal signs hold at the same time.

  • A solution 𝐱t\mathbf{x}_{t} is called Pareto optimal if there exists no solution 𝐱^t\widehat{\mathbf{x}}_{t} that dominates 𝐱t\mathbf{x}_{t}.

Then, the goal of multi-objective optimization is to find the Pareto optimal solution. The local Pareto optimality can also be reached via gradient descent like single-objective optimization. We just follow the multiple gradient descent algorithm (MGDA) (Désidéri, 2012). MGDA also leverages the Karush-Kuhn-Tucker (KKT) conditions for the multi-objective optimization, which in our task is that:

Theorem 1.

K.K.T. conditions on a smooth manifold.

At time step tt on the tangent space T𝐱ttT_{\mathbf{x}_{t}}\mathcal{M}_{t}, there α,β1,β2,,βm0\exists\alpha,\beta^{1},\beta^{2},...,\beta^{m}\geq 0 such that α+i=1mβi=1\alpha+\sum_{i=1}^{m}\beta^{i}=1 and α𝐬r(𝐱𝐭,t)=i=1mβi𝐱𝐭εir(𝐱t,𝐲0,t)\alpha\mathbf{s}_{r}(\mathbf{x_{t}},t)=\sum_{i=1}^{m}\beta^{i}\nabla_{\mathbf{x_{t}}}\varepsilon_{ir}(\mathbf{x}_{t},\mathbf{y}_{0},t) , where 𝐬r(𝐱𝐭,t)\mathbf{s}_{r}(\mathbf{x_{t}},t) are the fractions of 𝐬(𝐱𝐭,t)\mathbf{s}(\mathbf{x_{t}},t) on the tangent space and εir(𝐱t,𝐲0,t)\varepsilon_{ir}(\mathbf{x}_{t},\mathbf{y}_{0},t) are functions restricted on the manifold t\mathcal{M}_{t}.

All points that satisfy the above conditions are called Pareto stationary points. Every Pareto optimal point is Pareto stationary point, while the reverse is not true. Désidéri (2012) showed that the optimization solution for the problem :

minα,β1,,βm0α+β1++βm=1{α𝐬r(𝐱t,t)i=1mβi𝐱𝐭εir(𝐱t,𝐲0,t)22}\min_{\begin{subarray}{c}\alpha,\beta^{1},\ldots,\beta^{m}\geq 0\\ \alpha+\beta^{1}+\ldots+\beta^{m}=1\end{subarray}}\left\{\left\|\alpha\mathbf{s}_{r}(\mathbf{x}_{t},t)-\sum_{i=1}^{m}\beta^{i}\nabla_{\mathbf{x_{t}}}\varepsilon_{ir}(\mathbf{x}_{t},\mathbf{y}_{0},t)\right\|_{2}^{2}\right\} (12)

gives a descent direction that improves all tasks or gives a Pareto stationary point. For a balanced result, we normalize all gradients first.

However, In our task, we can search Pareto stationary points in 𝐁ϵ(𝐱t)t\mathbf{B}_{\epsilon}(\mathbf{x}_{t})\cap\mathcal{M}_{t} for a small ϵ\epsilon because we have many time steps of different manifolds. 𝐁ϵ(𝐱t)\mathbf{B}_{\epsilon}(\mathbf{x}_{t}) is an open ball with center 𝐱t\mathbf{x}_{t}, radius ϵ\epsilon.

We have the following algorithm:

Algorithm 1 Multi-Objective Optimization on Manifold
1:  Input: stepsize λ\lambda, current 𝐱t\mathbf{x}_{t}, refinement score srs_{r}, energy funcitons εir\varepsilon_{ir} on t,i=1,,m\mathcal{M}_{t},i=1,\ldots,m, ϵ\epsilon
2:  Output: 𝐱t\mathbf{x}_{t}^{*}
3:  Initialize 𝐱t=𝐱t\mathbf{x}_{t}^{*}=\mathbf{x}_{t}
4:  repeat
5:     𝐱tεir=λi𝐬r(𝐱t,t)𝐱tεir(𝐱t,𝐲0,t)𝐱tεir(𝐱t,𝐲0,t)\nabla_{\mathbf{x}_{t}^{*}}\varepsilon_{ir}^{\prime}=\lambda_{i}\frac{\|\mathbf{s}_{r}(\mathbf{x}_{t}^{*},t)\|}{\|\nabla_{\mathbf{x}_{t}^{*}}\varepsilon_{ir}(\mathbf{x}_{t}^{*},\mathbf{y}_{0},t)\|}\nabla_{\mathbf{x}_{t}^{*}}\varepsilon_{ir}(\mathbf{x}_{t}^{*},\mathbf{y}_{0},t)
6:     Get the min value vv of eq. (12) and corresponding α,β1,,βm\alpha,\beta^{1},\ldots,\beta^{m}
7:     if v==0v==0 then
8:        return 𝐱t\mathbf{x}_{t}^{*}
9:     end if
10:     δ=α𝐬r(𝐱t,t)i=1mβi𝐱tεir\delta=\alpha\mathbf{s}_{r}(\mathbf{x}_{t}^{*},t)-\sum_{i=1}^{m}\beta^{i}\nabla_{\mathbf{x}_{t}^{*}}\varepsilon_{ir}^{\prime}
11:     𝐱t=𝐱t+λδ\mathbf{x}^{\prime}_{t}=\mathbf{x}_{t}^{*}+\lambda\delta
12:     𝐱t=𝐑𝐱𝐭(𝐱t)\mathbf{x}_{t}^{*}=\mathbf{R_{\mathbf{x}_{t}^{*}}}(\mathbf{x}^{\prime}_{t})
13:  until 𝐱t𝐁ϵ(𝐱t)\mathbf{x}_{t}^{*}\notin\mathbf{B}_{\epsilon}(\mathbf{x}_{t})
14:  return 𝐱t\mathbf{x}_{t}^{*}

Remark 1.

We can use 𝐟(𝐱t,t)\mathbf{f}(\mathbf{x}_{t},t) and 𝐬d(𝐱t,t)\mathbf{s}_{d}(\mathbf{x}_{t},t) to approximate 𝐟(𝐱t,t)\mathbf{f}(\mathbf{x}_{t}^{*},t) and 𝐬d(𝐱t,t)\mathbf{s}_{d}(\mathbf{x}_{t}^{*},t) when ϵ\epsilon is small.

Remark 2.

Notably, EGSDE (Zhao et al., 2022) applies coefficients directly on the guidance vectors, and DVCE (Augustin et al., 2022) uses coefficients after the normalization on the guidance vectors. We can also provide coefficients λi\lambda_{i}s for normalized energy vectors to change the impact of the vectors. A smaller norm means greater impact, as mentioned in (Désidéri, 2012).

3.4 Stage 2: Transformation between adjacent manifolds

After the optimization on the manifold t\mathcal{M}_{t}, we get 𝐱t\mathbf{x}_{t}^{*} that dominates 𝐱t\mathbf{x}_{t}, then we use 𝐟(𝐱t,t)\mathbf{f}(\mathbf{x}_{t}^{*},t), the “denoising” part score function 𝐬d(𝐱,t)\mathbf{s}_{d}(\mathbf{x}^{*},t), reverse-time noise and restriction function on t1\mathcal{M}_{t-1} to map to the adjacent manifold t1\mathcal{M}_{t-1}.

Firstly, we have the following proposition to describe the properties of the adjacent map.

Proposition 2.

Suppose the 𝐟(,)\mathbf{f}(\cdot,\cdot) is affine. Then the adjacent map has the following properties:

  • !𝐯𝐱𝐭N𝐱tt\exists!\mathbf{v_{\mathbf{x}_{t}}}\in N_{\mathbf{x}_{t}}\mathcal{M}_{t} that 𝐱t+𝐯𝐱𝐭t1\mathbf{x}_{t}+\mathbf{v_{\mathbf{x}_{t}}}\in\mathcal{M}_{t-1}.

  • N𝐱tt=N𝐱t+𝐯𝐱𝐭t1N_{\mathbf{x}_{t}}\mathcal{M}_{t}=N_{\mathbf{x}_{t}+\mathbf{v_{\mathbf{x}_{t}}}}\mathcal{M}_{t-1} .

  • 𝐯𝐱𝐭\mathbf{v_{\mathbf{x}_{t}}} is a transition map from T𝐱ttT_{\mathbf{x}_{t}}\mathcal{M}_{t} to T𝐱t+𝐯t1T_{\mathbf{x}_{t}+\mathbf{v}}\mathcal{M}_{t-1} .

  • 𝐯𝐱𝐭\mathbf{v_{\mathbf{x}_{t}}} is determined with 𝐟(,),𝐱𝐭,𝐲0\mathbf{f}(\cdot,\cdot),\mathbf{x_{t}},\mathbf{y}_{0} and g()g(\cdot).

However, if we use 𝐯𝐱𝐭\mathbf{v_{\mathbf{x}_{t}}} as the adjacent map, we will lose the impact of 𝐬d\mathbf{s}_{d} and the reverse-time noise. Therefore, we follow the reverse SDE, using the extra part of which on the normal space N𝐱ttN_{\mathbf{x}_{t}}\mathcal{M}_{t} and a restriction function on t1\mathcal{M}_{t-1} as the adjacent map, we denote this part as 𝐯𝐱𝐭\mathbf{v_{\mathbf{x}_{t}}^{*}}.

Finally, we have Algorithm 2 in the following to generate images with our proposed SDDM.

Algorithm 2 Generation with SDDM
1:  Input: time steps TT, milstone time step T0T_{0}, stepsize λ\lambda, score function 𝐬\mathbf{s}, energy funcitons εi,i=1,,m\varepsilon_{i},i=1,\ldots,m, small ϵ\epsilon, reference image 𝐲0\mathbf{y}_{0}
2:  Output: generated image 𝐱0\mathbf{x}_{0}
3:  Initialize 𝐱TT\mathbf{x}_{T}\in\mathcal{M}_{T}
4:  for t=Tt=T to T0T_{0}  do
5:     Divide the d\mathbb{R}^{d} into two orthogonal spaces T𝐱ttT_{\mathbf{x}_{t}}\mathcal{M}_{t} and N𝐱ttN_{\mathbf{x}_{t}}\mathcal{M}_{t}.
6:     Calculate 𝐬r(,)\mathbf{s}_{r}(\cdot,\cdot) and εir(,,)\varepsilon_{ir}(\cdot,\cdot,\cdot)
7:     Optimize on manifold t\mathcal{M}_{t} with algorithm 1, and get the output 𝐱t\mathbf{x}_{t}^{*}
8:     Apply the time-reverse SDE on the N𝐱ttN_{\mathbf{x}_{t}^{*}}\mathcal{M}_{t} and then use the restriction function 𝐑\mathbf{R} on manifold t1\mathcal{M}_{t-1} to map the 𝐱t\mathbf{x}_{t}^{*} to 𝐱t1t1\mathbf{x}_{t-1}\in\mathcal{M}_{t-1}
9:  end for
10:  for t=T01t=T_{0}-1 to 11  do
11:     Apply unconditional time-reverse SDE from 𝐱𝐭\mathbf{x_{t}} to 𝐱t1\mathbf{x}_{t-1}
12:  end for
13:  return 𝐱0\mathbf{x}_{0}

Remark 3.

If 𝐟(𝐱t,t)\mathbf{f}(\mathbf{x}_{t},t) is linear to 𝐱t\mathbf{x}_{t}, then 𝐟(𝐱t,t)N𝐱t\mathbf{f}(\mathbf{x}_{t},t)\in N_{\mathbf{x}_{t}}\mathcal{M}.

Remark 4.

When the ϵ\epsilon is small, we can just use N𝐱ttN_{\mathbf{x}_{t}}\mathcal{M}_{t} to approximate N𝐱ttN_{\mathbf{x}_{t}^{*}}\mathcal{M}_{t}.

Remark 5.

Suppose 𝐬r\|\mathbf{s}_{r}\| is 𝐨(𝐯𝐱𝐭)\mathbf{o}\left(\|\mathbf{v_{\mathbf{x}_{t}}^{*}}\|\right), and ϵ\epsilon is 𝐎(𝐬r)\mathbf{O}(\|\mathbf{s}_{r}\|). We can ignore the restriction step in algorithm 1.

Remark 6.

At time step T0T_{0}, we can set larger ϵ\epsilon for better results.

4 Implementations

Chunking Trick. The chunking trick is an easy but powerful trick to reduce the dimensions of the manifolds in high-dimensional space problems, like the generation of images. We will divide the image shape C×H×WC\times H\times W into blocks N×NN\times N, and the shape will be like CN2×HN×WNCN^{2}\times\frac{H}{N}\times\frac{W}{N}, and the manifold will be the direct product of CN2CN^{2} manifolds at each HN×WN\frac{H}{N}\times\frac{W}{N}-sized block, we index them with (c,i,j)(c,i,j). This trick has the following advantages:

  • We can easily get the TT\mathcal{M} and the NN\mathcal{M}, which are also the direct product of each block’s TT\mathcal{M} and NN\mathcal{M}.

  • We can control the impact of the reference image on the generation process.

  • We can optimize on block level and lower the impact of other distant blocks.

Manifold Details. For each chunked HN×WN\frac{H}{N}\times\frac{W}{N}-sized block of the image, we use the first-order and second-order moments to restrict the statistics of the pixels of the block to get a (HN×WN2)dim(\frac{H}{N}\times\frac{W}{N}-2)-dim manifold. In particular, we denote the HN×HN\frac{H}{N}\times\frac{H}{N} as dd. Suppose 𝐲t=α¯t𝐲0+1α¯t𝐳𝐭\mathbf{y}_{t}=\sqrt{\bar{\alpha}_{t}}\mathbf{y}_{0}+\sqrt{1-\bar{\alpha}_{t}}\mathbf{z_{t}}, Then, 𝐲t𝒩(α¯t𝐲0,(1α¯t)𝐈)\mathbf{y}_{t}\sim\mathcal{N}\left(\sqrt{\bar{\alpha}_{t}}\mathbf{y}_{0},\left(1-\bar{\alpha}_{t}\right)\mathbf{I}\right), and the 𝐲tc,i,j𝒩(α¯t𝐲0c,i,j,(1α¯t)𝐈)\mathbf{y}^{c,i,j}_{t}\sim\mathcal{N}\left(\sqrt{\bar{\alpha}_{t}}\mathbf{y}_{0}^{c,i,j},\left(1-\bar{\alpha}_{t}\right)\mathbf{I}\right). Then, the manifold tc,i,j\mathcal{M}_{t}^{c,i,j}of block (c,i,j)(c,i,j) is restricted with:

μ[𝐱tc,i,j]\displaystyle\operatorname{\mu}[\mathbf{x}_{t}^{c,i,j}] =α¯tμ[𝐲0c,i,j],\displaystyle=\sqrt{\bar{\alpha}_{t}}\operatorname{\mu}[\mathbf{y}_{0}^{c,i,j}], (13)
Var[𝐱tc,i,j]\displaystyle\operatorname{Var}[\mathbf{x}_{t}^{c,i,j}] =α¯tVar[𝐲0c,i,j]+(1α¯t).\displaystyle=\bar{\alpha}_{t}\operatorname{Var}[\mathbf{y}_{0}^{c,i,j}]+(1-\bar{\alpha}_{t}).

And the restrictions of Eqn. 13, tc,i,j\mathcal{M}_{t}^{c,i,j} is a (d2)(d-2) dimensional hypersphere. Then we can formulate t\mathcal{M}_{t} as:

t=c,i,jtc,i,j.\mathcal{M}_{t}=\otimes_{c,i,j}\mathcal{M}_{t}^{c,i,j}. (14)

The \otimes denotes the direct product. Huang & Belongie (2017) use the AdaIN module to transfer neural features as

AdaIN(𝐱,𝐲)=σ(𝐲)(𝐱μ(𝐱)σ(𝐱))+μ(𝐲).\operatorname{AdaIN}(\mathbf{x},\mathbf{y})=\sigma(\mathbf{y})\left(\frac{\mathbf{x}-\operatorname{\mu}(\mathbf{x})}{\sigma(\mathbf{x})}\right)+\operatorname{\mu}(\mathbf{y}). (15)

Based on that, we leverage a useful module of BAdaIN as the restriction function on any T𝐱ttT_{\mathbf{x}_{t}}\mathcal{M}_{t}:

BAdaIN(𝐱t,𝐲t)\displaystyle\operatorname{BAdaIN}(\mathbf{x}_{t},\mathbf{y}_{t}) =c,i,jσ(𝐲tc,i,j)(𝐱tc,i,jμ(𝐱tc,i,j)σ(𝐱tc,i,j))\displaystyle=\otimes_{c,i,j}\sigma(\mathbf{y}_{t}^{c,i,j})\left(\frac{\mathbf{x}_{t}^{c,i,j}-\operatorname{\mu}(\mathbf{x}_{t}^{c,i,j})}{\sigma(\mathbf{x}_{t}^{c,i,j})}\right) (16)
+c,i,jμ(𝐲tc,i,j)\displaystyle+\otimes_{c,i,j}\operatorname{\mu}(\mathbf{y}_{t}^{c,i,j})

In practice, we use the distribution moments of the perturbed reference image to simplify the calculation and eliminate randomness after knowing the relationship between the perturbed and original reference images, as in Eqn. (10).

We have Lemma 3 to describe Proposition 1 in detail:

Lemma 3.

ϵ,ξ>0,D>0,d>D\forall\epsilon,\xi>0,\exists D>0,\forall d>D we have:

𝒫(d2(𝐲tc,i,j,tc,i,j)<ϵd)>1ξ,\mathcal{P}\left(d_{2}\left(\mathbf{y}_{t}^{c,i,j},\mathcal{M}_{t}^{c,i,j}\right)<\epsilon\sqrt{d}\right)>1-\xi,

where dd is the dimension of the Euclid space 𝐲tc,i,j\mathbf{y}_{t}^{c,i,j} in.

Remark 7.

The ILVR (Choi et al., 2021) method employs the low-pass filter to transfer the reference image information in Eqn. 7, and the low-pass filter calculates the block means. We have the following relationship between our mean restriction and the low-pass filter:

𝔼[𝚽(𝐲t)]=c,i,jα¯tμ[𝐲0c,i,j]\mathbb{E}[\mathbf{\Phi}(\mathbf{y}_{t})]=\otimes_{c,i,j}\sqrt{\bar{\alpha}_{t}}\operatorname{\mu}[\mathbf{y}_{0}^{c,i,j}] (17)

Energy Function. We can also use the BAdaIN module for constructing weak energy functions. Firstly, we use the first several layers of VGG19 (Simonyan & Zisserman, 2014) net to extract neural features of 𝐱t\mathbf{x}_{t} and 𝐲t\mathbf{y}_{t} to get 𝐱^t\widehat{\mathbf{x}}_{t} and 𝐲t^\widehat{\mathbf{y}_{t}}. Then we use the L2L_{2} distance of BAdaIN(𝐱^t,𝐲t^)\operatorname{BAdaIN}(\widehat{\mathbf{x}}_{t},\widehat{\mathbf{y}_{t}}) and 𝐱t^\hat{\mathbf{x}_{t}} as the energy function for faithfulness. To verify SDDM’s advantage, we only use this weak energy function.

5 Experiments

Datasets. We evaluate our SDDM on the following datasets. All the images are resized to 256×256256\times 256 pixels.

  • AFHQ (Choi et al., 2020) contains high-resolution animal faces of three domains: cat, dog, and wild. Each domain has 500 testing images. we conduct Cat \rightarrow Dog and Wild \rightarrow Dog on this dataset, following the experiments of CUT (Park et al., 2020) and EGSDE (Zhao et al., 2022).

  • CelebA-HQ (Karras et al., 2017) consists of high-quality human face images of two categories, male and female. Each category contains 1000 testing images. We conduct Male \rightarrow Female on this dataset, following the experiments of EGSDE (Zhao et al., 2022).

Evaluation Metrics. We evaluate our translated images from two aspects. One is to assess the distance between the translated and the source images, and we report the SSIM between them. The other is to evaluate the distance of generated images and target domain images, and we calculate the widely-used Frechet Inception Score (FID) (Heusel et al., 2017) between the generated images and the target domain images. Details about the FID settings are in Appendix D.

5.1 Comparison with the State-of-the-arts

We compare our experiments with other GANs-based and SBDM-based methods, as shown in Table 1.

Table 1: Quantitative comparisons. The results marked with * come from (Zhao et al., 2022) Our method and ILVR have 100 diffusion steps. SDEdit and EGSDE* have 1000 diffusion steps. For a fair comparison with our SDDM, we also report the results of EGSDE** with 200 diffusion steps. All SBDM-based methods are repeated 5 times to reduce the randomness. Details about SDDM and SDDM are shown in Appendix C.2.
Model FID\downarrow SSIM\uparrow
Cat \rightarrow Dog
CycleGAN* 85.9 -
MUNIT* 104.4 -
DRIT* 123.4 -
Distance* 155.3 -
SelfDistance* 144.4 -
GCGAN* 96.6 -
LSeSim* 72.8 -
ITTR (CUT)* 68.6 -
StarGAN v2* 54.88 ±\pm 1.01 0.27 ± 0.003
CUT* 76.21 0.601
SDEdit* 74.17 ± 1.01 0.423 ± 0.001
ILVR* 74.37 ± 1.55 0.363 ± 0.001
EGSDE* 65.82 ± 0.77 0.415 ± 0.001
EGSDE** 70.16± 1.03 0.411 ± 0.001
SDDM(Ours) 62.29 ± 0.63 0.422± 0.001
SDDM (Ours) 49.43 ± 0.23 0.361± 0.001
Wild \rightarrow Dog
SDEdit* 68.51 ± 0.65 0.343 ± 0.001
ILVR* 75.33 ± 1.22 0.287 ± 0.001
EGSDE* 59.75 ± 0.62 0.343 ± 0.001
SDDM(Ours) 57.38 ± 0.53 0.328 ± 0.001
Male \rightarrow Female
SDEdit* 49.43 ± 0.47 0.572 ± 0.000
ILVR* 46.12 ± 0.33 0.510 ± 0.001
EGSDE* 41.93 ± 0.11 0.574 ± 0.000
EGSDE** 45.12± 0.24 0.512 ± 0.001
SDDM(Ours) 44.37± 0.23 0.526 ± 0.001

Compared with other SBDM-based methods, our SDDM improves on both metrics FID and SSIM, which indicates the effectiveness of the two-stage generation process of our SDDM via the decomposition of score function and energy guidance with manifolds. Especially compared with EGSDE*, which has strong pre-trained energy functions, in the Cat \rightarrow Dog task, our SDDM improves the FID score by 3.53 and SSIM score by 0.007 with much lower time steps, 10001001000\rightarrow 100. For the comparison with EGSDE** having 200 diffusion steps, SDDM improves the FID score by 7.87 and the SSIM score by 0.011 in the Cat \rightarrow Dog task and improves the FID score by 0.75 and the SSIM score by 0.014 in the Male \rightarrow Female task, which suggests the advantage of our SDDM in fewer diffusion steps. The visual comparison is in Appendix F.

5.2 Ablation Studies

Observations on Score Components. While performing the Cat \rightarrow Dog experiment, we report the L2L_{2} norms of the deterministic guidance values on T𝐱tT_{\mathbf{x}_{t}}\mathcal{M} and N𝐱tN_{\mathbf{x}_{t}}\mathcal{M}. As shown in Figure 2, the component on the normal space has one in 128 dimensions but contains the most value of the deterministic guidance of diffusion models, while the component on the tangent space 𝐬r(,)\mathbf{s}_{r}(\cdot,\cdot) has 127 in 128 dimensions but contains a minimal value, which indicates we have relative large optimization space on the manifold which will not excessively interfere with the intermediate distributions.

Refer to caption
Figure 2: The mean and standard deviation of L2L_{2} norms of 𝐬r\mathbf{s}_{r} in T𝐱tT_{\mathbf{x}_{t}}\mathcal{M} and other part in N𝐱tN_{\mathbf{x}_{t}}\mathcal{M}. We repeat 100 times of our SDDM with different reference images.

Comparison of Different Manifolds. We compare SDDM with different manifold methods and report the results in Table 2. Compared with the manifold restricted with a low-pass filter, the manifold restricted with our BAdaIN has better performance both on FID and SSIM, because our manifold separates the content refinement part and image denoising part better.

Table 2: Comparisons of different manifolds.
Model FID\downarrow SSIM\uparrow
SDDM(low-pass filter) 67.56 0.411
SDDM(BAdaIN) 62.29 0.422

Comparison of Different Coefficients. We have two coefficients at each iteration step, the coefficient of the step size λ\lambda of optimal multi-objection direction and the coefficient of the energy guidance λ1\lambda_{1}. As in Table 3, the larger λ\lambda is, the better FID will be because, at each optimization on the manifold, it reaches a position with higher probability p(𝐱)p_{\mathcal{M}}(\mathbf{x}), but when λ\lambda is too large, the FID score will be worse again. The λ1\lambda_{1} has a negative connection with the impact of energy guidance, which indicates that smaller λ1\lambda_{1} makes the energy guidance stronger and thus has a better SSIM score.

Table 3: Comparisons of different coefficients.
      Coefficients       FID\downarrow       SSIM\uparrow
      λ=2,λ1=10\lambda=2,\lambda_{1}=10       65.09       0.429
      λ=2,λ1=40\lambda=2,\lambda_{1}=40       62.02       0.420
      λ=1,λ1=25\lambda=1,\lambda_{1}=25       66.04       0.428
      λ=3,λ1=25\lambda=3,\lambda_{1}=25       62.32       0.415
      λ=2,λ1=25\lambda=2,\lambda_{1}=25       62.29       0.422

Comparison w./w.o. Multi-Objective Optimization. We compare SDDM with SDDM without the MOO method and report the FID, SSIM, and probability of negative impact (PNI), which indicates the situation that the total guidance including score and energy decreases the p(𝐱)p(\mathbf{x}) in Table 4. The proposed SDDM method avoids such situations and reaches better performance.

Table 4: Comparisons of different manifold optimization policies.
Model FID\downarrow SSIM\uparrow PNI\downarrow
SDDM(w/o MOO) 64.93 0.421 0.024
SDDM 62.29 0.422 0

ϵ\epsilon Policy in The Optimization on Manifolds. We mainly compare three different ϵ\epsilon policies:

  • Policy 1: The ϵ\epsilon is very small such that in Algorithm 1, we iterate only once. With Remark 1, this method will introduce no additional calculations of scores.

  • Policy 2: ϵt=𝐬r(𝐱,t)\epsilon_{t}=\|\mathbf{s}_{r}(\mathbf{x},t)\|, which normally iterates twice in Algorithm 1. In practice, we iterate twice.

  • Policy 3: At the time step T0T_{0} in Algorithm 2, we set ϵt\epsilon_{t} larger to iterate another 4 times. and other time steps are as same as Policy 1.

We report the FID and SSIM of different policies in Table 5. Policy 3 has the best performance, which reveals that iteration a little more at T0T_{0} time step can balance different metrics better without introducing too much cost.

Table 5: Comparisons of the ϵ\epsilon Policies.
       Policy        FID\downarrow        SSIM\uparrow
       Policy 1        61.33        0.413
       Policy 2        64.05        0.418
       Policy 3        62.29        0.422

The Choice of Middle-Time T0T_{0} and Block Number. As shown in Figure 3, when we chunk more blocks or set the T0T_{0} smaller, the generated image is more faithful to the reference image. But too many blocks will also introduce some bad details, like the mouth in the left bottom image.

Refer to caption
Figure 3: The comparison of different numbers of blocks and different middle time T0T_{0}s.
Table 6: Comparisons of different block numbers.
      Block number       FID\downarrow       SSIM\uparrow
      8×88\times 8       54.56       0.359
      16×1616\times 16       62.29       0.422
      32×3232\times 32       68.03       0.426

6 Related Work

GAN-based Unpaired Image-to-Image Translation. There are mainly two categories of GANs-based methods in the unpaired I2I task. One is two-side way, while the other is one-side mapping. In the first category, the key idea is that the translated image could be translated back with another inverse mapping. CycleGAN (Zhu et al., 2017), DualGAN (Yi et al., 2017), DiscoGAN (Kim et al., 2017), SCAN (Van Gansbeke et al., 2020) and U-GAT-IT (Kim et al., 2019) are in this class. But translations usually lose information. Several new studies have started to map two domains to the same metric space and use the distance of this space as supervision.DistanceGAN (Benaim & Wolf, 2017), GCGAN (Fu et al., 2019), CUT (Park et al., 2020) and LSeSim (Zheng et al., 2021) are in this categoriy.

It is also noteworthy that other techniques have been proposed to tackle the problem of unpaired image-to-image translation. For instance, some studies (Xie et al., 2021, 2018) leverage cooperative learning, whereas others (Zhao et al., 2021) adopt an energy-based framework or a short-run MCMC like Langevin dynamics (Xie et al., 2016).

SBDMs-based Conditional Methods. There are mainly two classes of conditional generation with SBDMs. The first one is to empower SBDMs with the conditional generation ability during training with the classifier-free guidance trick (Ho & Salimans, 2022), which learns the score functions and conditional score functions via a single neural network. The other method is to train another classifier to lead the learned score functions for a conditional generation. EGSDE (Zhao et al., 2022) generalizes the classifiers to any energy-based functions. These methods cannot describe the intermediate distributions clearly, which is a hard problem because the distributions of adjacent time steps are deeply coupled. However, when the conditions can give constraints to separate the adjacent distributions well, we can get better results, and this observation inspires our model.

7 Conclusions

In this work, we have presented a new score-decomposed diffusion model, SDDM, which leverages manifold analyses to decompose the score function and explicitly optimize the tangled distributions during image generation. SDDM derives manifolds to separate the distributions of adjacent time steps and decompose the score function or energy guidance into an image “denoising” part and a content “refinement” part. With the new multi-objective optimization algorithm and block adaptive instance normalization module, our realized SDDM method demonstrates promising results in unpaired image-to-image translation on two benchmarks. In future work, we plan to improve and apply the proposed SDDM model in more image translation tasks.

One limitation of our approach involves additional computations, although these computations are negligible compared to the inferences of neural networks. Additionally, we should prevent any misuse of generative algorithms for malicious purposes.

Acknowledgements

This work is supported by the National Key R&D Program of China under Grant No. 2021QY1500, and the State Key Program of the National Natural Science Foundation of China (NSFC) (No.61831022). It is also partly supported by the NSFC under Grant No. 62076238 and 62222606.

References

  • Augustin et al. (2022) Augustin, M., Boreiko, V., Croce, F., and Hein, M. Diffusion visual counterfactual explanations. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022.
  • Bao et al. (2022a) Bao, F., Li, C., Zhu, J., and Zhang, B. Analytic-DPM: An analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503, 2022a.
  • Bao et al. (2022b) Bao, F., Zhao, M., Hao, Z., Li, P., Li, C., and Zhu, J. Equivariant energy-guided SDE for inverse molecular design. arXiv preprint arXiv:2209.15408, 2022b.
  • Benaim & Wolf (2017) Benaim, S. and Wolf, L. One-sided unsupervised domain mapping. In Advances in Neural Information Processing Systems, pp. 752–762, 2017.
  • Choi et al. (2021) Choi, J., Kim, S., Jeong, Y., Gwon, Y., and Yoon, S. ILVR: Conditioning method for denoising diffusion probabilistic models. arXiv preprint arXiv:2108.02938, 2021.
  • Choi et al. (2020) Choi, Y., Uh, Y., Yoo, J., and Ha, J.-W. StarGAN v2: Diverse image synthesis for multiple domains. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8185–8194, 2020.
  • Chung et al. (2022) Chung, H., Sim, B., Ryu, D., and Ye, J. C. Improving diffusion models for inverse problems using manifold constraints. arXiv preprint arXiv:2206.00941, 2022.
  • Désidéri (2012) Désidéri, J.-A. Multiple-gradient descent algorithm (MGDA) for multiobjective optimization. Comptes Rendus Mathematique, 350(5-6):313–318, 2012.
  • Dhariwal & Nichol (2021) Dhariwal, P. and Nichol, A. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems, pp. 8780–8794, 2021.
  • Fu et al. (2019) Fu, H., Gong, M., Wang, C., Batmanghelich, K., Zhang, K., and Tao, D. Geometry-consistent generative adversarial networks for one-sided unsupervised domain mapping. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2427–2436, 2019.
  • Goodfellow et al. (2014) Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
  • Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6629–6640, 2017.
  • Ho & Salimans (2022) Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  • Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, pp. 6840–6851, 2020.
  • Hu et al. (2020) Hu, J., Liu, X., Wen, Z.-W., and Yuan, Y.-X. A brief introduction to manifold optimization. Journal of the Operations Research Society of China, 8(2):199–248, 2020.
  • Huang & Belongie (2017) Huang, X. and Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In IEEE International Conference on Computer Vision, pp. 1501–1510, 2017.
  • Huang et al. (2018) Huang, X., Liu, M.-Y., Belongie, S., and Kautz, J. Multimodal unsupervised image-to-image translation. In European Conference on Computer Vision, pp.  172–189, 2018.
  • Jiang et al. (2020) Jiang, L., Zhang, C., Huang, M., Liu, C., Shi, J., and Loy, C. C. Tsit: A simple and versatile framework for image-to-image translation. In European Conference on Computer Vision, pp.  206–222, 2020.
  • Karras et al. (2017) Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  • Kim et al. (2019) Kim, J., Kim, M., Kang, H., and Lee, K. U-gat-it: Unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. arXiv preprint arXiv:1907.10830, 2019.
  • Kim et al. (2017) Kim, T., Cha, M., Kim, H., Lee, J. K., and Kim, J. Learning to discover cross-domain relations with generative adversarial networks. In International Conference on Machine Learning, pp. 1857–1865, 2017.
  • Laurent & Massart (2000) Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pp.  1–18, 2000.
  • Lee et al. (2018) Lee, H.-Y., Tseng, H.-Y., Huang, J.-B., Singh, M., and Yang, M.-H. Diverse image-to-image translation via disentangled representations. In European Conference on Computer Vision, pp.  35–51, 2018.
  • Lee (2010) Lee, J. Introduction to topological manifolds, volume 202. Springer Science & Business Media, 2010.
  • Lee & Lee (2012) Lee, J. M. and Lee, J. M. Smooth manifolds. Springer, 2012.
  • Liu et al. (2022) Liu, L., Ren, Y., Lin, Z., and Zhao, Z. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022.
  • Lu et al. (2022) Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J. DPM-Solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.
  • Meng et al. (2022) Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Ermon, S. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, pp. 1–14, 2022.
  • Nichol & Dhariwal (2021) Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171, 2021.
  • Pang et al. (2021) Pang, Y., Lin, J., Qin, T., and Chen, Z. Image-to-image translation: Methods and applications. IEEE Transactions on Multimedia, 24:3859–3881, 2021.
  • Park et al. (2020) Park, T., Efros, A. A., Zhang, R., and Zhu, J.-Y. Contrastive learning for unpaired image-to-image translation. In European Conference on Computer Vision, pp.  319–345, 2020.
  • Särkkä & Solin (2019) Särkkä, S. and Solin, A. Applied stochastic differential equations, volume 10. Cambridge University Press, 2019.
  • Sener & Koltun (2018) Sener, O. and Koltun, V. Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems, pp. 1–12, 2018.
  • Shen et al. (2019) Shen, Z., Huang, M., Shi, J., Xue, X., and Huang, T. S. Towards instance-level image-to-image translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3683–3692, 2019.
  • Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Song & Ermon (2019) Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 11918–11930, 2019.
  • Song et al. (2021) Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
  • Tu (2011) Tu, L. W. Manifolds. In An Introduction to Manifolds, pp.  47–83. Springer, 2011.
  • Van Gansbeke et al. (2020) Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Proesmans, M., and Van Gool, L. Scan: Learning to classify images without labels. In European Conference on Computer Vision, pp.  268–285, 2020.
  • Xie et al. (2016) Xie, J., Lu, Y., Zhu, S.-C., and Wu, Y. A theory of generative convnet. In International Conference on Machine Learning, pp. 2635–2644, 2016.
  • Xie et al. (2018) Xie, J., Lu, Y., Gao, R., Zhu, S.-C., and Wu, Y. N. Cooperative training of descriptor and generator networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(1):27–45, 2018.
  • Xie et al. (2021) Xie, J., Zheng, Z., Fang, X., Zhu, S.-C., and Wu, Y. N. Learning cycle-consistent cooperative networks via alternating mcmc teaching for unsupervised cross-domain translation. In AAAI Conference on Artificial Intelligence, pp. 10430–10440, 2021.
  • Yi et al. (2017) Yi, Z., Zhang, H., Tan, P., and Gong, M. DualGAN: Unsupervised dual learning for image-to-image translation. In IEEE International Conference on Computer Vision, pp. 2849–2857, 2017.
  • Zhao et al. (2022) Zhao, M., Bao, F., Li, C., and Zhu, J. EGSDE: Unpaired image-to-image translation via energy-guided stochastic differential equations. arXiv preprint arXiv:2207.06635, 2022.
  • Zhao et al. (2021) Zhao, Y., Xie, J., and Li, P. Learning energy-based generative models via coarse-to-fine expanding and sampling. In International Conference on Learning Representations, 2021.
  • Zheng et al. (2021) Zheng, C., Cham, T.-J., and Cai, J. The spatially-correlative loss for various image translation tasks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16407–16417, 2021.
  • Zhu et al. (2017) Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision, pp. 2223–2232, 2017.

Appendix A Basic Knowledge about Manifolds

These definitions come from (Lee, 2010; Tu, 2011; Lee & Lee, 2012).

Definition 5.

Topological space.

A topological space \mathcal{M} is locally Euclidean of dimension nn if every point pp in \mathcal{M} has a neighborhood UU such that there is a homeomorphism ϕ\phi from UU onto an open subset of n\mathbb{R}^{n}. We call the pair (U,ϕ:Un)(U,\phi:U\rightarrow\mathbb{R}^{n}) a chart, UU a coordinate neighborhood or an open coordinate set, and ϕ\phi a coordinate map or a coordinate system on UU. We say that a chart (U,ϕ)(U,\phi) is centered at pUp\in U if ϕ(p)=0\phi(p)=0. A chart (U,ϕ)(U,\phi) about pp simply means that (U,ϕ)(U,\phi) is a chart and pUp\in U .

Definition 6.

Locally Euclidean property.

The locally Euclidean property means that for each pp\in\mathcal{M}, we can find the following:

  • an open set UU\subset\mathcal{M} containing pp;

  • an open set U~n\widetilde{U}\subset\mathbb{R}^{n};and

  • a homeomorphism ϕ:UU~\phi:U\rightarrow\widetilde{U} (i.e., a continuous bijective map with continuous inverse).

Definition 7.

Topological manifold.

Suppose \mathcal{M} is a topological space. We say \mathcal{M} is a topological manifold of dimension nn or a topological nn-manifold if it has the following properties:

  • M is a Hausdorff space: For every pair of points p,qp,q\in\mathcal{M}, there are disjoint open subsets U,VU,V\subset\mathcal{M} such that pUp\in Uand qVq\in V.

  • \mathcal{M} is second countable: There exists a countable basis for the topology of \mathcal{M}.

  • \mathcal{M} is locally Euclidean of dimension nn: Every point has a neighborhood that is homeomorphic to an open subset of n\mathbb{R}^{n}.

Definition 8.

Tangent vector.

A tangent vector at a point pp in a manifold \mathcal{M} is a derivation at pp.

Definition 9.

Tangent space.

As for n\mathbb{R}^{n}, the tangent vectors at pp form a vector space Tp()T_{p}(\mathcal{M}), called the tangent space of \mathcal{M} at pp. We also write TpT_{p}\mathcal{M} instead of Tp()T_{p}(\mathcal{M}).

Definition 10.

Normal space.

the normal space to \mathcal{M} at pp to be the subspace NpmN_{p}\mathcal{M}\subset\mathbb{R}^{m} consisting of all vectors that are orthogonal to TpT_{p}\mathcal{M} with respect to the Euclidean dot product. The normal bundle of M is the subset Nm×mN\mathcal{M}\subset\mathbb{R}^{m}\times\mathbb{R}^{m}defined by

N=pMNp={(p,v)m×m:pM and vNp}N\mathcal{M}=\coprod_{p\in M}N_{p}\mathcal{M}=\left\{(p,v)\in\mathbb{R}^{m}\times\mathbb{R}^{m}:p\in M\text{ and }v\in N_{p}\mathcal{M}\right\} (18)

Appendix B Proofs

B.1 Proof of Lemma 1

Proof. Consider the local coordinate system at 𝐱\mathbf{x}. Suppose {xi}i=1,2,,d\{x_{i}\}_{i=1,2,...,d} are the orthonormal basis of d\mathbb{R}^{d} and {xi}i=1,2,,m\{x_{i}\}_{i=1,2,...,m} (m<dm<d) are in the tangent space T𝐱T_{\mathbf{x}}\mathcal{M} and the rest of them are in the normal space N𝐱N_{\mathbf{x}}\mathcal{M}. Then:

𝐬r(𝐱)=\displaystyle\mathbf{s}_{r}(\mathbf{x})= 𝐱logp(𝐱)=i=1dxilogp(𝐱)\displaystyle\nabla_{\mathbf{x}}\log p_{\mathcal{M}}(\mathbf{x})=\sum_{i=1}^{d}\frac{\partial}{\partial x_{i}}\log p_{\mathcal{M}}(\mathbf{x}) (19)
=\displaystyle= i=1mxilogp(𝐱)=i=1mxilogCp(𝐱)\displaystyle\sum_{i=1}^{m}\frac{\partial}{\partial x_{i}}\log p_{\mathcal{M}}(\mathbf{x})=\sum_{i=1}^{m}\frac{\partial}{\partial x_{i}}\log Cp(\mathbf{x})
=\displaystyle= i=1mxilogp(𝐱)=𝐬(𝐱)|T𝐱.\displaystyle\sum_{i=1}^{m}\frac{\partial}{\partial x_{i}}\log p(\mathbf{x})=\mathbf{s}(\mathbf{x})|_{T_{\mathbf{x}}\mathcal{M}}.

Therefore, we have:

𝐬(𝐱)=\displaystyle\mathbf{s}(\mathbf{x})= 𝐬(𝐱)|T𝐱+𝐬(𝐱)|N𝐱\displaystyle\mathbf{s}(\mathbf{x})|_{T_{\mathbf{x}}\mathcal{M}}+\mathbf{s}(\mathbf{x})|_{N_{\mathbf{x}}\mathcal{M}} (20)
=\displaystyle= 𝐬r(𝐱)+𝐬d(𝐱)\displaystyle\mathbf{s}_{r}(\mathbf{x})+\mathbf{s}_{d}(\mathbf{x})

\Box

In the following sections, consider the distributions of perturbed reference image 𝐲𝐭=α^t𝐲0+β^t𝐳t\mathbf{y_{t}}=\hat{\alpha}_{t}\mathbf{y}_{0}+\hat{\beta}_{t}\mathbf{z}_{t}, where 𝐳t𝒩(𝟎,𝐈)\mathbf{z}_{t}\sim\mathcal{N}(\mathbf{0},\mathbf{I}), and the reference image 𝐲0\mathbf{y}_{0} is fixed.

B.2 Proof of Proposition 1 and Lemma 3

Before proving the Proposition 1 and Lemma 3, we will prove another two relevant lemmas.

Lemma 4.

𝐲𝐭\mathbf{y_{t}} is clustered on the (d1)dim(d-1)-dim manifolds {t}\{\mathcal{M}_{t}\} restricted with the first-order moment constraints,

μ[𝐱t]=α^tμ[𝐲0]\operatorname{\mu}[\mathbf{x}_{t}]=\hat{\alpha}_{t}\operatorname{\mu}[\mathbf{y}_{0}] (21)

under the d2d_{2} distance of d\mathbb{R}^{d}.

Strictly speaking, in the original Cartesian coordinate system d\mathbb{R}^{d}.

ϵ,ξ>0,D>0,d>D\forall\epsilon,\xi>0,\exists D>0,\forall d>D we have:

𝒫(d2(𝐲t,t)<ϵd)>1ξ\mathcal{P}\left(d_{2}\left(\mathbf{y}_{t},\mathcal{M}_{t}\right)<\epsilon\sqrt{d}\right)>1-\xi

Proof. The manifold provided with restriction of Eqn. (21) is a hyperplane in d\mathbb{R}^{d}, and the normal vector 𝐧=1d(1,1,,1)\mathbf{n}=\frac{1}{\sqrt{d}}(1,1,...,1). Then the L2L_{2} distance from 𝐲t\mathbf{y}_{t} to the manifold t\mathcal{M}_{t} is |β^t𝐳t𝐧||\hat{\beta}_{t}\mathbf{z}_{t}\cdot\mathbf{n}|, where β^t𝐳t𝐧𝒩(0,β^t2)\hat{\beta}_{t}\mathbf{z}_{t}\cdot\mathbf{n}\sim\mathcal{N}(0,\hat{\beta}_{t}^{2}). Therefore,

d2(𝐲t,t)=d|1dβ^t𝐳t𝐧|\displaystyle d_{2}\left(\mathbf{y}_{t},\mathcal{M}_{t}\right)=\sqrt{d}|\frac{1}{\sqrt{d}}\hat{\beta}_{t}\mathbf{z}_{t}\cdot\mathbf{n}| (22)

Thus, as d+d\rightarrow+\infty, the variance of 1dβ^t𝐳t𝐧0\frac{1}{\sqrt{d}}\hat{\beta}_{t}\mathbf{z}_{t}\cdot\mathbf{n}\rightarrow 0.

Then strictly speaking, ϵ,ξ>0,D>0,d>D\forall\epsilon,\xi>0,\exists D>0,\forall d>D we have:

𝒫(d2(𝐲t,t)<ϵd)>1ξ\mathcal{P}\left(d_{2}\left(\mathbf{y}_{t},\mathcal{M}_{t}\right)<\epsilon\sqrt{d}\right)>1-\xi

\Box

Lemma 5.

Suppose 𝐲t\mathbf{y}_{t} shares the same bound AA, which means 𝐲t<A\|\mathbf{y}_{t}\|_{\infty}<A. 𝐲𝐭\mathbf{y_{t}} is clustered on the (d1)dim(d-1)-dim manifolds {t}\{\mathcal{M}_{t}\} restricted with the second-order moment constraints,

μ[𝐱tα^tμ[𝐲0]]2=α^t2Var[𝐲0]+β^t2\operatorname{\mu}\left[\mathbf{x}_{t}-\hat{\alpha}_{t}\operatorname{\mu}[\mathbf{y}_{0}]\right]^{2}=\hat{\alpha}^{2}_{t}\operatorname{Var}[\mathbf{y}_{0}]+\hat{\beta}^{2}_{t} (23)

under the metric of d2d_{2} distance.

Strictly speaking,

ϵ,ξ>0,D>0,d>D\forall\epsilon,\xi>0,\exists D>0,\forall d>D we have:

𝒫(d2(𝐲t,t)<ϵd)>1ξ\mathcal{P}\left(d_{2}\left(\mathbf{y}_{t},\mathcal{M}_{t}\right)<\epsilon\sqrt{d}\right)>1-\xi

Proof. The manifold provided with restriction of Eqn. (23) is a hypersphere on the d\mathbb{R}^{d}. The center of the hypersphere is α^tμ[𝐲0](1,1,,1)\hat{\alpha}_{t}\operatorname{\mu}[\mathbf{y}_{0}](1,1,...,1) and the radius is dα^t2Var[𝐲0]+β^t2\sqrt{d}\sqrt{\hat{\alpha}^{2}_{t}\operatorname{Var}[\mathbf{y}_{0}]+\hat{\beta}^{2}_{t}}. The square of the L2L_{2} distance of 𝐲t\mathbf{y}_{t} to the center is:

[α^t𝐲0+β^t𝐳tα^tμ[𝐲0]]2\displaystyle\left[\hat{\alpha}_{t}\mathbf{y}_{0}+\hat{\beta}_{t}\mathbf{z}_{t}-\hat{\alpha}_{t}\operatorname{\mu}[\mathbf{y}_{0}]\right]^{2} =i=1d[(α^t𝐲0iα^tμ[𝐲0])+β^t𝐳ti]2\displaystyle=\sum_{i=1}^{d}\left[(\hat{\alpha}_{t}\mathbf{y}_{0}^{i}-\hat{\alpha}_{t}\operatorname{\mu}[\mathbf{y}_{0}])+\hat{\beta}_{t}\mathbf{z}_{t}^{i}\right]^{2} (24)
=i=1d[(α^t𝐲0iα^tμ[𝐲0])2+β^t2(𝐳ti)2+2(α^t𝐲0iα^tμ[𝐲0])β^t𝐳ti]\displaystyle=\sum_{i=1}^{d}\left[(\hat{\alpha}_{t}\mathbf{y}_{0}^{i}-\hat{\alpha}_{t}\operatorname{\mu}[\mathbf{y}_{0}])^{2}+\hat{\beta}_{t}^{2}(\mathbf{z}_{t}^{i})^{2}+2(\hat{\alpha}_{t}\mathbf{y}_{0}^{i}-\hat{\alpha}_{t}\operatorname{\mu}[\mathbf{y}_{0}])\hat{\beta}_{t}\mathbf{z}_{t}^{i}\right]
=dα^t2Var[𝐲0]+β^t2𝐳t2+2α^tβ^t(𝐲𝟎μ[𝐲0])𝐳t,\displaystyle=d\hat{\alpha}^{2}_{t}\operatorname{Var}[\mathbf{y}_{0}]+\hat{\beta}_{t}^{2}\mathbf{z}_{t}^{2}+2\hat{\alpha}_{t}\hat{\beta}_{t}(\mathbf{y_{0}}-\operatorname{\mu}[\mathbf{y}_{0}])\cdot\mathbf{z}_{t},

Therefore,

d2(𝐲t,t)=\displaystyle d_{2}\left(\mathbf{y}_{t},\mathcal{M}_{t}\right)= |[α^t𝐲0+β^t𝐳tα^tμ[𝐲0]]2dα^t2Var[𝐲0]+β^t2|\displaystyle\left|\sqrt{\left[\hat{\alpha}_{t}\mathbf{y}_{0}+\hat{\beta}_{t}\mathbf{z}_{t}-\hat{\alpha}_{t}\operatorname{\mu}[\mathbf{y}_{0}]\right]^{2}}-\sqrt{d}\sqrt{\hat{\alpha}^{2}_{t}\operatorname{Var}[\mathbf{y}_{0}]+\hat{\beta}^{2}_{t}}\right| (25)
=\displaystyle= |[α^t𝐲0+β^t𝐳tα^tμ[𝐲0]]2dα^t2Var[𝐲0]dβ^t2|[α^t𝐲0+β^t𝐳tα^tμ[𝐲0]]2+dα^t2Var[𝐲0]+β^t2\displaystyle\frac{\left|\left[\hat{\alpha}_{t}\mathbf{y}_{0}+\hat{\beta}_{t}\mathbf{z}_{t}-\hat{\alpha}_{t}\operatorname{\mu}[\mathbf{y}_{0}]\right]^{2}-d\hat{\alpha}^{2}_{t}\operatorname{Var}[\mathbf{y}_{0}]-d\hat{\beta}^{2}_{t}\right|}{\sqrt{\left[\hat{\alpha}_{t}\mathbf{y}_{0}+\hat{\beta}_{t}\mathbf{z}_{t}-\hat{\alpha}_{t}\operatorname{\mu}[\mathbf{y}_{0}]\right]^{2}}+\sqrt{d}\sqrt{\hat{\alpha}^{2}_{t}\operatorname{Var}[\mathbf{y}_{0}]+\hat{\beta}^{2}_{t}}}
\displaystyle\leq 1dβ^t|[α^t𝐲0+β^t𝐳tα^tμ[𝐲0]]2dα^t2Var[𝐲0]dβ^t2|\displaystyle\frac{1}{\sqrt{d}\hat{\beta}_{t}}\left|\left[\hat{\alpha}_{t}\mathbf{y}_{0}+\hat{\beta}_{t}\mathbf{z}_{t}-\hat{\alpha}_{t}\operatorname{\mu}[\mathbf{y}_{0}]\right]^{2}-d\hat{\alpha}^{2}_{t}\operatorname{Var}[\mathbf{y}_{0}]-d\hat{\beta}^{2}_{t}\right|
=\displaystyle= 1dβ^t|dα^t2Var[𝐲0]+β^t2𝐳t2+2α^tβ^t(𝐲𝟎μ[𝐲0])𝐳tdα^t2Var[𝐲0]dβ^t2|\displaystyle\frac{1}{\sqrt{d}\hat{\beta}_{t}}\left|d\hat{\alpha}^{2}_{t}\operatorname{Var}[\mathbf{y}_{0}]+\hat{\beta}_{t}^{2}\mathbf{z}_{t}^{2}+2\hat{\alpha}_{t}\hat{\beta}_{t}(\mathbf{y_{0}}-\operatorname{\mu}[\mathbf{y}_{0}])\cdot\mathbf{z}_{t}-d\hat{\alpha}^{2}_{t}\operatorname{Var}[\mathbf{y}_{0}]-d\hat{\beta}^{2}_{t}\right|
\displaystyle\leq 1d(β^t2|𝐳t2d|+|2α^tβ^t(𝐲𝟎μ[𝐲0])𝐳t|)\displaystyle\frac{1}{\sqrt{d}}\left(\hat{\beta}_{t}^{2}\left|\mathbf{z}_{t}^{2}-d\right|+\left|2\hat{\alpha}_{t}\hat{\beta}_{t}(\mathbf{y_{0}}-\operatorname{\mu}[\mathbf{y}_{0}])\cdot\mathbf{z}_{t}\right|\right)
=\displaystyle= d(β^t|𝐳t2d|d+2α^t|(𝐲𝟎μ[𝐲0])𝐳t|d),\displaystyle\sqrt{d}\left(\hat{\beta}_{t}\frac{\left|\mathbf{z}_{t}^{2}-d\right|}{d}+2\hat{\alpha}_{t}\frac{\left|(\mathbf{y_{0}}-\operatorname{\mu}[\mathbf{y}_{0}])\cdot\mathbf{z}_{t}\right|}{d}\right),

where the 𝐳t2\mathbf{z}_{t}^{2} is a stand chi-square distribution with dd degrees of freedom. We apply the standard Laurent-Massart bound (Laurent & Massart, 2000) for it and get

𝒫[𝐳t2d2dt+2t]\displaystyle\mathcal{P}[\mathbf{z}_{t}^{2}-d\geq 2\sqrt{dt}+2t] et\displaystyle\leq e^{-t} (26)
𝒫[𝐳t2d2dt]\displaystyle\mathcal{P}[\mathbf{z}_{t}^{2}-d\leq-2\sqrt{dt}] et,\displaystyle\leq e^{-t},

which holds for any t>0t>0. We let t=dϵt=d\epsilon^{\prime}, where the ϵ+ϵ=ϵ4β^t\epsilon^{\prime}+\sqrt{\epsilon^{\prime}}=\frac{\epsilon}{4\hat{\beta}_{t}} for any given ϵ\epsilon, Then we have

𝒫[2dϵ𝐳t2d2d(ϵ+ϵ)]12edϵ.\mathcal{P}\left[-2d\sqrt{\epsilon^{\prime}}\leq\mathbf{z}_{t}^{2}-d\leq 2d(\epsilon^{\prime}+\sqrt{\epsilon^{\prime}})\right]\geq 1-2e^{-d\epsilon^{\prime}}. (27)

Therefore,

𝒫[β^t|𝐳t2d|d<ϵ2]>12edϵ\mathcal{P}\left[\hat{\beta}_{t}\frac{\left|\mathbf{z}_{t}^{2}-d\right|}{d}<\frac{\epsilon}{2}\right]>1-2e^{-d\epsilon^{\prime}} (28)

and D1,d>D1,4edϵξ\exists D_{1},\forall d>D_{1},4e^{-d\epsilon^{\prime}}\leq\xi, thus

𝒫[β^t|𝐳t2d|d<ϵ2]>1ξ2\mathcal{P}\left[\hat{\beta}_{t}\frac{\left|\mathbf{z}_{t}^{2}-d\right|}{d}<\frac{\epsilon}{2}\right]>1-\frac{\xi}{2} (29)

Similar to Lemma 4, 2α^t(𝐲𝟎μ[𝐲0])𝐳td2\hat{\alpha}_{t}\frac{(\mathbf{y_{0}}-\operatorname{\mu}[\mathbf{y}_{0}])\cdot\mathbf{z}_{t}}{d} is a Gaussian distribution, and the mean is 0, variance is bounded with (4α^tA)2d\frac{(4\hat{\alpha}_{t}A)^{2}}{d}. As d+d\rightarrow+\infty, the variance 0\rightarrow 0, thus D2,d>D2\exists D_{2},\forall d>D_{2}, we have

𝒫[2α^t|(𝐲𝟎μ[𝐲0])𝐳t|d<ϵ2]>1ξ2\mathcal{P}\left[2\hat{\alpha}_{t}\frac{\left|(\mathbf{y_{0}}-\operatorname{\mu}[\mathbf{y}_{0}])\cdot\mathbf{z}_{t}\right|}{d}<\frac{\epsilon}{2}\right]>1-\frac{\xi}{2} (30)

Finally, ϵ,ξ>0,D=Max{D1,D2},d>D\forall\epsilon,\xi>0,\exists D=Max\{D_{1},D_{2}\},\forall d>D,

𝒫(d2(𝐲t,t)<ϵd)\displaystyle\mathcal{P}\left(d_{2}\left(\mathbf{y}_{t},\mathcal{M}_{t}\right)<\epsilon\sqrt{d}\right)\geq 𝒫[β^t|𝐳t2d|d<ϵ2,2α^t|(𝐲𝟎μ[𝐲0])𝐳t|d<ϵ2]\displaystyle\mathcal{P}\left[\hat{\beta}_{t}\frac{\left|\mathbf{z}_{t}^{2}-d\right|}{d}<\frac{\epsilon}{2},2\hat{\alpha}_{t}\frac{\left|(\mathbf{y_{0}}-\operatorname{\mu}[\mathbf{y}_{0}])\cdot\mathbf{z}_{t}\right|}{d}<\frac{\epsilon}{2}\right] (31)
=\displaystyle= 𝒫[β^t|𝐳t2d|d<ϵ2]+𝒫[2α^t|(𝐲𝟎μ[𝐲0])𝐳t|d<ϵ2]\displaystyle\mathcal{P}\left[\hat{\beta}_{t}\frac{\left|\mathbf{z}_{t}^{2}-d\right|}{d}<\frac{\epsilon}{2}\right]+\mathcal{P}\left[2\hat{\alpha}_{t}\frac{\left|(\mathbf{y_{0}}-\operatorname{\mu}[\mathbf{y}_{0}])\cdot\mathbf{z}_{t}\right|}{d}<\frac{\epsilon}{2}\right]
𝒫[β^t|𝐳t2d|d<ϵ2 or 2α^t|(𝐲𝟎μ[𝐲0])𝐳t|d<ϵ2]\displaystyle-\mathcal{P}\left[\hat{\beta}_{t}\frac{\left|\mathbf{z}_{t}^{2}-d\right|}{d}<\frac{\epsilon}{2}\text{ or }2\hat{\alpha}_{t}\frac{\left|(\mathbf{y_{0}}-\operatorname{\mu}[\mathbf{y}_{0}])\cdot\mathbf{z}_{t}\right|}{d}<\frac{\epsilon}{2}\right]
\displaystyle\geq 1ξ2+1ξ21\displaystyle 1-\frac{\xi}{2}+1-\frac{\xi}{2}-1
=\displaystyle= 1ξ\displaystyle 1-\xi

\Box

Then we will prove the Proposition 1.

Proof. Consider t\mathcal{M}_{t}, which is restricted with:

μ[𝐱t]\displaystyle\operatorname{\mu}[\mathbf{x}_{t}] =α^tμ[𝐲0],\displaystyle=\hat{\alpha}_{t}\operatorname{\mu}[\mathbf{y}_{0}], (32)
Var[𝐱t]\displaystyle\operatorname{Var}[\mathbf{x}_{t}] =α^t2Var[𝐲0]+β^t2.\displaystyle=\hat{\alpha}^{2}_{t}\operatorname{Var}[\mathbf{y}_{0}]+\hat{\beta}^{2}_{t}.

We can substitute the μ[𝐱t]\operatorname{\mu}[\mathbf{x}_{t}] with α¯tμ[𝐲0]\sqrt{\bar{\alpha}_{t}}\operatorname{\mu}[\mathbf{y}_{0}] in the calculation of the variance and get the following equivalent restrictions:

μ[𝐱t]\displaystyle\operatorname{\mu}[\mathbf{x}_{t}] =α^tμ[𝐲0],\displaystyle=\hat{\alpha}_{t}\operatorname{\mu}[\mathbf{y}_{0}], (33)
μ[𝐱tα^tμ[𝐲0]]2\displaystyle\operatorname{\mu}\left[\mathbf{x}_{t}-\hat{\alpha}_{t}\operatorname{\mu}[\mathbf{y}_{0}]\right]^{2} =α^t2Var[𝐲0]+β^t2.\displaystyle=\hat{\alpha}^{2}_{t}\operatorname{Var}[\mathbf{y}_{0}]+\hat{\beta}^{2}_{t}.

These two constraints correspond to Lemma 4 and Lemma 5 respectively. We denote the manifold restricted with one of these constraints as tA\mathcal{M}_{tA} and tB\mathcal{M}_{tB}. tAtB=t\mathcal{M}_{tA}\cap\mathcal{M}_{tB}=\mathcal{M}_{t}. Suppose the angle of tA\mathcal{M}_{tA} and tB\mathcal{M}_{tB} at the intersection is θ\theta. Then locally the hypersphere can be treated as a hyperplane and the error is second-order. We have the following relationship when ϵ\epsilon is small:

B(1sinθ2+1)ϵd(t)Bϵd(tA)Bϵd(tB)B_{(\frac{1}{\sin\frac{\theta}{2}}+1)\epsilon\sqrt{d}}(\mathcal{M}_{t})\supset B_{\epsilon\sqrt{d}}(\mathcal{M}_{tA})\cap B_{\epsilon\sqrt{d}}(\mathcal{M}_{tB}) (34)

Then, according to Lemma 4 and Lemma 5:

ϵ,ξ,D,dD\forall\epsilon,\xi,\exists D,\forall d\geq D, where ϵ\epsilon is small,

𝒫(d2(𝐱t,tA)<ϵd)>\displaystyle\mathcal{P}\left(d_{2}\left(\mathbf{x}_{t},\mathcal{M}_{tA}\right)<\epsilon\sqrt{d}\right)> 1ξ2\displaystyle 1-\frac{\xi}{2} (35)
𝒫(d2(𝐱t,tB)<ϵd)>\displaystyle\mathcal{P}\left(d_{2}\left(\mathbf{x}_{t},\mathcal{M}_{tB}\right)<\epsilon\sqrt{d}\right)> 1ξ2.\displaystyle 1-\frac{\xi}{2}.

Then,

𝒫(d2(𝐱t,t)<(1sinθ2+1)ϵd)\displaystyle\mathcal{P}\left(d_{2}\left(\mathbf{x}_{t},\mathcal{M}_{t}\right)<(\frac{1}{\sin\frac{\theta}{2}}+1)\epsilon\sqrt{d}\right)\geq 𝒫[d2(𝐱t,tA)<ϵd,d2(𝐱t,tB)<ϵd]\displaystyle\mathcal{P}\left[d_{2}\left(\mathbf{x}_{t},\mathcal{M}_{tA}\right)<\epsilon\sqrt{d},d_{2}\left(\mathbf{x}_{t},\mathcal{M}_{tB}\right)<\epsilon\sqrt{d}\right] (36)
=\displaystyle= 𝒫(d2(𝐱t,tA)<ϵd)+𝒫(d2(𝐱t,tB)<ϵd)\displaystyle\mathcal{P}\left(d_{2}\left(\mathbf{x}_{t},\mathcal{M}_{tA}\right)<\epsilon\sqrt{d}\right)+\mathcal{P}\left(d_{2}\left(\mathbf{x}_{t},\mathcal{M}_{tB}\right)<\epsilon\sqrt{d}\right)
𝒫[d2(𝐱t,tA)<ϵd or d2(𝐱t,tB)<ϵd]\displaystyle-\mathcal{P}\left[d_{2}\left(\mathbf{x}_{t},\mathcal{M}_{tA}\right)<\epsilon\sqrt{d}\text{ or }d_{2}\left(\mathbf{x}_{t},\mathcal{M}_{tB}\right)<\epsilon\sqrt{d}\right]
>\displaystyle> 1ξ2+1ξ21\displaystyle 1-\frac{\xi}{2}+1-\frac{\xi}{2}-1
=\displaystyle= 1ξ\displaystyle 1-\xi

\Box

Let α^t2=α¯t\hat{\alpha}^{2}_{t}=\bar{\alpha}_{t} and β^t2=1α¯t\hat{\beta}^{2}_{t}=1-\bar{\alpha}_{t} and only consider the block(c,i,j)(c,i,j), We can get the Lemma 3.

B.3 Proof of Lemma 2

Proof. We just use the following hyperplane:

μ[𝐱]=α^t+t2μ[𝐲0]\operatorname{\mu}[\mathbf{x}]=\hat{\alpha}_{\frac{t+t^{\prime}}{2}}\operatorname{\mu}[\mathbf{y}_{0}] (37)

Then, becauseα^t\hat{\alpha}_{t} monotonically decreasing with tt in VP-SDE. Therefore, The hyperplanes μ[𝐱]=α^tμ[𝐲0]\operatorname{\mu}[\mathbf{x}]=\hat{\alpha}_{t}\operatorname{\mu}[\mathbf{y}_{0}] and μ[𝐱]=α^tμ[𝐲0]\operatorname{\mu}[\mathbf{x}]=\hat{\alpha}_{t^{\prime}}\operatorname{\mu}[\mathbf{y}_{0}] are on different sides of the given hyperplane. As a consequence, t\mathcal{M}_{t} and t\mathcal{M}_{t^{\prime}} are on different sides of the given hyperplane. \Box

B.4 Proof of Proposition 2

!𝐯𝐱𝐭N𝐱tt\exists!\mathbf{v_{\mathbf{x}_{t}}}\in N_{\mathbf{x}_{t}}\mathcal{M}_{t} that 𝐱t+𝐯𝐱𝐭t1\mathbf{x}_{t}+\mathbf{v_{\mathbf{x}_{t}}}\in\mathcal{M}_{t-1}.

Proof. We consider the 2-dim normal space N𝐱ttN_{\mathbf{x}_{t}}\mathcal{M}_{t}. Easy to show that it has these two orthogonal basis vectors, [𝐱tμ[𝐱t](1,1,,1)]\left[\mathbf{x}_{t}-\mu[\mathbf{x}_{t}](1,1,...,1)\right] and μ[𝐱t](1,1,,1)\mu[\mathbf{x}_{t}](1,1,...,1). There are only two points in 2-dim normal space N𝐱ttN_{\mathbf{x}_{t}}\mathcal{M}_{t} that are in t1\mathcal{M}_{t-1} because there are only two points, in 2-dim normal space N𝐱ttN_{\mathbf{x}_{t}}\mathcal{M}_{t} meet the following conditions:

  • The distance between this point to μ[𝐱t1](1,1,,1)\mu[\mathbf{x}_{t-1}](1,1,...,1) is C0C_{0}.

  • The line connecting this point to μ[𝐱t1](1,1,,1)\mu[\mathbf{x}_{t-1}](1,1,...,1) is perpendicular to (1,1,,1)(1,1,...,1).

And there is only one of them near 𝐱t\mathbf{x}_{t}.

Thus, !𝐯𝐱𝐭N𝐱tt\exists!\mathbf{v_{\mathbf{x}_{t}}}\in N_{\mathbf{x}_{t}}\mathcal{M}_{t} that 𝐱t+𝐯𝐱𝐭t1\mathbf{x}_{t}+\mathbf{v_{\mathbf{x}_{t}}}\in\mathcal{M}_{t-1}.

\Box

N𝐱tt=N𝐱t+𝐯𝐱𝐭t1N_{\mathbf{x}_{t}}\mathcal{M}_{t}=N_{\mathbf{x}_{t}+\mathbf{v_{\mathbf{x}_{t}}}}\mathcal{M}_{t-1}.

Proof. Because 𝐱t+𝐯𝐱𝐭N𝐱tt\mathbf{x}_{t}+\mathbf{v_{\mathbf{x}_{t}}}\in N_{\mathbf{x}_{t}}\mathcal{M}_{t}, μ[𝐱t+𝐯𝐱𝐭](1,1,,1)inN𝐱tt\mu[\mathbf{x}_{t}+\mathbf{v_{\mathbf{x}_{t}}}](1,1,...,1)\ inN_{\mathbf{x}_{t}}\mathcal{M}_{t}, and they are two orthogonal basis vectors of N𝐱t+𝐯𝐱𝐭t1N_{\mathbf{x}_{t}+\mathbf{v_{\mathbf{x}_{t}}}}\mathcal{M}_{t-1}.

Therefore, N𝐱tt=N𝐱t+𝐯𝐱𝐭t1N_{\mathbf{x}_{t}}\mathcal{M}_{t}=N_{\mathbf{x}_{t}+\mathbf{v_{\mathbf{x}_{t}}}}\mathcal{M}_{t-1}.

\Box

𝐯𝐱𝐭\mathbf{v_{\mathbf{x}_{t}}} is a transition map from T𝐱ttT_{\mathbf{x}_{t}}\mathcal{M}_{t} to T𝐱t+𝐯𝐱𝐭t1T_{\mathbf{x}_{t}+\mathbf{v_{\mathbf{x}_{t}}}}\mathcal{M}_{t-1} .

Proof. Because N𝐱tt=N𝐱t+𝐯𝐱𝐭t1N_{\mathbf{x}_{t}}\mathcal{M}_{t}=N_{\mathbf{x}_{t}+\mathbf{v_{\mathbf{x}_{t}}}}\mathcal{M}_{t-1}, T𝐱ttT_{\mathbf{x}_{t}}\mathcal{M}_{t} and T𝐱t+𝐯𝐱𝐭t1T_{\mathbf{x}_{t}+\mathbf{v_{\mathbf{x}_{t}}}}\mathcal{M}_{t-1} are parallel, and 𝐯𝐱𝐭\mathbf{v_{\mathbf{x}_{t}}} maps 𝐱t\mathbf{x}_{t} to 𝐱t+𝐯𝐱𝐭\mathbf{x}_{t}+\mathbf{v_{\mathbf{x}_{t}}}.

Therefore, 𝐯𝐱𝐭\mathbf{v_{\mathbf{x}_{t}}} is a transition map from T𝐱ttT_{\mathbf{x}_{t}}\mathcal{M}_{t} to T𝐱t+𝐯𝐱𝐭t1T_{\mathbf{x}_{t}+\mathbf{v_{\mathbf{x}_{t}}}}\mathcal{M}_{t-1} .

\Box

𝐯𝐱𝐭\mathbf{v_{\mathbf{x}_{t}}} is determained with 𝐟(,),𝐱𝐭,𝐲0\mathbf{f}(\cdot,\cdot),\mathbf{x_{t}},\mathbf{y}_{0} and g()g(\cdot).

Proof. As proved in (Särkkä & Solin, 2019), the means and covariances of linear SDEs can be transformed to corresponding ODEs. Therefore, suppose 𝐲𝐭=α^t𝐲0+β^t𝐳t\mathbf{y_{t}}=\hat{\alpha}_{t}\mathbf{y}_{0}+\hat{\beta}_{t}\mathbf{z}_{t}, 𝐲𝐭𝟏=α^t1𝐲0+β^t1𝐳t1\mathbf{y_{t-1}}=\hat{\alpha}_{t-1}\mathbf{y}_{0}+\hat{\beta}_{t-1}\mathbf{z}_{t-1}, all the coeffients are determained by 𝐟(,),g()\mathbf{f}(\cdot,\cdot),g(\cdot). For clarity, we will represent 𝐯𝐱𝐭\mathbf{v_{\mathbf{x}_{t}}} with 𝐱t,𝐲0\mathbf{x}_{t},\mathbf{y}_{0} and the coefficients above.

In fact, it is easy to show that

α^t12Var[𝐲0]+β^t12(𝐱tα^tμ(𝐲0)α^t2Var[𝐲0]+β^t2)+α^t1μ(𝐲0)\sqrt{\hat{\alpha}_{t-1}^{2}Var[\mathbf{y}_{0}]+\hat{\beta}_{t-1}^{2}}\left(\frac{\mathbf{x}_{t}-\hat{\alpha}_{t}\operatorname{\mu}(\mathbf{y}_{0})}{\sqrt{\hat{\alpha}_{t}^{2}Var[\mathbf{y}_{0}]+\hat{\beta}_{t}^{2}}}\right)+\hat{\alpha}_{t-1}\operatorname{\mu}(\mathbf{y}_{0}) (38)

is in both N𝐱ttN_{\mathbf{x}_{t}}\mathcal{M}_{t} and t1\mathcal{M}_{t-1}, and is the one in N𝐱ttt1N_{\mathbf{x}_{t}}\mathcal{M}_{t}\cap\mathcal{M}_{t-1} near 𝐱t\mathbf{x}_{t}. Therefore,

𝐯𝐱𝐭=α^t12Var[𝐲0]+β^t12(𝐱tα^tμ(𝐲0)α^t2Var[𝐲0]+β^t2)+α^t1μ(𝐲0)𝐱t\mathbf{v_{\mathbf{x}_{t}}}=\sqrt{\hat{\alpha}_{t-1}^{2}Var[\mathbf{y}_{0}]+\hat{\beta}_{t-1}^{2}}\left(\frac{\mathbf{x}_{t}-\hat{\alpha}_{t}\operatorname{\mu}(\mathbf{y}_{0})}{\sqrt{\hat{\alpha}_{t}^{2}Var[\mathbf{y}_{0}]+\hat{\beta}_{t}^{2}}}\right)+\hat{\alpha}_{t-1}\operatorname{\mu}(\mathbf{y}_{0})-\mathbf{x}_{t} (39)

\Box

B.5 Proof of Remark 3

Proof. Equivalently, we prove that 𝐱tN𝐱tt\mathbf{x}_{t}\in N_{\mathbf{x}_{t}}\mathcal{M}_{t}. Consider tA\mathcal{M}_{tA} and tB\mathcal{M}_{tB} restricted with one of the equations in Eqn. (33). We do the following decomposition of 𝐱t\mathbf{x}_{t}

𝐱t=[𝐱tμ[𝐱t](1,1,,1)]+μ[𝐱t](1,1,,1)\displaystyle\mathbf{x}_{t}=\left[\mathbf{x}_{t}-\mu[\mathbf{x}_{t}](1,1,...,1)\right]+\mu[\mathbf{x}_{t}](1,1,...,1) (40)

Easy to know that [𝐱tμ[𝐱t](1,1,,1)]N𝐱ttB\left[\mathbf{x}_{t}-\mu[\mathbf{x}_{t}](1,1,...,1)\right]\in N_{\mathbf{x}_{t}}\mathcal{M}_{tB} and μ[𝐱t](1,1,,1)N𝐱ttA\mu[\mathbf{x}_{t}](1,1,...,1)\in N_{\mathbf{x}_{t}}\mathcal{M}_{tA}. Because t=tAtB\mathcal{M}_{t}=\mathcal{M}_{tA}\cap\mathcal{M}_{tB}, thus the two compoments are all N𝐱tt\in N_{\mathbf{x}_{t}}\mathcal{M}_{t} and then 𝐱tN𝐱tt\mathbf{x}_{t}\in N_{\mathbf{x}_{t}}\mathcal{M}_{t} . \Box

B.6 Proof of Remark 7

Proof.

𝔼[𝚽(𝐲t)]\displaystyle\mathbb{E}[\mathbf{\Phi}(\mathbf{y}_{t})] =𝔼[c,i,jμ[𝐲tc,i,j]]\displaystyle=\mathbb{E}[\otimes_{c,i,j}\mu[\mathbf{y}_{t}^{c,i,j}]] (41)
=c,i,j𝔼[μ[𝐲tc,i,j]]\displaystyle=\otimes_{c,i,j}\mathbb{E}[\operatorname{\mu}[\mathbf{y}_{t}^{c,i,j}]]
=c,i,jα¯tμ[𝐲0c,i,j]\displaystyle=\otimes_{c,i,j}\sqrt{\bar{\alpha}_{t}}\operatorname{\mu}[\mathbf{y}_{0}^{c,i,j}]

\Box

Appendix C Details about SDDM

Assumption 1.

Suppose 𝐬(,):D×D\mathbf{s}(\cdot,\cdot):\mathbb{R}^{D}\times\mathbb{R}\rightarrow\mathbb{R}^{D} is the score-based model, 𝐟(,):D×D\mathbf{f}(\cdot,\cdot):\mathbb{R}^{D}\times\mathbb{R}\rightarrow\mathbb{R}^{D} is the drift coefficient, g():g(\cdot):\mathbb{R}\rightarrow\mathbb{R} is the diffusion coefficient, and (,,):D×D×\mathcal{E}(\cdot,\cdot,\cdot):\mathbb{R}^{D}\times\mathbb{R}^{D}\times\mathbb{R}\rightarrow\mathbb{R} is the energy function. 𝐲0\mathbf{y}_{0} is the given source image.

Like previous works (Zhao et al., 2022; Choi et al., 2021; Meng et al., 2022), we define a valid conditional distribution p(𝐱0𝐲0)p\left(\mathbf{x}_{0}\mid\mathbf{y}_{0}\right) under following assumptions:

  • C>0,t,𝐱,𝐲0D:f(𝐱,t)f(𝐲0,t)2C𝐱𝐲2\exists C>0,\forall t\in\mathbb{R},\forall\mathbf{x},\mathbf{y}_{0}\in\mathbb{R}^{D}:\|f(\mathbf{x},t)-f(\mathbf{y}_{0},t)\|_{2}\leq C\|\mathbf{x}-\mathbf{y}\|_{2}.

  • C>0,t,s,𝐱D:f(𝐱,t)f(𝐱,s)2C|ts|\exists C>0,\forall t,s\in\mathbb{R},\forall\mathbf{x}\in\mathbb{R}^{D}:\|f(\mathbf{x},t)-f(\mathbf{x},s)\|_{2}\leq C|t-s|.

  • C>0,t,𝐱,𝐲0D:𝐬(𝐱,t)𝐬(𝐲0,t)2C𝐱𝐲2\exists C>0,\forall t\in\mathbb{R},\forall\mathbf{x},\mathbf{y}_{0}\in\mathbb{R}^{D}:\|\mathbf{s}(\mathbf{x},t)-\mathbf{s}(\mathbf{y}_{0},t)\|_{2}\leq C\|\mathbf{x}-\mathbf{y}\|_{2}.

  • C>0,t,s,𝐱D:𝐬(𝐱,t)𝐬(𝐱,s)2C|ts|\exists C>0,\forall t,s\in\mathbb{R},\forall\mathbf{x}\in\mathbb{R}^{D}:\|\mathbf{s}(\mathbf{x},t)-\mathbf{s}(\mathbf{x},s)\|_{2}\leq C|t-s|.

  • C>0,t,𝐱,𝐲0D:𝐱(𝐱,𝐲0,t)𝐲(𝐲0,𝐲0,t)2C𝐱𝐲2\exists C>0,\forall t\in\mathbb{R},\forall\mathbf{x},\mathbf{y}_{0}\in\mathbb{R}^{D}:\left\|\nabla_{\mathbf{x}}\mathcal{E}\left(\mathbf{x},\mathbf{y}_{0},t\right)-\nabla_{\mathbf{y}}\mathcal{E}\left(\mathbf{y}_{0},\mathbf{y}_{0},t\right)\right\|_{2}\leq C\|\mathbf{x}-\mathbf{y}\|_{2}.

  • C>0,t,s,𝐱D:𝐱(𝐱,𝐲0,t)𝐱(𝐱,𝐲0,s)2C|ts|\exists C>0,\forall t,s\in\mathbb{R},\forall\mathbf{x}\in\mathbb{R}^{D}:\left\|\nabla_{\mathbf{x}}\mathcal{E}\left(\mathbf{x},\mathbf{y}_{0},t\right)-\nabla_{\mathbf{x}}\mathcal{E}\left(\mathbf{x},\mathbf{y}_{0},s\right)\right\|_{2}\leq C|t-s|.

  • C>0,t,s:|g(t)g(s)|C|ts|\exists C>0,\forall t,s\in\mathbb{R}:|g(t)-g(s)|\leq C|t-s|.

C.1 Details about Pre-Trained Diffusion Models

We use two pre-trained diffusion models and a VGG model.

In the Cat \rightarrow Dog task and Wild \rightarrow Dog task, we use the public pre-trained model provided in the official code https://github.com/jychoi118/ilvr_adm of ILVR (Choi et al., 2021).

In the Male \rightarrow Female task, we use the public pre-trained model provided in the official code https://github.com/ML-GSAI/EGSDE of EGSDE (Zhao et al., 2022).

Our energy function uses the pre-trained VGG net provided in the unofficial open source code https://github.com/naoto0804/pytorch-AdaIN of AdaIN (Huang & Belongie, 2017).

C.2 Details about Our Default Model Settings

Our default SDDM settings:

  • Using BAdaIN to construct manifolds.

  • Using multi-optimization on manifolds.

  • λ=2,λ1=25\lambda=2,\lambda_{1}=25.

  • Using ϵ\epsilon Policy 3.

  • Blocks are 16×1616\times 16.

  • T0=0.5TT_{0}=0.5T.

  • 100 diffusion steps.

SDDM sets T0=0.6TT_{0}=0.6T.

C.3 Implementation Details about Solving Problem (12)

To simplify the process, we denote all the vectors as {𝐯i}\{\mathbf{v}_{i}\}, and coefficients as {λi}\{\lambda_{i}\}, and rewrite Problem (12) as

minλ1,λ2,,λn0λ1+λ2++λn=1{i=1nλi𝐯i22}.\min_{\begin{subarray}{c}\lambda_{1},\lambda_{2},\ldots,\lambda_{n}\geq 0\\ \lambda_{1}+\lambda_{2}+\ldots+\lambda_{n}=1\end{subarray}}\left\{\left\|\sum_{i=1}^{n}\lambda_{i}\mathbf{v}_{i}\right\|_{2}^{2}\right\}. (42)

When there are only two vectors (in our situation) and no restriction λ1,λ2,,λn0\lambda_{1},\lambda_{2},\ldots,\lambda_{n}\geq 0, we can get the following analytical solution:

λ^1=(𝐯2𝐯1)T𝐯2𝐯2𝐯122.\hat{\lambda}_{1}^{*}=\frac{(\mathbf{v}_{2}-\mathbf{v}_{1})^{T}\mathbf{v}_{2}}{\|\mathbf{v}_{2}-\mathbf{v}_{1}\|_{2}^{2}}. (43)

Therefore, easy to prove that when there are only two vectors, the analytical solution is:

λ1=min(1,max((𝐯2𝐯1)T𝐯2𝐯2𝐯122,0)).\lambda_{1}^{*}=\min\left(1,\max\left(\frac{(\mathbf{v}_{2}-\mathbf{v}_{1})^{T}\mathbf{v}_{2}}{\|\mathbf{v}_{2}-\mathbf{v}_{1}\|_{2}^{2}},0\right)\right). (44)

For general situations, we can apply Frank–Wolfe algorithm on this problem as in (Sener & Koltun, 2018).

Appendix D Details about FID calculation

The FID is calculated between 500 generated images and the target validation dataset containing 500 images in the Cat \rightarrow Dog and Wild \rightarrow Dog task. The number is 1000 in the Male \rightarrow Female task. All experiments are repeated 5 times to eliminate the randomness.

Appendix E FID on the Male\rightarrowFemale task

It is true that EGSDE with sufficient diffusion steps outperforms our SDDM on the Male\rightarrowFemale task, it is important to note that the energy function used in EGSDE is strongly pretrained on related datasets and contains significant domain-specific information. In contrast, to demonstrate the effectiveness and versatility of our framework, we intentionally chose to use a weak energy function consisting of only one layer of convolution without any further pretraining. After incorporating the strong guidance function from EGSDE, our method outperforms EGSDE in the FID score, as shown in the following table.

Table 7: The FID comparison between EGSDE and our SDDM with the same energy guidance function on the Male\rightarrowFemale task.
Model FID\downarrow
EGSDE 41.93 ± 0.11
SDDM(Ours) 40.08 ± 0.13

Appendix F Samples

Refer to caption
Figure 4: The visual comparison of different methods.
Refer to caption
Figure 5: More random samples of our methods.