This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Discriminator Contrastive Divergence: Semi-Amortized Generative Modeling by Exploring Energy of the Discriminator

Yuxuan Song*1 Qiwei Ye*2 Minkai Xu*1 Tie-Yan Liu2
1Shanghai Jiao Tong University 2Microsoft Research
{songyuxuan,mkxu}@apex.sjtu.edu.cn, {qiwye,tie-yan.liu}@microsoft.com
Abstract

Generative Adversarial Networks (GANs) have shown great promise in modeling high dimensional data. The learning objective of GANs usually minimizes some measure discrepancy, e.g., ff-divergence (ff-GANs [28]) or Integral Probability Metric (Wasserstein GANs [2]). With ff-divergence as the objective function, the discriminator essentially estimates the density ratio [37], and the estimated ratio proves useful in further improving the sample quality of the generator [3, 36]. However, how to leverage the information contained in the discriminator of Wasserstein GANs (WGAN) [2] is less explored. In this paper, we introduce the Discriminator Contrastive Divergence, which is well motivated by the property of WGAN’s discriminator and the relationship between WGAN and energy-based model. Compared to standard GANs, where the generator is directly utilized to obtain new samples, our method proposes a semi-amortized generation procedure where the samples are produced with the generator’s output as an initial state. Then several steps of Langevin dynamics are conducted using the gradient of the discriminator. We demonstrate the benefits of significant improved generation on both synthetic data and several real-world image generation benchmarks.111Code is available at https://github.com/MinkaiXu/Discriminator-Contrastive-Divergence.

11footnotetext: Equal contribution, with the order determined by flipping coins.

1 Introduction

Generative Adversarial Networks (GANs) [10] proposes a widely popular way to learn likelihood-free generative models, which have shown promising results on various challenging tasks. Specifically, GANs are learned by finding the equilibrium of a min-max game between a generator and a discriminator, or a critic under the context of WGANs. Assuming the optimal discriminator can be obtained, the generator substantially minimizes some discrepancy between the generated distribution and the target distribution.

Improving training GANs by exploring the discrepancy measure with the excellent property has stimulated fruitful lines of research works and is still an active area. Two well-known discrepancy measures for training GANs are ff-divergence and Integral Probability Metric (IPM) [26]. ff-divergence is severe for directly minimization due to the intractable integral, ff-GANs provide minimization instead of a variational approximation of ff-divergence between the generated distribution pGθp_{G_{\theta}} and the target distribution pdatap_{\text{data}}. The discriminator in ff-GANs serves as a density ratio estimator [37]. The other families of GANs are based on the minimization of an Integral Probability Metric (IPM). According to the definition of IPM, the critic needs to be constrained into a specific function class. When the critic is restricted to be 1-Lipschitz function, the corresponding IPM turns to the Wasserstein-1 distance, which inspires the approaches of Wasserstein GANs (WGANs) [25, 2, 13].

No matter what kind of discrepancy is evaluated and minimized, the discriminator is usually discarded at the end of the training, and only the generator is kept to generate samples. A natural question to ask is whether, and how we can leverage the remaining information in the discriminator to construct a more superior distribution than simply sampling from a generator.

Recent work  [3, 36] has shown that a density ratio can be obtained through the output of discriminator, and a more superior distribution can be acquired by conducting rejection sampling or Metropolis-Hastings sampling with the estimated density ratio based on the original GAN [10].

However, the critical limitation of previous methods lies in that they can not be adapted to WGANs, which enjoy superior empirical performance over other variants. How to leverage the information of a WGAN’s critic model to improve image generation remains an open problem. In this paper, we do the following to address this:

  • We provide a generalized view to unify different families of GANs by investigating the informativeness of the discriminators.

  • We propose a semi-amortized generative modeling procedure so-called discriminator contrastive divergence (DCD), which achieves an intermediate between implicit and explicit generation and hence allows a trade-off between generation quality and speed.

Extensive experiments are conducted to demonstrate the efficacy of our proposed method on both synthetic setting and real-world generation scenarios, which achieves state-of-the-art performance on several standard evaluation benchmarks of image generation.

2 Related Works

Both empirical [2] and theoretical [15] evidence has demonstrated that learning a discriminative model with neural networks is relatively easy, and the neural generative model(sampler) is prone to reach its bottleneck during the optimization. Hence, there is strong motivation to further improve the generated distribution by exploring the remaining information. Two recent advancements are discriminator rejection sampling(DRS) [3] and MH-GANs [36]. DRS conducts rejection sampling on the output of the generator. The vital limitation that lies in the upper bound of DϕD_{\phi} is needed to be estimated for computing the rejection probability. MH-GAN sidesteps the above problem by introducing a Metropolis-Hastings sampling procedure with generator acting as the independent proposal; the state transition is estimated with a well-calibrated discriminator. However, the theoretical justification of both the above two methods is based on the fact that the output of discriminator needs to be viewed as an estimation of density ratio pdatapGθ\frac{p_{\text{data}}}{p_{G_{\theta}}}. As pointed out by previous work [41], the output of a discriminator in WGAN [2] suffers from the free offset and can not provide the density ratio, which prevents the application of the above methods in WGAN.

Our work is inspired by recent theoretical studies on the property of discriminator in WGANs [13, 41]. [33] proposes discriminator optimal transport (DOT) to leverage the optimal transport plan implied by WGANs’ discriminator, which is orthogonal to our method. Moreover, turning the discriminator of WGAN into an energy function is closely related to the amortized generation methods in the context of the energy-based model (EBM) [20, 40, 23] where a separate network is proposed to learn to sample from the partition function in [9]. Recent progress [32, 8] in the area of EBM has shown the feasibility of generating high dimensional data with Langevin dynamics. From the perspective of EBM, our proposed method can be seen as an intermediary between an amortized generation model and an implicit generation model, i.e., a semi-amortized generation method, which allows a trade-off between speed and flexibility of generation. With a similar spirit, [11] also illustrates the potential connection between neural classifier and energy-based model in supervised and semi-supervised scenarios.

3 Preliminaries

3.1 Generative Adversarial Networks

Generative Adversarial Networks (GANs) [10] is an implicit generative model that aims to fit an empirical data distribution pdatap_{\text{data}} over sample space 𝒳\mathcal{X}. The generative distribution pGθp_{G_{\theta}} is implied by a generated function GθG_{\theta}, which maps latent variable ZZ to sample XX, i.e., Gθ:𝒵𝒳G_{\theta}:\mathcal{Z}\xrightarrow{}\mathcal{X}. Typically, the latent variable ZZ is distributed on a fixed prior distribution p(z)p(z). With i.i.d samples available from pGθp_{G_{\theta}} and pdatap_{\text{data}}, the GAN typically learns the generative model through a min-max game between a discriminator DϕD_{\phi} and a generator GθG_{\theta}:

minθmaxϕ𝔼𝒙Pdata [r(Dϕ(𝒙))]𝔼𝒙pGθ[m(Dϕ(𝒙))].\min_{\theta}\max_{\phi}\mathbb{E}_{\bm{x}\sim P_{\text{data }}}\left[r(D_{\phi}(\bm{x}))\right]-\mathbb{E}_{\bm{x}\sim p_{G_{\theta}}}\left[m(D_{\phi}(\bm{x}))\right]. (1)

With rr and mm as the function r(x)=m(x)=xr(x)=m(x)=x and the Dϕ(x)D_{\phi}(x) is constrained as 1-Lipschitz function, the Eq. 1 yields the WGANs objective which essentially minimizes the Wasserstein distance between pdatap_{\text{data}} and pGθp_{G_{\theta}}. With r(x)=xr(x)=x and m(x)m(x) as the Fenchel conjugate[16] of a convex and lower-semicontinuous function, the objective in Eq. 1 approximately minimize a variational estimation of ff-divergence[28] between pdatap_{\text{data}} and pGθp_{G_{\theta}}.

3.2 Energy Based Model and MCMC basics

The energy-based model tends to learn an unnormalized probability model implied by an energy function Eθ(x)E_{\theta}(x) to prescribe the ground truth data distribution pdatap_{\text{data}}. The corresponding normalized density function is:

qθ(x)=eEθ(x)Zθ,Zθ=eEθ(x)dx,\displaystyle q_{\theta}(x)=\frac{e^{-E_{\theta}(x)}}{Z_{\theta}},~{}~{}~{}~{}~{}Z_{\theta}=\int e^{-E_{\theta}(x)}\mathrm{d}x, (2)

where ZθZ_{\theta} is so-called normalization constant. The objective of training an energy-based model with maximum likelihood estimation is as:

MLE(θ;p):=𝔼xpdata(x)[logqθ(x)].\displaystyle{\mathcal{L}}_{\mathrm{MLE}}(\theta;p)\vcentcolon=-{\mathbb{E}}_{x\sim p_{\text{data}}(x)}\left[\log q_{\theta}(x)\right]. (3)

The estimated gradient with respect to the MLE objective is as follows:

θMLE(θ;p)\displaystyle\nabla_{\theta}{\mathcal{L}}_{\mathrm{MLE}}(\theta;p) (4)
=\displaystyle=\ θ𝔼xpdata(x)[Eθ(x)]eEθ(x)θEθ(x)dxZθ\displaystyle\nabla_{\theta}{\mathbb{E}}_{x\sim p_{\text{data}}(x)}[E_{\theta}(x)]-\frac{\int e^{-E_{\theta}(x)}\nabla_{\theta}E_{\theta}(x)\mathrm{d}x}{Z_{\theta}}
=\displaystyle=\ 𝔼xpdata(x)[θEθ(x)]𝔼xqθ(x)[θEθ(x)].\displaystyle{\mathbb{E}}_{x\sim p_{\text{data}}(x)}[\nabla_{\theta}E_{\theta}(x)]-{\mathbb{E}}_{x\sim q_{\theta}(x)}[\nabla_{\theta}E_{\theta}(x)].

The above method for gradient estimation in Equation 4 is called contrastive divergence (CD). Furthermore, we define the score of distribution with density function p(x)p(x) as xlogp(x)\nabla_{x}\log p(x). We can immediately conclude that xlogqθ(x)=Eθ(x)\nabla_{x}\log q_{\theta}(x)=\nabla E_{\theta}(x), which does not depend on the intractable ZθZ_{\theta}.

Markov chain Monte Carlo is a powerful framework for drawing samples from a given distribution. An MCMC is specified by a transition kernel 𝒦(x|x)\mathcal{K}(x^{\prime}|x) which corresponds to a unique stationary distribution pp, i.e.,

q=pq(x)=q(x)𝒦(x|x)𝑑x,x.\displaystyle q=p\quad\Leftrightarrow\quad q(x)=\int q\left(x^{\prime}\right)\mathcal{K}\left(x|x^{\prime}\right)dx^{\prime},\quad\forall x.

More specifically, MCMC can be viewed as drawing x0x_{0} from the initial distribution x0x_{0} and iteratively get sample xtx_{t} at the tt-th iteration by applied the transition kernel on the previous step, i.e., xt|xt1𝒦(xt|xt1)x_{t}|x_{t-1}\sim\mathcal{K}(x_{t}|x_{t-1}). Following [24], we formalized the distribution qtq_{t} of ztz_{t} as obtained by a fixed point update of form qt(x)𝒦qt1(x)q_{t}(x)\leftarrow\mathcal{K}q_{t-1}(x), and 𝒦qt1(x)\mathcal{K}q_{t-1}(x):

𝒦qt1(x):=qt1(x)𝒦(x|x)𝑑x.\displaystyle\mathcal{K}q_{t-1}(x):=\int q_{t-1}\left(x^{\prime}\right)\mathcal{K}\left(x|x^{\prime}\right)dx^{\prime}.

As indicated by the standard theory of MCMC, the following monotonic property is satisfied:

DKL(qt||p)DKL(qt1||p).D_{\mathrm{KL}}(q_{t}||p)\leq D_{\mathrm{KL}}(q_{t-1}||p). (5)

And qtq_{t} converges to the stationary distribution pp as tt\rightarrow\infty.

4 Methodology

4.1 Informativeness of Discriminator

In this section, we seek to investigate the following questions:

  • What kind of information is contained in the discriminator of different kinds of GANs?

  • Why and how can the information be utilized to further improved the quality of generated distribution?

We discuss the discriminator of ff-GANs, and WGANs, respectively, in the following.

4.1.1 ff-GAN Discriminator

ff-GAN [27] is based on the variational estimation of ff-divergence [1] with only samples from two distributions available:

Theorem 1.

[27] With Fenchel Duality, the variational estimation of ff-divergence can be illustrated as follows:

Df(PQ)\displaystyle D_{f}(P\|Q) (6)
=\displaystyle=\ 𝒳q(x)supt dom f{(tp(x)q(x)f(t))dx}\displaystyle\int_{\mathcal{X}}q(x)\sup_{t\in\text{ dom }_{f^{*}}}\left\{\left(t\frac{p(x)}{q(x)}-f^{*}(t)\right)\mathrm{d}x\right\}
\displaystyle\geq\ supT𝒯(𝒳p(x)T(x)dx𝒳q(x)f(T(x))dx)\displaystyle\sup_{T\in\mathcal{T}}\left(\int_{\mathcal{X}}p(x)T(x)\mathrm{d}x-\int_{\mathcal{X}}q(x)f^{*}(T(x))\mathrm{d}x\right)
=\displaystyle=\ supT𝒯(𝔼xP[T(x)]𝔼xQ[f(T(x))]),\displaystyle\sup_{T\in\mathcal{T}}\left(\mathbb{E}_{x\sim P}[T(x)]-\mathbb{E}_{x\sim Q}\left[f^{*}(T(x))\right]\right),

where the 𝒯\mathcal{T} is the arbitrary class of function and ff^{*} denotes the Fenchel conjugate of ff. And the supremum is achieved only when T(x)=f(p(x)q(x))T^{*}(x)=f^{\prime}\left(\frac{p(x)}{q(x)}\right), i.e. p(x)q(x)=fT(T(x))\frac{p(x)}{q(x)}=\frac{\partial f^{*}}{\partial T}(T^{*}(x)).

In ff-GAN [28], the discriminator DϕD_{\phi} is actually the function TT parameterized with neural networks. Theorem. 1 indicates the density ratio estimation view of ff-GAN’s discriminator, as illustrated in [37]. More specifically, the discriminatorDϕD_{\phi} in ff-GAN is optimized to estimate a statistic related to the density ratio between pdatap_{\text{data}} and pGθp_{G_{\theta}}, i.e. pdatapGθ\frac{p_{\text{data}}}{p_{G_{\theta}}}, and the pdatapGθ\frac{p_{\text{data}}}{p_{G_{\theta}}} can be acquired easily with DϕD_{\phi}. For example, in the original GANs [10], the corresponding ff in ff-GAN literature is f(x)=xlogx(x+1)log(x+1)+2log2f(x)=x\log x-(x+1)\log(x+1)+2\log 2. Assuming the discriminator is trained to be optimal, the output is Dϕ(x)=pdatapdata+pGθD_{\phi}(x)=\frac{p_{\text{data}}}{p_{\text{data}}+p_{G_{\theta}}}, and we can get the density ratio pdatapGθ=Dϕ(x)1Dϕ(x)\frac{p_{\text{data}}}{p_{G_{\theta}}}=\frac{D_{\phi}(x)}{1-D_{\phi}(x)}. However, it should be noticed that the discriminator is hard to reach the optimality. In practice, without loss of generality, the density ratio implied by a sub-optimal discriminator can be seen as the density ratio between an implicitly defined distribution pp^{*} and the generated distribution pGθp_{G_{\theta}}. It has been studied both theoretically and empirically in the context of GANs [2, 15, 17], with the same inductive bias, that learning a discriminative model is more accessible than a generative model. Based on the above fact, the rejection-sampling based methods are proposed to use the estimated density ratio, e.g., Dϕ(x)1Dϕ(x)\frac{D_{\phi}(x)}{1-D_{\phi}(x)} in original GANs, to conduct rejection sampling[3] or Metropolis-Hastings sampling[36] based on generated distribution pGθp_{G_{\theta}}. These methods radically modify the generated distribution pGθp_{G_{\theta}} to pp^{*}, the improvement in empirical performance as shown in [3, 36] demonstrates that we can construct a superior distribution pp^{*} to prescribe the empirical distribution pdatap_{\text{data}} by involving the remaining information in discriminator.

Refer to caption
Figure 1: Discriminator Contrastive Divergence: After WGAN training, a fine-tuning for critics can be conducted with several MCMC steps, which leverages the gradient of discriminator by Langevin dynamics; after the fine-tuning, the discriminator could be viewed as a superior distribution of pGθp_{G_{\theta}}, hence sampling from pGθp_{G_{\theta}} can be implemented using the same Langevin dynamics as described in 1.

4.1.2 WGAN Discriminator

Different from ff-GANs, the objective of WGANs is derived from the Integral Probability Metric, and the discriminator can not naturally be derived as an estimated density ratio. Before leveraging the remaining information in the discriminator, the property of the discriminator in WGANs needs to be investigated first. We introduce the primal problem implied by WGANs objective as follows:

Let π\pi denote the joint probability for transportation between PP and QQ, which satisfies the marginality conditions,

𝑑𝒚π(𝒙,𝒚)=p(𝒙),𝑑𝒙π(𝒙,𝒚)=q(𝒚)\int d\bm{y}\pi(\bm{x},\bm{y})=p(\bm{x}),\quad\int d\bm{x}\pi(\bm{x},\bm{y})=q(\bm{y}) (7)

The primal form first-order Wasserstein distance W1W_{1} is defined as:

W1(𝒫,𝒬)=infπΠ(𝒫,𝒬)𝔼(x,y)π[xy2]\displaystyle W_{1}\left(\mathcal{P},\mathcal{Q}\right)=\inf_{\pi\in\Pi\left(\mathcal{P},\mathcal{Q}\right)}\mathbb{E}_{(x,y)\sim\pi}[\|x-y\|_{2}]

the objective function of the discriminator in Wasserstein GANs is the Kantorovich-Rubinstein duality of Eq. 7, and the optimal discriminator has the following property[13]:

Theorem 2.

Let π\pi^{*} as the optimal transport plan in Eq. 7 and xt=tx+(1t)yx_{t}=tx+(1-t)y with 0t10\leq t\leq 1. With the optimal discriminator DϕD_{\phi} as a differentiable function and π(x,x)=0\pi^{*}(x,x)=0 for all xx, then it holds that:

P(x,y)π[xiDϕ(xt)=yxyx]=1\displaystyle\mathrm{P}_{(x,y)\sim\pi^{*}}\left[\nabla_{x_{i}}D_{\phi}^{*}\left(x_{t}\right)=\frac{y-x}{\|y-x\|}\right]=1

Theorem. 2 states that for each sample xx in the generated distribution pGθp_{G_{\theta}}, the gradient on the xx directly points to a sample yy in the pdatap_{\text{data}}, where the (x,y)(x,y) pairs are consistent with the optimal transport plan π\pi^{*}. All the linear interpolations xtx_{t} between xx and yy satisfy that xkDϕ(xt)=yxyx\nabla_{x_{k}}D_{\phi}^{*}\left(x_{t}\right)=\frac{y-x}{\|y-x\|}. It should also be noted that similar results can also be drawn in some variants of WGANs, whose loss functions may have a slight difference with standard WGAN [41]. For example, the SNGAN uses the hinge loss during the optimization of the discriminator, i.e., r()r(\cdot) and g()g(\cdot) in Eq. 1 is selected as max(0,1u)\max(0,-1-u) for stabilizing the training procedure. We provide a detailed discussion on several surrogate objectives in Appendix. E.

The above property of discriminator in WGANs can be interpreted as that given a sample xx from generated distribution pGθp_{G_{\theta}} we can obtain a corresponding yy in data distribution pdatap_{\text{data}} by directly conducting gradient decent with the optimal discriminator DϕD_{\phi}^{*}:

y=x+wxxDϕ,wx0y=x+w_{x}*\nabla_{x}D_{\phi}^{*},\quad w_{x}\geq 0 (8)

It seems to be a simple and appealing solution to improve pGθp_{G_{\theta}} with the guidance of discriminator DϕD_{\phi}. However, the following issues exist:

1) there is no theoretical indication on how to set wxw_{x} for each sample xx in generated distribution. We noticed that a concurrent work [33] introduce a search process called Discriminator Optimal Transport(DOT) by finding the corresponding yy^{*} through the following:

yx=argmin𝒚{𝒚𝒙2Dϕ(𝒚)}y_{x}=\underset{\bm{y}}{\arg\min}\left\{\|\bm{y}-\bm{x}\|_{2}-D_{\phi}^{*}(\bm{y})\right\} (9)

However, it should be noticed that Eq. 9 has a non-unique solution. As indicated by Theorem 2, all points on the connection between xx and yy are valid solutions. We further extend the fact into the following theorem:

Theorem 3.

With the π\pi^{*} and DϕD_{\phi}^{*} as the optimal solutions of the primal problem in Eq. 7 and Kantorovich-Rubinstein duality of Eq. 7, the distribution potp_{ot} implied by the generated distribution pGθp_{G_{\theta}}and the discriminator DϕD_{\phi}^{*} is defined as(yxy_{x} is defined in Eq. 9):

pot(𝒚)=𝑑𝒙δ(yyx)pGθ(𝒙)\displaystyle p_{ot}(\bm{y})=\int d\bm{x}\delta(y-y_{x})p_{G_{\theta}}(\bm{x})

when pdatapGθp_{\text{data}}\neq p_{G_{\theta}}, there exists infinite numbers of potp_{ot} with pdatap_{\text{data}} as a special case.

Theorem 3 provides a theoretical justification for the poor empirical performance of conducting DOT in the sample space, as shown in their paper.

2) Another problem lies in that samples distributed outside the generated distribution (pGθp_{G_{\theta}}) are never explored during training, which results in much adversarial noise during the gradient-based search process, especially when the sample space is high dimensional such as real-world images.

To fix the issues mentioned above in leveraging the information of discriminator in Wasserstein GANs, we propose viewing the discriminator as an energy function. With the discriminator as an energy function, the stationary distribution is unique, and Langevin dynamics can approximately conduct sampling from the stationary distribution. Due to the monotonic property of MCMC, there will not be issues like setting wxw_{x} in Eq. 8. Besides, the second issue can also be easily solved by fine-tuning the energy spaces with contrastive divergence. In addition to the benefits illustrated above, if the discriminator is an energy function, the samples from the corresponding energy-based model can be obtained through Langevin dynamics by using the gradients of the discriminator which takes advantage of the property of discriminator as shown in Theorem 2. With all the facts as mentioned above, there is strong motivation to explore further and bridge the gap between discriminator in WGAN and the energy-based model.

4.2 Semi-Amortized Generation with Langevin Dynamics

We first introduce the Fenchel dual of the intractable partition function ZθZ_{\theta} in Eq. 2:

Theorem 4.

[39] With H(q)=q(x)logq(x)𝑑xH(q)=-\int q(x)\log q(x)dx, the Fenchel dual of log-partition ZθZ_{\theta} is as follows:

A(Eθ)=maxq𝒫q(x),Eθ(x)+H(q),\displaystyle A(E_{\theta})=\max_{q\in\mathcal{P}}\langle q(x),E_{\theta}(x)\rangle+H(q), (10)

where 𝒫\mathcal{P} denotes the space of distributions, and q(x),Eθ(x)=Eθ(x)q(x)𝑑x\langle q(x),E_{\theta}(x)\rangle=\int E_{\theta}(x)q(x)dx.

We put the Fenchel dual of A(Eθ)A(E_{\theta}) back into the MLE objective in Eq. 3, we achieve the following min-max game formalization for training energy-based model based on MLE:

minq𝒫maxEθ𝔼𝒙Pdata [Eθ(𝒙)]𝔼𝒙q[Eθ(𝒙)]WGAN’s objective for criticH(q)entropy regularization.\min_{q\in\mathcal{P}}\max_{E_{\theta}\in\mathcal{E}}\underbrace{\mathbb{E}_{\bm{x}\sim P_{\text{data }}}\left[E_{\theta}(\bm{x})\right]-\mathbb{E}_{\bm{x}\sim q}\left[E_{\theta}(\bm{x})\right]}_{\text{WGAN's objective for critic}}-\underbrace{H(q)}_{\text{entropy regularization}}. (11)
Algorithm 1 Discriminator Contrastive Divergence
1:  Input: Pretrained generator GθG_{\theta}, discriminator DϕD_{\phi}.
2:  Set the step size ϵ\epsilon, the length of MCMC steps KK and the total iterations TT.
3:  for iteration i=1,,Ti=1,\cdots,T do
4:     Sample a batch of data samples {xt}t=1m\{x_{t}\}_{t=1}^{m} for empirical data distribution pdatap_{\text{data}} and {zt}t=1m\{z_{t}\}_{t=1}^{m} for the prior distribution p(z)p(z).
5:     for iteration l=1,,Kl=1,\cdots,K do
6:        Pixel Space: Gθ(zt)l=Gθ(zt)l1ϵ2xDϕ(Gθ(zt)l1)+ϵω,ω𝒩(0,)G_{\theta}(z_{t})^{l}=G_{\theta}(z_{t})^{l-1}-\frac{\epsilon}{2}\nabla_{x}D_{\phi}\left(G_{\theta}(z_{t})^{l-1}\right)+\sqrt{\epsilon}\omega,\omega\sim\mathcal{N}(0,\mathcal{I}) or
7:        Latent Space: ztl=ztl1ϵ2zDϕ(Gθ(zt)l1)+ϵω,ω𝒩(0,)z_{t}^{l}=z_{t}^{l-1}-\frac{\epsilon}{2}\nabla_{z}D_{\phi}\left(G_{\theta}(z_{t})^{l-1}\right)+\sqrt{\epsilon}\omega,\omega\sim\mathcal{N}(0,\mathcal{I})
8:     end for
9:     Optimized the following objective w.r.t. ϕ\phi:
10:     Pixel Space: L=1mt(Dϕ(xt)Dϕ(Gθ(zt)K))L=\frac{1}{m}\sum_{t}(D_{\phi}(x_{t})-D_{\phi}(G_{\theta}(z_{t})^{K})) or
11:     Latent Space: L=1mt(Dϕ(xt)Dϕ(Gθ(ztK)))L=\frac{1}{m}\sum_{t}(D_{\phi}(x_{t})-D_{\phi}(G_{\theta}(z_{t}^{K})))
12:  end for

The Fenchel dual view of MLE training in the energy-based model explicitly illustrates the gap and connection between the WGAN and Energy based model. If we consider the dual distribution qq as the generated distribution pGθp_{G_{\theta}}, and the DϕD_{\phi} as the energy function EθE_{\theta}. The duality form for training energy-based models is essentially the WGAN’s objective with the entropy of the generator is regularized.

Hence to turn the discriminator in WGAN into an energy function, we may conduct several fine-tuning steps, as illustrated in Eq. 11. Note that maximizing the entropy of the pGθp_{G_{\theta}} is indeed a challenging task, which needs to either use a tractable density generator, e.g., normalizing Flows [7], or maximize the mutual information between the latent variable 𝒵\mathcal{Z} and the corresponding Gθ(Z)G_{\theta}(Z) when the GθG_{\theta} is a deterministic mapping. However, instead of maximizing the entropy of the generated distribution pGθp_{G_{\theta}} directly, we derive our method based on the following fact:

Proposition 1.

[20] Update the generated distribution pGθp_{G_{\theta}} according to the gradient estimated through Equation. 11, essentially minimized the Kullback–Leibler (KL) divergence between pGθp_{G_{\theta}} and the distribution pDϕp_{D_{\phi}}, which refers to the distribution implied by using DϕD_{\phi} as the energy function, as illustrated in Eq. 2, i.e. DKL(pGθ||pDϕ)D_{\mathrm{KL}}(p_{G_{\theta}}||p_{D_{\phi}}).

To avoid the computation of H(pGθ)H(p_{G_{\theta}}), motivated by the monotonic property of MCMC, as illustrated in Eq. 5, we propose Discriminator Contrastive Divergence (DCD), which replaces the gradient-based optimization on qq(pGθp_{G_{\theta}}) in Eq. 11 with several steps of MCMC for finetuning the critic in WGAN into an energy function. To be more specific, we use Langevin dynamics[34] which leverages the gradient of the discriminator to conduct sampling:

xk=xk1ϵ2xDϕ(xk1)+ϵω,ω𝒩(0,),x_{k}=x_{k-1}-\frac{\epsilon}{2}\nabla_{x}D_{\phi}\left(x_{k-1}\right)+\sqrt{\epsilon}\omega,\omega\sim\mathcal{N}(0,\mathcal{I}), (12)

Where ϵ\epsilon refers to the step size. The whole finetuning procedure is illustrated in Algorithm 1. The GAN-based approaches are implicitly constrained by the dimension of the latent noise, which is based on a widely applied assumption that the high dimensional data, e.g., images, actually distribute on a relatively low-dimensional manifold. Apart from searching the reasonable point in the data space, we could also find the lower energy part of the latent manifold by conducting Langevin dynamics in the latent space which are more stable in practice, i.e.:

ztl=ztl1ϵ2zDϕ(Gθ(zt)l1)+ϵω,ω𝒩(0,).z_{t}^{l}=z_{t}^{l-1}-\frac{\epsilon}{2}\nabla_{z}D_{\phi}\left(G_{\theta}(z_{t})^{l-1}\right)+\sqrt{\epsilon}\omega,\omega\sim\mathcal{N}(0,\mathcal{I}). (13)

Ideally, the proposal should be accepted or rejected according to the Metropolis–Hastings algorithm:

α:=min{1,Dϕ(xk)q(xk1|xk)Dϕ(xk1)q(xk|xk1)},\alpha:=\min\left\{1,\frac{D_{\phi}\left(x_{k}\right)q\left(x_{k-1}|x_{k}\right)}{D_{\phi}\left(x_{k-1}\right)q\left(x_{k}|x_{k-1}\right)}\right\}, (14)

where qq refers to the proposal which is defined as:

q(x|x)exp(14τxxτlogπ(x)22).q\left(x^{\prime}|x\right)\propto\exp\left(-\frac{1}{4\tau}\left\|x^{\prime}-x-\tau\nabla\log\pi(x)\right\|_{2}^{2}\right). (15)

In practice, we find the rejection steps described in Eq. 14 do not boost performance. For simplicity, following [32, 8], we apply Eq. 12 in experiments as an approximate version.

After fine-tuning, the discriminator function can be approximated seen as an unnormalized probability function, which implies a unique distribution pDϕp_{D_{\phi}}. And similar to the pp_{*} implied in the rejection sampling-based method, it is reasonable to assume that pDϕp_{D_{\phi}} is a superior distribution of pGθp_{G_{\theta}}. Sampling from pDϕp_{D_{\phi}} can be implemented through the Langevin dynamics, as illustrated in Eq. 12 with pGθp_{G_{\theta}} serves as the initial distribution.

5 Experiments

Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Target distribution
Refer to caption
(b) SNGAN
Refer to caption
(c) SNGAN-DCD
Figure 2: Density modeling on synthetic distributions. Top: 8 Gaussian distribution. Bottom: 25 Gaussian distribution. Left: Distribution of real data. Middle: Distribution defined by the generator of SNGAN. The surface is the level set of the critic. Yellow corresponds to higher value while purple corresponds to lower. Right: Distribution defined by the SNGAN-DCD. The surface is the level set of the proposed energy function.

In this section, we conduct extensive experiments on both synthetic data and real-world images to demonstrate the effectiveness of our proposed method. The results show that taking the optionally fine-tuned Discriminator as the energy function and sampling from the corresponding pDϕp_{D_{\phi}} yield stable improvement over the WGAN implementations.

5.1 Synthetic Density Modeling

Displaying the level sets is a meaningful way to study learned critic. Following the [3, 13], we investigate the impacts of our method on two challenging low-dimensional synthetic settings: twenty-five isotropic Gaussian distributions arranged in a grid and eight Gaussian distributions arranged in a ring (Fig. 2(a)). For all different settings, both the generator and the discriminator of the WGAN model are implemented as neural networks with four fully connected layers and Relu activations. The Lipschitz constraint is restricted through spectral normalization [25], while the prior is a two-dimensional multivariate Gaussian with a mean of 0 and a standard deviation of 11.

To investigate whether the proposed Discriminator Contrastive Divergence is capable of tuning the distribution induced by the discriminator as desired energy function, i.e. pDϕp_{D_{\phi}}, we visualize both the value surface of the critic and the samples obtained from pDϕp_{D_{\phi}} with Langevin dynamics. The results are shown in Figure. 2. As can be observed, the original WGAN (Fig. 2(b)) is strong enough to cover most modes, but there are still some spurious links between two different modes. The enhanced distribution pDϕp_{D_{\phi}} (Fig. 2(c)), however, has the ability to reduce spurious links and recovers the modes with underestimated density. More precisely, after the MCMC fine-tuning procedure (Fig. 2(c)), the gradients of the value surface become more meaningful so that all the regions with high density in data distribution pdatap_{\text{data}} are assigned with high DϕD_{\phi} value, i.e., lower energy(exp(Dϕ)\exp(-D_{\phi})). By contrast, in the original discriminator (Fig. 2(b)), the lower energy regions in pDϕp_{D_{\phi}} are not necessarily consistent with the high-density region of pdatap_{\text{data}}.

5.2 Real-World Image Generation

To quantitatively and empirically study the proposed DCD approach, in this section, we conduct experiments on unsupervised real-world image generation with DCD and its related counterparts. On several commonly used image datasets, experiments demonstrate that our proposed DCD algorithm can always achieve better performance on different benchmarks with a significant margin.

5.2.1 Experimental setup

Baselines. We evaluated the following models as our baselines: we take PixelCNN [38], PixelIQN [29], and MoLM [30] as representatives of other types of generative models. For the energy-based model, we compared the proposed method with EBM [8] and NCSN [32]. For GAN models, we take WGAN-GP [13], Spectral Normalization GAN (SNGAN) [25], and Progressiv eGAN [19] for comparison. We also take the aforementioned DRS [3], DOT [33] and MH-GAN [36] into consideration. The choices of EBM and GANs are due to their close relation to our proposed method, as analyzed in Section 4. We omit other previous GAN methods since as a representative of a state-of-the-art GAN model, SNGAN and Progressive GAN has been shown to rival or outperform several former methods such as the original GAN [10], the energy-based generative adversarial network [40], and the original WGAN with weight clipping [2].

Evaluation Metrics. For evaluation, we concentrate on comparing the quality of generated images since it is well known that GAN models cannot perform reliable likelihood estimations [35]. We choose to compare the Inception Scores [31] and Frechet Inception Distances (FID) [15] reached during training iterations, both computed from 50K samples. A high image quality corresponds to high Inception and low FID scores. Specifically, the intuition of IS is that high-quality images should lead to high confidence in classification, while FID aims to measure the computer-vision-specific similarity of generated images to real ones through Frechet distance.

Data. We use CIFAR-10 [22] and STL-10 [5], which are all standard datasets widely used in generative literature. STL-10 consists of unlabeled real-world color images, while CIFAR-10 is provided with class labels, which enables us to conduct conditional generation tasks. For STL-10, we also shrink the images into 32×3232\times 32 as in previous works. The pixel values of all images are rescaled into [1,1][-1,1].

Network Architecture. For all experiment settings, we follow Spectral Normalization GAN (SNGAN) [25] and adopt the same Residual Network (ResNet) [14] structures and hyperparameters, which presently is the state-of-the-art implementation of WGAN. Details can be found in Appendix. D. We take their open-source code and pre-trained model as the base model for the experiments on CIFAR-10. For STL-10, since there is no pre-trained model available to reproduce the results, we train the SNGAN from scratch and take it as the base model.

Model Inception FID CIFAR-10 Unconditional PixelCNN [38] 4.604.60 65.9365.93 PixelIQN [29] 5.295.29 49.4649.46 EBM [8] 6.026.02 40.5840.58 WGAN-GP [13] 7.86±.077.86\pm.07 18.1218.12 MoLM [30] 7.90±.107.90\pm.10 18.9\mathbf{18.9} SNGAN [25] 8.22±.058.22\pm.05 21.721.7 ProgressiveGAN [19] 8.80±.058.80\pm.05 - NCSN [32] 8.87±.128.87\pm.12 25.3225.32 DCGAN w/ DRS(cal) [3] 3.0733.073 - DCGAN w/ MH-GAN(cal) [36] 3.3793.379 - ResNet-SAGAN w/ DOT [33] 8.50±.128.50\pm.12 19.7119.71 SNGAN-DCD (Pixel) 8.54±.118.54\pm.11 21.6721.67 SNGAN-DCD (Latent) 9.11±.04\mathbf{9.11}\pm.04 16.24\mathbf{16.24} CIFAR-10 Conditional EBM [8] 8.308.30 37.937.9 SNGAN [25] 8.43±.098.43\pm.09 15.4315.43 SNGAN-DCD (Pixel) 8.73±.138.73\pm.13 22.8422.84 SNGAN-DCD (Latent) 8.81±.118.81\pm.11 15.0515.05 BigGAN [4] 9.22\mathbf{9.22} 14.73\mathbf{14.73}

Table 1: Inception and FID scores for CIFAR-10.
Refer to caption
Figure 3: Unconditional CIFAR-10 Langevin dynamics visualization.

5.2.2 Results

Model Inception FID SNGAN [25] 8.90±.128.90\pm.12 18.7318.73 SNGAN-DCD (Pixel) 9.25±.099.25\pm.09 22.2522.25 SNGAN-DCD (Latent) 9.33±.04\mathbf{9.33}\pm.04 17.6817.68

Table 2: Inception and FID scores for STL-10

For quantitative evaluation, we report the inception score [31] and FID [15] scores on CIFAR-10 in Tab. 1 and STL-10 in Tab. 2. As shown in the Tab. 1, in pixel space, by introducing the proposed DCD algorithm, we achieve a significant improvement of inception score over the SNGAN. The reported inception score is even higher than most values achieved by class-conditional generative models. Our FID score of 21.6721.67 on CIFAR-10 is competitive with other top generative models. When the DCD is conducted in the latent space, we further achieve a 9.119.11 inception score and a 16.2416.24 FID, which is a new state-of-the-art performance of IS. When combined with label information to perform conditional generation, we further improve the FID to 15.0515.05, which is comparable with current state-of-the-art large-scale trained models [4]. Some visualization of generated examples can be found in Fig 3, which demonstrates that the Markov chain is able to generate more realistic samples, suggesting that the MCMC process is meaningful and effective. Tab. 2 shows the performance on STL-10, which demonstrates that as a generalized method, DCD is not over-fitted to the specific data CIFAR-10. More experiment details and the generated samples of STL-10 can be found in Appendix. F.

6 Discussion and Future Work

Based on the density ratio estimation perspective, the discriminator in ff-GANs could be adapted to a wide range of application scenarios, such as mutual information estimation [18] and bias correction of generative models [12]. However, as another important branch in GANs’ research, the available information in WGANs discriminator is less explored. In this paper, we narrow down the scope of discussion and focus on the problem of how to leverage the discriminator of WGANs to further improve the sample quality in image generation. We conduct a comprehensive theoretical study on the informativeness of discriminator in different kinds of GANs. Motivated by the theoretical progress in the literature of WGANs, we investigate the possibility of turning the discriminator of WGANs into an energy function and propose a fine-tuning procedure of WGANs named as "discriminator contrastive divergence". The final image generation process is semi-amortized, where the generator acts as an initial state, and then several steps of Langevin dynamics are conducted. We demonstrate the effectiveness of the proposed method on several tasks, including both synthetic and real-world image generation benchmarks.

It should be noted that the semi-amortized generation allows a trade-off between the generation quality and sampling speed, which holds a slower sampling speed than a direct generation with a generator. Hence the proposed method is suitable to the application scenario where the generation quality is given vital importance. Another interesting observation during the experiments is the discriminator contrastive divergence surprisingly reduces the occurrence of adversarial samples during training, so it should be a promising future direction to investigate the relationship between our method and bayesian adversarial learning.

We hope our work helps shed some light on a generalized view to a method of connecting different GANs and energy-based models, which will stimulate more exploration into the potential of current deep generative models.

References

  • Ali and Silvey [1966] Syed Mumtaz Ali and Samuel D Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 28(1):131–142, 1966.
  • Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
  • Azadi et al. [2018] Samaneh Azadi, Catherine Olsson, Trevor Darrell, Ian Goodfellow, and Augustus Odena. Discriminator rejection sampling. arXiv preprint arXiv:1810.06758, 2018.
  • Brock et al. [2018] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  • Coates et al. [2011] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223, 2011.
  • Cover and Thomas [2012] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
  • Dinh et al. [2016] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
  • Du and Mordatch [2019] Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. arXiv preprint arXiv:1903.08689, 2019.
  • Finn et al. [2016] Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv preprint arXiv:1611.03852, 2016.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • Grathwohl et al. [2019] Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. arXiv preprint arXiv:1912.03263, 2019.
  • Grover et al. [2019] Aditya Grover, Jiaming Song, Alekh Agarwal, Kenneth Tran, Ashish Kapoor, Eric Horvitz, and Stefano Ermon. Bias correction of learned generative models using likelihood-free importance weighting. arXiv preprint arXiv:1906.09531, 2019.
  • Gulrajani et al. [2017] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
  • Hiriart-Urruty and Lemaréchal [2012] Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal. Fundamentals of convex analysis. Springer Science & Business Media, 2012.
  • Hjelm et al. [2017] R Devon Hjelm, Athul Paul Jacob, Tong Che, Adam Trischler, Kyunghyun Cho, and Yoshua Bengio. Boundary-seeking generative adversarial networks. arXiv preprint arXiv:1702.08431, 2017.
  • Hjelm et al. [2018] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
  • Karras et al. [2017] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  • Kim and Bengio [2016] Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability estimation. arXiv preprint arXiv:1606.03439, 2016.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • Kumar et al. [2019] Rithesh Kumar, Anirudh Goyal, Aaron Courville, and Yoshua Bengio. Maximum entropy generators for energy-based models. arXiv preprint arXiv:1901.08508, 2019.
  • Li et al. [2017] Yingzhen Li, Richard E Turner, and Qiang Liu. Approximate inference with amortised mcmc. arXiv preprint arXiv:1702.08343, 2017.
  • Miyato et al. [2018] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
  • Müller [1997] Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
  • Nguyen et al. [2010] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
  • Nowozin et al. [2016] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pages 271–279, 2016.
  • Ostrovski et al. [2018] Georg Ostrovski, Will Dabney, and Rémi Munos. Autoregressive quantile networks for generative modeling. arXiv preprint arXiv:1806.05575, 2018.
  • Ravuri et al. [2018] Suman Ravuri, Shakir Mohamed, Mihaela Rosca, and Oriol Vinyals. Learning implicit generative models with the method of learned moments. arXiv preprint arXiv:1806.11006, 2018.
  • Salimans et al. [2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
  • Song and Ermon [2019] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pages 11895–11907, 2019.
  • Tanaka [2019] Akinori Tanaka. Discriminator optimal transport. arXiv preprint arXiv:1910.06832, 2019.
  • Teh et al. [2003] Yee Whye Teh, Max Welling, Simon Osindero, and Geoffrey E Hinton. Energy-based models for sparse overcomplete representations. Journal of Machine Learning Research, 4(Dec):1235–1260, 2003.
  • Theis et al. [2015] Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844, 2015.
  • Turner et al. [2018] Ryan Turner, Jane Hung, Yunus Saatci, and Jason Yosinski. Metropolis-hastings generative adversarial networks. arXiv preprint arXiv:1811.11357, 2018.
  • Uehara et al. [2016] Masatoshi Uehara, Issei Sato, Masahiro Suzuki, Kotaro Nakayama, and Yutaka Matsuo. Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920, 2016.
  • Van den Oord et al. [2016] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pages 4790–4798, 2016.
  • Wainwright et al. [2008] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
  • Zhao et al. [2016] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.
  • Zhou et al. [2019] Zhiming Zhou, Jiadong Liang, Yuxuan Song, Lantao Yu, Hongwei Wang, Weinan Zhang, Yong Yu, and Zhihua Zhang. Lipschitz generative adversarial nets. arXiv preprint arXiv:1902.05687, 2019.

Appendix A Proof of Theorem 2

It should be noticed that Theorem. 2 can be generalized to that Lipschitz continuity with l2l_{2}-norm (Euclidean Distance) can guarantee that the gradient is directly pointing towards some sample[41]. We introduce the following lemmas, and Theorem. 2 is a special case.

Let (x,y)(x,y) be such that yxy\neq x, and we define xt=x+t(yx)x_{t}=x+t\cdot(y-x) with t[0,1]t\in[0,1].

Lemma 1.

If f(x)f(x) is kk-Lipschitz with respect to .p\|.\|_{p} and f(y)f(x)=kyxpf(y)-f(x)=k\|y-x\|_{p}, then f(xt)=f(x)+tkyxpf(x_{t})=f(x)+t\cdot k\|y-x\|_{p}.

Proof.

As we know f(x)f(x) is kk-Lipschitz, with the property of norms, we have

f(y)f(x)\displaystyle f(y)-f(x) =f(y)f(xt)+f(xt)f(x)\displaystyle=f(y)-f(x_{t})+f(x_{t})-f(x)
f(y)f(xt)+kxtxp=f(y)f(xt)+tkyxp\displaystyle\leq f(y)-f(x_{t})+k\|x_{t}-x\|_{p}=f(y)-f(x_{t})+t\cdot k\|y-x\|_{p}
kyxtp+tkyxp=k(1t)yxp+tkyxp\displaystyle\leq k\|y-x_{t}\|_{p}+t\cdot k\|y-x\|_{p}=k\cdot(1-t)\|y-x\|_{p}+t\cdot k\|y-x\|_{p}
=kyxp.\displaystyle=k\|y-x\|_{p}. (16)

f(y)f(x)=kyxpf(y)-f(x)=k\|y-x\|_{p} implies all the inequalities is equalities. Therefore, f(xt)=f(x)+tkyxpf(x_{t})=f(x)+t\cdot k\|y-x\|_{p}. ∎

Lemma 2.

Let vv be the unit vector yxyx2\frac{y-x}{\|y-x\|_{2}}. If f(xt)=f(x)+tkyx2f(x_{t})=f(x)+t\cdot k\|y-x\|_{2}, then f(xt)v\frac{\partial{f(x_{t})}}{\partial{v}} equals to kk.

Proof.
f(xt)v\displaystyle\vspace{-20pt}\frac{\partial{f(x_{t})}}{\partial{v}} =limh0f(xt+hv)f(xt)h=limh0f(xt+hyxyx2)f(xt)h\displaystyle=\lim\limits_{h\rightarrow 0}\frac{f(x_{t}+hv)-f(x_{t})}{h}=\lim\limits_{h\rightarrow 0}\frac{f(x_{t}+h\frac{y-x}{\|y-x\|_{2}})-f(x_{t})}{h}
=limh0f(xt+hyx2)f(xt)h=limh0hyx2kyx2h=k.\displaystyle=\lim\limits_{h\rightarrow 0}\frac{f(x_{t+\frac{h}{\|y-x\|_{2}}})-f(x_{t})}{h}=\lim\limits_{h\rightarrow 0}\frac{\frac{h}{\|y-x\|_{2}}\cdot k\|y-x\|_{2}}{h}=k.\qed

Then we derive the formal proof of Theorem 2.

Proof.

Assume p=2p=2, if f(x)f(x) is kk-Lipschitz with respect to .2\|.\|_{2} and f(x)f(x) is differentiable at xtx_{t}, then f(xt)2k\|\nabla f(x_{t})\|_{2}\leq k. Let vv be the unit vector yxyx2\frac{y-x}{\|y-x\|_{2}}. We have

k2=kf(xt)v\displaystyle k^{2}=k\frac{\partial{f(x_{t})}}{\partial{v}} =kv,f(xt)=kv,f(xt)kv2f(xt)2=k2.\displaystyle=k\left<v,\nabla f(x_{t})\right>=\left<kv,\nabla f(x_{t})\right>\leq\|kv\|_{2}\|\nabla f(x_{t})\|_{2}=k^{2}. (17)

Because the equality holds only when f(xt)=kv=kyxyx2\nabla f(x_{t})=kv=k\frac{y-x}{\|y-x\|_{2}}, we have that f(xt)=kyxyx2\nabla f(x_{t})=k\frac{y-x}{\|y-x\|_{2}}. ∎

Appendix B Proof of Theorem 3

Theorem. 3 states that following the following procedure as introduced in [33], there is non-unique stationary distribution. The complete procedure is to find the following yy for xPGθx\sim P_{G_{\theta}}:

y=argminx{xy2D(x)}.\displaystyle y^{*}=\operatorname*{arg\,min}_{x}\{\|x-y\|_{2}-D(x)\}. (18)

To find the corresponding yy^{*}, the following gradient based update is conducted:

{xxϵx{||xy||2D(x)}.\displaystyle\{x\leftarrow x-\epsilon\nabla_{x}\left\{||x-y||_{2}-D(x)\right\}. (19)

For all the points xtx_{t} in the linear interpolation of xx and target yy^{*} as defined in the proof of Theorem 2,

xt{xty2D(xt)}=yxyx2yxyx2=0,\displaystyle\nabla_{x_{t}}\left\{||x_{t}-y||_{2}-D(x_{t})\right\}=\frac{y-x}{\|y-x\|_{2}}-\frac{y-x}{\|y-x\|_{2}}=0, (20)

which indicates all points in the linear interpolation satisfy the stationary condition.

Appendix C Proof of Proposition 1

Proposition. 1 is the direct result of the following Lemma. 3. Following [24], we provide the complete proof as following.

Lemma 3.

[6] Let qq and rr be two distributions for 𝐳0\bm{z}_{0}. Let qtq_{t} and rtr_{t} be the corresponded distributions of state 𝐳t\bm{z}_{t} at time tt, induced by the transition kernel 𝒦\mathcal{K}. Then DKL[qt||rt]DKL[qt+1||rt+1]\mathrm{D}_{\text{KL}}[q_{t}||r_{t}]\geq\mathrm{D}_{\text{KL}}[q_{t+1}||r_{t+1}] for all t0t\geq 0.

Proof.
DKL[qt||rt]\displaystyle\mathrm{D}_{\text{KL}}[q_{t}||r_{t}] =𝔼qt[logqt(𝒛t)rt(𝒛𝒕)]\displaystyle=\mathbb{E}_{q_{t}}\left[\log\frac{q_{t}(\bm{z}_{t})}{r_{t}(\bm{z_{t}})}\right]
=𝔼qt(𝒛t)𝒦(𝒛t+1|𝒛t)[logqt(𝒛t)𝒦(𝒛t+1|𝒛t)rt(𝒛t)𝒦(𝒛t+1|𝒛t)]\displaystyle=\mathbb{E}_{q_{t}(\bm{z}_{t})\mathcal{K}(\bm{z}_{t+1}|\bm{z}_{t})}\left[\log\frac{q_{t}(\bm{z}_{t})\mathcal{K}(\bm{z}_{t+1}|\bm{z}_{t})}{r_{t}(\bm{z}_{t})\mathcal{K}(\bm{z}_{t+1}|\bm{z}_{t})}\right]
=𝔼qt+1(𝒛t+1)qt+1(𝒛t|𝒛t+1)[logqt+1(𝒛t+1)q(𝒛t|𝒛t+1)rt+1(𝒛t+1)r(𝒛t|𝒛t+1)]\displaystyle=\mathbb{E}_{q_{t+1}(\bm{z}_{t+1})q_{t+1}(\bm{z}_{t}|\bm{z}_{t+1})}\left[\log\frac{q_{t+1}(\bm{z}_{t+1})q(\bm{z}_{t}|\bm{z}_{t+1})}{r_{t+1}(\bm{z}_{t+1})r(\bm{z}_{t}|\bm{z}_{t+1})}\right]
=DKL[qt+1||rt+1]+𝔼qt+1DKL[qt+1(𝒛t|𝒛t+1)||rt+1(𝒛t|𝒛t+1)].\displaystyle=\mathrm{D}_{\text{KL}}[q_{t+1}||r_{t+1}]+\mathbb{E}_{q_{t+1}}\mathrm{D}_{\text{KL}}[q_{t+1}(\bm{z}_{t}|\bm{z}_{t+1})||r_{t+1}(\bm{z}_{t}|\bm{z}_{t+1})].

Appendix D Network architectures

ResNet architectures for CIFAR-10 and STL-10 datasets. We use similar architectures to the ones used in [13].

z128𝒩(0,I)z\in\mathbb{R}^{128}\sim\mathcal{N}(0,I)
dense, 4×4×2564\times 4\times 256
ResBlock up 256
ResBlock up 256
ResBlock up 256
BN, ReLU, 3×\times3 conv, 3 Tanh
Table 3: Generator
RGB image x32×32×3x\in\mathbb{R}^{32\times 32\times 3}
ResBlock down 128
ResBlock down 128
ResBlock 128
ResBlock 128
ReLU
Global sum pooling
dense \rightarrow 1
Table 4: Discriminator

Appendix E Discussions on Objective Functions

Optimization of the standard objective of WGAN, i.e. with r(x)=m(x)=xr(x)=m(x)=x in Eq. 1, are found to be unstable due to the numerical issues and free offset [41, 25]. Instead, several surrogate losses are actually used in practice. For example, the logistic loss(r(x)=m(x)=log(1+ex)r(x)=m(x)=-\log(1+e^{-x})) and hinge loss(r(x)=m(x)=min(0,x)r(x)=m(x)=\min(0,x)) are two widely applied objectives. Such surrogate losses are valid due to that they are actually the lower bounds of the Wasserstain distance between the two distributions of interest. The statement can be easily derived by the fact that log(1+ex)x-\log(1+e^{-x})\leq x and min(0,x)x\min(0,x)\leq x. A more detailed discussion could also be found in [33].

Note that min(0,1+x)\min(0,-1+x) and log(1+ex)-\log(1+e^{-x}) are in the function family proposed in [41], and Theorem 4 in [41] guarantees the gradient property of discriminator.

Appendix F More Experiment Details

F.1 CIFAR-10

For the meta-parameters in DCD Algorithm 1, when the MCMC process is conducted in the pixel space, we choose 686-8 as the number of MCMC steps KK, and set the step size ϵ\epsilon as 1010 and the standard deviation of the Gaussian noise as 0.010.01, while for the latent space we set KK as 5050, ϵ\epsilon as 0.20.2 and the deviation as 0.10.1. Adam optimizer [21] is set with 2×1042\times 10^{-4} learning rate with β1=0,β2=0.9\beta_{1}=0,\beta_{2}=0.9. We use 55 critic updates per generator update, and a batch size of 6464.

F.2 STL-10

We show generated samples of DCD during Langevin dynamics in Fig. 4. We run 150 steps of MCMC steps and plot generated sample for every 10 iterations. The step size is set as 0.050.05 and the noise is set as N(0,0.1)N(0,0.1).

Refer to caption
Figure 4: STL-10 Langevin dynamics visualization.