This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Mathematical Introduction to Generative Adversarial Nets (GAN)

Yang Wang Department of Mathematics
Hong Kong University of Science and Technology
Clear Water Bay, Kowloon, Hong Kong
[email protected]
Abstract.

Generative Adversarial Nets (GAN) have received considerable attention since the 2014 groundbreaking work by Goodfellow et al [4]. Such attention has led to an explosion in new ideas, techniques and applications of GANs. To better understand GANs we need to understand the mathematical foundation behind them. This paper attempts to provide an overview of GANs from a mathematical point of view. Many students in mathematics may find the papers on GANs more difficulty to fully understand because most of them are written from computer science and engineer point of view. The aim of this paper is to give more mathematically oriented students an introduction to GANs in a language that is more familiar to them.

Key words and phrases:
Deep Learning, GAN, Neural Network
2010 Mathematics Subject Classification:
Primary 42C15
The author is supported in part by the Hong Kong Research Grant Council grants 16308518 and 16317416, as well as HK Innovation Technology Fund ITS/044/18FX

1. Introduction

1.1. Background

Generative Adversarial Nets (GAN) have received considerable attention since the 2014 groundbreaking work by Goodfellow et al [4]. Such attention has led to an explosion in new ideas, techniques and applications of GANs. Yann LeCun has called “this (GAN) and the variations that are now being proposed is the most interesting idea in the last 10 years in ML, in my opinion.” In this note I will attempt to provide a beginner’s introduction to GAN from a more mathematical point of view, intended for students in mathematics. Of course there is much more to GANs than just the mathematical principle. To fully understand GANs one must also look into their algorithms and applications. Nevertheless I believe that understanding the mathematical principle is a crucial first step towards understanding GANs, and with it the other aspects of GANs will be considerably easier to master.

The original GAN, which we shall refer to as the vanilla GAN in this paper, was introduced in [4] as a new generative framework from training data sets. Its goal was to address the following question: Suppose we are given a data set of objects with certain degree of consistency, for example, a collection of images of cats, or handwritten Chinese characters, or Van Gogh painting etc., can we artificially generate similar objects?

This question is quite vague so we need to make it more mathematically specific. We need to clarify what do we mean by “objects with certain degree of consistency” or “similar objects”, before we can move on.

First we shall assume that our objects are points in n{\mathbb{R}}^{n}. For example, a grayscale digital image of 1 megapixel can be viewed as a point in n{\mathbb{R}}^{n} with n=106n=10^{6}. Our data set (training data set) is simply a collection of points in n{\mathbb{R}}^{n}, which we denote by 𝒳n{\mathcal{X}}\subset{\mathbb{R}}^{n}. When we say that the objects in the data set 𝒳{\mathcal{X}} have certain degree of consistency we mean that they are samples generated from a common probability distribution μ\mu on n{\mathbb{R}}^{n}, which is often assumed to have a density function p(x)p(x). Of course by assuming μ\mu to have a density function mathematically we are assuming that μ\mu is absolutely continuous. Some mathematicians may question the wisdom of this assumption by pointing out that it is possible (in fact even likely) that the objects of interest lie on a lower dimensional manifold, making μ\mu a singular probability distribution. For example, consider the MNIST data set of handwritten digits. While they are 28×2828\times 28 images (so n=784n=784), the actual dimension of these data points may lie on a manifold with much smaller dimension (say the actual dimension may only be 20 or so). This is a valid criticism. Indeed when the actual dimension of the distribution is far smaller than the ambient dimension various problems can arise, such as failure to converge or the so-called mode collapsing, leading to poor results in some cases. Still, in most applications this assumption does seem to work well. Furthermore, we shall show that the requirement of absolute continuity is not critical to the GAN framework and can in fact be relaxed.

Quantifying “similar objects” is a bit trickier and holds the key to GANs. There are many ways in mathematics to measure similarity. For example, we may define a distance function and call two pints 𝐱,𝐲{\mathbf{x}},{\mathbf{y}} “similar” if the distance between them is small. But this idea is not useful here. Our objective is not to generate objects that have small distances to some existing objects in 𝒳{\mathcal{X}}. Rather we want to generate new objects that may not be so close in whatever distance measure we use to any existing objects in the training data set 𝒳{\mathcal{X}}, but we feel they belong to the same class. A good analogy is we have a data set of Van Gogh paintings. We do not care to generate a painting that is a perturbation of Van Gogh’s Starry Night. Instead we would like to generate a painting that a Van Gogh expert will see as a new Van Gogh painting she has never seen before.

A better angle, at least from the perspective of GANs, is to define similarity in the sense of probability distribution. Two data sets are considered similar if they are samples from the same (or approximately same) probability distribution. Thus more specifically we have our training data set 𝒳n{\mathcal{X}}\subset{\mathbb{R}}^{n} consisting of samples from a probability distribution μ\mu (with density p(𝐱)p({\mathbf{x}})), and we would like to find a probability distribution ν\nu (with density q(𝐱)q({\mathbf{x}})) such that ν\nu is a good approximation of μ\mu. By taking samples from the distribution ν\nu we obtain generated objects that are “similar” to the objects in 𝒳{\mathcal{X}}.

One may wonder why don’t we just simply set ν=μ\nu=\mu and take samples from μ\mu. Wouldn’t that give us a perfect solution? Indeed — if we know what μ\mu is. Unfortunately that is exactly our main problem: we don’t know. All we know is a finite set of samples 𝒳{\mathcal{X}} drawn from the distribution μ\mu. Hence our real challenge is to learn the distribution μ\mu from only a finite set of samples drawn over it. We should view finding ν\nu as the process of approximating μ\mu. GANs do seem to provide a novel and highly effective way for achieving this goal. In general the success of a GAN will depend on the complexity of the distribution μ\mu and the size of the training data set 𝒳{\mathcal{X}}. In some cases the cardinality |𝒳|=N|{\mathcal{X}}|=N can be quite large, e.g. for ImageNet data set NN is well over 10710^{7}. But in some other cases, such as Van Gogh paintings, the size NN is rather small, in the order of 100 only.

1.2. The Basic Approach of GAN

To approximate μ\mu, the vanilla GAN and subsequently other GANs start with an initial probability distribution γ\gamma defined on d{\mathbb{R}}^{d}, where dd may or may not be the same as nn. For the time being we shall set γ\gamma to be the standard normal distribution N(0,Id)N(0,I_{d}), although we certainly can choose γ\gamma to be other distributions. The technique GANs employ is to find a mapping (function) G:dnG:~{}{\mathbb{R}}^{d}{\longrightarrow}{\mathbb{R}}^{n} such that if a random variable 𝐳d{\mathbf{z}}\in{\mathbb{R}}^{d} has distribution γ\gamma then G(𝐳)G({\mathbf{z}}) has distribution μ\mu. Note that the distribution of G(𝐳)G({\mathbf{z}}) is γG1\gamma\circ G^{-1}, where G1G^{-1} maps subsets of n{\mathbb{R}}^{n} to subsets of d{\mathbb{R}}^{d}. Thus we are looking for a G(𝐳)G({\mathbf{z}}) such that γG1=μ\gamma\circ G^{-1}=\mu, or at least is a good approximation of μ\mu. Sounds simple, right?

Actually several key issues remain to be addressed. One issue is that we only have samples from μ\mu, and if we know GG we can have samples G(𝐳)G({\mathbf{z}}) where 𝐳{\mathbf{z}} is drawn from the distribution γ\gamma. How do we know from these samples that our distribution γG1\gamma\circ G^{-1} is the same or a good approximation of μ\mu? Assuming we have ways to do so, we still have the issue of finding G(𝐳)G({\mathbf{z}}).

The approach taken by the vanilla GAN is to form an adversarial system from which GG continues to receive updates to improve its performance. More precisely it introduces a “discriminator function” D(x)D(x), which tries to dismiss the samples generated by GG as fakes. The discriminator D(x)D(x) is simply a classifier that tries to distinguish samples in the training set 𝒳{\mathcal{X}} (real samples) from the generated samples G(𝐳)G({\mathbf{z}}) (fake samples). It assigns to each sample 𝐱{\mathbf{x}} a probability D(𝐱)[0,1]D({\mathbf{x}})\in[0,1] for its likelihood to be from the same distribution as the training samples. When samples G(𝐳j)G({\mathbf{z}}_{j}) are generated by GG, the discriminator DD tries to reject them as fakes. In the beginning this shouldn’t be hard because the generator GG is not very good. But each time GG fails to generate samples to fool DD, it will learn and adjust with an improvement update. The improved GG will perform better, and now it is the discriminator DD’s turn to update itself for improvement. Through this adversarial iterative process an equilibrium is eventually reached, so that even with the best discriminator DD it can do no better than random guess. At such point, the generated samples should be very similar in distribution to the training samples 𝒳{\mathcal{X}}.

So one may ask: where do neural networks and deep learning have to do with all this? The answer is that we basically have the fundamental faith that deep neural networks can be used to approximate just about any function, through proper tuning of the network parameters using the training data sets. In particular neural networks excel in classification problems. Not surprisingly, for GAN we shall model both the discriminator function DD and the generator function GG as neural networks with parameters ω\omega and θ\theta, respectively. Thus we shall more precisely write D(x)D(x) as Dω(x)D_{\omega}(x) and G(z)G(z) as Gθ(z)G_{\theta}(z), and denote νθ:=γGθ1\nu_{\theta}:=\gamma\circ G^{-1}_{\theta}. Our objective is to find the desired Gθ(𝐳)G_{\theta}({\mathbf{z}}) by properly tuning θ\theta.

2. Mathematical Formulation of the Vanilla GAN

The adversarial game described in the previous section can be formulated mathematically by minimax of a target function between the discriminator function D(x):n[0,1]D(x):{\mathbb{R}}^{n}{\longrightarrow}[0,1] and the generator function G:dnG:{\mathbb{R}}^{d}{\longrightarrow}{\mathbb{R}}^{n}. The generator GG turns random samples 𝐳d{\mathbf{z}}\in{\mathbb{R}}^{d} from distribution γ\gamma into generated samples G(𝐳)G({\mathbf{z}}). The discriminator DD tries to tell them apart from the training samples coming from the distribution μ\mu, while GG tries to make the generated samples as similar in distribution to the training samples. In [4] a target loss function is proposed to be

(2.1) V(D,G):=𝔼𝐱μ[logD(𝐱)]+𝔼𝐳γ[log(1D(G(𝐳)))],V(D,G):={\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\log D({\mathbf{x}})]+{\mathbb{E}}_{{\mathbf{z}}\sim\gamma}[\log(1-D(G({\mathbf{z}})))],

where 𝔼{\mathbb{E}} denotes the expectation with respect to a distribution specified in the subscript. When there is no confusion we may drop the subscript. The vanilla GAN solves the minimax problem

(2.2) minGmaxDV(D,G):=minGmaxD(𝔼𝐱μ[logD(𝐱)]+𝔼𝐳γ[log(1D(G(𝐳)))]).\min_{G}\max_{D}\,V(D,G):=\min_{G}\max_{D}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\log D({\mathbf{x}})]+{\mathbb{E}}_{{\mathbf{z}}\sim\gamma}[\log(1-D(G({\mathbf{z}})))]\Bigr{)}.

Intuitively, for a given generator GG, maxDV(D,G)\max_{D}\,V(D,G) optimizes the discriminator DD to reject generated samples G(𝐳)G({\mathbf{z}}) by attempting to assign high values to samples from the distribution μ\mu and low values to generated samples G(𝐳)G({\mathbf{z}}). Conversely, for a given discriminator DD, minGV(D,G)\min_{G}\,V(D,G) optimizes GG so that the generated samples G(𝐳)G({\mathbf{z}}) will attempt to “fool” the discriminator DD into assigning high values.

Now set 𝐲=G(𝐳)n{\mathbf{y}}=G({\mathbf{z}})\in{\mathbb{R}}^{n}, which has distribution ν:=γG1\nu:=\gamma\circ G^{-1} as 𝐳d{\mathbf{z}}\in{\mathbb{R}}^{d} has distribution γ\gamma. We can now rewrite V(D,G)V(D,G) in terms of DD and ν\nu as

V~(D,ν):=V(D,G)\displaystyle\tilde{V}(D,\nu):=V(D,G) =𝔼𝐱μ[logD(𝐱)]+𝔼𝐳γ[log(1D(G(𝐳)))]\displaystyle={\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\log D({\mathbf{x}})]+{\mathbb{E}}_{{\mathbf{z}}\sim\gamma}[\log(1-D(G({\mathbf{z}})))]
=𝔼𝐱μ[logD(𝐱)]+𝔼𝐲ν[log(1D(𝐲))]\displaystyle={\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\log D({\mathbf{x}})]+{\mathbb{E}}_{{\mathbf{y}}\sim\nu}[\log(1-D({\mathbf{y}}))]
(2.3) =nlogD(x)𝑑μ(x)+nlog(1D(y))𝑑ν(y).\displaystyle=\int_{{\mathbb{R}}^{n}}\log D(x)\,d\mu(x)+\int_{{\mathbb{R}}^{n}}\log(1-D(y))\,d\nu(y).

The minimax problem (2.2) becomes

(2.4) minGmaxDV(D,G)=minGmaxD(nlogD(x)𝑑μ(x)+nlog(1D(y))𝑑ν(y)).\min_{G}\max_{D}\,V(D,G)=\min_{G}\max_{D}\,\Bigl{(}\int_{{\mathbb{R}}^{n}}\log D(x)\,d\mu(x)+\int_{{\mathbb{R}}^{n}}\log(1-D(y))\,d\nu(y)\Bigr{)}.

Assume that μ\mu has density p(x)p(x) and ν\nu has density function q(x)q(x) (which of course can only happen if dnd\geq n). Then

(2.5) V(D,ν)=n(logD(x)p(x)+log(1D(x))q(x))𝑑x.V(D,\nu)=\int_{{\mathbb{R}}^{n}}\Bigl{(}\log D(x)p(x)+\log(1-D(x))q(x)\Bigr{)}\,dx.

The minimax problem (2.2) can now be written as

(2.6) minGmaxDV(D,G)=minGmaxDn(logD(x)p(x)+log(1D(x))q(x))𝑑x.\min_{G}\max_{D}\,V(D,G)=\min_{G}\max_{D}\,\int_{{\mathbb{R}}^{n}}\Bigl{(}\log D(x)p(x)+\log(1-D(x))q(x)\Bigr{)}\,dx.

Observe that the above is equivalent to minνmaxDV~(D,ν)\min_{\nu}\max_{D}\tilde{V}(D,\nu) under the constraint that ν=γG1\nu=\gamma\circ G^{-1} for some GG. But to better understand the minimax problem it helps to examine minνmaxDV~(D,ν)\min_{\nu}\max_{D}\tilde{V}(D,\nu) without this constraint. For the case where μ,ν\mu,\nu have densities [4] has established the following results:

Proposition 2.1 ([4]).

Given probability distributions μ\mu and ν\nu on n{\mathbb{R}}^{n} with densities p(x)p(x) and q(x)q(x) respectively,

maxDV(D,ν)=maxDn(logD(x)p(x)+log(1D(x))q(x))𝑑x\max_{D}\,V(D,\nu)=\max_{D}\,\int_{{\mathbb{R}}^{n}}\Bigl{(}\log D(x)p(x)+\log(1-D(x))q(x)\Bigr{)}\,dx

is attained by Dp,q(x)=p(x)p(x)+q(x)D_{p,q}(x)=\frac{p(x)}{p(x)+q(x)} for xsupp(μ)supp(ν)x\in{\rm supp}(\mu)\cup{\rm supp}(\nu).

The above proposition leads to

Theorem 2.2 ([4]).

Let p(x)p(x) be a probability density function on n{\mathbb{R}}^{n}. For probability distribution ν\nu with density function q(x)q(x) and D:n[0,1]D:{\mathbb{R}}^{n}{\longrightarrow}[0,1] consider the minimax problem

(2.7) minνmaxDV~(D,ν)=minνmaxDn(logD(x)p(x)+log(1D(x))q(x))𝑑x.\min_{\nu}\max_{D}\,\tilde{V}(D,\nu)=\min_{\nu}\max_{D}\,\int_{{\mathbb{R}}^{n}}\Bigl{(}\log D(x)p(x)+\log(1-D(x))q(x)\Bigr{)}\,dx.

Then the solution is attained with q(x)=p(x)q(x)=p(x) and D(x)=1/2D(x)=1/2 for all xsupp(p)x\in{\rm supp}(p).

Theorem 2.2 says the solution to the minimax problem (2.7) is exactly what we are looking for, under the assumption that the distributions have densities. We discussed earlier that this assumption ignores that the distribution of interest may lie on a lower dimensional manifold and thus without a density function. Fortunately, the theorem actually holds in the general setting for any distributions. We have:

Theorem 2.3.

Let μ\mu be a given probability distribution on n{\mathbb{R}}^{n}. For probability distribution ν\nu and function D:n[0,1]D:{\mathbb{R}}^{n}{\longrightarrow}[0,1] consider the minimax problem

(2.8) minνmaxDV~(D,ν)=minνmaxDn(logD(x)dμ(x)+log(1D(x))dν(x)).\min_{\nu}\max_{D}\,\tilde{V}(D,\nu)=\min_{\nu}\max_{D}\,\int_{{\mathbb{R}}^{n}}\Bigl{(}\log D(x)\,d\mu(x)+\log(1-D(x))\,d\nu(x)\Bigr{)}.

Then the solution is attained with ν=μ\nu=\mu and D(x)=12D(x)=\frac{1}{2} μ\mu-almost everywhere.

Proof.  The proof follows directly from Theorem 3.7 and the discussion in Subsection 3.5, Example 2.  

Like many minimax problems, one may use the alternating optimization algorithm to solve (2.7), which alternates the updating of DD and qq (hence GG). An updating cycle consists of first updating DD for a given qq, and then updating qq with the new DD. This cycle is repeated until we reach an equilibrium. The following is given in [4]:

Proposition 2.4 ([4]).

If in each cycle, the discriminator DD is allowed to reach its optimum given q(x)q(x), followed by an update of q(x)q(x) so as to improve the minimization criterion

minqn(logD(x)p(x)+log(1D(x))q(x))𝑑x.\min_{q}\,\int_{{\mathbb{R}}^{n}}\Bigl{(}\log D(x)p(x)+\log(1-D(x))q(x)\Bigr{)}\,dx.

Then qq converges to pp.

Here I have changed the wording a little bit from the original statement, but have kept its essence intact. From a pure mathematical angle this proposition is not rigorous. However, it provides a practical framework for solving the vanilla GAN minimax problem, namely in each cycle we may first optimize the discriminator D(x)D(x) all the way for the current q(x)q(x), and then update q(x)q(x) given the new D(x)D(x) just a little bit. Repeating this cycle will lead us to the desire solution. In practice, however, we rarely optimize DD all the way for a given GG; instead we usually update DD a little bit before switching to updating GG.

Note that the unconstrained minimax problem (2.7) and (2.8) are not the same as the original minimax problem (2.2) or the equivalent formulation (2.3), where ν\nu is constrained to be of the form ν=γG1\nu=\gamma\circ G^{-1}. Nevertheless it is reasonable in practice to assume that (2.2) and (2.3) will exhibit similar properties as those being shown in Theorem 2.3 and Proposition 2.4. In fact, we shall assume the same even after we further restrict the discriminator and generator functions to be neural networks D=DωD=D_{\omega} and G=GθG=G_{\theta} as intended. Set νθ=γGθ1\nu_{\theta}=\gamma\circ G_{\theta}^{-1}. Under this model our minimax problem has become minθmaxωV(Dω,Gθ)\min_{\theta}\max_{\omega}\,V(D_{\omega},G_{\theta}) where

(2.9) V(Dω,Gθ)\displaystyle V(D_{\omega},G_{\theta}) =𝔼𝐱μ[logDω(𝐱)]+𝔼𝐳γ[log(1Dω(Gθ(𝐳)))]\displaystyle={\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\log D_{\omega}({\mathbf{x}})]+{\mathbb{E}}_{{\mathbf{z}}\sim\gamma}[\log(1-D_{\omega}(G_{\theta}({\mathbf{z}})))]
(2.10) =n(logDω(x)dμ(x)+log(1Dω(x))dνθ(x)).\displaystyle=\int_{{\mathbb{R}}^{n}}\Bigl{(}\log D_{\omega}(x)\,d\mu(x)+\log(1-D_{\omega}(x))\,d\nu_{\theta}(x)\Bigr{)}.

Equation (2.9) is the key to carrying out the actual optimization: since we do not have the explicit expression for the target distribution μ\mu, we shall approximate the expectations through sample averages. Thus (2.9) allows us to approximate V(Dω,Gθ)V(D_{\omega},G_{\theta}) using samples. More specifically, let 𝒜{\mathcal{A}} be a subset of samples from the training data set 𝒳{\mathcal{X}} (a minibatch) and {\mathcal{B}} be a minibatch of samples in d{\mathbb{R}}^{d} drawn from the distribution γ\gamma. Then we do the approximation

(2.11) 𝔼𝐱μ[logDω(𝐱)]\displaystyle{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\log D_{\omega}({\mathbf{x}})] 1|𝒜|𝐱𝒜logDω(𝐱)\displaystyle\approx\frac{1}{|{\mathcal{A}}|}\sum_{{\mathbf{x}}\in{\mathcal{A}}}\log D_{\omega}({\mathbf{x}})
(2.12) 𝔼𝐳γ[log(1Dω(Gθ(𝐳)))]\displaystyle{\mathbb{E}}_{{\mathbf{z}}\sim\gamma}[\log(1-D_{\omega}(G_{\theta}({\mathbf{z}})))] 1||𝐳log(1Dω(Gθ(𝐳))).\displaystyle\approx\frac{1}{|{\mathcal{B}}|}\sum_{{\mathbf{z}}\in{\mathcal{B}}}\log(1-D_{\omega}(G_{\theta}({\mathbf{z}}))).

The following algorithm for the vanilla GAN was presented in [4]:

  Vanilla GAN Algorithm Minibatch stochastic gradient descent training of generative adversarial nets. The number of steps to apply to the discriminator, kk, is a hyperparameter. k=1k=1, the least expensive option, was used in the experiments in [4].
 

for number of training iterations do
ab for k steps do

  • Sample minibatch of mm samples {𝐳1,,𝐳m}\{{\mathbf{z}}_{1},\dots,{\mathbf{z}}_{m}\} in d{\mathbb{R}}^{d} from the distribution γ\gamma.

  • Sample minibatch of mm samples {𝐱1,,𝐱m}𝒳\{{\mathbf{x}}_{1},\dots,{\mathbf{x}}_{m}\}\subset{\mathcal{X}} from the training set 𝒳{\mathcal{X}}.

  • Update the discriminator DωD_{\omega} by ascending its stochastic gradient with respect to ω\omega:

    ω1mi=1m[logDω(𝐱i)+log(1Dω(Gθ(𝐳i)))].\nabla_{\omega}\frac{1}{m}\sum_{i=1}^{m}\Bigl{[}\log D_{\omega}({\mathbf{x}}_{i})+\log(1-D_{\omega}(G_{\theta}({\mathbf{z}}_{i})))\Bigr{]}.

ab end for

  • Sample minibatch of mm samples {𝐳1,,𝐳m}\{{\mathbf{z}}_{1},\dots,{\mathbf{z}}_{m}\} in d{\mathbb{R}}^{d} from the distribution γ\gamma.

  • Update the generator GθG_{\theta} by descending its stochastic gradient with respect to θ\theta:

    θ1mi=1mlog(1Dω(Gθ(𝐳i))).\nabla_{\theta}\frac{1}{m}\sum_{i=1}^{m}\log(1-D_{\omega}(G_{\theta}({\mathbf{z}}_{i}))).

end for

The gradient-based updates can use any standard gradient-based learning rule. The paper used momentum in their experiments.
 

Proposition 2.4 serves as a heuristic justification for the convergence of the algorithm. One problem often encountered with the Vanilla GAN Algorithm is that the updating of GθG_{\theta} from minimization of 𝔼𝐳γ[log(1Dω(Gθ(𝐳)))]{\mathbb{E}}_{{\mathbf{z}}\sim\gamma}[\log(1-D_{\omega}(G_{\theta}({\mathbf{z}})))] may saturate early. So instead the authors substituted it with minimizing 𝔼[logDω(Gθ(𝐳))]-{\mathbb{E}}[\log D_{\omega}(G_{\theta}({\mathbf{z}}))]. This is the well-known “ logD\log D trick”, and it seems to offer superior performance. We shall examine this more closely later on.

3. ff-Divergence and ff-GAN

Recall that the motivating problem for GAN is that we have a probability distribution μ\mu known only in the form of a finite set of samples (training samples). We would like to learn this target distribution through iterative improvement. Starting with a probability distribution ν\nu we iteratively update ν\nu so it gets closer and closer to the target distribution μ\mu. Of course to do so we will first need a way to measure the discrepancy between two probability distributions. The vanilla GAN has employed a discriminator for this purpose. But there are other ways.

3.1. ff-Divergence

One way to measure the discrepancy between two probability distributions μ\mu and ν\nu is through the Kullback-Leibler divergence, or KL divergence. Let p(x)p(x) and q(x)q(x) be two probability density functions defined on n{\mathbb{R}}^{n}. The KL-divergence of pp and qq is defined as

DKL(pq):=nlog(p(x)q(x))p(x)𝑑x.D_{KL}(p\|q):=\int_{{\mathbb{R}}^{n}}\log\Bigl{(}\frac{p(x)}{q(x)}\Bigr{)}p(x)\,dx.

Note that DKL(pq)D_{KL}(p\|q) is finite only if q(x)0q(x)\neq 0 on supp(p){\rm supp}(p) almost everywhere. While KL-divergence is widely used, there are other divergences such as the Jensen-Shannon divergence

DJS(pq):=12DKL(pM)+12DKL(qM),D_{JS}(p\|q):=\frac{1}{2}D_{KL}(p\|M)+\frac{1}{2}D_{KL}(q\|M),

where M:=p(x)+q(x)2M:=\frac{p(x)+q(x)}{2}. One advantage of the Jensen-Shannon divergence is that it is well defined for any probability density functions p(x)p(x) and q(x)q(x), and is symmetric DJS(pq)=DJS(qp)D_{JS}(p\|q)=D_{JS}(q\|p). In fact, following from Proposition 2.1 the minimization part of the minimax problem in the vanilla GAN is precisely the minimization over qq of DJS(pq)D_{JS}(p\|q) for a given density function pp. As it turns out, both DKLD_{KL} and DJSD_{JS} are special cases of the more general ff-divergence, introduced by Ali and Silvey [1].

Let f(x)f(x) be a strictly convex function with domain II\subseteq{\mathbb{R}} such that f(1)=0f(1)=0. Throughout this paper we shall adopt the convention that f(𝐱)=+f({\mathbf{x}})=+\infty for all 𝐱I{\mathbf{x}}\not\in I.

Definition 3.1.

Let p(x)p(x) and q(x)q(x) be two probability density functions on n{\mathbb{R}}^{n}. Then the ff-divergence of pp and qq is defined as

(3.1) Df(pq):=𝔼𝐱q[f(p(𝐱)q(𝐱))]=nf(p(x)q(x))q(x)𝑑x,D_{f}(p\|q):={\mathbb{E}}_{{\mathbf{x}}\sim q}\Big{[}f\Bigl{(}\frac{p({\mathbf{x}})}{q({\mathbf{x}})}\Bigr{)}\Bigr{]}=\int_{{\mathbb{R}}^{n}}f\Bigl{(}\frac{p(x)}{q(x)}\Bigr{)}q(x)\,dx,

where we adopt the convention that f(p(x)q(x))q(x)=0f(\frac{p(x)}{q(x)})q(x)=0 if q(x)=0q(x)=0.

Remark.  Because the ff-divergence is not symmetric in the sense that Df(pq)Df(qp)D_{f}(p\|q)\neq D_{f}(q\|p) in general, there might be some confusion as to which divides which in the fraction. If we follow the original Ali and Silvey paper [1] then the definition of Df(pq)D_{f}(p\|q) would be our Df(qp)D_{f}(q\|p). Here we adopt the same definition as in the paper [9], which first introduced the concept of ff-GAN.

Proposition 3.1.

Let f(x)f(x) be a strictly convex function on domain II\subseteq{\mathbb{R}} such that f(1)=0f(1)=0. Assume either supp(p)supp(q){\rm supp}(p)\subseteq{\rm supp}(q) (equivalent to pqp\ll q) or f(t)>0f(t)>0 for t[0,1)t\in[0,1). Then Df(pq)0D_{f}(p\|q)\geq 0, and Df(pq)=0D_{f}(p\|q)=0 if and only if p(x)=q(x)p(x)=q(x).

Proof.  By the convexity of ff and Jensen’s Inequality

Df(pq)=𝔼𝐱q[f(p(𝐱)q(𝐱))]f(𝔼𝐱q[p(𝐱)q(𝐱)])=f(supp(q)p(x)dx)=:f(r),D_{f}(p\|q)={\mathbb{E}}_{{\mathbf{x}}\sim q}\Bigl{[}f\Bigl{(}\frac{p({\mathbf{x}})}{q({\mathbf{x}})}\Bigr{)}\Bigr{]}\geq f\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim q}\Bigl{[}\frac{p({\mathbf{x}})}{q({\mathbf{x}})}\Bigr{]}\Bigr{)}=f\Bigl{(}\int_{{\rm supp}(q)}p(x)\,dx\Bigr{)}=:f(r),

where the equality holds if and only if q(x)/p(x)q(x)/p(x) is a constant or ff is linear on the range of p(x)/q(x)p(x)/q(x). Since ff is strictly convex, it can only be the former. Thus for the equality to hold we must have p(x)=rq(x)p(x)=rq(x) on supp(q){\rm supp}(q).

Now clearly r1r\leq 1. If supp(p)supp(q){\rm supp}(p)\subseteq{\rm supp}(q) then r=1r=1, and we have Df(pq)0D_{f}(p\|q)\geq 0. The equality holds if and only if p=qp=q. If f(t)>0f(t)>0 for all t[0,1)t\in[0,1) then we also have Df(pq)f(r)0D_{f}(p\|q)\geq f(r)\geq 0. For r<1r<1 we will have Df(pq)f(r)>0D_{f}(p\|q)\geq f(r)>0. Thus if Df(pq)=0D_{f}(p\|q)=0 we must have r=1r=1 and p=qp=q.  

It should be noted that ff-divergence can be defined for two arbitrary probability measures μ\mu and ν\nu on a probability space Ω\Omega. Let τ\tau be another probability measure such that μ,ντ\mu,\nu\ll\tau, namely both μ,ν\mu,\nu are absolutely continuous with respect to τ\tau. For example, we may take τ=12(μ+ν)\tau=\frac{1}{2}(\mu+\nu). Let p=dμ/dτp=d\mu/d\tau and q=dν/dτq=d\nu/d\tau be their Radon-Nikodym derivatives. We define the ff-divergence of μ\mu and ν\nu as

(3.2) Df(μν):=Ωf(p(x)q(x))q(x)𝑑τ=𝔼𝐱ν[f(p(𝐱)q(𝐱))].D_{f}(\mu\|\nu):=\int_{\Omega}f\Bigl{(}\frac{p(x)}{q(x)}\Bigr{)}q(x)\,d\tau={\mathbb{E}}_{{\mathbf{x}}\sim\nu}\Bigl{[}f\Bigl{(}\frac{p({\mathbf{x}})}{q({\mathbf{x}})}\Bigr{)}\Bigr{]}.

Again we adopt the convention that f(p(x)q(x))q(x)=0f(\frac{p(x)}{q(x)})q(x)=0 if q(x)=0q(x)=0. It is not hard to show that this definition is independent of the choice for the probability measure τ\tau, and Proposition 3.1 holds for the more general Df(μν)D_{f}(\mu\|\nu) as well.

With ff-divergence measuring the discrepancy between two measures, we can now consider applying it to GANs. The biggest challenge here is that we don’t have an explicit expression for the target distribution μ\mu. As with the vanilla GAN, to compute Df(pq)D_{f}(p\|q) we must express it in terms of sample averages. Fortunately earlier work by Nguyen, Wainwright and Jordan [8] has already tackled this problem using the convex conjugate of a convex function.

3.2. Convex Conjugate of a Convex Function

The convex conjugate of a convex function f(x)f(x) is also known as the Fenchel transform or Fenchel-Legendre transform of ff, which is a generalization of the well known Legendre transform. Let f(x)f(x) be a convex function defined on an interval II\subseteq{\mathbb{R}}. Then its convex conjugate f:{±}f^{*}:{\mathbb{R}}{\longrightarrow}{\mathbb{R}}\cup\{\pm\infty\} is defined to be

(3.3) f(y)=suptI{tyf(t)}.f^{*}(y)=\sup_{t\in I}\,\{ty-f(t)\}.

As mentioned earlier we extend ff^{*} to the whole real line by adopting the convention that f(x)=+f(x)=+\infty for xIx\not\in I. Below is a more explicit expression for f(y)f^{*}(y).

Lemma 3.2.

Assume that f(x)f(x) is strictly convex and continuously differentiable on its domain II\subseteq{\mathbb{R}}, where Io=(a,b)I^{o}=(a,b) with a,b[,+]a,b\in[-\infty,+\infty]. Then

(3.4) f(y)={yf1(y)f(f1(y)),yf(Io)limtb(tyf(t)),ylimtbf(t)limta+(tyf(t)),ylimta+f(t).f^{*}(y)=\left\{\begin{array}[]{ll}yf^{\prime-1}(y)-f(f^{\prime-1}(y)),\phantom{aa}&y\in f^{\prime}(I^{o})\\ \lim_{t{\rightarrow}b^{-}}(ty-f(t)),&y\geq\lim_{t{\rightarrow}b^{-}}f^{\prime}(t)\\ \lim_{t{\rightarrow}a^{+}}(ty-f(t)),&y\leq\lim_{t{\rightarrow}a^{+}}f^{\prime}(t).\end{array}\right.

Proof.  Let g(t)=tyf(t)g(t)=ty-f(t). The g(t)=yf(t)g^{\prime}(t)=y-f^{\prime}(t) on II, which is strictly decreasing by the convexity of f(t)f(t). Hence g(t)g(t) is strictly concave on II. If y=f(t)y=f^{\prime}(t^{*}) for some tIot^{*}\in I^{o} then tt^{*} is a critical point of gg so it must be its global maximum. Thus g(t)g(t) attains its maximum at t=t=f1(y)t=t^{*}=f^{\prime-1}(y). Now assume yy is not in the range of ff^{\prime} then g(t)>0g^{\prime}(t)>0 or g(t)<0g^{\prime}(t)<0 on IoI^{o}. Consider the case g(t)>0g^{\prime}(t)>0 for all tIot\in I^{o}. Clearly the supreme of g(t)g(t) is achieved as tbt{\rightarrow}b^{-} since g(t)g(t) is monotonously increasing. The case for g(t)<0g^{\prime}(t)<0 for all tIot\in I^{o} is similarly derived.  

Remark.  Note that ++\infty is a possible value for ff^{*}. The domain Dom(f){\rm Dom}\,(f^{*}) for ff^{*} is defined as the set on which ff^{*} is finite.

A consequence of Lemma 3.2 is that under the assumption that ff is continuously differentiable, suptI{tyf(t)}\sup_{t\in I}\,\{ty-f(t)\} is attained for some tIt\in I if and only if yy is in the range of f(t)f^{\prime}(t). This is clear if yf(Io)y\in f^{\prime}(I^{o}), but it can also be argued rather easily for finite boundary points of II. More generally without the assumption of differentiability, suptI{tyf(t)}\sup_{t\in I}\,\{ty-f(t)\} is attained if and only if yf(t)y\in\partial f(t) for some tIt\in I, where f(t)\partial f(t) is the set of sub-derivatives. The following proposition summarizes some important properties of convex conjugate:

Proposition 3.3.

Let f(x)f(x) be a convex function on {\mathbb{R}} with range in {±}{\mathbb{R}}\cup\{\pm\infty\}. Then ff^{*} is convex and is lower-semi continuous. Furthermore if ff is lower-semi continuous then it satisfies the Fenchel Duality f=(f)f=(f^{*})^{*}.

Proof.  This is a well known result. We omit the proof here.  

The table below lists the convex dual of some common convex functions:

f(x)f(x) f(y)f^{*}(y)
f(x)=ln(x),x>0\displaystyle f(x)=-\ln(x),~{}~{}x>0 f(y)=1ln(y)),y<0\displaystyle f^{*}(y)=-1-\ln(-y)),~{}~{}y<0
f(x)=ex\displaystyle f(x)=e^{x} f(y)=yln(y)y,y>0\displaystyle f^{*}(y)=y\ln(y)-y,~{}~{}y>0
f(x)=x2\displaystyle f(x)=x^{2} f(y)=14y2\displaystyle f^{*}(y)=\frac{1}{4}y^{2}
f(x)=1+x2\displaystyle f(x)=\sqrt{1+x^{2}} f(y)=1y2,y[1,1]\displaystyle f^{*}(y)=-\sqrt{1-y^{2}},~{}~{}y\in[-1,1]
f(x)=0,x[0,1]\displaystyle f(x)=0,~{}~{}x\in[0,1] f(y)=ReLu(y)\displaystyle f^{*}(y)=\mbox{ReLu}(y)
f(x)=g(axb),a0\displaystyle f(x)=g(ax-b),~{}~{}a\neq 0 f(y)=bay+g(ya)\displaystyle f^{*}(y)=\frac{b}{a}y+g^{*}(\frac{y}{a})

3.3. Estimating ff-Divergence Using Convex Dual

To estimate ff-divergence from samples, Nguyen, Wainwright and Jordan [8] has proposed the use of the convex dual of ff. Let μ,ν\mu,\nu be probability measures such that μ,ντ\mu,\nu\ll\tau for some probability measure τ\tau, with p=dμ/dτp=d\mu/d\tau and q=dν/dτq=d\nu/d\tau. In the nice case of μν\mu\ll\nu, by f(x)=(f)(x)f(x)=(f^{*})^{*}(x) we have

Df(μν)\displaystyle D_{f}(\mu\|\nu) :=Ωf(p(x)q(x))q(x)𝑑τ\displaystyle:=\int_{\Omega}f\Bigl{(}\frac{p(x)}{q(x)}\Bigr{)}q(x)\,d\tau
(3.5) =Ωsupt{tp(x)q(x)f(t)}q(x)dτ(x)\displaystyle=\int_{\Omega}\sup_{t}\Bigl{\{}t\,\frac{p(x)}{q(x)}-f^{*}(t)\Bigr{\}}q(x)\,d\tau(x)
(3.6) =Ωsupt{tp(x)f(t)q(x)}dτ(x)\displaystyle=\int_{\Omega}\sup_{t}\,\Bigl{\{}tp(x)-f^{*}(t)q(x)\Bigr{\}}\,d\tau(x)
Ω(T(x)p(x)f(T(x))q(x))𝑑τ(x)\displaystyle\geq\int_{\Omega}\Bigl{(}T(x)p(x)-f^{*}(T(x))q(x)\Bigr{)}\,d\tau(x)
=𝔼𝐱μ[T(𝐱)]𝔼𝐱ν[f(T(𝐱))]\displaystyle={\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T({\mathbf{x}}))]

where T(x)T(x) is any Borel function. Thus taking TT over all Borel functions we have

(3.7) Df(μν)supT(𝔼𝐱μ[T(𝐱)]𝔼𝐱ν[f(T(𝐱))]).D_{f}(\mu\|\nu)\geq\sup_{T}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T({\mathbf{x}}))]\Bigr{)}.

On the other hand, note that for each xx, supt{tp(x)q(x)f(t)}\sup_{t}\bigl{\{}t\frac{p(x)}{q(x)}-f^{*}(t)\bigr{\}} is attained for some t=T(x)t=T^{*}(x) as long as p(x)q(x)\frac{p(x)}{q(x)} is in the range of sub-derivatives of ff^{*}. Thus if this holds for all xx we have

Df(μν)=(𝔼𝐱ν[T(𝐱)]𝔼𝐱μ[f(T(𝐱))]).D_{f}(\mu\|\nu)=\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T^{*}({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[f^{*}(T^{*}({\mathbf{x}}))]\Bigr{)}.

In fact, equality holds in general under mild conditions.

Theorem 3.4.

Let f(t)f(t) be strictly convex and continuously differentiable on II\subseteq{\mathbb{R}}. Let μ,ν\mu,\nu be Borel probability measures on n{\mathbb{R}}^{n} such that μν\mu\ll\nu. Then

(3.8) Df(μν)=supT(𝔼𝐱μ[T(𝐱)]𝔼𝐱ν[f(T(𝐱))]),D_{f}(\mu\|\nu)=\sup_{T}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T({\mathbf{x}}))]\Bigr{)},

where supT\sup_{T} is taken over all Borel functions T:nDom(f)T:~{}{\mathbb{R}}^{n}{\longrightarrow}\,{\rm Dom}\,(f^{*}). Furthermore assume that p(x)Ip(x)\in I for all xx. Then T(x):=f(p(x))T^{*}(x):=f^{\prime}(p(x)) is an optimizer of (3.8).

Proof.  We have already establish the upper bound part (3.7). Now we establish the lower bound part. Let p(x)=dμ/dν(x)p(x)=d\mu/d\nu(x). We examine (3.5) with q(x)=1q(x)=1 and supt{tp(x)f(t)}\sup_{t}\,\{tp(x)-f^{*}(t)\} for each xx. Denote gx(t)=tp(x)f(t)g_{x}(t)=tp(x)-f^{*}(t). Let S=Dom(f)S={\rm Dom}\,(f^{*}) and assume So=(a,b)S^{o}=(a,b) where a,b{±}a,b\in{\mathbb{R}}\cup\{\pm\infty\}. We now construct a sequence Tk(x)T_{k}(x) as follows: If p(x)p(x) is in the range of f{f^{*}}^{\prime}, say p(x)=f(tx)p(x)={f^{*}}^{\prime}(t_{x}), we set Tk(x)=txST_{k}(x)=t_{x}\in S. If p(x)f(t)>0p(x)-{f^{*}}^{\prime}(t)>0 for all tt then gx(t)g_{x}(t) is strictly increasing. The supreme of gx(t)g_{x}(t) is attained at the boundary point bb, and we will set Tk(x)=bkST_{k}(x)=b_{k}\in S where bkbb_{k}{\rightarrow}b^{-}. If p(x)f(t)<0p(x)-{f^{*}}^{\prime}(t)<0 for all tt then gx(t)g_{x}(t) is strictly decreasing. The supreme of gx(t)g_{x}(t) is attained at the boundary point aa, and we will set Tk(x)=akST_{k}(x)=a_{k}\in S where aka+a_{k}{\rightarrow}a^{+}. By Lemma 3.2 and its proof we know that

limk(Tk(x)p(x)f(Tk(x)))=supt{tp(x)f(t)}.\lim_{k{\rightarrow}\infty}\Bigl{(}T_{k}(x)p(x)-f^{*}(T_{k}(x))\Bigr{)}=\sup_{t}\,\Bigl{\{}tp(x)-f^{*}(t)\Bigr{\}}.

Thus

limk(𝔼𝐱ν[Tk(𝐱)]𝔼𝐱μ[f(Tk(𝐱))])=Df(μν),\lim_{k{\rightarrow}\infty}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T_{k}({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[f^{*}(T_{k}({\mathbf{x}}))]\Bigr{)}=D_{f}(\mu\|\nu),

To establish the last part of the theorem, assume that p(x)Ip(x)\in I. By Lemma 3.2, set s(t)=f1(t)s(t)=f^{\prime-1}(t) for tt in the range of ff^{\prime} so we have

f(t)=(ts(t)f(s(t)))=s(t)+ts(t)f(s(t))s(t)=s(t).{f^{*}}^{\prime}(t)=\Bigl{(}ts(t)-f(s(t))\Bigr{)}^{\prime}=s(t)+ts^{\prime}(t)-f^{\prime}(s(t))s^{\prime}(t)=s(t).

Thus gx(t)=p(x)f(t)=p(x)f1(t)g_{x}^{\prime}(t)=p(x)-{f^{*}}^{\prime}(t)=p(x)-f^{\prime-1}(t). It follows that gx(t)g_{x}(t) attains its maximum at t=f(p(x))t=f^{\prime}(p(x)). This proves that T(x)=f(p(x))T^{*}(x)=f^{\prime}(p(x)) is an optimizer for (3.8).  

The above theorem requires that μν\mu\ll\nu. What if this does not hold? We have

Theorem 3.5.

Let f(t)f(t) be convex such that the domain of ff^{*} contains (a,)(a,\infty) for some aa\in{\mathbb{R}}. Let μ,ν\mu,\nu be Borel probability measures on n{\mathbb{R}}^{n} such that μ≪̸ν\mu\not\ll\nu. Then

(3.9) supT(𝔼𝐱μ[T(𝐱)]𝔼𝐱ν[f(T(𝐱))])=+,\sup_{T}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T({\mathbf{x}}))]\Bigr{)}=+\infty,

where supT\sup_{T} is taken over all Borel functions T:nDom(f)T:~{}{\mathbb{R}}^{n}{\longrightarrow}\,{\rm Dom}\,(f^{*}).

Proof.  Take τ=12(μ+ν)\tau=\frac{1}{2}(\mu+\nu). Then μ,ντ\mu,\nu\ll\tau. Let p=dμ/dτp=d\mu/d\tau and q=dν/dτq=d\nu/d\tau be their Radon-Nikodym derivatives. Since μ≪̸ν\mu\not\ll\nu there exists a set S0S_{0} with μ(S0)>0\mu(S_{0})>0 on which q(x)=0q(x)=0. Fix a t0t_{0} in the domain of ff^{*}. Let Tk(x)=kT_{k}(x)=k for xS0x\in S_{0} and Tk(x)=t0T_{k}(x)=t_{0} otherwise. Then

𝔼𝐱μ[Tk(𝐱)]𝔼𝐱ν[f(Tk(𝐱))]kμ(S0)f(t0)(1ν(S0))+.{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T_{k}({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T_{k}({\mathbf{x}}))]\geq k\mu(S_{0})-f^{*}(t_{0})(1-\nu(S_{0})){\longrightarrow}+\infty.

This proves the theorem.  

As one can see, we clearly have a problem in the above case. If the domain of ff^{*} is not bounded from above, (3.8) does not hold unless μν\mu\ll\nu. In many practical applications the target distribution μ\mu might be singular, as the training data we are given may lie on a lower dimensional manifold. Fortunately there is still hope as given by the next theorem:

Theorem 3.6.

Let f(t)f(t) be a lower semi-continuous convex function such that the domain II^{*} of ff^{*} has supI=b<+\sup I^{*}=b^{*}<+\infty. Let μ,ν\mu,\nu be Borel probability measures on n{\mathbb{R}}^{n} such that μ=μs+μab\mu=\mu_{s}+\mu_{ab}, where μsν\mu_{s}\perp\nu and μabν\mu_{ab}\ll\nu. Then

(3.10) supT(𝔼𝐱μ[T(𝐱)]𝔼𝐱ν[f(T(𝐱))])=Df(μν)+bμs(n),\sup_{T}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T({\mathbf{x}}))]\Bigr{)}=D_{f}(\mu\|\nu)+b^{*}\mu_{s}({\mathbb{R}}^{n}),

where supT\sup_{T} is taken over all Borel functions T:nDom(f)T:~{}{\mathbb{R}}^{n}{\longrightarrow}\,{\rm Dom}\,(f^{*}).

Proof.  Again, take τ=12(μ+ν)\tau=\frac{1}{2}(\mu+\nu). Then μ,ντ\mu,\nu\ll\tau. The decomposition μ=μab+μs\mu=\mu_{ab}+\mu_{s} where μabν\mu_{ab}\ll\nu and μsν\mu_{s}\perp\nu is unique and guaranteed by the Lebesgue Decomposition Theorem. Let pab=dμab/dτp_{ab}=d\mu_{ab}/d\tau, ps=dμs/dτp_{s}=d\mu_{s}/d\tau and q=dν/dτq=d\nu/d\tau be their Radon-Nikodym derivatives. Since μsν\mu_{s}\perp\nu, we may divide n{\mathbb{R}}^{n} into n=ΩΩc{\mathbb{R}}^{n}=\Omega\cup\Omega^{c} where Ω=supp(q)\Omega={\rm supp}(q). Clearly we have q(x)=pab(x)=0q(x)=p_{ab}(x)=0 for xΩcx\in\Omega^{c}. Thus

supT(\displaystyle\sup_{T}\Bigl{(} 𝔼𝐱μ[T(𝐱)]𝔼𝐱ν[f(Tk(𝐱))])\displaystyle{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T_{k}({\mathbf{x}}))]\Bigr{)}
=supTΩ(T(x)pab(x)f(T(x))q(x))𝑑τ+supTΩcT(x)pab(x)𝑑τ\displaystyle=\sup_{T}\int_{\Omega}\Bigl{(}T(x)p_{ab}(x)-f^{*}(T(x))q(x)\Bigr{)}\,d\tau+\sup_{T}\int_{\Omega^{c}}T(x)p_{ab}(x)\,d\tau
=supTΩ(T(x)pab(x)q(x)f(T(x)))q(x)𝑑τ+bμs(Ωc)\displaystyle=\sup_{T}\int_{\Omega}\Bigl{(}T(x)\frac{p_{ab}(x)}{q(x)}-f^{*}(T(x))\Bigr{)}q(x)\,d\tau+b^{*}\mu_{s}(\Omega^{c})
=Ωf(pab(x)q(x))q(x)𝑑τ+bμs(n)\displaystyle=\int_{\Omega}f\Bigl{(}\frac{p_{ab}(x)}{q(x)}\Bigr{)}q(x)\,d\tau+b^{*}\mu_{s}({\mathbb{R}}^{n})
=Ωf(p(x)q(x))q(x)𝑑τ+bμs(n)\displaystyle=\int_{\Omega}f\Bigl{(}\frac{p(x)}{q(x)}\Bigr{)}q(x)\,d\tau+b^{*}\mu_{s}({\mathbb{R}}^{n})
=Df(μν)+bμs(n).\displaystyle=D_{f}(\mu\|\nu)+b^{*}\mu_{s}({\mathbb{R}}^{n}).

This proves the theorem.  

3.4. ff-GAN: Variational Divergence Minimization (VDM)

We can formulate a generalization of the vanilla GAN using ff-divergence. For a given probability distribution μ\mu, the ff-GAN objective is to minimize the ff-divergence Df(μν)D_{f}(\mu\|\nu) with respect to the probability distribution ν\nu. Carried out in the sample space, ff-GAN solves the following minimax problem

(3.11) minνsupT(𝔼𝐱ν[T(𝐱)]𝔼𝐱μ[f(T(𝐱))]).\min_{\nu}\sup_{T}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[f^{*}(T({\mathbf{x}}))]\Bigr{)}.

The ff-GAN framework is first introduced in [9], and the optimization problem (3.11) is referred to as the Variational Divergence Minimization (VDM). Note VDM looks similar to the minimax problem in vanilla GAN. The Borel function TT here is called a critic function, or just a critic. Under the assumption of μν\mu\ll\nu, by Theorem 3.4 this is equivalent to minνDf(μν)\min_{\nu}\,D_{f}(\mu\|\nu). One potential problem of ff-GAN is that by Theorem 3.5 if μ≪̸ν\mu\not\ll\nu then (3.11) is in general not equivalent to minνDf(μν)\min_{\nu}\,D_{f}(\mu\|\nu). Fortunately for specially chosen ff this is not a problem.

Theorem 3.7.

Let f(t)f(t) be a lower semi-continuous strictly convex function such that the domain II^{*} of ff^{*} has supI=b[0,)\sup I^{*}=b^{*}\in[0,\infty). Assume further that ff is continuously differentiable on its domain and f(t)>0f(t)>0 for t(0,1)t\in(0,1). Let μ\mu be Borel probability measures on n{\mathbb{R}}^{n}. Then ν=μ\nu=\mu is the unique optimizer of

minνsupT(𝔼𝐱ν[T(𝐱)]𝔼𝐱μ[f(T(𝐱))]),\min_{\nu}\sup_{T}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[f^{*}(T({\mathbf{x}}))]\Bigr{)},

where supT\sup_{T} is taken over all Borel functions T:nDom(f)T:~{}{\mathbb{R}}^{n}{\longrightarrow}\,{\rm Dom}\,(f^{*}) and infν\inf_{\nu} is taken over all Borel probability measures.

Proof.  By Theorem3.6, for any Borel probability measure ν\nu we have

supT(𝔼𝐱μ[T(𝐱)]𝔼𝐱ν[f(T(𝐱))])=Df(μν)+bμs(n)Df(μν).\sup_{T}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T({\mathbf{x}}))]\Bigr{)}=D_{f}(\mu\|\nu)+b^{*}\mu_{s}({\mathbb{R}}^{n})\geq D_{f}(\mu\|\nu).

Now by Proposition 3.1, Df(μν)0D_{f}(\mu\|\nu)\geq 0, and equality holds if and only ν=μ\nu=\mu. Thus ν=μ\nu=\mu is the unique optimizer.  

3.5. Examples

We shall now look at some examples of ff-GAN for different choices of the convex function ff.

Example 1: f(t)=𝐥𝐧(t)f(t)=-\ln(t).

This is the KL-divergence. We have f(u)=1ln(u)f^{*}(u)=-1-\ln(-u) with domain I=(,0)I^{*}=(-\infty,0). ff satisfies all conditions of Theorem 3.7. The corresponding ff-GAN objective is

(3.12) minνsupT(𝔼𝐱ν[T(𝐱)]+𝔼𝐱μ[ln(T(𝐱))])+1,\min_{\nu}\sup_{T}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]+{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\ln(-T({\mathbf{x}}))]\Bigr{)}+1,

where T(x)<0T(x)<0. If we ignore the constant +1+1 term and set D(x)=T(x)D(x)=-T(x) then we obtain the equivalent minimax problem

minνsupD>0(𝔼𝐱ν[D(𝐱)]+𝔼𝐱μ[ln(D(𝐱))]).\min_{\nu}\sup_{D>0}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[-D({\mathbf{x}})]+{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\ln(D({\mathbf{x}}))]\Bigr{)}.

Example 2: f(t)=𝐥𝐧(t+𝟏)+𝐥𝐧(t)+(t+𝟏)𝐥𝐧𝟐f(t)=-\ln(t+1)+\ln(t)+(t+1)\ln 2.

This is the Jensen-Shannon divergence. We have f(u)=ln(2eu)f^{*}(u)=-\ln(2-e^{u}) with domain I=(,ln2)I^{*}=(-\infty,\ln 2). Again ff satisfies all conditions of Theorem 3.7. The corresponding ff-GAN objective is

(3.13) minνsupT(𝔼𝐱ν[T(𝐱)]+𝔼𝐱μ[ln(2eT(𝐱))]),\min_{\nu}\sup_{T}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]+{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\ln(2-e^{T({\mathbf{x}})})]\Bigr{)},

where T(x)<ln2T(x)<\ln 2. Set D(x)=112eT(𝐱)D(x)=1-\frac{1}{2}e^{T({\mathbf{x}})}, and so T(x)=ln(1D(x))+ln2T(x)=\ln(1-D(x))+\ln 2. Substituting in (3.13) yields

minνmaxD>0(𝔼𝐱μ[ln(D(𝐱))]+𝔼𝐱ν[ln(1D(𝐱))])+ln4.\min_{\nu}\max_{D>0}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\ln(D({\mathbf{x}}))]+{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[\ln(1-D({\mathbf{x}}))]\Bigr{)}+\ln 4.

Ignoring the constant ln4\ln 4, the vanilla GAN is a special case of ff-GAN with ff being the Jensen-Shannon divergence.

Example 3: f(t)=α𝟏|t𝟏|f(t)=\alpha^{-1}|t-1| where α>𝟎\alpha>0 .

Here we have f(u)=uf^{*}(u)=u with domain I=[α,α]I^{*}=[-\alpha,\alpha]. While ff is not strictly convex and continuously differentiable, it does satisfy the two important conditions of Theorem 3.7 of sup(I)0\sup(I^{*})\geq 0 and f(t)>0f(t)>0 for t[0,1)t\in[0,1). The corresponding ff-GAN objective is

(3.14) minνsup|T|α(𝔼𝐱ν[T(𝐱)]𝔼𝐱μ[T(𝐱)]).\min_{\nu}\sup_{|T|\leq\alpha}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]\Bigr{)}.

For α=1\alpha=1, if we require TT to be continuous then the supremum part of (3.14) is precisely the total variation (also known as the Radon metric) between μ\mu and ν\nu, which is closely related to the Wasserstein distance between μ\mu and ν\nu.

Example 4: f(t)=(t𝟏)𝐥𝐧tt+𝟏,t>𝟎f(t)=(t-1)\ln\frac{t}{t+1},~{}t>0 (the “log D Trick”).

Here f′′(t)=3t+1t2(t+1)2>0f^{\prime\prime}(t)=\frac{3t+1}{t^{2}(t+1)^{2}}>0 so ff is strictly convex. It satisfies all conditions of Theorem 3.7. The explicit expression for the convex dual ff^{*} is complicated to write down, however we do know the domain II^{*} for ff^{*} is the range of f(t)f^{\prime}(t), which is (,0)(-\infty,0). The ff-GAN objective is

minνsupT<0(𝔼𝐱ν[T(𝐱)]+𝔼𝐱μ[f(T(𝐱))]).\min_{\nu}\sup_{T<0}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]+{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[f^{*}(T({\mathbf{x}}))]\Bigr{)}.

By Theorem 3.6 and the fact b=0b^{*}=0,

(3.15) supT<0(𝔼𝐱μ[T(𝐱)]𝔼𝐱ν[f(T(𝐱))])=Df(μν).\sup_{T<0}~{}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T({\mathbf{x}}))]\Bigr{)}=D_{f}(\mu\|\nu).

Take τ=12(μ+ν)\tau=\frac{1}{2}(\mu+\nu) so μ,ντ\mu,\nu\ll\tau. Let p=dμ/dτp=d\mu/d\tau and q=dν/dτq=d\nu/d\tau. Observe that

Df(μν)\displaystyle D_{f}(\mu\|\nu) =𝔼𝐱ν[f(p(𝐱)q(𝐱))]=𝔼𝐱ν[(p(𝐱)q(𝐱)1)lnp(𝐱)p(𝐱)+q(𝐱)]\displaystyle={\mathbb{E}}_{{\mathbf{x}}\sim\nu}\Bigl{[}f\Bigl{(}\frac{p({\mathbf{x}})}{q({\mathbf{x}})}\Bigr{)}\Bigr{]}={\mathbb{E}}_{{\mathbf{x}}\sim\nu}\Bigl{[}\Bigl{(}\frac{p({\mathbf{x}})}{q({\mathbf{x}})}-1\Bigr{)}\ln{\frac{p({\mathbf{x}})}{p({\mathbf{x}})+q({\mathbf{x}})}}\Bigr{]}
=𝔼𝐱μ[lnp(𝐱)p(𝐱)+q(𝐱)]𝔼𝐱ν[lnp(𝐱)p(𝐱)+q(𝐱)].\displaystyle={\mathbb{E}}_{{\mathbf{x}}\sim\mu}\Bigl{[}\ln\frac{p({\mathbf{x}})}{p({\mathbf{x}})+q({\mathbf{x}})}\Bigr{]}-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}\Bigl{[}\ln\frac{p({\mathbf{x}})}{p({\mathbf{x}})+q({\mathbf{x}})}\Bigr{]}.

Denote D(x)=p(𝐱)p(𝐱)+q(𝐱)D(x)=\frac{p({\mathbf{x}})}{p({\mathbf{x}})+q({\mathbf{x}})}. Then the outer minimization of the minimax problem is

(3.16) minν(𝔼𝐱μ[ln(D(𝐱))]𝔼𝐱ν[ln(D(𝐱))]).\min_{\nu}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\ln(D({\mathbf{x}}))]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[\ln(D({\mathbf{x}}))]\Bigr{)}.

This is precisely the “ logD\log D” trick used in the original GAN paper [4] for the vanilla GAN to address the saturation problem. Now we can see it is equivalent to the ff-GAN with the above ff. It is interesting to note that directly optimizing (3.15) is hard because it is hard to find the explicit formula for ff^{*} in this case. Thus the vanilla GAN with the “ logD\log D trick” is an indirect way to realize this ff-GAN.

To implement an ff-GAN VDM we resort to the same approach as the vanilla GAN, using neural networks to approximate both T(x)T(x) and ν\nu to solve the minimax problem (3.11)

minνsupT(𝔼𝐱ν[T(𝐱)]𝔼𝐱μ[f(T(𝐱))]).\min_{\nu}\sup_{T}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[f^{*}(T({\mathbf{x}}))]\Bigr{)}.

We assume the critic function T(x)T(x) comes from a neural network. [9] proposes T(x)=Tω(x)=gf(Sω(x))T(x)=T_{\omega}(x)=g_{f}(S_{\omega}(x)), where SωS_{\omega} is a neural network with parameters ω\omega taking input from n{\mathbb{R}}^{n} and gf:Ig_{f}:~{}{\mathbb{R}}{\longrightarrow}I^{*} is an output activation function to force the output from Vω(x)V_{\omega}(x) onto the domain II^{*} of ff^{*}. For ν\nu we again consider its approximation by probability distributions of the form νθ=γGθ1\nu_{\theta}=\gamma\circ G_{\theta}^{-1}, where γ\gamma is an initially chosen probability distribution on d{\mathbb{R}}^{d} (usually Gaussian, where dd may or may not be nn), and GθG_{\theta} is a neural network with parameters θ\theta, with input from d{\mathbb{R}}^{d} and output in n{\mathbb{R}}^{n}. Under this model the ff-GAN VMD minimax problem (3.11) becomes

(3.17) minθsupω(𝔼𝐳γ[gf(Sω(Gθ(𝐳)))]𝔼𝐱μ[f(gf(Sω(𝐱)))]).\min_{\theta}\sup_{\omega}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{z}}\sim\gamma}[g_{f}(S_{\omega}(G_{\theta}({\mathbf{z}})))]-{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[f^{*}(g_{f}(S_{\omega}({\mathbf{x}})))]\Bigr{)}.

Like the vanilla GAN, since we do not have the explicit expression for the target distribution μ\mu, we shall approximate the expectations through sample averages. More specifically, let 𝒜{\mathcal{A}} be a minibatch of samples from the training data set 𝒳{\mathcal{X}} (a minibatch) and {\mathcal{B}} be a minibatch of samples in d{\mathbb{R}}^{d} drawn from the distribution γ\gamma. Then we employ the approximations

(3.18) 𝔼𝐳γ[gf(Sω(Gθ(𝐳)))]\displaystyle{\mathbb{E}}_{{\mathbf{z}}\sim\gamma}[g_{f}(S_{\omega}(G_{\theta}({\mathbf{z}})))] 1||𝐳[gf(Sω(Gθ(𝐳)))],\displaystyle\approx\frac{1}{|{\mathcal{B}}|}\sum_{{\mathbf{z}}\in{\mathcal{B}}}[g_{f}(S_{\omega}(G_{\theta}({\mathbf{z}})))],
(3.19) 𝔼𝐱μ[f(gf(Sω(𝐱)))]\displaystyle{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[f^{*}(g_{f}(S_{\omega}({\mathbf{x}})))] 1|𝒜|𝐱𝒜f(gf(Sω(𝐱))).\displaystyle\approx\frac{1}{|{\mathcal{A}}|}\sum_{{\mathbf{x}}\in{\mathcal{A}}}f^{*}(g_{f}(S_{\omega}({\mathbf{x}}))).

The following algorithm for ff-GAN VDM is almost a verbatim repeat of the Vanilla GAN Algorithm [4] stated earlier:

  VDM Algorithm Minibatch stochastic gradient descent training of generative adversarial nets. Here k1k\geq 1 and mm are hyperparameters.
 

for number of training iterations do
ab for k steps do

  • Sample minibatch of mm samples {𝐳1,,𝐳m}\{{\mathbf{z}}_{1},\dots,{\mathbf{z}}_{m}\} in d{\mathbb{R}}^{d} from the distribution γ\gamma.

  • Sample minibatch of mm samples {𝐱1,,𝐱m}𝒳\{{\mathbf{x}}_{1},\dots,{\mathbf{x}}_{m}\}\subset{\mathcal{X}} from the training set 𝒳{\mathcal{X}}.

  • Update SωS_{\omega} by ascending its stochastic gradient with respect to ω\omega:

    ω1mi=1m[gf(Sω(Gθ(𝐳i)))f(gf(Sω(𝐱i)))].\nabla_{\omega}\,\frac{1}{m}\sum_{i=1}^{m}\Bigl{[}g_{f}(S_{\omega}(G_{\theta}({\mathbf{z}}_{i})))-f^{*}(g_{f}(S_{\omega}({\mathbf{x}}_{i})))\Bigr{]}.

ab end for

  • Sample minibatch of mm samples {𝐳1,,𝐳m}\{{\mathbf{z}}_{1},\dots,{\mathbf{z}}_{m}\} in d{\mathbb{R}}^{d} from the distribution γ\gamma.

  • Update the discriminator GθG_{\theta} by descending its stochastic gradient with respect to θ\theta:

    θ1mi=1mgf(Sω(Gθ(𝐳i))).\nabla_{\theta}\,\frac{1}{m}\sum_{i=1}^{m}g_{f}(S_{\omega}(G_{\theta}({\mathbf{z}}_{i}))).

end for

The gradient-based updates can use any standard gradient-based learning rule.
 

4. Examples of Well-Known GANs

Since inception GANS have become one of the hottest research topics in machine learning. Many specially trained GANS tailor made for particular applications have been developed. Modifications and improvements have been proposed to address some of the shortcomings of vanilla GAN. Here we review some of the best known efforts in these directions.

4.1. Wasserstein GAN (WGAN)

Training a GAN can be difficult, which frequently encounters several failure modes. This has been a subject of many discussions. Some of the best known failure modes are:

\cdot Vanishing Gradients:

This occurs quite often, especially when the discriminator is too good, which can stymie the improvement of the generator. With an optimal discriminator generator training can fail due to vanishing gradients, thus not providing enough information for the generator to improve.

\cdot Mode Collapse:

This refers to the phenomenon where the generator starts to produce the same output (or a small set of outputs) over and over again. If the discriminator gets stuck in a local minimum, then it’s too easy for the next generator iteration to find the most plausible output for the current discriminator. Being stuck the discriminator never manages to learn its way out of the trap. As a result the generators rotate through a small set of output types.

\cdot Failure to Converge:

GANs frequently fail to converge, due to a number of factors (known and unknown).

The WGAN [2] makes a simple modification where it replaces the Jensen-Shannon divergence loss function in vanilla GAN with the Wasserstein distance, also known as the Earth Mover (EM) distance. Don’t overlook the significance of this modification: It is one of the most important developments in the topic since the inception of GAN, as the use of EM distance effectively addresses some glaring shortcomings of divergence based GAN, allowing one to mitigate those common failure modes in the training of GANs.

Let μ,ν\mu,\nu be two probability distributions on n{\mathbb{R}}^{n} (or more generally any metric space). Denote by Π(μ,ν)\Pi(\mu,\nu) the set of all probability distributions π(x,y)\pi(x,y) on n×n{\mathbb{R}}^{n}\times{\mathbb{R}}^{n} such that the marginals of π\pi are μ(x)\mu(x) and ν(y)\nu(y) respectively. Then the EM distance (Wasserstein-1 distance) between μ\mu and ν\nu is

W1(μ,ν):=minπΠ(μ,ν)n×nxy𝑑π(x,y)=minπΠ(μ,ν)𝔼(𝐱,𝐲)π[𝐱𝐲].W^{1}(\mu,\nu):=\min_{\pi\in\Pi(\mu,\nu)}\int_{{\mathbb{R}}^{n}\times{\mathbb{R}}^{n}}\|x-y\|\,d\pi(x,y)=\min_{\pi\in\Pi(\mu,\nu)}{\mathbb{E}}_{({\mathbf{x}},{\mathbf{y}})\sim\pi}[\|{\mathbf{x}}-{\mathbf{y}}\|].

Intuitively W1(μ,ν)W^{1}(\mu,\nu) is called the earth mover distance because it denotes the least amount of work one needs to do to move mass μ\mu to mass ν\nu.

In WGAN the objective is to minimize the loss function W1(μ,ν)W^{1}(\mu,\nu) as opposed to the loss function Df(μν)D_{f}(\mu\|\nu) in ff-GAN. The advantage of W1(μ,ν)W^{1}(\mu,\nu) is illustrated by [2] through the following example. Let ZZ be the uniform distribution on (0,1)(0,1) in {\mathbb{R}}. Define μ=(0,Z)\mu=(0,Z) and νθ=(θ,Z)\nu_{\theta}=(\theta,Z) on 2{\mathbb{R}}^{2}. Then μ,ν\mu,\nu are singular distributions with disjoint support if θ0\theta\neq 0. It is easy to check that

DJS(μνθ)=ln2,𝒟KL(μνθ)=,andW1(μ,νθ)=|θ|.D_{JS}(\mu\|\nu_{\theta})=\ln 2,\quad{\mathcal{D}}_{KL}(\mu\|\nu_{\theta})=\infty,\quad\mbox{and}\quad W^{1}(\mu,\nu_{\theta})=|\theta|.

However, visually for small θ>0\theta>0, even though μ\mu and νθ\nu_{\theta} have disjoint support, they look very close. In fact if a GAN can approximate μ\mu by νθ\nu_{\theta} for a very small θ\theta we would be very happy with the result. But no matter how close θ>0\theta>0 is to 0 we will have DJS(μνθ)=ln2D_{JS}(\mu\|\nu_{\theta})=\ln 2. If we train the vanilla GAN with the initial source distribution ν=νθ\nu=\nu_{\theta} we would be stuck with a flat gradient so it will not converge. More generally, let μ,ν\mu,\nu be probability measures such that μν\mu\perp\nu. Then we always have DJS(μν)=ln2D_{JS}(\mu\|\nu)=\ln 2. By Theorem 3.6 and (3.10)

supT(𝔼𝐱μ[T(𝐱)]𝔼𝐱ν[f(T(𝐱))])=ln2.\sup_{T}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T({\mathbf{x}}))]\Bigr{)}=\ln 2.

Thus gradient descend will fail to update ν\nu. In more practical setting, if our target distribution μ\mu is a Gaussian mixture with well separated means, then starting with the standard Gaussian as initial source distribution will likely miss those Gaussian distributions in the mixture whose means are far away from 0, resulting in mode collapse and possibly failure to converge. Note that by the same Theorem 3.6 and (3.10), things wouldn’t improve by changing the convex function f(x)f(x) in the ff-GAN. As we seen from the examples, this pitfall can be avoided in WGAN.

The next question is how to evaluate W1(μ,ν)W^{1}(\mu,\nu) using only samples from the distributions, since we do not have explicit expression for the target distribution μ\mu. This is where Kantorovich-Rubenstein Duality [11] comes in, which states that

(4.1) W1(μ,ν)=supTLip1(n)(𝔼𝐱μ[T(𝐱)]𝔼𝐱ν[T(𝐱)]),W^{1}(\mu,\nu)=\sup_{T\in{\rm Lip}_{1}({\mathbb{R}}^{n})}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]\Bigr{)},

where Lip1(n){\rm Lip}_{1}({\mathbb{R}}^{n}) denotes the set of all Lipschitz functions on n{\mathbb{R}}^{n} with Lipschitz constant 1. Here the critic T(x)T(x) serves the role of a discriminator. With the duality WGAN solves the minimax problem

(4.2) minνW1(μ,ν)=minνsupTLip1(n)(𝔼𝐱μ[T(𝐱)]𝔼𝐱ν[T(𝐱)]).\min_{\nu}W^{1}(\mu,\nu)=\min_{\nu}\sup_{T\in{\rm Lip}_{1}({\mathbb{R}}^{n})}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]\Bigr{)}.

The rest follows the same script as vanilla GAN and ff-GAN. We write ν=γG1\nu=\gamma\circ G^{-1} where γ\gamma is a prior source distribution (usually the standard normal) in d{\mathbb{R}}^{d} and GG maps d{\mathbb{R}}^{d} to n{\mathbb{R}}^{n}. It follows that

𝔼𝐱ν[T(𝐱)]=𝔼𝐳γ[T(G(𝐳))].{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]={\mathbb{E}}_{{\mathbf{z}}\sim\gamma}[T(G({\mathbf{z}}))].

Finally we approximate T(x)T(x) and G(z)G(z) by neural networks with parameters ω\omega and θ\theta respectively, T(x)=Tω(x)T(x)=T_{\omega}(x) and G(z)=Gθ(z)G(z)=G_{\theta}(z). Stochastic gradient descend is used to train WGAN just like all other GANs. Indeed, replacing the JS-divergence with the EM distance W1W^{1}, the algorithm for vanilla GAN can be copied verbatim to become the algorithm for WGAN.

Actually almost verbatim. There is one last issue to be worked out, namely how does one enforce the condition Tω(x)Lip1T_{\omega}(x)\in{\rm Lip}_{1}? This is difficult. The authors have proposed a technique called weight clipping, where the parameter (weights) ω\omega is artificially restricted to the region Ω:={ω0.01}\Omega:=\{\|\omega\|_{\infty}\leq 0.01\}. In other words, all parameters in ω\omega are clipped so they fall into the box [0.01,0.01][-0.01,0.01]. Obviously this is not the same as restricting the Lipschitz constant to 1. However, since Ω\Omega is compact, so will be {Tω:ωΩ}\{T_{\omega}:\omega\in\Omega\}. This means the Lipschitz constant will be bounded by some K>0K>0. The hope is that

supωΩ(𝔼𝐱μ[Tω(𝐱)]𝔼𝐱ν[Tω(𝐱)])supTLipK(n)(𝔼𝐱μ[T(𝐱)]𝔼𝐱ν[T(𝐱)]),\sup_{\omega\in\Omega}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T_{\omega}({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T_{\omega}({\mathbf{x}})]\Bigr{)}\approx\sup_{T\in{\rm Lip}_{K}({\mathbb{R}}^{n})}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]\Bigr{)},

where the latter is just KW1(μ,ν)K\cdot W^{1}(\mu,\nu).

Weight clipping may not be the best way to approximate the Lipschitz condition. Alternatives such as gradient restriction [5] can be more effective. There might be other better ways.

4.2. Deep Convolutional GAN (DCGAN)

DCGAN refers to a set of architectural guidelines for GAN, developed in [10]. Empirically the guidelines help GANs to attain more stable training and good performance. According to the paper, these guidelines are

  • Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).

  • Use batch normalization in both the generator and the discriminator.

  • Remove fully connected hidden layers for deeper architectures.

  • Use ReLU activation in generator for all layers except for the output, which uses Tanh.

  • Use LeakyReLU activation in the discriminator for all layers.

Here strided convolution refers to shifting the convolution window by more than 1 unit, which amounts to downsampling. Fractional strided convolution refers to shifting the convolution window by a fractional unit, say 1/21/2 of a unit, which is often used for upsampling. This obviously cannot be done in the literal sense. To realize this we pad the input by zeros and then take the appropriate strided convolution. For example, suppose our input data is X=[x1,x2,,xN]X=[x_{1},x_{2},\dots,x_{N}] and the convolution window is w=[w1,w2,w3]w=[w_{1},w_{2},w_{3}]. For 1/21/2 strided convolution we would first pad XX to become X~=[x1,0,x2,0,,0,xN]\tilde{X}=[x_{1},0,x_{2},0,\dots,0,x_{N}] and then execute X~w\tilde{X}*w. Nowadays, strided convolution is just one of the several ways for up and down sampling.

4.3. Progressive Growing of GANs (PGGAN)

Generating high resolution images from GANs is a very challenging problem. Progressive Growing of GANs developed in [6] is a technique that addresses this challenge.

PGGAN actually refers to a training methodology for GANs. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution image, one adds new layers that model increasingly fine details as training progresses. Since low resolution images are much more stable and easier to train, the training is very stable in the beginning. Once training at a lower resolution is done, it gradually transit to training at a higher resolution. This process continues until the desired resolution is reached. In the paper [6], very high quality facial images have been generated by starting off the training at 4×44\times 4 resolution and gradually increasing the resolution to 8×88\times 8, 16×1616\times 16 etc, until it reaches 1024×10241024\times 1024.

It would be interesting to provide a mathematical foundation for PGGAN. From earlier analysis there are pitfalls with GANs when the target distribution is singular, especially if it and the initial source distribution have disjoint supports. PGGAN may have provided an effective way to mitigate this problem.

4.4. Cycle-Consistent Adversarial Networks (Cycle-GAN)

Cycle-GAN is an image-to-image translation technique developed in [12]. Before this work the goal of image-to-image translation is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, often we may still want to do a similar task without the benefit of having a training set consisting of aligned image pairs . For example, while we have many paintings by Claude Monet, we don’t have photographs of the scenes, and may wonder what those scenes would look like in a photo. We may wonder how a Monet painting of Mt. Everest would look like even though Claude Monet had never been to Mt. Everest. One way to achieve these tasks is through neural style transfer [3], but this technique only transfers the style of a single painting to a target image. Cycle-GAN offers a different approach that allows for style transfer (tanslation) more broadly.

In a Cycle-GAN, we start with two training sets 𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}}. For example, 𝒳{\mathcal{X}} could be a corpus of Monet scenery paintings and 𝒴{\mathcal{Y}} could be a set of landscape photographs. The training objective of a Cycle-GAN is to transfer styles from 𝒳{\mathcal{X}} to 𝒴{\mathcal{Y}} and vice versa in a “cycle-consistent” way, as described below.

Precisely speaking, a Cycle-GAN consists of three components, each given by a tailored loss function. The first component is a GAN (vanilla GAN, but can also be any other GAN) that tries to generate the distribution of 𝒳{\mathcal{X}}, with one notable deviation: instead of sampling initially from a random source distribution γ\gamma such as a standard normal distribution, Cycle-GAN samples from the training data set 𝒴{\mathcal{Y}}. Assume that 𝒳{\mathcal{X}} are samples drawn from the distribution μ\mu and 𝒴{\mathcal{Y}} are samples drawn from the distribution ν0\nu_{0}. The loss function for this component is

(4.3) 𝐋gan1(G1,Dμ):=𝔼𝐱μ[log(Dμ(𝐱))]+𝔼𝐲ν0[log(1Dμ(G1(𝐲)))]{\mathbf{L}}_{\rm gan1}(G_{1},D_{\mu}):={\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\log(D_{\mu}({\mathbf{x}}))]+{\mathbb{E}}_{{\mathbf{y}}\sim\nu_{0}}[\log(1-D_{\mu}(G_{1}({\mathbf{y}})))]

where DμD_{\mu} is the discriminator network and G1G_{1} is the generator network. Clearly this is the same loss used by the vanilla GAN, except the source distribution γ\gamma is replaced by the distribution ν0\nu_{0} of 𝒴{\mathcal{Y}}. The second component is the mirror of the first component, namely it is a GAN to learn the distribution ν0\nu_{0} of 𝒴{\mathcal{Y}} with the initial source distribution set as the distribution μ\mu of 𝒳{\mathcal{X}}. The corresponding loss function is thus

(4.4) 𝐋gan2(G2,Dν0):=𝔼𝐲ν0[log(Dν0(𝐲))]+𝔼𝐱μ[log(1Dν0(G2(𝐲)))]{\mathbf{L}}_{\rm gan2}(G_{2},D_{\nu_{0}}):={\mathbb{E}}_{{\mathbf{y}}\sim\nu_{0}}[\log(D_{\nu_{0}}({\mathbf{y}}))]+{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\log(1-D_{\nu_{0}}(G_{2}({\mathbf{y}})))]

where Dν0D_{\nu_{0}} is the discriminator network and G2G_{2} is the generator network. The third component of the Cycle-GAN is the “cycle-consistent” loss function given by

(4.5) 𝐋cycle(G1,G2):=𝔼𝐲ν0[G2(G1(𝐲))𝐲1]+𝔼𝐱μ[G1(G2(𝐱))𝐱1].{\mathbf{L}}_{\rm cycle}(G_{1},G_{2}):={\mathbb{E}}_{{\mathbf{y}}\sim\nu_{0}}[\|G_{2}(G_{1}({\mathbf{y}}))-{\mathbf{y}}\|_{1}]+{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\|G_{1}(G_{2}({\mathbf{x}}))-{\mathbf{x}}\|_{1}].

The overall loss function is

(4.6) 𝐋(G1,G2,Dμ,Dν0):=𝐋gan1(G1,Dμ)+𝐋gan2(G2,Dν0)+λ𝐋cycle(G1,G2),{\mathbf{L}}^{*}(G_{1},G_{2},D_{\mu},D_{\nu_{0}}):={\mathbf{L}}_{\rm gan1}(G_{1},D_{\mu})+{\mathbf{L}}_{\rm gan2}(G_{2},D_{\nu_{0}})+\lambda{\mathbf{L}}_{\rm cycle}(G_{1},G_{2}),

where λ>0\lambda>0 is a parameter. Intuitively, G1G_{1} translates a sample 𝐲{\mathbf{y}} from the distribution ν0\nu_{0} into a sample from μ\mu, while G2G_{2} translates a sample 𝐱{\mathbf{x}} from the distribution μ\mu into a sample from ν0\nu_{0}. The loss function 𝐋cycle{\mathbf{L}}_{\rm cycle} encourages “consistency” in the sense that G2(G1(𝐲))G_{2}(G_{1}({\mathbf{y}})) is not too far off 𝐲{\mathbf{y}} and G1(G2(𝐱))G_{1}(G_{2}({\mathbf{x}})) is not too far off 𝐱{\mathbf{x}}. Finally Cycle-GAN is trained by solving the minimax problem

(4.7) minG1,G2maxDμ,Dν0𝐋(G1,G2,Dμ,Dν0).\min_{G_{1},G_{2}}\,\max_{D_{\mu},D_{\nu_{0}}}\,{\mathbf{L}}^{*}(G_{1},G_{2},D_{\mu},D_{\nu_{0}}).

5. Alternative to GAN: Variational Autoencoder (VAE)

Variational Autoencoder (VAE) is an alternative generative model to GAN. To understand VAE one needs to first understand what is an autoencoder. In a nutshell an autoencoder consists of an encoder neural network FF that maps a dataset 𝒳n{\mathcal{X}}\subset{\mathbb{R}}^{n} to d{\mathbb{R}}^{d}, where dd is typically much smaller than nn, together with a decoder neural network HH that “decodes” elements of d{\mathbb{R}}^{d} back to n{\mathbb{R}}^{n}. In other words, it encodes the nn-dimensional features of the dataset to the dd-dimensional latents, along with a way to convert the latents back to the features. Autoencoders can be viewed as data compressors that compress a higher dimensional dataset to a much lower dimensional data set without losing too much information. For example, the MNIST dataset consists of images of size 28×2828\times 28, which is in 784{\mathbb{R}}^{784}. An autoencoder can easily compress it to a data set in 10{\mathbb{R}}^{10} using only 10 latents without losing much information. A typical autoencoder has a “bottleneck” architecture, which is shown in Figure 1.

Refer to caption
Figure 1. An autoencoder (courtsey of Jason Anderson and ThreeComp Inc.)

The loss function for training an autoencoder is typically the mean square error (MSE)

𝐋AE:=1N𝐱𝒳H(F(𝐱))𝐱2{\mathbf{L}}_{\rm AE}:=\frac{1}{N}\sum_{{\mathbf{x}}\in{\mathcal{X}}}\|H(F({\mathbf{x}}))-{\mathbf{x}}\|^{2}

where NN is the size of 𝒳{\mathcal{X}}. For binary data we may use the Bernoulli Cross Entropy (BCE) loss function. Furthermore, we may also add a regularization term to force desirable properties, e.g. a LASSO style L1L^{1} loss to gain sparsity.

First developed in [7], VAEs are also neural networks having similar architectures as autoencoders, with stochasticity added into the networks. Autoencoders are deterministic networks in the sense that output is completely determined by the input. To make generative models out of autoencoders we will need to add randomness to latents. In an autoencoder, input data 𝐱{\mathbf{x}} are encoded to the latents 𝐳=F(𝐱)d{\mathbf{z}}=F({\mathbf{x}})\in{\mathbb{R}}^{d}, which are then decoded to 𝐱^=H(F(𝐳))\hat{\mathbf{x}}=H(F({\mathbf{z}})). A VAE deviates from an autoencoder in the following sense: the input 𝐱{\mathbf{x}} is encoded into a diagonal Gaussian random variable 𝝃=𝝃(𝐱){\boldsymbol{\xi}}={\boldsymbol{\xi}}({\mathbf{x}}) in d{\mathbb{R}}^{d} with mean 𝝁(𝐱){\boldsymbol{\mu}}({\mathbf{x}}) and variance 𝝈2(𝐱){\boldsymbol{\sigma}}^{2}({\mathbf{x}}). Here 𝝈2(𝐱)=[σ12(𝐱),,σd2(𝐱)]Td{\boldsymbol{\sigma}}^{2}({\mathbf{x}})=[\sigma_{1}^{2}({\mathbf{x}}),\dots,\sigma_{d}^{2}({\mathbf{x}})]^{T}\in{\mathbb{R}}^{d} and the variance is actually diag(𝝈2){\rm diag}({\boldsymbol{\sigma}}^{2}). Another way to look at this setup is that instead of having just one encoder FF in an autoencoder, a VAE neural network has two encoders 𝝁{\boldsymbol{\mu}} and 𝝈2{\boldsymbol{\sigma}}^{2} for both the mean and the variance of the latent variable 𝝃{\boldsymbol{\xi}}. As in an autoencoder, it also has a decoder HH. With randomness in place we now have a generative model.

Of course we will need some constraints on 𝝁(𝐱){\boldsymbol{\mu}}({\mathbf{x}}) and σ2(𝐱)\sigma^{2}({\mathbf{x}}). Here VAEs employ the following heuristics for training:

  • The decoder HH decodes the latent random variables 𝝃(𝐱){\boldsymbol{\xi}}({\mathbf{x}}) to 𝐱^\hat{\mathbf{x}} that are close to 𝐱{\mathbf{x}}.

  • The random variable X=𝝃(𝐱)X={\boldsymbol{\xi}}({\mathbf{x}}) with 𝐱{\mathbf{x}} sampled uniformly from 𝒳{\mathcal{X}} is close to the standard normal distribution N(0,1)N(0,1).

The heuristics will be realized through a loss function consisting of two components. The first component is simply the mean square error between 𝐱^\hat{\mathbf{x}} and 𝐱{\mathbf{x}} given by

(5.1) 𝐋1(𝝁,𝝈,G)\displaystyle{\mathbf{L}}_{1}({\boldsymbol{\mu}},{\boldsymbol{\sigma}},G) =1N𝐱𝒳𝔼𝐳N(𝝁(𝐱),σ(𝐱)Id)[H(𝐳)𝐱2]\displaystyle=\frac{1}{N}\,\sum_{{\mathbf{x}}\in{\mathcal{X}}}{\mathbb{E}}_{{\mathbf{z}}\sim N({\boldsymbol{\mu}}({\mathbf{x}}),\sigma({\mathbf{x}})I_{d})}[\|H({\mathbf{z}})-{\mathbf{x}}\|^{2}]
(5.2) =1N𝐱𝒳𝔼𝐳N(0,Id)[H(𝝁(𝐱)+σ(𝐱)𝐳)𝐱2]\displaystyle=\frac{1}{N}\,\sum_{{\mathbf{x}}\in{\mathcal{X}}}{\mathbb{E}}_{{\mathbf{z}}\sim N(0,I_{d})}[\|H({\boldsymbol{\mu}}({\mathbf{x}})+\sigma({\mathbf{x}})\odot{\mathbf{z}})-{\mathbf{x}}\|^{2}]

where \odot denotes entry-wise product. Here going from (5.1) to (5.2) is a very useful technique called re-parametrization. The second component of the loss function is the KL-divergence (or other ff-divergences) between XX and N(0,Id)N(0,I_{d}). For two Gaussian random variables their KL-divergence has an explicit expression, given by

(5.3) 𝐋2(𝝁,𝝈)=DKL(XN(0,Id))=12N𝐱𝒳i=1d(μi2(𝐱)+σi(𝐱)21+ln(σi2)).{\mathbf{L}}_{2}({\boldsymbol{\mu}},{\boldsymbol{\sigma}})=D_{KL}(X\|N(0,I_{d}))=\frac{1}{2N}\,\sum_{{\mathbf{x}}\in{\mathcal{X}}}\sum_{i=1}^{d}\Bigl{(}\mu_{i}^{2}({\mathbf{x}})+\sigma_{i}({\mathbf{x}})^{2}-1+\ln(\sigma_{i}^{2})\Bigr{)}.

The loss function for a VAE is thus

(5.4) 𝐋VAE=𝐋1(𝝁,𝝈,G)+λ𝐋2(𝝁,𝝈){\mathbf{L}}_{\rm VAE}={\mathbf{L}}_{1}({\boldsymbol{\mu}},{\boldsymbol{\sigma}},G)+\lambda\,{\mathbf{L}}_{2}({\boldsymbol{\mu}},{\boldsymbol{\sigma}})

where λ>0\lambda>0 is a parameter. To generate new data from a VAE, one inputs random samples 𝝃N(0,Id){\boldsymbol{\xi}}\sim N(0,I_{d}) into the decoder network HH. A typical VAE architecture is shown in Figure 2.

Refer to caption
Figure 2. A variational autoencoder (courtesy of Jason Anderson and CompThree Inc.)

This short introduction of VAE has just touched on the mathematical foundation of VAE. There have been a wealth of research focused on improving the performance of VAE and applications. We encourage interested readers to further study the subject by reading recent papers.

References

  • [1] S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 28(1):131–142, 1966.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
  • [3] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
  • [4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
  • [5] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.
  • [6] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
  • [7] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [8] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
  • [9] S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pages 271–279, 2016.
  • [10] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • [11] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
  • [12] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.