A Mathematical Introduction to Generative Adversarial Nets (GAN)

Yang Wang Department of Mathematics
Hong Kong University of Science and Technology
Clear Water Bay, Kowloon, Hong Kong [email protected]

Abstract.

Generative Adversarial Nets (GAN) have received considerable attention since the 2014 groundbreaking work by Goodfellow et al [4]. Such attention has led to an explosion in new ideas, techniques and applications of GANs. To better understand GANs we need to understand the mathematical foundation behind them. This paper attempts to provide an overview of GANs from a mathematical point of view. Many students in mathematics may find the papers on GANs more difficulty to fully understand because most of them are written from computer science and engineer point of view. The aim of this paper is to give more mathematically oriented students an introduction to GANs in a language that is more familiar to them.

Key words and phrases:

Deep Learning, GAN, Neural Network

2010 Mathematics Subject Classification:

Primary 42C15

The author is supported in part by the Hong Kong Research Grant Council grants 16308518 and 16317416, as well as HK Innovation Technology Fund ITS/044/18FX

1. Introduction

1.1. Background

Generative Adversarial Nets (GAN) have received considerable attention since the 2014 groundbreaking work by Goodfellow et al [4]. Such attention has led to an explosion in new ideas, techniques and applications of GANs. Yann LeCun has called “this (GAN) and the variations that are now being proposed is the most interesting idea in the last 10 years in ML, in my opinion.” In this note I will attempt to provide a beginner’s introduction to GAN from a more mathematical point of view, intended for students in mathematics. Of course there is much more to GANs than just the mathematical principle. To fully understand GANs one must also look into their algorithms and applications. Nevertheless I believe that understanding the mathematical principle is a crucial first step towards understanding GANs, and with it the other aspects of GANs will be considerably easier to master.

The original GAN, which we shall refer to as the vanilla GAN in this paper, was introduced in [4] as a new generative framework from training data sets. Its goal was to address the following question: Suppose we are given a data set of objects with certain degree of consistency, for example, a collection of images of cats, or handwritten Chinese characters, or Van Gogh painting etc., can we artificially generate similar objects?

This question is quite vague so we need to make it more mathematically specific. We need to clarify what do we mean by “objects with certain degree of consistency” or “similar objects”, before we can move on.

First we shall assume that our objects are points in ${\mathbb{R}}^{n}$ . For example, a grayscale digital image of 1 megapixel can be viewed as a point in ${\mathbb{R}}^{n}$ with $n=10^{6}$ . Our data set (training data set) is simply a collection of points in ${\mathbb{R}}^{n}$ , which we denote by ${\mathcal{X}}\subset{\mathbb{R}}^{n}$ . When we say that the objects in the data set ${\mathcal{X}}$ have certain degree of consistency we mean that they are samples generated from a common probability distribution $\mu$ on ${\mathbb{R}}^{n}$ , which is often assumed to have a density function $p(x)$ . Of course by assuming $\mu$ to have a density function mathematically we are assuming that $\mu$ is absolutely continuous. Some mathematicians may question the wisdom of this assumption by pointing out that it is possible (in fact even likely) that the objects of interest lie on a lower dimensional manifold, making $\mu$ a singular probability distribution. For example, consider the MNIST data set of handwritten digits. While they are $28\times 28$ images (so $n=784$ ), the actual dimension of these data points may lie on a manifold with much smaller dimension (say the actual dimension may only be 20 or so). This is a valid criticism. Indeed when the actual dimension of the distribution is far smaller than the ambient dimension various problems can arise, such as failure to converge or the so-called mode collapsing, leading to poor results in some cases. Still, in most applications this assumption does seem to work well. Furthermore, we shall show that the requirement of absolute continuity is not critical to the GAN framework and can in fact be relaxed.

Quantifying “similar objects” is a bit trickier and holds the key to GANs. There are many ways in mathematics to measure similarity. For example, we may define a distance function and call two pints ${\mathbf{x}},{\mathbf{y}}$ “similar” if the distance between them is small. But this idea is not useful here. Our objective is not to generate objects that have small distances to some existing objects in ${\mathcal{X}}$ . Rather we want to generate new objects that may not be so close in whatever distance measure we use to any existing objects in the training data set ${\mathcal{X}}$ , but we feel they belong to the same class. A good analogy is we have a data set of Van Gogh paintings. We do not care to generate a painting that is a perturbation of Van Gogh’s Starry Night. Instead we would like to generate a painting that a Van Gogh expert will see as a new Van Gogh painting she has never seen before.

A better angle, at least from the perspective of GANs, is to define similarity in the sense of probability distribution. Two data sets are considered similar if they are samples from the same (or approximately same) probability distribution. Thus more specifically we have our training data set ${\mathcal{X}}\subset{\mathbb{R}}^{n}$ consisting of samples from a probability distribution $\mu$ (with density $p({\mathbf{x}})$ ), and we would like to find a probability distribution $\nu$ (with density $q({\mathbf{x}})$ ) such that $\nu$ is a good approximation of $\mu$ . By taking samples from the distribution $\nu$ we obtain generated objects that are “similar” to the objects in ${\mathcal{X}}$ .

One may wonder why don’t we just simply set $\nu=\mu$ and take samples from $\mu$ . Wouldn’t that give us a perfect solution? Indeed — if we know what $\mu$ is. Unfortunately that is exactly our main problem: we don’t know. All we know is a finite set of samples ${\mathcal{X}}$ drawn from the distribution $\mu$ . Hence our real challenge is to learn the distribution $\mu$ from only a finite set of samples drawn over it. We should view finding $\nu$ as the process of approximating $\mu$ . GANs do seem to provide a novel and highly effective way for achieving this goal. In general the success of a GAN will depend on the complexity of the distribution $\mu$ and the size of the training data set ${\mathcal{X}}$ . In some cases the cardinality $|{\mathcal{X}}|=N$ can be quite large, e.g. for ImageNet data set $N$ is well over $10^{7}$ . But in some other cases, such as Van Gogh paintings, the size $N$ is rather small, in the order of 100 only.

1.2. The Basic Approach of GAN

To approximate $\mu$ , the vanilla GAN and subsequently other GANs start with an initial probability distribution $\gamma$ defined on ${\mathbb{R}}^{d}$ , where $d$ may or may not be the same as $n$ . For the time being we shall set $\gamma$ to be the standard normal distribution $N(0,I_{d})$ , although we certainly can choose $\gamma$ to be other distributions. The technique GANs employ is to find a mapping (function) $G:~{}{\mathbb{R}}^{d}{\longrightarrow}{\mathbb{R}}^{n}$ such that if a random variable ${\mathbf{z}}\in{\mathbb{R}}^{d}$ has distribution $\gamma$ then $G({\mathbf{z}})$ has distribution $\mu$ . Note that the distribution of $G({\mathbf{z}})$ is $\gamma\circ G^{-1}$ , where $G^{-1}$ maps subsets of ${\mathbb{R}}^{n}$ to subsets of ${\mathbb{R}}^{d}$ . Thus we are looking for a $G({\mathbf{z}})$ such that $\gamma\circ G^{-1}=\mu$ , or at least is a good approximation of $\mu$ . Sounds simple, right?

Actually several key issues remain to be addressed. One issue is that we only have samples from $\mu$ , and if we know $G$ we can have samples $G({\mathbf{z}})$ where ${\mathbf{z}}$ is drawn from the distribution $\gamma$ . How do we know from these samples that our distribution $\gamma\circ G^{-1}$ is the same or a good approximation of $\mu$ ? Assuming we have ways to do so, we still have the issue of finding $G({\mathbf{z}})$ .

The approach taken by the vanilla GAN is to form an adversarial system from which $G$ continues to receive updates to improve its performance. More precisely it introduces a “discriminator function” $D(x)$ , which tries to dismiss the samples generated by $G$ as fakes. The discriminator $D(x)$ is simply a classifier that tries to distinguish samples in the training set ${\mathcal{X}}$ (real samples) from the generated samples $G({\mathbf{z}})$ (fake samples). It assigns to each sample ${\mathbf{x}}$ a probability $D({\mathbf{x}})\in[0,1]$ for its likelihood to be from the same distribution as the training samples. When samples $G({\mathbf{z}}_{j})$ are generated by $G$ , the discriminator $D$ tries to reject them as fakes. In the beginning this shouldn’t be hard because the generator $G$ is not very good. But each time $G$ fails to generate samples to fool $D$ , it will learn and adjust with an improvement update. The improved $G$ will perform better, and now it is the discriminator $D$ ’s turn to update itself for improvement. Through this adversarial iterative process an equilibrium is eventually reached, so that even with the best discriminator $D$ it can do no better than random guess. At such point, the generated samples should be very similar in distribution to the training samples ${\mathcal{X}}$ .

So one may ask: where do neural networks and deep learning have to do with all this? The answer is that we basically have the fundamental faith that deep neural networks can be used to approximate just about any function, through proper tuning of the network parameters using the training data sets. In particular neural networks excel in classification problems. Not surprisingly, for GAN we shall model both the discriminator function $D$ and the generator function $G$ as neural networks with parameters $\omega$ and $\theta$ , respectively. Thus we shall more precisely write $D(x)$ as $D_{\omega}(x)$ and $G(z)$ as $G_{\theta}(z)$ , and denote $\nu_{\theta}:=\gamma\circ G^{-1}_{\theta}$ . Our objective is to find the desired $G_{\theta}({\mathbf{z}})$ by properly tuning $\theta$ .

2. Mathematical Formulation of the Vanilla GAN

The adversarial game described in the previous section can be formulated mathematically by minimax of a target function between the discriminator function $D(x):{\mathbb{R}}^{n}{\longrightarrow}[0,1]$ and the generator function $G:{\mathbb{R}}^{d}{\longrightarrow}{\mathbb{R}}^{n}$ . The generator $G$ turns random samples ${\mathbf{z}}\in{\mathbb{R}}^{d}$ from distribution $\gamma$ into generated samples $G({\mathbf{z}})$ . The discriminator $D$ tries to tell them apart from the training samples coming from the distribution $\mu$ , while $G$ tries to make the generated samples as similar in distribution to the training samples. In [4] a target loss function is proposed to be

(2.1)

V(D,G):={\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\log D({\mathbf{x}})]+{\mathbb{E}}_{{\mathbf{z}}\sim\gamma}[\log(1-D(G({\mathbf{z}})))],

where ${\mathbb{E}}$ denotes the expectation with respect to a distribution specified in the subscript. When there is no confusion we may drop the subscript. The vanilla GAN solves the minimax problem

(2.2)

\min_{G}\max_{D}\,V(D,G):=\min_{G}\max_{D}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\log D({\mathbf{x}})]+{\mathbb{E}}_{{\mathbf{z}}\sim\gamma}[\log(1-D(G({\mathbf{z}})))]\Bigr{)}.

Intuitively, for a given generator $G$ , $\max_{D}\,V(D,G)$ optimizes the discriminator $D$ to reject generated samples $G({\mathbf{z}})$ by attempting to assign high values to samples from the distribution $\mu$ and low values to generated samples $G({\mathbf{z}})$ . Conversely, for a given discriminator $D$ , $\min_{G}\,V(D,G)$ optimizes $G$ so that the generated samples $G({\mathbf{z}})$ will attempt to “fool” the discriminator $D$ into assigning high values.

Now set ${\mathbf{y}}=G({\mathbf{z}})\in{\mathbb{R}}^{n}$ , which has distribution $\nu:=\gamma\circ G^{-1}$ as ${\mathbf{z}}\in{\mathbb{R}}^{d}$ has distribution $\gamma$ . We can now rewrite $V(D,G)$ in terms of $D$ and $\nu$ as

	$\displaystyle\tilde{V}(D,\nu):=V(D,G)$	$\displaystyle={\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\log D({\mathbf{x}})]+{\mathbb{E}}_{{\mathbf{z}}\sim\gamma}[\log(1-D(G({\mathbf{z}})))]$
		$\displaystyle={\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\log D({\mathbf{x}})]+{\mathbb{E}}_{{\mathbf{y}}\sim\nu}[\log(1-D({\mathbf{y}}))]$
(2.3)			$\displaystyle=\int_{{\mathbb{R}}^{n}}\log D(x)\,d\mu(x)+\int_{{\mathbb{R}}^{n}}\log(1-D(y))\,d\nu(y).$

The minimax problem (2.2) becomes

(2.4)

\min_{G}\max_{D}\,V(D,G)=\min_{G}\max_{D}\,\Bigl{(}\int_{{\mathbb{R}}^{n}}\log D(x)\,d\mu(x)+\int_{{\mathbb{R}}^{n}}\log(1-D(y))\,d\nu(y)\Bigr{)}.

Assume that $\mu$ has density $p(x)$ and $\nu$ has density function $q(x)$ (which of course can only happen if $d\geq n$ ). Then

(2.5)

V(D,\nu)=\int_{{\mathbb{R}}^{n}}\Bigl{(}\log D(x)p(x)+\log(1-D(x))q(x)\Bigr{)}\,dx.

The minimax problem (2.2) can now be written as

(2.6)

\min_{G}\max_{D}\,V(D,G)=\min_{G}\max_{D}\,\int_{{\mathbb{R}}^{n}}\Bigl{(}\log D(x)p(x)+\log(1-D(x))q(x)\Bigr{)}\,dx.

Observe that the above is equivalent to $\min_{\nu}\max_{D}\tilde{V}(D,\nu)$ under the constraint that $\nu=\gamma\circ G^{-1}$ for some $G$ . But to better understand the minimax problem it helps to examine $\min_{\nu}\max_{D}\tilde{V}(D,\nu)$ without this constraint. For the case where $\mu,\nu$ have densities [4] has established the following results:

Proposition 2.1 ([4]).

Given probability distributions $\mu$ and $\nu$ on ${\mathbb{R}}^{n}$ with densities $p(x)$ and $q(x)$ respectively,

\max_{D}\,V(D,\nu)=\max_{D}\,\int_{{\mathbb{R}}^{n}}\Bigl{(}\log D(x)p(x)+\log(1-D(x))q(x)\Bigr{)}\,dx

is attained by $D_{p,q}(x)=\frac{p(x)}{p(x)+q(x)}$ for $x\in{\rm supp}(\mu)\cup{\rm supp}(\nu)$ .

The above proposition leads to

Theorem 2.2 ([4]).

Let $p(x)$ be a probability density function on ${\mathbb{R}}^{n}$ . For probability distribution $\nu$ with density function $q(x)$ and $D:{\mathbb{R}}^{n}{\longrightarrow}[0,1]$ consider the minimax problem

(2.7)

\min_{\nu}\max_{D}\,\tilde{V}(D,\nu)=\min_{\nu}\max_{D}\,\int_{{\mathbb{R}}^{n}}\Bigl{(}\log D(x)p(x)+\log(1-D(x))q(x)\Bigr{)}\,dx.

Then the solution is attained with $q(x)=p(x)$ and $D(x)=1/2$ for all $x\in{\rm supp}(p)$ .

Theorem 2.2 says the solution to the minimax problem (2.7) is exactly what we are looking for, under the assumption that the distributions have densities. We discussed earlier that this assumption ignores that the distribution of interest may lie on a lower dimensional manifold and thus without a density function. Fortunately, the theorem actually holds in the general setting for any distributions. We have:

Theorem 2.3.

Let $\mu$ be a given probability distribution on ${\mathbb{R}}^{n}$ . For probability distribution $\nu$ and function $D:{\mathbb{R}}^{n}{\longrightarrow}[0,1]$ consider the minimax problem

(2.8)

\min_{\nu}\max_{D}\,\tilde{V}(D,\nu)=\min_{\nu}\max_{D}\,\int_{{\mathbb{R}}^{n}}\Bigl{(}\log D(x)\,d\mu(x)+\log(1-D(x))\,d\nu(x)\Bigr{)}.

Then the solution is attained with $\nu=\mu$ and $D(x)=\frac{1}{2}$ $\mu$ -almost everywhere.

Proof. The proof follows directly from Theorem 3.7 and the discussion in Subsection 3.5, Example 2.

Like many minimax problems, one may use the alternating optimization algorithm to solve (2.7), which alternates the updating of $D$ and $q$ (hence $G$ ). An updating cycle consists of first updating $D$ for a given $q$ , and then updating $q$ with the new $D$ . This cycle is repeated until we reach an equilibrium. The following is given in [4]:

Proposition 2.4 ([4]).

If in each cycle, the discriminator $D$ is allowed to reach its optimum given $q(x)$ , followed by an update of $q(x)$ so as to improve the minimization criterion

\min_{q}\,\int_{{\mathbb{R}}^{n}}\Bigl{(}\log D(x)p(x)+\log(1-D(x))q(x)\Bigr{)}\,dx.

Then $q$ converges to $p$ .

Here I have changed the wording a little bit from the original statement, but have kept its essence intact. From a pure mathematical angle this proposition is not rigorous. However, it provides a practical framework for solving the vanilla GAN minimax problem, namely in each cycle we may first optimize the discriminator $D(x)$ all the way for the current $q(x)$ , and then update $q(x)$ given the new $D(x)$ just a little bit. Repeating this cycle will lead us to the desire solution. In practice, however, we rarely optimize $D$ all the way for a given $G$ ; instead we usually update $D$ a little bit before switching to updating $G$ .

Note that the unconstrained minimax problem (2.7) and (2.8) are not the same as the original minimax problem (2.2) or the equivalent formulation (2.3), where $\nu$ is constrained to be of the form $\nu=\gamma\circ G^{-1}$ . Nevertheless it is reasonable in practice to assume that (2.2) and (2.3) will exhibit similar properties as those being shown in Theorem 2.3 and Proposition 2.4. In fact, we shall assume the same even after we further restrict the discriminator and generator functions to be neural networks $D=D_{\omega}$ and $G=G_{\theta}$ as intended. Set $\nu_{\theta}=\gamma\circ G_{\theta}^{-1}$ . Under this model our minimax problem has become $\min_{\theta}\max_{\omega}\,V(D_{\omega},G_{\theta})$ where

(2.9)		$\displaystyle V(D_{\omega},G_{\theta})$	$\displaystyle={\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\log D_{\omega}({\mathbf{x}})]+{\mathbb{E}}_{{\mathbf{z}}\sim\gamma}[\log(1-D_{\omega}(G_{\theta}({\mathbf{z}})))]$
(2.10)			$\displaystyle=\int_{{\mathbb{R}}^{n}}\Bigl{(}\log D_{\omega}(x)\,d\mu(x)+\log(1-D_{\omega}(x))\,d\nu_{\theta}(x)\Bigr{)}.$

Equation (2.9) is the key to carrying out the actual optimization: since we do not have the explicit expression for the target distribution $\mu$ , we shall approximate the expectations through sample averages. Thus (2.9) allows us to approximate $V(D_{\omega},G_{\theta})$ using samples. More specifically, let ${\mathcal{A}}$ be a subset of samples from the training data set ${\mathcal{X}}$ (a minibatch) and ${\mathcal{B}}$ be a minibatch of samples in ${\mathbb{R}}^{d}$ drawn from the distribution $\gamma$ . Then we do the approximation

(2.11)		$\displaystyle{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\log D_{\omega}({\mathbf{x}})]$	$\displaystyle\approx\frac{1}{\|{\mathcal{A}}\|}\sum_{{\mathbf{x}}\in{\mathcal{A}}}\log D_{\omega}({\mathbf{x}})$
(2.12)		$\displaystyle{\mathbb{E}}_{{\mathbf{z}}\sim\gamma}[\log(1-D_{\omega}(G_{\theta}({\mathbf{z}})))]$	$\displaystyle\approx\frac{1}{\|{\mathcal{B}}\|}\sum_{{\mathbf{z}}\in{\mathcal{B}}}\log(1-D_{\omega}(G_{\theta}({\mathbf{z}}))).$

The following algorithm for the vanilla GAN was presented in [4]:

Vanilla GAN Algorithm Minibatch stochastic gradient descent training of generative adversarial nets. The number of steps to apply to the discriminator, $k$ , is a hyperparameter. $k=1$ , the least expensive option, was used in the experiments in [4].

for number of training iterations do
ab for k steps do

•

Sample minibatch of $m$ samples $\{{\mathbf{z}}_{1},\dots,{\mathbf{z}}_{m}\}$ in ${\mathbb{R}}^{d}$ from the distribution $\gamma$ .
•

Sample minibatch of $m$ samples $\{{\mathbf{x}}_{1},\dots,{\mathbf{x}}_{m}\}\subset{\mathcal{X}}$ from the training set ${\mathcal{X}}$ .

•

Update the discriminator $D_{\omega}$ by ascending its stochastic gradient with respect to $\omega$ :

\nabla_{\omega}\frac{1}{m}\sum_{i=1}^{m}\Bigl{[}\log D_{\omega}({\mathbf{x}}_{i})+\log(1-D_{\omega}(G_{\theta}({\mathbf{z}}_{i})))\Bigr{]}.

ab end for

•

Sample minibatch of $m$ samples $\{{\mathbf{z}}_{1},\dots,{\mathbf{z}}_{m}\}$ in ${\mathbb{R}}^{d}$ from the distribution $\gamma$ .
•

Update the generator $G_{\theta}$ by descending its stochastic gradient with respect to $\theta$ :

$\nabla_{\theta}\frac{1}{m}\sum_{i=1}^{m}\log(1-D_{\omega}(G_{\theta}({\mathbf{z}}_{i}))).$

end for

The gradient-based updates can use any standard gradient-based learning rule. The paper used momentum in their experiments.

Proposition 2.4 serves as a heuristic justification for the convergence of the algorithm. One problem often encountered with the Vanilla GAN Algorithm is that the updating of $G_{\theta}$ from minimization of ${\mathbb{E}}_{{\mathbf{z}}\sim\gamma}[\log(1-D_{\omega}(G_{\theta}({\mathbf{z}})))]$ may saturate early. So instead the authors substituted it with minimizing $-{\mathbb{E}}[\log D_{\omega}(G_{\theta}({\mathbf{z}}))]$ . This is the well-known “ $\log D$ trick”, and it seems to offer superior performance. We shall examine this more closely later on.

3. $f$ -Divergence and $f$ -GAN

Recall that the motivating problem for GAN is that we have a probability distribution $\mu$ known only in the form of a finite set of samples (training samples). We would like to learn this target distribution through iterative improvement. Starting with a probability distribution $\nu$ we iteratively update $\nu$ so it gets closer and closer to the target distribution $\mu$ . Of course to do so we will first need a way to measure the discrepancy between two probability distributions. The vanilla GAN has employed a discriminator for this purpose. But there are other ways.

3.1. $f$ -Divergence

One way to measure the discrepancy between two probability distributions $\mu$ and $\nu$ is through the Kullback-Leibler divergence, or KL divergence. Let $p(x)$ and $q(x)$ be two probability density functions defined on ${\mathbb{R}}^{n}$ . The KL-divergence of $p$ and $q$ is defined as

D_{KL}(p\|q):=\int_{{\mathbb{R}}^{n}}\log\Bigl{(}\frac{p(x)}{q(x)}\Bigr{)}p(x)\,dx.

Note that $D_{KL}(p\|q)$ is finite only if $q(x)\neq 0$ on ${\rm supp}(p)$ almost everywhere. While KL-divergence is widely used, there are other divergences such as the Jensen-Shannon divergence

D_{JS}(p\|q):=\frac{1}{2}D_{KL}(p\|M)+\frac{1}{2}D_{KL}(q\|M),

where $M:=\frac{p(x)+q(x)}{2}$ . One advantage of the Jensen-Shannon divergence is that it is well defined for any probability density functions $p(x)$ and $q(x)$ , and is symmetric $D_{JS}(p\|q)=D_{JS}(q\|p)$ . In fact, following from Proposition 2.1 the minimization part of the minimax problem in the vanilla GAN is precisely the minimization over $q$ of $D_{JS}(p\|q)$ for a given density function $p$ . As it turns out, both $D_{KL}$ and $D_{JS}$ are special cases of the more general $f$ -divergence, introduced by Ali and Silvey [1].

Let $f(x)$ be a strictly convex function with domain $I\subseteq{\mathbb{R}}$ such that $f(1)=0$ . Throughout this paper we shall adopt the convention that $f({\mathbf{x}})=+\infty$ for all ${\mathbf{x}}\not\in I$ .

Definition 3.1.

Let $p(x)$ and $q(x)$ be two probability density functions on ${\mathbb{R}}^{n}$ . Then the $f$ -divergence of $p$ and $q$ is defined as

(3.1)

D_{f}(p\|q):={\mathbb{E}}_{{\mathbf{x}}\sim q}\Big{[}f\Bigl{(}\frac{p({\mathbf{x}})}{q({\mathbf{x}})}\Bigr{)}\Bigr{]}=\int_{{\mathbb{R}}^{n}}f\Bigl{(}\frac{p(x)}{q(x)}\Bigr{)}q(x)\,dx,

where we adopt the convention that $f(\frac{p(x)}{q(x)})q(x)=0$ if $q(x)=0$ .

Remark. Because the $f$ -divergence is not symmetric in the sense that $D_{f}(p\|q)\neq D_{f}(q\|p)$ in general, there might be some confusion as to which divides which in the fraction. If we follow the original Ali and Silvey paper [1] then the definition of $D_{f}(p\|q)$ would be our $D_{f}(q\|p)$ . Here we adopt the same definition as in the paper [9], which first introduced the concept of $f$ -GAN.

Proposition 3.1.

Let $f(x)$ be a strictly convex function on domain $I\subseteq{\mathbb{R}}$ such that $f(1)=0$ . Assume either ${\rm supp}(p)\subseteq{\rm supp}(q)$ (equivalent to $p\ll q$ ) or $f(t)>0$ for $t\in[0,1)$ . Then $D_{f}(p\|q)\geq 0$ , and $D_{f}(p\|q)=0$ if and only if $p(x)=q(x)$ .

Proof. By the convexity of $f$ and Jensen’s Inequality

D_{f}(p\|q)={\mathbb{E}}_{{\mathbf{x}}\sim q}\Bigl{[}f\Bigl{(}\frac{p({\mathbf{x}})}{q({\mathbf{x}})}\Bigr{)}\Bigr{]}\geq f\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim q}\Bigl{[}\frac{p({\mathbf{x}})}{q({\mathbf{x}})}\Bigr{]}\Bigr{)}=f\Bigl{(}\int_{{\rm supp}(q)}p(x)\,dx\Bigr{)}=:f(r),

where the equality holds if and only if $q(x)/p(x)$ is a constant or $f$ is linear on the range of $p(x)/q(x)$ . Since $f$ is strictly convex, it can only be the former. Thus for the equality to hold we must have $p(x)=rq(x)$ on ${\rm supp}(q)$ .

Now clearly $r\leq 1$ . If ${\rm supp}(p)\subseteq{\rm supp}(q)$ then $r=1$ , and we have $D_{f}(p\|q)\geq 0$ . The equality holds if and only if $p=q$ . If $f(t)>0$ for all $t\in[0,1)$ then we also have $D_{f}(p\|q)\geq f(r)\geq 0$ . For $r<1$ we will have $D_{f}(p\|q)\geq f(r)>0$ . Thus if $D_{f}(p\|q)=0$ we must have $r=1$ and $p=q$ .

It should be noted that $f$ -divergence can be defined for two arbitrary probability measures $\mu$ and $\nu$ on a probability space $\Omega$ . Let $\tau$ be another probability measure such that $\mu,\nu\ll\tau$ , namely both $\mu,\nu$ are absolutely continuous with respect to $\tau$ . For example, we may take $\tau=\frac{1}{2}(\mu+\nu)$ . Let $p=d\mu/d\tau$ and $q=d\nu/d\tau$ be their Radon-Nikodym derivatives. We define the $f$ -divergence of $\mu$ and $\nu$ as

(3.2)

D_{f}(\mu\|\nu):=\int_{\Omega}f\Bigl{(}\frac{p(x)}{q(x)}\Bigr{)}q(x)\,d\tau={\mathbb{E}}_{{\mathbf{x}}\sim\nu}\Bigl{[}f\Bigl{(}\frac{p({\mathbf{x}})}{q({\mathbf{x}})}\Bigr{)}\Bigr{]}.

Again we adopt the convention that $f(\frac{p(x)}{q(x)})q(x)=0$ if $q(x)=0$ . It is not hard to show that this definition is independent of the choice for the probability measure $\tau$ , and Proposition 3.1 holds for the more general $D_{f}(\mu\|\nu)$ as well.

With $f$ -divergence measuring the discrepancy between two measures, we can now consider applying it to GANs. The biggest challenge here is that we don’t have an explicit expression for the target distribution $\mu$ . As with the vanilla GAN, to compute $D_{f}(p\|q)$ we must express it in terms of sample averages. Fortunately earlier work by Nguyen, Wainwright and Jordan [8] has already tackled this problem using the convex conjugate of a convex function.

3.2. Convex Conjugate of a Convex Function

The convex conjugate of a convex function $f(x)$ is also known as the Fenchel transform or Fenchel-Legendre transform of $f$ , which is a generalization of the well known Legendre transform. Let $f(x)$ be a convex function defined on an interval $I\subseteq{\mathbb{R}}$ . Then its convex conjugate $f^{*}:{\mathbb{R}}{\longrightarrow}{\mathbb{R}}\cup\{\pm\infty\}$ is defined to be

(3.3)

f^{*}(y)=\sup_{t\in I}\,\{ty-f(t)\}.

As mentioned earlier we extend $f^{*}$ to the whole real line by adopting the convention that $f(x)=+\infty$ for $x\not\in I$ . Below is a more explicit expression for $f^{*}(y)$ .

Lemma 3.2.

Assume that $f(x)$ is strictly convex and continuously differentiable on its domain $I\subseteq{\mathbb{R}}$ , where $I^{o}=(a,b)$ with $a,b\in[-\infty,+\infty]$ . Then

(3.4)

f^{*}(y)=\left\{\begin{array}[]{ll}yf^{\prime-1}(y)-f(f^{\prime-1}(y)),\phantom{aa}&y\in f^{\prime}(I^{o})\\ \lim_{t{\rightarrow}b^{-}}(ty-f(t)),&y\geq\lim_{t{\rightarrow}b^{-}}f^{\prime}(t)\\ \lim_{t{\rightarrow}a^{+}}(ty-f(t)),&y\leq\lim_{t{\rightarrow}a^{+}}f^{\prime}(t).\end{array}\right.

Proof. Let $g(t)=ty-f(t)$ . The $g^{\prime}(t)=y-f^{\prime}(t)$ on $I$ , which is strictly decreasing by the convexity of $f(t)$ . Hence $g(t)$ is strictly concave on $I$ . If $y=f^{\prime}(t^{*})$ for some $t^{*}\in I^{o}$ then $t^{*}$ is a critical point of $g$ so it must be its global maximum. Thus $g(t)$ attains its maximum at $t=t^{*}=f^{\prime-1}(y)$ . Now assume $y$ is not in the range of $f^{\prime}$ then $g^{\prime}(t)>0$ or $g^{\prime}(t)<0$ on $I^{o}$ . Consider the case $g^{\prime}(t)>0$ for all $t\in I^{o}$ . Clearly the supreme of $g(t)$ is achieved as $t{\rightarrow}b^{-}$ since $g(t)$ is monotonously increasing. The case for $g^{\prime}(t)<0$ for all $t\in I^{o}$ is similarly derived.

Remark. Note that $+\infty$ is a possible value for $f^{*}$ . The domain ${\rm Dom}\,(f^{*})$ for $f^{*}$ is defined as the set on which $f^{*}$ is finite.

A consequence of Lemma 3.2 is that under the assumption that $f$ is continuously differentiable, $\sup_{t\in I}\,\{ty-f(t)\}$ is attained for some $t\in I$ if and only if $y$ is in the range of $f^{\prime}(t)$ . This is clear if $y\in f^{\prime}(I^{o})$ , but it can also be argued rather easily for finite boundary points of $I$ . More generally without the assumption of differentiability, $\sup_{t\in I}\,\{ty-f(t)\}$ is attained if and only if $y\in\partial f(t)$ for some $t\in I$ , where $\partial f(t)$ is the set of sub-derivatives. The following proposition summarizes some important properties of convex conjugate:

Proposition 3.3.

Let $f(x)$ be a convex function on ${\mathbb{R}}$ with range in ${\mathbb{R}}\cup\{\pm\infty\}$ . Then $f^{*}$ is convex and is lower-semi continuous. Furthermore if $f$ is lower-semi continuous then it satisfies the Fenchel Duality $f=(f^{*})^{*}$ .

Proof. This is a well known result. We omit the proof here.

The table below lists the convex dual of some common convex functions:

$f(x)$	$f^{*}(y)$
$\displaystyle f(x)=-\ln(x),~{}~{}x>0$	$\displaystyle f^{*}(y)=-1-\ln(-y)),~{}~{}y<0$
$\displaystyle f(x)=e^{x}$	$\displaystyle f^{*}(y)=y\ln(y)-y,~{}~{}y>0$
$\displaystyle f(x)=x^{2}$	$\displaystyle f^{*}(y)=\frac{1}{4}y^{2}$
$\displaystyle f(x)=\sqrt{1+x^{2}}$	$\displaystyle f^{*}(y)=-\sqrt{1-y^{2}},~{}~{}y\in[-1,1]$
$\displaystyle f(x)=0,~{}~{}x\in[0,1]$	$\displaystyle f^{*}(y)=\mbox{ReLu}(y)$
$\displaystyle f(x)=g(ax-b),~{}~{}a\neq 0$	$\displaystyle f^{}(y)=\frac{b}{a}y+g^{}(\frac{y}{a})$

3.3. Estimating $f$ -Divergence Using Convex Dual

To estimate $f$ -divergence from samples, Nguyen, Wainwright and Jordan [8] has proposed the use of the convex dual of $f$ . Let $\mu,\nu$ be probability measures such that $\mu,\nu\ll\tau$ for some probability measure $\tau$ , with $p=d\mu/d\tau$ and $q=d\nu/d\tau$ . In the nice case of $\mu\ll\nu$ , by $f(x)=(f^{*})^{*}(x)$ we have

	$\displaystyle D_{f}(\mu\\|\nu)$	$\displaystyle:=\int_{\Omega}f\Bigl{(}\frac{p(x)}{q(x)}\Bigr{)}q(x)\,d\tau$
(3.5)			$\displaystyle=\int_{\Omega}\sup_{t}\Bigl{\{}t\,\frac{p(x)}{q(x)}-f^{*}(t)\Bigr{\}}q(x)\,d\tau(x)$
(3.6)			$\displaystyle=\int_{\Omega}\sup_{t}\,\Bigl{\{}tp(x)-f^{*}(t)q(x)\Bigr{\}}\,d\tau(x)$
		$\displaystyle\geq\int_{\Omega}\Bigl{(}T(x)p(x)-f^{*}(T(x))q(x)\Bigr{)}\,d\tau(x)$
		$\displaystyle={\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T({\mathbf{x}}))]$

where $T(x)$ is any Borel function. Thus taking $T$ over all Borel functions we have

(3.7)

D_{f}(\mu\|\nu)\geq\sup_{T}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T({\mathbf{x}}))]\Bigr{)}.

On the other hand, note that for each $x$ , $\sup_{t}\bigl{\{}t\frac{p(x)}{q(x)}-f^{*}(t)\bigr{\}}$ is attained for some $t=T^{*}(x)$ as long as $\frac{p(x)}{q(x)}$ is in the range of sub-derivatives of $f^{*}$ . Thus if this holds for all $x$ we have

D_{f}(\mu\|\nu)=\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T^{*}({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[f^{*}(T^{*}({\mathbf{x}}))]\Bigr{)}.

In fact, equality holds in general under mild conditions.

Theorem 3.4.

Let $f(t)$ be strictly convex and continuously differentiable on $I\subseteq{\mathbb{R}}$ . Let $\mu,\nu$ be Borel probability measures on ${\mathbb{R}}^{n}$ such that $\mu\ll\nu$ . Then

(3.8)

D_{f}(\mu\|\nu)=\sup_{T}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T({\mathbf{x}}))]\Bigr{)},

where $\sup_{T}$ is taken over all Borel functions $T:~{}{\mathbb{R}}^{n}{\longrightarrow}\,{\rm Dom}\,(f^{*})$ . Furthermore assume that $p(x)\in I$ for all $x$ . Then $T^{*}(x):=f^{\prime}(p(x))$ is an optimizer of (3.8).

Proof. We have already establish the upper bound part (3.7). Now we establish the lower bound part. Let $p(x)=d\mu/d\nu(x)$ . We examine (3.5) with $q(x)=1$ and $\sup_{t}\,\{tp(x)-f^{*}(t)\}$ for each $x$ . Denote $g_{x}(t)=tp(x)-f^{*}(t)$ . Let $S={\rm Dom}\,(f^{*})$ and assume $S^{o}=(a,b)$ where $a,b\in{\mathbb{R}}\cup\{\pm\infty\}$ . We now construct a sequence $T_{k}(x)$ as follows: If $p(x)$ is in the range of ${f^{*}}^{\prime}$ , say $p(x)={f^{*}}^{\prime}(t_{x})$ , we set $T_{k}(x)=t_{x}\in S$ . If $p(x)-{f^{*}}^{\prime}(t)>0$ for all $t$ then $g_{x}(t)$ is strictly increasing. The supreme of $g_{x}(t)$ is attained at the boundary point $b$ , and we will set $T_{k}(x)=b_{k}\in S$ where $b_{k}{\rightarrow}b^{-}$ . If $p(x)-{f^{*}}^{\prime}(t)<0$ for all $t$ then $g_{x}(t)$ is strictly decreasing. The supreme of $g_{x}(t)$ is attained at the boundary point $a$ , and we will set $T_{k}(x)=a_{k}\in S$ where $a_{k}{\rightarrow}a^{+}$ . By Lemma 3.2 and its proof we know that

\lim_{k{\rightarrow}\infty}\Bigl{(}T_{k}(x)p(x)-f^{*}(T_{k}(x))\Bigr{)}=\sup_{t}\,\Bigl{\{}tp(x)-f^{*}(t)\Bigr{\}}.

Thus

\lim_{k{\rightarrow}\infty}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T_{k}({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[f^{*}(T_{k}({\mathbf{x}}))]\Bigr{)}=D_{f}(\mu\|\nu),

To establish the last part of the theorem, assume that $p(x)\in I$ . By Lemma 3.2, set $s(t)=f^{\prime-1}(t)$ for $t$ in the range of $f^{\prime}$ so we have

{f^{*}}^{\prime}(t)=\Bigl{(}ts(t)-f(s(t))\Bigr{)}^{\prime}=s(t)+ts^{\prime}(t)-f^{\prime}(s(t))s^{\prime}(t)=s(t).

Thus $g_{x}^{\prime}(t)=p(x)-{f^{*}}^{\prime}(t)=p(x)-f^{\prime-1}(t)$ . It follows that $g_{x}(t)$ attains its maximum at $t=f^{\prime}(p(x))$ . This proves that $T^{*}(x)=f^{\prime}(p(x))$ is an optimizer for (3.8).

The above theorem requires that $\mu\ll\nu$ . What if this does not hold? We have

Theorem 3.5.

Let $f(t)$ be convex such that the domain of $f^{*}$ contains $(a,\infty)$ for some $a\in{\mathbb{R}}$ . Let $\mu,\nu$ be Borel probability measures on ${\mathbb{R}}^{n}$ such that $\mu\not\ll\nu$ . Then

(3.9)

\sup_{T}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T({\mathbf{x}}))]\Bigr{)}=+\infty,

where $\sup_{T}$ is taken over all Borel functions $T:~{}{\mathbb{R}}^{n}{\longrightarrow}\,{\rm Dom}\,(f^{*})$ .

Proof. Take $\tau=\frac{1}{2}(\mu+\nu)$ . Then $\mu,\nu\ll\tau$ . Let $p=d\mu/d\tau$ and $q=d\nu/d\tau$ be their Radon-Nikodym derivatives. Since $\mu\not\ll\nu$ there exists a set $S_{0}$ with $\mu(S_{0})>0$ on which $q(x)=0$ . Fix a $t_{0}$ in the domain of $f^{*}$ . Let $T_{k}(x)=k$ for $x\in S_{0}$ and $T_{k}(x)=t_{0}$ otherwise. Then

{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T_{k}({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T_{k}({\mathbf{x}}))]\geq k\mu(S_{0})-f^{*}(t_{0})(1-\nu(S_{0})){\longrightarrow}+\infty.

This proves the theorem.

As one can see, we clearly have a problem in the above case. If the domain of $f^{*}$ is not bounded from above, (3.8) does not hold unless $\mu\ll\nu$ . In many practical applications the target distribution $\mu$ might be singular, as the training data we are given may lie on a lower dimensional manifold. Fortunately there is still hope as given by the next theorem:

Theorem 3.6.

Let $f(t)$ be a lower semi-continuous convex function such that the domain $I^{*}$ of $f^{*}$ has $\sup I^{*}=b^{*}<+\infty$ . Let $\mu,\nu$ be Borel probability measures on ${\mathbb{R}}^{n}$ such that $\mu=\mu_{s}+\mu_{ab}$ , where $\mu_{s}\perp\nu$ and $\mu_{ab}\ll\nu$ . Then

(3.10)

\sup_{T}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T({\mathbf{x}}))]\Bigr{)}=D_{f}(\mu\|\nu)+b^{*}\mu_{s}({\mathbb{R}}^{n}),

where $\sup_{T}$ is taken over all Borel functions $T:~{}{\mathbb{R}}^{n}{\longrightarrow}\,{\rm Dom}\,(f^{*})$ .

Proof. Again, take $\tau=\frac{1}{2}(\mu+\nu)$ . Then $\mu,\nu\ll\tau$ . The decomposition $\mu=\mu_{ab}+\mu_{s}$ where $\mu_{ab}\ll\nu$ and $\mu_{s}\perp\nu$ is unique and guaranteed by the Lebesgue Decomposition Theorem. Let $p_{ab}=d\mu_{ab}/d\tau$ , $p_{s}=d\mu_{s}/d\tau$ and $q=d\nu/d\tau$ be their Radon-Nikodym derivatives. Since $\mu_{s}\perp\nu$ , we may divide ${\mathbb{R}}^{n}$ into ${\mathbb{R}}^{n}=\Omega\cup\Omega^{c}$ where $\Omega={\rm supp}(q)$ . Clearly we have $q(x)=p_{ab}(x)=0$ for $x\in\Omega^{c}$ . Thus

	$\displaystyle\sup_{T}\Bigl{(}$	$\displaystyle{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T_{k}({\mathbf{x}}))]\Bigr{)}$
		$\displaystyle=\sup_{T}\int_{\Omega}\Bigl{(}T(x)p_{ab}(x)-f^{*}(T(x))q(x)\Bigr{)}\,d\tau+\sup_{T}\int_{\Omega^{c}}T(x)p_{ab}(x)\,d\tau$
		$\displaystyle=\sup_{T}\int_{\Omega}\Bigl{(}T(x)\frac{p_{ab}(x)}{q(x)}-f^{}(T(x))\Bigr{)}q(x)\,d\tau+b^{}\mu_{s}(\Omega^{c})$
		$\displaystyle=\int_{\Omega}f\Bigl{(}\frac{p_{ab}(x)}{q(x)}\Bigr{)}q(x)\,d\tau+b^{*}\mu_{s}({\mathbb{R}}^{n})$
		$\displaystyle=\int_{\Omega}f\Bigl{(}\frac{p(x)}{q(x)}\Bigr{)}q(x)\,d\tau+b^{*}\mu_{s}({\mathbb{R}}^{n})$
		$\displaystyle=D_{f}(\mu\\|\nu)+b^{*}\mu_{s}({\mathbb{R}}^{n}).$

This proves the theorem.

3.4. $f$ -GAN: Variational Divergence Minimization (VDM)

We can formulate a generalization of the vanilla GAN using $f$ -divergence. For a given probability distribution $\mu$ , the $f$ -GAN objective is to minimize the $f$ -divergence $D_{f}(\mu\|\nu)$ with respect to the probability distribution $\nu$ . Carried out in the sample space, $f$ -GAN solves the following minimax problem

(3.11)

\min_{\nu}\sup_{T}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[f^{*}(T({\mathbf{x}}))]\Bigr{)}.

The $f$ -GAN framework is first introduced in [9], and the optimization problem (3.11) is referred to as the Variational Divergence Minimization (VDM). Note VDM looks similar to the minimax problem in vanilla GAN. The Borel function $T$ here is called a critic function, or just a critic. Under the assumption of $\mu\ll\nu$ , by Theorem 3.4 this is equivalent to $\min_{\nu}\,D_{f}(\mu\|\nu)$ . One potential problem of $f$ -GAN is that by Theorem 3.5 if $\mu\not\ll\nu$ then (3.11) is in general not equivalent to $\min_{\nu}\,D_{f}(\mu\|\nu)$ . Fortunately for specially chosen $f$ this is not a problem.

Theorem 3.7.

Let $f(t)$ be a lower semi-continuous strictly convex function such that the domain $I^{*}$ of $f^{*}$ has $\sup I^{*}=b^{*}\in[0,\infty)$ . Assume further that $f$ is continuously differentiable on its domain and $f(t)>0$ for $t\in(0,1)$ . Let $\mu$ be Borel probability measures on ${\mathbb{R}}^{n}$ . Then $\nu=\mu$ is the unique optimizer of

\min_{\nu}\sup_{T}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[f^{*}(T({\mathbf{x}}))]\Bigr{)},

where $\sup_{T}$ is taken over all Borel functions $T:~{}{\mathbb{R}}^{n}{\longrightarrow}\,{\rm Dom}\,(f^{*})$ and $\inf_{\nu}$ is taken over all Borel probability measures.

Proof. By Theorem3.6, for any Borel probability measure $\nu$ we have

\sup_{T}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T({\mathbf{x}}))]\Bigr{)}=D_{f}(\mu\|\nu)+b^{*}\mu_{s}({\mathbb{R}}^{n})\geq D_{f}(\mu\|\nu).

Now by Proposition 3.1, $D_{f}(\mu\|\nu)\geq 0$ , and equality holds if and only $\nu=\mu$ . Thus $\nu=\mu$ is the unique optimizer.

3.5. Examples

We shall now look at some examples of $f$ -GAN for different choices of the convex function $f$ .

Example 1: $f(t)=-\ln(t)$ .

This is the KL-divergence. We have $f^{*}(u)=-1-\ln(-u)$ with domain $I^{*}=(-\infty,0)$ . $f$ satisfies all conditions of Theorem 3.7. The corresponding $f$ -GAN objective is

(3.12)

\min_{\nu}\sup_{T}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]+{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\ln(-T({\mathbf{x}}))]\Bigr{)}+1,

where $T(x)<0$ . If we ignore the constant $+1$ term and set $D(x)=-T(x)$ then we obtain the equivalent minimax problem

\min_{\nu}\sup_{D>0}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[-D({\mathbf{x}})]+{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\ln(D({\mathbf{x}}))]\Bigr{)}.

Example 2: $f(t)=-\ln(t+1)+\ln(t)+(t+1)\ln 2$ .

This is the Jensen-Shannon divergence. We have $f^{*}(u)=-\ln(2-e^{u})$ with domain $I^{*}=(-\infty,\ln 2)$ . Again $f$ satisfies all conditions of Theorem 3.7. The corresponding $f$ -GAN objective is

(3.13)

\min_{\nu}\sup_{T}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]+{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\ln(2-e^{T({\mathbf{x}})})]\Bigr{)},

where $T(x)<\ln 2$ . Set $D(x)=1-\frac{1}{2}e^{T({\mathbf{x}})}$ , and so $T(x)=\ln(1-D(x))+\ln 2$ . Substituting in (3.13) yields

\min_{\nu}\max_{D>0}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\ln(D({\mathbf{x}}))]+{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[\ln(1-D({\mathbf{x}}))]\Bigr{)}+\ln 4.

Ignoring the constant $\ln 4$ , the vanilla GAN is a special case of $f$ -GAN with $f$ being the Jensen-Shannon divergence.

Example 3: $f(t)=\alpha^{-1}|t-1|$ where $\alpha>0$ .

Here we have $f^{*}(u)=u$ with domain $I^{*}=[-\alpha,\alpha]$ . While $f$ is not strictly convex and continuously differentiable, it does satisfy the two important conditions of Theorem 3.7 of $\sup(I^{*})\geq 0$ and $f(t)>0$ for $t\in[0,1)$ . The corresponding $f$ -GAN objective is

(3.14)

\min_{\nu}\sup_{|T|\leq\alpha}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]\Bigr{)}.

For $\alpha=1$ , if we require $T$ to be continuous then the supremum part of (3.14) is precisely the total variation (also known as the Radon metric) between $\mu$ and $\nu$ , which is closely related to the Wasserstein distance between $\mu$ and $\nu$ .

Example 4: $f(t)=(t-1)\ln\frac{t}{t+1},~{}t>0$ (the “log D Trick”).

Here $f^{\prime\prime}(t)=\frac{3t+1}{t^{2}(t+1)^{2}}>0$ so $f$ is strictly convex. It satisfies all conditions of Theorem 3.7. The explicit expression for the convex dual $f^{*}$ is complicated to write down, however we do know the domain $I^{*}$ for $f^{*}$ is the range of $f^{\prime}(t)$ , which is $(-\infty,0)$ . The $f$ -GAN objective is

\min_{\nu}\sup_{T<0}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]+{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[f^{*}(T({\mathbf{x}}))]\Bigr{)}.

By Theorem 3.6 and the fact $b^{*}=0$ ,

(3.15)

\sup_{T<0}~{}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T({\mathbf{x}}))]\Bigr{)}=D_{f}(\mu\|\nu).

Take $\tau=\frac{1}{2}(\mu+\nu)$ so $\mu,\nu\ll\tau$ . Let $p=d\mu/d\tau$ and $q=d\nu/d\tau$ . Observe that

	$\displaystyle D_{f}(\mu\\|\nu)$	$\displaystyle={\mathbb{E}}_{{\mathbf{x}}\sim\nu}\Bigl{[}f\Bigl{(}\frac{p({\mathbf{x}})}{q({\mathbf{x}})}\Bigr{)}\Bigr{]}={\mathbb{E}}_{{\mathbf{x}}\sim\nu}\Bigl{[}\Bigl{(}\frac{p({\mathbf{x}})}{q({\mathbf{x}})}-1\Bigr{)}\ln{\frac{p({\mathbf{x}})}{p({\mathbf{x}})+q({\mathbf{x}})}}\Bigr{]}$
		$\displaystyle={\mathbb{E}}_{{\mathbf{x}}\sim\mu}\Bigl{[}\ln\frac{p({\mathbf{x}})}{p({\mathbf{x}})+q({\mathbf{x}})}\Bigr{]}-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}\Bigl{[}\ln\frac{p({\mathbf{x}})}{p({\mathbf{x}})+q({\mathbf{x}})}\Bigr{]}.$

Denote $D(x)=\frac{p({\mathbf{x}})}{p({\mathbf{x}})+q({\mathbf{x}})}$ . Then the outer minimization of the minimax problem is

(3.16)

\min_{\nu}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\ln(D({\mathbf{x}}))]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[\ln(D({\mathbf{x}}))]\Bigr{)}.

This is precisely the “ $\log D$ ” trick used in the original GAN paper [4] for the vanilla GAN to address the saturation problem. Now we can see it is equivalent to the $f$ -GAN with the above $f$ . It is interesting to note that directly optimizing (3.15) is hard because it is hard to find the explicit formula for $f^{*}$ in this case. Thus the vanilla GAN with the “ $\log D$ trick” is an indirect way to realize this $f$ -GAN.

To implement an $f$ -GAN VDM we resort to the same approach as the vanilla GAN, using neural networks to approximate both $T(x)$ and $\nu$ to solve the minimax problem (3.11)

\min_{\nu}\sup_{T}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[f^{*}(T({\mathbf{x}}))]\Bigr{)}.

We assume the critic function $T(x)$ comes from a neural network. [9] proposes $T(x)=T_{\omega}(x)=g_{f}(S_{\omega}(x))$ , where $S_{\omega}$ is a neural network with parameters $\omega$ taking input from ${\mathbb{R}}^{n}$ and $g_{f}:~{}{\mathbb{R}}{\longrightarrow}I^{*}$ is an output activation function to force the output from $V_{\omega}(x)$ onto the domain $I^{*}$ of $f^{*}$ . For $\nu$ we again consider its approximation by probability distributions of the form $\nu_{\theta}=\gamma\circ G_{\theta}^{-1}$ , where $\gamma$ is an initially chosen probability distribution on ${\mathbb{R}}^{d}$ (usually Gaussian, where $d$ may or may not be $n$ ), and $G_{\theta}$ is a neural network with parameters $\theta$ , with input from ${\mathbb{R}}^{d}$ and output in ${\mathbb{R}}^{n}$ . Under this model the $f$ -GAN VMD minimax problem (3.11) becomes

(3.17)

\min_{\theta}\sup_{\omega}\,\Bigl{(}{\mathbb{E}}_{{\mathbf{z}}\sim\gamma}[g_{f}(S_{\omega}(G_{\theta}({\mathbf{z}})))]-{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[f^{*}(g_{f}(S_{\omega}({\mathbf{x}})))]\Bigr{)}.

Like the vanilla GAN, since we do not have the explicit expression for the target distribution $\mu$ , we shall approximate the expectations through sample averages. More specifically, let ${\mathcal{A}}$ be a minibatch of samples from the training data set ${\mathcal{X}}$ (a minibatch) and ${\mathcal{B}}$ be a minibatch of samples in ${\mathbb{R}}^{d}$ drawn from the distribution $\gamma$ . Then we employ the approximations

(3.18)		$\displaystyle{\mathbb{E}}_{{\mathbf{z}}\sim\gamma}[g_{f}(S_{\omega}(G_{\theta}({\mathbf{z}})))]$	$\displaystyle\approx\frac{1}{\|{\mathcal{B}}\|}\sum_{{\mathbf{z}}\in{\mathcal{B}}}[g_{f}(S_{\omega}(G_{\theta}({\mathbf{z}})))],$
(3.19)		$\displaystyle{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[f^{*}(g_{f}(S_{\omega}({\mathbf{x}})))]$	$\displaystyle\approx\frac{1}{\|{\mathcal{A}}\|}\sum_{{\mathbf{x}}\in{\mathcal{A}}}f^{*}(g_{f}(S_{\omega}({\mathbf{x}}))).$

The following algorithm for $f$ -GAN VDM is almost a verbatim repeat of the Vanilla GAN Algorithm [4] stated earlier:

VDM Algorithm Minibatch stochastic gradient descent training of generative adversarial nets. Here $k\geq 1$ and $m$ are hyperparameters.

for number of training iterations do
ab for k steps do

•

Sample minibatch of $m$ samples $\{{\mathbf{z}}_{1},\dots,{\mathbf{z}}_{m}\}$ in ${\mathbb{R}}^{d}$ from the distribution $\gamma$ .
•

Sample minibatch of $m$ samples $\{{\mathbf{x}}_{1},\dots,{\mathbf{x}}_{m}\}\subset{\mathcal{X}}$ from the training set ${\mathcal{X}}$ .

•

Update $S_{\omega}$ by ascending its stochastic gradient with respect to $\omega$ :

\nabla_{\omega}\,\frac{1}{m}\sum_{i=1}^{m}\Bigl{[}g_{f}(S_{\omega}(G_{\theta}({\mathbf{z}}_{i})))-f^{*}(g_{f}(S_{\omega}({\mathbf{x}}_{i})))\Bigr{]}.

ab end for

•

Sample minibatch of $m$ samples $\{{\mathbf{z}}_{1},\dots,{\mathbf{z}}_{m}\}$ in ${\mathbb{R}}^{d}$ from the distribution $\gamma$ .
•

Update the discriminator $G_{\theta}$ by descending its stochastic gradient with respect to $\theta$ :

$\nabla_{\theta}\,\frac{1}{m}\sum_{i=1}^{m}g_{f}(S_{\omega}(G_{\theta}({\mathbf{z}}_{i}))).$

end for

The gradient-based updates can use any standard gradient-based learning rule.

4. Examples of Well-Known GANs

Since inception GANS have become one of the hottest research topics in machine learning. Many specially trained GANS tailor made for particular applications have been developed. Modifications and improvements have been proposed to address some of the shortcomings of vanilla GAN. Here we review some of the best known efforts in these directions.

4.1. Wasserstein GAN (WGAN)

Training a GAN can be difficult, which frequently encounters several failure modes. This has been a subject of many discussions. Some of the best known failure modes are:

$\cdot$ Vanishing Gradients:: This occurs quite often, especially when the discriminator is too good, which can stymie the improvement of the generator. With an optimal discriminator generator training can fail due to vanishing gradients, thus not providing enough information for the generator to improve.
$\cdot$ Mode Collapse:: This refers to the phenomenon where the generator starts to produce the same output (or a small set of outputs) over and over again. If the discriminator gets stuck in a local minimum, then it’s too easy for the next generator iteration to find the most plausible output for the current discriminator. Being stuck the discriminator never manages to learn its way out of the trap. As a result the generators rotate through a small set of output types.
$\cdot$ Failure to Converge:: GANs frequently fail to converge, due to a number of factors (known and unknown).

The WGAN [2] makes a simple modification where it replaces the Jensen-Shannon divergence loss function in vanilla GAN with the Wasserstein distance, also known as the Earth Mover (EM) distance. Don’t overlook the significance of this modification: It is one of the most important developments in the topic since the inception of GAN, as the use of EM distance effectively addresses some glaring shortcomings of divergence based GAN, allowing one to mitigate those common failure modes in the training of GANs.

Let $\mu,\nu$ be two probability distributions on ${\mathbb{R}}^{n}$ (or more generally any metric space). Denote by $\Pi(\mu,\nu)$ the set of all probability distributions $\pi(x,y)$ on ${\mathbb{R}}^{n}\times{\mathbb{R}}^{n}$ such that the marginals of $\pi$ are $\mu(x)$ and $\nu(y)$ respectively. Then the EM distance (Wasserstein-1 distance) between $\mu$ and $\nu$ is

W^{1}(\mu,\nu):=\min_{\pi\in\Pi(\mu,\nu)}\int_{{\mathbb{R}}^{n}\times{\mathbb{R}}^{n}}\|x-y\|\,d\pi(x,y)=\min_{\pi\in\Pi(\mu,\nu)}{\mathbb{E}}_{({\mathbf{x}},{\mathbf{y}})\sim\pi}[\|{\mathbf{x}}-{\mathbf{y}}\|].

Intuitively $W^{1}(\mu,\nu)$ is called the earth mover distance because it denotes the least amount of work one needs to do to move mass $\mu$ to mass $\nu$ .

In WGAN the objective is to minimize the loss function $W^{1}(\mu,\nu)$ as opposed to the loss function $D_{f}(\mu\|\nu)$ in $f$ -GAN. The advantage of $W^{1}(\mu,\nu)$ is illustrated by [2] through the following example. Let $Z$ be the uniform distribution on $(0,1)$ in ${\mathbb{R}}$ . Define $\mu=(0,Z)$ and $\nu_{\theta}=(\theta,Z)$ on ${\mathbb{R}}^{2}$ . Then $\mu,\nu$ are singular distributions with disjoint support if $\theta\neq 0$ . It is easy to check that

D_{JS}(\mu\|\nu_{\theta})=\ln 2,\quad{\mathcal{D}}_{KL}(\mu\|\nu_{\theta})=\infty,\quad\mbox{and}\quad W^{1}(\mu,\nu_{\theta})=|\theta|.

However, visually for small $\theta>0$ , even though $\mu$ and $\nu_{\theta}$ have disjoint support, they look very close. In fact if a GAN can approximate $\mu$ by $\nu_{\theta}$ for a very small $\theta$ we would be very happy with the result. But no matter how close $\theta>0$ is to 0 we will have $D_{JS}(\mu\|\nu_{\theta})=\ln 2$ . If we train the vanilla GAN with the initial source distribution $\nu=\nu_{\theta}$ we would be stuck with a flat gradient so it will not converge. More generally, let $\mu,\nu$ be probability measures such that $\mu\perp\nu$ . Then we always have $D_{JS}(\mu\|\nu)=\ln 2$ . By Theorem 3.6 and (3.10)

\sup_{T}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[f^{*}(T({\mathbf{x}}))]\Bigr{)}=\ln 2.

Thus gradient descend will fail to update $\nu$ . In more practical setting, if our target distribution $\mu$ is a Gaussian mixture with well separated means, then starting with the standard Gaussian as initial source distribution will likely miss those Gaussian distributions in the mixture whose means are far away from 0, resulting in mode collapse and possibly failure to converge. Note that by the same Theorem 3.6 and (3.10), things wouldn’t improve by changing the convex function $f(x)$ in the $f$ -GAN. As we seen from the examples, this pitfall can be avoided in WGAN.

The next question is how to evaluate $W^{1}(\mu,\nu)$ using only samples from the distributions, since we do not have explicit expression for the target distribution $\mu$ . This is where Kantorovich-Rubenstein Duality [11] comes in, which states that

(4.1)

W^{1}(\mu,\nu)=\sup_{T\in{\rm Lip}_{1}({\mathbb{R}}^{n})}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]\Bigr{)},

where ${\rm Lip}_{1}({\mathbb{R}}^{n})$ denotes the set of all Lipschitz functions on ${\mathbb{R}}^{n}$ with Lipschitz constant 1. Here the critic $T(x)$ serves the role of a discriminator. With the duality WGAN solves the minimax problem

(4.2)

\min_{\nu}W^{1}(\mu,\nu)=\min_{\nu}\sup_{T\in{\rm Lip}_{1}({\mathbb{R}}^{n})}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]\Bigr{)}.

The rest follows the same script as vanilla GAN and $f$ -GAN. We write $\nu=\gamma\circ G^{-1}$ where $\gamma$ is a prior source distribution (usually the standard normal) in ${\mathbb{R}}^{d}$ and $G$ maps ${\mathbb{R}}^{d}$ to ${\mathbb{R}}^{n}$ . It follows that

{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]={\mathbb{E}}_{{\mathbf{z}}\sim\gamma}[T(G({\mathbf{z}}))].

Finally we approximate $T(x)$ and $G(z)$ by neural networks with parameters $\omega$ and $\theta$ respectively, $T(x)=T_{\omega}(x)$ and $G(z)=G_{\theta}(z)$ . Stochastic gradient descend is used to train WGAN just like all other GANs. Indeed, replacing the JS-divergence with the EM distance $W^{1}$ , the algorithm for vanilla GAN can be copied verbatim to become the algorithm for WGAN.

Actually almost verbatim. There is one last issue to be worked out, namely how does one enforce the condition $T_{\omega}(x)\in{\rm Lip}_{1}$ ? This is difficult. The authors have proposed a technique called weight clipping, where the parameter (weights) $\omega$ is artificially restricted to the region $\Omega:=\{\|\omega\|_{\infty}\leq 0.01\}$ . In other words, all parameters in $\omega$ are clipped so they fall into the box $[-0.01,0.01]$ . Obviously this is not the same as restricting the Lipschitz constant to 1. However, since $\Omega$ is compact, so will be $\{T_{\omega}:\omega\in\Omega\}$ . This means the Lipschitz constant will be bounded by some $K>0$ . The hope is that

\sup_{\omega\in\Omega}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T_{\omega}({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T_{\omega}({\mathbf{x}})]\Bigr{)}\approx\sup_{T\in{\rm Lip}_{K}({\mathbb{R}}^{n})}\Bigl{(}{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[T({\mathbf{x}})]-{\mathbb{E}}_{{\mathbf{x}}\sim\nu}[T({\mathbf{x}})]\Bigr{)},

where the latter is just $K\cdot W^{1}(\mu,\nu)$ .

Weight clipping may not be the best way to approximate the Lipschitz condition. Alternatives such as gradient restriction [5] can be more effective. There might be other better ways.

4.2. Deep Convolutional GAN (DCGAN)

DCGAN refers to a set of architectural guidelines for GAN, developed in [10]. Empirically the guidelines help GANs to attain more stable training and good performance. According to the paper, these guidelines are

•

Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).
•

Use batch normalization in both the generator and the discriminator.
•

Remove fully connected hidden layers for deeper architectures.
•

Use ReLU activation in generator for all layers except for the output, which uses Tanh.
•

Use LeakyReLU activation in the discriminator for all layers.

Here strided convolution refers to shifting the convolution window by more than 1 unit, which amounts to downsampling. Fractional strided convolution refers to shifting the convolution window by a fractional unit, say $1/2$ of a unit, which is often used for upsampling. This obviously cannot be done in the literal sense. To realize this we pad the input by zeros and then take the appropriate strided convolution. For example, suppose our input data is $X=[x_{1},x_{2},\dots,x_{N}]$ and the convolution window is $w=[w_{1},w_{2},w_{3}]$ . For $1/2$ strided convolution we would first pad $X$ to become $\tilde{X}=[x_{1},0,x_{2},0,\dots,0,x_{N}]$ and then execute $\tilde{X}*w$ . Nowadays, strided convolution is just one of the several ways for up and down sampling.

4.3. Progressive Growing of GANs (PGGAN)

Generating high resolution images from GANs is a very challenging problem. Progressive Growing of GANs developed in [6] is a technique that addresses this challenge.

PGGAN actually refers to a training methodology for GANs. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution image, one adds new layers that model increasingly fine details as training progresses. Since low resolution images are much more stable and easier to train, the training is very stable in the beginning. Once training at a lower resolution is done, it gradually transit to training at a higher resolution. This process continues until the desired resolution is reached. In the paper [6], very high quality facial images have been generated by starting off the training at $4\times 4$ resolution and gradually increasing the resolution to $8\times 8$ , $16\times 16$ etc, until it reaches $1024\times 1024$ .

It would be interesting to provide a mathematical foundation for PGGAN. From earlier analysis there are pitfalls with GANs when the target distribution is singular, especially if it and the initial source distribution have disjoint supports. PGGAN may have provided an effective way to mitigate this problem.

4.4. Cycle-Consistent Adversarial Networks (Cycle-GAN)

Cycle-GAN is an image-to-image translation technique developed in [12]. Before this work the goal of image-to-image translation is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, often we may still want to do a similar task without the benefit of having a training set consisting of aligned image pairs . For example, while we have many paintings by Claude Monet, we don’t have photographs of the scenes, and may wonder what those scenes would look like in a photo. We may wonder how a Monet painting of Mt. Everest would look like even though Claude Monet had never been to Mt. Everest. One way to achieve these tasks is through neural style transfer [3], but this technique only transfers the style of a single painting to a target image. Cycle-GAN offers a different approach that allows for style transfer (tanslation) more broadly.

In a Cycle-GAN, we start with two training sets ${\mathcal{X}}$ and ${\mathcal{Y}}$ . For example, ${\mathcal{X}}$ could be a corpus of Monet scenery paintings and ${\mathcal{Y}}$ could be a set of landscape photographs. The training objective of a Cycle-GAN is to transfer styles from ${\mathcal{X}}$ to ${\mathcal{Y}}$ and vice versa in a “cycle-consistent” way, as described below.

Precisely speaking, a Cycle-GAN consists of three components, each given by a tailored loss function. The first component is a GAN (vanilla GAN, but can also be any other GAN) that tries to generate the distribution of ${\mathcal{X}}$ , with one notable deviation: instead of sampling initially from a random source distribution $\gamma$ such as a standard normal distribution, Cycle-GAN samples from the training data set ${\mathcal{Y}}$ . Assume that ${\mathcal{X}}$ are samples drawn from the distribution $\mu$ and ${\mathcal{Y}}$ are samples drawn from the distribution $\nu_{0}$ . The loss function for this component is

(4.3)

{\mathbf{L}}_{\rm gan1}(G_{1},D_{\mu}):={\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\log(D_{\mu}({\mathbf{x}}))]+{\mathbb{E}}_{{\mathbf{y}}\sim\nu_{0}}[\log(1-D_{\mu}(G_{1}({\mathbf{y}})))]

where $D_{\mu}$ is the discriminator network and $G_{1}$ is the generator network. Clearly this is the same loss used by the vanilla GAN, except the source distribution $\gamma$ is replaced by the distribution $\nu_{0}$ of ${\mathcal{Y}}$ . The second component is the mirror of the first component, namely it is a GAN to learn the distribution $\nu_{0}$ of ${\mathcal{Y}}$ with the initial source distribution set as the distribution $\mu$ of ${\mathcal{X}}$ . The corresponding loss function is thus

(4.4)

{\mathbf{L}}_{\rm gan2}(G_{2},D_{\nu_{0}}):={\mathbb{E}}_{{\mathbf{y}}\sim\nu_{0}}[\log(D_{\nu_{0}}({\mathbf{y}}))]+{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\log(1-D_{\nu_{0}}(G_{2}({\mathbf{y}})))]

where $D_{\nu_{0}}$ is the discriminator network and $G_{2}$ is the generator network. The third component of the Cycle-GAN is the “cycle-consistent” loss function given by

(4.5)

{\mathbf{L}}_{\rm cycle}(G_{1},G_{2}):={\mathbb{E}}_{{\mathbf{y}}\sim\nu_{0}}[\|G_{2}(G_{1}({\mathbf{y}}))-{\mathbf{y}}\|_{1}]+{\mathbb{E}}_{{\mathbf{x}}\sim\mu}[\|G_{1}(G_{2}({\mathbf{x}}))-{\mathbf{x}}\|_{1}].

The overall loss function is

(4.6)

{\mathbf{L}}^{*}(G_{1},G_{2},D_{\mu},D_{\nu_{0}}):={\mathbf{L}}_{\rm gan1}(G_{1},D_{\mu})+{\mathbf{L}}_{\rm gan2}(G_{2},D_{\nu_{0}})+\lambda{\mathbf{L}}_{\rm cycle}(G_{1},G_{2}),

where $\lambda>0$ is a parameter. Intuitively, $G_{1}$ translates a sample ${\mathbf{y}}$ from the distribution $\nu_{0}$ into a sample from $\mu$ , while $G_{2}$ translates a sample ${\mathbf{x}}$ from the distribution $\mu$ into a sample from $\nu_{0}$ . The loss function ${\mathbf{L}}_{\rm cycle}$ encourages “consistency” in the sense that $G_{2}(G_{1}({\mathbf{y}}))$ is not too far off ${\mathbf{y}}$ and $G_{1}(G_{2}({\mathbf{x}}))$ is not too far off ${\mathbf{x}}$ . Finally Cycle-GAN is trained by solving the minimax problem

(4.7)

\min_{G_{1},G_{2}}\,\max_{D_{\mu},D_{\nu_{0}}}\,{\mathbf{L}}^{*}(G_{1},G_{2},D_{\mu},D_{\nu_{0}}).

5. Alternative to GAN: Variational Autoencoder (VAE)

Variational Autoencoder (VAE) is an alternative generative model to GAN. To understand VAE one needs to first understand what is an autoencoder. In a nutshell an autoencoder consists of an encoder neural network $F$ that maps a dataset ${\mathcal{X}}\subset{\mathbb{R}}^{n}$ to ${\mathbb{R}}^{d}$ , where $d$ is typically much smaller than $n$ , together with a decoder neural network $H$ that “decodes” elements of ${\mathbb{R}}^{d}$ back to ${\mathbb{R}}^{n}$ . In other words, it encodes the $n$ -dimensional features of the dataset to the $d$ -dimensional latents, along with a way to convert the latents back to the features. Autoencoders can be viewed as data compressors that compress a higher dimensional dataset to a much lower dimensional data set without losing too much information. For example, the MNIST dataset consists of images of size $28\times 28$ , which is in ${\mathbb{R}}^{784}$ . An autoencoder can easily compress it to a data set in ${\mathbb{R}}^{10}$ using only 10 latents without losing much information. A typical autoencoder has a “bottleneck” architecture, which is shown in Figure 1.

Refer to caption — Figure 1. An autoencoder (courtsey of Jason Anderson and ThreeComp Inc.)

The loss function for training an autoencoder is typically the mean square error (MSE)

{\mathbf{L}}_{\rm AE}:=\frac{1}{N}\sum_{{\mathbf{x}}\in{\mathcal{X}}}\|H(F({\mathbf{x}}))-{\mathbf{x}}\|^{2}

where $N$ is the size of ${\mathcal{X}}$ . For binary data we may use the Bernoulli Cross Entropy (BCE) loss function. Furthermore, we may also add a regularization term to force desirable properties, e.g. a LASSO style $L^{1}$ loss to gain sparsity.

First developed in [7], VAEs are also neural networks having similar architectures as autoencoders, with stochasticity added into the networks. Autoencoders are deterministic networks in the sense that output is completely determined by the input. To make generative models out of autoencoders we will need to add randomness to latents. In an autoencoder, input data ${\mathbf{x}}$ are encoded to the latents ${\mathbf{z}}=F({\mathbf{x}})\in{\mathbb{R}}^{d}$ , which are then decoded to $\hat{\mathbf{x}}=H(F({\mathbf{z}}))$ . A VAE deviates from an autoencoder in the following sense: the input ${\mathbf{x}}$ is encoded into a diagonal Gaussian random variable ${\boldsymbol{\xi}}={\boldsymbol{\xi}}({\mathbf{x}})$ in ${\mathbb{R}}^{d}$ with mean ${\boldsymbol{\mu}}({\mathbf{x}})$ and variance ${\boldsymbol{\sigma}}^{2}({\mathbf{x}})$ . Here ${\boldsymbol{\sigma}}^{2}({\mathbf{x}})=[\sigma_{1}^{2}({\mathbf{x}}),\dots,\sigma_{d}^{2}({\mathbf{x}})]^{T}\in{\mathbb{R}}^{d}$ and the variance is actually ${\rm diag}({\boldsymbol{\sigma}}^{2})$ . Another way to look at this setup is that instead of having just one encoder $F$ in an autoencoder, a VAE neural network has two encoders ${\boldsymbol{\mu}}$ and ${\boldsymbol{\sigma}}^{2}$ for both the mean and the variance of the latent variable ${\boldsymbol{\xi}}$ . As in an autoencoder, it also has a decoder $H$ . With randomness in place we now have a generative model.

Of course we will need some constraints on ${\boldsymbol{\mu}}({\mathbf{x}})$ and $\sigma^{2}({\mathbf{x}})$ . Here VAEs employ the following heuristics for training:

•

The decoder $H$ decodes the latent random variables ${\boldsymbol{\xi}}({\mathbf{x}})$ to $\hat{\mathbf{x}}$ that are close to ${\mathbf{x}}$ .
•

The random variable $X={\boldsymbol{\xi}}({\mathbf{x}})$ with ${\mathbf{x}}$ sampled uniformly from ${\mathcal{X}}$ is close to the standard normal distribution $N(0,1)$ .

The heuristics will be realized through a loss function consisting of two components. The first component is simply the mean square error between $\hat{\mathbf{x}}$ and ${\mathbf{x}}$ given by

(5.1)		$\displaystyle{\mathbf{L}}_{1}({\boldsymbol{\mu}},{\boldsymbol{\sigma}},G)$	$\displaystyle=\frac{1}{N}\,\sum_{{\mathbf{x}}\in{\mathcal{X}}}{\mathbb{E}}_{{\mathbf{z}}\sim N({\boldsymbol{\mu}}({\mathbf{x}}),\sigma({\mathbf{x}})I_{d})}[\\|H({\mathbf{z}})-{\mathbf{x}}\\|^{2}]$
(5.2)			$\displaystyle=\frac{1}{N}\,\sum_{{\mathbf{x}}\in{\mathcal{X}}}{\mathbb{E}}_{{\mathbf{z}}\sim N(0,I_{d})}[\\|H({\boldsymbol{\mu}}({\mathbf{x}})+\sigma({\mathbf{x}})\odot{\mathbf{z}})-{\mathbf{x}}\\|^{2}]$

where $\odot$ denotes entry-wise product. Here going from (5.1) to (5.2) is a very useful technique called re-parametrization. The second component of the loss function is the KL-divergence (or other $f$ -divergences) between $X$ and $N(0,I_{d})$ . For two Gaussian random variables their KL-divergence has an explicit expression, given by

(5.3)

{\mathbf{L}}_{2}({\boldsymbol{\mu}},{\boldsymbol{\sigma}})=D_{KL}(X\|N(0,I_{d}))=\frac{1}{2N}\,\sum_{{\mathbf{x}}\in{\mathcal{X}}}\sum_{i=1}^{d}\Bigl{(}\mu_{i}^{2}({\mathbf{x}})+\sigma_{i}({\mathbf{x}})^{2}-1+\ln(\sigma_{i}^{2})\Bigr{)}.

The loss function for a VAE is thus

(5.4)

{\mathbf{L}}_{\rm VAE}={\mathbf{L}}_{1}({\boldsymbol{\mu}},{\boldsymbol{\sigma}},G)+\lambda\,{\mathbf{L}}_{2}({\boldsymbol{\mu}},{\boldsymbol{\sigma}})

where $\lambda>0$ is a parameter. To generate new data from a VAE, one inputs random samples ${\boldsymbol{\xi}}\sim N(0,I_{d})$ into the decoder network $H$ . A typical VAE architecture is shown in Figure 2.

This short introduction of VAE has just touched on the mathematical foundation of VAE. There have been a wealth of research focused on improving the performance of VAE and applications. We encourage interested readers to further study the subject by reading recent papers.

References

[1] S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 28(1):131–142, 1966.
[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
[3] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
[5] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.
[6] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
[7] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
[8] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
[9] S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pages 271–279, 2016.
[10] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
[11] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
[12] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.