Prb-GAN: A Probabilistic Framework for GAN Modelling

Blessen George Indian Institute of Technology Kanpur
Kanpur, India
[email protected] Vinod K. Kurmi Indian Institute of Technology Kanpur
Kanpur, India
[email protected] Vinay P. Namboodiri Indian Institute of Technology Kanpur
Kanpur, India
[email protected]

Abstract

Generative adversarial networks (GANs) are very popular to generate realistic images, but they often suffer from the training instability issues and the phenomenon of mode loss. In order to attain greater diversity in GAN synthesized data, it is critical to solving the problem of mode loss. Our work explores probabilistic approaches to GAN modelling that could allow us to tackle these issues. We present Prb-GANs, a new variation that uses dropout to create a distribution over the network parameters with the posterior learnt using variational inference. We describe theoretically and validate experimentally using simple and complex datasets the benefits of such an approach. We look into further improvements using the concept of uncertainty measures. Through a set of further modifications to the loss functions for each network of the GAN, we are able to get results that show the improvement of GAN performance. Our methods are extremely simple and require very little modification to existing GAN architecture.

I Introduction

Generative modelling is a branch of machine learning research that aims at learning the underlying probability distribution of a data source given a few samples from it. Generative Adversarial Networks (GANs) [1] are one such class of generative models that, especially in the context of computer vision, are capable of producing highly realistic data. The model is based on a game theoretic training process that uses two neural networks termed the discriminator and the generator. This scheme, however, is often susceptible to training instability along with other pathological problems like vanishing gradients, mode dropping and hyperparameter initialization sensitivity.

Since GANs are quite a popular generative model, it is critical to solve the issues of mode loss and training instability and thus forms an active area of GAN research. Multiple different approaches to these problems have been tried in past works. Multiple networks based methods such as [2], [3], [4] are all very effective to tackle these issues. The underlying idea behind all these methods is that, from multi networks, the generator can capture the data distributions efficiently. These models are trained by a deterministic loss function obtained from the discriminator. The probabilistic loss-based approach could also one of the solutions to tackle the problem of mode collapse. The fact is that the probabilistic models incorporate the variance estimation; thus, they can provide better gradients to update the generator model. The main issue with the probabilistic methods is that they are intractable. But Gal et al. [5] have proved a method by applying the dropout to obtain the Bayesian formulation of the network. The dropoutGAN [4] is one of such type of variations of the GAN. But this work mainly involves using multiple discriminators and a dropout branch to choose a subset of the discriminators to train the generator against in each iteration. Our work, on the other hand, involves applying dropout to the parameters of the discriminator network actually to simulate a probabilistic discriminator. The spirit of our work is very different from the dropoutGAN [4]. In all of these works, the question remains as to exactly how many multiple networks needed, and this generally must be estimated via trial and error. In contrast, we use just a single discriminator and a single generator network; thus, our memory footprint is much smaller.

Another important aspect of the Bayesian formulation of GAN is to incorporate the uncertainty estimation. In the proposed framework, we also used the uncertainty estimation for the measurement of the GAN loss. There are two variant of this uncertainty based measures are Weighted Discriminator Score and Discriminator Score Set Variance used to obtain the new GAN losses. In this paper, we work with predictive uncertainty [6] of the deep learning model using the dropout.

We have drawn our inspiration from BayesianGAN [7] and concur that thinking in a probabilistic manner would yield fruits in improving GAN performance. While the direction of our work runs similar to [7], our implementation is extremely simple and requires minimal extra machinery.

Our contributions are as follows.

•

We propose a variation call it Prb-GAN (Probabilistic GANs) which uses the popular technique of Dropout [8] to convert standard GANs into probabilistic GANs. Inference of the parameter distribution is done using variational inference.
•

We also present the theoretical backing for our work and show empirically why such a formulation can help solve the mode loss issue.
•

Additionally, we present new loss functions that are based on uncertainty measure estimates that further improve the GAN performance metric.

II Related Work

In deep learning literature, many GAN architectures and models have been proposed. One line of work is to change the objective itself entirely as proposed by works such as Least squares GANs (LSGANs) [9] and Wasserstein GANs [10, 11, 12, 13]. The assertion is that the standard objective function minimizes the JS-divergence, which is unstable, particularly when the support of both distributions have no intersection. By minimizing the more smoother Wasserstein distance, one is able to significantly check the instability. LSGANs are similar in approach in that it minimizes the Pearson divergence. MADGANs [14] was another recent approach wherein the authors advocate the use of multiple generators, each tasked with capturing a separate mode of the data. Bayesian GANs are the only GAN variations that approached the problem in a Bayesian manner [7]. The key idea here was to maintain a distribution over the parameters of the networks and to infer the right posterior distribution given the observed samples from the dataset. Another probabilistic version of GAN is P-GAN [15], where instead of drawing a boundary between real and fake samples, discriminator of P-GAN fits a Gaussian on the embeddings of the projected real images. The f-GAN [16] summarizes that GANs can be trained by using an f-divergence. LS-GAN [17] is introduced to train the generator to produce realistic samples by minimizing the designated margins between real and generated samples. MRGAN [18] proposes a metric regularization to penalize missing modes. SN-GAN [19] proposes the use of weight normalization to stabilize the training of the discriminator.

Apart from the different loss-variant, there are many other architectures based GAN models are proposed. The BGAN [20], PROGAN [21], BigGAN [22], StarGAN [23] and many others are popular to generated the realistic images. Recently there are many other variants and applications of GANs [24, 25, 26, 27, 28, 29, 30, 31, 32] in different areas are presented in the literature. Recently some probabilistic version of GANs are present in ProbGAN [33] where distributions of the generator and the discriminator are leaned. In IDDA [34] a class based discriminator for learning the source distribution. In the PCGAN [35], authors used a conditional generative framework along with Pairwise Comparisons using the uncertainty estimation.

In addition to the GAN literature, the probabilistic and Bayesian deep learning-based models are also popular. In [5] Gal et al. propose that through the use of dropout [8] in neural networks, one could create a Bayesian neural network. By requiring a prior distribution over the weights of the neural networks and training using dropout in the layers of the networks, we could infer the distribution of the weights of the neural network. The technique offers a clean way of creating probabilistic neural networks. We use this idea and the assertion that GANs could indeed benefit from maintaining a distribution over the network parameters to try and solve the mode collapse and stability issues prevalent in GANs. The uncertainty estimation aleatoric and epistemic for the deep learning models are presented in [6, 36]. Other uncertainty based methods are discussed in [37, 38, 39].

III Probabilistic GANs

III-A Standard GANs

Recall that the original GAN formulation features a generator and a discriminator with a single parameter setting $\theta_{g}\text{ and }\theta_{d}$ respectively rather than a parameter distribution. The optimal parameter point estimate is found by solving

	$\displaystyle\theta_{g}^{*}=\max_{\theta_{g}}\enskip\log D(G(z))$		(1)
	$\displaystyle\theta_{d}^{*}=\min_{\theta_{d}}\enskip[\log D(x)+\log[1-D(G(z))]]$		(2)

Here $z$ is the latent noise vector. We show in the next sections the simple changes required to convert the standard GAN into a probabilistic GAN.

III-B Probabilistic Generators and Discriminators

Probabilistic Generators: Assume a gaussian prior distribution over the weights of generator $\theta_{g}$ , where $n_{\theta_{g}}$ represents number of generator parameters,

\displaystyle p(\theta_{g})\sim\mathcal{N}(0,I_{n_{\theta_{g}}})

(3)

Let the probability that the data point produced by the generator $G(z;\theta_{g})$ being marked real by the discriminator be given by

\displaystyle p(y_{g}=1|\theta_{g},z)=D(G(z;\theta_{g});\theta_{d})

(4)

Here $\theta_{d}$ are the parameters of discriminator network. Then we seek to find the posterior distribution of the weights $p(\theta_{g}|y_{g}=1,z)$ assuming that the generated point was marked real. Direct computation is infeasible and we infer using variational inference.

We setup our variational distribution for the generator weights using Bernoulli dropout as

	$\displaystyle q(\theta_{g};W_{g})=\{\theta_{g}^{i}\}=\{W_{g}^{i}.\text{diag}([b_{i,j}]_{j=1}^{K_{i}})\}$		(5)
	$\displaystyle b_{i,j}\sim\text{Bernouilli}(p_{i});i\in[L],j\in[K_{i}]$		(6)

Here $W_{g}$ serves as the variational parameter, $K_{i}$ is the number of neurons in the $i^{th}$ layer, $L$ is the number of layers and $p$ is the dropout probability. We seek out the optimal variational parameters for the generator by maximizing the evidence lower bound (ELBO) function associated with the above. Thus

\displaystyle\begin{split}W_{g}^{*}=&\operatorname*{arg\,max}_{W_{g}}\enskip E_{q(\theta_{g};W_{g})}\enskip[\log p(y_{g}=1|\theta_{g},z)]\\ &\quad-KL(q(\theta_{g};W_{g})||p(\theta_{g}))\end{split}

(7)

The first term is an expectation and can be approximated using Monte-Carlo(MC) [40] integration for some arbitrary MC iteration, $M_{g}$ , and the second term corresponds to the $L_{2}$ norm regularization term which forces the parameters to stay close to 0. Thus the loss function we would like to minimize becomes

\displaystyle W_{g}^{*}=\operatorname*{arg\,min}_{W_{g}}\left\lVert W_{g}\right\rVert^{2}_{2}-\dfrac{1}{M_{g}}\sum_{m_{g}=1}^{M_{g}}\log D(G(z;\theta_{g}^{(m_{g})}))

(8)

Probabilistic Discriminators: The setup for probabilistic discriminators are nearly the same. We place a gaussian prior over the discriminator parameters as well. The likelihood term however is slightly different. The probability that the discriminator marks the real generated points as real and fake respectively appropriately is:

\displaystyle p(y_{d}=[1,0]|\theta_{d},x,z)=D(x;\theta_{d}).(1-D(G(z);\theta_{d}))

(9)

Then, the posterior distribution of the discriminator weights is sought out using a similar bernouilli dropout based variational distribution $q(\theta_{d};W_{d})$ . The loss function that follows is given by

\displaystyle\begin{split}W_{d}^{*}=\operatorname*{arg\,min}_{W_{d}}\left\lVert W_{d}\right\rVert^{2}-2-&\dfrac{1}{M_{d}}\sum_{m_{d}=1}^{M_{d}}\enskip[\log D(x;\theta_{d}^{(m_{d})})\\ &+\log(1-D(G(z);\theta_{d}^{(m_{d})}))]\end{split}

(10)

$M_{d}$ is the number of sampled discriminator.

Refer to caption — Figure 1: During a training step of the probabilistic discriminator, $N$ discriminator instances and a single random generator are sampled. The losses are averaged and back-propogated to update the discriminator variational parameter.

Training scheme: In practice, the updated loss function for the generator is equivalent to actually sampling the probabilistic generator $M_{g}$ many times and back-propagating the averaged loss for a single generator training step and similarly sampling the discriminator $M_{d}$ many times and averaging the loss during a single discriminator training step. For each discriminator training step, a single random generator is sampled and vice versa for a single generator training step. Generally, we set the number of MC samples for generator and discriminator to be equal $M_{d}=M_{g}=N$ . Figure 1 and 2 describes how the training steps would proceed each case. For a generator training step, while the variational parameters $\theta_{g}^{(l)}$ are the same for all generators, unique sampling of dropout layers ${\beta^{\prime}}_{n}^{(l)}$ create a unique generator. The case holds similarly for the discriminator training step. The algorithm is presented in 1. We term our formulation of GAN as Prb-GAN.

Algorithm 1 Probabilistic GAN

\text{Initialize variational parameters }W_{d},W_{g}\text{ randomly}

2:Let

B

: Batch Size,

N

: MC sample iterations

3:while not converged do

\{z^{i}\}\sim p(z);\>{i\in[B]}

\triangleright

Sample noise variables

\{x^{i}\}\sim X;\>{i\in[B]}

\triangleright

Sample real data samples

{\theta_{g}}_{*}\sim q({W_{g}};p)

\triangleright

Choose random generator network parameter

9: for i=1 to N do

10:

{\theta_{d}}^{(i)}\sim q({W_{d}};p)

\triangleright

Sample a discriminator

11:

f_{d}\leftarrow\sum_{b=1}^{B}[\log D(x^{i};{\theta_{d}}^{(i)})+\log(1-D(G(z^{i};{\theta_{g}}_{*});{\theta_{d}}^{(i)})]

\triangleright

Loss

12:

\alpha_{i}\leftarrow\nabla_{{\theta_{d}}^{(i)}}f_{d}

\triangleright

Gradient

13:

W_{d}\leftarrow W_{d}+\dfrac{\lambda}{BN}\sum_{i=1}^{N}\alpha_{i}

\triangleright

Update using average gradient

14:

15:

{\theta_{d}}_{*}\sim q({W_{d}};p)

\triangleright

Choose random discriminator network parameter

16: for i=1 to N do

17:

{\theta_{g}}^{(i)}\sim q({W_{g}};p)

\triangleright

Sample a generator

18:

f_{g}\leftarrow\sum_{b=1}^{B}[\log(D(G(z^{i};{\theta_{g}}^{(i)});{\theta_{d}}_{*})]

\triangleright

Loss

19:

\beta_{i}\leftarrow\nabla_{{\theta_{g}}^{(i)}}f_{g}

\triangleright

Gradient

20:

W_{g}\leftarrow W_{g}+\dfrac{\lambda}{BN}\sum_{i=1}^{N}\beta_{i}

\triangleright

Update using average gradient

21:

22:return

W_{d},W_{g}

IV Uncertainty Measures for GANs

In this section, we make a further improvement to our approach through the use of uncertainty measures for GANs. Kendall et al. [41, 6] discussed how neural networks could be equipped with prediction uncertainty estimates in addition to their actual predictions on regression and classification tasks. We work with two types of uncertainty measures - Aleatoric uncertainty, which is the uncertainty that arises due to the data itself and Epistemic uncertainty, which arises when the model has not been trained well enough on the data. Aleatoric uncertainty could be occlusion in the data, multiclass instances in a single image, noise due to the data capturing equipment itself, or any other problem that causes input data to not conform to the expected standard. In such cases, even if our model is trained to the best of its ability, it may still not be certain about its predictions. Essentially, this is the uncertainty that cannot be ’trained’ away with lots of data. Epistemic uncertainty measures the uncertainty about prediction based on how well the model was trained. For example, suppose the data is perfectly free from noise, assume a prediction model that has been trained only on a very small portion of the observed data. Due to the lack of training, the model cannot be certain about its prediction, and thus the predictive posterior may have a widespread. As more and more data is seen, the variance in the predictive posterior decreases and concentrated towards its mean. Thus we say that as the model is trained, it is increasingly free of epistemic uncertainty.

Inspired by these ideas, we asked whether GANs could benefit from a similar line of thought? Essentially, we ask Could we equip discriminators with the ability to be certain or uncertain about the predictions? And could we use the variance of the MC sampled discriminator scores to improve the generator’s gradient?

IV-A Weighted discriminator scores

Inspired by the idea of determining aleatoric uncertainty for neural nets, we allow each dropout sampled discriminator to decide on how certain or uncertain it wants to be about its prediction on its input data-point. Every randomly sampled discriminator corresponds to instantiating a decision boundary for separating real and fake data points. When input data-points are closer to the captured mode of one of these sampled discriminators, if that discriminator makes a prediction with high certainty, its score is weighted higher. This encourages discriminators to capture diverse modes of the data by allowing it to be sure of what it knows and unsure of what it does not know. Figure 3 illustrates that since the input datapoint is closer to discriminator 2’s captured modes, we would like the discriminator 2 to be more sure about its prediction as compared to discriminator 1.

Following this, we create a new loss function that essentially weighs the prediction scores of the discriminators based on their uncertainty scores. Uncertainty scores are estimated by adding another branch from the penultimate layer of the discriminator. Greater the uncertainty measure, the closer the predicted logit score of that discriminator is to 0 (correspondingly the actual prediction being close to 0.5). Consider the logits predicted by a discriminator $D(x;\theta_{d})$ . Suppose we allow the discriminator to specify how certain it is about its prediction - uncertainty represented as $u(x)$ , then the modification is

\displaystyle D(x;\ \theta_{d})\rightarrow D^{\prime}(x;\ \theta_{d}):=\dfrac{D(x;\ \theta_{d})}{u(x;\ \theta_{d})+b_{1}}

(11)

Here $b_{1}$ is a hyperparameter that stands for a bias term which in we set a small number in our experiments. This is to prevent division by zero in case the uncertainty predicted is 0. The effect of distorting the predicted logits as above is to increase or decrease the prediction magnitude thus making the logit prediction closer to 0 in case $u(x;\theta_{d})>1$ or further away from 0 otherwise.

To prevent the discriminator from assigning high uncertainty to all the points we add the uncertainty estimate itself as an additional penalty term. Thus the loss function for the discriminator factoring in this input data uncertainty is changed Eq. 12 to Eq. 13, BCE is the binary cross-entropy loss.

\displaystyle\begin{split}\mathcal{L}_{\text{old dis loss}}=\dfrac{1}{N}\text{BCE}(\sigma(D(x;\ \theta_{d})))\end{split}

(12)

\displaystyle\begin{split}\mathcal{L}_{\text{new dis loss}}=\dfrac{1}{N}[\text{BCE }&\left(\sigma\left(D^{\prime}(x;\ \theta_{d})\right)\right)\\ &\ +\ u(x;\ \theta_{d})]\end{split}

(13)

BCE is the binary cross-entropy loss.

IV-B Discriminator score set variance

Our second modification is based on the idea that we could measure the variance of the set of discriminator scores and use this information to improve gradients for the generator. Essentially, if the discriminator scores exhibit high variance, then the sampled discriminators must disagree with each other’s scores and some discriminators must have assigned high logit scores while others must have scored the point with low logit scores. This would happen presumably when the data point is closer to modes captured by some of the discriminators while lying outside the high probability regions of the remaining discriminators which would indicate that the discriminators have captured diverse modes. When the generator generates such data points we’d like to reward it based on the magnitude of the score set variance.

Consider a concrete example wherein two separate cases, the discriminators assign logit scores to the same generated point as $[7,-1,7,-1]$ and $[3,3,3,3]$ . Now the mean score in both cases is equal to 3; however, the first case though, exhibits high variance among the scores while the second set does not. We argue that the first case might be more desirable due to the fact that one-half of the discriminators recognize the point as being close to their captured modes while the other half marks them as unsure or outside their captured modes thus again incentivizing diverse mode capture.

To implement this idea, we introduce an additional term in the loss function of the generators that rewards generated points that have high variance in the discriminator logit score set. The updated loss function for the generator is given as

\displaystyle\begin{split}\mathcal{L}_{\text{new gen loss}}=\dfrac{1}{N}\sum_{n=1}^{N}\text{BCE}(D_{n}^{\prime}(G(z)))\\ -\lambda\dfrac{var\{D_{n}\prime(G(z))\}_{n=1}^{N}}{\mu^{2}\{D_{n}\prime(G(z))\}_{n=1}^{N}+b_{2}}\end{split}

(14)

In the above $D_{n}$ represents the $n^{th}$ sampled discriminator. The term features $\lambda$ - a hyperparameter, $b_{2}$ - a fixed bias term to prevent division by 0, and the mean and variance of the set of discriminator scores. If the mean of the prediction is closer to zero, then the term has higher weightage. This is done to emphasize that only when the mean of the discriminator scores is near 0, is the term relevant. Greater the variance of the discriminator scores, the greater the reward.

V Results

We conduct two different sets of studies to measure the performance of Prb-GAN against other standard GANs. The first is a set of initial experiments we conducted to validate our ideas. The second is a larger study comparing the current best GAN evaluation metric - the Frechet Inception Distance (FID) score on more standard datasets - CIFAR 10 and Celeb-A datasets. Finally, we show the results for uncertainty measure equipped probabilistic GANs.

V-A Small scale experiments

Initially, we evaluated our model on a set of two simple datasets, a mixture of 1-dimensional Gaussians and the MNIST dataset. We show that on both the datasets, whereas the standard GAN exhibits mode loss behavior, the probabilistic formulation, as described above, is able to significantly prevent mode loss. We use a simple, fully connected 4 layer neural network with a net input size of $z_{d}=100$ where $z_{d}$ is the dimensionality of the random latent code space and hidden layer size of 600 neurons with the leaky Relu activation function. Xavier’s initialization of weights is used. We set the probability of dropout $p$ = 0.4 and $N$ = 20.

The 1D Gaussian dataset is comprised of data-points in $R$ sampled from a mixture of 5 Gaussians. The means of the Gaussians are located at $[10,20,60,80,110]$ with standard deviations as $[3,3,2,2,1]$ . The histogram representing the real dataset is shown in red, while the generated data histogram is shown in green. This is the same dataset used in [14]. The dataset consists of 600,000 data-points sampled from the mixture distribution. An ideal generative model would be able to recreate the dataset perfectly, and the histogram associated would resemble the true distribution. The vanilla GAN and Prb-GAN are trained for 5 epochs over the entire dataset.

Figure 4(a) and 4(b) shows the histogram plots of the data generated by both the models. It is observed that while the vanilla GAN seems to capture only one mode well, the dropout GAN is able to capture almost all the modes. In fact, by iteration 50, we observed that Prb-GAN was able to capture four of the most important modes, while vanilla GAN was only able to capture one mode. Also, we noted that the values for dropout probability and $N$ (number of MC sample iterations) used can change the results. In general, the greater the dropout probability used, the greater the $N$ value that would have to be in order to produce good samples.

Following this experiment, we tested our model on real images, specifically the MNIST dataset. We use the same GANs but make a change to the input layer size to match the input latent code size, which we set to be 200. We run the models over the entire dataset for 100 epochs and plot the images generated by each model at the end of the training. Figure 5(a) and 5(b) shows how while Vanilla GAN has collapsed the entire latent code mapping to the digit 1, Prb-GAN is able to produce diverse images. We note that the visual quality of the Prb-GAN may appear slightly less appealing, but what we gain in return is larger diversity. Through the use of Progressive GANs [42], one could conceive of using a Prb-GAN backend to allow the network to produce diverse images while having a standard Progressive GAN front end to make the produced images more crisp.

Further, in Figure 6, we see that using the exact same random latent code, Prb-GAN in separate Bernoulli dropout sampling instances is able to produce very different images from each other. Thus separate instances of the generators actually capture different sets of modes. This observation shows that through the use of a single GAN model and the dropout technique, we’re able to achieve performance similar to peers like MADGAN while having the advantage of not having to know in advance the number of modes that could be present in the data.

V-B Standard GAN metric evaluation study

To test our GAN performance more rigorously, we turn towards the work done by [43]. Their work compares all the standard GANs on various datasets on a variety of hyperparameters and measures them using the FID score. As opposed to the earlier used Inception score [44], the FID score [45] is the current best GAN evaluation metric since it is more robust to noise and sensitive to mode loss compared to Inception score. The surprising conclusion reached is that it is possible that all the tested GANs perform nearly equal. We demonstrate that while this might be the case, the use of our probabilistic approach in GANs almost always improves performance.

Our architectural setup is as used in CompareGAN [46]. We compare three of the best variations of GANs - Non-Saturating GANs (NSGANs, the original GAN) [1] and Least-squares GANs (LSGANs) [47]. We make modifications to the networks by introducing dropout and training through MC integration based averaged loss. The comparison is made using the MNIST [48], CIFAR 10 [49] and the CelebA datasets [50]. We trained the models using a batch-size of 64, iteration steps of 18750 for the MNIST dataset, 200,000 for the CIFAR-10 dataset, and 80,000 iteration steps on the CelebA.

GAN model	MNIST	CIFAR-10	CelebA
NSGAN	17.66	73.24	63.95
Prb-NSGAN	16.65	64.03	56.75
LSGAN	11.37	72.60	112.05
Prb-LSGAN	10.37	54.06	84.20

TABLE I: FID scores for standard and probabilistic GANs on standard datasets. Probabilistic versions of NSGAN and LSGAN are better with improved scores.

The FID score comparison for the various GAN models is given in Table I. Lower the FID score, the better the performance of the GAN. We noted that for the NSGAN and the LSGAN, our probabilistic approach improves on the FID score but not on the WGAN. We used the number of discriminators instances sampled $N$ = 20 and dropout probability $p$ = 0.4. Dropout was applied to both the generator and the discriminator side and the averaged loss was back-propagated. Qualitative results are presented in Figure 7.

Using the two modifications as described in section IV, we now implement our uncertainty measure equipped probabilistic GAN. In this section, we present the tabulation of the FID scores. The FID scores and the Inception scores are mentioned; however, note that the FID score is superior to the Inception score, and hence we emphasize the former’s results. We note that in almost all cases, the use of weighted discriminator scores improves the FID score. Prb-GAN represents the probabilistic GAN based on the setup described in the previous chapter. Prb-GAN (v1) describes the probabilistic GAN setup that uses the idea of uncertainty estimation and weighted discriminator scoring. Prb-GAN (v2) is the probabilistic GAN that makes additional use of the variance of the set of discriminator logit scores. Table II shows the FID score and Table III the Inception scores. We found amongst all our experiments Prb-GANs equipped with uncertainty measures perform the best. We used the values $N$ = 10, and p = 0.4, where $N$ is the number of discriminators instances sampled, and p is the dropout probability. Note that we apply dropout to only the discriminator and not to the generator.

GAN model	MNIST	CIFAR-10	CelebA
NSGAN	17.829	71.611	53.299
Prb-GAN	14.749	58.273	46.355
Prb-GAN (v1)	13.818	53.967	40.533
Prb-GAN (v2)	13.723	51.040	40.561

TABLE II: FID scores for uncertainty measure equipped GANs on standard datasets.

GAN model	MNIST	CIFAR-10	CelebA
NSGAN	2.283	5.290	1.935
Prb-GAN	2.235	5.806	1.979
Prb-GAN (v1)	2.258	6.064	1.962
Prb-GAN (v2)	2.244	6.236	1.990

TABLE III: Inception scores for uncertainty measure equipped GANs on standard datasets

V-C Probabilistic Discriminators vs Probabilistic Generators

In this section, we compare the performance of GANs that use only probabilistic discriminators against GANs that use probabilistic generators and discriminators. We find empirically that the use of probabilistic discriminators alone is sufficient to improve performance. In Table IV, Prb-GAN(v1) represents that Probabilistic GAN version that uses probabilistic generators and probabilistic discriminators and Prb-GAN (v2) which uses probabilistic discriminators and deterministic generators.

GAN model	MNIST	CIFAR-10	CelebA
Vanilla GAN	17.66	73.24	63.95
Prb-GAN (v1)	16.65	64.03	56.75
Prb-GAN (v2)	14.75	58.27	46.35

TABLE IV: FID scores comparing GANs with probabilistic discriminators- deteriministic generators vs GANs with probabilistic generators- discriminators on standard datasets. Probabilistic discriminators alone provides superior performance as compared to probabilistic generators and discriminators.

VI Conclusion

In this paper, we have described the issues associated with the GAN model and described a probabilistic approach to solving these problems. In the first part, we discussed how the standard technique of Dropout could be used to create a probabilistic version of GANs and how the simple inference mechanism requiring minimal changes to current training schemes could significantly alleviate the mode dropping issue along with training instability issues. In later sections, we described how ideas similar to the use of uncertainty measures currently used in Bayesian Neural Networks, could be leveraged in order to improve GAN performance metrics on standard datasets further.

References

[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
[2] I. Durugkar, I. Gemp, and S. Mahadevan, “Generative multi-adversarial networks.”
[3] A. Ghosh, V. Kulharia, V. P. Namboodiri, P. H. Torr, and P. K. Dokania, “Multi-agent diverse generative adversarial networks,” in CVPR, 2018.
[4] G. Mordido, H. Yang, and C. Meinel, “Dropout-gan: Learning from a dynamic ensemble of discriminators,” arXiv preprint arXiv:1807.11346, 2018.
[5] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in international conference on machine learning, 2016, pp. 1050–1059.
[6] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7482–7491.
[7] Y. Saatci and A. G. Wilson, “Bayesian gan,” in Advances in neural information processing systems, 2017, pp. 3622–3631.
[8] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
[9] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least squares generative adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2794–2802.
[10] I. Deshpande, Y.-T. Hu, R. Sun, A. Pyrros, N. Siddiqui, S. Koyejo, Z. Zhao, D. Forsyth, and A. G. Schwing, “Max-sliced wasserstein distance and its use for gans,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2019, pp. 10 648–10 656.
[11] I. Deshpande, Z. Zhang, and A. G. Schwing, “Generative modeling using the sliced wasserstein distance,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3483–3491.
[12] J. Wu, Z. Huang, D. Acharya, W. Li, J. Thoma, D. P. Paudel, and L. V. Gool, “Sliced wasserstein generative models,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2019, pp. 3713–3722.
[13] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Advances in neural information processing systems, 2017, pp. 5767–5777.
[14] A. Ghosh, V. Kulharia, V. P. Namboodiri, P. H. Torr, and P. K. Dokania, “Multi-agent diverse generative adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8513–8521.
[15] H. Eghbal-zadeh and G. Widmer, “Probabilistic generative adversarial networks,” arXiv preprint arXiv:1708.01886, 2017.
[16] S. Nowozin, B. Cseke, and R. Tomioka, “f-gan: Training generative neural samplers using variational divergence minimization,” in Advances in neural information processing systems, 2016, pp. 271–279.
[17] G.-J. Qi, “Loss-sensitive generative adversarial networks on lipschitz densities,” International Journal of Computer Vision, pp. 1–23, 2019.
[18] T. Che, Y. Li, A. Jacob, Y. Bengio, and W. Li, “Mode regularized generative adversarial networks,” 2017.
[19] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” 2018.
[20] R. D. Hjelm, A. P. Jacob, T. Che, A. Trischler, K. Cho, and Y. Bengio, “Boundary-seeking generative adversarial networks,” in 6th International Conference on Learning Representations, ICLR 2018, 2018.
[21] H. Gao, J. Pei, and H. Huang, “Progan: Network embedding via proximity generative adversarial network,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 1308–1316.
[22] J. Donahue and K. Simonyan, “Large scale adversarial representation learning,” in Advances in Neural Information Processing Systems, 2019, pp. 10 541–10 551.
[23] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8789–8797.
[24] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. V. Gool, “Generative adversarial networks for extreme learned image compression,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
[25] H. Yang, H. Ouyang, V. Koltun, and Q. Chen, “Hiding video in audio via reversible generative models,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
[26] X. Gong, S. Chang, Y. Jiang, and Z. Wang, “Autogan: Neural architecture search for generative adversarial networks,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
[27] T. R. Shaham, T. Dekel, and T. Michaeli, “Singan: Learning a generative model from a single natural image,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
[28] J. N. Kundu, M. Gor, D. Agrawal, and R. V. Babu, “Gan-tree: An incrementally learned hierarchical generative framework for multi-modal data distributions,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
[29] H. Huang, C. Wang, P. S. Yu, and C.-D. Wang, “Generative dual adversarial network for generalized zero-shot learning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[30] Q. Mao, H.-Y. Lee, H.-Y. Tseng, S. Ma, and M.-H. Yang, “Mode seeking generative adversarial networks for diverse image synthesis,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[31] H. Eghbal-zadeh, W. Zellinger, and G. Widmer, “Mixture density generative adversarial networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[32] S. Jenni and P. Favaro, “On stabilizing generative adversarial training with noise,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[33] H. He, H. Wang, G. Lee, and Y. Tian, “Probgan: Towards probabilistic GAN with theoretical guarantees,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. [Online]. Available: https://openreview.net/forum?id=H1l7bnR5Ym
[34] V. K. Kurmi and V. P. Namboodiri, “Looking back at labels: A class based domain adaptation technique,” in 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 2019, pp. 1–8.
[35] L. Han, R. Gao, M. Kim, X. Tao, B. Liu, and D. N. Metaxas, “Robust conditional gan from uncertainty-aware pairwise comparisons.” in AAAI, 2020, pp. 10 909–10 916.
[36] V. K. Kurmi, S. Kumar, and V. P. Namboodiri, “Attending to discriminative certainty for domain adaptation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 491–500.
[37] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” in Advances in Neural Information Processing Systems 30, 2017, pp. 6402–6413.
[38] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan, “Augmix: A simple data processing method to improve robustness and uncertainty,” ICLR, 2020.
[39] V. K. Kurmi, V. Bajaj, V. K. Subramanian, and V. P. Namboodiri, “Curriculum based dropout discriminator for domain adaptation,” arXiv preprint arXiv:1907.10628, 2019.
[40] W. K. Hastings, “Monte carlo sampling methods using markov chains and their applications,” 1970.
[41] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” in Advances in neural information processing systems, 2017, pp. 5574–5584.
[42] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” ICLR, 2018.
[43] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet, “Are gans created equal? a large-scale study,” in Advances in neural information processing systems, 2018, pp. 700–709.
[44] S. Barratt and R. Sharma, “A note on the inception score,” arXiv preprint arXiv:1801.01973, 2018.
[45] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in Advances in Neural Information Processing Systems, 2017, pp. 6626–6637.
[46] Google. (2019) Comapre gan. [Online]. Available: http://github.com/google/comapare-gan
[47] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least squares generative adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2794–2802.
[48] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010. [Online]. Available: http://yann.lecun.com/exdb/mnist/
[49] A. Krizhevsky et al., “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
[50] Z. Liu, P. Luo, X. Wang, and X. Tang, “Large-scale celebfaces attributes (celeba) dataset,” Retrieved August, vol. 15, p. 2018, 2018.