This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Toward a Generalization Metric for Deep Generative Models

Hoang Thanh-Tung
Deakin University
[email protected]
&Truyen Tran
Deakin University
[email protected]
Abstract

Measuring the generalization capacity of Deep Generative Models (DGMs) is difficult because of the curse of dimensionality. Evaluation metrics for DGMs such as Inception Score, Fréchet Inception Distance, Precision-Recall, and Neural Net Divergence try to estimate the distance between the generated distribution and the target distribution using a polynomial number of samples. These metrics are the target of researchers when designing new models. Despite the claims, it is still unclear how well can they measure the generalization capacity of a generative model. In this paper, we investigate the capacity of these metrics in measuring the generalization capacity. We introduce a framework for comparing the robustness of evaluation metrics. We show that better scores in these metrics do not imply better generalization. They can be fooled easily by a generator that memorizes a small subset of the training set. We propose a fix to the NND metric to make it more robust to noise in the generated data. Toward building a robust metric for generalization, we propose to apply the Minimum Description Length principle to the problem of evaluating DGMs. We develop an efficient method for estimating the complexity of Generative Latent Variable Models (GLVMs). Experimental results show that our metric can effectively detect training set memorization and distinguish GLVMs of different generalization capacities.

1 Introduction

The quality of a generative model is defined by the distance between the model distribution and the target distribution. Traditional distances/divergences such as the KL divergence or the Wasserstein distance are intractable because of their exponential sample complexities. In the last few years, a number of evaluation metrics with polynomial sample complexities have been introduced. Some of the most common metrics are the Inception Score (IS) [24], Fréchet Inception Distance (FID) [12], Precision-Recall metrics [23, 17, 25]. These metrics are intuitive and fast to compute, but some researchers have raised concerns regarding their reliability [11]. [11] proposed to use the Neural Net Divergence (NND) [2] to measure the generalization capacity of generative models. In this paper, we evaluate the ability to measure generalization of the above evaluation metrics. We find that all of them cannot reliably measure generalization. They can be fooled easily by a generative model that memorizes a small subset of the training set.

We propose a novel generalization metric for a broad class of generative models called the Generative Latent Variable Models. Our metric is based on the Minimum Description Length principle [16, 20, 21, 22, 5]. The metric combines the complexity of the generative model and the divergence between the model distribution and the target distribution into a single value. The metric inherits the robustness of the MDL framework and can be computed in polynomial time using a polynomial number of samples.

Notations

𝒛dz\bm{z}\in\mathbb{R}^{d_{z}}: a latent variable.
𝒙d\bm{x}\in\mathbb{R}^{d}: a data point.
𝒢(;𝜽)\mathcal{G}(\cdot;\bm{\theta}): a generative model with parameter 𝜽dθ\bm{\theta}\in\mathbb{R}^{d_{\theta}}.
𝒟r={𝒙1,,𝒙k|𝒙ipdata}\mathscr{D}_{r}=\left\{\bm{x}_{1},...,\bm{x}_{k}\ |\ \bm{x}_{i}\sim p_{data}\right\}: a set of samples from pdatap_{data}.
𝒟g={𝒙1,,𝒙l|𝒙ipmodel}\mathscr{D}_{g}=\left\{\bm{x}_{1},...,\bm{x}_{l}\ |\ \bm{x}_{i}\sim p_{model}\right\}: a set of samples from pmodelp_{model}.
𝒟train={𝒙1,,𝒙n|𝒙ipdata}\mathscr{D}_{train}=\left\{\bm{x}_{1},...,\bm{x}_{n}\ |\ \bm{x}_{i}\sim p_{data}\right\}: the training set, |𝒟train|=poly(d)|\mathscr{D}_{train}|=\textnormal{poly}(d).
𝒟test={𝒙1,,𝒙m|𝒙ipdata,𝒙i𝒟train}\mathscr{D}_{test}=\left\{\bm{x}_{1},...,\bm{x}_{m}\ |\ \bm{x}_{i}\sim p_{data},\ \bm{x}_{i}\notin\mathscr{D}_{train}\right\}: the test set, |𝒟test|=poly(d)|\mathscr{D}_{test}|=\textnormal{poly}(d).
p^𝒟\hat{p}_{\mathscr{D}}: the uniform distribution over 𝒟\mathscr{D}.

2 Background and Related Work

2.1 Divergence based evaluation metrics

A generative model 𝒢\mathcal{G} is trained on 𝒟train\mathscr{D}_{train} to produce pmodelp_{model}, an estimate of pdatap_{data}. The quality of 𝒢\mathcal{G} is defined by the divergence/distance between pdatap_{data} and pmodelp_{model}. Let ff be a divergence/distance function. Because f(pdata,pmodel)f(p_{data},p_{model}) is intractable for most distributions of interest, we approximate it with f(𝒟r,𝒟g)f(\mathscr{D}_{r},\mathscr{D}_{g}) where 𝒟r,𝒟g{\mathscr{D}_{r}},{\mathscr{D}_{g}} are polynomial sized datasets.

Inception Score (IS) [24] is defined as:

IS(pmodel)=exp(KL(p(Y|𝑿)||p(Y)))\displaystyle\textnormal{IS}\left(p_{model}\right)=\exp\left(\text{KL}\left(p(Y|\bm{X})||p(Y)\right)\right) (1)

where 𝒙pmodel\bm{x}\sim p_{model} is the input, yy is the label, and p(y)=𝔼𝒙pmodel[p(y|𝒙)]p(y)=\mathbb{E}_{\bm{x}\sim p_{model}}\left[p(y|\bm{x})\right]. p(y|𝒙)p(y|\bm{x}) is computed by feeding the generated data point through a pretrained classifier, e.g. the Inception network [26]. IS is maximized when H(Y|𝑿)=0\textnormal{H}\left(Y|\bm{X}\right)=0 and H(Y)=lnC\textnormal{H}(Y)=\ln C where CC is the number of classes. Barratt and Sharma [4] showed that IS(pmodel)C\textnormal{IS}\left(p_{model}\right)\leq C. To make our writing consistent, we convert all of the metrics in this paper to (pseudo) divergences. A lower divergence should imply that pmodelp_{model} is closer to pdatap_{data}. We define the following pseudo divergence:

fis(pmodel)=Cexp(KL(p(Y|𝑿)||p(Y)))f_{is}\left(p_{model}\right)=C-\exp\left(\text{KL}\left(p(Y|\bm{X})||p(Y)\right)\right) (2)

This is a pseudo divergence because (1) it only contains 1 distribution, pmodelp_{model}, the other distribution is implicitly the dataset on which the classifier was trained, and (2) fis(pmodel)=0f_{is}\left(p_{model}\right)=0 does not imply that the two distributions are the same (more on this in Section 3). A fundamental flaw of IS is that a smaller fis(pmodel)f_{is}\left(p_{model}\right) implies that p(y)p(y) is closer to the uniform distribution, not that p(y)p(y) is closer to the true distribution over the labels. Fortunately (or unfortunately), the distribution over the labels in the Imagenet dataset [6] is uniform. This makes IS a popular and somewhat abused tool for evaluating image generation models.

Fréchet Inception Distance (FID): Heusel et al. [12] approximate pdatap_{data}, pmodelp_{model} with two Gaussian distributions 𝒩(𝝁r,𝚺r)\mathcal{N}\left(\bm{\mu}_{r},\bm{\Sigma}_{r}\right) and 𝒩(𝝁g,𝚺g)\mathcal{N}\left(\bm{\mu}_{g},\bm{\Sigma}_{g}\right), respectively, and then compute the Fréchet distance between the two Gaussians:

ffid(pdata,pmodel)=𝝁r𝝁g22+Tr(𝚺r+𝚺g2(𝚺r𝚺g)12)f_{fid}\left(p_{data},p_{model}\right)=\left\lVert\bm{\mu}_{r}-\bm{\mu}_{g}\right\rVert_{2}^{2}+\textnormal{Tr}\left(\bm{\Sigma}_{r}+\bm{\Sigma}_{g}-2\left(\bm{\Sigma}_{r}\bm{\Sigma}_{g}\right)^{\frac{1}{2}}\right) (3)

Because FID compares only the first 2 moments of pdatap_{data} and pmodelp_{model}, ffid(pdata,pmodel)=0f_{fid}\left(p_{data},p_{model}\right)=0 does not guarantee that the 2 distributions are the same.

Neural Net Divergence (NND) [11, 2] uses neural nets to compute the divergence between pdatap_{data} and pmodelp_{model}

fnnd(pdata,pmodel)=supfϕ(𝔼𝒙pdata[fϕ(𝒙)]𝔼𝒙pmodel[fϕ(𝒙)])f_{nnd}\left(p_{data},p_{model}\right)=\underset{f_{\bm{\phi}}\in\mathcal{F}}{\sup}\left(\mathbb{E}_{\bm{x}\sim p_{data}}\left[f_{\bm{\phi}}(\bm{x})\right]-\mathbb{E}_{\bm{x}\sim p_{model}}\left[f_{\bm{\phi}}(\bm{x})\right]\right) (4)

where fϕ:df_{\bm{\phi}}:\mathbb{R}^{d}\rightarrow\mathbb{R} is usually a neural network with finite capacity. fϕf_{\bm{\phi}} is trained to maximize the difference in Eqn. 4. Arora et al. [2] showed that NND can detect the lack of diversity in generated data. Gulrajani et al. [11] leveraged that capability to estimate the generalizability of GMs. We show in Section 3 that the diversity in generated data can be faked easily by adding white noise to the data. This fake diversity is present in real-world models such as GANs and VAEs. We propose a fix to NND that removes the effect of fake diversity while retaining the discriminative power of NND.

2.2 Precision-Recall based evaluation metrics

The above metrics use scalars to measure both the quality and diversity of generated samples. The following metrics are designed to separate quality from diversity.

kk-means based Precision-Recall (kk-means PR) [23] applies kk-means algorithm to 𝒟r𝒟g\mathscr{D}_{r}\cup\mathscr{D}_{g}, builds 2 histograms to approximate pdatap_{data} and pmodelp_{model}, and computes the difference between these histograms. The results are 2 F-scores, FβF_{\beta} and F1/βF_{1/\beta}. For β>1\beta>1, FβF_{\beta} can be interpreted as the recall (diversity) and F1/βF_{1/\beta} can be interpreted as the precision (quality) of the generated data (Algo. 1). We define the following pseudo divergences: fp-kmeans=1F1/βf_{p\textnormal{-}kmeans}=1-F_{1/\beta}, fr-kmeans=1Fβf_{r\textnormal{-}kmeans}=1-F_{\beta}.

kk-nn based Precision-Recall (kk-nn PR) [17] computes the precision and recall by comparing the reconstructed real manifold and the reconstructed fake manifold (Algo. 2 and Fig. 1). Kynkäänniemi et al. [17] claimed that kk-nn PR has better discriminative power than kk-means PR and demonstrated that on several datasets. However, we show in Section 3 that kk-nn PR has the worst worst-case performance, and in practice, it cannot distinguish a good model from a very bad model. We define the following pseudo divergences: fp-knn=1Pf_{p\text{-}knn}=1-P, fr-knn=1Rf_{r\text{-}knn}=1-R.

Similar to IS and FID, kk-means PR and kk-nn PR assign (near) perfect scores to generative models that produce the entire training set.

Algorithm 1 kk-means PR
1:Inputs: 𝒟r,𝒟g\mathscr{D}_{r},\ \mathscr{D}_{g}, hyper parameter β\beta
2:Outputs: Fβ,F1/βF_{\beta},\ F_{1/\beta}
3:𝒞kmeans(𝒟r𝒟g)\mathscr{C}\leftarrow kmeans(\mathscr{D}_{r}\cup\mathscr{D}_{g})
4:histg[i]# fake data points in cluster 𝒞i|𝒟g|hist_{g}[i]\leftarrow\frac{\text{\# fake data points in cluster }\mathscr{C}_{i}}{|\mathscr{D}_{g}|}
5:histr[i]# real data points in cluster 𝒞i|𝒟r|hist_{r}[i]\leftarrow\frac{\text{\# real data points in cluster }\mathscr{C}_{i}}{|\mathscr{D}_{r}|}
6:curve=pr_curve(histr,histg)curve=pr\_curve(hist_{r},hist_{g})
7:Fβ=f_score(curve,β){F_{\beta}=f\_score(curve,\beta)}
8:F1/β=f_score(curve,1/β){F_{1/\beta}=f\_score(curve,1/\beta)}
9:return Fβ,F1/βF_{\beta},\ F_{1/\beta}
Algorithm 2 kk-nn PR
1:Inputs: 𝒟r,𝒟g\mathscr{D}_{r},\ \mathscr{D}_{g}
2:Outputs: Precision PP, Recall RR
3:g\mathcal{M}_{g}\leftarrow\emptyset  // build the fake manifold
4:for each data point 𝒙i𝒟g\bm{x}_{i}\in\mathscr{D}_{g} do
5:     𝒙j\bm{x}_{j}\leftarrow the kk-th nearest neighbor of 𝒙i\bm{x}_{i}
6:     Sisphere(c=𝒙i,r=𝒙i𝒙j)S_{i}\leftarrow sphere(c=\bm{x}_{i},r=\left\lVert\bm{x}_{i}-\bm{x}_{j}\right\rVert)
7:     ggSi\mathcal{M}_{g}\leftarrow\mathcal{M}_{g}\cup S_{i}
8:end for
9:build the real manifold r\mathcal{M}_{r}
10:P# fake data points in r|𝒟g|P\leftarrow\frac{\text{\# fake data points in }\mathcal{M}_{r}}{|\mathscr{D}_{g}|}
11:R# real data points in g|𝒟r|R\leftarrow\frac{\text{\# real data points in }\mathcal{M}_{g}}{|\mathscr{D}_{r}|}
12:return P,RP,\ R

Figure 1: kk-nn PR on a 2-dimensional dataset. 1 real data (blue crosses) and fake data (red dots). 1, 1 real and fake manifolds approximated with 33-nearest neighbors. Although the real and the fake data are very different, the precision and recall of the fake data are 1 and 1.

2.3 Other evaluation metrics

Metrics for class-conditional models: Esteban et al. [7] proposed the “Train on Synthetic Test on Real” (TSTR) approach to evaluate class-conditional GANs. The metric is expensive to compute as it requires training a classifier on a labeled synthetic dataset. Shmelkov et al. [25] extended the idea to a Precision-Recall-like metric for class-conditional GANs. Because these metrics are applicable only to class-conditional GMs, we do not consider them in this paper.

Topological/Geometrical approaches: Khrulkov and Oseledets [15], Horak et al. [13] applied geometrical and topological tools to measure the difference between the fake manifold and the real manifold. These metrics are not designed for measuring generalizability and are not considered in this paper. Karras et al. [14] used the pairwise perceptual distance - the pairwise distance between data points on the feature manifold - to measure the smoothness of the feature manifold. The intuition is that if the perceptual distance is small, then the manifold is smoother, and the GM might learn better representations and have better generalizability. We show in Section 4 that the pairwise distance does not well reflect the smoothness of the manifold, and it is minimized when total mode collapse occurs. Therefore, pairwise perceptual distance is not a good metric for GMs.

Non-parametric approaches: Arora et al. [3] used the birthday paradox and human evaluation to estimate the number of distinct modes that a GM can generate. The method cannot be automated and has high variance and bias because humans are involved in the loop. Meehan et al. [18] proposed a three sample test for data copying in GMs. The metric can detect data copying but cannot measure the generalizability of a model or the quality of the generated samples.

2.4 A generative model with poor generalization capacity

When mode collapse or training set memorization [8] happen, the generative model generates slightly different versions of the memorized data points. We simulate this kind of behavior with the following generative procedure:

Given a dataset 𝒟𝒟train\mathscr{D}\subseteq\mathscr{D}_{train}, a constant ϵ\epsilon\in\mathbb{R}, a new data point is generated by

  1. 1.

    Draw a data point 𝒙𝒟\bm{x}\in\mathscr{D} at random

  2. 2.

    Draw a noise vector 𝒖pu\bm{u}\sim p_{u}, where pup_{u} is the noise distribution.

  3. 3.

    Output the noisy data point: 𝒙~=𝒙+ϵ𝒖\tilde{\bm{x}}=\bm{x}+\epsilon\bm{u}

In our experiments, pu=𝒰(𝟏,𝟏)p_{u}=\mathcal{U}(-\bm{1},\bm{1}) (see Fig. 7 for samples from MNIST dataset). We tried replacing 𝒰(𝟏,𝟏)\mathcal{U}(-\bm{1},\bm{1}) with 𝒩(𝟎,I)\mathcal{N}(\bm{0},\textbf{I}) and found no difference in performance. This generator, denoted as 𝒢𝒟,ϵ\mathcal{G}_{\mathscr{D},\epsilon}, has no real generalization capacity and the number of essentially different modes in p𝒢𝒟,ϵp_{\mathcal{G}_{\mathscr{D},\epsilon}} is |𝒟|\left\lvert\mathscr{D}\right\rvert. For ϵ>0\epsilon>0, this generator can be approximated with a neural net with finite capacity.

3 Evaluating evaluation metrics

Given a training set 𝒟trainpdata\mathscr{D}_{train}\sim p_{data}, we train a generative model 𝒢\mathcal{G} on 𝒟train\mathscr{D}_{train} and get a distribution pmodelp_{model}. We would like to estimate the divergence/distance between pdatap_{data} and pmodelp_{model} using a divergence/distance function ff. Because the true divergence/distance f(pdata,pmodel)f(p_{data},p_{model}) is intractable, we approximate it with the empirical divergence/distance f(𝒟test,𝒟g)f(\mathscr{D}_{test},\mathscr{D}_{g}) where 𝒟testpdata\mathscr{D}_{test}\sim p_{data} and 𝒟gpmodel\mathscr{D}_{g}\sim p_{model}. If 𝒟test=𝒟train\mathscr{D}_{test}=\mathscr{D}_{train} then 𝒢\mathcal{G} can achieve the perfect score by memorizing the entire 𝒟train\mathscr{D}_{train}. To reduce the effect of training set memorization, we require that 𝒟test𝒟train=\mathscr{D}_{test}\cap\mathscr{D}_{train}=\emptyset. We want f(𝒟test,𝒟g)f(\mathscr{D}_{test},\mathscr{D}_{g}) to track f(pdata,pmodel)f(p_{data},p_{model}), i.e. f(𝒟test,𝒟g)f(\mathscr{D}_{test},\mathscr{D}_{g}) is small only when f(pdata,pmodel)f(p_{data},p_{model}) is small. Specifically, we want ff to satisfy:

𝔼𝒟test,𝒟1pdata,|𝒟test|=m,|𝒟1|=n1[f(𝒟test,𝒟1)]>𝔼𝒟test,𝒟2pdata,|𝒟test|=m,|𝒟2|=n2[f(𝒟test,𝒟2)]\displaystyle\mathbb{E}_{\mathscr{D}_{test},\ \mathscr{D}_{1}\sim p_{data},\ \left\lvert\mathscr{D}_{test}\right\rvert=m,\ \left\lvert\mathscr{D}_{1}\right\rvert=n_{1}}\left[f(\mathscr{D}_{test},\mathscr{D}_{1})\right]>\mathbb{E}_{\mathscr{D}_{test},\ \mathscr{D}_{2}\sim p_{data},\ \left\lvert\mathscr{D}_{test}\right\rvert=m,\ \left\lvert\mathscr{D}_{2}\right\rvert=n_{2}}\left[f(\mathscr{D}_{test},\mathscr{D}_{2})\right]
for all n1<n2\displaystyle\text{ for all }n_{1}<n_{2} (5)

Intuitively, the divergence/distance should decrease as the generative model can generate more (significantly different) data points from the target distribution. We say that a generative model generalizes if it can generate more data points from the target distribution than a generative model that memorizes the training dataset. To test for generalization, we generate a dataset 𝒟g\mathscr{D}_{g} whose size is larger than that of 𝒟train\mathscr{D}_{train} and compare f(𝒟test,𝒟g)f(\mathscr{D}_{test},\mathscr{D}_{g}) and f(𝒟test,𝒟train)f(\mathscr{D}_{test},\mathscr{D}_{train}).

We show in the next subsections that Inception Score, Precision-Recall, and Neural Net Divergence do not satisfy Eqn. 5 and are not robust against training set memorization.

3.1 Worst-case analyses of evaluation metrics

Let 𝒟\mathscr{D}^{*} be the smallest subset of 𝒟train\mathscr{D}_{train} that satisfies 𝔼𝒟testpdata,|𝒟test|=m[f(𝒟test,𝒟)]𝔼𝒟testpdata,|𝒟test|=m[f(𝒟test,𝒟train)]\mathbb{E}_{\mathscr{D}_{test}\sim p_{data},\ \left\lvert\mathscr{D}_{test}\right\rvert=m}\left[f(\mathscr{D}_{test},\mathscr{D}^{*})\right]\leq\mathbb{E}_{\mathscr{D}_{test}\sim p_{data},\ \left\lvert\mathscr{D}_{test}\right\rvert=m}\left[f(\mathscr{D}_{test},\mathscr{D}_{train})\right]. The divergence ff cannot distinguish a model that produces only 𝒟\mathscr{D}^{*} and a model that produces 𝒟train\mathscr{D}_{train}. By memorizing only 𝒟\mathscr{D}^{*}, a generator can fool ff that it can produce a larger dataset. The smaller |𝒟|\left\lvert\mathscr{D}^{*}\right\rvert is, the more vulnerable ff is to training set memorization. We construct 𝒟\mathscr{D}^{*} for the metrics mentioned above and discuss the relationship between 𝒟\mathscr{D}^{*} and the performance of a metric in practice. Because NND is the only metric designed to be robust against training set memorization, it is the focus of this section. We defer the details for other metrics to the appendix.

Inception Score

𝒟is={𝒙1,,𝒙C|𝒙i𝒞i}\mathscr{D}^{*}_{is}=\left\{\bm{x}_{1},...,\bm{x}_{C}\ |\ \bm{x}_{i}\in\mathscr{C}_{i}\right\} (6)

where 𝒞i𝒟train\mathscr{C}_{i}\subset\mathscr{D}_{train} is the ii-th class of the training set. Details in Appx. A.

kk-means Precision-Recall

𝒟kmeans={𝒙1,,𝒙k|𝒙i𝒞i}\mathscr{D}^{*}_{kmeans}=\left\{\bm{x}_{1},...,\bm{x}_{k}\ |\ \bm{x}_{i}\in\mathscr{C}_{i}\right\} (7)

where 𝒞i\mathscr{C}_{i} is the ii-th cluster of the kk-means clustering algorithm. Details in Appx. C.

kk-nn Precision-Recall

𝒟knn={𝒙1,𝒙2|𝒙1,𝒙2𝒟train and 𝒙1𝒙2𝒙i𝒙j𝒙i,𝒙j𝒟train}\mathscr{D}^{*}_{knn}=\left\{\bm{x}_{1},\ \bm{x}_{2}\ |\ \bm{x}_{1},\bm{x}_{2}\in\mathscr{D}_{train}\text{ and }\left\lVert\bm{x}_{1}-\bm{x}_{2}\right\rVert\geq\left\lVert\bm{x}_{i}-\bm{x}_{j}\right\rVert\forall\bm{x}_{i},\bm{x}_{j}\in\mathscr{D}_{train}\right\} (8)

Details in Appx. D.

Neural Net Divergence is computed by training a finite capacity neural net fϕf_{\bm{\phi}} to maximize the objective in Eqn. 4. Because the value of NND depends on the network and the training procedure, there is no fixed 𝒟nnd\mathscr{D}^{*}_{nnd}. However, given a network fϕf_{\bm{\phi}}, we can still estimate the size of 𝒟nnd\mathscr{D}^{*}_{nnd} through experiments.

First, we investigate the effect of noise on NND. In our experiment, the dataset is the MNIST dataset, fϕf_{\bm{\phi}} is an MLP with 3 hidden layers and 512 neurons in each layer. Details about the architecture and hyper parameters are in Appx. B. We use the generator described in Section 2.4 to generate noisy versions of data points in 𝒟\mathscr{D}, a random subset of 𝒟train\mathscr{D}_{train} (see Fig. 7 for noisy images generated by the generator). The number of essentially different data points that the generator can generate is |𝒟|\left\lvert\mathscr{D}\right\rvert. We vary |𝒟|\left\lvert\mathscr{D}\right\rvert from 10001000 to 6000060000 and ϵ\epsilon from 0 to 11. The noisy samples are generated continuously throughout the training process as recommended by the authors NND metric, [11]. As we can see in Fig. 7, for ϵ=0.1\epsilon=0.1 and ϵ=0.5\epsilon=0.5 the noisy images are very clear and easily recognizable. For ϵ=1\epsilon=1, the images are much more noisy but still recognizable. At first glance, one might think that adding noise to clean data will increase the NND because the quality of generated data is worsened. The result in Fig. 2 shows the opposite. As we can see from the figure, when ϵ\epsilon and |𝒟|\left\lvert\mathscr{D}\right\rvert increase, fnndf_{nnd} decreases. Increasing ϵ\epsilon results in a bigger decrease in fnndf_{nnd} than increasing |𝒟|\left\lvert\mathscr{D}\right\rvert. For ϵ=1\epsilon=1, a noisy random subset of size 1000 outperforms a noiseless set of size 60000 by a large margin. The network also cannot distinguish noisy sets of size 10000 and 60000 and gives them the same score (the red solid line in Fig 2). We conclude that |𝒟nnd|1000\left\lvert\mathscr{D}^{*}_{nnd}\right\rvert\leq 1000 for this specific network.111We note that our experiments with pu=𝒩(𝟎,I)p_{u}=\mathcal{N}(\bm{0},\textbf{I}) produces the same results.

On the positive side, we observe that the NND decreases as |𝒟|\left\lvert\mathscr{D}\right\rvert increases. However, the NND is plateauing as |𝒟|\left\lvert\mathscr{D}\right\rvert reaches 6000060000. If we continue to increase |𝒟|\left\lvert\mathscr{D}\right\rvert to much larger numbers, the network will assign about the same NND to these datasets. This behavior is expected as the capacity of the network is polynomial in dd. A neural network of a polynomial size can only distinguish datasets of polynomial sizes (for a more detailed discussion, please refer to [2]). We can still use NND to test for generalization if the network can distinguish datasets of sizes larger than |𝒟train|\left\lvert\mathscr{D}_{train}\right\rvert. The larger the network is, the more discriminative the NND becomes.

If we can remove the fake diversity caused by noise, we can use NND to estimate the diversity of generated data. We can reduce the effect of noise by generating a noisy data set 𝒟g\mathscr{D}_{g} of a fixed and polynomial size and train fϕf_{\bm{\phi}} on 𝒟g\mathscr{D}_{g} and 𝒟test\mathscr{D}_{test}. For the experiment in Fig. 2, we use 𝒢𝒟,ϵ\mathcal{G}_{\mathscr{D},\epsilon} to generate a dataset 𝒟g\mathscr{D}_{g} of 6000060000 noisy data points. In contrast to the previous experiment, the NND increases with the level of noise. This behavior is expected as a larger ϵ\epsilon results in worse images, making the task of separating 𝒟g\mathscr{D}_{g} from 𝒟test\mathscr{D}_{test} easier. By fixing the size of 𝒟g\mathscr{D}_{g}, we can effectively remove the fake diversity.

The outputs of GMs usually contain noise. For Autoregressive Models and Energy Based Models, the noise comes from the randomness in the sampling process. For generative latent variable models like GANs, VAEs, and flows, the noise comes from the prior distribution. A generative model can memorize a subset of the training data, add noise to the memorized data points, and fool fnndf_{nnd} that it has learned to faithfully reproduce the target distribution. Our next experiment investigates this phenomenon in real-world models. We use GANs as our example. We select a random subset 𝒟\mathscr{D} from 𝒟train\mathscr{D}_{train}, train a GAN on 𝒟\mathscr{D}, and use the resulting model to generate fake data. For each model, we compute 2 NND scores following the original procedure and our new procedure described in the previous paragraph. Fig. 2 shows the changes in NND of models over 500 epochs. As expected, the NND of ‘Fix’ datasets is higher than that of ‘Inf’ datasets. However, the difference between ‘Fix’ and ‘Inf’ datasets is small, suggesting that the level of noise in the outputs of GAN is much lower than that in our experiment in Fig. 2.

Refer to caption
Refer to caption
Refer to caption
Figure 2: The effect of noise on NND. 2: noisy samples were generated continuously throughout training. 2: a fixed number of noisy samples were used in training. 2: continuously generated fake data (labeled ‘Inf + training set size’) v.s. fixed size fake data (labeled ‘Fix + training set size’).

4 An MDL inspired Generalization Metric

In this section, we present a novel metric for measuring the generalization capacity of generative latent variable models. The metric aims to fix the problems of previously mentioned metrics. Our metric is based on the manifold hypothesis and the Minimum Description Length principle. We assume that the data lie on a high-dimensional manifold and there is a path between every pair of points on the manifold.

4.1 Motivation

Figure 3: The target distribution is the uniform distribution over the line segment (𝒙0,𝒙1)(\bm{x}_{0},\bm{x}_{1}). 3 a GM that memorizes the training set. 3 a GM that produces constant speed interpolation.

4.1.1 Preliminaries

Path length and speed: A path from 𝒛0\bm{z}_{0} to 𝒛1\bm{z}_{1} is a continuous function 𝒛(t):[0,1]dz\bm{z}(t):[0,1]\rightarrow\mathbb{R}^{d_{z}} that satisfies 𝒛(0)=𝒛0,𝒛(1)=𝒛1\bm{z}(0)=\bm{z}_{0},\ \bm{z}(1)=\bm{z}_{1}. The Frobenius norm of the Jacobian 𝒛t\frac{\partial\bm{z}}{\partial t} is the speed of the path at time tt. The path has a constant speed if 𝒛tF=const\left\lVert\frac{\partial\bm{z}}{\partial t}\right\rVert_{F}=\mathrm{const}. Let 𝒢\mathcal{G} be a continuous function from dz\mathbb{R}^{d_{z}} to d\mathbb{R}^{d}. 𝒢\mathcal{G} maps a path from 𝒛0\bm{z}_{0} to 𝒛1\bm{z}_{1} to a path from 𝒙0=𝒢(𝒛0)\bm{x}_{0}=\mathcal{G}(\bm{z}_{0}) to 𝒙1=𝒢(𝒛1)\bm{x}_{1}=\mathcal{G}(\bm{z}_{1}). The length of the path from 𝒙0\bm{x}_{0} to 𝒙1\bm{x}_{1} is

(𝒙0,𝒙1)=01𝒙tF𝑑t=01𝒙𝒛𝒛tF𝑑t\displaystyle\ell(\bm{x}_{0},\bm{x}_{1})=\int_{0}^{1}\left\lVert\frac{\partial\bm{x}}{\partial t}\right\rVert_{F}dt=\int_{0}^{1}\left\lVert\frac{\partial\bm{x}}{\partial\bm{z}}\frac{\partial\bm{z}}{\partial t}\right\rVert_{F}dt (9)

Minimum Description Length principle [16, 20, 21, 22, 5] (see [9] for a comprehensive tutorial) is a formalization of the Occam’s razor principle. Refined MDL [5] defines the stochastic complexity of a dataset 𝒟\mathscr{D} given a parametric model (a set of parametric hypotheses) \mathcal{H} as

SCOMP(𝒟|)=LEN(𝒟|H^)+COMP()\displaystyle\text{SCOMP}(\mathscr{D}|\mathcal{H})=\text{LEN}(\mathscr{D}|\hat{H})+\text{COMP}(\mathcal{H}) (10)

where LEN(𝒟|H)=logp(𝒟|H)\text{LEN}(\mathscr{D}|H)=-\log p(\mathscr{D}|H) is the description length of 𝒟\mathscr{D} given a point hypothesis HH\in\mathcal{H}, H^\hat{H}\in\mathcal{H} is the point hypothesis that maximizes the probability of 𝒟\mathscr{D}, and COMP()\text{COMP}(\mathcal{H}) is the parametric complexity of \mathcal{H}. The MDL principle states that the model with the smallest SCOMP(𝒟|)\text{SCOMP}(\mathscr{D}|\mathcal{H}) is the model with the best generalization capacity. Let L\mathcal{H}_{L} be the set of LL-Lipschitz functions, then the bigger LL is, the more essentially different functions L\mathcal{H}_{L} contains (c.f. Section 2.6.2 in [9]). LL and COMP(L)\text{COMP}(\mathcal{H}_{L}) are positively correlated. In the following, we use LL in place of the parametric complexity of L\mathcal{H}_{L}.

4.1.2 Detecting memorization and estimating complexity

A metric for generalization must be able to detect memorization. We study the behavior of GLVMs when they memorize training data. Fig. 3 illustrates the situation where a GLVM 𝒢1\mathcal{G}_{1} tries to memorize a training set 𝒟\mathscr{D} (shown by two red dots). For any latent code 𝒛\bm{z} (blue stars), 𝒢1\mathcal{G}_{1} tries to make 𝒙=𝒢(𝒛)\bm{x}=\mathcal{G}(\bm{z}) (blue dots) as close as possible to a training data point. Because generated data points are concentrated in 2 clusters, when we interpolate from 𝒛0\bm{z}_{0} to 𝒛1\bm{z}_{1} at constant speed 𝒛tF\left\lVert\frac{\partial\bm{z}}{\partial t}\right\rVert_{F}, the speed 𝒙tF\left\lVert\frac{\partial\bm{x}}{\partial t}\right\rVert_{F} blows up in the middle of the path (top pane in Fig. 3). Fig. 3 shows another GLVM 𝒢2\mathcal{G}_{2} that produces constant speed path from 𝒙0\bm{x}_{0} to 𝒙1\bm{x}_{1}. Although the maximum speed in Fig. 3 is much higher than that in Fig. 3, the paths in the two figures have the same length. Compared to the path length, the maximum speed is a better feature for detecting memorization (see Fig. 6).

From Eqn. 9, we see that if 𝒛tF=const\left\lVert\frac{\partial\bm{z}}{\partial t}\right\rVert_{F}=\mathrm{const}, then the maximum speed smax(𝒙0,𝒙1)=maxt[0,1](𝒙tF)s_{max}(\bm{x}_{0},\bm{x}_{1})=\max_{t\in[0,1]}\left(\left\lVert\frac{\partial\bm{x}}{\partial t}\right\rVert_{F}\right) is proportional to the Lipschitz constant of 𝒢\mathcal{G} on the path. The parametric complexity of 𝒢\mathcal{G} can be defined as the expectation of smaxs_{max} over the latent space

COMP(𝒢)=𝔼𝒛i,𝒛jpz[smax(𝒢(𝒛i),𝒢(𝒛j))]\displaystyle\text{COMP}(\mathcal{G})=\mathbb{E}_{\bm{z}_{i},\bm{z}_{j}\sim p_{z}}[s_{max}(\mathcal{G}(\bm{z}_{i}),\mathcal{G}(\bm{z}_{j}))] (11)

We use the average of smaxs_{max} instead of the absolute maximum speed because (1) the absolute max speed has too high variance, making it an uninformative quantity, and (2) the average tracks the cost of describing the generated manifold more closely than the absolute maximum speed.

4.2 Definition

Because log(𝒟|H)-\log(\mathscr{D}|H), the description length of the training data given the GLVM is not easily computable in some GLVMs like GANs, we replace it with the divergence between p^data\hat{p}_{data} and p^model\hat{p}_{model}. The divergence can be computed using one of the methods in Section 2.1.

Definition 1

Let 𝒟train\mathscr{D}_{train} be a set of i.i.d. samples from a distribution pdatap_{data}. The generalization metric for a GLVM 𝒢\mathcal{G} trained on 𝒟train\mathscr{D}_{train} is defined as:

fgen(𝒢)=αf(𝒟train,𝒟g)+COMP(𝒢)\displaystyle f_{gen}(\mathcal{G})=\alpha f(\mathscr{D}_{train},\mathscr{D}_{g})+\text{COMP}(\mathcal{G}) (12)

where f(𝒟train,𝒟g)f(\mathscr{D}_{train},\mathscr{D}_{g}) is the divergence between the training data and generated data, COMP(𝒢)\text{COMP}(\mathcal{G}) is the parametric complexity in Eqn 11, and α>0\alpha>0 is a scalar that controls the relative importance of the divergence to the complexity.

If 𝒢\mathcal{G} memorizes 𝒟train\mathscr{D}_{train}, then f(𝒟train,𝒟g)f(\mathscr{D}_{train},\mathscr{D}_{g}) goes to 0 but COMP(𝒢)\text{COMP}(\mathcal{G}) blows up. If COMP(𝒢)\text{COMP}(\mathcal{G}) goes to 0, then generated samples are concentrated in a small region, making f(𝒟train,𝒟g)f(\mathscr{D}_{train},\mathscr{D}_{g}) goes up (assuming that ff is a divergence that is good at detecting the lack of diversity, e.g., our fixed NND). Therefore, our metric cannot be fooled by training set memorization. In general, fgenf_{gen} cannot be 0 because the two terms cannot be 0 at the same time.

A metric with large α\alpha will prefer models that can reproduce the training set more faithfully. A small α\alpha will result in a metric that prefers simpler models. α\alpha can be removed if we are comparing models with the same divergence.

4.3 A constant speed regularizer for GANs

A GLVM that produces constant speed paths in the space of generated data is desirable because (1) it is less likely to memorize the training set, (2) it does not make sudden jumps from one memorized data point to another, thus produces smoother interpolation. We propose the following constant speed regularizer for the generator 𝒢\mathcal{G} in GANs:

𝒢const=𝒢+λ𝔼𝒛i,𝒛jpz[𝔼t[0,1][(𝒢(𝒛(t))ts¯)2]]\displaystyle\mathcal{L}^{\mathrm{const}}_{\mathcal{G}}=\mathcal{L}_{\mathcal{G}}+\lambda\mathbb{E}_{\bm{z}_{i},\bm{z}_{j}\sim p_{z}}\left[\mathbb{E}_{t\in[0,1]}\left[\left(\left\lVert\frac{\partial\mathcal{G}(\bm{z}(t))}{\partial t}\right\rVert-\bar{s}\right)^{2}\right]\right] (13)

where 𝒢\mathcal{L}_{\mathcal{G}} is the original loss function of 𝒢\mathcal{G}, s¯\bar{s} is the average speed on the path from 𝒢(𝒛i)\mathcal{G}(\bm{z}_{i}) to 𝒢(𝒛j),\mathcal{G}(\bm{z}_{j}), and the interpolation method is a constant speed interpolation method, i.e. 𝒛t=const\left\lVert\frac{\partial\bm{z}}{\partial t}\right\rVert=\mathrm{const}. The regularizer forces the variance of the speed to 0, making the speed constant. As noted by other authors [e.g. 14], for complicated datasets like the Imagenet, it might be better to apply the regularizer to the feature space. In the experiments below, we apply the regularizer directly to the data space because the dataset is simple.

5 Experiments

5.1 Comparing evaluation metrics

We measure the performance of evaluation metrics on GANs trained on MNIST to see how their worst-case performances correlate to their real-world performances. The result is shown in Fig. 4 - 4.222IS and FID are not included because computing them requires the Inception model. Gulrajani et al. [11] empirically showed that IS and FID are not robust against training set memorization. We refer readers to their experimental results. We trained 5 variants of GAN on the MNIST dataset. The variants are GAN with 0GP regularizer [28] (GAN0GP), GAN with R1 regularizer [19] (GANR1), WGAN with gradient penalty [1, 10] (WGAN-1GP), WGAN with 0GP [28], and WGAN1GP with our new constant speed regularizer (WGAN1GP-const). Thanh-Tung et al. [28] showed in their paper that 0GP improves the generalization capacity of the discriminator and that, in turn, improves the generator’s generalization capacity. They also claimed that 1GP [10] does not help improving generalization. We verify their claims with our fixed NND metric.

The result for NND is shown in Fig. 4. WGAN0GP consistently has lower NND than WGAN1GP. That confirms the claims by Thanh-Tung et al. [28]. We also find that GAN0GP also more stable than GANR1 as GANR1 can suddenly collapse during training. Fig. 4 shows that GANR1 suffers from overfitting after epoch 400, and its NND starts to increase. GAN0GP’s NND decreases as the training progresses. Our constant speed does a very good job at improving the generalization capacity of WGAN1GP. WGAN1GP-const has almost the same NND as WGAN0GP while producing smoother interpolation (see Fig. 5, 5).

kk-means PR does not do a good job at distinguishing models with different generalization capacities. Fig. 4 shows that all models achieve almost perfect precision (i.e. F1/βF_{1/\beta}) after 400 epochs. In contrast to NND, kk-means PR assigns the highest precision and recall to GANR1. This result is expected as Thanh-Tung and Tran [27] and Thanh-Tung et al. [28] suggested that the R1 regularizer may encourage the generator to memorize the training set. The other GANs with better generalization capacity will produce more data points that are further away from the training and test sets. The far data points result in a lower recall, as we see in Fig. 4.

kk-nn PR does a much better job distinguishing different models (Fig. 4). The result seems to justify the claim of its authors [17]. However, Fig. 4 shows that kk-nn PR assigns very high recall to a very bad GAN. Fig. 1 illustrates the same phenomenon on a lower-dimensional dataset. kk-nn works fine for good models that distribute the fake data evenly across the real manifold, but it is vulnerable to bad models that generate many outliers.

Discussion: This experiment shows that the robustness of evaluation metrics on real-world datasets is directly correlated to the size of their minimal dataset 𝒟\mathscr{D}^{*}. kk-nn PR has the smallest minimal set and is the most vulnerable to outliers in the data. kk-means PR has |𝒟kmeans|=k\left\lvert\mathscr{D}^{*}_{kmeans}\right\rvert=k and different data points in 𝒟kmeans\mathscr{D}^{*}_{kmeans} must lie in different clusters. To fool kk-means PR, a generator must produce fake data points belonging to different clusters. That is nontrivial for real-world generative models. NND (both the original and our updated version) has the biggest 𝒟nnd\mathscr{D}^{*}_{nnd} and is the most robust. Our updated NND can be used to estimate the generalization capacity of GMs.

Figure 4: We trained 2 sets of GANs with 2 different learning rates. We deliberately chose a large learning rate for the second group to make it collapse. 4 - 4 results of kk-means PR, kk-nn PR, and NND for models trained with lr1=2×104lr_{1}=2\times 10^{-4}. 4 Because fnnd(p^train,p^fake)f_{nnd}\left(\hat{p}_{train},\hat{p}_{fake}\right) and fnnd(p^test,p^fake)f_{nnd}\left(\hat{p}_{test},\hat{p}_{fake}\right) always show the same trend, we show only fnnd(p^test,p^fake)f_{nnd}\left(\hat{p}_{test},\hat{p}_{fake}\right). We show fnnd(p^train,p^fake)f_{nnd}\left(\hat{p}_{train},\hat{p}_{fake}\right) for WGAN0GP as a representative. 4 top: kk-nn PR of GAN0GP trained with lr2=103lr_{2}=10^{-3}, bottom: generated samples from a GAN with recall equal 1.

5.2 Evaluating our constant speed regularizer and MDL based metric

Fig. 5 shows the interpolation results for WGAN1GP and WGAN1GP-const. Interpolations produced by WGAN1GP contain many sudden jumps (red boxes in Fig. 5). At the beginning and the end of an interpolation path, the image at a step is almost identical to the image at the previous step (yellow boxes in Fig. 5). The phenomenon suggests that similar to the generator in Fig. 3, WGAN1GP memorizes the training data. In contrast, WGAN1GP-const produces smooth interpolations with no sudden jumps. Fig. 4 shows that WGAN1GP-const has a very good generalization capacity, i.e., it can produce many more images that are not in the training set. Our constant speed regularizer helps the generator avoids training set memorization and generalizes.

Refer to caption
Refer to caption
Figure 5: 5, 5 interpolation result of WGAN1GP-const and WGAN1GP at epoch 1000. Despite the lower NND, WGAN0GP produces similar interpolation result as WGAN1GP.

We test the discriminative power of the computational complexity. We compare the computational complexity with the pairwise path length [14]. The result is shown in Fig. 6. The pairwise path length is not good for distinguishing different models. WGAN1GP and WGAN0GP have roughly the same pairwise path length. The same is true for GANR1 and GAN0GP. In contrast, the computational complexity can clearly separate different models. We also see that the complexity of each model increases as the training progresses. GANs with 0GP have lower complexity than GANs with 1GP and R1 penalty. The result confirms that 0GP can help to reduce the complexity and improve generalization. Our constant speed regularizer does an outstanding job in minimizing the complexity. The complexity of WGAN1GP-const is much lower than other models and does not increase as the training progresses. The result again confirms the effectiveness of our constant speed regularizer.

Figure 6: Experimental result for the computational complexity.

6 Limitations and Future Work

In Section 3, we introduce a method for comparing evaluation metrics. To compute the capacity of an evaluation metric ff, we have to construct its minimal dataset 𝒟f\mathscr{D}^{*}_{f}. However, the minimal dataset is not always easy to construct and might depend on the evaluation metric’s parameters (e.g., the NND). We need a general method for estimating the minimal datasets of evaluation metrics.

Our generalization metric uses α\alpha to control the relative importance of the divergence to the complexity. Although we discussed the effect of α\alpha on the metric in Section 4.2, we have not yet tested its effect in practical settings. An algorithm for determining the optimal value for α\alpha is also needed. They are open problems that we aim to solve in our next work.

7 Conclusion

In this paper, we investigate the ability to measure generalization of different evaluation metrics. We show that all of the existing metrics can be fooled by training set memorization. We study the worst-case performance of common metrics and show that their worst-case performance is related to their robustness in real-world scenarios. We identify the weakness of NND and propose a fix to the problem. Our experiments show that the fixed NND metric is robust to noise in the input and can be used to estimate the generalization capacities of generative models. We introduce a constant speed regularizer for improving the smoothness of the generated data manifold and the generalization capacity of the generator in generative latent variable models. We develop a novel metric for generalization based on the MDL principle. Initial results show that our metric has better discriminative power than existing metrics.

References

  • Arjovsky et al. [2017] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • Arora et al. [2017] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative adversarial nets (GANs). In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 224–232, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • Arora et al. [2018] S. Arora, A. Risteski, and Y. Zhang. Do GANs learn the distribution? some theory and empirics. In International Conference on Learning Representations, 2018.
  • Barratt and Sharma [2018] S. Barratt and R. Sharma. A note on the inception score, 2018.
  • Barron et al. [1998] A. Barron, J. Rissanen, and Bin Yu. The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6):2743–2760, 1998.
  • Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  • Esteban et al. [2017] C. Esteban, S. L. Hyland, and G. Rätsch. Real-valued (medical) time series generation with recurrent conditional gans. arXiv preprint arXiv:1706.02633, 2017.
  • Goodfellow et al. [2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
  • Grunwald [2004] P. Grunwald. A tutorial introduction to the minimum description length principle. arXiv preprint math/0406077, 2004.
  • Gulrajani et al. [2017] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5767–5777. Curran Associates, Inc., 2017.
  • Gulrajani et al. [2019] I. Gulrajani, C. Raffel, and L. Metz. Towards GAN benchmarks which require generalization. In International Conference on Learning Representations, 2019.
  • Heusel et al. [2017] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6626–6637. Curran Associates, Inc., 2017.
  • Horak et al. [2020] D. Horak, S. Yu, and G. Salimi-Khorshidi. Topology distance: A topology-based approach for evaluating generative adversarial networks. arXiv preprint arXiv:2002.12054, 2020.
  • Karras et al. [2019] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
  • Khrulkov and Oseledets [2018] V. Khrulkov and I. Oseledets. Geometry score: A method for comparing generative adversarial networks. arXiv preprint arXiv:1802.02664, 2018.
  • Kolmogorov [1968] A. N. Kolmogorov. Three approaches to the quantitative definition of information. International journal of computer mathematics, 2(1-4):157–168, 1968.
  • Kynkäänniemi et al. [2019] T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila. Improved precision and recall metric for assessing generative models. In Advances in Neural Information Processing Systems, pages 3929–3938, 2019.
  • Meehan et al. [2020] C. Meehan, K. Chaudhuri, and S. Dasgupta. A Non-Parametric Test to Detect Data-Copying in Generative Models. arXiv e-prints, art. arXiv:2004.05675, Apr. 2020.
  • Mescheder et al. [2018] L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for GANs do actually converge? In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3478–3487, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
  • Rissanen [1978] J. Rissanen. Paper: Modeling by shortest data description. Automatica, 14(5):465–471, Sept. 1978. ISSN 0005-1098. doi: 10.1016/0005-1098(78)90005-5.
  • Rissanen [1986] J. Rissanen. Stochastic complexity and modeling. The annals of statistics, pages 1080–1100, 1986.
  • Rissanen [1987] J. Rissanen. Stochastic complexity. Journal of the Royal Statistical Society: Series B (Methodological), 49(3):223–239, 1987.
  • Sajjadi et al. [2018] M. S. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly. Assessing generative models via precision and recall. In Advances in Neural Information Processing Systems, pages 5228–5237, 2018.
  • Salimans et al. [2016] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen. Improved techniques for training gans. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2234–2242. Curran Associates, Inc., 2016.
  • Shmelkov et al. [2018] K. Shmelkov, C. Schmid, and K. Alahari. How good is my gan? In Proceedings of the European Conference on Computer Vision (ECCV), pages 213–229, 2018.
  • Szegedy et al. [2015] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015. URL http://arxiv.org/abs/1512.00567.
  • Thanh-Tung and Tran [2020] H. Thanh-Tung and T. Tran. Catastrophic Forgetting and Mode Collapse in Generative Adversarial Networks. In Proceedings of the International Joint Conference on Neural Networks, Jul 2020.
  • Thanh-Tung et al. [2019] H. Thanh-Tung, T. Tran, and S. Venkatesh. Improving generalization and stability of generative adversarial networks. In International Conference on Learning Representations, 2019.

Appendix A Inception Score

We assume that the Inception model is a perfect classifier, i.e. given a real datapoint 𝒙\bm{x} the conditional distribution p(y|𝒙)p(y|\bm{x}) is

p(y=i|𝒙)={1 if i is the correct label of 𝒙0 otherwise\displaystyle p(y=i|\bm{x})=\begin{cases}1\text{ if }i\text{ is the correct label of }\bm{x}\\ 0\text{ otherwise}\end{cases} (14)

We have H(y|𝒙)=0\text{H}(y|\bm{x})=0. We select CC real datapoints from CC classes in the training set so H(y)=lnC\text{H}(y)=\ln C. The Inception score for the dataset in Eqn. 6 is CC.

Appendix B Neural Net Divergence

The configuration for the experiment in Fig. 2 and 2 is given in Table 1.

Refer to caption
(a) ϵ=0\epsilon=0
Refer to caption
(b) ϵ=0.1\epsilon=0.1
Refer to caption
(c) ϵ=0.5\epsilon=0.5
Refer to caption
(d) ϵ=1\epsilon=1
Refer to caption
(e) ϵ=2\epsilon=2
Refer to caption
(f) ϵ=5\epsilon=5
Figure 7: Noisy images generated by our procedure with pu=𝒰(1,1)p_{u}=\mathcal{U}(-1,1).
Table 1: Configuration for experiment in Fig. 2
Network architecture MLP with 3 hidden layers, 512 hidden neurons
Loss function WGAN1GP loss function [10]
Number of training iteration 20000
Learning rate 1e-4
Optimizer Adam with β1=0.9,β2=0.999\beta_{1}=0.9,\ \beta_{2}=0.999
|𝒟test|\left\lvert\mathscr{D}_{test}\right\rvert 10000

The GANs in Fig. 2 and 4 also use 3 hidden layer MLPs with the same number of hidden neurons.

Appendix C kk-means Precision-Recall

Algorithm 3 kk-means PR
1:Inputs: 𝒟r,𝒟g\mathcal{D}_{r},\ \mathcal{D}_{g}, hyper parameter β\beta
2:Outputs: Fβ,F1/βF_{\beta},\ F_{1/\beta}
3:𝒞kmeans(𝒟r𝒟g)\mathcal{C}\leftarrow kmeans(\mathcal{D}_{r}\cup\mathcal{D}_{g})
4:histg[i]# fake datapoints in cluster 𝒞i|𝒟g|hist_{g}[i]\leftarrow\frac{\text{\# fake datapoints in cluster }\mathcal{C}_{i}}{|\mathcal{D}_{g}|}
5:histr[i]# real datapoints in cluster 𝒞i|𝒟r|hist_{r}[i]\leftarrow\frac{\text{\# real datapoints in cluster }\mathcal{C}_{i}}{|\mathcal{D}_{r}|}
6:curve=pr_curve(histr,histg)curve=pr\_curve(hist_{r},hist_{g})
7:Fβ=f_score(curve,β){F_{\beta}=f\_score(curve,\beta)}
8:F1/β=f_score(curve,1/β){F_{1/\beta}=f\_score(curve,1/\beta)}
9:return Fβ,F1/βF_{\beta},\ F_{1/\beta}

We describe the procedure for constructing 𝒟kmeans\mathscr{D}^{*}_{kmeans}. Without loss of generality, we assume that |𝒟test|=|𝒟g|\left\lvert\mathscr{D}_{test}\right\rvert=\left\lvert\mathscr{D}_{g}\right\rvert, and the kk-means algorithm always finds the optimal clustering. kk-means PR applies kk-means clustering algorithm on 𝒟test𝒟g\mathscr{D}_{test}\cup\mathscr{D}_{g} to get kk clusters. Let 𝒞i\mathcal{C}_{i} be the ii-th cluster, 𝒞i,test\mathcal{C}_{i,test} be the set of test datapoints in this cluster. From the algorithm above, we see the the perfect precision and recall are achieved when histg=histrhist_{g}=hist_{r}. To make histg[i]=histr[i]hist_{g}[i]=hist_{r}[i], the generator GG needs to

  1. 1.

    memorize a training datapoint 𝒙\bm{x} s.t. 𝒙\bm{x} is closest to the center of cluster 𝒞i\mathcal{C}_{i}.

  2. 2.

    duplicate the datapoint |𝒞i,test|\left\lvert\mathcal{C}_{i,test}\right\rvert times

The number of distinct datapoints that GG has to memorize is kk, the number of clusters.

Appendix D kk-nn Precision-Recall

Algorithm 4 kk-nn PR
1:Inputs: 𝒟r,𝒟g\mathcal{D}_{r},\ \mathcal{D}_{g}
2:Outputs: Precision PP, Recall RR
3:g\mathcal{M}_{g}\leftarrow\emptyset  // build the fake manifold
4:for each datapoint 𝒙i𝒟g\bm{x}_{i}\in\mathcal{D}_{g} do
5:     𝒙j\bm{x}_{j}\leftarrow the kk-th nearest neighbor of 𝒙i\bm{x}_{i}
6:     Sisphere(c=𝒙i,r=𝒙i𝒙j)S_{i}\leftarrow sphere(c=\bm{x}_{i},r=\left\lVert\bm{x}_{i}-\bm{x}_{j}\right\rVert)
7:     ggSi\mathcal{M}_{g}\leftarrow\mathcal{M}_{g}\cup S_{i}
8:end for
9:build the real manifold r\mathcal{M}_{r}
10:P# fake datapoints in r|𝒟g|P\leftarrow\frac{\text{\# fake datapoints in }\mathcal{M}_{r}}{|\mathcal{D}_{g}|}
11:R# real datapoints in g|𝒟r|R\leftarrow\frac{\text{\# real datapoints in }\mathcal{M}_{g}}{|\mathcal{D}_{r}|}
12:return P,RP,\ R

The idea for constructing 𝒟knn\mathscr{D}^{*}_{knn} is illustrated in Fig. 1. We select 𝒙1,𝒙2𝒟train\bm{x}_{1},\bm{x}_{2}\in\mathscr{D}_{train} s.t. 𝒙1𝒙2𝒙i𝒙j𝒙i,𝒙j𝒟train\left\lVert\bm{x}_{1}-\bm{x}_{2}\right\rVert\geq\left\lVert\bm{x}_{i}-\bm{x}_{j}\right\rVert\ \forall\ \bm{x}_{i},\bm{x}_{j}\in\mathscr{D}_{train}. Because 𝒙1,𝒙2\bm{x}_{1},\bm{x}_{2} are in 𝒟train\mathscr{D}_{train}, they lie on the real manifold. Thus, the precision is 1. Because 𝒙1𝒙2𝒙i𝒙j𝒙i,𝒙j𝒟train\left\lVert\bm{x}_{1}-\bm{x}_{2}\right\rVert\geq\left\lVert\bm{x}_{i}-\bm{x}_{j}\right\rVert\ \forall\ \bm{x}_{i},\bm{x}_{j}\in\mathscr{D}_{train}, all of the training datapoints are in the two spheres centered at 𝒙1\bm{x}_{1} and 𝒙2\bm{x}_{2}. If 𝒟train\mathscr{D}_{train} is large enough then we can assume that all of the test datapoints lies in the manifold reconstructed using 𝒟train\mathscr{D}_{train}. Therefore, all of the test datapoints are highly likely to lie in the two spheres centered at 𝒙1,𝒙2\bm{x}_{1},\bm{x}_{2} and the recall is 1.