Toward a Generalization Metric for Deep Generative Models
Abstract
Measuring the generalization capacity of Deep Generative Models (DGMs) is difficult because of the curse of dimensionality. Evaluation metrics for DGMs such as Inception Score, Fréchet Inception Distance, Precision-Recall, and Neural Net Divergence try to estimate the distance between the generated distribution and the target distribution using a polynomial number of samples. These metrics are the target of researchers when designing new models. Despite the claims, it is still unclear how well can they measure the generalization capacity of a generative model. In this paper, we investigate the capacity of these metrics in measuring the generalization capacity. We introduce a framework for comparing the robustness of evaluation metrics. We show that better scores in these metrics do not imply better generalization. They can be fooled easily by a generator that memorizes a small subset of the training set. We propose a fix to the NND metric to make it more robust to noise in the generated data. Toward building a robust metric for generalization, we propose to apply the Minimum Description Length principle to the problem of evaluating DGMs. We develop an efficient method for estimating the complexity of Generative Latent Variable Models (GLVMs). Experimental results show that our metric can effectively detect training set memorization and distinguish GLVMs of different generalization capacities.
1 Introduction
The quality of a generative model is defined by the distance between the model distribution and the target distribution. Traditional distances/divergences such as the KL divergence or the Wasserstein distance are intractable because of their exponential sample complexities. In the last few years, a number of evaluation metrics with polynomial sample complexities have been introduced. Some of the most common metrics are the Inception Score (IS) [24], Fréchet Inception Distance (FID) [12], Precision-Recall metrics [23, 17, 25]. These metrics are intuitive and fast to compute, but some researchers have raised concerns regarding their reliability [11]. [11] proposed to use the Neural Net Divergence (NND) [2] to measure the generalization capacity of generative models. In this paper, we evaluate the ability to measure generalization of the above evaluation metrics. We find that all of them cannot reliably measure generalization. They can be fooled easily by a generative model that memorizes a small subset of the training set.
We propose a novel generalization metric for a broad class of generative models called the Generative Latent Variable Models. Our metric is based on the Minimum Description Length principle [16, 20, 21, 22, 5]. The metric combines the complexity of the generative model and the divergence between the model distribution and the target distribution into a single value. The metric inherits the robustness of the MDL framework and can be computed in polynomial time using a polynomial number of samples.
Notations
: a latent variable. |
: a data point. |
: a generative model with parameter . |
: a set of samples from . |
: a set of samples from . |
: the training set, . |
: the test set, . |
: the uniform distribution over . |
2 Background and Related Work
2.1 Divergence based evaluation metrics
A generative model is trained on to produce , an estimate of . The quality of is defined by the divergence/distance between and . Let be a divergence/distance function. Because is intractable for most distributions of interest, we approximate it with where are polynomial sized datasets.
Inception Score (IS) [24] is defined as:
(1) |
where is the input, is the label, and . is computed by feeding the generated data point through a pretrained classifier, e.g. the Inception network [26]. IS is maximized when and where is the number of classes. Barratt and Sharma [4] showed that . To make our writing consistent, we convert all of the metrics in this paper to (pseudo) divergences. A lower divergence should imply that is closer to . We define the following pseudo divergence:
(2) |
This is a pseudo divergence because (1) it only contains 1 distribution, , the other distribution is implicitly the dataset on which the classifier was trained, and (2) does not imply that the two distributions are the same (more on this in Section 3). A fundamental flaw of IS is that a smaller implies that is closer to the uniform distribution, not that is closer to the true distribution over the labels. Fortunately (or unfortunately), the distribution over the labels in the Imagenet dataset [6] is uniform. This makes IS a popular and somewhat abused tool for evaluating image generation models.
Fréchet Inception Distance (FID): Heusel et al. [12] approximate , with two Gaussian distributions and , respectively, and then compute the Fréchet distance between the two Gaussians:
(3) |
Because FID compares only the first 2 moments of and , does not guarantee that the 2 distributions are the same.
Neural Net Divergence (NND) [11, 2] uses neural nets to compute the divergence between and
(4) |
where is usually a neural network with finite capacity. is trained to maximize the difference in Eqn. 4. Arora et al. [2] showed that NND can detect the lack of diversity in generated data. Gulrajani et al. [11] leveraged that capability to estimate the generalizability of GMs. We show in Section 3 that the diversity in generated data can be faked easily by adding white noise to the data. This fake diversity is present in real-world models such as GANs and VAEs. We propose a fix to NND that removes the effect of fake diversity while retaining the discriminative power of NND.
2.2 Precision-Recall based evaluation metrics
The above metrics use scalars to measure both the quality and diversity of generated samples. The following metrics are designed to separate quality from diversity.
-means based Precision-Recall (-means PR) [23] applies -means algorithm to , builds 2 histograms to approximate and , and computes the difference between these histograms. The results are 2 F-scores, and . For , can be interpreted as the recall (diversity) and can be interpreted as the precision (quality) of the generated data (Algo. 1). We define the following pseudo divergences: , .
-nn based Precision-Recall (-nn PR) [17] computes the precision and recall by comparing the reconstructed real manifold and the reconstructed fake manifold (Algo. 2 and Fig. 1). Kynkäänniemi et al. [17] claimed that -nn PR has better discriminative power than -means PR and demonstrated that on several datasets. However, we show in Section 3 that -nn PR has the worst worst-case performance, and in practice, it cannot distinguish a good model from a very bad model. We define the following pseudo divergences: , .
Similar to IS and FID, -means PR and -nn PR assign (near) perfect scores to generative models that produce the entire training set.
2.3 Other evaluation metrics
Metrics for class-conditional models: Esteban et al. [7] proposed the “Train on Synthetic Test on Real” (TSTR) approach to evaluate class-conditional GANs. The metric is expensive to compute as it requires training a classifier on a labeled synthetic dataset. Shmelkov et al. [25] extended the idea to a Precision-Recall-like metric for class-conditional GANs. Because these metrics are applicable only to class-conditional GMs, we do not consider them in this paper.
Topological/Geometrical approaches: Khrulkov and Oseledets [15], Horak et al. [13] applied geometrical and topological tools to measure the difference between the fake manifold and the real manifold. These metrics are not designed for measuring generalizability and are not considered in this paper. Karras et al. [14] used the pairwise perceptual distance - the pairwise distance between data points on the feature manifold - to measure the smoothness of the feature manifold. The intuition is that if the perceptual distance is small, then the manifold is smoother, and the GM might learn better representations and have better generalizability. We show in Section 4 that the pairwise distance does not well reflect the smoothness of the manifold, and it is minimized when total mode collapse occurs. Therefore, pairwise perceptual distance is not a good metric for GMs.
Non-parametric approaches: Arora et al. [3] used the birthday paradox and human evaluation to estimate the number of distinct modes that a GM can generate. The method cannot be automated and has high variance and bias because humans are involved in the loop. Meehan et al. [18] proposed a three sample test for data copying in GMs. The metric can detect data copying but cannot measure the generalizability of a model or the quality of the generated samples.
2.4 A generative model with poor generalization capacity
When mode collapse or training set memorization [8] happen, the generative model generates slightly different versions of the memorized data points. We simulate this kind of behavior with the following generative procedure:
Given a dataset , a constant , a new data point is generated by
-
1.
Draw a data point at random
-
2.
Draw a noise vector , where is the noise distribution.
-
3.
Output the noisy data point:
In our experiments, (see Fig. 7 for samples from MNIST dataset). We tried replacing with and found no difference in performance. This generator, denoted as , has no real generalization capacity and the number of essentially different modes in is . For , this generator can be approximated with a neural net with finite capacity.
3 Evaluating evaluation metrics
Given a training set , we train a generative model on and get a distribution . We would like to estimate the divergence/distance between and using a divergence/distance function . Because the true divergence/distance is intractable, we approximate it with the empirical divergence/distance where and . If then can achieve the perfect score by memorizing the entire . To reduce the effect of training set memorization, we require that . We want to track , i.e. is small only when is small. Specifically, we want to satisfy:
(5) |
Intuitively, the divergence/distance should decrease as the generative model can generate more (significantly different) data points from the target distribution. We say that a generative model generalizes if it can generate more data points from the target distribution than a generative model that memorizes the training dataset. To test for generalization, we generate a dataset whose size is larger than that of and compare and .
We show in the next subsections that Inception Score, Precision-Recall, and Neural Net Divergence do not satisfy Eqn. 5 and are not robust against training set memorization.
3.1 Worst-case analyses of evaluation metrics
Let be the smallest subset of that satisfies . The divergence cannot distinguish a model that produces only and a model that produces . By memorizing only , a generator can fool that it can produce a larger dataset. The smaller is, the more vulnerable is to training set memorization. We construct for the metrics mentioned above and discuss the relationship between and the performance of a metric in practice. Because NND is the only metric designed to be robust against training set memorization, it is the focus of this section. We defer the details for other metrics to the appendix.
-means Precision-Recall
(7) |
where is the -th cluster of the -means clustering algorithm. Details in Appx. C.
Neural Net Divergence is computed by training a finite capacity neural net to maximize the objective in Eqn. 4. Because the value of NND depends on the network and the training procedure, there is no fixed . However, given a network , we can still estimate the size of through experiments.
First, we investigate the effect of noise on NND. In our experiment, the dataset is the MNIST dataset, is an MLP with 3 hidden layers and 512 neurons in each layer. Details about the architecture and hyper parameters are in Appx. B. We use the generator described in Section 2.4 to generate noisy versions of data points in , a random subset of (see Fig. 7 for noisy images generated by the generator). The number of essentially different data points that the generator can generate is . We vary from to and from to . The noisy samples are generated continuously throughout the training process as recommended by the authors NND metric, [11]. As we can see in Fig. 7, for and the noisy images are very clear and easily recognizable. For , the images are much more noisy but still recognizable. At first glance, one might think that adding noise to clean data will increase the NND because the quality of generated data is worsened. The result in Fig. 2 shows the opposite. As we can see from the figure, when and increase, decreases. Increasing results in a bigger decrease in than increasing . For , a noisy random subset of size 1000 outperforms a noiseless set of size 60000 by a large margin. The network also cannot distinguish noisy sets of size 10000 and 60000 and gives them the same score (the red solid line in Fig 2). We conclude that for this specific network.111We note that our experiments with produces the same results.
On the positive side, we observe that the NND decreases as increases. However, the NND is plateauing as reaches . If we continue to increase to much larger numbers, the network will assign about the same NND to these datasets. This behavior is expected as the capacity of the network is polynomial in . A neural network of a polynomial size can only distinguish datasets of polynomial sizes (for a more detailed discussion, please refer to [2]). We can still use NND to test for generalization if the network can distinguish datasets of sizes larger than . The larger the network is, the more discriminative the NND becomes.
If we can remove the fake diversity caused by noise, we can use NND to estimate the diversity of generated data. We can reduce the effect of noise by generating a noisy data set of a fixed and polynomial size and train on and . For the experiment in Fig. 2, we use to generate a dataset of noisy data points. In contrast to the previous experiment, the NND increases with the level of noise. This behavior is expected as a larger results in worse images, making the task of separating from easier. By fixing the size of , we can effectively remove the fake diversity.
The outputs of GMs usually contain noise. For Autoregressive Models and Energy Based Models, the noise comes from the randomness in the sampling process. For generative latent variable models like GANs, VAEs, and flows, the noise comes from the prior distribution. A generative model can memorize a subset of the training data, add noise to the memorized data points, and fool that it has learned to faithfully reproduce the target distribution. Our next experiment investigates this phenomenon in real-world models. We use GANs as our example. We select a random subset from , train a GAN on , and use the resulting model to generate fake data. For each model, we compute 2 NND scores following the original procedure and our new procedure described in the previous paragraph. Fig. 2 shows the changes in NND of models over 500 epochs. As expected, the NND of ‘Fix’ datasets is higher than that of ‘Inf’ datasets. However, the difference between ‘Fix’ and ‘Inf’ datasets is small, suggesting that the level of noise in the outputs of GAN is much lower than that in our experiment in Fig. 2.



4 An MDL inspired Generalization Metric
In this section, we present a novel metric for measuring the generalization capacity of generative latent variable models. The metric aims to fix the problems of previously mentioned metrics. Our metric is based on the manifold hypothesis and the Minimum Description Length principle. We assume that the data lie on a high-dimensional manifold and there is a path between every pair of points on the manifold.
4.1 Motivation
4.1.1 Preliminaries
Path length and speed: A path from to is a continuous function that satisfies . The Frobenius norm of the Jacobian is the speed of the path at time . The path has a constant speed if . Let be a continuous function from to . maps a path from to to a path from to . The length of the path from to is
(9) |
Minimum Description Length principle [16, 20, 21, 22, 5] (see [9] for a comprehensive tutorial) is a formalization of the Occam’s razor principle. Refined MDL [5] defines the stochastic complexity of a dataset given a parametric model (a set of parametric hypotheses) as
(10) |
where is the description length of given a point hypothesis , is the point hypothesis that maximizes the probability of , and is the parametric complexity of . The MDL principle states that the model with the smallest is the model with the best generalization capacity. Let be the set of -Lipschitz functions, then the bigger is, the more essentially different functions contains (c.f. Section 2.6.2 in [9]). and are positively correlated. In the following, we use in place of the parametric complexity of .
4.1.2 Detecting memorization and estimating complexity
A metric for generalization must be able to detect memorization. We study the behavior of GLVMs when they memorize training data. Fig. 3 illustrates the situation where a GLVM tries to memorize a training set (shown by two red dots). For any latent code (blue stars), tries to make (blue dots) as close as possible to a training data point. Because generated data points are concentrated in 2 clusters, when we interpolate from to at constant speed , the speed blows up in the middle of the path (top pane in Fig. 3). Fig. 3 shows another GLVM that produces constant speed path from to . Although the maximum speed in Fig. 3 is much higher than that in Fig. 3, the paths in the two figures have the same length. Compared to the path length, the maximum speed is a better feature for detecting memorization (see Fig. 6).
From Eqn. 9, we see that if , then the maximum speed is proportional to the Lipschitz constant of on the path. The parametric complexity of can be defined as the expectation of over the latent space
(11) |
We use the average of instead of the absolute maximum speed because (1) the absolute max speed has too high variance, making it an uninformative quantity, and (2) the average tracks the cost of describing the generated manifold more closely than the absolute maximum speed.
4.2 Definition
Because , the description length of the training data given the GLVM is not easily computable in some GLVMs like GANs, we replace it with the divergence between and . The divergence can be computed using one of the methods in Section 2.1.
Definition 1
Let be a set of i.i.d. samples from a distribution . The generalization metric for a GLVM trained on is defined as:
(12) |
where is the divergence between the training data and generated data, is the parametric complexity in Eqn 11, and is a scalar that controls the relative importance of the divergence to the complexity.
If memorizes , then goes to 0 but blows up. If goes to 0, then generated samples are concentrated in a small region, making goes up (assuming that is a divergence that is good at detecting the lack of diversity, e.g., our fixed NND). Therefore, our metric cannot be fooled by training set memorization. In general, cannot be 0 because the two terms cannot be 0 at the same time.
A metric with large will prefer models that can reproduce the training set more faithfully. A small will result in a metric that prefers simpler models. can be removed if we are comparing models with the same divergence.
4.3 A constant speed regularizer for GANs
A GLVM that produces constant speed paths in the space of generated data is desirable because (1) it is less likely to memorize the training set, (2) it does not make sudden jumps from one memorized data point to another, thus produces smoother interpolation. We propose the following constant speed regularizer for the generator in GANs:
(13) |
where is the original loss function of , is the average speed on the path from to and the interpolation method is a constant speed interpolation method, i.e. . The regularizer forces the variance of the speed to 0, making the speed constant. As noted by other authors [e.g. 14], for complicated datasets like the Imagenet, it might be better to apply the regularizer to the feature space. In the experiments below, we apply the regularizer directly to the data space because the dataset is simple.
5 Experiments
5.1 Comparing evaluation metrics
We measure the performance of evaluation metrics on GANs trained on MNIST to see how their worst-case performances correlate to their real-world performances. The result is shown in Fig. 4 - 4.222IS and FID are not included because computing them requires the Inception model. Gulrajani et al. [11] empirically showed that IS and FID are not robust against training set memorization. We refer readers to their experimental results. We trained 5 variants of GAN on the MNIST dataset. The variants are GAN with 0GP regularizer [28] (GAN0GP), GAN with R1 regularizer [19] (GANR1), WGAN with gradient penalty [1, 10] (WGAN-1GP), WGAN with 0GP [28], and WGAN1GP with our new constant speed regularizer (WGAN1GP-const). Thanh-Tung et al. [28] showed in their paper that 0GP improves the generalization capacity of the discriminator and that, in turn, improves the generator’s generalization capacity. They also claimed that 1GP [10] does not help improving generalization. We verify their claims with our fixed NND metric.
The result for NND is shown in Fig. 4. WGAN0GP consistently has lower NND than WGAN1GP. That confirms the claims by Thanh-Tung et al. [28]. We also find that GAN0GP also more stable than GANR1 as GANR1 can suddenly collapse during training. Fig. 4 shows that GANR1 suffers from overfitting after epoch 400, and its NND starts to increase. GAN0GP’s NND decreases as the training progresses. Our constant speed does a very good job at improving the generalization capacity of WGAN1GP. WGAN1GP-const has almost the same NND as WGAN0GP while producing smoother interpolation (see Fig. 5, 5).
-means PR does not do a good job at distinguishing models with different generalization capacities. Fig. 4 shows that all models achieve almost perfect precision (i.e. ) after 400 epochs. In contrast to NND, -means PR assigns the highest precision and recall to GANR1. This result is expected as Thanh-Tung and Tran [27] and Thanh-Tung et al. [28] suggested that the R1 regularizer may encourage the generator to memorize the training set. The other GANs with better generalization capacity will produce more data points that are further away from the training and test sets. The far data points result in a lower recall, as we see in Fig. 4.
-nn PR does a much better job distinguishing different models (Fig. 4). The result seems to justify the claim of its authors [17]. However, Fig. 4 shows that -nn PR assigns very high recall to a very bad GAN. Fig. 1 illustrates the same phenomenon on a lower-dimensional dataset. -nn works fine for good models that distribute the fake data evenly across the real manifold, but it is vulnerable to bad models that generate many outliers.
Discussion: This experiment shows that the robustness of evaluation metrics on real-world datasets is directly correlated to the size of their minimal dataset . -nn PR has the smallest minimal set and is the most vulnerable to outliers in the data. -means PR has and different data points in must lie in different clusters. To fool -means PR, a generator must produce fake data points belonging to different clusters. That is nontrivial for real-world generative models. NND (both the original and our updated version) has the biggest and is the most robust. Our updated NND can be used to estimate the generalization capacity of GMs.
5.2 Evaluating our constant speed regularizer and MDL based metric
Fig. 5 shows the interpolation results for WGAN1GP and WGAN1GP-const. Interpolations produced by WGAN1GP contain many sudden jumps (red boxes in Fig. 5). At the beginning and the end of an interpolation path, the image at a step is almost identical to the image at the previous step (yellow boxes in Fig. 5). The phenomenon suggests that similar to the generator in Fig. 3, WGAN1GP memorizes the training data. In contrast, WGAN1GP-const produces smooth interpolations with no sudden jumps. Fig. 4 shows that WGAN1GP-const has a very good generalization capacity, i.e., it can produce many more images that are not in the training set. Our constant speed regularizer helps the generator avoids training set memorization and generalizes.


We test the discriminative power of the computational complexity. We compare the computational complexity with the pairwise path length [14]. The result is shown in Fig. 6. The pairwise path length is not good for distinguishing different models. WGAN1GP and WGAN0GP have roughly the same pairwise path length. The same is true for GANR1 and GAN0GP. In contrast, the computational complexity can clearly separate different models. We also see that the complexity of each model increases as the training progresses. GANs with 0GP have lower complexity than GANs with 1GP and R1 penalty. The result confirms that 0GP can help to reduce the complexity and improve generalization. Our constant speed regularizer does an outstanding job in minimizing the complexity. The complexity of WGAN1GP-const is much lower than other models and does not increase as the training progresses. The result again confirms the effectiveness of our constant speed regularizer.
6 Limitations and Future Work
In Section 3, we introduce a method for comparing evaluation metrics. To compute the capacity of an evaluation metric , we have to construct its minimal dataset . However, the minimal dataset is not always easy to construct and might depend on the evaluation metric’s parameters (e.g., the NND). We need a general method for estimating the minimal datasets of evaluation metrics.
Our generalization metric uses to control the relative importance of the divergence to the complexity. Although we discussed the effect of on the metric in Section 4.2, we have not yet tested its effect in practical settings. An algorithm for determining the optimal value for is also needed. They are open problems that we aim to solve in our next work.
7 Conclusion
In this paper, we investigate the ability to measure generalization of different evaluation metrics. We show that all of the existing metrics can be fooled by training set memorization. We study the worst-case performance of common metrics and show that their worst-case performance is related to their robustness in real-world scenarios. We identify the weakness of NND and propose a fix to the problem. Our experiments show that the fixed NND metric is robust to noise in the input and can be used to estimate the generalization capacities of generative models. We introduce a constant speed regularizer for improving the smoothness of the generated data manifold and the generalization capacity of the generator in generative latent variable models. We develop a novel metric for generalization based on the MDL principle. Initial results show that our metric has better discriminative power than existing metrics.
References
- Arjovsky et al. [2017] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
- Arora et al. [2017] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative adversarial nets (GANs). In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 224–232, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
- Arora et al. [2018] S. Arora, A. Risteski, and Y. Zhang. Do GANs learn the distribution? some theory and empirics. In International Conference on Learning Representations, 2018.
- Barratt and Sharma [2018] S. Barratt and R. Sharma. A note on the inception score, 2018.
- Barron et al. [1998] A. Barron, J. Rissanen, and Bin Yu. The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6):2743–2760, 1998.
- Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
- Esteban et al. [2017] C. Esteban, S. L. Hyland, and G. Rätsch. Real-valued (medical) time series generation with recurrent conditional gans. arXiv preprint arXiv:1706.02633, 2017.
- Goodfellow et al. [2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
- Grunwald [2004] P. Grunwald. A tutorial introduction to the minimum description length principle. arXiv preprint math/0406077, 2004.
- Gulrajani et al. [2017] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5767–5777. Curran Associates, Inc., 2017.
- Gulrajani et al. [2019] I. Gulrajani, C. Raffel, and L. Metz. Towards GAN benchmarks which require generalization. In International Conference on Learning Representations, 2019.
- Heusel et al. [2017] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6626–6637. Curran Associates, Inc., 2017.
- Horak et al. [2020] D. Horak, S. Yu, and G. Salimi-Khorshidi. Topology distance: A topology-based approach for evaluating generative adversarial networks. arXiv preprint arXiv:2002.12054, 2020.
- Karras et al. [2019] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
- Khrulkov and Oseledets [2018] V. Khrulkov and I. Oseledets. Geometry score: A method for comparing generative adversarial networks. arXiv preprint arXiv:1802.02664, 2018.
- Kolmogorov [1968] A. N. Kolmogorov. Three approaches to the quantitative definition of information. International journal of computer mathematics, 2(1-4):157–168, 1968.
- Kynkäänniemi et al. [2019] T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila. Improved precision and recall metric for assessing generative models. In Advances in Neural Information Processing Systems, pages 3929–3938, 2019.
- Meehan et al. [2020] C. Meehan, K. Chaudhuri, and S. Dasgupta. A Non-Parametric Test to Detect Data-Copying in Generative Models. arXiv e-prints, art. arXiv:2004.05675, Apr. 2020.
- Mescheder et al. [2018] L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for GANs do actually converge? In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3478–3487, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
- Rissanen [1978] J. Rissanen. Paper: Modeling by shortest data description. Automatica, 14(5):465–471, Sept. 1978. ISSN 0005-1098. doi: 10.1016/0005-1098(78)90005-5.
- Rissanen [1986] J. Rissanen. Stochastic complexity and modeling. The annals of statistics, pages 1080–1100, 1986.
- Rissanen [1987] J. Rissanen. Stochastic complexity. Journal of the Royal Statistical Society: Series B (Methodological), 49(3):223–239, 1987.
- Sajjadi et al. [2018] M. S. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly. Assessing generative models via precision and recall. In Advances in Neural Information Processing Systems, pages 5228–5237, 2018.
- Salimans et al. [2016] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen. Improved techniques for training gans. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2234–2242. Curran Associates, Inc., 2016.
- Shmelkov et al. [2018] K. Shmelkov, C. Schmid, and K. Alahari. How good is my gan? In Proceedings of the European Conference on Computer Vision (ECCV), pages 213–229, 2018.
- Szegedy et al. [2015] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015. URL http://arxiv.org/abs/1512.00567.
- Thanh-Tung and Tran [2020] H. Thanh-Tung and T. Tran. Catastrophic Forgetting and Mode Collapse in Generative Adversarial Networks. In Proceedings of the International Joint Conference on Neural Networks, Jul 2020.
- Thanh-Tung et al. [2019] H. Thanh-Tung, T. Tran, and S. Venkatesh. Improving generalization and stability of generative adversarial networks. In International Conference on Learning Representations, 2019.
Appendix A Inception Score
We assume that the Inception model is a perfect classifier, i.e. given a real datapoint the conditional distribution is
(14) |
We have . We select real datapoints from classes in the training set so . The Inception score for the dataset in Eqn. 6 is .
Appendix B Neural Net Divergence






Appendix C -means Precision-Recall
We describe the procedure for constructing . Without loss of generality, we assume that , and the -means algorithm always finds the optimal clustering. -means PR applies -means clustering algorithm on to get clusters. Let be the -th cluster, be the set of test datapoints in this cluster. From the algorithm above, we see the the perfect precision and recall are achieved when . To make , the generator needs to
-
1.
memorize a training datapoint s.t. is closest to the center of cluster .
-
2.
duplicate the datapoint times
The number of distinct datapoints that has to memorize is , the number of clusters.
Appendix D -nn Precision-Recall
The idea for constructing is illustrated in Fig. 1. We select s.t. . Because are in , they lie on the real manifold. Thus, the precision is 1. Because , all of the training datapoints are in the two spheres centered at and . If is large enough then we can assume that all of the test datapoints lies in the manifold reconstructed using . Therefore, all of the test datapoints are highly likely to lie in the two spheres centered at and the recall is 1.