Gi and Pal Scores: Deep Neural Network
Generalization Statistics

Yair Schiff, Brian Quanz, Payel Das, Pin-Yu Chen
IBM Research, Yorktown Heights, NY 10598
{yair.schiff,pin-yu.chen}@ibm.com,{blquanz,daspa}@us.ibm.com

Abstract

The field of Deep Learning is rich with empirical evidence of human-like performance on a variety of regression, classification, and control tasks. However, despite these successes, the field lacks strong theoretical error bounds and consistent measures of network generalization and learned invariances. In this work, we introduce two new measures, the Gi-score and Pal-score, that capture a deep neural network’s generalization capabilities. Inspired by the Gini coefficient and Palma ratio, measures of income inequality, our statistics are robust measures of a network’s invariance to perturbations that accurately predict generalization gaps, i.e., the difference between accuracy on training and test sets.

1 Introduction

Neural networks have produced state-of-the-art and human-like performance across a variety of tasks, from image classification to autonomous driving (Sengupta et al., 2020). With this rapid progress has come wider-spread adoption and deployment. Given their prevalence and increasing applications, it is important to better understand why neural nets are able to generate such high performance that often generalizes to unseen data and to estimate how well a trained net will generalize.

Various attempts at bounding and predicting neural network generalization are well summarized and analyzed in the recent survey Jiang et al. (2020b). While both theoretical and empirical progress has been made, there remains a gap in the literature for an efficient and intuitive measure that can predict generalization given a trained network and its corresponding data post hoc. Hoping to fill this gap, the recent Predicting Generalization in Deep Learning (PGDL) competition described in Jiang et al. (2020a) encouraged participants to provide complexity measures that would take into account network weights and training data to predict generalization gaps, i.e., the difference between performance on training and test sets. In this work, we propose two new measures called the Gi-score and Pal-score that present progress towards this goal. Our new statistics are calculated by measuring a network’s performance on training data that has been perturbed with varying magnitudes and comparing this performance to an idealized network that is unaffected by all magnitudes of perturbation.

2 Related work

The PGDL competition resulted in several proposals of complexity measures that aim to bound and predict neural network generalization gaps. While several submissions build off the work of Jiang et al. (2019) and rely on margin-based measures, we will focus on those submissions that measure perturbation response, specifically mixup, since this is most relevant to our work. Mixup, first introduced in Zhang et al. (2017), is a novel training paradigm in which training occurs not just on the given training data, but also on linearly interpolated points. Manifold Mixup training extends this idea to interpolation of intermediate network representations (Verma et al., 2019).

Perhaps most closely related to our work is that of the winning submission, Natekar & Sharma (2020). While Natekar & Sharma (2020) present several proposed complexity measures, they explore accuracy on Mixed up and Manifold Mixed up training sets as potential predictors of generalization gap, and performance on mixed up data is one of the inputs into their winning submission. Natekar & Sharma (2020) mixup data points or intermediate representations within a class, not between classes. While this closely resembles our work, in that the authors are using performance on a perturbed dataset, namely a mixed up one, the key difference is that Natekar & Sharma (2020) only investigate a network’s response to a single magnitude of interpolation, 0.5. Additionally, we investigate between-class interpolation as well. Our proposed Gi-score and Pal-score therefore provide a much more robust sense for how invariant a network is to this mixing up perturbation and can easily be applied to other perturbations as well.

In the vein of exploring various transformations / perturbations, the second place submission Kashyap et al. (2021) perform various augmentations, such as color saturation, applying Sobel filters, cropping and resizing, and others, and create a composite penalty score based on how a network performs on these perturbed data points. Our work, in addition to achieving better generalizatiton gap prediction scores, can be thought of as an extension of this submission, because as above, rather than looking at a single magnitude of various perturbations, the Gi-score and Pal-score provide a summary of how a model reacts to a spectrum of parameterized transformations.

3 Methodology

3.1 Notation

We begin by defining a network for a classification task as $f:\mathbb{R}^{d}\rightarrow\Delta_{k}$ ; that is, a mapping of real input signals $x$ of dimension $d$ to discrete distributions, with $\Delta_{k}$ being the space of all $k$ -simplices. We also define the intermediate layer mappings of a network as $f^{(\ell)}:\mathbb{R}^{d_{\ell-1}}\rightarrow\mathbb{R}^{d_{\ell}}$ , where $\ell$ refers to a layer’s depth with dimension $d_{\ell}$ . The output of each layer is defined as $x^{(\ell)}=f^{(\ell)}(x^{(\ell-1)})$ , with inputs defined as $x^{(0)}$ . Additionally, let $f_{\ell}:\mathbb{R}^{d_{\ell}}\rightarrow\Delta_{k}$ be the function that maps intermediate representations $x^{(\ell)}$ to the final output of probability distributions over classes. For a dataset $\mathcal{D}$ , consisting of pairs of inputs $x\in\mathbb{R}^{d}$ and labels $y\in[k]$ , a network’s accuracy is defined as $\mathcal{A}=\sum_{x,y\in\mathcal{D}}\mathbbm{1}(\max_{i\in[k]}f(x)[i]=y)\ /\ |\mathcal{D}|,$ i.e. the fraction of samples where the predicted class matches the ground truth label, where $\mathbbm{1}(\cdot)$ is an indicator function and $f(x)[i]$ refers to the probability weight of the $i^{\mathrm{th}}$ class.

We define perturbations of the network’s representations as $\mathcal{T}_{\alpha}:\mathbb{R}^{d_{\ell}}\rightarrow\mathbb{R}^{d_{\ell}}$ , where $\alpha$ controls the magnitude of the perturbation. For example, adding Gaussian noise with zero mean and standard deviation $\sigma_{\alpha}$ to inputs can be represented as $\mathcal{T}_{\alpha}(x^{(0)})=x^{(0)}+\epsilon$ , where $\epsilon\sim\mathcal{N}(\mathbf{0},\sigma_{\alpha}{\bm{I}}).$ To measure a network’s response to a perturbation $\mathcal{T}_{\alpha}$ applied at the $\ell^{\mathrm{th}}$ layer output, we calculate the accuracy of the network given the perturbation: $\mathcal{A}_{\alpha}^{(\ell)}=\sum_{x,y\sim\mathcal{D}}\mathbbm{1}(\max_{i\in[k]}f_{\ell}(\mathcal{T}_{\alpha}(x^{(\ell)}))[i]=y)\ /\ |\mathcal{D}|.$ The greater the gap $\mathcal{A}-\mathcal{A}_{\alpha}^{(\ell)}$ , the less the network is resilient or invariant to the perturbation $\mathcal{T}_{\alpha}$ when applied to the $\ell^{\mathrm{th}}$ layer. Perturbations at deeper network layers can be viewed as perturbations in an implicit feature space learned by the network.

3.2 Calculating the Gi-score and Pal-score

In our work, we let $\mathcal{T}_{\alpha}$ be defined as an interpolation between two points of either different or the same class: $\mathcal{T}_{\alpha}(x)=(1-\alpha)x+\alpha x^{\prime}$ , For inter-class interpolation, i.e. where where $x^{\prime}$ is a (random) input from a different class than $x$ , we range $\alpha\in[0,0.5)$ . For the intra-class setup, i.e. where $x$ and $x^{\prime}$ are drawn from the same class, we include the upper bound of the magnitude: $\alpha\in[0,0.5].$ While we explored other varieties of perturbation, such as adding Gaussian noise, we found that this mixup perturbation was most predictive of generalization gaps for the networks and datasets we tested. Both mixup perturbations that we tested (intra- and inter-class) are justifiable for predicting generalization gap. We hypothesize that invariance to interpolation within a class should indicate that a network produces similar representations and ultimately the same class maximum prediction for inputs and latent representations that are within the same class regions (captured by intra-class interpolation). Invariance to interpolation between classes up to 50% should indicate that the network has well separated clusters for representations of different classes and is robust to perturbations moving points away from heavily represented class regions in the data / representations.

To measure a network’s robustness to a perturbation, one could simply choose a fixed $\alpha$ and measure the network’s response. However, a more complete picture is provided by sampling the network’s response to various magnitudes of $\alpha$ . In the blue plots in Figure 1, we show this in practice. For inter-class mixup ranging from 0 to 0.5, we measure the network’s accuracy on a subset of the training set. Only the inputs / layer outputs are mixed up, the label is held constant to test invariance to mixup. We apply this perturbation at various depths, and in Figure 1, we display perturbations at the input level ( $x^{(0)}$ ) and shallowest layer’s representation ( $x^{(1)}$ , the layer right after the input). The results is the blue curves seen in Figure 1 that we refer to as perturbation-response (PR) curves.

Refer to caption — Figure 1: Sample Gi-score and Pal-score calculation for inter-class mixup applied to a VGG-like network trained on CIFAR-10 data. Perturbation-response curves (blue) examine the network’s response (accuracy on sample of training set) to varying levels of mixup perturbation on inputs: $x^{(0)}$ (left) and shallowest representations: $x^{(1)}$ (right). Perturbation-cumulative-density curves (red) compare the network to idealized one that would produce 1.0 accuracy regardless of the magnitude of interpolation.

To extract a single statistic from the PR curves in Figure 1, we draw inspiration from the Gini coefficient and Palma ratio, two distinct measures of income inequality that compare wealth distribution of a given economy with that of an idealized economy (Cobham & Sumner, 2013). Namely, we compare a network’s response to varying magnitudes of perturbations with an idealized network: one whose accuracy is unaffected by the perturbations. The idealized network therefore has a PR curve that starts and remains at accuracy 1.0 regardless of the magnitude $\alpha<0.5.$

This comparison is achieved by creating a new graph that plots the cumulative density integral under the PR curves against the magnitudes $\alpha_{i}\in[0,0.5)$ : $\int_{0}^{\alpha_{i}}\mathcal{A}_{\alpha}d\alpha$ . This produces what we call perturbation-cumulative-density (PCD) curves seen in red in Figure 1. For the idealized network whose PR is identically equal to 1 for all $\alpha$ , this PCD curve is just the $45^{\circ}$ line passing through the origin. Finally, the Gi-score (named for the Gini coefficient it draws inspiration from) is calculated by taking the area between the idealized network’s PCD curve and that of the actual network. The Pal-score (named for the Palma ratio it draws inspiration from) is calculated by dividing the area for the largest top 60% of perturbation magnitudes by the area for the bottom 10%. This allows us to focus on variations on the upper and lower ends of the perturbation magnitude spectrum, ignoring the middle perturbations that might not vary as widely across networks.

4 Results

We calculate our Gi-score and Pal-score on a corpus of trained networks and their corresponding datasets provided in Jiang et al. (2020a). The networks from this competition span several architectures and datasets. Namely, there are VGG, Network-in-Network (NiN), and Fully Convolutional (Full Conv) architectures. The datasets are comprised of CIFAR-10 (Krizhevsky et al., 2009), SVHN (Netzer et al., 2011), CINIC-10 (Darlow et al., 2018), Oxford Flowers (Nilsback & Zisserman, 2008), Oxford Pets (Parkhi et al., 2012), and Fashion MNIST (Xiao et al., 2017). The list of dataset-model combinations, or tasks, available in the trained model corpus can be seen in the first two rows of Table 1. Across the 8 tasks, there are a total of 550 networks. Each network is trained so that it attains nearly perfect accuracy on the training dataset. For each network, we also know the generalization gap.

As proposed in Jiang et al. (2020a), the goal is to find a complexity measure of networks that is causally informative (predictive) of generalization gaps. To measure this predictive quality, Jiang et al. (2020a) propose a Conditional Mutual Information (CMI) score. For full implementation details of this score, please reference Jiang et al. (2020a), but roughly, higher values of CMI represent greater capability of a complexity score in predicting generalization gaps. In Table 1, we present the average CMI scores for all models within a task for our Gi- and Pal-scores compared to that of the winning team (Natekar & Sharma, 2020) from the PGDL competition. We also compare our statistic to comparable ones presented in Natekar & Sharma (2020) that rely on Mixup and Manifold Mixup accuracy¹¹1Scores come from Natekar & Sharma (2020). For the scores not reported there, we use the code provided by the authors: https://github.com/parthnatekar/pgdl. The winning submission described in Natekar & Sharma (2020) uses a combination of a score based on the accuracy of mixed up input data and a clustering quality index of class representations, known as the Davies-Bouldin Index (DBI) (Davies & Bouldin, 1979). Using the notation introduced in Section 3, the measures from Natekar & Sharma (2020) present in Table 1 can be described as follows: Mixup accuracy: $\mathcal{A}_{0.5}^{(0)}$ ; Manifold Mixup accuracy: $\mathcal{A}_{0.5}^{(1)}$ ; DBI * Mixup: $DBI*(1-\mathcal{A}_{0.5}^{(0)}).$

Table 1: Comparison of Conditional Mutual Information scores for various complexity measures across tasks. The highest score within each task among measures that are only based on mixup are bolded. For reference, the PGDL winning measure of DBI*Mixup is included in the bottom row. If the best mixup-based score out-performs the winning measure it is also marked with an asterisk. For Gi- an Pal-scores, we indicate whether Inter or Intra mixup was used for the parametric perturbation along with the depth at which the perturbation was applied in parentheses, with input = 0 and shallowest layer = 1. In the CINIC-10 columns, ‘bn’ stands batch-norm.

	CIFAR-10		SVHN	CINIC-10		Oxford Flowers	Oxford Pets	Fashion MNIST
	VGG	NiN	NiN	Conv w/bn	Conv w/o bn	NiN	NiN	VGG
Gi Inter ( $\ell=0$ )	3.12	34.78^∗	26.86	20.92	6.68	33.35	17.80^∗	4.49
Gi Inter ( $\ell=1$ )	7.69^∗	24.02	12.25	12.62	8.42	7.39	4.57	16.12^∗
Pal Inter ( $\ell=0$ )	3.17	27.79	22.91	20.94	6.21	29.75	15.96	4.16
Pal Inter ( $\ell=1$ )	7.10	13.33	9.65	12.11	7.69	6.27	3.49	14.43
Gi Intra ( $\ell=0$ )	0.82	31.73	40.99^∗	22.80	11.49	40.56	16.80	5.22
Gi Intra ( $\ell=1$ )	0.23	16.82	10.98	9.40	12.38	6.85	3.49	5.74
Pal Intra ( $\ell=0$ )	0.66	24.64	29.77	24.38	10.93	38.04	15.25	4.93
Pal Intra ( $\ell=1$ )	0.45	10.25	14.08	8.80	10.65	5.96	3.02	6.25
Mixup	0.03	14.18	22.75	30.30	19.51^∗	35.30	9.99	7.75
Manifold Mixup	2.24	2.88	12.11	4.23	4.84	0.03	0.13	0.19
DBI*Mixup¹	0.00	25.86	32.05	31.79	15.92	43.99	12.59	9.24

These results highlight that the Gi-score and Pal-score perform competitively in predicting generalization gap. Note that some versions of our scores out-perform the mixup approaches used in the PGDL winning approach on the majority of tasks, and even significantly out-perform the DBI*Mixup approach on 4 tasks. This suggests the possibility of even better results combining our scores with DBI (future work). In addition, we believe that our scores provide a more robust measure for how well a model is able to learn invariances to certain transformations. For example, the Mixup complexity score presented in Natekar & Sharma (2020) simply takes a 0.5 interpolation of data points and calculates accuracy of a network on the this mixed up portion of the training set. In contrast, our scores allow us to capture network performance on a spectrum of interpolations, thereby providing a more robust statistic for how invariant a network is to linear data interpolation. Our approach can also be extended to apply to any parametric transformation.

5 Conclusion

In this work, we introduced two novel statistics inspired by income inequality metrics, which effectively predict neural network generalization. The Gi-score and Pal-score are both computationally efficient, requiring only several forward passes through a subset of the training data, and intuitive. Calculating these statistics on the corpus of data made available in Jiang et al. (2020a) showed that the our scores applied to linear interpolation between data points have strong performance in predicting a network’s generalization gap.

In addition to this predictive capability of generalization, we believe that the Gi-score and Pal-score provide a useful criterion for evaluating networks and selecting which architecture or hyperparameter configuration is most invariant to a desired transformation. Because they rely on comparison with an idealized network that is unaffected by all magnitudes of a perturbation, future work will explore how these scores aid in discerning the extent to which a network has learned to be invariant to a given parametric transformation.

Acknowledgments

The authors would like to thank the organizers of the Predicting Generalization in Deep Learning competition and workshop hosted at NeurIPS 2020 for providing a repository of pre-trained networks and their corresponding datasets and starter code for working with this corpus.

References

Cobham & Sumner (2013) Alex Cobham and Andy Sumner. Is it all about the tails? the palma measure of income inequality. Center for Global Development working paper, (343), 2013.
Darlow et al. (2018) Luke N Darlow, Elliot J Crowley, Antreas Antoniou, and Amos J Storkey. Cinic-10 is not imagenet or cifar-10. arXiv preprint arXiv:1810.03505, 2018.
Davies & Bouldin (1979) David L Davies and Donald W Bouldin. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, (2):224–227, 1979.
Jiang et al. (2019) Yiding Jiang, Dilip Krishnan, Hossein Mobahi, and Samy Bengio. Predicting the generalization gap in deep networks with margin distributions. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HJlQfnCqKX.
Jiang et al. (2020a) Yiding Jiang, Pierre Foret, Scott Yak, Daniel M. Roy, Hossein Mobahi, Gintare Karolina Dziugaite, Samy Bengio, Suriya Gunasekar, Isabelle Guyon, and Behnam Neyshabur. Neurips 2020 competition: Predicting generalization in deep learning, 2020a.
Jiang et al. (2020b) Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. In International Conference on Learning Representations, 2020b. URL https://openreview.net/forum?id=SJgIPJBFvH.
Kashyap et al. (2021) Dhruva Kashyap, Natarajan Subramanyam, et al. Robustness to augmentations as a generalization metric. arXiv preprint arXiv:2101.06459, 2021.
Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Natekar & Sharma (2020) Parth Natekar and Manik Sharma. Representation based complexity measures for predicting generalization in deep learning, 2020.
Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
Nilsback & Zisserman (2008) Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. IEEE, 2008.
Parkhi et al. (2012) Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pp. 3498–3505. IEEE, 2012.
Sengupta et al. (2020) Saptarshi Sengupta, Sanchita Basak, Pallabi Saikia, Sayak Paul, Vasilios Tsalavoutis, Frederick Atiah, Vadlamani Ravi, and Alan Peters. A review of deep learning with special emphasis on architectures, applications and recent trends. Knowledge-Based Systems, 194:105596, 2020.
Verma et al. (2019) Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. In International Conference on Machine Learning, pp. 6438–6447. PMLR, 2019.
Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
Zhang et al. (2017) Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.

Gi and Pal Scores: Deep Neural Network Generalization Statistics