\ul
Practical Assessment of Generalization Performance Robustness for Deep Networks via Contrastive Examples
Abstract.
Training images with data transformations have been suggested as contrastive examples to complement the testing set for generalization performance evaluation of deep neural networks (DNNs) (Jiang et al., 2020a). In this work, we propose a practical framework ContRE111In French, the word “contre” means “against” or “versus”. that uses \ulContrastive examples for DNN gene\ulRalization performance \ulEstimation. Specifically, ContRE follows the assumption in (Chen et al., 2020b; He et al., 2020) that robust DNN models with good generalization performance are capable of extracting a consistent set of features and making consistent predictions from the same image under varying data transformations. Incorporating with a set of randomized strategies for well-designed data transformations over the training set, ContRE adopts classification errors and Fisher ratios on the generated contrastive examples to assess and analyze the generalization performance of DNN models in complement with a testing set. To show the effectiveness and efficiency of ContRE, extensive experiments have been done using various DNN models on three open source benchmark datasets with thorough ablation studies and applicability analyses. Our experiment results confirm that (1) behaviors of deep models on contrastive examples are strongly correlated to what on the testing set, and (2) ContRE is a robust measure of generalization performance complementing to the testing set in various settings.
1. Introduction

Deep neural networks (DNNs) have achieved remarkable performance in various domains such as computer vision, natural language processing, acoustic etc, but it is still challenging to assess the generalization performance of models in a robust way. Such robustness characterizes the ability to measure the performance of models on unseen data. Given a DNN model trained with a sample set, a proxy to evaluate the generalization performance is to evaluate the model using samples in a disjoint testing set drawn from the same underlying data distribution. We can then approximate the generalization error of the model by using the testing error, instead of coping with the intractable expectation of classification error over the data distributions (for both seen and unseen samples).
In addition to the testing set, many works nowadays have made lots of efforts on discovering intrinsic properties of deep models, such as weight norms, gradient norms, sharpness of the local minima etc. (Arora et al., 2018; Neyshabur et al., 2017; Jiang et al., 2019a; Jiang et al., 2020b), that are correlated to the generalization performance. The use of these intrinsic properties as metrics could evaluate the generalization performance of a model, while avoiding “overfitting” to the validation/testing set in performance tuning (Recht et al., 2019). However, most of these metrics are vacuous on current deep learning tasks (Jiang et al., 2020b) and lead to poor predictability of generalization performance in some practical settings.
Recently, a practical approach to evaluate generalization performance that exploits the training samples with image transformations, e.g., crops, shifts, rotations and color distortions, has been discussed in (Jiang et al., 2020a) and suggested as a potential way to assess the generalization performance using such contrastive examples (named after contrastive learning (Chen et al., 2020b, a) for the transformed examples). Though this approach could be loosely backed by some theoretical results (Bousquet and Elisseeff, 2002; Deng et al., 2020), the empirical use of contrastive examples for performance evaluation would be less effective when the similar data transformations have been used for training data augmentation to produce the model. This failure can be well explained by the high capacity of DNNs in memorization to overfit every seen sample (Zhang et al., 2016). Thus, there needs a way to adapt the data transformation to arbitrary models for generalization performance measurements, even when the ways of training data augmentation are varying.
In this work, we propose ContRE that uses Contrastive examples for geneRalization performance Evaluation. Specifically, ContRE records features and prediction results of the original training set and what of the contrastive examples (see Figure 1 for an illustration). For generalization performance evaluation, we follow the assumptions in (Chen et al., 2020b, a) that a generalizable DNN model should make consistent prediction for the same image of different views transformed by varying strategies, as the model would extract an invariant set of features from the image. In this way, a model with higher/lower consistency in prediction results and extracted features under varying data transformations would be with better/worse generalization ability. Hereby, ContRE proposes to evaluate the generalization performance of DNN models using the classification accuracy (and Fisher ratio of feature vectors) on a set of contrastive examples. To generate effective contrastive examples, ContRE adopts RandAugment (Cubuk et al., 2020), to ensemble tens of commonly-used image transformations in a stochastic manner, where every original training sample would be transformed by a sequence of randomly picked-up operations.
In summary, this work makes contributions as follows:
-
•
We propose a simple yet effective framework ContRE that generates contrastive examples using randomized data transformations on the training data and measures classification errors (and Fisher ratios) of DNN models on contrastive samples for the generalization performance evaluation purpose.
-
•
We conduct extensive experiments using various DNN models on three benchmark datasets, i.e., CIFAR-10, CIFAR-100 and ImageNet, empirically demonstrating the strong correlations between the accuracy on contrastive examples and the one on the testing set.
-
•
Thorough analyses are provided: (1) By breaking down the data transformation strategies used in ContRE, we systematically analyze the effect of every single transformation technique and combination effects of all possible pairs; (2) We provide applicability analyses to confirm the versatility of ContRE working with arbitrary models trained with varying data augmentations; (3) We further investigate when and why the ContRE an effective way to estimate the generalization performance of DNN models.
To the best of our knowledge, this work is the first to practically assess the relevance of using contrastive examples for generalization performance of DNN models.
2. Related Work
In this section, we introduce the related work from the methodology and application perspectives.
Contrastive Learning and Contrastive Examples
The idea of contrastive learning is to pull closer the representations of different views of the same image (positive pair) and yet repel representations of different images (negative pair). Networks are then updated iteratively with positive and negative pairs through various contrastive learning objectives (He et al., 2020; Chen et al., 2020b; Chen et al., 2020c; Chen et al., 2020a; Caron et al., 2020). In a contrastive learning framework, a stochastic data transformation module of generating different correlated views of the same image is proved to be effective to representation learning. Besides usages in contrastive representation learning, data augmentation has also been shown essential to improve the generalization of DNN models. The trivial ones, such as random crop, random flip etc, are widely used for training large DNN models. More recently, automated data augmentation policies including AutoAugment (Cubuk et al., 2018) have been proposed to select data augmentation methods for boosting the model performance. Rather than selecting data augmentation methods and compositions from a search space, RandAugment (Cubuk et al., 2020) proposes to randomly choose a number of augmentation distortions from a transformation candidate pool with a preset distortion magnitude.
In this work, we propose to measure the robustness of DNN models based on the (dis)similarity on features and final predictions between original images and contrastive examples, where we follow the RandAugment module to generate contrastive examples.
Generalization Performance and Robustness
Generalization performance measures the behaviors of models on unseen data. With the hold-out testing set as a proxy for measuring the generalization performance, the generalization gap is defined as the difference between the accuracies on the training set and the testing set. While previous works developed bounds for DNN models’ generalization performance (Arora et al., 2018; Neyshabur et al., 2017), some recent works predict generalization gap from the statistics of training data or trained model weights. Jiang et al. (Jiang et al., 2019a) approximate the margin in neural networks and train a linear model on margin statistics to predict the generalization gap of deep models. Yak et al. (Yak et al., 2019) build upon the work by learning DNNs and RNNs in place of linear models. Instead of exploiting the margins, Unterthiner et al. (Unterthiner et al., 2021) build predictors from model weights. Corneanu et al. (Corneanu et al., 2020) define DNNs on the topological space and estimate the gaps from the topology summaries. Note that ContRE could be loosely guaranteed through data-dependent theoretical bounds (Bousquet and Elisseeff, 2002) and potentially related to the local elasticity (He and Su, 2019; Deng et al., 2020) while the theoretical adaptation to DNN models should be more cautiously considered. We leave this as future work.
Another way of analyzing generalization performance is through complexity measures since a lower complexity often implies a smaller generalization gap (McAllester, 1999; Keskar et al., 2016; Liang et al., 2017; Nagarajan and Kolter, 2019; Neyshabur et al., 2015; Bartlett et al., 2017; Jiang et al., 2020b). The complexity is roughly defined as the measurement based on the DNN model’s properties such as the norm of parameters and “sharpness” of the local minima. We refer interested readers to Jiang et al. (Jiang et al., 2020b), where authors did extensive experiments to test the correlation between 40 complexity measures and deep learning models’ generalization gaps, suggesting that PAC-Beyasian bounds (McAllester, 1999) and sharpness measure (Keskar et al., 2016) perform the best overall.
Different from theoretical investigations and complexity measure of DNN models, ContRE measures the consistency of features and prediction results between contrastive examples and original images as generalization performance, which is data-dependent and practically works well.
3. Contrastive Sampling Framework for DNN Robustness Evaluation
For leveraging large-scale unlabeled images, visual contrastive learning approaches (Oord et al., 2018; Tian et al., 2019; Henaff et al., 2020; He et al., 2020; Chen et al., 2020b) propose to train DNN models for solving the self-supervised contrastive prediction task that classifies whether each pair of images is similar or dissimilar (Hadsell et al., 2006). Similar pairs are designed by a simple yet effective paradigm SimCLR (Chen et al., 2020b), that transforms original images through various random data augmentation approaches (Cubuk et al., 2020), and then considers the pairs of augmented samples derived from the same images as similar pairs and others as dissimilar ones. We call these generated samples as contrastive examples and the process of generating contrastive examples as contrastive sampling. Contrastive sampling is crucial in the self-supervised training of a DNN model that yields general and discriminative representations.
In this paper, instead of aiming at representation learning, we propose to investigate the robustness and consistency behaviors of (supervised) trained DNN models through our proposed approach ContRE, the Contrastive sampling framework for DNN Robustness and generalizability Evaluation. Figure 1 shows the process of ContRE. Specifically, ContRE first generates contrastive examples using RandAugment (Cubuk et al., 2020) and forwards generated samples into DNN models. Given the intermediate features and predictions of DNN models with contrastive examples as input, ContRE compares the results with original samples as input.
In the rest of this section, we formally introduce the proposed framework ContRE and the evaluation metrics for the investigations and analyses, as well as some discussions about ContRE.
3.1. Notations
For convenience, we consider a DNN model as composition of the classifier and the feature extractor , where indicates the layer index and is omitted for the last layer before the classifier. From an underlying data-generating distribution , an empirical dataset for training and a disjoint testing set are drawn. Given a DNN model (with parameters ) trained on , we can simply denote all the forwarding results of this DNN model, including intermediate features , the estimates of the DNN model , and the loss/accuracy on the training set with an objective function , as well as the generalization performance . In practice, the generalization performance is measured on the disjoint set . Moreover, a set of DNN models with different parameters are considered: .
3.2. Contrastive Examples
Given the notations, we introduce the proposed framework ContRE, which considers a number of widely-used data transformation approaches to generate a different view of the original dataset as the contrastive sampling process. We note the contrastive sampling strategy as and the obtained contrastive examples as . can be any data transformation method. Besides the trivial ones, ContRE follows the strategy of RandAugment (Cubuk et al., 2020) for generating contrastive examples.
According to our preliminary experiments, any relevant image transformation method is effective to the generalization estimation if the transformed images have not been remembered by the model during training. In reverse, the unseen transformed images can be always added as data augmentation techniques for improving the performance. We could not easily identify the used transformation methods nor regulate the data augmentation if the access to the training process is not available. To this end, ContRE adopts the strategy of RandAugment to increase the difficulty of fitting all the possible transformed images by the DNN models.
Given the contrastive examples, the primary objective of ContRE is to exploit the contrastive sampling strategy to analyze and evaluate the robustness of DNN models. To achieve the goal, with contrastive examples or original images as input, ContRE thoroughly compares the behaviors of models on the prediction accuracy and intermediate features. Formally, given , ContRE computes the intermediate features and , prediction or loss and , across different architectures and parameters . Given these features and performances of various DNN models, ContRE measures their robustness and analyzes the consistency, with the metrics described below.
3.3. Correlations and Fisher Ratio
Despite the strong non-linearity, DNN models exhibit relatively similar/consistent behaviors with similar samples as input (except the devised adversarial attacks (Szegedy et al., 2013; Goodfellow et al., 2014; Ilyas et al., 2019; Eghbal-zadeh et al., 2020)). For example, a large crop or a vertical flip of an image rarely changes the DNN estimates; while a rotation or a severe color distortion slightly disturbs its predictions. Nevertheless, the level of robustness differs across models and this difference may be exploited to investigate the relation with the robustness and generalization performance. We introduce two correlation metrics and one feature clustering quality metric that are used for experimental evaluations.
3.3.1. Spearman’s Rank Correlation Coefficient
The proposed framework ContRE provides a good practical estimator of DNN models’ generalization performance by measuring the loss/accuracy with contrastive examples as input. Specifically, given DNN models, we measure the Spearman’s rank correlation coefficient, noted as between two rank variables:
(1) |
where is the covariance, and are the standard deviations of the rank variables. Specifically, the rank variables are converted from and .
We show in the experimental section that this strong correlation exists in various scenarios, across datasets, hyper-parameter settings, image transformation techniques, etc.
3.3.2. Partial Rank Correlation
Only using correlation as the measure for association between two variables can be misleading because their dependence may come from the associations of each with a third confounding variable. The problem can be solved by controlling the confounding variable via partial correlation. Let , and be random variables taking real values, be Spearman correlation between and , then the partial rank correlation given control variable is:
(2) |
We use the partial correlation to eliminate the effect of training accuracy while investigating ContRE’s results. The details can be found in the experiment section.
3.3.3. Fisher Ratio
To verify the existence of significant correlations on feature level, ContRE also proposes to investigate the clustering quality of intermediate features. The degree of separation of features from other classes can be measured elegantly by the Fisher ratio. Given different classes, there are data points in class , with class mean . The between class scatter matrix measures mean separation and the within class matrix measures class concentration by pooling the estimates of the covariance matrices of each class, which are defined as:
(3) |
Then the Fisher ratio is:
(4) |
Fisher ratio is a powerful tool for analyzing models’ generalization capability right before the classifier. We compare features extracted from original samples and contrastive examples using Fisher ratio, i.e., and , as part of explanations of why ContRE works well. The detailed findings are located in the experiment section.
4. Experiments and Analyses



In this section, we validate, through extensive experiments, the main claim that is strongly correlated to . We conduct experiments with various networks on three open source benchmark datasets, and all results with Spearman correlations and partial correlations support our findings. Furthermore, we demonstrate the applicability of ContRE across several practical settings, e.g., model selection across hyper-parameter choices, across RandAugment-Trained models, with small-size datasets or within a practical competition setting. Finally, we provide analyses for investigating the reasons that ContRE works well as an estimator of DNN models’ robustness and generalization performance.
4.1. Experiment Setup
We describe the experiment setups of datasets, networks and the methods for generating contrastive examples.
4.1.1. Datasets
Three of the most popular benchmark datasets are considered: CIFAR-10, CIFAR-100 and ImageNet. CIFAR datasets contain 60,000 tiny images of resolution 3232 while ImageNet includes over 1.2M natural images. Conventional split of training and test sets are followed for CIFAR datasets, while the validation set of ImageNet is used as testing set here.
4.1.2. Networks
For evaluating our proposed framework ContRE, the most popular or the most powerful DNN architectures are considered: VGGNet (Simonyan and Zisserman, 2015), ResNet (He et al., 2016), Wide ResNet (Zagoruyko and Komodakis, 2016), DenseNet (Huang et al., 2017), ResNeXt (Xie et al., 2017), ShuffleNet (Zhang et al., 2018; Ma et al., 2018), MNasNet (Tan et al., 2019), MobileNet (Howard et al., 2017; Sandler et al., 2018) and their variants. We train these DNN models with standard data augmentation methods, i.e., random crop and random flip, and standard training hyper-parameters. We also investigate the different data augmentation methods and different hyper-parameter settings for training DNN models, and the strong correlation between and still holds. Generally, models for CIFAR datasets are smaller than those for ImageNet but the architectures are shared among models for these datasets. Moreover, many pretrained models on ImageNet are publicly available222https://pytorch.org/vision/stable/models.html, which greatly accelerate our setup process.
4.1.3. Contrastive Examples
As introduced previously, RandAugment (Cubuk et al., 2020) is used in the ContRE framework to generate contrastive examples, where we take all the available transformation operations from the open-sourced implementation333https://github.com/ildoonet/pytorch-randaugment. We note RA_N_M, where and are the two hyper-parameters to configure in RandAugment: the number of sequential distortion operations and the magnitude of each distortion. For example, RA_N2_M9 means that 2 operations will be chosen randomly from all the available operations and the distortion of magnitude 9 will be applied444The maximum magnitude is 30. Note that even the maximum magnitude does not transform the image into a total noise..
4.2. Experiment Results
CIFAR-10 | CIFAR-100 | ImageNet | |
Center Crop | 0.702 | 0.677 | 0.953 |
Random Crop | 0.722 | 0.789 | 0.979 |
Random Flip | 0.702 | \ul0.790 | 0.955 |
Random Rotation | \ul0.938 | 0.653 | 0.910 |
ContRE (ours) | 0.965 | 0.974 | \ul0.970 |
Strong correlations between the generalization performance and the accuracy on contrastive examples , are found across the three evaluated datasets, CIFAR-10, CIFAR-100 and ImageNet. Here we present the detailed experimental results for these three datasets in comparison with trivial image transformations, such as center crop, random crop, random flip and random rotation.
As shown in Figure 2 (top and middle), on the CIFAR datasets, most of the models have achieved close to 100% accuracy on the training set for training samples transformed by crop or flip, which have been used during the training process. For these transformations, the correlations are around 0.7. Random rotation is not used during the training process, while the samples transformed by random rotation do not achieve stable correlations: a high correlation coefficient is obtained for CIFAR-10 while a relatively low value for CIFAR-100. Meanwhile, our proposed approach ContRE consistently outperforms others by reaching the highest correlation coeeficients for both CIFAR datsets.
The results are different for models on ImageNet. Due to the tremendous amount of samples and classes, ImageNet models with the best validation accuracies failed to perfectly fit the training data. As shown in Figure 2 (bottom), training accuracies based on the listed operations all correlate well with test accuracies. Results given by random crop attains the first place with a correlation of 0.979, followed closely by ContRE’s 0.970.
The experimental results show that ContRE gets high correlation coefficients (Equation 1) of 0.969 on CIFAR-10 and CIFAR-100 and 0.970 on ImageNet, indicating that the grades of the models under the evaluation of ContRE are significantly correlated to the generalization performances of these models. Note that this observation is hard to get from trivial approaches, including random crop, rotation, flip etc., because of randomness and instability. Considering these trivial approaches, “Random Rotation” works well on CIFAR-10 but yields a relatively low correlation coefficient on ImageNet. Similarly, “Random Crop” works well on ImageNet but not on CIFAR-10 or CIFAR-100. However, across three tasks and datasets, our approach ContRE perfectly grades the models, probably thanks to the stability from the expectation over the randomness. We thus reasonably conclude that ContRE might have the potential to estimate the generalization performance of models with the use of training data and the contrastive techniques.
4.3. Ablation Studies
For further showing the stability and effectiveness of ContRE, we carry out the experiments for ablation studies on the choices of two hyper-parameters in RandAugment when generating contrastive examples, and on the comparisons between contrastive examples from RandAugment and from single data transformations or the compositions of two transformations.
4.3.1. Choices of and
In our framework ContRE, two hyper-parameters in RandAugment can be tuned: denotes the number of transformations that are to be applied on the original examples, and denotes the magnitude of distortions. Though they both represent the amount of distortions, their effects on ContRE are varied. As shown in 2, we have tested all the compositions of and . A large () magnitude of each distortion leads to a higher correlation but an additional increase from 15 to 20 does not change much. The results match our previous observations, implying that a weak distortion on the samples cannot help to tell good classifiers from the bad ones. On the other hand, applying more transformations at the same time seems to hurt the performance, especially when is small. Therefore, our approach adapts a large value of and a smaller value of , which is proved to be quite robust across various tasks and datasets.
=2 | =3 | =4 | |
---|---|---|---|
=4 | 0.837 | 0.763 | 0.503 |
=9 | 0.855 | 0.833 | 0.714 |
=15 | 0.974 | 0.974 | 0.960 |
=20 | 0.965 | 0.978 | 0.938 |



4.3.2. Single Transformations
ContRE generalizes the idea of contrastive learning to the evaluations of DNN models’ robustness and generalization performance. However, is it really necessary to use a stochastic module for generating the contrastive examples for ContRE? Is there any single transformation that could achieve the desired results? We answer these questions through the comparison of the effects by ContRE and each single contrastive technique.
Following the same experimental setup, contrastive examples are generated under each single transformation with the same magnitude as ContRE and the experiments are performed using all DNN models considered previously. The obtained Spearman correlation coefficients range from -0.297 to 0.988. While a part of single transformations yield correlations higher than 0.9, especially for ImageNet-trained models, there is no single operation that performs well across the three datasets. Therefore, choosing a best transformation depends on the dataset and it is difficult to find a single transformation that works well across different datasets. In contrast, ContRE introduces robustness and consistently achieves high performance through the expectation over the stochastic module and by combining several competent candidates.
4.3.3. Compositions of Two Transformations
To further prove the effectiveness and efficiency of ContRE, we conduct similar experiments on CIFAR datasets using contrastive examples that are generated by all the possible compositions of two transformations, obtain the accuracy on various contrastive examples and show the Spearman correlations with the testing accuracy in Figure 4. Some observations are shared with the results from the comparison using single transformations, that any composition with “Posterize” does not work well for CIFAR datasets and that any composition with “Rotate” works well for CIFAR-10 but not for CIFAR-100. A small part of compositions obtain higher correlations than ContRE. We argue that searching the optimal composition demands much computation resource and the optimal choice varies from different datasets, while ContRE benefits the expectation over the stochastic module to consistently produce precise estimation results. We will present in the following that ContRE is consistently applicable, efficient and effective in many practical complex situations.
4.4. Evaluations for Applicability Analyses
We have validated our proposed method ContRE on three popular datasets with standard training processes. In this subsection, we carry out experiments with practical settings and demonstrate the applicability of ContRE in real-world situations.

Hyperparameter | Subset | RandAugment | |
Center Crop | 0.268 | 0.465 | 0.723 |
Random Crop | 0.221 | 0.535 | 0.738 |
Random Flip | 0.268 | \ul0.583 | 0.978 |
Random Rotation | \ul0682 | 0.055 | \ul0.960 |
ContRE (ours) | 0.861 | 0.754 | 0.916 |
4.4.1. Across Hyper-parameter Choices
A direct application of ContRE is to find the best model among models with the same architecture but trained with different hyper-parameters. We have trained 18 models, all of them is of the network structure of Wide-ResNet-26-10, on CIFAR-10 with various choices of batch size (32, 64, 128), learning rate (0.01, 0.05, 0.1) and weight decay (0.0001, 0.001). Based on the resulting 18 models, ContRE outperforms other approaches by achieving a correlation coefficient Corr of 0.859, see the results in Figure 5 (left) and the comparison with other approaches in Table 3. The results illustrate that, besides helping with selecting the best architecture, our approach can be helpful in the hyper-parameter-tuning stage.
4.4.2. With Subset of Training Data
In some practical settings, training data are very scarce while the experiments we have shown are using a large dataset with over 100 samples per class. So we here would like to test the robustness of ContRE by reducing the training data size to small fraction (20%) and training the DNN models. When provided with 20% training data, models have limited information of the true data distributions and the generalization performances drop by 3-10 percent. Under such setting, ContRE outperforms trivial approaches by achieving a correlation coefficient of 0.754. The results indicate the potential of ContRE in predicting model performances when models are not trained with abundant data.
4.4.3. Across RandAugment-Trained Models
Given ContRE’s high correlations between the generalization performance and the accuracy on contrastive examples of DNN models that are trained with standard data augmentation methods, we have observed that the “Random Crop” and “Random Flip” that are used for the standard training process do not provide desired results in our framework. For over-parametric DNN models, including more data augmentation methods could usually improve the generalization performance. We thus ask if the correlation obtained by ContRE mainly benefits from the image transformation methods that are not used during training and whether ContRE still performs well when the generated contrastive examples have already been used during training.
To answer the questions, we have trained the same set of DNN models on CIFAR-10 with replacing the standard data augmentation by the RandAugment module while leaving other training settings unchanged. Then, we repeat the processing of ContRE and get that ContRE still achieves a correlation higher than 0.9, with the results in Figure 5 (right) and comparisons in Table 3. We note that “Random Flip” and “Random Rotation” get higher correlations than ContRE. We argue that single transformations could get high correlations if they are not used during training, but the correlation coefficients largely decrease if they have been used during training, as shown in the previous results. In contrast, ContRE combines a number of single transformations to generate contrastive examples and introduces the robustness from the expectation over the stochastic module. Even with the models that have seen the contrastive examples during the training stage, ContRE can also successfully estimate the generalization performances.
4.4.4. Generalization Gap Prediction (Jiang et al., 2020a)
ContRE directly estimates the generalization performance instead of the generalization gap (the difference between the training accuracy and the generalization performance). In the previous experiments, we mainly discussed the estimations of generalization performance in this work. Meanwhile, under the general condition that all DNN models equally fit on the training set (Jiang et al., 2020b), ContRE is also capable of providing a good estimator of the gap.
To further demonstrate the feasibility of ContRE, we use the experiment settings of “Predicting Generalization in Deep Learning Competition” at NeurIPS 2020 (Jiang et al., 2020a) to evaluate ContRE on predicting the generalization gap of DNN models using the contrastive examples of the training set. The competition offers a large number of deep models trained with various hyper-parameters and DNN architectures trained on CIFAR-10 or SVHN, while the evaluator of the competition first estimates the generalization performance of every model using the proposed measure, then computes the mutual information score (the higher the better) between the proposed measures and the (observable) ground truth of generalization gaps.
In the experiments, we propose to use the difference between the accuracy based on the original training samples and the one using contrastive examples generated from the original training samples with ContRE. For the comparison reason, we include several baseline measures in generalization gap predictors, including VC Dimension (Vapnik, 2013), Jacobian norm w.r.t intermediate layers (Jiang et al., [n.d.]), distance from the convergence point to initialization (Nagarajan and Kolter, 2019), and the sharpness of convergence point (Jiang et al., 2019b), all of these baselines provide a theoretical bound for the estimated gap. Table 4 presents the comparisons between the proposed measures and baselines. It shows that the proposed ContRE framework, with estimating the gap between the accuracy on original training samples and the one on contrastive examples, significantly outperforms the baseline methods. Note that our proposed approach ContRE is an efficient and effective estimator for the generalization performance in practice, and that ContRE is a good estimator for the generalization gap in condition that the training accuracy be controlled on the same level.


Methods | Prediction Score |
---|---|
VC Dimension (Vapnik, 2013) | 0.020 |
Jacobian norm w.r.t intermediate layers (Jiang et al., [n.d.]) | 2.061 |
Distance to Initialization (Nagarajan and Kolter, 2019) | 4.921 |
Sharpness of the convergence point (Jiang et al., 2019b) | 10.667 |
ContRE (ours) | 13.531 |
4.5. Analyses
We provide investigations from the partial correlations controlling the confounding variable and the feature-level correlations, to dissect the reasons that ContRE works for estimations of DNN models’ robustness and generalization performance.
4.5.1. Partial Correlations
CIFAR-10 | CIFAR-100 | ImageNet | |
---|---|---|---|
Random Crop | 0.241 | 0.551 | \ul0.766 |
Random Flip | N/A555The correlation between accuracy of training examples after random flip and original training accuracy is 1, leading to division by 0 in Equation (2). Similar for “Center Crop” on the training set that is used as training accuracy. | \ul0.619 | 0.228 |
Random Rotation | \ul0.887 | 0.190 | 0.566 |
ContRE (ours) | 0.938 | 0.962 | 0.774 |
We observe that the training accuracy is well correlated to the testing accuracy on ImageNet, where the best model cannot fit the training set. This shows that the training accuracy might be a good estimator of the generalization performance in some cases. On the other hand, training accuracy is also related to the accuracy on contrastive examples because they are directly generated from the training set. A question naturally follows: is the training accuracy the key factor for ContRE being a good estimator?
To answer this question, we first define the notion of consistency as the difference between training accuracy and contrastive examples based accuracy on the training set . The testing accuracy and the accuracy on contrastive examples contain a common variable, i.e., training accuracy , shown in the decomposition below:
(5) |
With controlling the effect of this compound variable, training accuracy , we compute the partial correlation between the accuracy on contrastive examples and the testing accuracy . Directly estimating the gap with the consistency fails on ImageNet for the reason that models fit to ImageNet training set in varying degrees. This is not a practical issue for two reasons: (1) it will be less needed to estimate the gap if the generalization performance is strongly correlated to the training accuracy; (2) few datasets are in the scale of ImageNet. Nevertheless, the results in Table 5 show ContRE still gets relatively strong (partial) correlations on all the three datasets, while a moderate difference is observed from the comparison between CIFAR datasets and ImageNet. We note that in the case of ImageNet, where the best model cannot perfectly fit the training set, the training accuracy is directly correlated to the testing accuracy and plays an important role for generalization performance estimation. However, for CIFAR datasets, where all (or some) models get 100% training accuracy, the training accuracy loses the predictability and the partial correlations controlling the training accuracy have no large difference compared to the standard correlations. In summary, the training accuracy is not the key factor that ContRE works well for the generalization performance estimation in most cases.
CIFAR-10 | CIFAR-100 | ImageNet | |
Center Crop | 0.495 | 0.754 | 0.662 |
Random Crop | 0.455 | \ul0.780 | 0.686 |
Random Flip | 0.495 | 0.754 | 0.662 |
Random Rotation | \ul0.736 | 0.640 | \ul0.774 |
ContRE (ours) | 0.767 | 0.890 | 0.816 |
4.5.2. Feature-Level Correlations
For complex DNN models, we further conduct analysis experiments by looking into the intermediate features of DNN models, computed on the original and contrastive examples. For these examples, we compute the Fisher ratios of features before the classifier for all models, and get the Spearman correlations with test accuracies. Note that for ImageNet and CIFAR-100 datasets, the number of data instances in each class is less than the number of dimensions of feature vectors when the DNN model, for instance DenseNet, is wide, leading to the non-invertibility of the within class matrix during the computation of Fisher ratio. We therefore perform the dimension reduction techniques on the feature vectors using singular value decomposition (SVD), reducing the number of dimensions to 64 or 512 for CIFAR-100 or ImageNet respectively, with most models containing over 95% of the total variance.
The results in Figure 6 (a) show that Fisher ratios from ContRE are highly correlated with the generalization performance of DNN models on each of the three datasets. We also conduct experiments to compute the results on models with the same architecture but different hyper-parameters, shown in Figure 6 (b). The conclusion also holds for models trained with different hyper-parameters. The overall numerical comparison is shown in Table 6. It suggests that, among contrastive examples based on various operations, models that are able to generate better class-separated features from ContRE samples are more likely to have better generalization performance. Intuitively, Fisher ratio, which can be seen as a measure of the clustering quality, is highly correlated with the final classification accuracies. Therefore, the high correlation between ContRE accuracies and generalization performance inherently comes from the association between ContRE accuracies and its features’ degree of separation, and that between Fisher ratio of features and model generalization performance.
5. Conclusion and Discussion
In this work, we build a framework ContRE to generate contrastive examples and measure the consistency of DNN models’ behaviors with contrastive examples or training examples as input. This consistency can be exploited to estimate the generalization performance of DNN models, endowed by the assumption that robust DNN models with good generalization performance tends to giving consistent features and predictions from the same image under varying data transformations. We adopt RandAugment to generate contrastive examples and have practically assessed that the proposed ContRE is able to consistently estimate the generalization performance of various DNN models on three benchmark datasets. Systematical ablation studies and thorough analyses have also been provided to demonstrate the versatility of ContRE in complex real-world situations and to dissect the reasons that ContRE works well for the estimation.
We further discuss the potential limitations in the estimation of generalization performance by our proposed framework: (1) The data transformations may spoil the data and lead to total random guess in the worst case. We show that the transformations used by ContRE would not significantly affect the generalization estimation, by additionally reporting the correlations between transformed training and testing data, i.e., and (0.983, 0.903 and 0.989 for CIFAR10, CIFAR100 and ImageNet respectively), but for further extensions, transformations should be cautiously chosen. (2) Theoretical analyses of local elasticity (Bousquet and Elisseeff, 2002; Deng et al., 2020; He and Su, 2019) seems related to our approach, but theoretical links among contrastive examples, generalization performance and this notion are not available yet. (3) Lack of appropriate data transformations, it would be difficult to extend to other data formats, such as texts, audios, graphs etc. Future investigation on the existing transformations and exploration on new ones would address this limitation.
References
- (1)
- Arora et al. (2018) Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. 2018. Stronger Generalization Bounds for Deep Nets via a Compression Approach. In International Conference on Machine Learning (ICML). 254–263.
- Bartlett et al. (2017) Peter L. Bartlett, Dylan J. Foster, and Matus J. Telgarsky. 2017. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems (NeurIPS). 6241–6250.
- Bousquet and Elisseeff (2002) Olivier Bousquet and André Elisseeff. 2002. Stability and generalization. The Journal of Machine Learning Research 2 (2002), 499–526.
- Caron et al. (2020) Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 33. 9912–9924.
- Chen et al. (2020b) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020b. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML).
- Chen et al. (2020c) Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. 2020c. Big Self-Supervised Models are Strong Semi-Supervised Learners. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 33. 22243–22255.
- Chen et al. (2020a) Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. 2020a. Improved Baselines with Momentum Contrastive Learning. arXiv preprint arXiv:2003.04297 (2020).
- Corneanu et al. (2020) Ciprian A. Corneanu, Sergio Escalera, and Aleix M. Martinez. 2020. Computing the Testing Error Without a Testing Set. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Cubuk et al. (2018) Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudeva, and Quoc V Le. 2018. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018).
- Cubuk et al. (2020) Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 702–703.
- Deng et al. (2020) Zhun Deng, Hangfeng He, and Weijie J Su. 2020. Toward Better Generalization Bounds with Locally Elastic Stability. arXiv preprint arXiv:2010.13988 (2020).
- Eghbal-zadeh et al. (2020) Hamid Eghbal-zadeh, Khaled Koutini, Paul Primus, Verena Haunschmid, Michal Lewandowski, Werner Zellinge, Bernhard A. Moser, and Gerhard Widmer. 2020. On Data Augmentation and Adversarial Risk: An Empirical Analysis. arXiv preprint arXiv:2007.02650 (2020).
- Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
- Hadsell et al. (2006) Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- He and Su (2019) Hangfeng He and Weijie J Su. 2019. The local elasticity of neural networks. arXiv preprint arXiv:1910.06943 (2019).
- He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 9729–9738.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Henaff et al. (2020) Olivier Henaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, S. M. Ali Eslami, and Aaron van den Oord. 2020. Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning (ICML). 4182–4192.
- Howard et al. (2017) Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
- Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4700–4708.
- Ilyas et al. (2019) Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. 2019. Adversarial Examples are not Bugs, they are Features. In Advances in Neural Information Processing Systems (NeurIPS).
- Jiang et al. ([n.d.]) Yiding Jiang, Pierre Foret, Scott Yak, Behnam Neyshabur, Isabelle Guyon, Hossein Mobahi, Gintare Karolina, Daniel Roy, Suriya Gunasekar, and Samy Bengio. [n.d.]. Predicting Generalization in Deep Learning, Competition at NeurIPS 2020. https://sites.google.com/view/pgdl2020/home?authuser=0
- Jiang et al. (2020a) Yiding Jiang, Pierre Foret, Scott Yak, Daniel M Roy, Hossein Mobahi, Gintare Karolina Dziugaite, Samy Bengio, Suriya Gunasekar, Isabelle Guyon, and Behnam Neyshabur. 2020a. NeurIPS 2020 Competition: Predicting Generalization in Deep Learning. arXiv preprint arXiv:2012.07976 (2020).
- Jiang et al. (2019a) Yiding Jiang, Dilip Krishnan, Hossein Mobahi, and Samy Bengio. 2019a. Predicting the Generalization Gap in Deep Networks with Margin Distributions. In International Conference on Learning Representations (ICML).
- Jiang et al. (2019b) Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. 2019b. Fantastic generalization measures and where to find them. arXiv preprint arXiv:1912.02178 (2019).
- Jiang et al. (2020b) Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. 2020b. Fantastic Generalization Measures and Where to Find Them. In International Conference on Learning Representations (ICML).
- Keskar et al. (2016) Nitish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tang. 2016. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv preprint arXiv:1609.04836 (2016).
- Liang et al. (2017) Tengyuan Liang, Tomaso A. Poggio, Alexander Rakhlin, and James Stokes. 2017. Fisher-Rao Metric, Geometry, and Complexity of Neural Networks. arXiv preprint arXiv:1711.01530 (2017).
- Ma et al. (2018) Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV). 116–131.
- McAllester (1999) David A. McAllester. 1999. PAC-Bayesian Model Averaging. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory (COLT ’99). Association for Computing Machinery, New York, NY, USA, 164–170.
- Nagarajan and Kolter (2019) Vaishnavh Nagarajan and J. Zico Kolter. 2019. Generalization in Deep Networks: The Role of Distance from Initialization. arXiv preprint arXiv:1901.01672 (2019).
- Neyshabur et al. (2017) Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. 2017. Exploring Generalization in Deep Learning. In Advances in Neural Information Processing Systems (NeurIPS). 5949–5958.
- Neyshabur et al. (2015) Behnam Neyshabur, Ruslan Salakhutdinov, and Nathan Srebro. 2015. Path-SGD: Path-Normalized Optimization in Deep Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). 2422–2430.
- Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
- Recht et al. (2019) Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2019. Do imagenet classifiers generalize to imagenet?. In Proceedings of ICML. 5389–5400.
- Sandler et al. (2018) Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR).
- Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).
- Tan et al. (2019) Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. 2019. MnasNet: Platform-Aware Neural Architecture Search for Mobile.. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2820–2828.
- Tian et al. (2019) Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2019. Contrastive multiview coding. arXiv preprint arXiv:1906.05849 (2019).
- Unterthiner et al. (2021) Thomas Unterthiner, Daniel Keysers, Sylvain Gelly, Olivier Bousquet, and Ilya Tolstikhin. 2021. Predicting Neural Network Accuracy from Weights. arXiv preprint arXiv:2002.11448 (2021).
- Vapnik (2013) Vladimir Vapnik. 2013. The Nature of Statistical Learning Theory. Springer Science & Business Media.
- Xie et al. (2017) Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Yak et al. (2019) Scott Yak, J. Gonzalvo, and Hanna Mazzawi. 2019. Towards Task and Architecture-Independent Generalization Gap Predictors. arXiv preprint arXiv:1906.01550 (2019).
- Zagoruyko and Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. 2016. Wide Residual Networks. In Proceedings of the British Machine Vision Conference (BMVC). 87.1–87.12.
- Zhang et al. (2016) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2016. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 (2016).
- Zhang et al. (2018) Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).