XX Month, XXXX \reviseddateXX Month, XXXX \accepteddateXX Month, XXXX \publisheddateXX Month, XXXX \currentdateXX Month, XXXX \doiinfoXXXX.2022.1234567
Corresponding author: Vamshi C. Madala (email: [email protected]).
CNNs Avoid Curse of Dimensionality by Learning on Patches
Abstract
Despite the success of convolutional neural networks (CNNs) in numerous computer vision tasks and their amazing generalization performances, several attempts to predict the generalization errors of CNNs have only been limited to a posteriori analyses thus far. A priori theories explaining the generalization performances of deep neural networks have mostly ignored the convolutionality aspect and do not specify why CNNs are able to seemingly overcome curse of dimensionality on computer vision tasks like image classification where the image dimensions are in thousands. Our work attempts to explain the generalization performance of CNNs on image classification under the hypothesis that CNNs operate on the domain of image patches, derives an a priori error bound for the generalization error and presents both quantitative and qualitative evidences in the support of our theory.
a priori analysis, convolutional neural networks, curse of dimensionality, generalization error.
1 INTRODUCTION
Convolutional neural networks (CNNs) are known to overcome curse of dimensionality (CoD) in practice [25]. Many approaches have been proposed to explain this and a popular one hypothesizes that the data lies in a manifold whose dimension is much smaller than the input dimension [10], however, these dimensions which are shown to be in the range of ~50-100 for CNNs on CIFAR-10 data set are still too large to overcome the CoD using just 50k training samples[27]. Another approach proves that compositionality of data avoids CoD [26]. But knowing the right compositionality is not possible in all cases for e.g on image classification data sets. A priori theories which derived generalization error bounds of deep neural networks either only dealt with special class of functions like maximum, indicator, piece-wise polynomial[33], functions where Fourier transform of its gradient is integrable [1] or derived bounds specifically for deep neural networks with piece-wise linear activation functions[36] but do not explicitly consider the convolutional neural networks. Previous works also tried to use the locality and shift invariance of convolutions to explain how CNNs overcome the CoD [9].
We theorize that, in the image classification setting, CNNs avoid CoD by learning noisy labels on patches of images rather than whole images which have a larger dimensionality compared to the patches. We show that a simple strategy such as averaging the predictions on these patches is enough to overcome the noise. Under these assumptions we derive an a priori upper bound for the generalization error of CNNs and show empirical evidence that the error follows the bound closely on popular image classification data sets. Ours is the first work we are aware of that provides an a priori numerical estimate for the generalization error of CNNs.
To test our theory, we explicitly decompose images into patches and use them as inputs to train CNNs. When target labels for individual patches are not available, like in classification data sets where the class labels describe the whole images, we assign the label of parent image to be the label for each patch in the image and train the network. For e.g, all the patches of a cat image are also assigned the cat label. During inference, we manually average the patch-wise outputs (logits) of all the patches of a test image and consider that as the model’s prediction for that image, as illustrated in Figure 1. Our results show that using this simple approximation, CNNs trained only on patches are able to achieve non-trivial accuracies on multiple image classification data sets. For example, a ResNet18 trained on CIFAR-10 data set achieves 66.7% accuracy using only patches and 84.2% accuracy using only patches. We also visualize all the patch-wise activations corresponding to the location of each patch in the parent image, giving us a heat map for any given class, as shown in Figure 1. We find that the heat maps of CNNs trained using standard procedure (using full images) are nearly identical to those of CNNs trained using patches, giving more evidence for our theory that CNNs operate on image patches instead of operating on the full image domain, thus avoiding the CoD.

We use our theory to derive an a priori upper bound on generalization error for classification tasks, that depends on various parameters of a data set like number of training samples, image resolution, number of classes, etc., which is different from the traditional a posteriori bounds which estimate generalization error using the trained weights of CNNs [15].
Motivated by our results on classification tasks, we also train CNNs on semantic segmentation tasks using our patch based approach, where pixel-wise labels are also available and we observe that they show similar performances to CNNs trained on whole images, providing more empirical evidence to our theory.
In summary,
-
In Section 3 we present our theory on how CNNs operating on the smaller dimensions of patches and aggregating the predictions can overcome CoD and a derivation of the a priori generalization error bound for such CNNs on image classification.
-
In Section 4 we report the results of our experiments on training CNNs using only patches and compare the average patch-wise accuracies with the upper bound derived on various data sets for different patch sizes giving evidence for our theory. We also compare the accuracies of these models with the accuracies of CNNs trained conventionally on whole images.
-
In Section 5 we use heat maps to also make the comparison qualitatively between CNN models trained using patches, models trained in a standard way and pre-trained models.
-
Finally in Section 6 we show results from semantic segmentation for CNNs trained using patches.
2 Related work
There have been many a posteriori generalization error bounds proposed for CNNs: Long and Sedghi[21] prove generalization error bounds that depend on training loss, weights and other parameters of CNNs; Lin and Zhang[19] propose bounds that depend on the spectral norm of weights; Zhou and Feng[39] prove bounds that depend on the architectural parameters of CNNs; and Jiang et al.[15] provide a comprehensive review of such bounds. As for the a priori theories, Barron[1] derives a generalization error bound on a special class of functions that have a Fourier representation such that they are integrable on the gradient of Fourier transform. Telgarsky[33] derives error bounds and shows why deep neural networks are essential in approximating a class of functions like maximum, indicator and piece-wise polynomial. A priori theories discussing the representational capacity of DNNs [32, 36] only consider CNNs as a special case of the feed-forward neural networks [7, 23, 2] whereas those focusing on CNNs have been conditional on special properties of the data set [26, 9, 5] which the popular computer vision (CV) data sets have not yet been shown to satisfy. Sokolic et al.[29] discuss the generalization error of invariant classifiers like CNNs under explicit input transformations and obtain an error bound, but they do not extend the discussion to patches and how the invariance when applied to the patches of input images affects the generalization performance of CNNs. Thus so far there has been no a priori theory that can satisfactorily explain the amazing generalization performances of CNNs on practical CV tasks. The bound proposed using our theoretical model gives an a priori numerical estimate of the generalization error and our experiments show that CNNs closely follow this bound for many popular CV data sets like CIFAR-10, CIFAR-100, STL-10 and ImageNet-1k.
Estimating generalization error from empirical studies like Sun et al.[30] have conducted large scale experiments on huge data sets to study the effect of parameters like data set size on the generalization error and shown empirically in select vision data sets that the error improves logarithmically on the size of training data. Similar large scale studies in the space of language models also obtain empirical estimates for generalization errors and find similar dependencies on the data set size[14, 16]. These empirical estimates provide encouraging similarities with our theoretical a priori bound.
Poggio et al.[26] seems to be the closest to this paper theoretically. They show that deep CNNs can avoid the curse of dimensionality for hierarchically compositional spatially local functions. CNNs definitely fall into this category, but the convolutional nature of the data was not fully exploited in their analysis. In particular knowledge of the right compositional model is needed for their results to be applicable.
Very early work on unsupervised learning by Coates et al.[4] shows that “convolutionally extracted” features achieve good performance on unsupervised tasks because of the similarity between the extracted patches/features making them easier to cluster using k-means. This correlates well with our theory that the predicted labels from convolutionally extracted features on small patches gives the CNNs their advantage, however no theoretical models were proposed to explain this.
Using random crops of images as inputs to CNNs, as we do in our experiments, is considered a common regularization strategy. Regularization strategies such as CutMix [37], which pastes random patches between images during training and Cutout [6] which removes random patches from training images are known to improve the test accuracies. But it is not fully understood how CNNs can learn to distinguish between classes when the differences are only at this patch level. A most common data augmentation strategy is random cropping which is also known to improve the generalization of CNNs but the crop size and location is usually chosen in practice such that class objects are preserved after the cropping. Touvron et al.[34] show that training images cropped to improve the test accuracy on ImageNet but their strategy also makes sure to adjust cropping so as to include object of the classes in the cropped image. Our theory offers simple explanation for why such strategies are effective and in our experiments we push the boundary by also training on patches of such small sizes that the assigned class labels are no longer semantically meaningful. Our heat map results show that, by learning patch-wise labels, CNNs can distinguish patches/pixels belonging to object class from those of background.
A very closely related paper by Brendel and Bethge[3] who propose the BagNet architecture. In BagNet a modified ResNet backbone is applied to each patch of the image and the output is a class heat map. The average of these heat maps is then fed into a linear classifier. The heat maps highlight the patches which contribute to the network’s decision in a straight-forward way because of the linear classifier, improving explaininability but at the cost of some accuracy due of the modified architecture. The first main difference with our work is that, during training they consider the output of the linear classifier i.e. the weighted aggregate of the individual decisions on patches to be the network’s prediction for the image and thus it is not directly evident what the network is considering as the labels for the individual patches. On the other hand, we train the standard CNNs to classify each patch as belonging to its parent class making it very clear that our heat maps are the network’s predictions on individual patches. We also show using the evidence of heat maps that the CNNs trained using standard training procedures also learn the labels on patches implicitly, thus providing explainable decisions to traditional CNN architectures without any loss of accuracy. Finally, we provide a theoretical model which explains why operating on the domain of patches is critical for avoiding the CoD.
3 A priori generalization error bound
Given our experimental observations that CNNs “learn” the labels of patches rather than full images, we note that this improves the generalization error in two ways – i) by reducing the dimensionality of the input domain i.e. number of pixels in patches is much smaller than that for the full images and, ii) by increasing the input sample rate i.e for each image of size , there are patches of size . Both of these effects help mitigate the curse of dimensionality. Accordingly, in this section, based on some reasonable assumptions, we propose an upper bound for the generalization error that seems to have good experimental correlation with real image data sets and CNNs.
Let be the true -class classification function mapping the samples to their labels . Let be a computed approximation to using a training data set that has training samples: . The mean-value theorem and triangle inequality imply that
(1) |
where and . Here denotes the training error of measured using an appropriate norm. We assume that training error primarily depends on noise in the patch labels as we are just using the same label as that of the parent image. Later in this section we provide a reasonable upper bound on the training error. We see from inequality (1) that CoD is in full effect when the input dimension is large and is driven by the mesh norm (see Mhaskar and Poggio [22]):
(2) |
We can obtain a general case upper bound on the mesh norm by assuming the worst case scenario that the input samples are distributed uniformly in the input space and are drawn i.i.d.:
(3) |
where the constant depends on the choice of the norm.
For images of size and number of channels, mesh norm becomes,
(4) |
Now for our experimental setting where we train a CNN model using only patches, the new input and output subspaces are transformed as and , respectively, where with and as the height and width dimensions of patches. The training dataset is thus now transformed to , where because each image of size in the dataset now produces number of patches of size , in effect increasing the number of samples in the transformed data set to . This gives the mesh norm for these patches as:
(5) |
We introduce two new parameters and without shrinking this upper bound; controls the effective number of total patches in the data set and acts similar to the stride parameter that is popular in convolutional kernels and this parameter is there to account for the fact that the nearby patches have large overlapping regions and are less likely to contribute to an improvement in generalization error. So the total effective number of samples becomes , where we denote the effective number of distinct patches in with . We let control the effective dimension of the patches as it is likely that the different channels in the image are also correlated, and in our experiments we use . So the mesh norm is re-parameterized as:
(6) |
We assume that the new approximation trained on patches is rougher than and consequently assume that the gradient norm in the inequality (1) depends on the patch size and dimensions via a power law and that it is much larger than the gradient norm of the true function , whose properties are unknown, i.e.
(7) |
where is a monotonically decreasing function. In our results we use . So at full image size the upper bound on the derivative is an known constant, but at smaller patch sizes this norm grows. In higher dimensions this bound grows slower because there is more room to accommodate the roughness in the classification function.
We are using the labels of a training image as the label for its patches no matter how small the patch is or where it is drawn from. So these labels are intrinsically noisy. We therefore model the training error on the patch data set , to be bounded by a quantity that is inversely proportional to the area of the patch. So for some constant :
(8) |
where is a monotonically increasing and a monotonically decreasing function. In our results we use and , arguing that for i.e. when full images are used, labels are noiseless. It should be noted here that our choices for , and are very preliminary but our experimental results show that these simple approximations are able to satisfactorily model the generalization error across different image classification data sets and CNN architectures. We also make a note that in our approximation of training error, we do not take into account the effect of number of parameters in the CNN. In fact, as and if so does the number of parameters in CNN at a suitable rate, one would expect the training error to go to zero, which is not reflected by the equation 8. In future work, we plan to further refine these approximations.
Finally the new bound on the generalization error for the model trained on the dataset and measured on samples drawn randomly from the patched subspace where can be rewritten as:
+ c_4 m_2^↑(K) m_3^↓(HTWTH W) | (9) |
Inequality (9) estimates the upper bound on the generalization error for a model trained on patches and tested on patches. However, what we are actually interested in is the generalization performance of the model on unseen full sized images. We make the simple assumption that:
(10) |
and accordingly,
(11) |
We now assume that there are number of uncorrelated patches in on average. Then one can expect, that on average the expected upper bound on the generalization error would be:
(12) |
where is the effective number of patches in an image. And now from (1), (6) and (9) we finally get
(13) |
Since the CNNs we use in our experiments typically use all patch sizes from on wards up to the maximum size , we assume that the minimum of the right-hand side of inequality (13) over the considered patch sizes should be taken as the generalization error for CNNs. In our experiments we use and .

3.1 Analysis of error parameters
We plot the generalization error bound of (13) in Figure 2 by varying the patch size for a base model of ImageNet-1k[28] data set containing 1.2M images and assuming a constant input resolution of from 1000 classes. We assume the stride parameter to be 4 to compute the effective patch size and vary each of these parameters individually. In all these figures, it is apparent that the curse of dimensionality starts affecting the generalization error at higher input dimensions. On the other hand, at very low patch sizes, even though the input dimensionality is controlled, noise in the target labels extremely amplifies the error. While for the intermediate patch sizes, depending on the data set’s maximum input resolution, curse of dimensionality is mitigated and the generalization error is bounded.
When the number of classes () is increased, the error model shows that the bound on the generalization error also raises increasingly fast which is commonly observed in practice. The same effect is also observed when increasing , which reduces the effective number of patches available for training. When the image resolutions are increased, our model only predicts marginal improvement which is also a common observation in practice that by improving just the image resolution, the improvement in performance is limited if there is no corresponding improvement in the quality of labels. With the current set of hyper-parameters, this bound does not show any effect of number of training images on the generalization error, as shown in the bottom right plot of Figure 2. However, large scale empirical studies showed that the training set size has power law dependencies on the generalization[30, 14, 16].
3.2 Limitations
Our current error model deals with patches of different sizes independently and does not consider combining the predictions from different patch sizes and how that might affect the bound in (13). Moreover, our initial guesses for the functions are very simple, given the nature of complex dynamics between multiple parameters that are at play. The bound in (13) is also not normalized with respect to the maximum error possible and leads to absurd numerical values for some values of hyper-parameters, owing again to the crude approximations of noise functions. Our theoretical model currently also does not take into account the capacity of different CNNs and also does not tell anything about how the optimization procedure using stochastic gradient descent affects the function that is being approximated. A refined bound with more detailed analysis will be presented elsewhere.
Despite these limitations, it is very surprising that this crude model does a good job of predicting the performance of CNN models on many standard image data sets. We compare this bound with the empirical generalization errors on multiple image classification data sets with varying parameters , , and in the next section.
4 Experiments
We train ResNet[13] models without any pre-trained weights on CIFAR-10[17], CIFAR-100[17], STL-10[4] and ImageNet-1k [28] using only patches for different patch sizes. CIFAR-10 has 10 classes ( with 5,000 32x32 resolution images per class (, ), whereas CIFAR-100 has the same number of total images but among 100 classes, constraining the training set to only 500 32x32 images per class. STL-10 has 10 classes with 500 training and 800 test images per class which are labelled but these images are of higher resolution () than CIFAR. Imagenet-1k has 1.2 million images with varying resolutions from 1,000 classes. Thus these data sets cover a wide range of variability in the parameters involved in our bound (13) allowing us to compare the theoretical estimate of generalization error with the observed test errors in a robust manner.
4.1 CIFAR and STL
Table 4 shows the training set and test set accuracies for ResNet18 model trained on CIFAR-10, CIFAR-100 and STL-10 data sets. The train and test accuracies correspond to the average patch-wise accuracies obtained by taking the average of model outputs (logits) on all patches of each image with stride 1 and computing the argmax to get the overall model’s prediction for that image and then comparing with the original image labels for training and test data sets respectively. For the models trained on full image resolutions (first row in each data set), the average patch-wise accuracy is the same as traditional accuracy metric and we note that they are lower than the state-of-the-art for ResNet18 because we do not employ any regularization strategies like CutMix, Cutout and Mixup [38] during training. So models trained on patches of size a bit smaller than full image resolution outperform those trained using full resolution since the patch-wise training implicitly regularizes the training in a similar manner to the strategies before-mentioned. For e.g model trained on patches on CIFAR-100 has 75.0% accuracy compared to the 66.8% for model trained on full images and for STL-10 83.0% using patches to 70.3% using full images. This improvement in the generalization performance of CNNs when patches of size smaller than full resolution are used for training [34], and the regularization this introduces, can be attributed to the increase in the sample rate on a smaller dimensional input without increasing the noise in the labels too much (for e.g patches of CIFAR-100 could still be easily identified with their original class labels) as described in our theoretical model in Section 3.
More interesting is the observation that these models obtain surprisingly high accuracies using very small patches like and . On CIFAR-10, model trained using only patches classifies with 84.2% accuracy and using only patches it obtains 66.7%! On STL-10 where the original images are of resolution, ResNet18 looking at only patches independently is able to obtain 46.3% accuracy.

We plot the experimental test accuracies in Figure 3 and compare them with the predicted generalization errors from the theoretical bound, which shows that our theoretical error model correlates with the empirical values reasonably well. In Figure 3, the predicted error at each patch size is computed by taking the minimum of (13) for all patch sizes less than or equal to . This is because CNN models in practice can have access to patches of all sizes that are given how the convolutional layers are constructed with kernels of deeper layers stacked on top of those of shallow layers with each layer’s weights operating on patches of sizes that correspond to the receptive-field of that layer. The empirical values are seen to be deviating at full input sizes but this is due to the absence of regularization strategies in our training as already described.
4.2 ImageNet
We also trained a ResNet50 model using FFCV[18] on ImageNet-1k data set using patches. This model achieves 72.4% top-1 accuracy compared to the 78.4% using full images. Due to resource constraints, a full sweep of patch sizes could not be done. It needs to be noted that since we only select one random patch per image per mini-batch, the number of epochs needed to train a model using patches increases drastically. For e.g FFCV[18] reports achieving 78.4% accuracy in 88 epochs whereas training on patches required 100 epochs to achieve 70.8%, 320 epochs for 72.2% and 1280 epochs for 72.4%.
5 Visualizing the CNN learnings on patches via Heat maps
In this section, we visually analyze what the CNNs are learning when trained on patches. We also compare these models with pre-trained, trained on whole images. To generate the heat map of class , ’th component of model’s output vector (logits) for each patch of an image is positioned relative to the patch location in the image and assigned a color value based on its magnitude, as illustrated in Figure 1. Zero-padding is done to obtain the same number of patches as there are pixels in the image.
5.1 CNNs trained on patches vs whole images
We observe that the models trained using only patches produce heat maps that are similar to the models trained on whole images. Essentially, no matter if the target labels are given to full resolution images or just the image patches during training, CNN learns to map similar set of image patches to these target labels. This observation strengthens our hypothesis that CNNs operate on patch domain rather than the image domain, thanks to the convolution based architecture. Figure 4 shows some examples from ImageNet-1k where a ResNet-50 model trained using only patches that achieves 72.4% top-1 accuracy produces heat maps that are similar to a pre-trained model downloaded from PyTorch[24] hub that has test set top-1 accuracy of 76.13%. Moreover, the patch size used to generate these heat maps is which is much lower than the input resolution that is used to train both the models. Extrapolating this observation that CNNs operate on the domain of patches, the peculiar effectiveness of CNNs when applied to data sets they were not trained on using transfer learning[35] can be attributed to smaller differences between objects of same classes in the patch domain compared to the image domain. The ability of CNNs to target specific patches while ignoring others further boosts our confidence in this hypothesis but demands more experiments to validate it carefully. A similar argument can be made in the support of unsupervised learning tasks for CNNs.

All the input images for which the heat maps are shown in Figure 4 are correctly classified by both the models. But even when the CNN models incorrectly classify and also when the predictions differ from each other, the heat maps are still similar between models trained on patches and those on whole images. Examples are shown in Figure 5. In the top example, the given target label for the image in ImageNet-1k data set is espresso. The patch-trained model and pre-trained model both predict the class as espresso maker and the corresponding heat maps show why this is the case and highlight the similar set of patches belonging to coffee maker object. There is also a coffee mug in the image which is also another class in the ImageNet-1k and the class heat maps for coffee mug class also activate similar patches.

In the bottom example of Figure 5, a similar situation is observed but here the patch-trained model’s prediction (desk) also differs from that of pre-trained model (desktop computer) and both the model’s predictions are different from the class label keyboard. Despite this, the corresponding heat maps for each of these classes are similar between both the models. The prediction for patch-trained model is computed by averaging the model outputs on all the patches of the image, where as the prediction for pre-trained model is computed by directly feeding the whole image. The difference in their predicted classes despite the similarity in patch activations can be attributed to the difference in how the patch-wise predictions are aggregated by each model. In the patch-trained model we manually average the predictions from each patch without any weighting, whereas a more complicated weighted combination might be the case with the pre-trained model which sees the whole image at once, because CNN architectures have final dense layers stacked on top of the convolutional layers, but more experiments need to be conducted to confirm these. Thus the heat map analysis also strongly supports our patch-based learning theory of CNNs. Further, these results sheds some light on the effectiveness of adversarial attacks [11, 31] on CNNs. Any model that combines individual predictions linearly is susceptible to changing its decision with slight changes in the linear combination and so, even if the CNNs make correct predictions on patches, they can be made to change final prediction on an image simply by modifying the averaging of patch-wise predictions.
6 Patch-wise training of semantic segmentation

Patch-wise training can be readily applied to semantic segmentation tasks as the pixel-wise labels are also available. We train SegNeXt[12] models on Pascal VOC data set[8] by random cropping to . We obtain an mIoU of 44.7 after 320k epochs compared to the 48.1 mIoU score using images after 160k epochs. Figure 6 shows examples from COCO data set[20] where a model trained on patches is used to produce the semantic maps. The predicted map is color coded as black for no class, blue for boat and red for car. The predicted map is obtained by averaging pixel outputs of all patches with a stride of 100. A more detailed analysis of semantic segmentation using patch based approach will be presented elsewhere.
7 Conclusion
We have shown experimental evidence for our claim that CNNs learn to classify by learning labels of patches, and provided a simple theoretical model that can explain their generalization performance by mitigating the curse of dimensionality and obtained an upper bound for the generalization error. We also showed that this estimated generalization error closely matches the observed generalization performance of CNNs on various standard image classification data sets. We have also provided qualitative evidence for our theory in the form of heat maps by visualizing CNN learnings on patches and comparative analysis between CNNs trained in a conventional way and CNNs trained using only patches. Our heat maps also work as diagnostic tools that can be of practical use to deep learning practitioners in analysing their models and data sets, as a natural application of our theory. Moreover, the generalization estimates of our a priori theory can help in designing better training sets, verifying models for deployment, as well as in new architecture design. Our theory also provides new insights into various unexplained behaviors of deep neural networks such as object detection, transfer learning, adversarial robustness and unsupervised learning.
ACKNOWLEDGMENT
We would like to thank Andrew Brown, Aidan Chandrasekaran, Hirish Chandrasekaran, Jason Dunne, B.S.Manjunath, Hrushikesh Mhaskar, Abhejit Rajagopal and Marco Zuliani for useful discussions and valuable feedback.
References
- [1] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993
- [2] Peter L Bartlett and Wolfgang Maass. Vapnikchervonenkis dimension of neural nets. The handbook of brain theory and neural networks, pages 1188–1192, 2003.
- [3] Wieland Brendel and Matthias Bethge. Approximating cnns with bag-of-local-features models works surprisingly well on imagenet. International Conference on Learning Representations, 2019. URL https:// openreview.net/pdf?id=SkfMWhAqYQ.
- [4] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Geoffrey Gordon, David Dunson, and Miroslav Dud´ık, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 215–223, Fort Lauderdale, FL,USA, 11–13 Apr 2011. PMLR. URL https://proceedings.mlr.press/v15/coates11a.html.
- [5] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on learning theory, pages 698–728. PMLR, 2016.
- [6] Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout, 2017. URL https://arxiv.org/abs/1708.04552.
- [7] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks, 2015. URL https: //arxiv.org/abs/1512.03965.
- [8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
- [9] Alessandro Favero, Francesco Cagnetta, and Matthieu Wyart. Locality defeats the curse of dimensionality in convolutional teacher-student scenarios. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 9456–9467. Curran Associates, Inc., 2021.
- [10] Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983–1049, 2016.
- [11] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- [12] Meng-Hao Guo, Cheng-Ze Lu, Qibin Hou, Zhengning Liu, Ming-Ming Cheng, and Shi-Min Hu. Segnext: Rethinking convolutional attention design for semantic segmentation. arXiv preprint arXiv:2209.08575, 2022.
- [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015. URL https://arxiv.org/abs/1512.03385.
- [14] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- [15] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them, 2019. URL https://arxiv.org/abs/1912.02178
- [16] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- [17] Alex Krizhevsky. Learning multiple layers of features from tiny images. University of Toronto, 05 2012.
- [18] Guillaume Leclerc, Andrew Ilyas, Logan Engstrom, Sung Min Park, Hadi Salman, and Aleksander Madry. ffcv. https://github.com/libffcv/ffcv/, 2022. commit e97289fdacb4b049de8dfefefb250cc35abb6550
- [19] Shan Lin and Jingwei Zhang. Generalization bounds for convolutional neural networks, 2019. URL https://arxiv.org/abs/1910.01487.
- [20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
- [21] Philip M. Long and Hanie Sedghi. Generalization bounds for deep convolutional neural networks, 2019. URL https://arxiv.org/abs/1905.12600.
- [22] H. N. Mhaskar and T. Poggio. Function approximation by deep networks. Communications on Pure and Applied Analysis, 19(8):4085–4095, 2020. URL https://arxiv.org/abs/1905.12882.
- [23] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Peter Grunwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 1376–1401, Paris, France, 03–06 Jul 2015. PMLR. URL https://proceedings.mlr.press/v40/Neyshabur15.html.
- [24] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, highperformance deep learning library, 2019.
- [25] Tomaso Poggio, Andrzej Banburski, and Qianli Liao. Theoretical issues in deep networks: Approximation, optimization and generalization, 2019. URL https://arxiv.org/abs/1908.09375.
- [26] Tomaso A. Poggio, Hrushikesh N. Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and when can deep - but not shallow - networks avoid the curse of dimensionality: a review. CoRR, abs/1611.00740, 2016. URL http://arxiv.org/abs/1611.00740
- [27] Stefano Recanatesi, Matthew Farrell, Madhu Advani, Timothy Moore, Guillaume Lajoie, and Eric SheaBrown. Dimensionality compression and expansion in deep neural networks. arXiv preprint arXiv:1906.00443,2019.
- [28] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi:10.1007/s11263-015-0816-y.
- [29] Jure Sokolic, Raja Giryes, Guillermo Sapiro, and Miguel Rodrigues. Generalization Error of Invariant Classifiers. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 1094–1103. PMLR, 20–22 Apr 2017. URL https://proceedings.mlr.press/v54/sokolic17a.html.
- [30] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pages 843–852, 2017.
- [31] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- [32] Matus Telgarsky. Representation benefits of deep feedforward networks, 2015. URL https://arxiv.org/abs/1509.08101.
- [33] Matus Telgarsky. Benefits of depth in neural networks. In Conference on learning theory, pages 1517–1539. PMLR, 2016.
- [34] Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herve Jegou. Fixing the train-test resolution discrepancy. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alche-Buc, E. Fox, and R. Gar-nett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- [35] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning. Journal of Big data, 3(1):1–40, 2016.
- [36] Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114,2017.
- [37] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features, 2019. URL https://arxiv.org/abs/1905.04899.
- [38] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization, 2017. URL https://arxiv.org/abs/1710.09412.
- [39] Pan Zhou and Jiashi Feng. Understanding generalization and optimization performance of deep cnns, 2018. URL https://arxiv.org/abs/1805.10767.