\receiveddate

XX Month, XXXX \reviseddateXX Month, XXXX \accepteddateXX Month, XXXX \publisheddateXX Month, XXXX \currentdateXX Month, XXXX \doiinfoXXXX.2022.1234567

\corresp

Corresponding author: Vamshi C. Madala (email: [email protected]).

CNNs Avoid Curse of Dimensionality by Learning on Patches

Vamshi C. Madala

{}^{\textbf{1}}

Shivkumar Chandrasekaran

{}^{\textbf{1}}

and Jason Bunk

{}^{\textbf{2}}

Department of Electrical and Computer Engineering, University of California Santa Barbara, Santa Barbara, CA 93106 USA Mayachitra Inc., Santa Barbara, CA 93111 USA

Abstract

Despite the success of convolutional neural networks (CNNs) in numerous computer vision tasks and their amazing generalization performances, several attempts to predict the generalization errors of CNNs have only been limited to a posteriori analyses thus far. A priori theories explaining the generalization performances of deep neural networks have mostly ignored the convolutionality aspect and do not specify why CNNs are able to seemingly overcome curse of dimensionality on computer vision tasks like image classification where the image dimensions are in thousands. Our work attempts to explain the generalization performance of CNNs on image classification under the hypothesis that CNNs operate on the domain of image patches, derives an a priori error bound for the generalization error and presents both quantitative and qualitative evidences in the support of our theory.

{IEEEkeywords}

a priori analysis, convolutional neural networks, curse of dimensionality, generalization error.

1 INTRODUCTION

\IEEEPARstart

Convolutional neural networks (CNNs) are known to overcome curse of dimensionality (CoD) in practice [25]. Many approaches have been proposed to explain this and a popular one hypothesizes that the data lies in a manifold whose dimension is much smaller than the input dimension [10], however, these dimensions which are shown to be in the range of ~50-100 for CNNs on CIFAR-10 data set are still too large to overcome the CoD using just 50k training samples[27]. Another approach proves that compositionality of data avoids CoD [26]. But knowing the right compositionality is not possible in all cases for e.g on image classification data sets. A priori theories which derived generalization error bounds of deep neural networks either only dealt with special class of functions like maximum, indicator, piece-wise polynomial[33], functions where Fourier transform of its gradient is integrable [1] or derived bounds specifically for deep neural networks with piece-wise linear activation functions[36] but do not explicitly consider the convolutional neural networks. Previous works also tried to use the locality and shift invariance of convolutions to explain how CNNs overcome the CoD [9].

We theorize that, in the image classification setting, CNNs avoid CoD by learning noisy labels on patches of images rather than whole images which have a larger dimensionality compared to the patches. We show that a simple strategy such as averaging the predictions on these patches is enough to overcome the noise. Under these assumptions we derive an a priori upper bound for the generalization error of CNNs and show empirical evidence that the error follows the bound closely on popular image classification data sets. Ours is the first work we are aware of that provides an a priori numerical estimate for the generalization error of CNNs.

To test our theory, we explicitly decompose images into patches and use them as inputs to train CNNs. When target labels for individual patches are not available, like in classification data sets where the class labels describe the whole images, we assign the label of parent image to be the label for each patch in the image and train the network. For e.g, all the patches of a cat image are also assigned the cat label. During inference, we manually average the patch-wise outputs (logits) of all the patches of a test image and consider that as the model’s prediction for that image, as illustrated in Figure 1. Our results show that using this simple approximation, CNNs trained only on patches are able to achieve non-trivial accuracies on multiple image classification data sets. For example, a ResNet18 trained on CIFAR-10 data set achieves 66.7% accuracy using only $4\times 4$ patches and 84.2% accuracy using only $8\times 8$ patches. We also visualize all the patch-wise activations corresponding to the location of each patch in the parent image, giving us a heat map for any given class, as shown in Figure 1. We find that the heat maps of CNNs trained using standard procedure (using full images) are nearly identical to those of CNNs trained using patches, giving more evidence for our theory that CNNs operate on image patches instead of operating on the full image domain, thus avoiding the CoD.

Refer to caption — Figure 1: Overview of CNN training and inference with only image patches. During training, a random $T\times T$ crop per each image in the mini-batch is applied without modifying their target labels. During inference, logits are computed on each $T\times T$ patch of a test image. Logits are averaged for all patches to obtain the CNN’s prediction by taking argmax or visualized relative to the position of the patch in the image to obtain heat map for any class.

We use our theory to derive an a priori upper bound on generalization error for classification tasks, that depends on various parameters of a data set like number of training samples, image resolution, number of classes, etc., which is different from the traditional a posteriori bounds which estimate generalization error using the trained weights of CNNs [15].

Motivated by our results on classification tasks, we also train CNNs on semantic segmentation tasks using our patch based approach, where pixel-wise labels are also available and we observe that they show similar performances to CNNs trained on whole images, providing more empirical evidence to our theory.

In summary,

$\bullet$

In Section 3 we present our theory on how CNNs operating on the smaller dimensions of patches and aggregating the predictions can overcome CoD and a derivation of the a priori generalization error bound for such CNNs on image classification.
$\bullet$

In Section 4 we report the results of our experiments on training CNNs using only patches and compare the average patch-wise accuracies with the upper bound derived on various data sets for different patch sizes giving evidence for our theory. We also compare the accuracies of these models with the accuracies of CNNs trained conventionally on whole images.
$\bullet$

In Section 5 we use heat maps to also make the comparison qualitatively between CNN models trained using patches, models trained in a standard way and pre-trained models.
$\bullet$

Finally in Section 6 we show results from semantic segmentation for CNNs trained using patches.

2 Related work

There have been many a posteriori generalization error bounds proposed for CNNs: Long and Sedghi[21] prove generalization error bounds that depend on training loss, weights and other parameters of CNNs; Lin and Zhang[19] propose bounds that depend on the spectral norm of weights; Zhou and Feng[39] prove bounds that depend on the architectural parameters of CNNs; and Jiang et al.[15] provide a comprehensive review of such bounds. As for the a priori theories, Barron[1] derives a generalization error bound on a special class of functions that have a Fourier representation such that they are integrable on the gradient of Fourier transform. Telgarsky[33] derives error bounds and shows why deep neural networks are essential in approximating a class of functions like maximum, indicator and piece-wise polynomial. A priori theories discussing the representational capacity of DNNs [32, 36] only consider CNNs as a special case of the feed-forward neural networks [7, 23, 2] whereas those focusing on CNNs have been conditional on special properties of the data set [26, 9, 5] which the popular computer vision (CV) data sets have not yet been shown to satisfy. Sokolic et al.[29] discuss the generalization error of invariant classifiers like CNNs under explicit input transformations and obtain an error bound, but they do not extend the discussion to patches and how the invariance when applied to the patches of input images affects the generalization performance of CNNs. Thus so far there has been no a priori theory that can satisfactorily explain the amazing generalization performances of CNNs on practical CV tasks. The bound proposed using our theoretical model gives an a priori numerical estimate of the generalization error and our experiments show that CNNs closely follow this bound for many popular CV data sets like CIFAR-10, CIFAR-100, STL-10 and ImageNet-1k.

Estimating generalization error from empirical studies like Sun et al.[30] have conducted large scale experiments on huge data sets to study the effect of parameters like data set size on the generalization error and shown empirically in select vision data sets that the error improves logarithmically on the size of training data. Similar large scale studies in the space of language models also obtain empirical estimates for generalization errors and find similar dependencies on the data set size[14, 16]. These empirical estimates provide encouraging similarities with our theoretical a priori bound.

Poggio et al.[26] seems to be the closest to this paper theoretically. They show that deep CNNs can avoid the curse of dimensionality for hierarchically compositional spatially local functions. CNNs definitely fall into this category, but the convolutional nature of the data was not fully exploited in their analysis. In particular knowledge of the right compositional model is needed for their results to be applicable.

Very early work on unsupervised learning by Coates et al.[4] shows that “convolutionally extracted” features achieve good performance on unsupervised tasks because of the similarity between the extracted patches/features making them easier to cluster using k-means. This correlates well with our theory that the predicted labels from convolutionally extracted features on small patches gives the CNNs their advantage, however no theoretical models were proposed to explain this.

Using random crops of images as inputs to CNNs, as we do in our experiments, is considered a common regularization strategy. Regularization strategies such as CutMix [37], which pastes random patches between images during training and Cutout [6] which removes random patches from training images are known to improve the test accuracies. But it is not fully understood how CNNs can learn to distinguish between classes when the differences are only at this patch level. A most common data augmentation strategy is random cropping which is also known to improve the generalization of CNNs but the crop size and location is usually chosen in practice such that class objects are preserved after the cropping. Touvron et al.[34] show that training images cropped to $128\times 128$ improve the test accuracy on ImageNet but their strategy also makes sure to adjust cropping so as to include object of the classes in the cropped image. Our theory offers simple explanation for why such strategies are effective and in our experiments we push the boundary by also training on patches of such small sizes that the assigned class labels are no longer semantically meaningful. Our heat map results show that, by learning patch-wise labels, CNNs can distinguish patches/pixels belonging to object class from those of background.

A very closely related paper by Brendel and Bethge[3] who propose the BagNet architecture. In BagNet a modified ResNet backbone is applied to each $q\times q$ patch of the image and the output is a class heat map. The average of these heat maps is then fed into a linear classifier. The heat maps highlight the patches which contribute to the network’s decision in a straight-forward way because of the linear classifier, improving explaininability but at the cost of some accuracy due of the modified architecture. The first main difference with our work is that, during training they consider the output of the linear classifier i.e. the weighted aggregate of the individual decisions on $q\times q$ patches to be the network’s prediction for the image and thus it is not directly evident what the network is considering as the labels for the individual patches. On the other hand, we train the standard CNNs to classify each patch as belonging to its parent class making it very clear that our heat maps are the network’s predictions on individual patches. We also show using the evidence of heat maps that the CNNs trained using standard training procedures also learn the labels on patches implicitly, thus providing explainable decisions to traditional CNN architectures without any loss of accuracy. Finally, we provide a theoretical model which explains why operating on the domain of patches is critical for avoiding the CoD.

3 A priori generalization error bound

Given our experimental observations that CNNs “learn” the labels of patches rather than full images, we note that this improves the generalization error in two ways – i) by reducing the dimensionality of the input domain i.e. number of pixels in patches is much smaller than that for the full images and, ii) by increasing the input sample rate i.e for each image of size $(H,W)$ , there are $(H-H_{T}+1)\times(W-W_{T}+1)$ patches of size $(H_{T},W_{T})$ . Both of these effects help mitigate the curse of dimensionality. Accordingly, in this section, based on some reasonable assumptions, we propose an upper bound for the generalization error that seems to have good experimental correlation with real image data sets and CNNs.

Let $f:\mathbb{R}^{D}\rightarrow\ \mathbb{R}^{K}$ be the true $K$ -class classification function mapping the samples $x\in\mathcal{X}\subseteq\mathbb{R}^{D}$ to their labels $y\in\mathcal{Y}\subseteq\mathbb{R}^{K}$ . Let $F:\mathbb{R}^{D}\rightarrow\ \mathbb{R}^{K}$ be a computed approximation to $f$ using a training data set that has $N$ training samples: $\mathbf{x}_{N}=\left\{(x_{i},y_{i})\mid(x_{i},y_{i})\in\mathcal{X}\times\mathcal{Y}\right\}_{i=1}^{N}$ . The mean-value theorem and triangle inequality imply that

	$\displaystyle\\|f(z)-F(z)\\|=\\|f(z)-f(x^{i})+f(x^{i})-F(x^{i})+$
	$\displaystyle F(x^{i})-F(z)\\|$
	$\displaystyle\leq\\|f(z)-f(x^{i})\\|+\\|F(x^{i})-F(z)\\|$
	$\displaystyle+\\|f(x^{i})-F(x^{i})\\|$
	$\displaystyle\leq\\|z-x^{i}\\|(\\|f^{\prime}\\|+\\|F^{\prime}\\|)+\varepsilon_{N},$		(1)

where $z\in\mathcal{X}$ and $x^{i}\in\mathbf{x_{N}}$ . Here $\varepsilon_{N}$ denotes the training error of $F$ measured using an appropriate norm. We assume that training error primarily depends on noise in the patch labels as we are just using the same label as that of the parent image. Later in this section we provide a reasonable upper bound on the training error. We see from inequality (1) that CoD is in full effect when the input dimension $D$ is large and is driven by the mesh norm (see Mhaskar and Poggio [22]):

\displaystyle\mu(z)

\displaystyle\equiv\min_{x^{i}\in\mathbf{x_{N}}}\|z-x^{i}\|.

(2)

We can obtain a general case upper bound on the mesh norm by assuming the worst case scenario that the input samples $\mathbf{x_{N}}$ are distributed uniformly in the input space $\mathcal{X}$ and are drawn i.i.d.:

\displaystyle\mu(z)

\displaystyle\leq c_{3}\left(\frac{1}{N}\right)^{\frac{1}{D}},

(3)

where the constant $c_{3}$ depends on the choice of the norm.

For images of size $(H,W)$ and $C$ number of channels, mesh norm becomes,

\displaystyle\mu(z)

\displaystyle\leq c_{3}\left(\frac{1}{N}\right)^{\frac{1}{HWC}}.

(4)

Now for our experimental setting where we train a CNN model $F_{T}$ using only patches, the new input and output subspaces are transformed as $\mathcal{X}_{T}\subseteq\mathbb{R}^{D_{T}}$ and $\mathcal{Y}_{T}\subseteq\mathbb{R}^{K}$ , respectively, where $D_{T}=H_{T}W_{T}C$ with $H_{T}$ and $W_{T}$ as the height and width dimensions of patches. The training dataset $\mathbf{x_{N}}$ is thus now transformed to $\mathbf{t}_{N_{T}}=\left\{(x_{t,i},y_{t,i})\mid(x_{t,i},y_{t,i})\in\mathcal{X}_{T}\times\mathcal{Y}_{T}\right\}_{i=1}^{N_{T}}$ , where $N_{T}=N(H-H_{T}+1)(W-W_{T}+1)$ because each image of size $(H,W,C)$ in the dataset now produces $(H-H_{T}+1)(W-W_{T}+1)$ number of patches of size $(H_{T},W_{T},C)$ , in effect increasing the number of samples in the transformed data set to $N(H-H_{T}+1)(W-W_{T}+1)$ . This gives the mesh norm for these patches as:

\displaystyle\mu(z_{T})

\displaystyle\leq c_{3}\Big{(}\frac{1}{N(H-H_{T}+1)(W-W_{T}+1)}\Big{)}^{\frac{1}{H_{T}W_{T}C}}.

(5)

We introduce two new parameters $S$ and $\alpha$ without shrinking this upper bound; $S=(S_{H},S_{W})$ controls the effective number of total patches in the data set and acts similar to the stride parameter that is popular in convolutional kernels and this parameter is there to account for the fact that the nearby patches have large overlapping regions and are less likely to contribute to an improvement in generalization error. So the total effective number of samples becomes $NT_{\text{eff}}=N\left(\frac{H-H_{T}}{S_{H}}+1\right)\left(\frac{W-W_{T}}{S_{W}}+1\right)$ , where we denote the effective number of distinct $H_{T}\times W_{T}$ patches in $z$ with $T_{\text{eff}}$ . We let $\alpha$ control the effective dimension of the patches as it is likely that the different channels in the image are also correlated, and in our experiments we use $\alpha=3$ . So the mesh norm is re-parameterized as:

\displaystyle\mu(z_{T})

\displaystyle\leq c_{3}\left(\frac{1}{NT_{\text{eff}}}\right)^{\frac{\alpha}{D_{T}}}.

(6)

We assume that the new approximation $F_{T}$ trained on patches is rougher than $F$ and consequently assume that the gradient norm in the inequality (1) depends on the patch size and dimensions via a power law and that it is much larger than the gradient norm of the true function $f_{T}$ , whose properties are unknown, i.e.

\displaystyle\|f^{\prime}_{T}\|+\|F^{\prime}_{T}\|

\displaystyle\approx\|F^{\prime}_{T}\|\lesssim m_{1}^{\downarrow}\left(\left(\frac{H_{T}W_{T}}{HW}\right)^{\frac{1}{D_{T}}}\right),

(7)

where $m_{1}^{\downarrow}$ is a monotonically decreasing function. In our results we use $m_{1}^{\downarrow}(\theta)=1/\theta$ . So at full image size the upper bound on the derivative is an known constant, but at smaller patch sizes this norm grows. In higher dimensions this bound grows slower because there is more room to accommodate the roughness in the classification function.

We are using the labels of a training image as the label for its patches no matter how small the patch is or where it is drawn from. So these labels are intrinsically noisy. We therefore model the training error on the patch data set $\varepsilon_{N_{T}}$ , to be bounded by a quantity that is inversely proportional to the area of the patch. So for some constant $c_{4}$ :

\displaystyle\varepsilon_{N_{T}}

\displaystyle\leq c_{4}\,m_{2}^{\uparrow}(K)\,m_{3}^{\downarrow}\left(\frac{H_{T}W_{T}}{HW}\right),

(8)

where $m_{2}^{\uparrow}$ is a monotonically increasing and $m_{3}^{\downarrow}$ a monotonically decreasing function. In our results we use $m_{2}^{\uparrow}(K)=\sqrt{K}$ and $m_{3}^{\downarrow}(\theta)=-\log(\theta)$ , arguing that for $H_{T}W_{T}=HW$ i.e. when full images are used, labels are noiseless. It should be noted here that our choices for $m_{1}^{\downarrow}$ , $m_{2}^{\uparrow}$ and $m_{3}^{\downarrow}$ are very preliminary but our experimental results show that these simple approximations are able to satisfactorily model the generalization error across different image classification data sets and CNN architectures. We also make a note that in our approximation of training error, we do not take into account the effect of number of parameters in the CNN. In fact, as $N\rightarrow\infty$ and if so does the number of parameters in CNN at a suitable rate, one would expect the training error to go to zero, which is not reflected by the equation 8. In future work, we plan to further refine these approximations.

Finally the new bound on the generalization error for the model $F_{T}$ trained on the dataset $\mathbf{t}_{N_{T}}$ and measured on samples $z_{T}$ drawn randomly from the patched subspace $\mathcal{X}_{T}\times\mathcal{Y}_{T}$ where $f_{T}(z_{T})\in\mathcal{Y}_{T}$ can be rewritten as:

$\displaystyle\\|f_{T}(z_{T})-F_{T}(z_{T})\\|$	$\displaystyle\leq c_{5}\mu(z_{T})\\|F^{\prime}_{T}\\|+\varepsilon_{N_{T}}$
	$\displaystyle\leq c_{5}\mu(z_{T})\,m_{1}^{\downarrow}\left(\left(\frac{H_{T}W_{T}}{HW}\right)^{\frac{1}{D_{T}}}\right)$
	+ c_4 m_2^↑(K) m_3^↓(HTWTH W)	(9)

Inequality (9) estimates the upper bound on the generalization error for a model trained on patches and tested on patches. However, what we are actually interested in is the generalization performance of the model on unseen full sized images. We make the simple assumption that:

\displaystyle f(z)

\displaystyle\approx\frac{1}{(\text{number of patches in }z)}\sum_{z_{T}\in z}f_{T}(z_{T})

(10)

and accordingly,

\displaystyle F(z)

\displaystyle\approx\frac{1}{(\text{number of patches in }z)}\sum_{z_{T}\in z}F_{T}(z_{T})

(11)

We now assume that there are $T_{\text{eff}}$ number of uncorrelated $H_{T}\times W_{T}$ patches in $z$ on average. Then one can expect, that on average the expected upper bound on the generalization error would be:

\displaystyle\|f(z)-F(z)\|

\displaystyle\lesssim\frac{1}{\sqrt{T_{\text{eff}}}}\max_{z_{T}\in z}\|f_{T}(z_{T})-F_{T}(z_{T}\|

(12)

where $T_{\text{eff}}$ is the effective number of patches in an image. And now from (1), (6) and (9) we finally get

	$\displaystyle\\|f(z)-F(z)\\|\lesssim\frac{1}{\sqrt{T_{\text{eff}}}}\Biggl{(}c_{6}\left(\frac{1}{NT_{\text{eff}}}\right)^{\frac{\alpha}{D_{T}}}$
	$\displaystyle m_{1}^{\downarrow}\left(\left(\frac{H_{T}W_{T}}{HW}\right)^{\frac{1}{D_{T}}}\right)$
	$\displaystyle+c_{4}m_{2}^{\uparrow}(K)\,m_{3}^{\downarrow}\left(\frac{H_{T}W_{T}}{HW}\right)\Biggr{)}$		(13)

Since the CNNs we use in our experiments typically use all patch sizes from $3\times 3$ on wards up to the maximum size $H_{T}\times W_{T}$ , we assume that the minimum of the right-hand side of inequality (13) over the considered patch sizes should be taken as the generalization error for CNNs. In our experiments we use $c_{6}=1$ and $c_{4}=0.5$ .

3.1 Analysis of error parameters

We plot the generalization error bound of (13) in Figure 2 by varying the patch size for a base model of ImageNet-1k[28] data set containing 1.2M images and assuming a constant input resolution of $256\times 256$ from 1000 classes. We assume the stride parameter to be 4 to compute the effective patch size and vary each of these parameters individually. In all these figures, it is apparent that the curse of dimensionality starts affecting the generalization error at higher input dimensions. On the other hand, at very low patch sizes, even though the input dimensionality is controlled, noise in the target labels extremely amplifies the error. While for the intermediate patch sizes, depending on the data set’s maximum input resolution, curse of dimensionality is mitigated and the generalization error is bounded.

When the number of classes ( $K$ ) is increased, the error model shows that the bound on the generalization error also raises increasingly fast which is commonly observed in practice. The same effect is also observed when increasing $S$ , which reduces the effective number of patches available for training. When the image resolutions are increased, our model only predicts marginal improvement which is also a common observation in practice that by improving just the image resolution, the improvement in performance is limited if there is no corresponding improvement in the quality of labels. With the current set of hyper-parameters, this bound does not show any effect of number of training images on the generalization error, as shown in the bottom right plot of Figure 2. However, large scale empirical studies showed that the training set size has power law dependencies on the generalization[30, 14, 16].

3.2 Limitations

Our current error model deals with patches of different sizes independently and does not consider combining the predictions from different patch sizes and how that might affect the bound in (13). Moreover, our initial guesses for the functions $m_{1},m_{2},m_{3}$ are very simple, given the nature of complex dynamics between multiple parameters that are at play. The bound in (13) is also not normalized with respect to the maximum error possible and leads to absurd numerical values for some values of hyper-parameters, owing again to the crude approximations of noise functions. Our theoretical model currently also does not take into account the capacity of different CNNs and also does not tell anything about how the optimization procedure using stochastic gradient descent affects the function that is being approximated. A refined bound with more detailed analysis will be presented elsewhere.

Despite these limitations, it is very surprising that this crude model does a good job of predicting the performance of CNN models on many standard image data sets. We compare this bound with the empirical generalization errors on multiple image classification data sets with varying parameters $(H,W)$ , $(H_{T},W_{T})$ , $N$ and $K$ in the next section.

4 Experiments

We train ResNet[13] models without any pre-trained weights on CIFAR-10[17], CIFAR-100[17], STL-10[4] and ImageNet-1k [28] using only patches for different patch sizes. CIFAR-10 has 10 classes ( $K=10)$ with 5,000 32x32 resolution images per class ( $N=50000$ , $H=W=32$ ), whereas CIFAR-100 has the same number of total images but among 100 classes, constraining the training set to only 500 32x32 images per class. STL-10 has 10 classes with 500 training and 800 test images per class which are labelled but these images are of higher resolution ( $96\times 96$ ) than CIFAR. Imagenet-1k has 1.2 million images with varying resolutions from 1,000 classes. Thus these data sets cover a wide range of variability in the parameters involved in our bound (13) allowing us to compare the theoretical estimate of generalization error with the observed test errors in a robust manner.

	$\displaystyle\\|f(z)-F(z)\\|=\\|f(z)-f(x^{i})+f(x^{i})-F(x^{i})+$
	$\displaystyle F(x^{i})-F(z)\\|$
	$\displaystyle\leq\\|f(z)-f(x^{i})\\|+\\|F(x^{i})-F(z)\\|$
	$\displaystyle+\\|f(x^{i})-F(x^{i})\\|$
	$\displaystyle\leq\\|z-x^{i}\\|(\\|f^{\prime}\\|+\\|F^{\prime}\\|)+\varepsilon_{N},$		(1)

CNNs Avoid Curse of Dimensionality by Learning on Patches

Abstract

1 INTRODUCTION

2 Related work

3 A priori generalization error bound

3.1 Analysis of error parameters

3.2 Limitations

4 Experiments

4.1 CIFAR and STL

4.2 ImageNet

5 Visualizing the CNN learnings on patches via Heat maps

5.1 CNNs trained on patches vs whole images

6 Patch-wise training of semantic segmentation

7 Conclusion

ACKNOWLEDGMENT

References