Disturbing Target Values for Neural Network Regularization

Yongho Kim
Institute of Computer Science
University of Hildesheim
Samelsonplatz 1, 31141 Hildesheim, Germany

Abstract

Given the increasing computational capabilities of modern computing systems, we are seeing bigger and more complex neural networks being used. However with this increase the models have shown to perform well on training data but suffering on unseen test data, this problem is known as overfitting. This phenomenon is well recognized in recent researches that the model tends to become overparameterized when the networks get more complex. Diverse regularization techniques have been developed such as L2 regularization, Dropout, DisturbLabel (DL) to prevent overfitting. DL, a newcomer on the scene, regularizes the loss layer by flipping a small share of the target labels at random and training the neural network on this distorted data so as to not learn the training data. It is observed that high confidence labels during training cause the overfitting problem and DL selects disturb labels at random regardless of the confidence of labels. To solve this shortcoming of DL, we propose Directional DisturbLabel (DDL) a novel regularization technique that makes use of the class probabilities to infer the confident labels and using these labels to regularize the model. This active regularization makes use of the model behavior during training to regularize it in a more directed manner. To address regression problems, we also propose DisturbValue (DV), and DisturbError (DE). DE uses only predefined confident labels to disturb target values. DV injects noise into a portion of target values at random similar to DL. In this paper, 6 and 8 datasets are used to validate the robustness of our methods in classification and regression tasks respectively. Finally, we demonstrate that our methods are either comparable to or outperform DisturbLabel, L2 regularization, and Dropout. Also, we achieve the best performance in more than half the datasets by combining our methods with either L2 regularization or Dropout.

1 Introduction

Training computers to learn and think like human to produce reliable results has been a topical study in Machine Learning (ML). Though there are several types of ML, for the sake of simplicity, we will focus only supervised learning in which we train the model with known dataset to make predictions and let the trained model predict new unseen data. Two different tasks under supervised learning are separated as classification and regression. They share the same concept of mapping inputs to outputs, but differ in the form of targets. For classification, the output is discrete or categorical, while the output is continuous or real numbers for regression. Attempting to create machine to reason like human, the mathematical models mimicking the human brain was introduced and has become commonly known as Neuron Network (NN). The idea of NN was first introduced in 1948 by Alan Turing [21] and has been developed further forming a series of algorithm having a neuron-like network. These networks are used to solve similar, but more complex, problems to other ML algorithms such as credit card fraud detection in financial sector [22] and an automated driving system in automotive industry [23]. Artificial Neural Network (ANN), Recurrent Neural Network (RNN) or Convolutional Neural Network are some examples of these neuron-like algorithms. Deeper networks yield higher accuracy and higher ability to data representation. However, it comes with large numbers of parameters which makes the models prone to overfitting or fail to generalize unseen data. This phenomenon is likely to happen when the model is very complex while training data is insufficient. Generally, deeper networks are deployed and they have far more parameters than LeNet [20] does. Overfitting phenomenon conflicts with objective of any ML algorithm where we need the trained model to perform well not only on training data, but also on the unseen data. To avoid such event, regularization is brought into play.

In recent years, several regularization techniques have been developed to be applied to various parts of the network. Some techniques are applied to weight such as DropConnect [2] that drops the weight between connected nodes leaving the connected layer sparse but the node can still be active. $L_{2}$ -regularization or weight decay [16] , adds penalty term with a hyperparameter ( $\lambda$ ) to the error function resulting in weight decaying close to zero. Data augmentation [3] is used at the input layer to perform image transformation such as flipping, zooming, shifting and cropping. Dropout [4] can be implemented, for instance, at the hidden layers to reduce the dependence between neurons by randomly dropping out the nodes from the network. Penalizing confident output by [5] is done at the output layer which introduces negative entropy to the negative likelihood during training. Last example is from the latest novel idea concerns the loss layer. DisturbLabel(DL) [1] aims to attack the loss layer by randomly flipping the ground truth with a shared hyperparameter ( $\alpha$ ). Among the numbers of various regularization methods, DL is claimed as the novel algorithm attacking the loss layer for classification task. Its simplicity of implementation, yet proving the compelling result comparable to the widely known technique such as dropout, has convinced us to explore and generate new prospects inherited from this concept. In this paper, we improve the regularization technique on the loss layer for classification task and extend the idea onto regression task.

For classification, we propose Directional Disturb Label (DDL) which is the method of selecting systematically which labels to disturb by excluding non-confident labels from the candidates based on cosine similarity. We show that this improvement can reduce misclassification rate comparing to the baseline and performs better with deeper network. Additionally, the experimental results reaffirm that there is no burden on the cooperation of DDL with other regularization methods. For regression, we propose DisturbValue(DV) and DisturbError(DE), the two novel methods developed from applying disturbing procedure of DL onto regression task. The experimental results show the efficiency of our methods. Our codes are available from the first author’s Github (https://github.com/kimy-de/DisturbMethods).

Our main contributions include:

1.

Propose model-agnostic noisy regularization methods.
2.

Improve DisturbLabel by filtering non-confident labels from the candidates of disturb labels.
3.

Demonstrate the robustness of our methods using 14 datasets.

2 Related Work

Regularization methods come in a wide variety of the areas of the neural network they are aimed at regularization can be imposed on weights, hidden or inputs nodes of the neural network, but regularizing within the loss layer is relatively new. Xie, Lingxi et al.[1] introduced DisturbLabel (DL) which is the first work investigating this area for convolutional neural networks.

The closest method to DisturbLabel is label smoothing [18], which also perturbs the ground-truth labels by softening it to a vector of probabilities of belonging to each of the classes in the task, while DisturbLabel flips the label fully. As the authors point out, the main advantage of DisturbLabel is being stochastic while soft labelling is deterministic and therefore cannot provide the same strong regularization as by DisturbLabel.

Taejong Joo et al.[32] also perturbs ground-truth labels similarly to label smoothing, but use Bayesian approach, and instead of manipulating the labels directly, they consider the ground truth to be a random variable of a categorical probability over class labels rather than being given by the training label.

From recent findings on manipulating the labels, the mixup method [30] is one of the prominent. It suggests training a neural network on convex combinations of pairs of examples and their labels. Weizhi Li et al.[31] combine the label smoothing with the hypothesis that more confident predictions require stronger regularization (we also exploit this hypothesis in our work). They perform regularization via structural label smoothing, imposing various smoothing strength on clusters of data lying in different parts of feature space.

Noisy regularization for regression tasks are well known techniques, but as pointed out in a survey paper[29], the noise is added either to inputs ([27]) or to weights ([28]). To the best of our findings, adding noise to the targets as a form of regularization has been never attempted before. Ehsan Imani et al.[26] did add Gaussian noise to targets, but their experiments are different from our proposed DV approach in the sense that (a) it was done as a form of augmentation and (b) noise was added to all targets.

Adding Gaussian noise to target values was also explored [37], but our works compliment rather than repeat each other. First, this prior work concentrates on the convergence properties of the network with noise added to the desired signal and does not investigate regularization effects. Second, they add noise to target values and use annealing schedule of the step size to control the variance of noise since the noise affects weight updating. On the contrary, we control the number of noisy values using noise rate instead of adjustment of the step size.

3 Methodology

3.1 Classification

DisturbLabel (DL) [1] regularizes the loss layer by replacing some ground truth labels with incorrect labels in each iteration. DL selects $k$ substitutes randomly to generate disturb labels based on a noise rate $\alpha\%$ that determines the amount of $k=\frac{\alpha}{100}\times(batch\,size)$ . When $\alpha=0\%$ , a batch-training works without any change. When $\alpha>0\%$ , each of selected true labels $y$ is converted into a disturb label by a Multinoulli distribution with the following probabilities:

p_{c}=1-\frac{(C-1)\alpha}{100C},\,p_{i}=\frac{\alpha}{100C}

(1)

where $p_{c}$ and $p_{i}$ replaces 1 and 0 respectively in the one-hot vector and C is the number of classes. The DL does not consider the confidence of labels to generate disturb labels.

Refer to caption — Figure 1: Schematic representation of prediction vectors

Our hypothesis is that high confident labels cause the overfitting problem, hence our method Directional DisturbLabel (DDL) considers only confident labels as candidates to be disturbed. To be specific, we define the confident label as $\hat{y}$ satisfying $\frac{y\cdot\hat{y}}{\parallel y\parallel\parallel\hat{y}\parallel}\geq cos\theta$ where $\theta$ is a permissible angle. In $\mathbb{R}^{C}$ , the angle $\rho$ between $y$ and $\hat{y}$ is calculated by the cosine similarity and we say that $\hat{y}$ is a confident label if $\rho<\theta$ . To simplify the formula $\frac{y\cdot\hat{y}}{\parallel y\parallel\parallel\hat{y}\parallel}\geq cos\theta$ , we use the unit vectors, $u=\frac{\hat{y}}{\parallel\hat{y}\parallel}$ and $y$ , with $\theta=\frac{\pi}{3}$ so that $y\cdot u\geq 0.5$ is applied to select substitutes for disturb labels in each batch data. The range of $\rho$ is $[0,\frac{\pi}{2}]$ . $\rho=0$ when $\hat{y}$ is the same as $y$ (i.e. the same direction). Plus, $y$ and $\hat{y}$ always have positive elements so that the maximum angle between them is $\frac{\pi}{2}$ . In our setting, non-confident labels from the candidates are excluded by the non-confident interval of $(\frac{\pi}{3},\frac{\pi}{2}]$ . For instance, Figure 1 shows that all the outputs of classification models with three classes are on the three dimensional space and the natural basis of the vector space are the one-hot vectors of the three true labels, $(1,0,0),(0,1,0)$ , and $(0,0,1)$ . Therefore, there exist $y$ and $\hat{y}$ are on the same space such that $y\cdot\hat{y}\geq 0.5$ is calculated.

3.2 Regression

Our hypothesis is that noise injection to target values regularizes the classifier layer by the oscillation of target values because the measurement of continuous target values could have errors caused by machine tolerance, human ability, measurement condition, and so on. For example, when we measure the current temperature, we can think that $16.55^{\circ}$ C and $16.54^{\circ}$ C are the same depending on the tolerance of thermometers so proper noise injection to target values can lead to a regularization effect without changing the attribute of target values. Thus, we use the concept of DisturbLabel (DL) for regression tasks. However, the original DL can be applied in classification problems so we propose two disturb methods by extending the concept of DL to regression problems. Given a mini-batch set $D_{i}\subset D=\{(\mathbf{x_{n}},y_{n})\mid\mathbf{x_{n}}\in\mathbb{R}^{p},y_{n}\in\mathbb{R},n,p\in\mathbb{N}\}$ . Then a prediction $\hat{y}=f(\mathbf{x};\theta)$ where $f$ is a regression model with model parameters $\theta$ .

To prevent overfitting in the training session, DisturbValue (DV) adds Gaussian noise to some of the target values at random. To be specific, a target value $y$ is replaced by $\tilde{y}=y+\epsilon$ following $\epsilon\sim N(0,\sigma^{2})$ based on a noise rate $\alpha$ . When $\alpha=p\%$ , $\epsilon$ is added randomly to $k$ target values in each batch data where $k\leq\frac{p}{100}\times(batch\,size)$ . As hyperparameters of DV, $\alpha$ and $\sigma$ have a huge impact on the regularization performance, but it is too hassle for controlling them depending on datasets. To reduce the number of the hyperparameters, we transform the domain of target values into $[0,1]$ by the MinMax scaler $f(x)=\frac{x-min(x)}{max(x)-min(x)}$ and $\sigma=0.01$ is fixed as a default. Thus, all of the domain of target values is $[0,1]$ regardless of datasets so that the same standard variance $\sigma$ can be used for any datasets. Finally, we consider the only one hyperparameter $\alpha$ during the training.

DV adds noise to target values at random. However, DisturbError (DE) adds Gaussian noise $\epsilon\sim N(0,\sigma^{2})$ to target values satisfy $\mid y-\hat{y}\mid<\rho$ where $\rho$ is a residual boundary. We say that $\hat{y}$ is a high confident value if there exists small constant $\rho$ such that $\mid y-\hat{y}\mid<\rho$ . When there are many high confident values, it is highly possible that a model is overfitted to train data. To prevent overfitting, DE disturbs the error of high confident values by redefining the error as $\mid y+\epsilon-\hat{y}\mid$ . In DE, we should control $\rho$ and $\sigma$ originally, however we can consider only $\rho$ based on the same scaling and $\sigma$ as DV.

Therefore, we find an optimal hyperparameter of DV and DE depending on datasets and compare to other regularization techniques in Section 4. In this paper, mean square loss function $L=\frac{1}{N}\sum_{i=1}^{N}(\hat{y_{i}}-y_{i})^{2}$ is used such that $\tilde{L}=\frac{1}{N}\sum_{i=1}^{N}(\hat{y_{i}}-(y_{i}+\epsilon_{i}))^{2}$ is defined as an objective function with our disturb methods. Considering the gradient of $\tilde{L}$ ,

\frac{\partial\tilde{L}}{\partial\theta}=\frac{2}{N}\sum_{i=1}^{N}(\hat{y_{i}}-(y_{i}+\epsilon_{i}))\frac{\partial\hat{y_{i}}}{\partial\theta}\\ =\frac{2}{N}\sum_{i=1}^{N}((\hat{y_{i}}-y_{i})-\epsilon_{i})\frac{\partial\hat{y_{i}}}{\partial\theta}\\ =\frac{\partial L}{\partial\theta}-\frac{2}{N}\sum_{i=1}^{N}\epsilon_{i}\frac{\partial\hat{y_{i}}}{\partial\theta}

(2)

is derived. It shows that the noise $\epsilon$ controls the gradient of prediction values to prevent overfitting.

4 Experiments

4.1 Classification Task

Table 1: Description of the datasets for classification task

Dataset	MNIST^*	FMNIST	CIFAR10^*	CIFAR100	Intel	Art
# classes	10	10	10	100	6	5
Size of images	28 $\times$ 28	28 $\times$ 28	32 $\times$ 32	32 $\times$ 32	150 $\times$ 150	227 $\times$ 227^**
# channels	1	1	3	3	3	3
# instances	70K	70K	60K	60K	17K	9K
Train/Test (% )	86/14	86/14	83/17	83/17	82/18	86/14

*

datasets also used by the baseline paper. CIFAR100 is implemented differently frrom the baseline. We use all 100 classes and treat the dataset as an additional one.
**

input size fed to the network (original sizes vary)

Evaluation metric. We report test misclassification rate to compare the performance of the methods. Experiments for each dataset method combination were run for 5 times, average value alongside with standard deviation is reported.

Baselines. In our experiments we compare the performance of suggested DDL method against same baselines as the reference paper (including exploring cooperation of the methods) and DL method itself; no regularization, dropout, DistrubLabel (DL), DistrubLabel (DL) + dropout, Directional DisturbLabel (DDL), Directional DisturbLabel (DDL) + dropout.

Datasets. Having a hypothesis that regularization effects should appear more vividly on ’easier’ datasets, we conduct experiments on six collections of images of various complexity: MNIST [24], FMNIST [35], CIFAR-10 [25], CIFAR-100 [16], INTEL [34] , and ART [33]. The characteristics of the datasets are summarized in Table 1.

Architecture. LeNet [20] is modified in accordance to baseline paper (two convolution units for MNIST and FMNIST dataset and three convolution units for CIFAR10, CIFAR100, ART and INTEL followed by ReLU and max pooling). ResNet18 [19] is modified with additional dropout after average pooling step for our experiment on models combination before feeding the input to the last layer with the softmax loss function.

Optimizer. The training procedure was performed using SGD optimizer with momentum and decaying learning rate. We start with learning rate of 0.001 and reduce it by factor 0.1 after 40, 60 and 80 epochs. Each experiment was run for 100 epochs in total.

Hyperparameters. This hyperparameter $\alpha$ (probability of the label to be disturbed) was tuned for each dataset separately. The optimal values of hyperparameters used are summarized in Table 2. The dropout probability was set to 0.5 in all corresponding experiments.

Table 2: Optimal values of hyperparameter

\alpha

(%)

Architecture	Method	MNIST	FMNIST	CIFAR10	CIFAR100	Intel	Art
LeNet	DL	10	5	10	40	50	50
	DDL	10	5	50	30	50	20
ResNet18	DL	20	5	10	10	5	10
	DDL	20	5	10	10	5	10

Table 3: LeNet Experimental Result(average misclassification rate): DDL demonstrates good cooperation with Dropout and brings the best results for datasets most suffering from overfitting (less complex datasets). Results are reported as first and second best.

	MNIST	FMNIST	CIFAR10	CIFAR100	Intel	Art
# classes	10	10	10	100	6	5
No reg.¹	0.86 $\pm$ 0.034	8.052 $\pm$ 0.112	24.82 $\pm$ 0.828	59.732 $\pm$ 1.590	15.614 $\pm$ 0.484	18.08 $\pm$ 0.597
Dr.²	0.658 $\pm$ 0.049	7.784 $\pm$ 0.167	22.58 $\pm$ 0.771	50.662 $\pm$ 0.828	13.48 $\pm$ 0.575	16.372 $\pm$ 0.651
DL	0.642 $\pm$ 0.052	7.748 $\pm$ 0.069	23.77 $\pm$ 0.310	58.758 $\pm$ 0.596	13.289 $\pm$ 0.270	16.28 $\pm$ 0.335
DDL	0.61 $\pm$ 0.041	7.789 $\pm$ 0.111	22.614 $\pm$ 0.553	56.444 $\pm$ 0.337	13.542 $\pm$ 0.193	16.687 $\pm$ 0.948
Dr.+DL	0.58 $\pm$ 0.044	7.816 $\pm$ 0.132	21.65 $\pm$ 0.320	50.87 $\pm$ 0.701	13.066 $\pm$ 0.339	15.578 $\pm$ 0.456
Dr.+DDL	0.658 $\pm$ 0.086	7.744 $\pm$ 0.210	21.766 $\pm$ 0.403	50.824 $\pm$ 1.487	12.366 $\pm$ 0.310	15.537 $\pm$ 0.498

1

No reg. refers no to regularization
2

Dr. refers to dropout

Experimental Results. The results of the experimented on LeNet are summarized in Table 3. DDL method (in combination with Dropout) demonstrated the best results for half of the datasets and is in top-2 results for all of them. Intel and Art dataset considered as simpler datasets which have higher degree of overfitting the other more complex datasets. MNIST is also known for being easily classified by modern networks, therefore we can say that the results prove the hypothesis of DDL method having better regularization capacity and is useful on the datasets which needed it at most. The four newly tested datasets demonstrated similar behaviour except for CIFAR100, which is the most challenging of them, having far more classes. We assume that memorizing the ground truth was not really in place for CIFAR100, that is why perturbing the labels even more has not facilitated the training procedure. A more lightweight method - dropout - nevertheless has still improved the result compared to full absence of regularization.

The results of the experimented on ResNet18 is shown in Table 4 confirming the results obtained for LeNet. For 5 out of 6 tested datasets, DDL (either alone or in combination with Dropout) has gained the best result compared to baselines and is second best for the remaining. FMNIST and CIFAR10 presumably do not require such strong regularization as the most simple MNIST, Intel and Art datasets, therefore using combination of DDL and dropout can be excessive in this case and DDL method obtains the best result alone. CIFAR100 still does not suffer overfitting even with the deeper network to the extend as the other datasets do and does not profit from strong regularization.

Table 4: ResNet18 Experimental results (average misclassification rate): With a more complex network, our method demonstrates vivid results against baselines. CIFAR100, however, does not require strong regularization and takes advantage of simpler DL. Results are reported as first and second best.

	MNIST	FMNIST	CIFAR10	CIFAR100	Intel	Art
# classes	10	10	10	100	6	5
No reg.¹	0.668 $\pm$ 0.0618	8.354 $\pm$ 0.072	8.145 $\pm$ 0.430	30.557 $\pm$ 0.466	6.237 $\pm$ 0.269	2.927 $\pm$ 0.26
Dr.²	0.601 $\pm$ 0.045	8.608 $\pm$ 0.179	7.813 $\pm$ 0.170	28.363 $\pm$ 0.500	6.531 $\pm$ 0.333	2.837 $\pm$ 0.348
DL	0.612 $\pm$ 0.1567	8.244 $\pm$ 0.274	7.99 $\pm$ 0.444	28.048 $\pm$ 1.064	6.004 $\pm$ 0.123	2.909 $\pm$ 0.416
DDL	0.558 $\pm$ 0.0507	8.228 $\pm$ 0.107	7.667 $\pm$ 0.096	28.236 $\pm$ 1.209	6.271 $\pm$ 0.207	2.891 $\pm$ 0.090
Dr.+DL	0.543 $\pm$ 0.039	8.480 $\pm$ 0.378	7.841 $\pm$ 0.198	28.123 $\pm$ 0.540	6.231 $\pm$ 0.346	3.035 $\pm$ 0.0342
Dr.+DDL	0.532 $\pm$ 0.054	8.475 $\pm$ 0.191	8.111 $\pm$ 0.151	28.115 $\pm$ 1.186	5.917 $\pm$ 0.166	2.728 $\pm$ 0.206

1

No reg. refers to no regularization,
2

Dr. refers to dropout.

4.2 Regression Task

Evaluation metric. We use root-mean-square error (RMSE) [7]. For every experiment and every dataset we average RMSE for 20 runs and measure standard deviation.

Baselines. For all eight datasets we compare the results of our methods with the baselines which represent state-of-the-art on neural network regularization to the best of our knowledge and also combination of our methods with baselines: $L_{2}$ regularization, dropout, DV - Gaussian noise, DV - Laplacian noise, DV - cosine annealing [17], DE, DV + dropout, DV + $L_{2}$ , DV + DE.

Table 5: Description of the datasets for regression task

Dataset	BHP	BS	AQ	MS	HP	SC	CC	AEP
# Instances	506	731	9,357	5,000	1460	21,263	1,994	19,735
# Feautres	13	13	10	30	81	81	100	27

Table 6: Experimental results for the regression (RMSE averaged for 20 runs). Best results are marked in bold (lower is better). Offered method DV alone or in combination with others methods of regularization show best result for every dataset tested. Hypothesis about dependencies between complexity of dataset and best method was not proved.

Method	Air	Boston	Bike	Energy	Sklearn	House	Scond	Crime
Features	10	13	13	27	30	81	81	100
No reg	0.00880	0.09496	0.03090	0.00506	0.06493	0.01363	0.08137	0.14596
$L_{2}$	0.00620	0.09122	0.02260	0.00431	0.06088	0.01952	0.07922	0.14270
Dropout	0.00744	0.09121	0.01986	0.00455	0.06435	0.01159	0.08100	0.14514
DV gaus	0.00451	0.08958	0.01566	0.00327	0.06100	0.01133	0.07930	0.14350
DV lapl	0.00512	0.09093	0.02559	-	0.06106	0.01460	-	-
DV anneal	0.01207	0.09324	0.03013	0.00455	0.06235	0.00967	0.07867	0.14460
DE	0.00699	0.08981	0.02525	0.00462	0.06106	0.01464	0.07953	0.14448
DV+Drop	0.00453	0.090262	0.01960	0.00153	0.06335	0.01504	0.07972	0.14389
DV+ $L_{2}$	0.00558	0.08728	0.02147	0.00313	0.06016	0.01142	0.07928	0.14215
DV+DE	0.00293	0.08789	0.02327	0.00335	0.06496	0.01270	0.07937	0.14349

Datasets. We use eight datasets with different sizes and complexity for our evaluations: Boston House Prices (BHP) [8], Bike Sharing (BS) [36], Air Quality (AQ) [10], Make-sklearn (MS) [8], Housing Price (HP) [11], Superconductivty (SC) [12], Communities and Crime (CC) [13], Appliances Energy Prediction (AEP) [14]. The characteristics of the datasets are summarized in Table 5. Also, Minmax scaling is used to standardize the features to present in the data in a fixed range.

Architecture. We implement our methods using neural network with two hidden layers and ReLU activation function.

Optimizer. We use ADAM optimizer [6] with learning rate of 0.001 in all experiments.

Hyperparameters. For each dataset a grid-search is used to find the best value for hyper-parameters:

1.

$L_{2}$ penalty for $L_{2}$ regularization,
2.

Drop rate (%) for Dropout,
3.

Disturb rate (%) for Disturb Value,
4.

Residual (e) for Disturb Error.

Experimental Results. To tune hyper-parameters all datasets were split on training set (50%) and testing set (50%). For each hyper-parameter we fit our models on training set and then evaluate accuracy metric on testing set. For every run every dataset values were shuffled before split. Table 6 shows the results of experiments, which proved that DV approach or combination of DV with DE, cosine-annealing, Dropout or $L_{2}$ outperforms baselines for all eight datasets. We were interested to know if the most suitable approach depends on data size and data complexity. To check this we form Table 6 starting from a dataset with smaller complexity (smaller number of features) and ending with a dataset with bigger complexity. No dependencies were revealed during this analysis.

We also analyze standard deviations for every method to be sure that offered approaches (DV and DE) are robust. There is no significant deviation between standard deviation for the model without regularization and models with regularization.

5 Conclusion

In this paper, we have extended one of the modern regularization methods, DisturbLabel (DL), by proposing an improved procedure of label disturbing for classification task and projecting the idea of useful ground truth disturbance to regression domain, which was not covered by the method of the baseline paper.

The problem of overfitting can be interpreted as memorizing the ground truth by a neural network. Our extension of the classification task is based on the hypothesis that confident predictions often come from this memorizing of the ground truth labels, that is why it makes more sense to penalize (in our case, randomly disturb) a share of them and do not perturb training procedure for the predictions that are unconfident in a natural way. We proposed to use the cosine similarity to measure the distance between vectors of ground truth and predicted class probabilities in classification tasks. We have tested this method on six datasets and two architectures and showed that DDL brings improvement compared to DL (alone or in combination with dropout).

We also have shown that the extension of DL method on the regression domain can improve the performance of the model and help to avoid overfitting. We presented two methods: (1) DisturbValue (DV) method which injects Gaussian noise to target values at random, and (2) DisturbError method (DE) which injects Gaussian noise to target values if the prediction is close to the target value, other words, the difference between prediction and target values is smaller than a small constant. Our proposed DV method outperformed well-known baselines $L_{2}$ and dropout alone or in combination with other approaches (DE, $L_{2}$ , dropout, cosine-annealing). The experiments were done for eight datasets with different sizes and complexity.

References

[1] Xie, Lingxi and Wang, Jingdong and Wei, Zhen and Wang, Meng and Tian, Qi, (2016). Disturblabel: Regularizing cnn on the loss layer, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, 4753–4762.
[2] Li Wan and Matthew Zeiler and Sixin Zhang and Yann Le Cun and Rob Fergus, (2013). Regularization of Neural Networks using DropConnect, Proceedings of the 30th International Conference on Machine Learning, 1058–1066, Vol 28.
[3] Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E, (2012). Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems, 1097–1105.
[4] Srivastava, Nitish and Hinton, Geoffrey and Krizhevsky, Alex and Sutskever, Ilya and Salakhutdinov, Ruslan, (2014). Dropout: a simple way to prevent neural networks from overfitting, The journal of machine learning research 2014, Vol 15, 1929–1958.
[5] Pereyra, Gabriel and Tucker, George and Chorowski, Jan and Kaiser, Lukasz and Hinton, Geoffrey, (2017). Regularizing neural networks by penalizing confident output distributions, arXiv preprint arXiv:1701.06548.
[6] Diederik P. Kingma and Jimmy Ba, (2017). Adam: A Method for Stochastic Optimization, arXiv:1412.6980.
[7] Rob J. Hyndman and Anne B. Koehler, (2006). Another look at measures of forecast accuracy, International Journal of Forecasting, Vol 22, 679–688.
[8] Pedregosa, F. and Varoquaux, G., Gramfort, A., Michel, V.,Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E., (2011). Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, Vol 12, 2825–2830.
[9] Fanaee-T, Hadi and Gama, Joao, (2013). Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence, Springer Berlin Heidelberg, 1–15.
[10] Sensors and Actuators B: Chemical, (2008). On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario, Sensors and Actuators B: Chemical Vol 129, 750–757.
[11] De Coc and Dean, (2011). Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project, Journal of Statistics Education, Vol 19.
[12] Kam Hamidieh, (2018). A data-driven statistical model for predicting the critical temperature of a superconductor, Computational Materials Science, Vol 154, 346–354
[13] Dua, Dheeru and Graff, Casey, (2017). UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, http://archive.ics.uci.edu/ml
[14] Luis M. Candanedo and Véronique Feldheim and Dominique Deramaix, (2017). Data driven prediction models of energy use of appliances in a low-energy house, Energy and Buildings, Vol 140, 81–97.
[15] Geoffrey E. Hinton and Nitish Srivastava and Alex Krizhevsky and Ilya Sutskever and Ruslan R. Salakhutdinov, (2012). Improving neural networks by preventing co-adaptation of feature detectors, arXiv:1207.0580.
[16] A. Krizhevsky, (2009). Learning Multiple Layers of Features from Tiny Images, http://www.cs.toronto.edu/%7Ekriz/learning-features-2009-TR.pdf
[17] Loshchilov, Ilya and Hutter, Frank, (2016). SGDR: Stochastic Gradient Descent with Warm Restarts
[18] Christian Szegedy and Vincent Vanhoucke and Sergey Ioffe and Jonathon Shlens and Zbigniew Wojna, (2015). Rethinking the Inception Architecture for Computer Vision,
[19] Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun, (2015). Deep Residual Learning for Image Recognition, cs.CV arXiv:1512.03385.
[20] LeCun, Yann and Boser, Bernhard and Denker, John and Henderson, Donnie and Howard, R. and Hubbard, Wayne and Jackel, Lawrence, (1990), Handwritten Digit Recognition with a Back-Propagation Network, Advances in Neural Information Processing Systems, Vol 2.
[21] Copeland, B. Jack., (2004). The Essential Turing: Seminal Writings in Computing, Logic, Philosophy, Artificial Intelligence, and Artificial Life plus The Secrets of Enigma, Oxford University Press.
[22] Alae Chouiekh and EL Hassane Ibn EL Haj, (2018). ConvNets for Fraud Detection analysis, Proceedings of the first international conference on intelligent computing in data sciences ICDS2017, Vol 127, 133–138.
[23] M. J. Shafiee, A. Jeddi, A. Nazemi, P. Fieguth, and A. Wong, (2021). Deep Neural Network Perception Models and Robust Autonomous Driving Systems: Practical Solutions for Mitigation and Improvement, IEEE Signal Processing Magazine, Vol 38, 22–30.
[24] LeCun, Yann and Cortes, Corinna, (2010). MNIST handwritten digit database, http://yann.lecun.com/exdb/mnist/
[25] Alex Krizhevsky and Vinod Nair and Geoffrey Hinton, (2010). CIFAR-10 (Canadian Institute for Advanced Research), http://www.cs.toronto.edu/ kriz/cifar.html
[26] Imani, Ehsan and White, Martha, (2018). Improving regression performance with distributional losses, International Conference on Machine Learning, PMLR 2018, 2157–2166.
[27] Poole, Ben and Sohl-Dickstein, Jascha and Ganguli, Surya, (2014). Analyzing noise in autoencoders and deep networks, arXiv preprint arXiv:1406.1831.
[28] Hochreiter, Sepp and Schmidhuber, Jürgen and others, (1995). Simplifying neural nets by discovering flat minima, Advances in neural information processing systems, 529–536.
[29] Moradi, Reza and Berangi, Reza and Minaei, Behrouz, (2020). A survey of regularization strategies for deep models, Artificial Intelligence Review, Vol 53, 3947–3986.
[30] Zhang, Hongyi and Cisse, Moustapha and Dauphin, Yann N and Lopez-Paz, David, (2017). Mixup: Beyond empirical risk minimization, arXiv preprint arXiv:1710.09412.
[31] Li, Weizhi and Dasarathy, Gautam and Berisha, Visar, (2020). Regularization via structural label smoothing, International Conference on Artificial Intelligence and Statistics, 1453–1463.
[32] Joo, Taejong and Chung, Uijung and Seo, Min-Gwan, (2020). Being Bayesian about categorical probability, International Conference on Machine Learning, 4950–4961.
[33] Danil, (2018). Art Images: Drawing/Painting/Sculptures/Engravings, https://www.kaggle.com/thedownhill/art-images-drawings-painting-sculpture-engraving
[34] Puneet Bansal, (2019). Intel Image Classification, https://www.kaggle.com/puneet6060/intel-image-classification
[35] Han Xiao and Kashif Rasul and Roland Vollgraf, (2017). Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms, cs.LG arXiv:1708.07747
[36] Fanaee-T, Hadi and Gama, Joao, (2013). Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence, 1–15.
[37] Wang, Chuan and Principe, Jose C, (1999). Training neural networks with additive noise in the desired signal, IEEE Transactions on Neural Networks, Vol 10, 1511–1517.