This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Implicit Regularization of Dropout

Zhongwang Zhang1, Zhi-Qin John Xu1,2
1 School of Mathematical Sciences, Institute of Natural Sciences, MOE-LSC, Shanghai Jiao Tong University
2 Qing Yuan Research Institute, Shanghai Jiao Tong University
Corresponding author: [email protected].
Abstract

It is important to understand how dropout, a popular regularization method, aids in achieving a good generalization solution during neural network training. In this work, we present a theoretical derivation of an implicit regularization of dropout, which is validated by a series of experiments. Additionally, we numerically study two implications of the implicit regularization, which intuitively rationalizes why dropout helps generalization. Firstly, we find that input weights of hidden neurons tend to condense on isolated orientations trained with dropout. Condensation is a feature in the non-linear learning process, which makes the network less complex. Secondly, we experimentally find that the training with dropout leads to the neural network with a flatter minimum compared with standard gradient descent training, and the implicit regularization is the key to finding flat solutions. Although our theory mainly focuses on dropout used in the last hidden layer, our experiments apply to general dropout in training neural networks. This work points out a distinct characteristic of dropout compared with stochastic gradient descent and serves as an important basis for fully understanding dropout.

1 Introduction

Dropout is used with gradient-descent-based algorithms for training neural networks (NNs) [1, 2], which can improve the generalization in deep learning [3, 4]. For example, common neural network frameworks such as PyTorch default to utilizing dropout during transformer training. Dropout works by multiplying the output of each neuron by a random variable with probability pp being 1/p1/p and 1p1-p being zero during training. Note that every time the concerning quantity is calculated, the variable is randomly sampled at each feedforward operation.

The effect of dropout is equivalent to adding a specific noise to the gradient descent training. Theoretically, based on the method of the modified gradient flow [5], we derive implicit regularization terms of the dropout training for networks with dropout on the last hidden layer. The implicit regularization of dropout can lead to two important implications, condensed weights and flat solutions, verified by a series of experiments111Code can be found at: https://github.com/sjtuzzw/torch_code_frame under general settings.

Firstly, we study weight feature learning in dropout training. Previous works [6, 7, 8] find that, in the nonlinear training regime, input weights of hidden neurons (the input weight of a hidden neuron is a vector consisting of the weight from its input layer to the hidden layer and its bias term) are clustered into several groups under gradient flow training. The weights in each group have similar orientations, which is called condensation. By analyzing the implicit regularization terms, we theoretically find that dropout tends to find solutions with weight condensation. To verify the effect of dropout on condensation, we conduct experiments in the linear regime, such as neural tangent kernel initialization [9], where the weights are in proximity to the random initial values and condensation does not occur in common gradient descent training. We find that even in the linear regime, with dropout, weights show clear condensation in experiments, and for simplicity, we only show the output here (Fig. 1(a)). As condensation reduces the complexity of the NN, dropout may help the generalization by constraining the model’s complexity.

Secondly, we study the flatness of the solution in dropout training. We theoretically show that the implicit regularization terms of dropout lead to a flat minimum. We experimentally verify the effect of the implicit regularization terms on flatness (Fig. 1(b)). As suggested by many existing works [10, 11, 12], flatter minima have a higher probability of better generalization and stability.

Refer to caption
(a) condensation
Refer to caption
(b) flatness
Figure 1: The experimental results of training two-layer ReLU NNs trained with and without dropout. The width of the hidden layers is 10001000, and the learning rate for all experiments is 1×1031\times 10^{-3}. (a) The output of NNs with or without dropout. The black points represent the target points. (b) The loss value obtained by perturbing the network with or without dropout in a given random direction. α\alpha is the step size moving in the above direction.

This work provides a comprehensive investigation into the implicit regularization of dropout and its associated implications. Although our theoretical analysis mainly focuses on the dropout used in the last hidden layer, our experimental results extend to the general use of dropout in training NNs. Our results show that dropout has a distinct implicit regularization for facilitating weight condensation and finding flat minima, which may jointly improve the generalization performance of NNs.

2 Related Works

Dropout is proposed as a simple approach to prevent overfitting in the training of NNs, thus improving the generalization of the network [1, 2]. Many works aim to find an explicit form of dropout regularization. A previous work [13] presents PAC-Bayesian bounds, and others [14, 15] derive Rademacher generalization bounds. These results show that the reduction of complexity brought by dropout is O(p)O(p), where pp is the probability of keeping an element in dropout. All of the above works need specific settings, such as norm assumptions and logistic loss, and they only give a rough estimate of the generalization error bound, which usually consider the worst case. [16, 17] study the implicit bias of dropout for linear models. However, it is not clear what is the characteristic of the dropout training process and how to bridge the training with the generalization in non-linear neural networks. In this work, we show the implicit regularization of dropout, which may be a key factor in enabling dropout to find solutions with better generalization.

The modified gradient flow is defined as the gradient flow which is close to discrete iterates of the original training path up to some high-order learning rate term [5]. [18] derive the modified gradient flow of discrete full-batch gradient descent training as R^S,GD(𝜽)=RS(𝜽)+(ε/4)RS(𝜽)2+O(ε2)\hat{R}_{S,GD}(\bm{\theta})=R_{S}({\bm{\theta}})+(\varepsilon/4)\lVert\nabla R_{S}({\bm{\theta}})\rVert^{2}+O(\varepsilon^{2}), where RS(𝜽)R_{S}({\bm{\theta}}) is the training loss on dataset SS, ε\varepsilon is the learning rate and \lVert\cdot\rVert denotes the l2l_{2}-norm. In a similar vein, [19] derive the modified gradient flow of stochastic gradient descent training as R^S,SGD(𝜽)=RS(𝜽)+(ε/4)RS(𝜽)2+(ε/4m)i=0m1RS,i(𝜽)RS(𝜽)2+O(ε2)\hat{R}_{S,SGD}(\bm{\theta})=R_{S}({\bm{\theta}})+(\varepsilon/4)\lVert\nabla R_{S}({\bm{\theta}})\rVert^{2}+(\varepsilon/4m)\sum_{i=0}^{m-1}\lVert\nabla R_{S,i}({\bm{\theta}})-\nabla R_{S}({\bm{\theta}})\rVert^{2}+O(\varepsilon^{2}), where RS,i(𝜽)R_{S,i}({\bm{\theta}}) is the iith batch loss and the last term is also called “non-uniform” term [20]. Our work shows that there exist several distinct features between dropout and SGD. Specifically, in the limit of the vanishing learning rate, the modified gradient flow of dropout still has an additional implicit regularization term, whereas that of SGD converges to the full-batch gradient flow [21].

The parameter initialization of the network determines the final fitting result of the network. [6, 8] mainly identify the linear regime and the condensed regime for two-layer and three-layer wide ReLU NNs. In the linear regime, the training dynamics of NNs are approximately linear and similar to a random feature model with an exponential loss decay. In the condensed regime, active neurons are condensed at several discrete orientations, which may be an underlying reason why NNs outperform traditional algorithms.

[22, 23] show that NNs of different widths often exhibit similar condensation behavior, e.g., stagnating at a similar loss with almost the same output function. Based on this observation, they propose the embedding principle that the loss landscape of an NN contains all critical points of all narrower NNs. The embedding principle provides a basis for understanding why condensation occurs from the perspective of loss landscape.

Several works study the mechanism of condensation at the initial training stage, such as for ReLU network [24, 25] and network with continuously differentiable activation functions [7]. However, studying condensation throughout the whole training process is generally challenging, with dropout training being an exception. The regularization terms we derive in this work show that the dropout training tends to condense in the whole training process.

3 Preliminary

3.1 Deep Neural Networks

Consider a LL-layer (L2L\geq 2) fully-connected neural network (FNN). We regard the input as the 0th layer and the output as the LLth layer. Let mlm_{l} represent the number of neurons in the llth layer. In particular, m0=dm_{0}=d and mL=dm_{L}=d^{\prime}. For any i,ki,k\in\mathbb{N} and i<ki<k, we denote [i:k]={i,i+1,,k}[i:k]=\{i,i+1,\ldots,k\}. In particular, we denote [k]:={1,2,,k}[k]:=\{1,2,\ldots,k\}.

Given weights W[l]ml×ml1W^{[l]}\in\mathbb{R}^{m_{l}\times m_{l-1}} and biases b[l]mlb^{[l]}\in\mathbb{R}^{m_{l}} for l[L]l\in[L], we define the collection of parameters 𝜽\bm{\theta} as a 2L2L-tuple (an ordered list of 2L2L elements) whose elements are matrices or vectors

𝜽=(𝜽|1,,𝜽|L)=(𝑾[1],𝒃[1],,𝑾[L],𝒃[L]),\bm{\theta}=\Big{(}\bm{\theta}|_{1},\cdots,\bm{\theta}|_{L}\Big{)}=\Big{(}\bm{W}^{[1]},\bm{b}^{[1]},\ldots,\bm{W}^{[L]},\bm{b}^{[L]}\Big{)},

where the llth layer parameters of 𝜽\bm{\theta} is the ordered pair 𝜽|l=(𝑾[l],𝒃[l]),l[L]\bm{\theta}|_{l}=\Big{(}\bm{W}^{[l]},\bm{b}^{[l]}\Big{)},\quad l\in[L]. We may misuse notation and identify 𝜽\bm{\theta} with its vectorization vec(𝜽)M\mathrm{vec}(\bm{\theta})\in\mathbb{R}^{M}, where M=l=0L1(ml+1)ml+1M=\sum_{l=0}^{L-1}(m_{l}+1)m_{l+1}.

Given 𝜽M\bm{\theta}\in\mathbb{R}^{M}, the FNN function 𝒇𝜽()\bm{f}_{\bm{\theta}}(\cdot) is defined recursively. First, we denote 𝒇𝜽[0](𝒙)=𝒙\bm{f}^{[0]}_{\bm{\theta}}(\bm{x})=\bm{x} for all 𝒙d\bm{x}\in\mathbb{R}^{d}. Then, for l[L1]l\in[L-1], 𝒇𝜽[l]\bm{f}^{[l]}_{\bm{\theta}} is defined recursively as 𝒇𝜽[l](𝒙)=σ(𝑾[l]𝒇𝜽[l1](𝒙)+𝒃[l])\bm{f}^{[l]}_{\bm{\theta}}(\bm{x})=\sigma(\bm{W}^{[l]}\bm{f}^{[l-1]}_{\bm{\theta}}(\bm{x})+\bm{b}^{[l]}), where σ\sigma is a non-linear activation function. Finally, we denote

𝒇𝜽(𝒙)=𝒇(𝒙,𝜽)=𝒇𝜽[L](𝒙)=𝑾[L]𝒇𝜽[L1](𝒙)+𝒃[L].\bm{f}_{\bm{\theta}}(\bm{x})=\bm{f}(\bm{x},\bm{\theta})=\bm{f}^{[L]}_{\bm{\theta}}(\bm{x})=\bm{W}^{[L]}\bm{f}^{[L-1]}_{\bm{\theta}}(\bm{x})+\bm{b}^{[L]}.

For notational simplicity, we denote

𝒇𝜽j(𝒙i)=𝑾j[L]f𝜽,j[L1](𝒙i),\bm{f}_{\bm{\theta}}^{j}(\bm{x}_{i})=\bm{W}^{[L]}_{j}f^{[L-1]}_{\bm{\theta},j}(\bm{x}_{i}),

where 𝒇𝜽j(𝒙i),𝑾j[L]mL\bm{f}_{\bm{\theta}}^{j}(\bm{x}_{i}),\bm{W}^{[L]}_{j}\in\mathbb{R}^{m_{L}} is the jjth column of 𝑾[L]\bm{W}^{[L]}, and f𝜽,j[L1](𝒙i)f^{[L-1]}_{\bm{\theta},j}(\bm{x}_{i}) is the jjth element of vector 𝒇𝜽[L1](𝒙i)\bm{f}^{[L-1]}_{\bm{\theta}}(\bm{x}_{i}). In this work, we denote the l2l_{2}-norm as \lVert\cdot\rVert for convenience.

3.2 Loss Function

The training data set is denoted as S={(𝒙i,𝒚i)}i=1nS=\{(\bm{x}_{i},\bm{y}_{i})\}_{i=1}^{n}, where 𝒙id\bm{x}_{i}\in\mathbb{R}^{d} and 𝒚id\bm{y}_{i}\in\mathbb{R}^{d^{\prime}}. For simplicity, we assume an unknown function 𝒚\bm{y} satisfying 𝒚(𝒙i)=𝒚i\bm{y}(\bm{x}_{i})=\bm{y}_{i} for i[n]i\in[n]. The empirical risk reads as

RS(𝜽)=1ni=1n(𝒇(𝒙i,𝜽),𝒚(𝒙i)),R_{S}(\bm{\theta})=\frac{1}{n}\sum_{i=1}^{n}\ell(\bm{f}(\bm{x}_{i},\bm{\theta}),\bm{y}(\bm{x}_{i})), (1)

where the loss function (,)\ell(\cdot,\cdot) is differentiable and the derivative of \ell with respect to its first argument is denoted by (𝒚,𝒚)\nabla\ell(\bm{y},\bm{y}^{*}). The error with respect to data sample (𝒙i,𝒚i)(\bm{x}_{i},\bm{y}_{i}) is defined as

𝒆(𝒇𝜽(𝒙i),𝒚i)=𝒇𝜽(𝒙i)𝒚i.\bm{e}(\bm{f}_{\bm{\theta}}(\bm{x}_{i}),\bm{y}_{i})=\bm{f}_{\bm{\theta}}(\bm{x}_{i})-\bm{y}_{i}.

For notation simplicity, we denote 𝒆(𝒇𝜽(𝒙i),𝒚i)=𝒆𝜽,i\bm{e}(\bm{f}_{\bm{\theta}}(\bm{x}_{i}),\bm{y}_{i})=\bm{e}_{\bm{\theta},i}.

3.3 Dropout

For 𝒇𝜽[l](𝒙)ml\bm{f}_{\bm{\theta}}^{[l]}(\bm{x})\in\mathbb{R}^{m_{l}}, we randomly sample a scaling vector 𝜼ml\bm{\eta}\in\mathbb{R}^{m_{l}} with coordinates of 𝜼\bm{\eta} are sampled i.i.d that,

(𝜼)k={1pp with probability p1 with probability 1p,(\bm{\eta})_{k}=\begin{cases}\frac{1-p}{p}&\text{ with probability }p\\ -1&\text{ with probability }1-p,\end{cases}

where p(0,1]p\in(0,1], k[ml]k\in[m_{l}] indices the coordinate of 𝒇𝜽[l](𝒙)\bm{f}_{\bm{\theta}}^{[l]}(\bm{x}). It is important to note that 𝜼\bm{\eta} is a zero-mean random variable. We then apply dropout by computing

𝒇𝜽,𝜼[l](𝒙)=(𝟏+𝜼)𝒇𝜽[l](𝒙),\bm{f}_{\bm{\theta},\bm{\eta}}^{[l]}(\bm{x})=(\bm{1}+\bm{\eta})\odot\bm{f}_{\bm{\theta}}^{[l]}(\bm{x}),

and use 𝒇𝜽,𝜼[l](𝒙)\bm{f}_{\bm{\theta},\bm{\eta}}^{[l]}(\bm{x}) instead of 𝒇𝜽[l](𝒙)\bm{f}_{\bm{\theta}}^{[l]}(\bm{x}). Here we use \odot for the Hadamard product of two matrices of the same dimension. To simplify notation, we let 𝜼\bm{\eta} denote the collection of such vectors over all layers. We denote the output of model 𝒇𝜽(𝒙)\bm{f}_{\bm{\theta}}(\bm{x}) on input 𝒙\bm{x} using dropout noise 𝜼\bm{\eta} as 𝒇𝜽,𝜼drop(𝒙)\bm{f}_{\bm{\theta},\bm{\eta}}^{\mathrm{drop}}(\bm{x}). The empirical risk associated with the network with dropout layer 𝒇𝜽,𝜼drop\bm{f}_{\bm{\theta},\bm{\eta}}^{\mathrm{drop}} is denoted by RSdrop(𝜽,𝜼)R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right), given by

RSdrop(𝜽,𝜼)=1ni=1n(𝒇𝜽,𝜼drop(𝒙i),𝒚(𝒙i)).R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right)=\frac{1}{n}\sum_{i=1}^{n}\ell(\bm{f}_{\bm{\theta},\bm{\eta}}^{\mathrm{drop}}(\bm{x}_{i}),\bm{y}(\bm{x}_{i})). (2)

4 Modified Gradient Flow

In this section, we theoretically analyze the implicit regularization effect of dropout. We derive the modified gradient flow of dropout in the sense of expectation. We first summarize the settings and provide the necessary definitions used for our theoretical results below. Note that, the settings of our experiments are much more general.

Setting 1 (dropout structure).

Consider an LL-layer (L2L\geq 2) FNN with only one dropout layer after the (L1)(L-1)th layer of the network,

𝒇𝜽,𝜼drop(𝒙)=𝑾[L](𝟏+𝜼)𝒇𝜽[L1](𝒙)+𝒃[L].\bm{f}_{\bm{\theta},\bm{\eta}}^{\mathrm{drop}}(\bm{x})=\bm{W}^{[L]}(\bm{1}+\bm{\eta})\odot\bm{f}^{[L-1]}_{\bm{\theta}}(\bm{x})+\bm{b}^{[L]}.
Setting 2 (loss function).

Take the mean squared error (MSE) as our loss function,

RS(𝜽)=12ni=1n(𝒇(𝒙i,𝜽)𝒚i)2.R_{S}(\bm{\theta})=\frac{1}{2n}\sum_{i=1}^{n}(\bm{f}(\bm{x}_{i},\bm{\theta})-\bm{y}_{i})^{2}.
Setting 3 (network structure).

For convenience, we set the model output dimension to one, i.e. mL=1m_{L}=1.

In the following, we introduce two key terms that play an important role in our theoretical results:

R1(𝜽):=1p2npi=1nj=1mL1𝑾j[L]f𝜽,j[L1](𝒙i)2,R_{1}(\bm{\theta}):=\frac{1-p}{2np}\sum_{i=1}^{n}\sum_{j=1}^{m_{L-1}}\|\bm{W}^{[L]}_{j}f^{[L-1]}_{\bm{\theta},j}(\bm{x}_{i})\|^{2}, (3)
R2(𝜽):=ε4𝔼𝜼𝜽RSdrop(𝜽,𝜼)2,R_{2}(\bm{\theta}):=\frac{\varepsilon}{4}\mathbb{E}_{\bm{\eta}}\|\nabla_{\bm{\theta}}R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right)\|^{2}, (4)

where 𝑾j[L]mL\bm{W}^{[L]}_{j}\in\mathbb{R}^{m_{L}} is the jj-th column of 𝑾[L]\bm{W}^{[L]}, f𝜽,j[L1](𝒙i)f^{[L-1]}_{\bm{\theta},j}(\bm{x}_{i}) is the jj-th element of 𝒇𝜽[L1](𝒙i)\bm{f}^{[L-1]}_{\bm{\theta}}(\bm{x}_{i}), 𝔼𝜼\mathbb{E}_{\bm{\eta}} is the expectation with respect to 𝜼\bm{\eta}, and ε\varepsilon is the learning rate.

Based on the above settings, we obtain a modified equation based on dropout gradient flow.

Lemma 1 (the expectation of dropout loss).

Given an LL-layer FNN with dropout 𝐟𝛉,𝛈drop(𝐱)\bm{f}_{\bm{\theta},\bm{\eta}}^{\mathrm{drop}}(\bm{x}), under Setting 1–3, we have the expectation of dropout MSE:

𝔼𝜼\displaystyle\mathbb{E}_{\bm{\eta}} (RSdrop(𝜽,𝜼))=RS(𝜽)+R1(𝜽).\displaystyle(R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right))=R_{S}\left(\bm{\theta}\right)+R_{1}(\bm{\theta}).

Based on the above lemma, we proceed to study the discrete iterate training of gradient descent with dropout, resulting in the derivation of the modified gradient flow of dropout training.

Modified gradient flow of dropout. Under Setting 1–3, the mean iterate of 𝜽\bm{\theta}, with a learning rate ε1\varepsilon\ll 1, stays close to the path of gradient flow on a modified loss 𝜽˙=𝜽R~Sdrop(𝜽,𝜼)\dot{\bm{\theta}}=-\nabla_{\bm{\theta}}\tilde{R}_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta}), where the modified loss R~Sdrop(𝜽,𝜼)\tilde{R}_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta}) satisfies:

𝔼𝜼R~Sdrop(𝜽,𝜼)RS(𝜽)+R1(𝜽)+R2(𝜽).\mathbb{E}_{\bm{\eta}}\tilde{R}_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta})\approx R_{S}\left(\bm{\theta}\right)+R_{1}(\bm{\theta})+R_{2}(\bm{\theta}). (5)

Contrary to SGD [19], the R1(𝜽)R_{1}(\bm{\theta}) term is independent of the learning rate ε\varepsilon, thus the implicit regularization of dropout still affects the gradient flow even as the learning rate ε\varepsilon approaches zero. In Section 6, we show the R1(𝜽)R_{1}(\bm{\theta}) term makes the network tend to find solutions with lower complexity, that is, solutions with weight condensation, which is also illustrated and supported numerically. In Section 7, we show the R1(𝜽)R_{1}(\bm{\theta}) term plays a more important role in improving the generalization and flatness of the model than the R2(𝜽)R_{2}(\bm{\theta}) term, which explicitly aims to find a flatter solution.

5 Numerical Verification of Implicit Regularization Terms

In this section, we numerically verify the validity of two implicit regularization terms, i.e., R1(𝜽)R_{1}(\bm{\theta}) defined in Equation (3) and R2(𝜽)R_{2}(\bm{\theta}) defined in Equation (4), under more general settings than out theoretical results. The detailed experimental settings can be found in Appendix A.

5.1 Validation of the Effect of R1(𝜽)R_{1}(\bm{\theta})

Refer to caption
(a)
Refer to caption
(b)
Figure 2: Two-layer NNs of width 1000 for the classification of the first 1000 images of MNIST dataset, utilizing two distinct loss functions: RSdrop(𝜽,𝜼)R_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta}) and RS(𝜽)+R1(𝜽)R_{S}\left(\bm{\theta}\right)+R_{1}\left(\bm{\theta}\right). To study the impact of different dropout rates on the performance of the networks, we conduct experiments with varying dropout rates while maintaining a constant learning rate of ε=5×103\varepsilon=5\times 10^{-3}. (a) The test accuracy. (b) The value of RSdrop(𝜽,𝜼)R_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta}) and R1(𝜽)R_{1}\left(\bm{\theta}\right).

As R1(𝜽)R_{1}(\bm{\theta}) is independent of the learning rate and R2(𝜽)R_{2}(\bm{\theta}) vanishes in the limit of zero learning rate, we select a small learning rate to verify the validity of R1(𝜽)R_{1}(\bm{\theta}). According to Equation (5), the modified equation of dropout training dynamics can be approximated by RS(𝜽)+R1(𝜽)R_{S}\left(\bm{\theta}\right)+R_{1}\left(\bm{\theta}\right) when the learning rate ε\varepsilon is sufficiently small. Therefore, we verify the validity of R1(𝜽)R_{1}(\bm{\theta}) through the similarity of the NN trained by the two loss functions, i.e., RSdrop(𝜽,𝜼)R_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta}) and RS(𝜽)+R1(𝜽)R_{S}\left(\bm{\theta}\right)+R_{1}\left(\bm{\theta}\right) under a small learning rate.

Fig. 2(a) presents the test accuracy of two losses trained under different dropout rates. For the network trained with RS(𝜽)+R1(𝜽)R_{S}\left(\bm{\theta}\right)+R_{1}\left(\bm{\theta}\right), there is no dropout layer, and the dropout rate affects the weight of R1(𝜽)R_{1}\left(\bm{\theta}\right) in the loss function. For different dropout rates, the networks obtained by the two losses above exhibit similar test accuracy. It is worth mentioning that for the network trained with RS(𝜽)R_{S}\left(\bm{\theta}\right), the obtained accuracy is only 79%79\%, which is significantly lower than the accuracy of the network trained through the two loss functions above (over 88%88\% in Fig. 2(a)). In Fig. 2(b), we show the values of RS(𝜽)R_{S}\left(\bm{\theta}\right) and R1(𝜽)R_{1}\left(\bm{\theta}\right) for the two networks at different dropout rates. Note that for the network obtained by RSdrop(𝜽,𝜼)R_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta}) training, we can calculate the two terms through the network’s parameters. It can be seen that for different dropout rates, the values of RS(𝜽)R_{S}\left(\bm{\theta}\right) and R1(𝜽)R_{1}\left(\bm{\theta}\right) of the two networks are almost indistinguishable.

Refer to caption
(a)
Refer to caption
(b)
Figure 3: Classify the first 1000 images of CIFAR-10 by training VGG-9 under a specific loss function by GD. For loss function RSdrop(𝜽,𝜼)R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right), we train the NNs with various learning rates ε\varepsilon. For loss function RSdrop(𝜽,𝜼)+(λ/4)𝜽RSdrop(𝜽,𝜼)2R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right)+(\lambda/4)\|\nabla_{\bm{\theta}}R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right)\|^{2}, we train the NNs with various regularization coefficient λ\lambda, while keeping the learning rate fixed at a small value of ε=5×103\varepsilon=5\times 10^{-3}. (a) The test accuracy of the network under different learning rates and regularization coefficients. The red dots indicate the location of the maximum test accuracy of the NNs obtained by training with both two loss functions. (b) The (𝔼𝜼RSdrop(𝜽,𝜼))/(𝔼𝜼𝜽RSdrop(𝜽,𝜼)2)(\mathbb{E}_{\bm{\eta}}\left\|R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right)\right\|)/(\mathbb{E}_{\bm{\eta}}\left\|\nabla_{\bm{\theta}}R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right)\right\|^{2}) value of the resulting model in (a) under different learning rates ε\varepsilon or regularization coefficients λ\lambda.

5.2 Validation of the Effect of R2(𝜽)R_{2}(\bm{\theta})

As shown in Theorem 4, the modified loss R~Sdrop(𝜽,𝜼)\tilde{R}_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta}) satisfies the equation:

𝔼𝜼R~Sdrop(𝜽,𝜼)=𝔼𝜼(RSdrop(𝜽,𝜼)+ε4𝜽RSdrop(𝜽,𝜼)2).\mathbb{E}_{\bm{\eta}}\tilde{R}_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta})=\mathbb{E}_{\bm{\eta}}\left(R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right)+\frac{\varepsilon}{4}\|\nabla_{\bm{\theta}}R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right)\|^{2}\right).

In order to validate the effect of R2(𝜽)R_{2}(\bm{\theta}) in the training process, we verify the equivalence of the following two training methods: (i) training networks with a dropout layer by MSE RSdrop(𝜽,𝜼)R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right) with different learning rates ε\varepsilon; (ii) training networks with a dropout layer by MSE with an explicit regularization:

RSregu(𝜽,𝜼):=RSdrop(𝜽,𝜼)+(λ/4)𝜽RSdrop(𝜽,𝜼)2R_{S}^{\mathrm{regu}}\left(\bm{\theta},\bm{\eta}\right):=R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right)+(\lambda/4)\|\nabla_{\bm{\theta}}R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right)\|^{2}

with different values of λ\lambda and a fixed learning rate much smaller than ε\varepsilon. The exact form of R2(𝜽)R_{2}(\bm{\theta}) has an expectation with respect to 𝜼\bm{\eta}, but in this subsection, we ignore this expectation in experiments for convenience.

As shown in Fig. 3, we train the NNs by the MSE RSdrop(𝜽,𝜼)R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right) with different learning rates (blue), and the regularized MSE RSregu(𝜽,𝜼)R_{S}^{\mathrm{regu}}\left(\bm{\theta},\bm{\eta}\right) with a fixed small learning rate and different values of λ\lambda (orange). In Fig. 3(a), the learning rate ε\varepsilon and the regularization coefficient λ\lambda are close when they reach their corresponding maximum test accuracy (red point). In addition, as shown in Fig. 3(b), we study the value of 𝔼𝜼RSdrop(𝜽,𝜼)/𝔼𝜼𝜽RSdrop(𝜽,𝜼)2\mathbb{E}_{\bm{\eta}}\left\|R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right)\right\|/\mathbb{E}_{\bm{\eta}}\left\|\nabla_{\bm{\theta}}R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right)\right\|^{2} under different learning rates (blue) and regularization coefficients (orange). In practical experiments, we take 3000 different dropout noises 𝜼\bm{\eta} to approximate the expectation after the training process. The results indicate that the same learning rate ε\varepsilon and regularization coefficient λ\lambda result in similar ratios.

Due to the computational cost of full-batch GD, we only use a few training samples in the above experiments. We conduct similar experiments with dropout under different learning rates and regularization coefficients using SGD as detailed in Appendix C.1.

6 Dropout Facilitates Condensation

A condensed network, which refers to a network with neurons having aligned input weights, is equivalent to another network with a reduced width [7, 6]. Therefore, the effective complexity of the network is smaller than its superficial appearance. Such low effective complexity may be an underlying reason for good generalization. In addition, the embedding principle [22, 23, 26] shows that although the condensed network is equivalent to a smaller one in the sense of approximation, it has more degeneracy and more descent directions that may lead to a simpler training process.

In this section, we experimentally and theoretically study the effect of dropout on the condensation phenomenon.

6.1 Experimental Results

To empirically validate the effect of dropout on condensation, we examine ReLU and tanh activations in one-dimensional and high-dimensional fitting problems, as well as image classification problems. Due to space limitations, some experimental results and detailed experimental settings are left in Appendices A, C.

6.1.1 Network with One-dimensional Input

We train a tanh NN with 1000 hidden neurons for the one-dimensional fitting problem to fit the data shown in Fig. 4 with MSE. Additional experimental verifications on ReLU NNs are provided in Appendix C. The experiments performed with and without dropout under the same initialization can both well fit the training data. In order to clearly study the effect of dropout on condensation, we take the parameter initialization distribution in the linear regime [6], where condensation does not occur without additional constraints. The dropout layer is used after the hidden layer of the two-layer network (top row) and used between the hidden layers and after the last hidden layer of the three-layer network (bottom row). Upon close inspection of the fitting process, we find that the output of NNs trained without dropout in Fig. 4(a, e) has much more oscillation than the output of NNs trained with dropout in Fig. 4(b, f). To better understand the underlying effect of dropout, we study the feature of parameters.

Refer to caption
(a) p=1p=1, output
Refer to caption
(b) p=0.9p=0.9, output
Refer to caption
(c) p=1p=1, feature
Refer to caption
(d) p=0.9p=0.9, feature
Refer to caption
(e) p=1p=1, output
Refer to caption
(f) p=0.9p=0.9, output
Refer to caption
(g) p=1p=1, feature
Refer to caption
(h) p=0.9p=0.9, feature
Figure 4: tanh NNs outputs and features under different dropout rates. The width of the hidden layers is 10001000, and the learning rate for different experiments is 1×1031\times 10^{-3}. In (c,d,g,h), blue dots and orange dots are for the weight feature distribution at the initial and final training stages, respectively. The top row is the result of two-layer networks, with the dropout layer after the hidden layer. The bottom row is the result of three-layer networks, with the dropout layer between the two hidden layers and after the last hidden layer. Refer to Appendix C.2 for further experiments on ReLU NNs.

The parameter pair (aj,𝒘j)(a_{j},\bm{w}_{j}) of each neuron can be separated into a unit orientation feature 𝒘^j=𝒘j/𝒘j2\hat{\bm{w}}_{j}=\bm{w}_{j}/\lVert\bm{w}_{j}\rVert_{2} and an amplitude Aj=|aj|𝒘j2A_{j}=|a_{j}|\lVert\bm{w}_{j}\rVert_{2} indicating its contribution222Due to the homogeneity of ReLU neurons, this amplitude can accurately describe the contribution of ReLU neurons. For tanh neurons, the amplitude has a certain positive correlation with the contribution of each neuron. to the output, i.e., (𝒘^j,Aj)(\hat{\bm{w}}_{j},A_{j}). For a one-dimensional input, 𝒘j\bm{w}_{j} is two-dimensional due to the incorporation of bias. Therefore, we use the angle to the xx-axis in [π,π)[-\pi,\pi) to indicate the orientation of each 𝒘^j\hat{\bm{w}}_{j}. For simplicity, for the three-layer network with one-dimensional input, we only consider the input weight of the first hidden layer. The scatter plots of {(𝒘^j,|aj|)}j=1m\{(\hat{\bm{w}}_{j},|a_{j}|)\}_{j=1}^{m} and {(𝒘^j,𝒘j2)}j=1m\{(\hat{\bm{w}}_{j},\lVert\bm{w}_{j}\rVert_{2})\}_{j=1}^{m} of tanh activation are presented in Appendix C.3 to eliminate the impact of the non-homogeneity of tanh activation.

The scatter plots of {(𝒘^j,Aj)}j=1m\{(\hat{\bm{w}}_{j},A_{j})\}_{j=1}^{m} of the NNs are shown in Fig. 4(c,d,g,h). For convenience, we normalize the feature distribution of each model parameter such that the maximum amplitude of neurons in each model is 11. Compared with the initial weight distribution (blue), the weight trained without dropout (orange) is close to its initial value. However, for the NNs trained with dropout, the parameters after training are significantly different from the initialization, and the non-zero parameters tend to condense on several discrete orientations, showing a condensation tendency.

In addition, we study the stability of the model trained with the loss function RS(𝜽)R_{S}(\bm{\theta}) under the two loss functions RSdrop(𝜽)R_{S}^{\mathrm{drop}}(\bm{\theta}) and RS(𝜽)+R1(𝜽)R_{S}(\bm{\theta})+R_{1}(\bm{\theta}). As shown in the left panel of Fig. 5, we use RS(𝜽)R_{S}(\bm{\theta}) as the loss function to train the model before the dashed line where when RS(𝜽)R_{S}(\bm{\theta}) is small, and we then replace the loss function by RSdrop(𝜽)R_{S}^{\mathrm{drop}}(\bm{\theta}) or RS(𝜽)+R1(𝜽)R_{S}(\bm{\theta})+R_{1}(\bm{\theta}). The outputs and features of the models trained with these three loss functions are shown in the middle and right panels of Fig. 5, respectively. The results reveal that dropout (R1(𝜽)R_{1}(\bm{\theta}) term) aids the training process in escaping from the minima obtained by RS(𝜽)R_{S}(\bm{\theta}) training and finding a condensed solution.

Refer to caption
Figure 5: Compared dynamics are initialized at model found by RS(𝜽)R_{S}(\bm{\theta}), marked by the vertical dashed line in iteration 200000 with two-layer tanh NN. Left: The loss trajectory under different loss functions. MIddle: The output of the model trained by RS(𝜽)R_{S}(\bm{\theta}) (blue) and the model trained by RSdrop(𝜽)R_{S}^{\mathrm{drop}}(\bm{\theta}) (orange) and RS(𝜽)+R1(𝜽)R_{S}(\bm{\theta})+R_{1}(\bm{\theta}) (green) initialized at model found by RS(𝜽)R_{S}(\bm{\theta}). The black points are the target points. Right: The feature of the model trained by RS(𝜽)R_{S}(\bm{\theta}) (blue) and the model trained by RSdrop(𝜽)R_{S}^{\mathrm{drop}}(\bm{\theta}) (orange) and RS(𝜽)+R1(𝜽)R_{S}(\bm{\theta})+R_{1}(\bm{\theta}) (green) initialized at model found by RS(𝜽)R_{S}(\bm{\theta}).

One may wonder if any noise injected into the training process could lead to condensation. We also perform similar experiments for SGD. As shown in Fig. 6, no significant condensation occurs even in the presence of noise during training. Therefore, the experiments in this section reveal the special characteristic of dropout that facilitate condensation.

Refer to caption
(a) batch size =2=2, output
Refer to caption
(b) batch size =2=2, feature
Figure 6: Two-layer tanh NN output and feature with a batch size of 2. The width of the hidden layer is 10001000, and the learning rate is 1×1031\times 10^{-3}. In (b), blue dots and orange dots are for the weight feature distribution at the initial and final training stages, respectively.

6.1.2 Network with High-dimensional Input

We conducted further investigation into the effect of dropout on high-dimensional two-layer tanh NNs under the teacher-student setting. Specifically, we utilize a two-layer tanh NN with only one hidden neuron and 10-dimensional input as the target function. The orientation similarity of two neurons is calculated by taking the inner product of their normalized weights. As shown in Fig. 7(a,b), for the NN with dropout, the neurons of the network have only two orientations, indicating the occurrence of condensation, while the NN without dropout does not exhibit such a phenomenon.

Refer to caption
(a) p=0.5p=0.5, feature
Refer to caption
(b) p=1p=1, feature
Refer to caption
(c) effective ratio
Figure 7: Sparsity in High-Dimensional NNs with different dropout rates. (a, b) Parameter features of the two-layer tanh NNs with and without dropout. (c) Effective ratio with and without dropout under the task of CIFAR-10 classification with ResNet-18. Conv2-1 and conv3-1 represent the parameters of the first convolutional layer of the second block and the third block of the ResNet, respectively.

To visualize the condensation during the training process, we define the ratio of effective neurons as follows.

Definition 1 (effective ratio).

For a given NN, the input weight of neuron jj in the ll-th layer is vectorized as 𝛉j[l]ml\bm{\theta}_{j}^{[l]}\in\mathbb{R}^{m_{l}}. Let U[l]={𝐮k[l]}k=1mlU^{[l]}=\{\bm{u}_{k}^{[l]}\}_{k=1}^{m_{l}} be the set of vectors such that for any 𝛉j[l]\bm{\theta}_{j}^{[l]}, there exists an element 𝐮U[l]\bm{u}\in U^{[l]} satisfying 𝐮𝛉j[l]>0.95\bm{u}\cdot\bm{\theta}_{j}^{[l]}>0.95. The effective neuron number mleffm^{\rm eff}_{l} of the ll-th layer is defined as the minimal size of all possible U[l]U^{[l]}. The effective ratio is defined as mleff/mlm^{\rm eff}_{l}/m_{l}.

We study the training process of using ResNet-18 to learn CIFAR-10. As shown in Fig. 7(c), NNs with dropout tend to have lower effective ratios, and thus tend to exhibit condensation.

6.1.3 Dropout Improves Generalization

As the effective neuron number of a condensed network is much smaller than its actual neuron number, it is expected to generalize better. To verify this, we use a two-layer tanh network with 10001000 neurons to learn a teacher two-layer tanh network with two neurons. The number of free parameters in the teacher network is 66. As shown in Fig. 8, the model with dropout generalizes well when the number of samplings is larger than 6, while the model without dropout generalizes badly. This result is consistent with the rank analysis of non-linear models [27].

Refer to caption
Figure 8: Average test error of the two-layer tanh NNs (color) vs. the number of samples (abscissa) for different dropout rates (ordinate). For all experiments, the width of the hidden layer is 10001000, and the learning rate is 1×1041\times 10^{-4} with the Adam optimizer. Each test error is averaged over 1010 trials with random initialization. Refer to Appendix C.2 for further experiments on ReLU NNs.

6.2 The Effect of R1(𝜽)R_{1}(\bm{\theta}) on Condensation

As can be seen from the implicit regularization term R1(𝜽)R_{1}(\bm{\theta}), dropout regularization imposes an additional l2l_{2}-norm constraint on the output of each neuron. The constraint has an effect on condensation. We illustrate the effect of R1(𝜽)R_{1}(\bm{\theta}) by a toy example of a two-layer ReLU network.

We use the following two-layer ReLU network to fit a one-dimensional function:

f𝜽(x)=j=1majσ(𝒘j𝒙)=j=1majσ(wjx+bj),f_{\bm{\theta}}(x)=\sum_{j=1}^{m}a_{j}\sigma(\bm{w}_{j}\cdot\bm{x})=\sum_{j=1}^{m}a_{j}\sigma(w_{j}x+b_{j}),

where 𝒙:=(x,1)2\bm{x}:=(x,1)^{\intercal}\in\mathbb{R}^{2}, 𝒘j:=(wj,bj)2\bm{w}_{j}:=(w_{j},b_{j})\in\mathbb{R}^{2}, σ(x)=ReLU(x)\sigma(x)=\mathrm{ReLU}(x). For simplicity, we set m=2m=2, and suppose the network can perfectly fit a training data set of two data points generated by a target function of σ(𝒘𝒙)\sigma(\bm{w}^{*}\cdot\bm{x}), denoted as 𝒐:=(σ(𝒘𝒙1),σ(𝒘𝒙2))\bm{o}^{*}:=(\sigma(\bm{w}^{*}\cdot\bm{x}_{1}),\sigma(\bm{w}^{*}\cdot\bm{x}_{2})). We further assume 𝒘𝒙i>0,i=1,2\bm{w}^{*}\cdot\bm{x}_{i}>0,i=1,2. Denote the output of the jj-th neuron over samples as

𝒐j=(ajσ(𝒘j𝒙1),ajσ(𝒘j𝒙2)).\bm{o}_{j}=(a_{j}\sigma(\bm{w}_{j}\cdot\bm{x}_{1}),a_{j}\sigma(\bm{w}_{j}\cdot\bm{x}_{2})).

The network output should equal to the target on the training data points after long enough training, i.e.,

𝒐=𝒐1+𝒐2.\bm{o}^{*}=\bm{o}_{1}+\bm{o}_{2}.

There are infinitely many pairs of 𝒐1\bm{o}_{1} and 𝒐2\bm{o}_{2} that can fit 𝒐\bm{o}^{*} well. However, the R1(𝜽)R_{1}(\bm{\theta}) term leads the training to a specific pair. R1(𝜽)R_{1}(\bm{\theta}) can be written as

R1(𝜽)=𝒐12+𝒐22,R_{1}(\bm{\theta})=\lVert\bm{o}_{1}\rVert^{2}+\lVert\bm{o}_{2}\rVert^{2},

and the components of 𝒐j\bm{o}_{j} perpendicular to 𝒐\bm{o}^{*} need to cancel each other at the well-trained stage to minimize R1(𝜽)R_{1}(\bm{\theta}). As a result, 𝒐1\bm{o}_{1} and 𝒐2\bm{o}_{2} need to be parallel with 𝒐\bm{o}^{*}, i.e., 𝒘1//𝒘2//𝒘\bm{w}_{1}//\bm{w}_{2}//\bm{w}^{*}, which is the condensation phenomenon.

In the following, we show that minimizing R1(𝜽)R_{1}(\bm{\theta}) term can lead to condensation under several settings. We first give some definitions that capture the characteristic of ReLU neurons (also shown in Fig. 9).

Refer to caption
Figure 9: Schematic diagram of some definitions associated with ReLU neurons.
Definition 2 (convexity change of ReLU NNs).

Consider piecewise linear function f(t)f(t), tt\in\mathbb{R}, and its linear interval sets {[ti,ti+1]}i=1T\{[t_{i},t_{i+1}]\}_{i=1}^{T}. For any two intervals [ti,ti+2],[ti+1,ti+3],i[T3][t_{i},t_{i+2}],[t_{i+1},t_{i+3}],i\in[T-3], if on one of the intervals, ff is convex and on the other ff is concave, then we call there exists a convexity change.

Definition 3 (direction and intercept point of ReLU neurons).

For a one-dimensional ReLU neuron ajσ(wjx+bj)a_{j}\sigma(w_{j}x+b_{j}), its direction is defined as sign(wj)\mathrm{sign}(w_{j}), and its intercept point is defined as x=bjwjx=-\frac{b_{j}}{w_{j}}.

Drawing inspiration from the methodology employed to establish the regularization effect of label noise SGD [28], we show that under the setting of two-layer ReLU NN and one-dimensional input data, the implicit bias of R1(𝜽)R_{1}(\bm{\theta}) term corresponds to “simple” functions that satisfy two conditions: (I) they have the minimum number of convexity changes required to fit the training points, and (ii) if the intercept points of neurons are in the same inner interval, and the neurons have the same direction, then their intercept points are identical.

Theorem 1 (the effect of R1(θ)R_{1}(\bm{\theta}) on facilitating condensation).

Consider the following two-layer ReLU NN,

f𝜽(x)=j=1majσ(wjx+bj)+ax+b,f_{\bm{\theta}}(x)=\sum_{j=1}^{m}a_{j}\sigma(w_{j}x+b_{j})+ax+b,

trained with a one-dimensional dataset S={(xi,yi)}i=1nS=\{(x_{i},y_{i})\}_{i=1}^{n}, where x1<x2<<xnx_{1}<x_{2}<\cdots<x_{n}. When the MSE of training data RS(𝛉)=0R_{S}(\bm{\theta})=0, if any of the following two conditions holds:

(i) the number of convexity changes of NN in (x1,xn)(x_{1},x_{n}) can be reduced while RS(𝛉)=0R_{S}(\bm{\theta})=0;

(ii) there exist two neurons with indexes k1k2k_{1}\neq k_{2}, such that they have the same sign, i.e., sign(wk1)=sign(wk2)\mathrm{sign}(w_{k_{1}})=\mathrm{sign}(w_{k_{2}}), and different intercept points in the same interval, i.e., bk1/wk1,bk2/wk2[xi,xi+1]-{b_{k_{1}}}/{w_{k_{1}}},-{b_{k_{2}}}/{w_{k_{2}}}\in[x_{i},x_{i+1}], and bk1/wk1bk2/wk2-{b_{k_{1}}}/{w_{k_{1}}}\neq-{b_{k_{2}}}/{w_{k_{2}}} for some i[2:n1]i\in[2:n-1];

then there exists parameters 𝛉\bm{\theta}^{\prime}, an infinitesimal perturbation of 𝛉\bm{\theta}, s.t.,

(i) RS(𝛉)=0R_{S}(\bm{\theta}^{\prime})=0;

(ii) R1(𝛉)<R1(𝛉)R_{1}(\bm{\theta}^{\prime})<R_{1}(\bm{\theta}).

It should be noted that not all functions trained with dropout exhibit obvious condensation. For example, a function with dropout shows no condensation while training with a dataset consisting of only one data point. However, for general datasets, such as the example shown in Fig. 1 and Fig. 4, NNs reach the condensed solution due to the limit of the convexity changes and the intercept points (also illustrated in Fig. 10).

Refer to caption
Figure 10: Schematic diagram of the effect of R1(𝜽)R_{1}(\bm{\theta}) on the limit of the convexity changes and the intercept points.

Although the current study only demonstrates the result for ReLU NNs, it is expected that for general activation functions, such as tanh, the R1(𝜽)R_{1}(\bm{\theta}) term also has the effect on facilitating condensation, which is left for future work. This is also confirmed by the experimental results conducted on the tanh NNs above. Furthermore, it is believed that the linear term ax+bax+b utilized to ensure RS(𝜽)=0R_{S}(\bm{\theta})=0 in certain cases is not a fundamental requirement. Our experiments have confirmed that neural networks without the linear term also exhibit the condensation phenomenon.

7 Implicit Regularization of Dropout on the Flatness of Solution

Understanding the mechanism by which dropout improves the generalization of NNs is of great interest and significance. In this section, we study the flatness of the minima found by dropout as inspired by the study of SGD on generalization [10]. Our primary focus is to study the effect of R1(𝜽)R_{1}(\bm{\theta}) and R2(𝜽)R_{2}(\bm{\theta}) on the flatness of loss landscape and network generalization.

7.1 Dropout Finds Flatter Minima

We first study the effect of dropout on model flatness and generalization. For a fair comparison of the flatness between different models, we employ the approach used in [29] as follows. To obtain a direction for a network with parameters 𝜽\bm{\theta}, we begin by producing a random Gaussian direction vector 𝒅\bm{d} with dimensions compatible with 𝜽\bm{\theta}. Then, we normalize each filter in 𝒅\bm{d} to have the same norm as the corresponding filter in 𝜽\bm{\theta}. For FNNs, each layer can be regarded as a filter, and the normalization process is equivalent to normalizing the layer, while for convolutional neural networks (CNNs), each convolution kernel may have multiple filters, and each filter is normalized individually. Thus, we obtain a normalized direction vector 𝒅\bm{d} by replacing 𝒅i,j\bm{d}_{i,j} with 𝒅i,j𝒅i,j𝜽i,j\frac{\bm{d}_{i,j}}{\lVert\bm{d}_{i,j}\rVert}\lVert\bm{\theta}_{i,j}\rVert, where 𝒅i,j\bm{d}_{i,j} and 𝜽i,j\bm{\theta}_{i,j} represent the jjth filter of the iith layer of the random direction 𝒅\bm{d} and the network parameters 𝜽\bm{\theta}, respectively. Here, \lVert\cdot\rVert denotes the Frobenius norm. It is crucial to note that jj refers to the filter index. We use the function L(α)=RS(𝜽+α𝒅)L(\alpha)=R_{S}\left(\bm{\theta}+\alpha\bm{d}\right) to characterize the loss landscape around the minima obtained with and without dropout layers.

For all network structures shown in Fig. 11, dropout improves the generalization of the network and finds flatter minima. In Fig. 11(a, b), for both networks trained with and without dropout layers, the training loss values are all close to zero, but their flatness and generalization are still different. In Fig. 11(c, d), due to the complexity of the dataset, i.e., CIFAR-100 and Multi30k, and network structures, i.e., ResNet-20 and transformer, networks with dropout do not achieve zero training error but the ones with dropout find flatter minima with much better generalization. The accuracy of different network structures is shown in Table 1.

Table 1: The effect of dropout on model accuracy.
Structure Dataset With Dropout Without Dropout
FNN MNIST 98.7% 98.1%
VGG-9 CIFAR-10 60.6% 59.2%
ResNet-20 CIFAR-100 54.7% 34.1%
Transformer Multi30k 49.3% 34.7%
Refer to caption
(a) flatness of FNN
Refer to caption
(b) flatness of VGG-9
Refer to caption
(c) flatness of ResNet-20
Refer to caption
(d) flatness of transformer
Figure 11: The 1D visualization of solutions of different network structures obtained with or without dropout layers. (a) The FNN is trained on MNIST dataset. The test accuracy for the model with dropout layers is 98.7%98.7\% while 98.1%98.1\% for the model without dropout layers. (b) The VGG-9 network is trained on CIFAR-10 dataset using the first 2048 examples as the training dataset. The test accuracy for the model with dropout layers is 60.6%60.6\% while 59.2%59.2\% for the model without dropout layers. (c) The ResNet-20 network is trained on CIFAR-100 dataset using all examples as the training dataset. The test accuracy for the model with dropout layers is 54.7%54.7\% while 34.1%34.1\% for the model without dropout layers. (d) The transformer is trained on Multi30k dataset using the first 2048 examples as the training dataset. The test accuracy for the model with dropout layers is 49.3%49.3\% while 34.7%34.7\% for the model without dropout layers.

7.2 The Effect of R1(𝜽)R_{1}(\bm{\theta}) on Flatness

In this subsection, we study the effect of R1(𝜽)R_{1}(\bm{\theta}) on flatness under the two-layer ReLU NN setting. Different from the flatness described above by loss interpolation, we define the flatness of the minimum as the sum of the eigenvalues of the Hessian matrix HH in this section, i.e., Tr(H)\mathrm{Tr}(H). Note that when RS(𝜽)=0R_{S}(\bm{\theta})=0, we have,

Tr(H)\displaystyle\mathrm{Tr}(H) =Tr(1ni=1n𝜽f𝜽(𝒙i)𝜽f𝜽(𝒙i))=1ni=1n𝜽f𝜽(𝒙i)22,\displaystyle=\mathrm{Tr}\left(\frac{1}{n}\sum_{i=1}^{n}\nabla_{\bm{\theta}}f_{\bm{\theta}}(\bm{x}_{i})\nabla_{\bm{\theta}}^{\intercal}f_{\bm{\theta}}(\bm{x}_{i})\right)=\frac{1}{n}\sum_{i=1}^{n}\lVert\nabla_{\bm{\theta}}f_{\bm{\theta}}(\bm{x}_{i})\rVert_{2}^{2},

thus the definition of flatness above is equivalent to 1ni=1n𝜽f𝜽(𝒙i)22\frac{1}{n}\sum_{i=1}^{n}\lVert\nabla_{\bm{\theta}}f_{\bm{\theta}}(\bm{x}_{i})\rVert_{2}^{2}.

Theorem 2 (the effect of R1(θ)R_{1}(\bm{\theta}) on facilitating flatness).

Under the Setting 13, consider a two-layer ReLU NN,

f𝜽(x)=j=1majσ(𝒘j𝒙),f_{\bm{\theta}}(x)=\sum_{j=1}^{m}a_{j}\sigma(\bm{w}_{j}\bm{x}),

trained with dataset S={(𝐱i,yi)}i=1nS=\{(\bm{x}_{i},y_{i})\}_{i=1}^{n}. Under the gradient flow training with the loss function RS(𝛉)+R1(𝛉)R_{S}(\bm{\theta})+R_{1}(\bm{\theta}), if 𝛉0\bm{\theta}_{0} satisfying RS(𝛉0)=0R_{S}(\bm{\theta}_{0})=0 and 𝛉R1(𝛉0)0\nabla_{\bm{\theta}}R_{1}(\bm{\theta}_{0})\neq 0, we have

d(1ni=1n𝜽f𝜽0(𝒙i)22)dt<0.\frac{{\rm d}\left(\frac{1}{n}\sum_{i=1}^{n}\lVert\nabla_{\bm{\theta}}f_{\bm{\theta}_{0}}(\bm{x}_{i})\rVert_{2}^{2}\right)}{\rm dt}<0.

The regularization effect of R2(𝜽)R_{2}(\bm{\theta}) also has a positive effect on flatness by constraining the norm of the gradient. In the next subsection, we compare the effect of these two regularization terms on generalization and flatness.

7.3 Effect of Two Implicit Regularization Terms on Generalization and Flatness

Although the modified gradient flow is noise-free during training, the model trained with the modified gradient flow can also find a flat minimum that generalizes well, due to the effect of R1(𝜽)R_{1}(\bm{\theta}) and R2(𝜽)R_{2}(\bm{\theta}). However, the magnitude of their impact on flatness is not yet fully understood. In this subsection, we study the effect of each regularization term through training networks by the following four loss functions:

L1(𝜽)\displaystyle L_{1}(\bm{\bm{\theta}}) :=RS(𝜽)+R1(𝜽)\displaystyle:=R_{S}(\bm{\bm{\theta}})+R_{1}(\bm{\theta}) (6)
:=RS(𝜽)+1p2npi=1nj=1mL1𝑾j[L]f𝜽,j[L1](𝒙i)2,\displaystyle:=R_{S}(\bm{\bm{\theta}})+\frac{1-p}{2np}\sum_{i=1}^{n}\sum_{j=1}^{m_{L-1}}\|\bm{W}^{[L]}_{j}f^{[L-1]}_{\bm{\theta},j}(\bm{x}_{i})\|^{2},
L2(𝜽,𝜼)\displaystyle L_{2}(\bm{\bm{\theta}},\bm{\bm{\eta}}) :=RS(𝜽)+R~2(𝜽,𝜼)\displaystyle:=R_{S}(\bm{\bm{\theta}})+\tilde{R}_{2}(\bm{\theta},\bm{\eta})
:=RS(𝜽)+ε4𝜽RSdrop(𝜽,𝜼)2,\displaystyle:=R_{S}(\bm{\bm{\theta}})+\frac{\varepsilon}{4}\left\|\nabla_{\bm{\bm{\theta}}}R_{S}^{\mathrm{drop}}\left(\bm{\bm{\theta}},\bm{\bm{\eta}}\right)\right\|^{2},
L3(𝜽,𝜼)\displaystyle L_{3}(\bm{\bm{\theta}},\bm{\bm{\eta}}) :=RSdrop(𝜽,𝜼)R~2(𝜽,𝜼)\displaystyle:=R_{S}^{\mathrm{drop}}(\bm{\bm{\theta}},\bm{\bm{\eta}})-\tilde{R}_{2}(\bm{\theta},\bm{\eta})
:=RSdrop(𝜽,𝜼)ε4𝜽RSdrop(𝜽,𝜼)2,\displaystyle:=R_{S}^{\mathrm{drop}}(\bm{\bm{\theta}},\bm{\bm{\eta}})-\frac{\varepsilon}{4}\left\|\nabla_{\bm{\bm{\theta}}}R_{S}^{\mathrm{drop}}\left(\bm{\bm{\theta}},\bm{\bm{\eta}}\right)\right\|^{2},
L4(𝜽,𝜼)\displaystyle L_{4}(\bm{\bm{\theta}},\bm{\bm{\eta}}) :=RSdrop(𝜽,𝜼)R1(𝜽)\displaystyle:=R_{S}^{\mathrm{drop}}(\bm{\bm{\theta}},\bm{\bm{\eta}})-R_{1}(\bm{\theta})
:=RSdrop(𝜽,𝜼)1p2npi=1nj=1mL1𝑾j[L]f𝜽,j[L1](𝒙i)2,\displaystyle:=R_{S}^{\mathrm{drop}}(\bm{\bm{\theta}},\bm{\bm{\eta}})-\frac{1-p}{2np}\sum_{i=1}^{n}\sum_{j=1}^{m_{L-1}}\|\bm{W}^{[L]}_{j}f^{[L-1]}_{\bm{\theta},j}(\bm{x}_{i})\|^{2},

where R~2(𝜽,𝜼)\tilde{R}_{2}(\bm{\theta},\bm{\eta}) is defined as (ε/4)𝜽RSdrop(𝜽,𝜼)2(\varepsilon/4)\left\|\nabla_{\bm{\bm{\theta}}}R_{S}^{\mathrm{drop}}\left(\bm{\bm{\theta}},\bm{\bm{\eta}}\right)\right\|^{2} for convenience, and we have 𝔼𝜼R~2(𝜽,𝜼)=R2(𝜽)\mathbb{E}_{\bm{\eta}}\tilde{R}_{2}(\bm{\theta},\bm{\eta})=R_{2}(\bm{\theta}). For each Li,i[4]L_{i},i\in[4], we explicitly add or subtract the penalty term of either R1(𝜽)R_{1}(\bm{\theta}) or R~2(𝜽,𝜼)\tilde{R}_{2}(\bm{\theta},\bm{\eta}) to study their effect on dropout regularization. Therefore, L1(𝜽)L_{1}(\bm{\theta}) and L3(𝜽,𝜼)L_{3}(\bm{\theta},\bm{\eta}) are used to study the effect of R1(𝜽)R_{1}(\bm{\theta}), while L2(𝜽,𝜼)L_{2}(\bm{\theta},\bm{\eta}) and L4(𝜽,𝜼)L_{4}(\bm{\theta},\bm{\eta}) are for R2(𝜽)R_{2}(\bm{\theta}).

We first study the effect of two regularization terms on the generalization of NNs. As shown in Fig. 12, we compare the test accuracy obtained by training with the above four distinct loss functions under different dropout rates and utilize the results of RS(𝜽)R_{S}(\bm{\bm{\theta}}) and RSdrop(𝜽,𝜼)R_{S}^{\mathrm{drop}}(\bm{\bm{\theta}},\bm{\bm{\eta}}) as reference benchmarks. Two different learning rates are considered, with the solid and dashed lines corresponding to ε=0.05\varepsilon=0.05 and ε=0.005\varepsilon=0.005, respectively. As shown in Fig. 12(a), both approaches show that the training with the R1(𝜽)R_{1}(\bm{\theta}) regularization term finds a solution that almost has the same test accuracy as the training with dropout. For R~2(𝜽,𝜼)\tilde{R}_{2}(\bm{\theta},\bm{\eta}), as shown in Fig. 12(b), the effect of R~2(𝜽,𝜼)\tilde{R}_{2}(\bm{\theta},\bm{\eta}) only marginally improves the generalization ability of full-batch gradient descent training in comparison to the utilization of R1(𝜽)R_{1}(\bm{\theta}).

Refer to caption
(a)
Refer to caption
(b)
Figure 12: The classification task on the MNIST dataset (the first 1000 images) using the FNN with size 784784-10001000-1010. The test accuracy obtained by training with L1(𝜽),,L4(𝜽)L_{1}(\bm{\theta}),\cdots,L_{4}(\bm{\theta}) and RS(𝜽)R_{S}(\bm{\bm{\theta}}), RSdrop(𝜽,𝜼)R_{S}^{\mathrm{drop}}(\bm{\bm{\theta}},\bm{\bm{\eta}}) under different dropout rates and learning rates. The solid line represents the test accuracy of the network under a large learning rate (ε=0.05\varepsilon=0.05), and the dashed line represents the test accuracy of the network under a small learning rate (ε=0.005\varepsilon=0.005).

Then we study the effect of two regularization terms on flatness. To this end, we show a one-dimensional cross-section of the loss RS(𝜽)R_{S}(\bm{\theta}) by the interpolation between two minima found by the training of two different loss functions. For either R1(𝜽)R_{1}(\bm{\theta}) or R~2(𝜽,𝜼)\tilde{R}_{2}(\bm{\theta},\bm{\eta}), we use addition or subtraction to study its effect. As shown in Fig. 13(a), for R1(𝜽)R_{1}(\bm{\theta}), the loss value of the interpolation between the minima found by the addition approach (L1L_{1}) and the subtraction approach (L3L_{3}) stays near zero, which is similar for R~2(𝜽,𝜼)\tilde{R}_{2}(\bm{\theta},\bm{\eta}) in Fig. 13(b), showing that the higher-order terms of the learning rate ε\varepsilon in the modified equation have less influence on the training process. We then compare the flatness of minima found by the training with R1(𝜽)R_{1}(\bm{\theta}) and R~2(𝜽,𝜼)\tilde{R}_{2}(\bm{\theta},\bm{\eta}) as illustrated in Fig. 13(c-f). The results indicate that the minima obtained by the training with R1(𝜽)R_{1}(\bm{\theta}) exhibit greater flatness than those obtained by training with R~2(𝜽,𝜼)\tilde{R}_{2}(\bm{\theta},\bm{\eta}).

The experiments in this section show that, compared with SGD, the unique implicit regularization of dropout, R1(𝜽)R_{1}(\bm{\theta}), plays a significant role in improving the generalization and finding flat minima.

Refer to caption
(a) L1&L3L_{1}\&L_{3}
Refer to caption
(b) L2&L4L_{2}\&L_{4}
Refer to caption
(c) L1&L2L_{1}\&L_{2}
Refer to caption
(d) L1&L4L_{1}\&L_{4}
Refer to caption
(e) L2&L3L_{2}\&L_{3}
Refer to caption
(f) L3&L4L_{3}\&L_{4}
Figure 13: The classification task on the MNIST dataset (the first 1000 images) using the FNN with size 784784-10001000-1010. The RS(𝜽)R_{S}(\bm{\theta}) value for interpolation between models with α\alpha interpolation factor. For Li&LjL_{i}\&L_{j}, there is one trained model at α=0\alpha=0 (trained by loss function LiL_{i}), and the other is at α=1\alpha=1 (trained by loss function LjL_{j}). Different curves represent different dropout rates used for training.

8 Conclusion and Discussion

In this work, we theoretically study the implicit regularization of dropout and its role in improving the generalization performance of neural networks. Specifically, we derive two implicit regularization terms, R1(𝜽)R_{1}(\bm{\theta}) and R2(𝜽)R_{2}(\bm{\theta}), and validate their efficacy through numerical experiments. One important finding of this work is that the unique implicit regularization term R1(𝜽)R_{1}(\bm{\theta}) in dropout, unlike SGD, is a key factor in improving the generalization and flatness of the dropout solution. We also found that R1(𝜽)R_{1}(\bm{\theta}) can facilitate the weight condensation during training, which may establish a link among weight condensation, flatness, and generalization for further study. This work reveals rich and unique properties of dropout, which are fundamental to a comprehensive understanding of dropout.

Our study also sheds light on the broader issue of simplicity bias in deep learning. We observed that dropout regularization tends to impose a bias toward simple solutions during training, as evidenced by the weight condensation and flatness effects. This is consistent with other perspectives on simplicity bias in deep learning, such as the frequency principle[30, 31, 32, 33, 34], which reveals that neural networks often learn data from low to high frequency. Our analysis of dropout regularization provides a detailed understanding of how simplicity bias works in practice, which is essential for understanding why over-parameterized neural networks can fit the training data well and generalize effectively to new data.

Finally, our work highlights the potential benefits of dropout regularization in training neural networks, particularly in the linear regime. As we have shown, dropout regularization can induce weight condensation and avoid the slow training speed often encountered in highly nonlinear networks due to the fact that the training trajectory is close to the stationary point [22, 23]. This may have important implications for the development of more efficient and effective deep learning algorithms.

Acknowledgments

This work is sponsored by the National Key R&D Program of China Grant No. 2022YFA1008200, the Shanghai Sailing Program, the Natural Science Foundation of Shanghai Grant No. 20ZR1429000, the National Natural Science Foundation of China Grant No. 62002221, Shanghai Municipal of Science and Technology Major Project No. 2021SHZDZX0102, and the HPC of School of Mathematical Sciences and the Student Innovation Center, and the Siyuan-1 cluster supported by the Center for High Performance Computing at Shanghai Jiao Tong University.

References

  • [1] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
  • [2] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  • [3] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  • [4] David P Helmbold and Philip M Long. On the inductive bias of dropout. The Journal of Machine Learning Research, 16(1):3403–3454, 2015.
  • [5] Ernst Hairer, Christian Lubich, and Gerhard Wanner. Geometric numerical integration illustrated by the störmer–verlet method. Acta numerica, 12:399–450, 2003.
  • [6] Tao Luo, Zhi-Qin John Xu, Zheng Ma, and Yaoyu Zhang. Phase diagram for two-layer relu neural networks at infinite-width limit. Journal of Machine Learning Research, 22(71):1–47, 2021.
  • [7] Hanxu Zhou, Qixuan Zhou, Tao Luo, Yaoyu Zhang, and Zhi-Qin John Xu. Towards understanding the condensation of neural networks at initial training. arXiv preprint arXiv:2105.11686, 2021.
  • [8] Hanxu Zhou, Qixuan Zhou, Zhenyuan Jin, Tao Luo, Yaoyu Zhang, and Zhi-Qin John Xu. Empirical phase diagram for three-layer neural networks with infinite width. Advances in Neural Information Processing Systems, 2022.
  • [9] Arthur Jacot, Clément Hongler, and Franck Gabriel. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, pages 8580–8589, 2018.
  • [10] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
  • [11] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. arXiv preprint arXiv:1706.08947, 2017.
  • [12] Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. arXiv preprint arXiv:1803.00195, 2018.
  • [13] David McAllester. A pac-bayesian tutorial with a dropout bound. arXiv preprint arXiv:1307.2118, 2013.
  • [14] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Lecun, and Rob Fergus. Regularization of neural networks using dropconnect. In In Proceedings of the International Conference on Machine learning. Citeseer, 2013.
  • [15] Wenlong Mou, Yuchen Zhou, Jun Gao, and Liwei Wang. Dropout training, data-dependent regularization, and generalization bounds. In International conference on machine learning, pages 3645–3653. PMLR, 2018.
  • [16] Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization. Advances in neural information processing systems, 26:351–359, 2013.
  • [17] Poorya Mianjy, Raman Arora, and Rene Vidal. On the implicit bias of dropout. In International Conference on Machine Learning, pages 3540–3548. PMLR, 2018.
  • [18] David Barrett and Benoit Dherin. Implicit gradient regularization. In International Conference on Learning Representations, 2020.
  • [19] Samuel L Smith, Benoit Dherin, David Barrett, and Soham De. On the origin of implicit regularization in stochastic gradient descent. In International Conference on Learning Representations, 2020.
  • [20] Lei Wu, Chao Ma, and Weinan E. How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. Advances in Neural Information Processing Systems, 31, 2018.
  • [21] Sho Yaida. Fluctuation-dissipation relations for stochastic gradient descent. In International Conference on Learning Representations, 2018.
  • [22] Yaoyu Zhang, Zhongwang Zhang, Tao Luo, and Zhiqin J Xu. Embedding principle of loss landscape of deep neural networks. Advances in Neural Information Processing Systems, 34:14848–14859, 2021.
  • [23] Yaoyu Zhang, Yuqing Li, Zhongwang Zhang, Tao Luo, and Zhi-Qin John Xu. Embedding principle: a hierarchical structure of loss landscape of deep neural networks. Journal of Machine Learning vol, 1:1–45, 2022.
  • [24] Hartmut Maennel, Olivier Bousquet, and Sylvain Gelly. Gradient descent quantizes relu network features. arXiv preprint arXiv:1803.08367, 2018.
  • [25] Franco Pellegrini and Giulio Biroli. An analytic theory of shallow networks dynamics for hinge loss classification. Advances in Neural Information Processing Systems, 33, 2020.
  • [26] Zhiwei Bai, Tao Luo, Zhi-Qin John Xu, and Yaoyu Zhang. Embedding principle in depth for the loss landscape analysis of deep neural networks. arXiv preprint arXiv:2205.13283, 2022.
  • [27] Yaoyu Zhang, Zhongwang Zhang, Leyang Zhang, Zhiwei Bai, Tao Luo, and Zhi-Qin John Xu. Linear stability hypothesis and rank stratification for nonlinear models. arXiv preprint arXiv:2211.11623, 2022.
  • [28] Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Conference on learning theory, pages 483–513. PMLR, 2020.
  • [29] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. arXiv preprint arXiv:1712.09913, 2017.
  • [30] Zhi-Qin John Xu, Yaoyu Zhang, and Yanyang Xiao. Training behavior of deep neural network in frequency domain. In International Conference on Neural Information Processing, pages 264–274. Springer, 2019.
  • [31] Zhi-Qin John Xu, Yaoyu Zhang, Tao Luo, Yanyang Xiao, and Zheng Ma. Frequency principle: Fourier analysis sheds light on deep neural networks. Communications in Computational Physics, 28(5):1746–1767, 2020.
  • [32] Yaoyu Zhang, Tao Luo, Zheng Ma, and Zhi-Qin John Xu. A linear frequency principle model to understand the absence of overfitting in neural networks. Chinese Physics Letters, 38(3):038701, 2021.
  • [33] Tao Luo, Zheng Ma, Zhi-Qin John Xu, and Yaoyu Zhang. Theory of the frequency principle for general deep neural networks. CSIAM Transactions on Applied Mathematics, 2(3):484–507, 2021.
  • [34] Zhi-Qin John Xu, Yaoyu Zhang, and Tao Luo. Overview frequency principle/spectral bias in deep learning. arXiv preprint arXiv:2201.07395, 2022.
  • [35] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [36] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR (Poster), 2015.
  • [37] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.

Appendix A Experimental Setups

For Fig. 1, Fig. 15, Fig. 16 and Fig. 17, we use the ReLU FNN with the width of 10001000 to fit the target function as follows,

f(x)=12σ(x13)+12σ(x13),f(x)=\frac{1}{2}\sigma(-x-\frac{1}{3})+\frac{1}{2}\sigma(x-\frac{1}{3}),

where σ(x)=ReLU(x)\sigma(x)=\mathrm{ReLU}(x). For Fig. 16, we train the network using Adam with a learning rate of 1×1041\times 10^{-4} and a batch size of 22. For Fig. 15, we add dropout layers behind the hidden layer with p=0.9p=0.9 for the two-layer experiments and add dropout layers between two hidden layers and behind the last hidden layer with p=0.9p=0.9 for the three-layer experiments. We train the network using Adam with a learning rate of 1×1041\times 10^{-4}. We initialize the parameters in the linear regime, 𝜽N(0,1m0.2)\bm{\theta}\sim N\left(0,\frac{1}{m^{0.2}}\right), where m=1000m=1000 is the width of the hidden layer.

For Fig. 2, Fig. 12, and Fig. 13, we use the FNN with size 784784-10001000-1010 to classify the MNIST dataset (the first 1000 images). We add dropout layers behind the hidden layer with different dropout rates. We train the network using GD with a learning rate of 5×1035\times 10^{-3}.

For Fig. 3, Fig. 14(b), we use VGG-9 [35] w/o dropout layers to classify CIFAR-10 using GD or SGD. For experiments with dropout layers, we add dropout layers after the pooling layers, the dropout rates of dropout layers are 0.50.5. For the experiments shown in Fig. 3, we only use the first 1000 images for training to compromise with the computational burden.

For Fig. 4, Fig. 5 Fig. 6 and Fig. 8, we use the tanh FNN with the width of 10001000 to fit the target function as follows,

f(x)=σ(x6)+σ(x+6),f(x)=\sigma(x-6)+\sigma(x+6),

where σ(x)=tanh(x)\sigma(x)=\mathrm{tanh}(x). We add dropout layers behind the hidden layer with p=0.9p=0.9 for the two-layer experiments and add dropout layers between two hidden layers and behind the last hidden layer with p=0.9p=0.9 for the three-layer experiments. We train the network using Adam with a learning rate of 1×1041\times 10^{-4}. We initialize the parameters in the linear regime, 𝜽N(0,1m0.2)\bm{\theta}\sim N\left(0,\frac{1}{m^{0.2}}\right), where m=1000m=1000 is the width of the hidden layer. For Fig. 6, we train the network using Adam with a batch size of 22. For Fig. 8, each test error is averaged over 1010 trials with random initialization.

For Fig. 7(a,b), we use the FNN with size 1010-100100-11 to fit the target function,

f(x)=σ(i=110xi),f(x)=\sigma(\sum_{i=1}^{10}x_{i}),

where σ(x)=tanh(x)\sigma(x)=\mathrm{tanh}(x), 𝒙10\bm{x}\in\mathbb{R}^{10}, xix_{i} is the iith component of 𝒙\bm{x} and the training size is 3030. We add dropout layers behind the hidden layer with p=0.5p=0.5. We train the network using Adam with a learning rate of 0.0010.001.

For Fig. 7(c), we use ResNet-18 to classify CIFAR-10 with and without dropout. We add dropout layers behind the last activation function of each block with p=0.5p=0.5. We train the network using Adam with a learning rate of 0.0010.001 and a batch size 0f 128128.

For Fig. 11(a), we use the FNN with size 784784-10241024-10241024-1010. We add dropout layers behind the first and the second layers with p=0.8p=0.8 and p=0.5p=0.5, respectively. We train the network using default Adam optimizer [36] with a learning rate of 1×1041\times 10^{-4}.

For Fig. 11(b), we use VGG-9 to compare the loss landscape flatness w/o dropout layers. For experiments with dropout layers, we add dropout layers after the pooling layers, with p=0.8p=0.8. Models are trained using full-batch GD with Nesterov momentum, with the first 20482048 images as training set for 300300 epochs. The learning rate is initialized at 0.10.1, and divided by a factor of 10 at epochs 150, 225, and 275.

For Fig. 11(c), we use ResNet-20 [37] to compare the loss landscape flatness w/o dropout layers. For experiments with dropout layers, we add dropout layers after the convolutional layers with p=0.8p=0.8. We only consider the parameter matrix corresponding to the weight of the first convolutional layer of the first block of the ResNet-20. Models are trained using full-batch GD, with a training size of 5000050000 for 12001200 epochs. The learning rate is initialized at 0.010.01.

For Fig. 11(d), we use transformer [38] with dmodel=50,dk=dv=20,dff=256,h=4,N=3d_{\mathrm{model}}=50,d_{k}=d_{v}=20,d_{\mathrm{ff}}=256,h=4,N=3, the meaning of the parameters is consistent with the original paper. For experiments with dropout layers, we apply dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. We set p=0.9p=0.9 for dropout layers. For the English-German translation problem, we use the cross-entropy loss with label smoothing trained by full-batch Adam based on the Multi30k dataset. The learning rate strategy is the same as that in [38]. The warm-up step is 40004000 epochs, the training step is 1000010000 epochs. We only use the first 20482048 examples for training to compromise with the computational burden.

For Fig. 14(a), we classify the Fashion-MNIST dataset by training four-layer NNs of width 40964096 with a training batch size of 6464. We add dropout layers behind the hidden layers with different dropout rates. The learning rate is 5×1035\times 10^{-3}.

Appendix B Proofs for Main Paper

B.1 Proof for Lemma 1

Lemma (the expectation of dropout loss).

Given an LL-layer FNN with dropout 𝐟𝛉,𝛈drop(𝐱)\bm{f}_{\bm{\theta},\bm{\eta}}^{\mathrm{drop}}(\bm{x}), under Setting 1–3, we have the expectation of dropout MSE:

𝔼𝜼\displaystyle\mathbb{E}_{\bm{\eta}} (RSdrop(𝜽,𝜼))=RS(𝜽)+R1(𝜽).\displaystyle(R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right))=R_{S}\left(\bm{\theta}\right)+R_{1}(\bm{\theta}).
Proof.

With the definition of MSE, we have,

R\displaystyle R (𝜽,𝜼)Sdrop=12ni=1n(𝒇𝜽,𝜼drop(𝒙i)𝒚i)2{}_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right)=\frac{1}{2n}\sum_{i=1}^{n}\left(\bm{f}_{\bm{\theta},\bm{\eta}}^{\mathrm{drop}}(\bm{x}_{i})-\bm{y}_{i}\right)^{2}
=12ni=1n(𝒇𝜽(𝒙i)𝒚i+j=1mL1𝑾j[L](𝜼)jf𝜽,j[L1](𝒙i))2.\displaystyle=\frac{1}{2n}\sum_{i=1}^{n}\left(\bm{f}_{\bm{\theta}}(\bm{x}_{i})-\bm{y}_{i}+\sum_{j=1}^{m_{L-1}}\bm{W}^{[L]}_{j}(\bm{\eta})_{j}f^{[L-1]}_{\bm{\theta},j}(\bm{x}_{i})\right)^{2}.

With the definition of 𝜼\bm{\eta}, we have 𝔼(𝜼)=𝟎\mathbb{E}(\bm{\eta})=\bm{0}, thus,

𝔼𝜼RSdrop(𝜽,𝜼)\displaystyle\mathbb{E}_{\bm{\eta}}R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right) =RS(𝜽)\displaystyle=R_{S}\left(\bm{\theta}\right)
+12n𝔼𝜼i=1n(j=1mL1𝑾j[L](𝜼)jf𝜽,j[L1](𝒙i))2.\displaystyle+\frac{1}{2n}\mathbb{E}_{\bm{\eta}}\sum_{i=1}^{n}\left(\sum_{j=1}^{m_{L-1}}\bm{W}^{[L]}_{j}(\bm{\eta})_{j}f^{[L-1]}_{\bm{\theta},j}(\bm{x}_{i})\right)^{2}.

At the same time, we have 𝔼((𝜼)k(𝜼)j)=0,kj\mathbb{E}((\bm{\eta})_{k}(\bm{\eta})_{j})=0,k\neq j and 𝔼((𝜼)k2)=1pp\mathbb{E}((\bm{\eta})_{k}^{2})=\frac{1-p}{p}. Thus we have,

𝔼𝜼RSdrop(𝜽,𝜼)\displaystyle\mathbb{E}_{\bm{\eta}}R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right) =RS(𝜽)\displaystyle=R_{S}\left(\bm{\theta}\right)
+1p2npi=1nj=1mL1𝑾j[L]f𝜽,j[L1](𝒙i)2.\displaystyle+\frac{1-p}{2np}\sum_{i=1}^{n}\sum_{j=1}^{m_{L-1}}\|\bm{W}^{[L]}_{j}f^{[L-1]}_{\bm{\theta},j}(\bm{x}_{i})\|^{2}.

B.2 Proof for the modified gradient flow of dropout

Modified gradient flow of dropout. Under Setting 1–3, the mean iterate of 𝜽\bm{\theta}, with a learning rate ε1\varepsilon\ll 1, stays close to the path of gradient flow on a modified loss 𝜽˙=𝜽R~Sdrop(𝜽,𝜼)\dot{\bm{\theta}}=-\nabla_{\bm{\theta}}\tilde{R}_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta}), where the modified loss R~Sdrop(𝜽,𝜼)\tilde{R}_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta}) satisfies:

𝔼𝜼R~Sdrop(𝜽,𝜼)=RS(𝜽)+R1(𝜽)+R2(𝜽)+O(ε2).\mathbb{E}_{\bm{\eta}}\tilde{R}_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta})=R_{S}\left(\bm{\theta}\right)+R_{1}(\bm{\theta})+R_{2}(\bm{\theta})+O(\varepsilon^{2}).
Proof.

We assume the ODE of the GD with dropout has the form 𝜽˙=f(𝜽)\dot{\bm{\theta}}=f(\bm{\theta}), and introduce a modified gradient flow 𝜽˙=f~(𝜽)\dot{\bm{\theta}}=\tilde{f}(\bm{\theta}), where,

f~(𝜽)=f(𝜽)+εf1(𝜽)+ε2f2(𝜽)+O(ε3).\tilde{f}(\bm{\theta})=f(\bm{\theta})+\varepsilon f_{1}(\bm{\theta})+\varepsilon^{2}f_{2}(\bm{\theta})+O(\varepsilon^{3}).

Thus by Taylor expansion, for small but finite ε\varepsilon, we have,

𝜽(t+ε)\displaystyle\bm{\theta}(t+\varepsilon) =𝜽(t)+εf~(𝜽(t))+ε22𝜽f~(𝜽(t))f~(𝜽(t))+O(ε3)\displaystyle=\bm{\theta}(t)+\varepsilon\widetilde{f}(\bm{\theta}(t))+\frac{\varepsilon^{2}}{2}\nabla_{\bm{\theta}}\widetilde{f}(\bm{\theta}(t))\widetilde{f}(\bm{\theta}(t))+O\left(\varepsilon^{3}\right)
=𝜽(t)+εf(𝜽(t))\displaystyle=\bm{\theta}(t)+\varepsilon f(\bm{\theta}(t))
+ε2(f1(𝜽(t))+12f(𝜽(t))f(𝜽(t)))+O(ε3).\displaystyle\quad+\varepsilon^{2}\left(f_{1}(\bm{\theta}(t))+\frac{1}{2}\nabla f(\bm{\theta}(t))f(\bm{\theta}(t))\right)+O\left(\varepsilon^{3}\right).

For GD, we have,

𝜽t+1=𝜽t+εf(𝜽t).\bm{\theta}_{t+1}=\bm{\theta}_{t}+\varepsilon f(\bm{\theta}_{t}).

Combining 𝜽(t+ε)=𝜽t+1\bm{\theta}(t+\varepsilon)=\bm{\theta}_{t+1}, 𝜽(t)=𝜽t\bm{\theta}(t)=\bm{\theta}_{t}, we have,

f(𝜽)\displaystyle f(\bm{\theta}) =𝜽RSdrop(𝜽,𝜼)\displaystyle=-\nabla_{\bm{\theta}}R_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta})
f1(𝜽)\displaystyle f_{1}(\bm{\theta}) =12𝜽f(𝜽(t))f(𝜽(t))\displaystyle=-\frac{1}{2}\nabla_{\bm{\theta}}f(\bm{\theta}(t))f(\bm{\theta}(t))
=14𝜽f(𝜽(t))2\displaystyle=-\frac{1}{4}\nabla_{\bm{\theta}}\|f(\bm{\theta}(t))\|^{2}
=14𝜽𝜽RSdrop(𝜽,𝜼)2.\displaystyle=-\frac{1}{4}\nabla_{\bm{\theta}}\|\nabla_{\bm{\theta}}R_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta})\|^{2}.

Thus, with the random variable 𝜼\bm{\eta},

f~(𝜽)=𝜽RSdrop(𝜽,𝜼)ε4𝜽𝜽RSdrop(𝜽,𝜼)2+O(ε2).\tilde{f}(\bm{\theta})=-\nabla_{\bm{\theta}}R_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta})-\frac{\varepsilon}{4}\nabla_{\bm{\theta}}\|\nabla_{\bm{\theta}}R_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta})\|^{2}+O(\varepsilon^{2}).

In the average sense and combining Lemma 1, we have,

𝜽˙\displaystyle\dot{\bm{\theta}} =𝔼𝜼(𝜽RSdrop(𝜽,𝜼)ε4𝜽𝜽RSdrop(𝜽,𝜼)2+O(ε2))\displaystyle=\mathbb{E}_{\bm{\eta}}\left(-\nabla_{\bm{\theta}}R_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta})-\frac{\varepsilon}{4}\nabla_{\bm{\theta}}\|\nabla_{\bm{\theta}}R_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta})\|^{2}+O(\varepsilon^{2})\right)
=𝜽𝔼𝜼(RSdrop(𝜽,𝜼)+ε4𝜽RSdrop(𝜽,𝜼)2+O(ε2))\displaystyle=-\nabla_{\bm{\theta}}\mathbb{E}_{\bm{\eta}}\left(R_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta})+\frac{\varepsilon}{4}\|\nabla_{\bm{\theta}}R_{S}^{\mathrm{drop}}(\bm{\theta},\bm{\eta})\|^{2}+O(\varepsilon^{2})\right)
=𝜽(RS(𝜽)+1p2npi=1nj=1mL1𝑾j[L]f𝜽,j[L1](𝒙i)2\displaystyle=-\nabla_{\bm{\theta}}\Big{(}R_{S}\left(\bm{\theta}\right)+\frac{1-p}{2np}\sum_{i=1}^{n}\sum_{j=1}^{m_{L-1}}\|\bm{W}^{[L]}_{j}f^{[L-1]}_{\bm{\theta},j}(\bm{x}_{i})\|^{2}
+ε4𝔼𝜼𝜽RSdrop(𝜽,𝜼)2+O(ε2)).\displaystyle\quad+\frac{\varepsilon}{4}\mathbb{E}_{\bm{\eta}}\|\nabla_{\bm{\theta}}R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right)\|^{2}+O(\varepsilon^{2})\Big{)}.

We have,

R~Sdrop(𝜽)\displaystyle\tilde{R}_{S}^{\mathrm{drop}}(\bm{\theta}) =RS(𝜽)+1p2npi=1nj=1mL1𝑾j[L]f𝜽,j[L1](𝒙i)2\displaystyle=R_{S}\left(\bm{\theta}\right)+\frac{1-p}{2np}\sum_{i=1}^{n}\sum_{j=1}^{m_{L-1}}\|\bm{W}^{[L]}_{j}f^{[L-1]}_{\bm{\theta},j}(\bm{x}_{i})\|^{2}
+ε4𝔼𝜼𝜽RSdrop(𝜽,𝜼)2+O(ε2).\displaystyle\quad+\frac{\varepsilon}{4}\mathbb{E}_{\bm{\eta}}\|\nabla_{\bm{\theta}}R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right)\|^{2}+O(\varepsilon^{2}).

B.3 Proof for Theorem 1

Theorem (the effect of R1(θ)R_{1}(\bm{\theta}) on facilitating condensation).

Consider the following two-layer ReLU NN,

f𝜽(x)=j=1majσ(wjx+bj)+ax+b,f_{\bm{\theta}}(x)=\sum_{j=1}^{m}a_{j}\sigma(w_{j}x+b_{j})+ax+b,

trained with a one-dimensional dataset S={(xi,yi)}i=1nS=\{(x_{i},y_{i})\}_{i=1}^{n}, where x1<x2<<xnx_{1}<x_{2}<\cdots<x_{n}. When the MSE of training data RS(𝛉)=0R_{S}(\bm{\theta})=0, if any of the following two conditions holds:

(i) the number of convexity changes of NN in (x1,xn)(x_{1},x_{n}) can be reduced while RS(𝛉)=0R_{S}(\bm{\theta})=0;

(ii) there exist two neurons with indexes k1k2k_{1}\neq k_{2}, such that they have the same sign, i.e., sign(wk1)=sign(wk2)\mathrm{sign}(w_{k_{1}})=\mathrm{sign}(w_{k_{2}}), and different intercept points in the same interval, i.e., bk1/wk1,bk2/wk2[xi,xi+1]-{b_{k_{1}}}/{w_{k_{1}}},-{b_{k_{2}}}/{w_{k_{2}}}\in[x_{i},x_{i+1}], and bk1/wk1bk2/wk2-{b_{k_{1}}}/{w_{k_{1}}}\neq-{b_{k_{2}}}/{w_{k_{2}}} for some i[2:n1]i\in[2:n-1];
then there exists parameters 𝛉\bm{\theta}^{\prime}, an infinitesimal perturbation of 𝛉\bm{\theta}, s.t.,

(i) RS(𝛉)=0R_{S}(\bm{\theta}^{\prime})=0;

(ii) R1(𝛉)<R1(𝛉)R_{1}(\bm{\theta}^{\prime})<R_{1}(\bm{\theta}).

Proof.

The implicit regularization term R1(𝜽)R_{1}(\bm{\theta}) can be expresses as follows:

R1(𝜽)=1p2npi=1nj=1m(ajσ(wjxi+bj))2,R_{1}(\bm{\theta})=\frac{1-p}{2np}\sum_{i=1}^{n}\sum_{j=1}^{m}(a_{j}\sigma(w_{j}x_{i}+b_{j}))^{2},

where mm is the width of the NN, and σ(x)=ReLU(x)\sigma(x)=\mathrm{ReLU}(x). It is worth noting that the term ax+bax+b will be absorbed by RS(𝜽)R_{S}(\bm{\theta}), so it will not affect the value of R1(𝜽)R_{1}(\bm{\theta}).

First, we show the proof outline, which is inspired by [28]. However, [28] study the setting of noise SGD, and only consider the limitation of convexity changes. For the first part, i.e., the limitation of the number of convexity changes, the proof outline is as follows. For any set of consecutive datapoints {(xi,yi),(xi+1,yi+1),(xi+2,yi+2)}\{(x_{i},y_{i}),(x_{i+1},y_{i+1}),(x_{i+2},y_{i+2})\}, we assume the linear interpolation of three datapoints is convex. If f𝜽(x)f_{\bm{\theta}}(x) is not convex in (xi,xi+2)(x_{i},x_{i+2}), there exist two neurons, one with a positive output layer weight and the other with a negative output layer weight, and the intercept point of both neurons are within the interval (xi,xi+2)(x_{i},x_{i+2}). Then, we can make the intercept points of the two neurons move to both sides by giving a specific perturbation direction, and through this movement, the intercept point of the neurons can be gathered at the data points xi,xi+2x_{i},x_{i+2}. At the same time, we can verify that this moving mode can reduce the value of R1(𝜽)R_{1}(\bm{\theta}) term while keeping the training error at 0. The main proof mainly classifies the different directions of the above two neurons to construct different types of perturbation direction. For the second part, i.e., the limitation of the number of intercept points in the same inner interval, the proof outline is as follows. For two neurons with the same direction and different intercept points in the same interval [xi,xi+1],i[n1][x_{i},x_{i+1}],i\in[n-1], without loss of generality, suppose {xτ}τ=i+1n\{x_{\tau}\}_{\tau=i+1}^{n} is the input dataset activated by two neurons at the same time. According to the mean value inequality, when the output values of two neurons are constant at each data point {xτ}τ=i+1n\{x_{\tau}\}_{\tau=i+1}^{n}, the R1(𝜽)R_{1}(\bm{\theta}) term is minimized. We can achieve this by moving the two intercept points of neurons to the interval within the two intercept points through perturbation.

The limitation of convexity changes. Without loss of generality, suppose f𝜽f_{\bm{\theta}} on the interval (xi,xi+2)\left(x_{i},x_{i+2}\right) is convex. If f(θ,x)f(\theta,x) fits these three points, but is not convex, then it must have a convexity change. This convexity change corresponds to at least two ReLU neurons ajσ(wjx+bj),j[2]a_{j}\sigma(w_{j}x+b_{j}),j\in[2], and the output weights of the two have opposite signs. WLOG, we assume the intercept point of the first neuron is less than the intercept point of the second neuron, i.e., b1/w1<b2/w2-b_{1}/w_{1}<-b_{2}/w_{2}. After reasonably assuming that the output weights of neurons with small turning points are negative, we classify the signs of the two neurons as follows,

Case (1) : w1>0,a1<0,w2>0,a2>0,\displaystyle w_{1}>0,a_{1}<0,w_{2}>0,a_{2}>0,
Case (2) : w1>0,a1<0,w2<0,a2>0,\displaystyle w_{1}>0,a_{1}<0,w_{2}<0,a_{2}>0,
Case (3) : w1<0,a1<0,w2>0,a2>0,\displaystyle w_{1}<0,a_{1}<0,w_{2}>0,a_{2}>0,
Case (4) : w1<0,a1<0,w2<0,a2>0.\displaystyle w_{1}<0,a_{1}<0,w_{2}<0,a_{2}>0.

Case (1) : For any sufficiently small ε>0\varepsilon>0, we perturb the parameters of the two neurons as follows:

w~1=w1(1ε),b~1=b1+xi+1w1ε,w~2=w2a1a2(w~1w1),b~2=b2a1a2(b~1b1).\begin{array}[]{cl}\tilde{w}_{1}=w_{1}(1-\varepsilon),&\tilde{b}_{1}=b_{1}+x_{i+1}w_{1}\varepsilon,\\ \tilde{w}_{2}=w_{2}-\frac{a_{1}}{a_{2}}\left(\tilde{w}_{1}-w_{1}\right),&\tilde{b}_{2}=b_{2}-\frac{a_{1}}{a_{2}}\left(\tilde{b}_{1}-b_{1}\right).\end{array}

Here we only consider the case where the intercept point of the first neuron b1w1xi+1-\frac{b_{1}}{w_{1}}\leq x_{i+1}. For the case where b1w1>xi+1-\frac{b_{1}}{w_{1}}>x_{i+1}, it is easy to obtain from the proof of the limitation of the number of intercept points below. In the same way, we can only consider the case where b2w2xi+1-\frac{b_{2}}{w_{2}}\geq x_{i+1}. We first consider the case where b1w1<xi+1-\frac{b_{1}}{w_{1}}<x_{i+1} and b2w2>xi+1-\frac{b_{2}}{w_{2}}>x_{i+1}. By studying the movement of the intercept point of the two neurons, we have,

b~1w~1(b1w1)=ε(w1xi+1+b1)w1(1ε)<0,-\frac{\tilde{b}_{1}}{\tilde{w}_{1}}-(-\frac{b_{1}}{w_{1}})=-\frac{\varepsilon\left(w_{1}x_{i+1}+b_{1}\right)}{w_{1}(1-\varepsilon)}<0,

where 0<ε<10<\varepsilon<1, w1>0w_{1}>0 and w1xi+1+b1>0w_{1}x_{i+1}+b_{1}>0. As for the second neuron, we have,

b~2w~2(b2w2)=w1a1ε(w2xi+1+b2)w2(a2w2+a1w1ε)>0,-\frac{\tilde{b}_{2}}{\tilde{w}_{2}}-(-\frac{b_{2}}{w_{2}})=\frac{w_{1}a_{1}\varepsilon\left(w_{2}x_{i+1}+b_{2}\right)}{w_{2}\left(a_{2}w_{2}+a_{1}w_{1}\varepsilon\right)}>0,

where 0<ε<10<\varepsilon<1, w1>0w_{1}>0, w2>0w_{2}>0, a1<0a_{1}<0, w2xi+1+b2<0w_{2}x_{i+1}+b_{2}<0 and a2w2+a1w1ε>0a_{2}w_{2}+a_{1}w_{1}\varepsilon>0. Thus the intercept point of the first neuron moves left and the other moves right under the above perturbation.

We then verify the invariance of the values of NN’s output on the input data. We study this invariance within three data point sets {xτ}τ=1i\{x_{\tau}\}_{\tau=1}^{i}, {xi+1}\{x_{i+1}\} and {xτ}τ=i+2n\{x_{\tau}\}_{\tau=i+2}^{n}. For {xτ}τ=1i\{x_{\tau}\}_{\tau=1}^{i}, the outputs of both neurons remain zero, so the output of the network is unchanged. For {xi+1}\{x_{i+1}\}, the output of the second neuron remains zero. As for the first neuron, we have,

a~1σ(w~1xi+1b~1)\displaystyle\tilde{a}_{1}\sigma(\tilde{w}_{1}x_{i+1}\tilde{b}_{1}) =a1σ(w1(1ε)xi+1+b1+w1εxi+1)\displaystyle=a_{1}\sigma(w_{1}(1-\varepsilon)x_{i+1}+b_{1}+w_{1}\varepsilon x_{i+1})
=a1σ(w1xi+1+b1),\displaystyle=a_{1}\sigma(w_{1}x_{i+1}+b_{1}),

thus the output of the network on {xi+1}\{x_{i+1}\} is unchanged. For {xτ}τ=i+2n\{x_{\tau}\}_{\tau=i+2}^{n}, we study the output value changes of the two neurons separately. For the first neuron, we have,

a~1\displaystyle\tilde{a}_{1} σ(w~1xτ+b~1)a1σ(w1xτ+b1)\displaystyle\sigma(\tilde{w}_{1}x_{\tau}+\tilde{b}_{1})-a_{1}\sigma(w_{1}x_{\tau}+b_{1})
=a1(w1(1ε)xτ+b1+w1εxi+1w1xτb1)\displaystyle=a_{1}(w_{1}(1-\varepsilon)x_{\tau}+b_{1}+w_{1}\varepsilon x_{i+1}-w_{1}x_{\tau}-b_{1})
=εa1(w1(xi+1xτ)).\displaystyle=\varepsilon a_{1}(w_{1}\left(x_{i+1}-x_{\tau}\right)).

For the second neuron, we have,

a~2\displaystyle\tilde{a}_{2} σ(w~2xτ+b~2)a2σ(w2xτ+b2)\displaystyle\sigma(\tilde{w}_{2}x_{\tau}+\tilde{b}_{2})-a_{2}\sigma(w_{2}x_{\tau}+b_{2})
=a1a2a2((w~1w1)xτ+b~1b1)\displaystyle=-\frac{a_{1}}{a_{2}}{a}_{2}\left(\left(\tilde{w}_{1}-w_{1}\right)x_{\tau}+\tilde{b}_{1}-b_{1}\right)
=εa1(w1(xi+1xτ)).\displaystyle=-\varepsilon a_{1}(w_{1}\left(x_{i+1}-x_{\tau}\right)).

Thus output value changes of both neurons remain zero. After verifying that the output of the network remains constant on the input data points, we study the effect of this perturbation on the R1(𝜽)R_{1}(\bm{\theta}) term. Noting that the output values of these two neurons on the input data {xτ}τ=1i+1\{x_{\tau}\}_{\tau=1}^{i+1} remain unchanged, we only need to study the effect of the output changes on the input data {xτ}τ=i+2n\{x_{\tau}\}_{\tau=i+2}^{n}. For {xτ}τ=i+2n\{x_{\tau}\}_{\tau=i+2}^{n}, we have,

a12=a~12,\displaystyle a_{1}^{2}=\tilde{a}_{1}^{2}, a22=a~22\displaystyle a_{2}^{2}=\tilde{a}_{2}^{2} (7)
σ(a~1xτ+b~1)2<\displaystyle\sigma(\tilde{a}_{1}x_{\tau}+\tilde{b}_{1})^{2}< σ(a1xτ+b1)2\displaystyle\sigma(a_{1}x_{\tau}+b_{1})^{2}
σ(a~2xτ+b~2)2<\displaystyle\sigma(\tilde{a}_{2}x_{\tau}+\tilde{b}_{2})^{2}< σ(a2xτ+b2)2\displaystyle\sigma(a_{2}x_{\tau}+b_{2})^{2}

Thus we have,

R~1(𝜽)\displaystyle\tilde{R}_{1}(\bm{\theta}) R1(𝜽)\displaystyle-R_{1}(\bm{\theta})
=\displaystyle= τ=i+2n(a~12σ(w~1xτ+b~1)2+a~22σ(w~2xτ+b~2)2\displaystyle\sum_{\tau=i+2}^{n}\Big{(}\tilde{a}_{1}^{2}\sigma(\tilde{w}_{1}x_{\tau}+\tilde{b}_{1})^{2}+\tilde{a}_{2}^{2}\sigma(\tilde{w}_{2}x_{\tau}+\tilde{b}_{2})^{2}
a12σ(w1xτ+b1)2a22σ(w2xτ+b2)2)\displaystyle-a_{1}^{2}\sigma(w_{1}x_{\tau}+b_{1})^{2}-a_{2}^{2}\sigma(w_{2}x_{\tau}+b_{2})^{2}\Big{)}
<\displaystyle< Θ(ε).\displaystyle-\Theta(\varepsilon).

As for b1w1=xi+1\frac{b_{1}}{w_{1}}=x_{i+1}(or b2w2=xi+1\frac{b_{2}}{w_{2}}=x_{i+1}), we can still get a consistent conclusion through the above perturbation, the only difference is that the intercept point of the first(or second) neuron will not move to the left(or right).

For the remaining three cases, we show the direction of the perturbation respectively, and we can similarly verify that the loss value remains unchanged and the R1(𝜽)R_{1}(\bm{\theta}) term decreases by Θ(ε)\Theta(\varepsilon).

Case (2) and Case (3): The perturbation is shown as follows:

w~1=w1(1ε),b~1=b1+xi+1w1ε,w~2=w2+a1a2(w~1w1),b~2=b2+a1a2(b~1b1),a=a1(w~1w1),b=a1(b~1b1).\begin{array}[]{cc}\tilde{w}_{1}=w_{1}(1-\varepsilon),&\tilde{b}_{1}=b_{1}+x_{i+1}w_{1}\varepsilon,\\ \tilde{w}_{2}=w_{2}+\frac{a_{1}}{a_{2}}\left(\tilde{w}_{1}-w_{1}\right),&\tilde{b}_{2}=b_{2}+\frac{a_{1}}{a_{2}}\left(\tilde{b}_{1}-b_{1}\right),\\ a=-a_{1}\left(\tilde{w}_{1}-w_{1}\right),&b=-a_{1}\left(\tilde{b}_{1}-b_{1}\right).\end{array}

Case (4): The perturbation is shown as follows:

w~2=w2(1ε),b~2=b2+xi+1w2ε,w~1=w1a2a1(w~2w2),b~1=b1a2a1(b~2b2).\begin{array}[]{cl}\tilde{w}_{2}=w_{2}(1-\varepsilon),&\tilde{b}_{2}=b_{2}+x_{i+1}w_{2}\varepsilon,\\ \tilde{w}_{1}=w_{1}-\frac{a_{2}}{a_{1}}\left(\tilde{w}_{2}-w_{2}\right),&\tilde{b}_{1}=b_{1}-\frac{a_{2}}{a_{1}}\left(\tilde{b}_{2}-b_{2}\right).\end{array}

The limitation of the number of intercept points. For any inner interval [xi,xi+1](x2,xn1),i[n1][x_{i},x_{i+1}]\cap(x_{2},x_{n-1}),i\in[n-1], we take two neurons with the same direction and different intercept points in the same inner interval denoted as ajσ(wjx+bj)a_{j}\sigma(w_{j}x+b_{j}), j[2]j\in[2]. In order to ensure consistency with the above proof, we denote the boundary points of the inner interval are xi+1x_{i+1} and xi+2x_{i+2}. Without loss of generality, we assume that the input weights of two neurons w1,w2>0w_{1},w_{2}>0, and the output weight of the first neuron a1<0a_{1}<0, and we assume the intercept point of the first neuron is less than the intercept point of the second neuron, i.e., b1/w1<b2/w2-b_{1}/w_{1}<-b_{2}/w_{2}. We study a2>0a_{2}>0 and a2<0a_{2}<0 respectively, where the first case corresponds to the case b1w1>xi+1-\frac{b_{1}}{w_{1}}>x_{i+1} in Case (1) above. We first study the case where a2>0a_{2}>0. We perturb the parameters of two neurons as follows:

w~1=w1(1ε),b~1=b1+xi+1w1ε,w~2=w2a1a2(w~1w1),b~2=b2a1a2(b~1b1).\begin{array}[]{cl}\tilde{w}_{1}=w_{1}(1-\varepsilon),&\tilde{b}_{1}=b_{1}+x_{i+1}w_{1}\varepsilon,\\ \tilde{w}_{2}=w_{2}-\frac{a_{1}}{a_{2}}\left(\tilde{w}_{1}-w_{1}\right),&\tilde{b}_{2}=b_{2}-\frac{a_{1}}{a_{2}}\left(\tilde{b}_{1}-b_{1}\right).\end{array}

By studying the movement of the intercept points of the two neurons, we have,

b~1w~1(b1w1)=ε(w1xi+1+b1)w1(1ε)>0,-\frac{\tilde{b}_{1}}{\tilde{w}_{1}}-(-\frac{b_{1}}{w_{1}})=-\frac{\varepsilon\left(w_{1}x_{i+1}+b_{1}\right)}{w_{1}(1-\varepsilon)}>0,

where 0<ε<10<\varepsilon<1, w1>0w_{1}>0 and w1xi+1+b1<0w_{1}x_{i+1}+b_{1}<0. As for the second neuron, we have,

b~2w~2(b2w2)=w1a1ε(w2xi+1+b2)w2(a2w2+a1w1ε)>0,-\frac{\tilde{b}_{2}}{\tilde{w}_{2}}-(-\frac{b_{2}}{w_{2}})=\frac{w_{1}a_{1}\varepsilon\left(w_{2}x_{i+1}+b_{2}\right)}{w_{2}\left(a_{2}w_{2}+a_{1}w_{1}\varepsilon\right)}>0,

where 0<ε<10<\varepsilon<1, w1>0w_{1}>0, w2>0w_{2}>0, a1<0a_{1}<0, w2xi+1+b2<0w_{2}x_{i+1}+b_{2}<0 and a2w2+a1w1ε>0a_{2}w_{2}+a_{1}w_{1}\varepsilon>0. Thus the intercept points of two neurons move right under the above perturbation.

We then verify the invariance of the values of NN’s output on the input data. We study this invariance within two data point sets {xτ}τ=1i+1\{x_{\tau}\}_{\tau=1}^{i+1} and {xτ}τ=i+2n\{x_{\tau}\}_{\tau=i+2}^{n}. For {xτ}τ=1i+1\{x_{\tau}\}_{\tau=1}^{i+1}, the outputs of both neurons remain zero, so the output of the network is unchanged. For {xτ}τ=i+2n\{x_{\tau}\}_{\tau=i+2}^{n}, we study the output value changes of the two neurons separately. For the first neuron, we have,

a~1\displaystyle\tilde{a}_{1} σ(w~1xτ+b~1)a1σ(w1xτ+b1)\displaystyle\sigma(\tilde{w}_{1}x_{\tau}+\tilde{b}_{1})-a_{1}\sigma(w_{1}x_{\tau}+b_{1})
=a1(w1(1ε)xτ+b1+w1εxi+1w1xτb1)\displaystyle=a_{1}(w_{1}(1-\varepsilon)x_{\tau}+b_{1}+w_{1}\varepsilon x_{i+1}-w_{1}x_{\tau}-b_{1})
=εa1(w1(xi+1xτ)).\displaystyle=\varepsilon a_{1}(w_{1}\left(x_{i+1}-x_{\tau}\right)).

For the second neuron, we have,

a~2\displaystyle\tilde{a}_{2} σ(w~2xτ+b~2)a2σ(w2xτ+b2)\displaystyle\sigma(\tilde{w}_{2}x_{\tau}+\tilde{b}_{2})-a_{2}\sigma\left(w_{2}x_{\tau}+b_{2}\right)
=a1a2a2((w~1w1)xτ+b~1b1)\displaystyle=-\frac{a_{1}}{a_{2}}{a}_{2}\left(\left(\tilde{w}_{1}-w_{1}\right)x_{\tau}+\tilde{b}_{1}-b_{1}\right)
=εa1(w1(xi+1xτ)).\displaystyle=-\varepsilon a_{1}(w_{1}\left(x_{i+1}-x_{\tau}\right)).

With the same relation shown in Equution (LABEL:equ:inequ), we have,

R~1(𝜽)R1(𝜽)<Θ(ε).\tilde{R}_{1}(\bm{\theta})-R_{1}(\bm{\theta})<-\Theta(\varepsilon).

For a2<0a_{2}<0, we consider three cases:

Case (1): a1w1=a2w2,\displaystyle a_{1}w_{1}=a_{2}w_{2},
Case (2): a1w1>a2w2,\displaystyle a_{1}w_{1}>a_{2}w_{2},
Case (3): a1w1<a2w2.\displaystyle a_{1}w_{1}<a_{2}w_{2}.

For Case (1), we perturb the parameters of two neurons as follows:

w~1=w1,b~1=b1ε,w~2=w2,b~2=b2a1a2(b~1b1).\begin{array}[]{cl}\tilde{w}_{1}=w_{1},&\tilde{b}_{1}=b_{1}-\varepsilon,\\ \tilde{w}_{2}=w_{2},&\tilde{b}_{2}=b_{2}-\frac{a_{1}}{a_{2}}\left(\tilde{b}_{1}-b_{1}\right).\end{array}

We first study the movement of the intercept points of the two neurons,

b~1w~1(b1w1)=εw1>0,-\frac{\tilde{b}_{1}}{\tilde{w}_{1}}-(-\frac{b_{1}}{w_{1}})=\frac{\varepsilon}{w_{1}}>0,

where 0<ε<10<\varepsilon<1, w1>0w_{1}>0. In the same way, we have

b~2w~2(b2w2)<0.-\frac{\tilde{b}_{2}}{\tilde{w}_{2}}-(-\frac{b_{2}}{w_{2}})<0.

For the invariance of the values of NN’s output on the input dataset {xτ}τ=i+2n\{x_{\tau}\}_{\tau=i+2}^{n}, we have,

a~1\displaystyle\tilde{a}_{1} σ(w~1xτ+b~1)a1σ(w1xτ+b1)\displaystyle\sigma(\tilde{w}_{1}x_{\tau}+\tilde{b}_{1})-a_{1}\sigma(w_{1}x_{\tau}+b_{1})
=a1(w1xτ+b1εw1xτb1)\displaystyle=a_{1}(w_{1}x_{\tau}+b_{1}-\varepsilon-w_{1}x_{\tau}-b_{1})
=εa1,\displaystyle=-\varepsilon a_{1},
a~2\displaystyle\tilde{a}_{2} σ(w~2xτ+b~2)a2σ(w2xτ+b2)\displaystyle\sigma(\tilde{w}_{2}x_{\tau}+\tilde{b}_{2})-a_{2}\sigma(w_{2}x_{\tau}+b_{2})
=a1a2a2((w~1w1)xτ+b~1b1)\displaystyle=-\frac{a_{1}}{a_{2}}{a}_{2}\left(\left(\tilde{w}_{1}-w_{1}\right)x_{\tau}+\tilde{b}_{1}-b_{1}\right)
=εa1.\displaystyle=\varepsilon a_{1}.

Thus, we can easily get, for each input point xτx_{\tau} in {xτ}τ=i+2n\{x_{\tau}\}_{\tau=i+2}^{n},

a~1\displaystyle\tilde{a}_{1} σ(w~1xτ+b~1)+a~2σ(w~2xτ+b~2)\displaystyle\sigma(\tilde{w}_{1}x_{\tau}+\tilde{b}_{1})+\tilde{a}_{2}\sigma(\tilde{w}_{2}x_{\tau}+\tilde{b}_{2})
a1σ(w1xτ+b1)a2σ(w2xτ+b2)=0,\displaystyle-a_{1}\sigma(w_{1}x_{\tau}+b_{1})-a_{2}\sigma(w_{2}x_{\tau}+b_{2})=0,
|a~1\displaystyle|\tilde{a}_{1} σ(w~1xτ+b~1)a~2σ(w~2xτ+b~2)|\displaystyle\sigma(\tilde{w}_{1}x_{\tau}+\tilde{b}_{1})-\tilde{a}_{2}\sigma(\tilde{w}_{2}x_{\tau}+\tilde{b}_{2})|
<|a1σ(w1xτ+b1)a2σ(w2xτ+b2)|.\displaystyle<|a_{1}\sigma(w_{1}x_{\tau}+b_{1})-a_{2}\sigma(w_{2}x_{\tau}+b_{2})|.

Then, we have,

R~1(𝜽)\displaystyle\tilde{R}_{1}(\bm{\theta}) R1(𝜽)\displaystyle-R_{1}(\bm{\theta})
=\displaystyle= τ=i+2n(a~12σ(w~1xτ+b~1)2+a~22σ(w~2xτ+b~2)2\displaystyle\sum_{\tau=i+2}^{n}\Big{(}\tilde{a}_{1}^{2}\sigma(\tilde{w}_{1}x_{\tau}+\tilde{b}_{1})^{2}+\tilde{a}_{2}^{2}\sigma(\tilde{w}_{2}x_{\tau}+\tilde{b}_{2})^{2}
a12σ(w1xτ+b1)2a22σ(w2xτ+b2)2)\displaystyle-a_{1}^{2}\sigma(w_{1}x_{\tau}+b_{1})^{2}-a_{2}^{2}\sigma(w_{2}x_{\tau}+b_{2})^{2}\Big{)}
<\displaystyle< Θ(ε).\displaystyle-\Theta(\varepsilon).

For the remaining three cases, we show the direction of the perturbation respectively, and we can similarly verify that the loss value remains unchanged and the R1(𝜽)R_{1}(\bm{\theta}) term decreases by Θ(ε)\Theta(\varepsilon).

Case (2): The perturbation is shown as follows:

w~1=w1(1+ε),b~1=b1a2b2a1b1a1w1a2w2w1ε,w~2=w2a1a2(w~1w1),b~2=b2a1a2(b~1b1).\begin{array}[]{cc}\tilde{w}_{1}=w_{1}(1+\varepsilon),&\tilde{b}_{1}=b_{1}-\frac{a_{2}b_{2}-a_{1}b_{1}}{a_{1}w_{1}-a_{2}w_{2}}w_{1}\varepsilon,\\ \tilde{w}_{2}=w_{2}-\frac{a_{1}}{a_{2}}\left(\tilde{w}_{1}-w_{1}\right),&\tilde{b}_{2}=b_{2}-\frac{a_{1}}{a_{2}}\left(\tilde{b}_{1}-b_{1}\right).\end{array}

Case (3): The perturbation is shown as follows:

w~1=w1(1ε),b~1=b1+xi+1w1ε,w~2=w2a1a2(w~1w1),b~2=b2a1a2(b~1b1).\begin{array}[]{cl}\tilde{w}_{1}=w_{1}(1-\varepsilon),&\tilde{b}_{1}=b_{1}+x_{i+1}w_{1}\varepsilon,\\ \tilde{w}_{2}=w_{2}-\frac{a_{1}}{a_{2}}\left(\tilde{w}_{1}-w_{1}\right),&\tilde{b}_{2}=b_{2}-\frac{a_{1}}{a_{2}}\left(\tilde{b}_{1}-b_{1}\right).\end{array}

Theorem (the effect of R1(θ)R_{1}(\bm{\theta}) on facilitating flatness).

Under the Setting 13, consider a two-layer ReLU NN,

f𝜽(x)=j=1majσ(𝒘j𝒙),f_{\bm{\theta}}(x)=\sum_{j=1}^{m}a_{j}\sigma(\bm{w}_{j}\bm{x}),

trained with dataset S={(𝐱i,yi)}i=1nS=\{(\bm{x}_{i},y_{i})\}_{i=1}^{n}. Under the gradient flow training with the loss function RS(𝛉)+R1(𝛉)R_{S}(\bm{\theta})+R_{1}(\bm{\theta}), if 𝛉0\bm{\theta}_{0} satisfying RS(𝛉0)=0R_{S}(\bm{\theta}_{0})=0 and 𝛉R1(𝛉0)0\nabla_{\bm{\theta}}R_{1}(\bm{\theta}_{0})\neq 0, we have

d(1ni=1n𝜽f𝜽0(𝒙i)22)dt<0.\frac{{\rm d}\left(\frac{1}{n}\sum_{i=1}^{n}\lVert\nabla_{\bm{\theta}}f_{\bm{\theta}_{0}}(\bm{x}_{i})\rVert_{2}^{2}\right)}{\rm dt}<0.
Proof.

Recall the quantity that characterizes the flatness of model f𝜽f_{\bm{\theta}},

1ni=1n\displaystyle\frac{1}{n}\sum_{i=1}^{n} 𝜽f𝜽(𝒙i)22\displaystyle\lVert\nabla_{\bm{\theta}}f_{\bm{\theta}}(\bm{x}_{i})\rVert_{2}^{2}
=1ni=1nj=1m(σ2(𝒘j𝒙i)+aj2σ(𝒘j𝒙i)2𝒙i22).\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{m}\left(\sigma^{2}(\bm{w}_{j}^{\intercal}\bm{x}_{i})+a_{j}^{2}\sigma^{{}^{\prime}}(\bm{w}_{j}^{\intercal}\bm{x}_{i})^{2}\left\lVert\bm{x}_{i}\right\rVert_{2}^{2}\right).

Then, under the assumption of RS(𝜽)=0R_{S}(\bm{\theta})=0, we obtain that 𝜽(RS(𝜽)+R1(𝜽))=𝜽(R1(𝜽))\nabla_{\bm{\theta}}\left(R_{S}(\bm{\theta})+R_{1}(\bm{\theta})\right)=\nabla_{\bm{\theta}}\left(R_{1}(\bm{\theta})\right), we notice that for sufficiently small amount of time tt, the sign of 𝒘j𝒙i\bm{w}_{j}^{\intercal}\bm{x}_{i} remains the same. Hence given data 𝒙i,i[n]\bm{x}_{i},i\in[n], and parameter set 𝜽j:=(ak,𝒘j),j[m]\bm{\theta}_{j}:=(a_{k},\bm{w}_{j}),j\in[m], if 𝒘j𝒙i<0\bm{w}_{j}^{\intercal}\bm{x}_{i}<0, for some indices i,ji,j, then 𝜽jf𝜽(𝒙i)220\lVert\nabla_{\bm{\theta}_{j}}f_{\bm{\theta}}(\bm{x}_{i})\rVert_{2}^{2}\equiv 0. The analysis above reveals that for certain neuron whose index is jj, we shall focus the data 𝒙i\bm{x}_{i} satisfying 𝒘j𝒙i>0\bm{w}_{j}^{\intercal}\bm{x}_{i}>0, i.e., the following set constitutes our data of interest for the jj-th neuron

Dj:={(𝒙k,yk)}i=1nj:={(𝒙i,yi)𝒘j𝒙i>0}S,D_{j}:=\{(\bm{x}_{k},y_{k})\}_{i=1}^{n_{j}}:=\{(\bm{x}_{i},y_{i})\mid\bm{w}_{j}^{\intercal}\bm{x}_{i}>0\}\subseteq S,

as 𝜽j\bm{\theta}_{j} is trained under the gradient flow of R1(𝜽)R_{1}(\bm{\theta}), i.e.

dajdt\displaystyle\frac{\mathrm{d}a_{j}}{\mathrm{d}t} =1p2npi=1najσ2(𝒘j𝒙i),\displaystyle=-\frac{1-p}{2np}\sum_{i=1}^{n}a_{j}\sigma^{2}(\bm{w}_{j}^{\intercal}\bm{x}_{i}),
d𝒘jdt\displaystyle\frac{\mathrm{d}\bm{w}_{j}}{\mathrm{d}t} =1p2npi=1naj2σ(𝒘j𝒙i)𝒙i,\displaystyle=-\frac{1-p}{2np}\sum_{i=1}^{n}a_{j}^{2}\sigma(\bm{w}_{j}^{\intercal}\bm{x}_{i})\bm{x}_{i}^{\intercal},

then the gradient flow of the flatness reads

ddt\displaystyle\frac{\mathrm{d}}{\mathrm{d}t} k=1nj𝜽jf𝜽(𝒙k)22\displaystyle\sum_{k=1}^{n_{j}}\lVert\nabla_{\bm{\theta}_{j}}f_{{\bm{\theta}}}(\bm{x}_{k})\rVert_{2}^{2}
=k=1nj(2(𝒘j𝒙k)𝒙k,d𝒘jdt+2aj𝒙i22dajdt)\displaystyle=\sum_{k=1}^{n_{j}}\left(\left<2(\bm{w}_{j}^{\intercal}\bm{x}_{k})\bm{x}_{k},\frac{\mathrm{d}\bm{w}_{j}}{\mathrm{d}t}\right>+2a_{j}\left\lVert\bm{x}_{i}\right\rVert_{2}^{2}\frac{\mathrm{d}a_{j}}{\mathrm{d}t}\right)
=1pnpi=1nk=1njaj2(𝒘j𝒙k)(𝒘j𝒙i)𝒙k,𝒙i𝟏𝒘j𝒙i>0\displaystyle=-\frac{1-p}{np}\sum_{i=1}^{n}\sum_{k=1}^{n_{j}}a_{j}^{2}(\bm{w}_{j}^{\intercal}\bm{x}_{k})(\bm{w}_{j}^{\intercal}\bm{x}_{i})\left<\bm{x}_{k},\bm{x}_{i}\right>\mathbf{1}_{\bm{w}_{j}^{\intercal}\bm{x}_{i}>0}
1pnpi=1nk=1njaj2𝒙i22(𝒘j𝒙i)2,\displaystyle~{}~{}-\frac{1-p}{np}\sum_{i=1}^{n}\sum_{k=1}^{n_{j}}a^{2}_{j}\left\lVert\bm{x}_{i}\right\rVert_{2}^{2}(\bm{w}_{j}^{\intercal}\bm{x}_{i})^{2},

by definition of DjD_{j}, we obtain that the first term can be written as

1pnpi=1nk=1njaj2(𝒘j𝒙k)(𝒘j𝒙i)𝒙k,𝒙i𝟏𝒘j𝒙i>0\displaystyle\frac{1-p}{np}\sum_{i=1}^{n}\sum_{k=1}^{n_{j}}a_{j}^{2}(\bm{w}_{j}^{\intercal}\bm{x}_{k})(\bm{w}_{j}^{\intercal}\bm{x}_{i})\left<\bm{x}_{k},\bm{x}_{i}\right>\mathbf{1}_{\bm{w}_{j}^{\intercal}\bm{x}_{i}>0}
=1pnpi=1njk=1njaj2(𝒘j𝒙k)(𝒘j𝒙i)𝒙k,𝒙i\displaystyle=\frac{1-p}{np}\sum_{i=1}^{n_{j}}\sum_{k=1}^{n_{j}}a_{j}^{2}(\bm{w}_{j}^{\intercal}\bm{x}_{k})(\bm{w}_{j}^{\intercal}\bm{x}_{i})\left<\bm{x}_{k},\bm{x}_{i}\right>
=1pnpk=1njaj(𝒘j𝒙k)𝒙k220,\displaystyle=\frac{1-p}{np}\left\lVert\sum_{k=1}^{n_{j}}a_{j}(\bm{w}_{j}^{\intercal}\bm{x}_{k})\bm{x}_{k}\right\rVert_{2}^{2}\geq 0,

hence

ddtk=1nj𝜽jf𝜽(𝒙k)22\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}\sum_{k=1}^{n_{j}}\lVert\nabla_{\bm{\theta}_{j}}f_{{\bm{\theta}}}(\bm{x}_{k})\rVert_{2}^{2} <0,\displaystyle<0,

which finishes the proof.

Appendix C Additional Experimental Results

C.1 Verification of R2(𝜽)R_{2}(\bm{\theta}) on Complete Datasets

As shown in Fig. 14, the learning rate ε\varepsilon and the regularization coefficient λ\lambda are similar when they reach the maximum test accuracy (red point).

Refer to caption
(a)
Refer to caption
(b)
Figure 14: For different training tasks, the test accuracy is obtained by training the network with SGD under different learning rates and regularization coefficients. The red dots indicate the location of the maximum test accuracy of the NNs obtained by training with both two loss functions. For loss function RSdrop(𝜽,𝜼)R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right), we train the NNs with different learning rates ε\varepsilon. For loss function RSdrop(𝜽,𝜼)+(λ/4)𝜽RSdrop(𝜽,𝜼)2R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right)+(\lambda/4)\|\nabla_{\bm{\theta}}R_{S}^{\mathrm{drop}}\left(\bm{\theta},\bm{\eta}\right)\|^{2}, we train the NNs with different regularization coefficient λ\lambda, with a small and unchanged learning rate (ε=5×103\varepsilon=5\times 10^{-3}). (a) Classify the Fashion-MNIST dataset by training four-layer NNs of width 4096 with a training batch size of 64. (b) Classify the CIFAR-10 datasets by training VGG-9 with a training batch size of 32.

C.2 Dropout Facilitates Condensation on ReLU NNs

In this subsection, We verify that the ReLU network facilitates condensation under dropout shown in Fig. 15, similar to the situation in the tanh NNs shown in Fig. 4 in the main text. We only plot the neurons with non-zero output value in the data interval [x1,xn][x_{1},x_{n}] in the situation in the ReLU NNs. For the neurons with constant zero output value in the data interval, they will not affect the training process and the NN’s output. Further, we study the results of ReLU NNs training under SGD, and the relationship between the model rank and generalization error, as shown in Figs. 16, 17, which correspond to the tanh NNs shown in Figs. 6, 8 in the main text.

Refer to caption
(a) p=1p=1, output
Refer to caption
(b) p=0.9p=0.9, output
Refer to caption
(c) p=1p=1, feature
Refer to caption
(d) p=0.9p=0.9, feature
Refer to caption
(e) p=1p=1, output
Refer to caption
(f) p=0.9p=0.9, output
Refer to caption
(g) p=1p=1, feature
Refer to caption
(h) p=0.9p=0.9, feature
Figure 15: ReLU NNs outputs and features under different dropout rates. The width of the hidden layers is 10001000, and the learning rate for different experiments is 1×1031\times 10^{-3}. In (c,d,g,h), blue dots and orange dots are for the weight feature distribution at the initial and final training stages, respectively. The top row is the result of two-layer networks, with the dropout layer after the hidden layer. The bottom row is the result of three-layer networks, with the dropout layer between the two hidden layers and after the last hidden layer.
Refer to caption
(a) batch size =2=2, output
Refer to caption
(b) batch size =2=2, feature
Figure 16: Two-layer ReLU NN output and feature under a batch size of 2. The width of the hidden layer is 10001000, and the learning rate is 1×1031\times 10^{-3}. In (b), blue dots and orange dots are for the weight feature distribution at the initial and final training stages, respectively.
Refer to caption
Figure 17: Average test error of the two-layer ReLU NNs (color) vs. the number of samples (abscissa) for different dropout rates (ordinate). For all experiments, the width of the hidden layer is 10001000, and the learning rate is 1×1041\times 10^{-4} with the Adam optimizer. Each test error is averaged over 1010 trials with random initialization.

C.3 Detailed Features of Tanh NNs

In order to eliminate the influence of the inhomogeneity of the tanh activation function on the parameter features of Fig. 4, we drew the normalized scatter diagrams between aj\lVert a_{j}\rVert, 𝒘j\lVert\bm{w}_{j}\rVert and the orientation, as shown in Fig. 18. Obviously, for the network with dropout, both the input weight and the output weight have weight condensation, while the network without dropout does not have weight condensation.

Refer to caption
(a) two-layer NN, initialization
Refer to caption
(b) three-layer NN, initialization
Refer to caption
(c) two-layer NN, p=0.9p=0.9
Refer to caption
(d) three-layer NN, p=0.9p=0.9
Refer to caption
(e) two-layer NN, p=1p=1
Refer to caption
(f) three-layer NN, p=1p=1
Figure 18: The normalized scatter diagrams between aj\lVert a_{j}\rVert, 𝒘j\lVert\bm{w}_{j}\rVert and the orientation of tanh NNs for the initialization parameters and the parameters trained with and without dropout. Blue dots and orange dots are the output weight distribution and the input weight distribution, respectively.