Dropout Regularization Versus -Penalization
in the Linear
Model
Abstract
We investigate the statistical behavior of gradient descent iterates with dropout in the linear regression model. In particular, non-asymptotic bounds for the convergence of expectations and covariance matrices of the iterates are derived. The results shed more light on the widely cited connection between dropout and -regularization in the linear model. We indicate a more subtle relationship, owing to interactions between the gradient descent dynamics and the additional randomness induced by dropout. Further, we study a simplified variant of dropout which does not have a regularizing effect and converges to the least squares estimator.
1 Introduction
Dropout is a simple, yet effective, algorithmic regularization technique, intended to prevent neural networks from overfitting. First introduced in [46], the method is implemented via random masking of neurons at training time. Specifically, during every gradient descent iteration, the output of each neuron is replaced by zero based on the outcome of an independently sampled -distributed variable. This temporarily removes each neuron with a probability of , see Figure 1 for an illustration. The method has demonstrated effectiveness in various applications, see for example [28, 46]. On the theoretical side, dropout is often studied by exhibiting connections with explicit regularizers [3, 6, 9, 31, 32, 34, 45, 46, 48]. Rather than analyzing the gradient descent iterates with dropout noise, these results consider the marginalized training loss with marginalization over the dropout noise. Within this framework, [46] established a connection between dropout and weighted -penalization in the linear regression model. This connection is now cited in popular textbooks [14, 18].
However, [51] show empirically that injecting dropout noise in the gradient descent iterates also induces an implicit regularization effect that is not captured by the link between the marginalized loss and explicit regularization. This motivates our approach to directly derive the statistical properties of gradient descent iterates with dropout. We study the linear regression model due to mathematical tractability and because the minimizer of the explicit regularizer is unique and admits a closed-form expression. In line with the implicit regularization observed in [51], our main result provides a theoretical bound quantifying the amount of randomness in the gradient descent scheme that is ignored by the previously considered minimizers of the marginalized training loss. More specifically, Theorem 2 shows that for a fixed learning rate there is additional randomness which fails to vanish in the limit, while Theorem 3 characterizes the gap between dropout and -penalization with respect to the learning rate, the dropout parameter , the design matrix, and the distribution of the initial iterate. Theorem 4 shows that this gap disappears for the Ruppert-Polyak averages of the iterates.
To provide a clearer understanding of the interplay between gradient descent and variance, we also investigate a simplified variant of dropout featuring more straightforward interactions between the two. Applying the same analytical techniques to this simplified variant, Theorem 5 establishes convergence in quadratic mean to the conventional linear least-squares estimator. This analysis illustrates the sensitivity of gradient descent to small changes in the way noise is injected during training.
Many randomized optimization methods can be formulated as noisy gradient descent schemes. The developed strategy to treat gradient descent with dropout may be generalized to other settings. An example is the recent analysis of forward gradient descent in [8].
The article is organized as follows. After discussing related results below, Section 2 contains preliminaries and introduces two different variants of dropout. Section 3 discusses some extensions of previous results for averaged dropout obtained by marginalizing over the dropout distribution in the linear model considered in [46]. Section 4 illustrates the main results on gradient descent with dropout in the linear model, and examines its statistical optimality. Section 5 contains further discussion and mentions a number of natural follow-up problems. All proofs are deferred to the Appendix.
1.1 Other Related Work
Considering linear regression and the marginalized training loss with marginalization over the dropout noise, the initial dropout article [46] already connects dropout with -regularization. This connection was also noted by [6] and by [31]. As this argument is crucial in our own analysis, we will discuss it in more detail in Section 3.
[48] extends the reasoning to generalized linear models and more general forms of injected noise. Employing a quadratic approximation to the loss function after marginalization over the injected noise, the authors exhibit an explicit regularizer. In case of dropout noise, this regularizer induces in first-order an -penalty after rescaling of the data by the estimated inverse of the diagonal Fisher information.
For two-layer models, marginalizing the dropout noise leads to a nuclear norm penalty on the product matrix, both in matrix factorization [9] and linear neural networks [34]. The latter may be seen as a special case of a particular “-path regularizer”, which appears in deep linear networks [32] and shallow ReLU-activated networks [3]. Further, [3] exhibit a data distribution-dependent regularizer in two-layer matrix sensing/completion problems. This regularizer collapses to a nuclear norm penalty for specific distributions.
[16] show that empirical risk minimization in deep neural networks with dropout may be recast as performing Bayesian variational inference to approximate the intractable posterior resulting from a deep Gaussian process prior. The Bayesian viewpoint also allows for the quantification of uncertainty. [15] further generalizes this technique to recurrent and long-short-term-memory (LSTM) networks. [52] analyze dropout applied to the max-pooling layers in convolutional neural networks. [50] present a Gaussian approximation to the gradient noise induced by dropout.
Generalization results for dropout training exist in various settings. Given bounds on the norms of weight vectors, [49], [17], and [53] prove decreasing Rademacher complexity bounds as the dropout rate increases. [3] bound the Rademacher complexity of shallow ReLU-activated networks with dropout. [31] obtains a PAC-Bayes bound for dropout and illustrates a trade-off between large and small dropout probabilities for different terms in the bound.
Recently, [30] demonstrated an universal approximation result in the vein of classic results [11, 25, 29], stating that any function in some generic semi-normed space that can be -approximated by a deterministic neural network may also be stochastically approximated in -norm by a sufficiently large network with dropout.
Less is known about gradient descent training with dropout. [45] study the gradient flow associated with the explicit regularizer obtained by marginalizing the dropout noise in a shallow linear network. In particular, the flow converges exponentially fast within a neighborhood of a parameter vector satisfying a balancing condition. [33] study gradient descent with dropout on the logistic loss of a shallow ReLU-activated network in a binary classification task. Their main result includes an explicit rate for the misclassification error, assuming an overparametrized network operating in the so-called lazy regime, where the trained weights stay relatively close to their initializations, and two well-separated classes.
1.2 Notation
Column vectors are denoted by bold letters. We define , , and the Euclidean norm . The identity matrix is symbolized by , or simply , when the dimension is clear from context. For matrices of the same dimension, denotes the Hadamard/entry-wise product . We write for the diagonal matrix with the same main diagonal as . Given , we define the matrices
In particular, , so results from rescaling the off-diagonal entries of by .
The smallest eigenvalue of a symmetric matrix is denoted by . The operator norm of a linear operator between normed linear spaces is given by . We write for the spectral norm of matrices, which is the operator norm induced by . For symmetric matrices, the relation signifies for all non-zero vectors . The strict operator inequality is defined analogously.
2 Gradient Descent and Dropout
We consider a linear regression model with fixed design matrix and outcomes , so that
(1) |
with unknown parameter . We assume and . The task is to estimate from the observed data . As the Gram matrix appears throughout our analysis, we introduce the shorthand
Recovery of in the linear regression model (1) may be interpreted as training a neural network without intermediate hidden layers, see Figure 2. If were to have a zero column, the corresponding regression coefficient would not affect the response vector . Consequently, both the zero column and the regression coefficient may be eliminated from the linear regression model. Without zero columns, the model is said to be in reduced form.
The least squares criterion for the estimation of refers to the objective function . Given a fixed learning rate , performing gradient descent on the least squares objective leads to the iterative scheme
(2) |
with and (possibly random) initialization .
For standard gradient descent as defined in (2), the estimate is updated with the gradient of the full model. Dropout, as introduced in [46], replaces the gradient of the full model with the gradient of a randomly reduced model during each iteration of training. To make this notion more precise, we call a random diagonal matrix a -dropout matrix, or simply a dropout matrix, if its diagonal entries satisfy for some . We note that the Bernoulli distribution may alternatively be parametrized with the failure probability , but following [46] we choose the success probability .
On average, has diagonal entries equal to and diagonal entries equal to . Given any vector , the coordinates of are randomly set to with probability . For simplicity, the dependence of on will be omitted.
Now, let , be a sequence of i.i.d. dropout matrices, where refers to the dropout matrix applied in the th iteration. Gradient descent with dropout takes the form
(3) |
with and (possibly random) initialization . In contrast with (2), the gradient in (3) is taken on the model reduced by the action of multiplying with . Alternatively, (3) may be interpreted as replacing the design matrix with the reduced matrix during the th iteration. The columns of the reduced matrix are randomly deleted with a probability of . Observe that the dropout matrix appears inside the squared norm, making the gradient quadratic in .
Dropout, as defined in (3), considers the full gradient of the reduced model, whereas another variant is obtained through reduction of the full gradient. The resulting iterative scheme takes the form
(4) |
with and (possibly random) initialization . As opposed to defined above, the dropout matrix only occurs once in the updates, so we shall call this method simplified dropout from here on. As we will illustrate, the quadratic dependence of on creates various challenges, whereas the analysis of is more straightforward.
Both versions (3) and (4) coincide when the Gram matrix is diagonal, meaning when the columns of are orthogonal. To see this, note that diagonal matrices commute, so and hence .
We note that dropout need not require the complete removal of neurons. Each neuron may be multiplied by an arbitrary (not necessarily Bernoulli distributed) random variable. For instance, [46] also report good performance for -distributed diagonal entries of the dropout matrix. However, the Bernoulli variant seems well-motivated from a model averaging perspective. [46] propose dropout with the explicit aim of approximating a Bayesian model averaging procedure over all possible combinations of connections in the network. The random removal of nodes during training is thought to prevent the neurons from co-adapting, recreating the model averaging effect. This is the main variant implemented in popular software libraries, such as Caffe [26], TensorFlow [1], Keras [10], and PyTorch [38].
Numerous variations and extensions of dropout exist. [49] show state-of-the-art results for networks with DropConnect, a generalization of dropout where connections are randomly dropped, instead of neurons. In the linear model, this coincides with standard dropout. [4] analyze the case of varying dropout probabilities, where the dropout probability for each neuron is computed using binary belief networks that share parameters with the underlying fully connected network. An adaptive procedure for the choice of dropout probabilities is presented in [27], while also giving a Bayesian justification for dropout.
3 Analysis of Averaged Dropout
Before presenting our main results on iterative dropout schemes, we further discuss some properties of the marginalized loss minimizer that was first analyzed by [46]. For the linear regression model (1), marginalizing the dropout noise leads to
(5) |
One may hope that the dropout gradient descent recursion for in (3) leads to a minimizer of (5), so that the marginalized loss minimizer may be studied as a surrogate for the behaviour of in the long run.
Intuitively, the gradient descent iterates with dropout represent a Monte-Carlo estimate of some deterministic algorithm [50]. This can be motivated by separating the gradient descent update into a part without algorithmic randomness and a centered noise term, meaning
(6) |
Notably, the stochastic terms form a martingale difference sequence with respect to . It seems conceivable that the noise in (6) averages out; despite the random variables being neither independent, nor identically distributed, one may hope that a law of large numbers still holds, see [2]. In this case, after a sufficient number of gradient steps,
(7) |
The latter sequence could plausibly converge to the marginalized loss minimizer . While this motivates studying , the main conclusion of our work is that this heuristic is not entirely correct and additional noise terms occur in the limit .
As the marginalized loss minimizer still plays a pivotal role in our analysis, we briefly recount and expand on some of the properties derived in [46]. Recall that , so we have
Since is diagonal, , and by Lemma 10(a), ,
(8) |
The right-hand side may be identified with a Tikhonov functional, or an -penalized least squares objective. Its gradient with respect to is given by
Recall from the discussion following Equation (1) that the model is assumed to be in reduced form, meaning . In turn,
is bounded away from , making invertible. Solving the gradient for the minimizer now leads to
(9) |
where . If the columns of are orthogonal, then is diagonal and hence . In this case, matches the usual linear least squares estimator . Alternatively, minimizing the marginalized loss can also be deduced from the identity
(10) |
which holds for all estimators . See Appendix A for a proof of (10). We now mention several other relevant properties of .
Calibration: [46] recommend multiplying by , which may be motivated as follows: Since , a small squared error in (3) implies . Moreover, multiplying by leads to which may be identified with the minimizer of the objective function
This recasts as resulting from a weighted form of ridge regression. Comparing the objective function to the original marginalized loss , the rescaling replaces with the normalized dropout matrix , which has the identity matrix as its expected value. In popular machine learning software, the sampled dropout matrices are usually rescaled by [1, 10, 26, 38].
In some settings multiplication by may worsen as a statistical estimator. As an example, consider the case with a multiple of the identity matrix. Now, (9) turns into . If the noise vector consists of independent standard normal random variables, then has mean squared error . In contrast, , so converges to at the rate while cannot be consistent as , unless .
The correct rescaling may also depend on the parameter dimension and the spectrum of . Suppose that all columns of have the same Euclidean norm, so that . Let denote a singular value decomposition of , with singular values . Now, satisfies
(11) |
For a proof of these identities, see Appendix A. To get an unbiased estimator for , we must undo the effect of the spectral multipliers which take values in the interval . Consequently, the proper rescaling depends on the eigenspace. Multiplication of the estimator by addresses the case where the singular values are large. In particular, if with , , and , then is the matrix with all entries equaling . Now and so , meaning the correct scaling factor depends explicitly on the parameter dimension and converges to the dropout probability as .
Invariance properties: The minimizer is scale invariant in the sense that and may be replaced with and for some arbitrary , without changing . This does not hold for the gradient descent iterates (3) and (4), since rescaling by changes the learning rate from to . Moreover, as well as the gradient descent iterates in (3) and in (4) are invariant under replacement of and by and for any orthogonal matrix . See [21] for further results on scale-invariance of dropout in deep networks.
Overparametrization: Dropout has been successfully applied in the overparametrized regime, see for example [28]. For the overparametrized linear regression model, the data-misfit term in (3) suggests that should be close to the data vector . However,
(12) |
See Appendix A for a proof. Hence, also shrinks towards zero in the overparametrized regime and does not interpolate the data. The variational formulation reveals that is a minimum-norm solution in the sense
which explains the induced shrinkage.
4 Analysis of Iterative Dropout Schemes
In the linear model, gradient descent with a small but fixed learning rate, as in (2), leads to exponential convergence in the number of iterations. Accordingly, we analyze the iterative dropout schemes (3) and (4) for fixed learning rate and only briefly discuss the algebraically less tractable case of decaying learning rates.
4.1 Convergence of Dropout
We proceed by assessing convergence of the iterative dropout scheme (3), as well as some of its statistical properties. Recall that gradient descent with dropout takes the form
(13) |
As alluded to in the beginning of Section 3, the gradient descent iterates should be related to the minimizer of (5). It then seems natural to study the difference , with as an “anchoring point”. Comparing and demands an explicit analysis, without marginalization of the dropout noise.
To start, we rewrite the updating formula (13) in terms of . Using , , and that diagonal matrices always commute, we obtain . As defined in (9), and thus
(14) |
In both representations, the second term is centered and uncorrelated for different values of . Vanishing of the mean follows from the independence of and , combined with , the latter being shown in (28). If , independence of and , as well as , imply and , which proves uncorrelatedness.
Defining , , and , the first representation in (14) may be identified with a lag one vector autoregressive (VAR) process
(15) |
with i.i.d. random coefficients and noise/innovation process . As just shown, and whenever , so the noise process is centered and serially uncorrelated. The random coefficients and are, however, dependent. While most authors do not allow for random coefficients in VAR processes, such processes are special cases of a random autoregressive process (RAR) [41].
In the VAR literature, identifiability and estimation of the random coefficients is considered [37, 41]. In contrast, we aim to obtain bounds for the convergence of and . Difficulties arise from the involved structure and coupled randomness of and . Estimation of coefficients under dependence of and is treated in [22].
For a sufficiently small learning rate , the random matrices and in both representations in (14) are contractive maps in expectation. By Lemma 10(a), their expected values coincide since
For the subsequent analysis, we impose the following mild conditions that, among other things, establish contractivity of as a linear map.
Assumption 1.
The learning rate and the dropout probability are chosen such that , the initialization is a square integrable random vector that is independent of the data and the model is in reduced form, meaning that does not have zero columns.
For gradient descent without dropout and fixed learning rate, as defined in (2), guarantees converge of the scheme in expectation. We will see shortly that dropout essentially replaces the expected learning rate with , which motivates the condition .
As a straightforward consequence of the definitions, we are now able to show that vanishes in expectation at a geometric rate. For a proof of this as well as subsequent results, see Appendix B.
Lemma 1 (Convergence of Expectation).
Given Assumption 1, and for any ,
Before turning to the analysis of the covariance structure, we highlight a property of the sequence . As mentioned, these conditional expectations may be viewed as gradient descent iterates generated by the marginalized objective that gives rise to . Indeed, combining (13) with , from Lemma 10(a), and (3) yields
This establishes a connection between the dropout iterates and the averaged analysis of the previous section. However, the relationship between the (unconditional) covariance matrices and the added noise remains unclear. A new dropout matrix is sampled for each iteration, whereas results from minimization only after applying the conditional expectation to the randomized objective function. Hence, we may expect that features smaller variance than the iterates as the latter also depend on the noise added via dropout.
As a first result for the covariance analysis, we establish an extension of the Gauss-Markov theorem stating that the covariance matrix of a linear estimator lower-bounds the covariance matrix of an affine estimator, provided that both estimators have the same asymptotic mean. Moreover, the covariance matrix of their difference characterizes the gap. We believe that a similar result may already be known, but we are not aware of any reference, so a full proof is provided in Appendix B for completeness.
Theorem 1.
Since , Theorem 1 may be interpreted as follows: if the estimators and are nearly the same in expectation, then and must be nearly uncorrelated. In turn, may be decomposed into and (nearly) orthogonal noise , so that is lower bounded by . Therefore, the covariance matrix quantifies the gap in the bound.
Taking and considering linear estimators with recovers the usual Gauss-Markov theorem, stating that is the best linear unbiased estimator (BLUE) for the linear model. Applying the generalized Gauss-Markov theorem with , where is a positive definite matrix, we obtain the following statement about -penalized estimators.
Corollary 1.
The minimizer of the -penalized functional has the smallest covariance matrix among all affine estimators with the same expectation as .
We now return to our analysis of the covariance structure induced by dropout. If , then in Theorem 1. Further, the dropout iterates may be rewritten as affine estimators with
By construction, and are independent, so Theorem 1 applies. As shown in Lemma 1, vanishes exponentially fast, so we conclude that is asymptotically lower-bounded by . Further, the covariance structure of is optimal in the sense of Corollary 1.
We proceed by studying , with the aim of quantifying the gap between the covariance matrices. To this end, we exhibit a particular recurrence for the second moments . Recall that denotes the Hadamard product, , and .
Lemma 2 (Second Moment - Recursive Formula).
Intuitively, the lemma states that the second moment of evolves as an affine dynamical system, up to some exponentially decaying remainder. This may be associated with the implicit regularization of the dropout noise, as illustrated empirically in [51].
Mathematically, the result may be motivated via the representation of the dropout iterates as a random autoregressive process in (15). Writing out and comparing with the proof of the lemma, we see that the remainder term, denoted by in the proof, coincides with the expected value of the cross terms . Moreover, the operator is obtained by computing
As are i.i.d., does not depend on . Moreover, independence of and implies
Inserting the definition results in the statement of the lemma. The random vector depends on and by Theorem 1 the correlation between and decreases as . This leads to the exponentially decaying bound for the remainder term .
The previous lemma entails equality between and , up to the remainder terms. The latter may be computed further by decomposing the affine operator into its intercept and linear part
(16) |
If were to have operator norm less than one, then the Neumann series for (see Lemma 12) gives
with the identity operator on matrices. Surprisingly, the operator “forgets” in the sense that the limit does not depend on anymore. The argument shows that should behave like in the first order. The next result makes this precise, taking into account the remainder terms and approximation errors.
Theorem 2 (Second Moment - Limit Formula).
In short, converges exponentially fast to the limit . Combining the generalized Gauss-Markov Theorem 1 with Theorem 2 also establishes
with exponential rate of convergence. Recall the intuition gained from Theorem 1 that may be decomposed into a sum of and (approximately) orthogonal centered noise. We now conclude that up to exponentially decaying terms, the covariance matrix of this orthogonal noise must be given by , which fully describes the (asymptotic) gap between the covariance matrices of and .
Taking the trace and noting that , we obtain a bound for the convergence of with respect to the squared Euclidean loss,
(17) |
Since is a matrix, the term describing the asymptotic discrepancy between and can be large in high dimensions , even if the spectral norm of is small. Since is a positive definite operator, the matrix is zero if, and only if, is zero. By (16), . The explicit form , shows that and provided that , meaning whenever is diagonal. To give a more precise quantification, we show that the operator norm of is of order .
Lemma 3.
In addition to Assumption 1 suppose , then, for any
and
where and are constants that are independent of .
The first bound describes the interplay between and . Making small will decrease the second term in the bound, but conversely requires a larger number of iterations for the first term to decay.
In the second bound, matches the covariance matrix of the marginalized loss minimizer up to a term of order . Consequently, the covariance structures induced by dropout and -regularization approximately coincide for sufficiently small . However, in this regime we have , and becomes extremely biased whenever the Gram matrix is not diagonal.
Theorem 1 already establishes as the optimal covariance among all affine estimators that are asymptotically unbiased for . To conclude our study of the gap between and , we provide a lower-bound.
Theorem 3 (Sub-Optimality of Variance).
In addition to the assumptions of Theorem 2, suppose for every there exists such that , then
The lower-bound is positive whenever is invertible. In general, Theorem 3 entails asymptotic statistical sub-optimality of the gradient descent iterates for a large class of design matrices. Moreover, the result does not require any further assumptions on the tuning parameters and , other than being sufficiently small.
To summarize, compared with the marginalized loss minimizer , the covariance matrix of the gradient descent iterates with dropout may be larger. The difference may be significant, especially if the data dimension is large. Proving the results requires a refined second moment analysis, based on explicit computation of the dynamics of . Simple heuristics such as (7) do not fully reveal the properties of the underlying dynamics.
4.1.1 Ruppert-Polyak Averaging
To reduce the gradient noise induced by dropout, one may consider the running average over the gradient descent iterates. This technique is also known as Ruppert-Polyak averaging [42, 39] The th Ruppert-Polyak average of the gradient descent iterates is given by
Averages of this type are well-studied in the stochastic approximation literature, see [40, 19] for results on linear regression and [55, 12] for stochastic gradient descent. The next theorem illustrates convergence of towards .
Theorem 4.
The first term in the upper bound is independent of and decays at the rate , whereas the second term scales with . Accordingly, for small , the second term will dominate initially, until grows sufficiently large.
Since the right hand side eventually tends to zero, the theorem implies convergence of the Ruppert-Polyak averaged iterates to the marginalized loss minimizer , so the link between dropout and -regularization persists at the variance level. The averaging comes at the price of a slower convergence rate of the remainder terms, as opposed to the exponentially fast convergence in Theorem 2. As in (17), the bound can be converted into a convergence rate for by taking the trace,
4.2 Convergence of Simplified Dropout
To further illustrate how dropout and gradient descent are coupled, we will now study the simplified dropout iterates
(18) |
as defined in (4). While the original dropout reduces the model before taking the gradient, this version takes the gradient first and applies dropout afterwards. As shown in Section 2, both versions coincide if the Gram matrix is diagonal. Recall from the discussion preceding Lemma 3 that for diagonal , converges to the optimal covariance matrix. This suggests that for the simplified dropout, no additional randomness in the limit occurs.
The least squares objective always admits a minimizer, with any minimizer necessarily solving the so-called normal equations . Provided is invertible, the least-squares estimator gives the unique solution. We will not assume invertibility for all results below, so we let denote any solution of the normal equations, unless specified otherwise. In turn, (18) may be rewritten as
(19) |
which is simpler than the analogous representation of as a VAR process in (15).
As a first result, we will show that the expectation of the simplified dropout iterates converges to the mean of the unregularized least squares estimator , provided that is invertible. Indeed, using (19), independence of and , and , observe that
(20) |
Induction on now shows and so
Assuming , invertibility of implies . Consequently, the convergence is exponential in the number of iterations.
Invertibility of may be avoided if the initialization lies in the orthogonal complement of the kernel of and is the -minimal solution to the normal equations. We can then argue that always stays in a linear subspace on which still acts as a contraction.
We continue with our study of by employing the same techniques as in the previous section to analyze the second moment. The linear operator
(21) |
defined on matrices, takes over the role of the affine operator encountered in Lemma 7. In particular, the second moments evolve as the linear dynamical system
(22) |
without remainder terms. To see this, observe via (19) the identity . Taking the expectation on both sides, conditioning on , and recalling that is independent of gives . We have and by Lemma 10, . Together with the definition of , this proves (22).
Further results are based on analyzing the recursion in (22). It turns out that convergence of to in second mean requires a non-singular Gram matrix .
Lemma 4.
Suppose the initialization is independent of all other sources of randomness and the number of parameters satisfies , then there exists a singular , such that for any positive integer , and
For invertible , we can apply Theorem 1 to show that is the optimal covariance matrix for the sequence of affine estimators . The simplified dropout iterates actually achieve the optimal variance when is invertible, which stands in contrast with the situation in Theorem 3.
Theorem 5.
Suppose is invertible, , and let be square-integrable, then, for any
Intuitively, the result holds due to the operator in (21) being linear, as opposed to affine like in the case of Lemma 2. Choosing sufficiently small ensures that acts as a contraction, meaning for any matrix . Hence, linearity of serves as an algebraic expression of the simplified dynamics. As in (17), we may take the trace to obtain the bound
5 Discussion and Outlook
Our main contributions may be summarized as follows: We studied dropout in the linear regression model, but unlike previous results, we explicitly analyzed the gradient descent dynamics with new dropout noise being sampled in each iteration. This allows us to characterize the limiting variance of the gradient descent iterates exactly (Theorem 2), which sheds light on the covariance structure induced via dropout. Our main tool in the analysis is a particular recursion (Lemma 2), which may be exhibited by “anchoring” the gradient descent iterates around the marginalized loss minimizer . To further understand the interaction between noise and gradient descent dynamics, we analyze the running average of the process (Theorem 4) and a simplified version of dropout (Theorem 5).
We view our analysis of the linear model as a fundamental first step towards understanding the dynamics of gradient descent with dropout. Analyzing the linear model has been a fruitful approach to study other phenomena in deep learning, such as overfitting [47], sharpness of local minima [7], and in-context learning [54]. We conclude by proposing some natural directions for future work.
Random minibatch sampling: For yet another way of incorporating dropout, we may compute the gradient based on a random subset of the data (mini batches). In this case, the updating formula satisfies
(23) |
The dropout matrices are now of dimension and select a random subset of the data points in every iteration. This version of dropout also relates to randomly weighted least squares and resampling methods [13]. The update formula may be written in the form with solving the normal equations . Similarly to the corresponding reformulation (14) of the original dropout scheme, this defines a vector autoregressive process with random coefficients and lag one.
Learning rates: The proof ideas may be generalized to a sequence of iteration-dependent learning rates . We expect this to come at the cost of more involved formulas. Specifically, the operator in Lemma 2 will depend on the iteration number through , so the limit in Theorem 2 cannot be expressed as anymore.
Random design and stochastic gradients: We considered a fixed design matrix and (full) gradient descent, whereas in machine learning practice inputs are typically assumed to be random and parameters are updated via stochastic gradient descent (SGD). The recent works [8, 44] derive convergence rates for SGD considering linear regression and another form of noisy gradient descent. We believe that parts of theses analyses carry over to dropout.
Generic dropout distributions: As already mentioned, [46] carry out simulations with dropout where . Gaussian dropout distributions are currently supported, or easily implemented, in major software libraries [1, 10, 26, 38]. Analyzing a generic dropout distribution with mean and variance may also paint a clearer picture of how the dropout noise interacts with the gradient descent dynamics. For the linear regression model, results that marginalize over the dropout noise generalize to arbitrary dropout distribution. In particular, (9) turns into
If , then .
In contrast, treatment of the corresponding iterative dropout scheme seems more involved. The analysis of dropout with Bernoulli distributions relies in parts on the projection property . Without it, additional terms occur in the moments in Lemma 10, which is required for the computation of the covariance matrix. For example, the formula turns into
where denotes the th moment of the dropout distribution and its variance. For the Bernoulli distribution, all moments equal , so and the last term disappears. Similarly, more terms will appear in the fourth moment of , making the expression for the operator corresponding to in Lemma 2 more complicated.
Inducing robustness via dropout: Among the possible ways of explaining the data, dropout should, by design, favor an explanation that is robust against setting a random subset of the parameters to zero. [34] indicate that dropout in two-layer linear networks tends to equalize the norms of different weight vectors.
To study the robustness properties of dropout, one may suggest analysis of loss functions measuring prediction of the response vector if each estimated regression coefficients is deleted with probability . Given an estimator , a natural choice would be the loss
with a new draw of the dropout matrix, independent of all other randomness. This loss depends on the unknown true regression vector . Since , an empirical version of the loss may replace with , considering . As shown in (9), minimizes this loss function. This suggests that may possess some optimality properties for the loss defined above.
Shallow networks with linear activation function: Multi-layer neural networks do not admit unique minimizers. In comparison with the linear regression model, this poses a major challenge for the analysis of dropout. [34] consider shallow linear networks of the form with an matrix and a matrix. Suppose is an dropout matrix. Assuming the random design vector satisfies , and marginalizing over dropout noise applied to the columns of (or equivalently to the rows of ) leads to an -penalty via
(24) |
As an extension of our approach, it seems natural to investigate whether gradient descent with dropout will converge to the same minimizer or involve additional terms in the variance. In contrast with linear regression, the marginalized loss (24) is non-convex and does not admit a unique minimizer. Hence, we cannot simply center the gradient descent iterates around a specific closed-form estimator, as in Section 4. To extend our techniques we may expect to replace the centering estimator with the gradient descent iterates for the marginalized loss function, demanding a more careful analysis.
[45] study the gradient flow associated with (24) and exhibit exponential convergence of and towards a minimizer. Extending the existing result on gradient flows to gradient descent is, however, non-trivial, see for example the gradient descent version of Theorem 3.1 in [5] provided in Theorem 2.4 of [36].
To be more precise, suppose , where and are independent random vectors, so the task reduces to learning a factorization based on noisy evaluation of . Consider the randomized loss
with respective gradients
Given observations and independent dropout matrices , the factorized structure leads to two coupled dynamical systems and , which are linked through the appearance of in and in . Due to non-convexity of the underlying marginalized objective (24), the resulting dynamics should be sensitive to initialization. Suppose and are given as random matrix polynomials in , meaning finite sums of expressions like , where and the are random coefficient matrices. Now, the gradient descent recursions lead to
(25) |
so is also a polynomial in with random coefficients. A difficulty in analyzing this recursion is that the degree of the polynomial increases exponentially fast. Indeed, since includes the term , the degree of is the degree of plus twice the degree of . During each gradient descent step, additional randomness is introduced via the newly sampled dropout matrix and the training data . Accordingly, the coefficients of and fluctuate around the coefficients of and . A principled analysis of the resulting dynamics requires careful accounting of how these fluctuations propagate through the iterations. [45] show that the gradient flow trajectories and minimizers of (24) satisfy specific symmetries, so one should hope to reduce the algebraic complexity of the problem by finding analogous symmetries in the stochastic recursions (25).
Alternatively, one may consider layer-wise training of the weight matrices to break the dependence between and . Given , suppose we keep fixed while taking gradient steps
followed by gradient steps of the form
In each phase, the gradient descent recursion solves a linear regression problem similar to to our analysis of the linear model. We leave the details for future work.
Appendix A Proofs for Section 3
A.1 Proof of Equation (10)
A.2 Proof of Equation (11)
Let and consider a singular value decomposition . The left-singular vectors , are orthonormal, meaning and since ,
Each left-singular vector is an eigenvector of , with associated eigenvalue . If , suppose , complete the orthonormal basis, then . Consequently, admits as an eigenvector for every and
By definition, and . Combining these facts now leads to
and
∎
A.3 Proof of Equation (12)
Set and consider with and . Observe that
and thus . If is the zero vector, . Otherwise, implies being positive definite, so we have the strict inequality whenever .
Now, suppose that has an eigenvector with corresponding eigenvalue , which implies . Then, we have . Equality only holds if . This contradicts the strict inequality so all eigenvalues have to be strictly smaller than one. ∎
Appendix B Proofs for Section 4
B.1 Proof of Lemma 1
To show that , note that by Lemma 13 and recall from Assumption 1. Hence,
(27) |
For any vector with and any two positive semi-definite matrices and , . Hence, and . By Assumption 1, the design matrix has no zero columns, guaranteeing . Combined with (27), we now obtain .
To prove the bound on the expectation, recall from (14) that equals . Lemma 10(a) shows that and Lemma 9(b) gives . In turn, we have and
(28) |
Conditioning on all randomness except now implies
(29) |
By the tower rule , so induction on gives
Sub-multiplicativity of the spectral norm implies , proving that . ∎
B.2 Proof of Theorem 1
We have , so the triangle inequality implies
(30) |
Write , then . When conditioned on , the estimator is deterministic. Hence, the law of total covariance yields
Further, implies . Using , note that with . Combining these identities, sub-multiplicativity of the spectral norm, and the triangle inequality leads to
Together with (30), this proves the result. ∎
B.3 Proof of Lemma 2
Given a random vector and a random element , observe that . Inserting and , as well as defining , leads to
(31) |
Recall from (29), and so
(32) |
where .
Evaluating the conditional covariance is the more challenging part, requiring moments up to fourth order in , see Lemma 11. Recall that
Lemma 5.
For every positive integer ,
with remainder vanishing at the rate
Proof Recall from (14) that . The covariance is invariant under deterministic shifts and sign flips, so
Applying Lemma 11 with , , and , we find
(33) |
Set and recall . Note the identities and . Taking the expectation of (33), multiplying both sides with , and using the definition of proves the claimed expression for with remainder term
(34) |
For any matrices and , Lemma 13 provides the inequalities , , and . If is moreover positive semi-definite, then also . Combined with the sub-multiplicativity of the spectral norm, this implies
By Assumption 1 also , so combining the upper-bounds with (34) leads to
(35) |
B.4 Proof of Theorem 2
Let be the affine operator introduced in Lemma 2 and recall the definitions and . First, the operator norm of will be analyzed.
Lemma 6.
The linear operator satisfies , provided that
where denotes the smallest eigenvalue of .
Proof Let be a matrix. Applying the triangle inequality, Lemma 13, and sub-multiplicativity of the spectral norm,
where the second inequality follows from .
As shown in (27), . Lemma 13 now implies , so that
If , then also , so that in turn . The constraint now enforces , which implies . ∎
As before, set for each and let . Using induction on , we now prove
(36) |
Taking , , so the claimed identity holds. Assuming the identity is true for , the recursion leads to
thereby establishing the induction step and proving (36).
Assuming , we move on to show the bound
with the identity operator on , and . By linearity, . Since , Lemma 12 asserts that and
(37) |
Lemma 2 ensures for all . Moreover, and hence for every , so the triangle inequality implies
(38) |
B.5 Proof of Lemma 3
Applying Theorem 1, Lemma 1, and the triangle inequality,
(40) |
Lemma 13 implies , so that . Next, entails equality between and . The second term on the right-hand side of (40) is then bounded by , for some constant independent of .
To prove the first claim of the lemma, it now suffices to show
(41) |
where and are constants independent of . As , the constant in Theorem 2 satisfies , with depending only on the distribution of . Consequently, Theorem 2 and the triangle inequality imply
(42) |
Consider a bounded linear operator on satisfying . For an arbitrary matrix , Lemma 12 asserts and therefore . Theorem 1 states that . As shown following (27), . Therefore,
Taking in Lemma 2, . Using Lemma 13 and ,
proving that
(43) |
Note that by Assumption 1. Applying these bounds in (42) leads to
which proves (41). Combined with (40), this proves the first claim of the lemma since
(44) |
To start proving the second claim, recall that . Hence, the triangle inequality leads to
Let , , and be square matrices of the same dimension, with and invertible. Observe the identity , so sub-multiplicativity implies . Using , and inserting , , and results in
with independent of . Combined with (44), this results in
which proves the second claim of the lemma by enlarging , if necessary. ∎
B.6 Proof of Theorem 3
Start by noting that for symmetric matrices, see [24], Theorem 4.2.6. Using super-additivity of infima, observe the lower bound
(45) |
Combining Lemma 1 and Theorem 1, the limit superior in (45) vanishes. Further, converges to by Theorem 2, so it suffices to analyze the latter matrix.
For the next step, the matrix in Theorem 2 will be lower-bounded. Taking in Lemma 2 and exchanging the expectation with the operator results in
(46) |
The first matrix in (46) is always positive semi-definite and we will now lower bound the matrix . Given distinct , symmetry of implies
In turn, for any unit-length vector ,
(47) |
where the last equality follows by noting that each square appears twice in (47) since the expression is symmetric in . Every summand in (47) is non-negative. If , then there exists such that . Write for the vector with entries
By construction, . Recall that and note that . Together with and , (47) now satisfies
As , this proves the matrix inequality
(48) |
Next, let be a centered random vector with covariance matrix and suppose is a dropout matrix, independent of . Conditioning on , the law of total variance states
(49) | |||
(50) |
Applying Lemma 11 with , , and now shows that . The second term in (50) is always positive semi-definite, proving that . As and , this implies
Lemma 13 moreover gives . Together with the lower-bound (48) for , this proves the result. ∎
B.7 Proof of Theorem 4
As in Section 4.1, write for the running average of the iterates and define
(51) |
Suppose and take . Using induction on , we now prove that . The identity always holds when . Next, suppose the identity holds for some . Taking in (14), . Since , is by assumption independent of . Recall from (28) that . Conditioning on and applying tower rule now gives
Together with the induction hypothesis, this proves the desired equality
For , transposing and flipping the roles of and also shows that with .
Defining and taking , (51) may now be rewritten as
(52) |
Set , then by Lemma 1. Note also that . Using the triangle inequality and sub-multiplicativity of the spectral norm, (52) then satisfies
As shown in Theorem 2, for some constant . Observing that , this implies
To complete the proof note that may be rewritten as and by (43). ∎
B.8 Proof of Lemma 4
Recall the definition .
Lemma 7.
For every
with equality if almost surely.
Proof Recall the definition , so that . As shown in (22), and hence
By definition, . Recall from (20) that , so implies
Together, these identities prove the first claim.
The lower bound follows from being positive semi-definite. A positive semi-definite matrix has non-negative diagonal entries, meaning is also positive semi-definite. Next, note that and in turn
(53) |
Together with the first part of the lemma and the definition of , the lower-bound follows.
Consider arbitrary positive semi-definite matrices , then for all vectors . Given any vector , this implies
with the th standard basis vector. Accordingly, is operator monotone with respect to the ordering of positive semi-definite matrices, in the sense that whenever . Using induction on , Lemma 7 may now be rewritten as
(54) |
To complete the proof, the right-hand side of (54) will be analyzed for a suitable choice of .
Lemma 8.
Suppose is independent of all other sources of randomness. Consider the linear regression model with a single observation , number of parameters , and design matrix . Then, and for any -dimensional vector satisfying and every
Proof By definition, is the -matrix with all entries equal to one. Consequently, for all and , so satisfies the normal equations .
To prove , note that whenever . By conditioning on , the identity was shown in (20). The same argument also proves . Induction on and the assumed independence between and now lead to
(55) |
Next, note that and . As , (55) satisfies
which proves the first claim.
To prove the second claim, we first show that there are real sequences and , not depending on the distribution of , such that
(56) |
for all , with equality if almost surely. When , independence between and , as well as , imply . Moreover, equality holds whenever is deterministic.
For the sake of induction suppose the claim is true for some . Lemma 7 and operator monotonicity of then imply
In case almost surely, Lemma 7 and the induction hypothesis assert equality in the previous display. Recall , so as well as . Note also that . Setting , expanding the definition now results in
(57) |
This establishes the induction step and thereby proves (56) for all .
As do not depend on the distribution of , taking shows that for all since
Consequently, (57) implies , proving that is non-decreasing in .
Lastly, we show that . To this end, recall that and . As is operator monotone and , independence of and results in
so .
To complete the proof, observe that implies . Accordingly,
which yields the second claim of the lemma. ∎
B.9 Proof of Theorem 5
For an arbitrary matrix , the triangle inequality, Lemma 13, and submultiplicativity of the spectral norm imply
(58) |
As is positive definite, implies . If , then
Together with (58) this leads to . By induction on , , completing the proof. ∎
Appendix C Higher Moments of Dropout Matrices
Deriving concise closed-form expressions for third and fourth order expectations of the dropout matrices presents one of the main technical challenges encountered in Section B.
All matrices in this section will be of dimension and all vectors of length . Moreover, always denotes a random diagonal matrix such that for all . The diagonal entries of are elements of , meaning for all positive powers .
Given a matrix and , recall the definitions
The first lemma contains some simple identities.
Lemma 9.
For arbitrary matrices and , , and a diagonal matrix ,
-
(a)
and
-
(b)
-
(c)
.
Proof
-
(a)
By definition, for all , which equals . The second equality then follows by transposition.
-
(b)
Clearly, and in turn . On the other hand, and hence equals as well.
-
(c)
Observe that equals . As , the claim follows.
∎
With these basic properties at hand, higher moments involving the dropout matrix may be computed by carefully accounting for equalities between the involved indices.
Lemma 10.
Given arbitrary matrices , , and , the following hold:
-
(a)
-
(b)
-
(c)
Proof
-
(a)
Recall that and hence , meaning . On the other hand, implies due to independence of and . Combining both identities,
-
(b)
First, note that and commutativity of diagonal matrices imply
(59) Applying Lemma 9(a) twice, has no non-zero diagonal entries. Moreover, taking distinct,
so that both and . Therefore,
Reinserting this expression into the expectation of (59) and applying Part (i) of the lemma now results in the claimed identity
where by Lemma 9(a).
-
(c)
Following a similar strategy as in Part (ii), observe that
(60) By construction of the latter matrix,
meaning is always distinct from the other indices. The index corresponds to the third -matrix from the left, so this proves . Reversing the overlines of the latter expression in order, note that
Note that the subtracted terms match those added to in (60) exactly. In turn, these identities prove
(61) The first and second term in the last equality may be computed via Part (ii) of the lemma, whereas the third term remains to be treated.
By definition, the diagonal entries of and are all zero, so equals for all . Moreover, taking implies
On the set , the entry is independent of and . Consequently,
In matrix form, the previous equation reads
where denotes the Hadamard product.
Reinserting the computed expressions into (61), as well as noting that and yields
(62) Next, using the identity of Lemma 9(b), we combine the first and third terms of the latter display into
Regarding the second term of (62), observe that
where the second equality follows from . Lastly, by Lemma 9(c), so the fourth and fifth term of (62) combine into
where the common factor is omitted in the display.
∎
In principle, any computations involving higher moments of may be accomplished with the proof strategy of Lemma 10. A particular covariance matrix is needed in Section B, which will be given in the next lemma.
Lemma 11.
Given a symmetric matrix , as well as vectors and ,
Proof The covariance of the sum is given by
(63) |
where each of the latter terms will be treated separately. To this end, set , , and .
First, recall that by Lemma 10(a) and , so the definition of covariance implies
(64) |
Moving on to , observe that
By the same argument as in (28),
(65) |
so that in turn
Recall from Lemma 9(b). Applying Lemma 10(b) for the first term in the previous display and Lemma 10(a) for the second term now leads to
(66) |
Using a completely analogous argument, the reflected term satisfies
(67) |
The last term necessitates another decomposition into four sub-problems. First, recall from (65) that vanishes, which leads to
(68) |
where the last term is negative since . Recall once more the identity from Lemma 9(b) and apply Lemma 10(c), to rewrite as
As for , start by noting that
where Lemma 10(a) computes the second expectation and Lemma 10(b) the first expectation. To progress, note first the identities
so that
By symmetry, the reflected term then also satisfies . Lastly, applying Lemma 10(a) to results in . To finish the treatment of , inserting the computed expressions for , and into (C) and combining like terms now leads to
To conclude the proof, insert this expression for into (C), together with as in (64), as in (67), and as in (C) to obtain the desired identity
∎
Appendix D Auxiliary Results
Below we collect identities and definitions referenced in other sections.
Neumann series: Let denote a real or complex Banach space with norm . Recall that the operator norm of a linear operator on is given by .
Lemma 12 ([20], Proposition 5.3.4).
Suppose is bounded, linear, and satisfies , then is invertible and .
Bounds on singular values: Recall that denotes the Euclidean norm on and the spectral norm on , which is given by the largest singular value . The spectral norm is sub-multiplicative in the sense that . The spectral norm of a vector , viewed as a linear functional on , is given by , proving that
(69) |
for any vectors and of the same length.
Recall the definitions and with .
Lemma 13.
Given matrices and , the inequalities , , and hold. Moreover, if is symmetric and positive semi-definite, then also .
Proof For any matrix , the maximal singular value can be computed from the variational formulation see [24], Theorem 4.2.6.
Let denote the th standard basis vector. The variational formulation of the maximal singular value implies which is bounded by . The latter is further bounded by , proving the first statement. The second inequality follows from the first since
For the inequality concerning the Hadamard product, see Theorem 5.5.7 of [23].
For the last inequality, note that semi-definiteness entails . Fixing , this ensures , which completes the proof. ∎
References
- [1] Martín Abadi et al. “TensorFlow: A System for Large-Scale Machine Learning” In 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association, 2016, pp. 265–283
- [2] Donald W.. Andrews “Laws of Large Numbers for Dependent Non-Identically Distributed Random Variables” In Econometric Theory 4.3, 1988, pp. 458–467
- [3] Raman Arora, Peter Bartlett, Poorya Mianjy and Nathan Srebro “Dropout: Explicit Forms and Capacity Control” In 38th International Conference on Machine Learning Proceedings of Machine Learning Research, 2021, pp. 351–361
- [4] Jimmy Ba and Brendan Frey “Adaptive dropout for training deep neural networks” In Advances in Neural Information Processing Systems 26 Curran Associates, Inc., 2013, pp. 3084–3092
- [5] Bubacarr Bah, Holger Rauhut, Ulrich Terstiege and Michael Westdickenberg “Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizers” In Information and Inference: A Journal of the IMA 11.1, 2022, pp. 307–353
- [6] Pierre Baldi and Peter J. Sadowski “Understanding Dropout” In Advances in Neural Information Processing Systems 26 Curran Associates, Inc., 2013, pp. 2814–2822
- [7] Peter L. Bartlett, Philip M. Long and Olivier Bousquet “The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima” In Journal of Machine Learning Research 24.316, 2023, pp. 1–36
- [8] Thijs Bos and Johannes Schmidt-Hieber “Convergence guarantees for forward gradient descent in the linear regression model”, arXiv:2309.15001 [math.ST], 2023
- [9] Jacopo Cavazza et al. “Dropout as a Low-Rank Regularizer for Matrix Factorization” In 21st International Conference on Artificial Intelligence and Statistics Proceedings of Machine Learning Research, 2018, pp. 435–444
- [10] François Chollet “Keras”, https://keras.io, 2015
- [11] George Cybenko “Approximation by superpositions of a sigmoidal function” In Mathematics of Control, Signals, and Systems 2.4, 1989, pp. 303–314
- [12] Steffen Dereich and Sebastian Kassing “Central limit theorems for stochastic gradient descent with averaging for stable manifolds” In Electronic Journal of Probability 28, 2023, pp. 1–48
- [13] Lutz Dümbgen, Richard J. Samworth and Dominic Schuhmacher “Stochastic search for semiparametric linear regression models” In From Probability to Statistics and Back: High-Dimensional Models and Processes. A Festschrift in Honor of Jon A. Wellner Institute of Mathematical Statistics, 2013, pp. 78–90
- [14] Bradley Efron and Trevor Hastie “Computer Age Statistical Inference. Algorithms, Evidence, and Data Science” 5, Institute of Mathematical Statistics Monographs Cambridge University Press, 2016
- [15] Yarin Gal and Zoubin Ghahramani “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks” In Advances in Neural Information Processing Systems 29 Curran Associates, Inc., 2016, pp. 1027–1035
- [16] Yarin Gal and Zoubin Ghahramani “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning” In 33rd International Conference on International Conference on Machine Learning Proceedings of Machine Learning Research, 2016, pp. 1050–1059
- [17] Wei Gao and Zhi-Hua Zhou “Dropout Rademacher complexity of deep neural networks” In Science China Information Sciences 59.7, 2016, pp. 2104:1–12
- [18] Ian Goodfellow, Yoshua Bengio and Aaron Courville “Deep learning”, Adaptive Computation and Machine Learning MIT Press, 2016
- [19] László Györfi and Harro Walk “On the Averaged Stochastic Approximation for Linear Regression” In SIAM Journal on Control and Optimization 34.1, 1996, pp. 31–61
- [20] Aleksandr Ya. Helemskii “Lectures and Exercises on Functional Analysis” 233, Translations of Mathematical Monographs American Mathematical Society, 2006
- [21] David P. Helmbold and Philip M. Long “Surprising properties of dropout in deep networks” In Conference on Learning Theory Proceedings of Machine Learning Research, 2017, pp. 1123–1146
- [22] Jonathan Hill and Liang Peng “Unified interval estimation for random coefficient autoregressive models” In Journal of Time Series Analysis 35.3, 2014, pp. 282–297
- [23] Roger A. Horn and Charles R. Johnson “Topics in Matrix Analysis” Cambridge University Press, 1991
- [24] Roger A. Horn and Charles R. Johnson “Matrix Analysis” Cambridge University Press, 2013
- [25] Kurt Hornik “Approximation capabilities of multilayer feedforward networks” In Neural Networks 4.2, 1991, pp. 251–257
- [26] Yangqing Jia et al. “Caffe: Convolutional Architecture for Fast Feature Embedding” In 22nd ACM International Conference on Multimedia Association for Computing Machinery, 2014, pp. 675–678
- [27] Diederik P. Kingma, Tim Salimans and Max Welling “Variational Dropout and the Local Reparameterization Trick” In Advances in Neural Information Processing Systems 28 Curran Associates, Inc., 2015, pp. 2575–2583
- [28] Alex Krizhevsky, Ilya Sutskever and Geoffrey E Hinton “ImageNet Classification with Deep Convolutional Neural Networks” In Advances in Neural Information Processing Systems 25 Curran Associates, Inc., 2012, pp. 1097–1105
- [29] Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus and Shimon Schocken “Multilayer feedforward networks with a nonpolynomial activation function can approximate any function” In Neural Networks 6.6, 1993, pp. 861–867
- [30] Oxana A. Manita et al. “Universal Approximation in Dropout Neural Networks” In Journal of Machine Learning Research 23.19, 2022, pp. 1–46
- [31] David McAllester “A PAC-Bayesian Tutorial with A Dropout Bound”, arXiv:1307.2118 [cs.LG], 2013
- [32] Poorya Mianjy and Raman Arora “On Dropout and Nuclear Norm Regularization” In 36th International Conference on Machine Learning Proceedings of Machine Learning Research, 2019, pp. 4575–4584
- [33] Poorya Mianjy and Raman Arora “On Convergence and Generalization of Dropout Training” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 21151–21161
- [34] Poorya Mianjy, Raman Arora and Rene Vidal “On the Implicit Bias of Dropout” In 35th International Conference on Machine Learning Proceedings of Machine Learning Research, 2018, pp. 3540–3548
- [35] Reza Moradi, Reza Berangi and Behrouz Minaei “A Survey of Regularization Strategies for Deep Models” In Artificial Intelligence Review 53.6, 2020, pp. 3947–3986
- [36] Gabin Maxime Nguegnang, Holger Rauhut and Ulrich Terstiege “Convergence of gradient descent for learning linear neural networks”, arXiv:2108.02040 [cs.LG], 2021
- [37] Des F. Nicholls and Barry G. Quinn “Random Coefficient Autoregressive Models: An Introduction” 11, Lecture Notes in Statistics Springer New York, 1982
- [38] Adam Paszke et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019
- [39] Boris Polyak “New method of stochastic approximation type” In Avtomatica i Telemekhanika 7, 1990, pp. 98–107
- [40] Boris Polyak and Anatoli B. Juditsky “Acceleration of stochastic approximation by averaging” In SIAM Journal on Control and Optimization 30.4, 1992, pp. 838–855
- [41] Marta Regis, Paulo Serra and Edwin R. Heuvel “Random autoregressive models: a structured overview” In Econometric Reviews 41.2, 2022, pp. 207–230
- [42] David Ruppert “Efficient Estimations from a Slowly Convergent Robbins-Monro Process”, Technical Report 781. Cornell University Operations Research and Industrial Engineering, 1988
- [43] Claudio Filipi Gonçalves Dos Santos and João Paulo Papa “Avoiding Overfitting: A Survey on Regularization Methods for Convolutional Neural Networks” In ACM Computing Surveys 54.10s, 2022, pp. 213:1–25
- [44] Johannes Schmidt-Hieber and Wouter M. Koolen “Hebbian learning inspired estimation of the linear regression parameters from queries”, arXiv:2311.03483 [math.ST], 2023
- [45] Albert Senen-Cerda and Jaron Sanders “Asymptotic Convergence Rate of Dropout on Shallow Linear Neural Networks” In Proceedings of the ACM on Measurement and Analysis of Computing Systems 6.2, 2022, pp. 32:1–53
- [46] Nitish Srivastava et al. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” In Journal of Machine Learning Research 15.56, 2014, pp. 1929–1958
- [47] Alexander Tsigler and Peter L. Bartlett “Benign Overfitting in Ridge Regression” In Journal of Machine Learning Research 24.123, 2023, pp. 1–76
- [48] Stefan Wager, Sida Wang and Percy S Liang “Dropout Training as Adaptive Regularization” In Advances in Neural Information Processing Systems 26 Curran Associates, Inc., 2013, pp. 351–359
- [49] Li Wan et al. “Regularization of Neural Networks using DropConnect” In 30th International Conference on Machine Learning Proceedings of Machine Learning Research, 2013, pp. 1058–1066
- [50] Sida Wang and Christopher Manning “Fast dropout training” In 30th International Conference on Machine Learning Proceedings of Machine Learning Research, 2013, pp. 118–126
- [51] Colin Wei, Sham Kakade and Tengyu Ma “The Implicit and Explicit Regularization Effects of Dropout” In 37th International Conference on Machine Learning Proceedings of Machine Learning Research, 2020, pp. 10181–10192
- [52] Haibing Wu and Xiaodong Gu “Towards dropout training for convolutional neural networks” In Neural Networks 71, 2015, pp. 1–10
- [53] Ke Zhai and Huan Wang “Adaptive Dropout with Rademacher Complexity Regularization” In 6th International Conference on Learning Representations, 2018
- [54] Ruiqi Zhang, Spencer Frei and Peter L. Bartlett “Trained Transformers Learn Linear Models In-Context” In Journal of Machine Learning Research 25.49, 2024, pp. 1–55
- [55] Wanrong Zhu, Xi Chen and Wei Biao Wu “Online Covariance Matrix Estimation in Stochastic Gradient Descent” In Journal of the American Statistical Association 118.541, 2021, pp. 393–404