Dropout Regularization Versus $\ell_{2}$ -Penalization
in the Linear Model

Gabriel Clara ¹¹1Corresponding author: [email protected] ²²2The authors thank an anonymous editor and three anonymous referees for their valuable time and effort; their comments improved the article tremendously.

This publication is part of the project Statistical foundation for multilayer neural networks (project number VI.Vidi.192.021 of the Vidi ENW programme) financed by the Dutch Research Council (NWO). Faculty of Electrical Engineering, Mathematics, and Computer Science
University of Twente Sophie Langer ²²footnotemark: 2 Faculty of Electrical Engineering, Mathematics, and Computer Science
University of Twente Johannes Schmidt-Hieber ²²footnotemark: 2 Faculty of Electrical Engineering, Mathematics, and Computer Science
University of Twente

Abstract

We investigate the statistical behavior of gradient descent iterates with dropout in the linear regression model. In particular, non-asymptotic bounds for the convergence of expectations and covariance matrices of the iterates are derived. The results shed more light on the widely cited connection between dropout and $\ell_{2}$ -regularization in the linear model. We indicate a more subtle relationship, owing to interactions between the gradient descent dynamics and the additional randomness induced by dropout. Further, we study a simplified variant of dropout which does not have a regularizing effect and converges to the least squares estimator.

1 Introduction

Dropout is a simple, yet effective, algorithmic regularization technique, intended to prevent neural networks from overfitting. First introduced in [46], the method is implemented via random masking of neurons at training time. Specifically, during every gradient descent iteration, the output of each neuron is replaced by zero based on the outcome of an independently sampled $\mathrm{Ber}(p)$ -distributed variable. This temporarily removes each neuron with a probability of $1-p$ , see Figure 1 for an illustration. The method has demonstrated effectiveness in various applications, see for example [28, 46]. On the theoretical side, dropout is often studied by exhibiting connections with explicit regularizers [3, 6, 9, 31, 32, 34, 45, 46, 48]. Rather than analyzing the gradient descent iterates with dropout noise, these results consider the marginalized training loss with marginalization over the dropout noise. Within this framework, [46] established a connection between dropout and weighted $\ell_{2}$ -penalization in the linear regression model. This connection is now cited in popular textbooks [14, 18].

However, [51] show empirically that injecting dropout noise in the gradient descent iterates also induces an implicit regularization effect that is not captured by the link between the marginalized loss and explicit regularization. This motivates our approach to directly derive the statistical properties of gradient descent iterates with dropout. We study the linear regression model due to mathematical tractability and because the minimizer of the explicit regularizer is unique and admits a closed-form expression. In line with the implicit regularization observed in [51], our main result provides a theoretical bound quantifying the amount of randomness in the gradient descent scheme that is ignored by the previously considered minimizers of the marginalized training loss. More specifically, Theorem 2 shows that for a fixed learning rate there is additional randomness which fails to vanish in the limit, while Theorem 3 characterizes the gap between dropout and $\ell_{2}$ -penalization with respect to the learning rate, the dropout parameter $p$ , the design matrix, and the distribution of the initial iterate. Theorem 4 shows that this gap disappears for the Ruppert-Polyak averages of the iterates.

Figure 1: Regular neurons (left) with all connections active. Sample of the same neurons with dropout (right). The dashed connections are ignored during the current iteration of training.

To provide a clearer understanding of the interplay between gradient descent and variance, we also investigate a simplified variant of dropout featuring more straightforward interactions between the two. Applying the same analytical techniques to this simplified variant, Theorem 5 establishes convergence in quadratic mean to the conventional linear least-squares estimator. This analysis illustrates the sensitivity of gradient descent to small changes in the way noise is injected during training.

Many randomized optimization methods can be formulated as noisy gradient descent schemes. The developed strategy to treat gradient descent with dropout may be generalized to other settings. An example is the recent analysis of forward gradient descent in [8].

The article is organized as follows. After discussing related results below, Section 2 contains preliminaries and introduces two different variants of dropout. Section 3 discusses some extensions of previous results for averaged dropout obtained by marginalizing over the dropout distribution in the linear model considered in [46]. Section 4 illustrates the main results on gradient descent with dropout in the linear model, and examines its statistical optimality. Section 5 contains further discussion and mentions a number of natural follow-up problems. All proofs are deferred to the Appendix.

1.1 Other Related Work

Considering linear regression and the marginalized training loss with marginalization over the dropout noise, the initial dropout article [46] already connects dropout with $\ell_{2}$ -regularization. This connection was also noted by [6] and by [31]. As this argument is crucial in our own analysis, we will discuss it in more detail in Section 3.

[48] extends the reasoning to generalized linear models and more general forms of injected noise. Employing a quadratic approximation to the loss function after marginalization over the injected noise, the authors exhibit an explicit regularizer. In case of dropout noise, this regularizer induces in first-order an $\ell_{2}$ -penalty after rescaling of the data by the estimated inverse of the diagonal Fisher information.

For two-layer models, marginalizing the dropout noise leads to a nuclear norm penalty on the product matrix, both in matrix factorization [9] and linear neural networks [34]. The latter may be seen as a special case of a particular “ $\ell_{2}$ -path regularizer”, which appears in deep linear networks [32] and shallow ReLU-activated networks [3]. Further, [3] exhibit a data distribution-dependent regularizer in two-layer matrix sensing/completion problems. This regularizer collapses to a nuclear norm penalty for specific distributions.

[16] show that empirical risk minimization in deep neural networks with dropout may be recast as performing Bayesian variational inference to approximate the intractable posterior resulting from a deep Gaussian process prior. The Bayesian viewpoint also allows for the quantification of uncertainty. [15] further generalizes this technique to recurrent and long-short-term-memory (LSTM) networks. [52] analyze dropout applied to the max-pooling layers in convolutional neural networks. [50] present a Gaussian approximation to the gradient noise induced by dropout.

Generalization results for dropout training exist in various settings. Given bounds on the norms of weight vectors, [49], [17], and [53] prove decreasing Rademacher complexity bounds as the dropout rate increases. [3] bound the Rademacher complexity of shallow ReLU-activated networks with dropout. [31] obtains a PAC-Bayes bound for dropout and illustrates a trade-off between large and small dropout probabilities for different terms in the bound.

Recently, [30] demonstrated an universal approximation result in the vein of classic results [11, 25, 29], stating that any function in some generic semi-normed space that can be $\varepsilon$ -approximated by a deterministic neural network may also be stochastically approximated in $L^{q}$ -norm by a sufficiently large network with dropout.

Less is known about gradient descent training with dropout. [45] study the gradient flow associated with the explicit regularizer obtained by marginalizing the dropout noise in a shallow linear network. In particular, the flow converges exponentially fast within a neighborhood of a parameter vector satisfying a balancing condition. [33] study gradient descent with dropout on the logistic loss of a shallow ReLU-activated network in a binary classification task. Their main result includes an explicit rate for the misclassification error, assuming an overparametrized network operating in the so-called lazy regime, where the trained weights stay relatively close to their initializations, and two well-separated classes.

1.2 Notation

Column vectors $\mathbf{x}=(x_{1},\dots,x_{d})^{\top}$ are denoted by bold letters. We define $\mathbf{0}:=(0,\ldots,0)^{\top}$ , $\mathbf{1}:=(1,\ldots,1)^{\top}$ , and the Euclidean norm $\left\lVert\mathbf{x}\right\rVert_{2}:=\sqrt{\mathbf{x}^{\top}\mathbf{x}}$ . The $d\times d$ identity matrix is symbolized by $I_{d}$ , or simply $I$ , when the dimension $d$ is clear from context. For matrices $A,B$ of the same dimension, $A\odot B$ denotes the Hadamard/entry-wise product $(A\odot B)_{ij}=A_{ij}B_{ij}$ . We write $\mathrm{Diag}(A):=I\odot A$ for the diagonal matrix with the same main diagonal as $A$ . Given $p\in(0,1)$ , we define the matrices

	$\displaystyle\overline{A}$	$\displaystyle:=A-\mathrm{Diag}(A)$
	$\displaystyle A_{p}$	$\displaystyle:=pA+(1-p)\mathrm{Diag}(A).$

In particular, $A_{p}=p\overline{A}+\mathrm{Diag}(A)$ , so $A_{p}$ results from rescaling the off-diagonal entries of $A$ by $p$ .

The smallest eigenvalue of a symmetric matrix $A$ is denoted by $\lambda_{\mathrm{min}}(A)$ . The operator norm of a linear operator $T:V\to W$ between normed linear spaces is given by $\left\lVert T\right\rVert_{\mathrm{op}}:=\sup_{v\in V:\left\lVert v\right\rVert_{V}\leq 1}\left\lVert Tv\right\rVert_{W}$ . We write $\lVert\ \cdot\ \rVert$ for the spectral norm of matrices, which is the operator norm induced by $\lVert\ \cdot\ \rVert_{2}$ . For symmetric matrices, the relation $A\geq B$ signifies $\mathbf{x}^{\top}(A-B)\mathbf{x}\geq 0$ for all non-zero vectors $\mathbf{x}$ . The strict operator inequality $A>B$ is defined analogously.

2 Gradient Descent and Dropout

We consider a linear regression model with fixed $n\times d$ design matrix $X$ and $n$ outcomes $\mathbf{Y}$ , so that

\displaystyle\mathbf{Y}=X\bm{\beta}_{\star}+\bm{\varepsilon},

(1)

with unknown parameter $\bm{\beta}_{\star}$ . We assume $\mathbb{E}\big{[}\bm{\varepsilon}\big{]}=\mathbf{0}$ and $\mathrm{Cov}\big{(}\bm{\varepsilon}\big{)}=I_{n}$ . The task is to estimate $\bm{\beta}_{\star}$ from the observed data $(X,\mathbf{Y})$ . As the Gram matrix $X^{\top}X$ appears throughout our analysis, we introduce the shorthand

\mathbb{X}:=X^{\top}X.

Recovery of $\bm{\beta}_{\star}$ in the linear regression model (1) may be interpreted as training a neural network without intermediate hidden layers, see Figure 2. If $X$ were to have a zero column, the corresponding regression coefficient would not affect the response vector $\mathbf{Y}$ . Consequently, both the zero column and the regression coefficient may be eliminated from the linear regression model. Without zero columns, the model is said to be in reduced form.

Figure 2: The linear regression model

y=\sum_{i=1}^{d}\beta_{i}x_{i}

, viewed as a neural network without hidden layers.

The least squares criterion for the estimation of $\bm{\beta}_{\star}$ refers to the objective function $\bm{\beta}\mapsto\tfrac{1}{2}\left\lVert\mathbf{Y}-X\bm{\beta}\right\rVert^{2}_{2}$ . Given a fixed learning rate $\alpha>0$ , performing gradient descent on the least squares objective leads to the iterative scheme

\displaystyle\widetilde{\bm{\beta}}^{\mathrm{gd}}_{k+1}=\widetilde{\bm{\beta}}^{\mathrm{gd}}_{k}-\alpha\nabla_{\widetilde{\bm{\beta}}^{\mathrm{gd}}_{k}}\dfrac{1}{2}\left\lVert\mathbf{Y}-X\widetilde{\bm{\beta}}^{\mathrm{gd}}_{k}\right\rVert^{2}_{2}=\widetilde{\bm{\beta}}^{\mathrm{gd}}_{k}+\alpha X^{\top}\big{(}\mathbf{Y}-X\widetilde{\bm{\beta}}^{\mathrm{gd}}_{k}\big{)}

(2)

with $k=0,1,2,\ldots$ and (possibly random) initialization $\widetilde{\bm{\beta}}^{\mathrm{gd}}_{0}$ .

For standard gradient descent as defined in (2), the estimate is updated with the gradient of the full model. Dropout, as introduced in [46], replaces the gradient of the full model with the gradient of a randomly reduced model during each iteration of training. To make this notion more precise, we call a random diagonal $d\times d$ matrix $D$ a $p$ -dropout matrix, or simply a dropout matrix, if its diagonal entries satisfy $D_{ii}\overset{\scriptscriptstyle i.i.d.}{\sim}\mathrm{Ber}(p)$ for some $p\in(0,1)$ . We note that the Bernoulli distribution may alternatively be parametrized with the failure probability $q:=1-p$ , but following [46] we choose the success probability $p$ .

On average, $D$ has $pd$ diagonal entries equal to $1$ and $(1-p)d$ diagonal entries equal to $0$ . Given any vector $\bm{\beta}$ , the coordinates of $D\bm{\beta}$ are randomly set to $0$ with probability $1-p$ . For simplicity, the dependence of $D$ on $p$ will be omitted.

Now, let $D_{k}$ , $k=1,2,\ldots$ be a sequence of i.i.d. dropout matrices, where $D_{k}$ refers to the dropout matrix applied in the $k$ ^th iteration. Gradient descent with dropout takes the form

\displaystyle\widetilde{\bm{\beta}}_{k+1}=\widetilde{\bm{\beta}}_{k}-\alpha\nabla_{\widetilde{\bm{\beta}}_{k}}\dfrac{1}{2}\left\lVert\mathbf{Y}-XD_{k+1}\widetilde{\bm{\beta}}_{k}\right\rVert^{2}_{2}=\widetilde{\bm{\beta}}_{k}+\alpha D_{k+1}X^{\top}\big{(}\mathbf{Y}-XD_{k+1}\widetilde{\bm{\beta}}_{k}\big{)}

(3)

with $k=0,1,2,\ldots$ and (possibly random) initialization $\widetilde{\bm{\beta}}_{0}$ . In contrast with (2), the gradient in (3) is taken on the model reduced by the action of multiplying $\widetilde{\bm{\beta}}_{k}$ with $D_{k+1}$ . Alternatively, (3) may be interpreted as replacing the design matrix $X$ with the reduced matrix $XD_{k+1}$ during the $(k+1)$ ^th iteration. The columns of the reduced matrix are randomly deleted with a probability of $1-p$ . Observe that the dropout matrix appears inside the squared norm, making the gradient quadratic in $D_{k+1}$ .

Dropout, as defined in (3), considers the full gradient of the reduced model, whereas another variant is obtained through reduction of the full gradient. The resulting iterative scheme takes the form

\displaystyle\widehat{\bm{\beta}}_{k+1}=\widehat{\bm{\beta}}_{k}-\alpha D_{k+1}\nabla_{\widehat{\bm{\beta}}_{k}}\dfrac{1}{2}\left\lVert\mathbf{Y}-X\widehat{\bm{\beta}}_{k}\right\rVert^{2}_{2}=\widehat{\bm{\beta}}_{k}+\alpha D_{k+1}X^{\top}\big{(}\mathbf{Y}-X\widehat{\bm{\beta}}_{k}\big{)}

(4)

with $k=0,1,2,\ldots$ and (possibly random) initialization $\widehat{\bm{\beta}}_{0}$ . As opposed to $\widetilde{\bm{\beta}}_{k}$ defined above, the dropout matrix only occurs once in the updates, so we shall call this method simplified dropout from here on. As we will illustrate, the quadratic dependence of $\widetilde{\bm{\beta}}_{k}$ on $D_{k+1}$ creates various challenges, whereas the analysis of $\widehat{\bm{\beta}}_{k}$ is more straightforward.

Both versions (3) and (4) coincide when the Gram matrix $\mathbb{X}=X^{\top}X$ is diagonal, meaning when the columns of $X$ are orthogonal. To see this, note that diagonal matrices commute, so $D_{k+1}^{2}=D_{k+1}$ and hence $D_{k+1}\mathbb{X}D_{k+1}=D_{k+1}\mathbb{X}$ .

We note that dropout need not require the complete removal of neurons. Each neuron may be multiplied by an arbitrary (not necessarily Bernoulli distributed) random variable. For instance, [46] also report good performance for $\mathcal{N}(1,1)$ -distributed diagonal entries of the dropout matrix. However, the Bernoulli variant seems well-motivated from a model averaging perspective. [46] propose dropout with the explicit aim of approximating a Bayesian model averaging procedure over all possible combinations of connections in the network. The random removal of nodes during training is thought to prevent the neurons from co-adapting, recreating the model averaging effect. This is the main variant implemented in popular software libraries, such as Caffe [26], TensorFlow [1], Keras [10], and PyTorch [38].

Numerous variations and extensions of dropout exist. [49] show state-of-the-art results for networks with DropConnect, a generalization of dropout where connections are randomly dropped, instead of neurons. In the linear model, this coincides with standard dropout. [4] analyze the case of varying dropout probabilities, where the dropout probability for each neuron is computed using binary belief networks that share parameters with the underlying fully connected network. An adaptive procedure for the choice of dropout probabilities is presented in [27], while also giving a Bayesian justification for dropout.

For a comprehensive overview of established methods and cutting-edge variants, see [35] and [43].

3 Analysis of Averaged Dropout

Before presenting our main results on iterative dropout schemes, we further discuss some properties of the marginalized loss minimizer that was first analyzed by [46]. For the linear regression model (1), marginalizing the dropout noise leads to

\displaystyle\widetilde{\bm{\beta}}:=\operatorname*{arg\,min}_{\bm{\beta}}\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-XD\bm{\beta}\big{\rVert}^{2}_{2}\mid\mathbf{Y}\Big{]}.

(5)

One may hope that the dropout gradient descent recursion for $\widetilde{\bm{\beta}}_{k}$ in (3) leads to a minimizer of (5), so that the marginalized loss minimizer may be studied as a surrogate for the behaviour of $\widetilde{\bm{\beta}}_{k}$ in the long run.

Intuitively, the gradient descent iterates with dropout represent a Monte-Carlo estimate of some deterministic algorithm [50]. This can be motivated by separating the gradient descent update into a part without algorithmic randomness and a centered noise term, meaning

\begin{split}\widetilde{\bm{\beta}}_{k+1}=\widetilde{\bm{\beta}}_{k}&-\dfrac{\alpha}{2}\mathbb{E}\bigg{[}\nabla_{\widetilde{\bm{\beta}}_{k}}\left\lVert\mathbf{Y}-XD_{k+1}\widetilde{\bm{\beta}}_{k}\right\rVert^{2}_{2}\ \big{|}\ \mathbf{Y},\widetilde{\bm{\beta}}_{k}\bigg{]}\\ &-\dfrac{\alpha}{2}\Bigg{(}\nabla_{\widetilde{\bm{\beta}}_{k}}\left\lVert\mathbf{Y}-XD_{k+1}\widetilde{\bm{\beta}}_{k}\right\rVert^{2}_{2}-\mathbb{E}\bigg{[}\nabla_{\widetilde{\bm{\beta}}_{k}}\left\lVert\mathbf{Y}-XD_{k+1}\widetilde{\bm{\beta}}_{k}\right\rVert^{2}_{2}\ \big{|}\ \mathbf{Y},\widetilde{\bm{\beta}}_{k}\bigg{]}\Bigg{)}.\end{split}

(6)

Notably, the stochastic terms form a martingale difference sequence with respect to $(\mathbf{Y},\widetilde{\bm{\beta}}_{k})$ . It seems conceivable that the noise in (6) averages out; despite the random variables being neither independent, nor identically distributed, one may hope that a law of large numbers still holds, see [2]. In this case, after a sufficient number of gradient steps,

\begin{split}\widetilde{\bm{\beta}}_{k+1}&=\widetilde{\bm{\beta}}_{0}-\dfrac{\alpha}{2}\sum_{\ell=1}^{k}\nabla_{\widetilde{\bm{\beta}}_{\ell}}\left\lVert\mathbf{Y}-XD_{\ell+1}\widetilde{\bm{\beta}}_{\ell}\right\rVert^{2}_{2}\\ &\approx\widetilde{\bm{\beta}}_{0}-\dfrac{\alpha}{2}\sum_{\ell=1}^{k}\mathbb{E}\bigg{[}\nabla_{\widetilde{\bm{\beta}}_{\ell}}\left\lVert\mathbf{Y}-XD_{\ell+1}\widetilde{\bm{\beta}}_{\ell}\right\rVert^{2}_{2}\ \big{|}\ \mathbf{Y},\widetilde{\bm{\beta}}_{k}\bigg{]}.\end{split}

(7)

The latter sequence could plausibly converge to the marginalized loss minimizer $\widetilde{\bm{\beta}}$ . While this motivates studying $\widetilde{\bm{\beta}}$ , the main conclusion of our work is that this heuristic is not entirely correct and additional noise terms occur in the limit $k\to\infty$ .

As the marginalized loss minimizer $\widetilde{\bm{\beta}}$ still plays a pivotal role in our analysis, we briefly recount and expand on some of the properties derived in [46]. Recall that $\mathbb{X}=X^{\top}X$ , so we have

\displaystyle\big{\lVert}\mathbf{Y}-XD\bm{\beta}\big{\rVert}^{2}_{2}=\left\lVert\mathbf{Y}\right\rVert^{2}_{2}-2\mathbf{Y}^{\top}XD\bm{\beta}+\bm{\beta}^{\top}D\mathbb{X}D\bm{\beta}.

Since $D$ is diagonal, $\mathbb{E}[D]=pI_{d}$ , and by Lemma 10(a), $\mathbb{E}[D\mathbb{X}D]=p^{2}\mathbb{X}+p(1-p)\mathrm{Diag}(\mathbb{X})$ ,

	$\displaystyle\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-XD\bm{\beta}\big{\rVert}^{2}_{2}\ \big{\|}\ \mathbf{Y}\Big{]}$	$\displaystyle=\lVert\mathbf{Y}\rVert_{2}^{2}-2p\mathbf{Y}^{\top}X\bm{\beta}+p^{2}\bm{\beta}^{\top}\mathbb{X}\bm{\beta}+p(1-p)\bm{\beta}^{\top}\mathrm{Diag}(\mathbb{X})\bm{\beta}$
		$\displaystyle=\big{\lVert}\mathbf{Y}-pX\bm{\beta}\big{\rVert}^{2}_{2}+p(1-p)\bm{\beta}^{\top}\mathrm{Diag}(\mathbb{X})\bm{\beta}.$		(8)

The right-hand side may be identified with a Tikhonov functional, or an $\ell_{2}$ -penalized least squares objective. Its gradient with respect to $\bm{\beta}$ is given by

\displaystyle\nabla_{\bm{\beta}}\mathbb{E}\big{[}\lVert\mathbf{Y}-XD\bm{\beta}\rVert^{2}_{2}\mid\mathbf{Y}\big{]}=-2pX^{\top}\mathbf{Y}+2\big{(}p^{2}\mathbb{X}+p(1-p)\mathrm{Diag}(\mathbb{X})\big{)}\bm{\beta}.

Recall from the discussion following Equation (1) that the model is assumed to be in reduced form, meaning $\min_{i}\mathbb{X}_{ii}>0$ . In turn,

\displaystyle p^{2}\mathbb{X}+p(1-p)\mathrm{Diag}(\mathbb{X})\geq p(1-p)\mathrm{Diag}(\mathbb{X})\geq p(1-p)\min_{i}\mathbb{X}_{ii}\cdot I_{d}

is bounded away from $0$ , making $p^{2}\mathbb{X}+p(1-p)\mathrm{Diag}(\mathbb{X})$ invertible. Solving the gradient for the minimizer $\widetilde{\bm{\beta}}$ now leads to

\displaystyle\widetilde{\bm{\beta}}=\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^{d}}\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-XD\bm{\beta}\big{\rVert}^{2}_{2}\ \big{|}\ \mathbf{Y}\Big{]}=p\Big{(}p^{2}\mathbb{X}+p(1-p)\mathrm{Diag}(\mathbb{X})\Big{)}^{-1}X^{\top}\mathbf{Y}=\mathbb{X}_{p}^{-1}X^{\top}\mathbf{Y},

(9)

where $\mathbb{X}_{p}:=p\mathbb{X}+(1-p)\mathrm{Diag}(\mathbb{X})$ . If the columns of $X$ are orthogonal, then $\mathbb{X}$ is diagonal and hence $\mathbb{X}_{p}=\mathbb{X}$ . In this case, $\widetilde{\bm{\beta}}$ matches the usual linear least squares estimator $\mathbb{X}^{-1}X^{\top}\mathbf{Y}$ . Alternatively, $\widetilde{\bm{\beta}}$ minimizing the marginalized loss can also be deduced from the identity

\displaystyle\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-XD\widehat{\bm{\beta}}\big{\rVert}_{2}^{2}\ \big{|}\ \mathbf{Y}\Big{]}=\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-XD\widetilde{\bm{\beta}}\big{\rVert}_{2}^{2}\ \big{|}\ \mathbf{Y}\Big{]}+\mathbb{E}\Big{[}\big{\lVert}XD\big{(}\widetilde{\bm{\beta}}-\widehat{\bm{\beta}}\big{)}\big{\rVert}_{2}^{2}\ \big{|}\ \mathbf{Y}\Big{]},

(10)

which holds for all estimators $\widehat{\bm{\beta}}$ . See Appendix A for a proof of (10). We now mention several other relevant properties of $\widetilde{\bm{\beta}}$ .

Calibration: [46] recommend multiplying $\widetilde{\bm{\beta}}$ by $p$ , which may be motivated as follows: Since $\mathbf{Y}=X\bm{\beta}_{\star}+\bm{\varepsilon}$ , a small squared error $\big{\lVert}\mathbf{Y}-pX\widetilde{\bm{\beta}}\big{\rVert}^{2}_{2}$ in (3) implies $\bm{\beta}_{\star}\approx p\widetilde{\bm{\beta}}$ . Moreover, multiplying $\widetilde{\bm{\beta}}$ by $p$ leads to $p\widetilde{\bm{\beta}}=\big{(}\mathbb{X}+(1/p-1)\mathrm{Diag}(\mathbb{X})\big{)}^{-1}X^{\top}\mathbf{Y}$ which may be identified with the minimizer of the objective function

\displaystyle\bm{\beta}\mapsto\big{\lVert}\mathbf{Y}-X\bm{\beta}\big{\rVert}^{2}_{2}+(p^{-1}-1)\bm{\beta}^{\top}\mathrm{Diag}(\mathbb{X})\bm{\beta}=\mathbb{E}\big{[}\lVert\mathbf{Y}-Xp^{-1}D\bm{\beta}\rVert^{2}_{2}\mid\mathbf{Y}\big{]}.

This recasts $p\widetilde{\bm{\beta}}$ as resulting from a weighted form of ridge regression. Comparing the objective function to the original marginalized loss $\mathbb{E}\big{[}\lVert\mathbf{Y}-XD\bm{\beta}\rVert^{2}_{2}\mid\mathbf{Y}\big{]}$ , the rescaling replaces $D$ with the normalized dropout matrix $p^{-1}D$ , which has the identity matrix as its expected value. In popular machine learning software, the sampled dropout matrices are usually rescaled by $p^{-1}$ [1, 10, 26, 38].

In some settings multiplication by $p$ may worsen $\widetilde{\bm{\beta}}$ as a statistical estimator. As an example, consider the case $n=d$ with $X=nI_{n}$ a multiple of the identity matrix. Now, (9) turns into $\widetilde{\bm{\beta}}=n^{-1}\mathbf{Y}=\bm{\beta}_{\star}+n^{-1}\bm{\varepsilon}$ . If the noise vector $\bm{\varepsilon}$ consists of independent standard normal random variables, then $\widetilde{\bm{\beta}}$ has mean squared error $\mathbb{E}\big{[}\lVert\widetilde{\bm{\beta}}-\bm{\beta}_{\star}\rVert_{2}^{2}\big{]}=n^{-1}$ . In contrast, $\mathbb{E}\big{[}\lVert p\widetilde{\bm{\beta}}-\bm{\beta}_{\star}\rVert_{2}^{2}\big{]}=(1-p)\lVert\bm{\beta}_{\star}\rVert_{2}^{2}+p^{2}n^{-1}$ , so $\widetilde{\bm{\beta}}$ converges to $\bm{\beta}_{\star}$ at the rate $n^{-1}$ while $p\widetilde{\bm{\beta}}$ cannot be consistent as $n\to\infty$ , unless $\bm{\beta}_{\star}=\mathbf{0}$ .

The correct rescaling may also depend on the parameter dimension $d$ and the spectrum of $\mathbb{X}$ . Suppose that all columns of $X$ have the same Euclidean norm, so that $\mathrm{Diag}(\mathbb{X})=\mathbb{X}_{11}\cdot I_{d}$ . Let $X=\sum_{\ell=1}^{\mathrm{rank}(X)}\sigma_{\ell}\mathbf{v}_{\ell}\mathbf{w}_{\ell}^{\top}$ denote a singular value decomposition of $X$ , with singular values $\sigma_{1},\ldots,\sigma_{\mathrm{rank}(X)}$ . Now, $\widetilde{\bm{\beta}}$ satisfies

\begin{split}\widetilde{\bm{\beta}}&=\sum_{\ell=1}^{\mathrm{rank}(X)}\dfrac{1}{p\sigma_{\ell}^{2}+(1-p)\mathbb{X}_{11}}\big{(}\sigma_{\ell}\mathbf{w}_{\ell}\mathbf{v}_{\ell}^{\top}\big{)}\mathbf{Y}\\ \mathbb{E}\big{[}X\widetilde{\bm{\beta}}\big{]}&=\sum_{\ell=1}^{\mathrm{rank}(X)}\dfrac{1}{p+(1-p)\mathbb{X}_{11}/\sigma_{\ell}^{2}}\big{(}\sigma_{\ell}\mathbf{v}_{\ell}\mathbf{w}_{\ell}^{\top}\big{)}\bm{\beta}_{\star}.\end{split}

(11)

For a proof of these identities, see Appendix A. To get an unbiased estimator for $X\bm{\beta}_{\star}=\sum_{\ell=1}^{\mathrm{rank}(X)}\sigma_{\ell}\mathbf{v}_{\ell}\mathbf{w}_{\ell}^{\top}\bm{\beta}_{\star}$ , we must undo the effect of the spectral multipliers $1/(p+(1-p)\mathbb{X}_{11}/\sigma_{\ell}^{2})$ which take values in the interval $[0,1/p]$ . Consequently, the proper rescaling depends on the eigenspace. Multiplication of the estimator by $p$ addresses the case where the singular values $\sigma_{\ell}$ are large. In particular, if $X=\sigma\mathbf{v}\mathbf{w}^{\top}$ with $\sigma=\sqrt{nd}$ , $\mathbf{v}=(n^{-1/2},\ldots,n^{-1/2})^{\top}$ , and $\mathbf{w}=(d^{-1/2},\ldots,d^{-1/2})^{\top}$ , then $X$ is the $n\times d$ matrix with all entries equaling $1$ . Now $\mathbb{X}_{11}=n$ and so $\mathbb{X}_{11}/\sigma^{2}=d^{-1}$ , meaning the correct scaling factor depends explicitly on the parameter dimension $d$ and converges to the dropout probability $p$ as $d\to\infty$ .

Invariance properties: The minimizer $\widetilde{\bm{\beta}}=\mathbb{X}_{p}^{-1}X^{\top}\mathbf{Y}$ is scale invariant in the sense that $\mathbf{Y}$ and $X$ may be replaced with $\gamma\mathbf{Y}$ and $\gamma X$ for some arbitrary $\gamma\neq 0$ , without changing $\widetilde{\bm{\beta}}$ . This does not hold for the gradient descent iterates (3) and (4), since rescaling by $\gamma$ changes the learning rate from $\alpha$ to $\alpha\gamma^{2}$ . Moreover, $\widetilde{\bm{\beta}}$ as well as the gradient descent iterates $\widetilde{\bm{\beta}}_{k}$ in (3) and $\widehat{\bm{\beta}}_{k}$ in (4) are invariant under replacement of $\mathbf{Y}$ and $X$ by $Q\mathbf{Y}$ and $QX$ for any orthogonal $n\times n$ matrix $Q$ . See [21] for further results on scale-invariance of dropout in deep networks.

Overparametrization: Dropout has been successfully applied in the overparametrized regime, see for example [28]. For the overparametrized linear regression model, the data-misfit term in (3) suggests that $pX\widetilde{\bm{\beta}}=X\big{(}\mathbb{X}+(p^{-1}-1)\mathrm{Diag}(\mathbb{X})\big{)}^{-1}X^{\top}\mathbf{Y}$ should be close to the data vector $\mathbf{Y}$ . However,

\Big{\lVert}X\big{(}\mathbb{X}+(p^{-1}-1)\mathrm{Diag}(\mathbb{X})\big{)}^{-1}X^{\top}\Big{\rVert}<1.

(12)

See Appendix A for a proof. Hence, $pX\widetilde{\bm{\beta}}$ also shrinks $\mathbf{Y}$ towards zero in the overparametrized regime and does not interpolate the data. The variational formulation $\widetilde{\bm{\beta}}\in\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^{d}}\big{\lVert}\mathbf{Y}-pX\bm{\beta}\big{\rVert}^{2}_{2}+p(1-p)\bm{\beta}^{\top}\mathrm{Diag}(\mathbb{X})\bm{\beta}$ reveals that $\widetilde{\bm{\beta}}$ is a minimum-norm solution in the sense

\displaystyle\widetilde{\bm{\beta}}\in\operatorname*{arg\,min}_{\bm{\beta}:X\bm{\beta}=X\widetilde{\bm{\beta}}}\ \bm{\beta}^{\top}\mathrm{Diag}(\mathbb{X})\bm{\beta},

which explains the induced shrinkage.

4 Analysis of Iterative Dropout Schemes

In the linear model, gradient descent with a small but fixed learning rate, as in (2), leads to exponential convergence in the number of iterations. Accordingly, we analyze the iterative dropout schemes (3) and (4) for fixed learning rate $\alpha$ and only briefly discuss the algebraically less tractable case of decaying learning rates.

4.1 Convergence of Dropout

We proceed by assessing convergence of the iterative dropout scheme (3), as well as some of its statistical properties. Recall that gradient descent with dropout takes the form

\displaystyle\widetilde{\bm{\beta}}_{k}=\widetilde{\bm{\beta}}_{k-1}+\alpha D_{k}X^{\top}\big{(}\mathbf{Y}-XD_{k}\widetilde{\bm{\beta}}_{k-1}\big{)}=\big{(}I-\alpha D_{k}\mathbb{X}D_{k}\big{)}\widetilde{\bm{\beta}}_{k-1}+\alpha D_{k}X^{\top}\mathbf{Y}.

(13)

As alluded to in the beginning of Section 3, the gradient descent iterates should be related to the minimizer $\widetilde{\bm{\beta}}$ of (5). It then seems natural to study the difference $\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}$ , with $\widetilde{\bm{\beta}}$ as an “anchoring point”. Comparing $\widetilde{\bm{\beta}}_{k}$ and $\widetilde{\bm{\beta}}$ demands an explicit analysis, without marginalization of the dropout noise.

To start, we rewrite the updating formula (13) in terms of $\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}$ . Using $D_{k}^{2}=D_{k}$ , $\mathrm{Diag}(\mathbb{X})=\mathbb{X}_{p}-p\overline{\mathbb{X}}$ , and that diagonal matrices always commute, we obtain $D_{k}\mathbb{X}D_{k}=D_{k}\overline{\mathbb{X}}D_{k}+D_{k}\mathrm{Diag}(\mathbb{X})=D_{k}\overline{\mathbb{X}}D_{k}+D_{k}\mathbb{X}_{p}-pD_{k}\overline{\mathbb{X}}$ . As defined in (9), $\mathbb{X}_{p}\widetilde{\bm{\beta}}=X^{\top}\mathbf{Y}$ and thus

\displaystyle\begin{split}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}&=\big{(}I-\alpha D_{k}\mathbb{X}D_{k}\big{)}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}+\alpha D_{k}\overline{\mathbb{X}}\big{(}pI-D_{k}\big{)}\widetilde{\bm{\beta}}\\ &=\big{(}I-\alpha D_{k}\mathbb{X}_{p}\big{)}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}+\alpha D_{k}\overline{\mathbb{X}}\big{(}pI-D_{k}\big{)}\widetilde{\bm{\beta}}_{k-1}.\end{split}

(14)

In both representations, the second term is centered and uncorrelated for different values of $k$ . Vanishing of the mean follows from the independence of $D_{k}$ and $\big{(}\widetilde{\bm{\beta}},\widetilde{\bm{\beta}}_{k-1}\big{)}$ , combined with $\mathbb{E}\big{[}D_{k}\overline{\mathbb{X}}(pI-D_{k})\big{]}=0$ , the latter being shown in (28). If $k>\ell$ , independence of $D_{k}$ and $\big{(}\widetilde{\bm{\beta}},\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}_{\ell-1}\big{)}$ , as well as $\mathbb{E}\big{[}D_{k}\overline{\mathbb{X}}(pI-D_{k})\big{]}=0$ , imply $\mathrm{Cov}\big{(}D_{k}\overline{\mathbb{X}}(pI-D_{k})\widetilde{\bm{\beta}},D_{\ell}\overline{\mathbb{X}}(pI-D_{\ell})\widetilde{\bm{\beta}}\big{)}=0$ and $\mathrm{Cov}\big{(}D_{k}\overline{\mathbb{X}}(pI-D_{k})\widetilde{\bm{\beta}}_{k-1},D_{\ell}\overline{\mathbb{X}}(pI-D_{\ell})\widetilde{\bm{\beta}}_{\ell-1}\big{)}=0$ , which proves uncorrelatedness.

Defining $Z_{k}:=\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}$ , $G_{k}:=I-\alpha D_{k}\mathbb{X}D_{k}$ , and $\bm{\xi}_{k}:=\alpha D_{k}\overline{\mathbb{X}}\big{(}pI-D_{k}\big{)}\widetilde{\bm{\beta}}$ , the first representation in (14) may be identified with a lag one vector autoregressive (VAR) process

\displaystyle Z_{k}=G_{k}Z_{k-1}+\bm{\xi}_{k}

(15)

with i.i.d. random coefficients $G_{k}$ and noise/innovation process $\bm{\xi}_{k}$ . As just shown, $\mathbb{E}[\bm{\xi}_{k}]=0$ and $\mathrm{Cov}(\bm{\xi}_{k},\bm{\xi}_{\ell})=0$ whenever $k\neq\ell$ , so the noise process is centered and serially uncorrelated. The random coefficients $G_{k}$ and $\bm{\xi}_{k}$ are, however, dependent. While most authors do not allow for random coefficients $G_{k}$ in VAR processes, such processes are special cases of a random autoregressive process (RAR) [41].

In the VAR literature, identifiability and estimation of the random coefficients $G_{k}$ is considered [37, 41]. In contrast, we aim to obtain bounds for the convergence of $\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]}$ and $\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}$ . Difficulties arise from the involved structure and coupled randomness of $G_{k}$ and $\bm{\xi}_{k}$ . Estimation of coefficients under dependence of $G_{k}$ and $\bm{\xi}_{k}$ is treated in [22].

For a sufficiently small learning rate $\alpha$ , the random matrices $I-\alpha D_{k}\mathbb{X}D_{k}$ and $I-\alpha D_{k}\mathbb{X}_{p}$ in both representations in (14) are contractive maps in expectation. By Lemma 10(a), their expected values coincide since

\displaystyle\mathbb{E}\big{[}I-\alpha D_{k}\mathbb{X}D_{k}\big{]}=\mathbb{E}\big{[}I-\alpha D_{k}\mathbb{X}_{p}\big{]}=I-\alpha p\mathbb{X}_{p}.

For the subsequent analysis, we impose the following mild conditions that, among other things, establish contractivity of $I-\alpha p\mathbb{X}_{p}$ as a linear map.

Assumption 1.

The learning rate $\alpha$ and the dropout probability $p$ are chosen such that $\alpha p\left\lVert\mathbb{X}\right\rVert<1$ , the initialization $\widetilde{\bm{\beta}}_{0}$ is a square integrable random vector that is independent of the data $\mathbf{Y}$ and the model is in reduced form, meaning that $X$ does not have zero columns.

For gradient descent without dropout and fixed learning rate, as defined in (2), $\alpha\lVert\mathbb{X}\rVert<1$ guarantees converge of the scheme in expectation. We will see shortly that dropout essentially replaces the expected learning rate with $\alpha p$ , which motivates the condition $\alpha p\lVert\mathbb{X}\rVert<1$ .

As a straightforward consequence of the definitions, we are now able to show that $\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}$ vanishes in expectation at a geometric rate. For a proof of this as well as subsequent results, see Appendix B.

Lemma 1 (Convergence of Expectation).

Given Assumption 1, $\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}\leq 1-\alpha p(1-p)\min_{i}\mathbb{X}_{ii}<1$ and for any $k=0,1,\ldots$ ,

\displaystyle\Big{\lVert}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]}\Big{\rVert}_{2}\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k}\Big{\lVert}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}\big{]}\Big{\rVert}_{2}.

Before turning to the analysis of the covariance structure, we highlight a property of the sequence $\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}\mid\mathbf{Y}\big{]}$ . As mentioned, these conditional expectations may be viewed as gradient descent iterates generated by the marginalized objective $\tfrac{1}{2}\mathbb{E}\big{[}\lVert\mathbf{Y}-XD\bm{\beta}\rVert_{2}^{2}\mid\mathbf{Y}\big{]}$ that gives rise to $\widetilde{\bm{\beta}}$ . Indeed, combining (13) with $\mathbb{E}[D_{k+1}]=pI_{d}$ , $\mathbb{E}\big{[}D_{k+1}\mathbb{X}D_{k+1}\big{]}=p\mathbb{X}_{p}$ from Lemma 10(a), and (3) yields

	$\displaystyle\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k+1}\mid\mathbf{Y}\big{]}$	$\displaystyle=\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}\mid\mathbf{Y}\big{]}+\alpha pX^{\top}\mathbf{Y}-\alpha p\mathbb{X}_{p}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}\mid\mathbf{Y}\big{]}$
		$\displaystyle=\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}\mid\mathbf{Y}\big{]}-\dfrac{\alpha}{2}\nabla_{\mathbb{E}[\widetilde{\bm{\beta}}_{k}\mid\mathbf{Y}]}\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-XD\mathbb{E}[\widetilde{\bm{\beta}}_{k}\mid\mathbf{Y}]\big{\rVert}_{2}^{2}\mid\mathbf{Y}\Big{]}.$

This establishes a connection between the dropout iterates and the averaged analysis of the previous section. However, the relationship between the (unconditional) covariance matrices $\mathrm{Cov}(\widetilde{\bm{\beta}}_{k})$ and the added noise remains unclear. A new dropout matrix is sampled for each iteration, whereas $\widetilde{\bm{\beta}}$ results from minimization only after applying the conditional expectation $\mathbb{E}[\ \cdot\mid\mathbf{Y}]$ to the randomized objective function. Hence, we may expect that $\widetilde{\bm{\beta}}$ features smaller variance than the iterates as the latter also depend on the noise added via dropout.

As a first result for the covariance analysis, we establish an extension of the Gauss-Markov theorem stating that the covariance matrix of a linear estimator lower-bounds the covariance matrix of an affine estimator, provided that both estimators have the same asymptotic mean. Moreover, the covariance matrix of their difference characterizes the gap. We believe that a similar result may already be known, but we are not aware of any reference, so a full proof is provided in Appendix B for completeness.

Theorem 1.

In the linear regression model (1), consider estimators $\widetilde{\bm{\beta}}_{A}=AX^{\top}\mathbf{Y}$ and $\widetilde{\bm{\beta}}_{\mathrm{aff}}=B\mathbf{Y}+\mathbf{a}$ , with $B\in\mathbb{R}^{d\times n}$ and $\mathbf{a}\in\mathbb{R}^{d}$ (possibly) random, but independent of $\mathbf{Y}$ , and $A\in\mathbb{R}^{d\times d}$ deterministic. Then,

\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{A}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{)}\Big{\rVert}\leq 4\lVert A\rVert\sup_{\bm{\beta}_{\star}:\lVert\bm{\beta}_{\star}\rVert_{2}\leq 1}\big{\lVert}\mathbb{E}_{\bm{\beta}_{\star}}\big{[}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{]}\big{\rVert}_{2},

where $\mathbb{E}_{\bm{\beta}_{\star}}$ denotes the expectation with respect to $\bm{\beta}_{\star}$ being the true regression vector in the linear regression model (1).

Since $\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{A}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{)}=\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A},\widetilde{\bm{\beta}}_{A}\big{)}+\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{A},\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{)}$ , Theorem 1 may be interpreted as follows: if the estimators $\widetilde{\bm{\beta}}_{\mathrm{aff}}$ and $\widetilde{\bm{\beta}}_{A}$ are nearly the same in expectation, then $\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}$ and $\widetilde{\bm{\beta}}_{A}$ must be nearly uncorrelated. In turn, $\widetilde{\bm{\beta}}_{k}$ may be decomposed into $\widetilde{\bm{\beta}}_{A}$ and (nearly) orthogonal noise $\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}$ , so that $\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}\big{)}\approx\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{A}\big{)}+\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{)}$ is lower bounded by $\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{A}\big{)}$ . Therefore, the covariance matrix $\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{)}$ quantifies the gap in the bound.

Taking $A:=\mathbb{X}^{-1}$ and considering linear estimators with $\mathbf{a}=\mathbf{0}$ recovers the usual Gauss-Markov theorem, stating that $\mathbb{X}^{-1}X^{\top}\mathbf{Y}$ is the best linear unbiased estimator (BLUE) for the linear model. Applying the generalized Gauss-Markov theorem with $A=(\mathbb{X}+\Gamma)^{-1}$ , where $\Gamma$ is a positive definite matrix, we obtain the following statement about $\ell_{2}$ -penalized estimators.

Corollary 1.

The minimizer $\widetilde{\bm{\beta}}_{\Gamma}:=(\mathbb{X}+\Gamma)^{-1}X^{\top}\mathbf{Y}$ of the $\ell_{2}$ -penalized functional $\big{\lVert}\mathbf{Y}-X\bm{\beta}\big{\rVert}_{2}^{2}+\bm{\beta}^{\top}\Gamma\bm{\beta}$ has the smallest covariance matrix among all affine estimators with the same expectation as $\widetilde{\bm{\beta}}_{\Gamma}$ .

We now return to our analysis of the covariance structure induced by dropout. If $A:=\mathbb{X}_{p}^{-1}$ , then $\widetilde{\bm{\beta}}=\mathbb{X}_{p}^{-1}X^{\top}\mathbf{Y}=AX^{\top}\mathbf{Y}=\widetilde{\bm{\beta}}_{A}$ in Theorem 1. Further, the dropout iterates may be rewritten as affine estimators $\widetilde{\bm{\beta}}_{k}=B_{k}\mathbf{Y}+\mathbf{a}_{k}$ with

\begin{split}B_{k}&:=\sum_{j=1}^{k-1}\left(\prod_{\ell=0}^{k-j-1}\Big{(}I-\alpha D_{k-\ell}\mathbb{X}D_{k-\ell}\Big{)}\right)\alpha D_{j}X^{\top}+\alpha D_{k}X^{\top}\\ \mathbf{a}_{k}&:=\left(\prod_{\ell=0}^{k-1}\Big{(}I-\alpha D_{k-\ell}\mathbb{X}D_{k-\ell}\Big{)}\right)\widetilde{\bm{\beta}}_{0}.\end{split}

By construction, $(B_{k},\mathbf{a}_{k})$ and $\mathbf{Y}$ are independent, so Theorem 1 applies. As shown in Lemma 1, $\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]}$ vanishes exponentially fast, so we conclude that $\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}$ is asymptotically lower-bounded by $\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}$ . Further, the covariance structure of $\widetilde{\bm{\beta}}$ is optimal in the sense of Corollary 1.

We proceed by studying $\mathrm{Cov}(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})$ , with the aim of quantifying the gap between the covariance matrices. To this end, we exhibit a particular recurrence for the second moments $\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})^{\top}\big{]}$ . Recall that $\odot$ denotes the Hadamard product, $B_{p}=pB+(1-p)\mathrm{Diag}(B)$ , and $\overline{B}=B-\mathrm{Diag}(B)$ .

Lemma 2 (Second Moment - Recursive Formula).

Under Assumption 1, for all positive integers $k$

\displaystyle\begin{split}&\Bigg{\lVert}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}-S\bigg{(}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}\bigg{)}\Bigg{\rVert}\\ &\leq 6\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}\Big{\lVert}\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}\big{)}\widetilde{\bm{\beta}}^{\top}\big{]}\Big{\rVert},\end{split}

where $S:\mathbb{R}^{d\times d}\to\mathbb{R}^{d\times d}$ denotes the affine operator

	$\displaystyle S(A)$	$\displaystyle=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}A\big{(}I-\alpha p\mathbb{X}_{p}\big{)}$
		$\displaystyle\quad+\alpha^{2}p(1-p)\mathrm{Diag}\big{(}\mathbb{X}_{p}A\mathbb{X}_{p}\big{)}+\alpha^{2}p^{2}(1-p)^{2}\mathbb{X}\odot\overline{A+\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}}\odot\mathbb{X}$
		$\displaystyle\quad+\alpha^{2}p^{2}(1-p)\bigg{(}\Big{(}\overline{\mathbb{X}}\mathrm{Diag}\Big{(}A+\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}\Big{)}\overline{\mathbb{X}}\Big{)}_{p}+\overline{\mathbb{X}}\mathrm{Diag}\big{(}\mathbb{X}_{p}A\big{)}+\mathrm{Diag}\big{(}\mathbb{X}_{p}A\big{)}\overline{\mathbb{X}}\bigg{)}.$

Intuitively, the lemma states that the second moment of $\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}$ evolves as an affine dynamical system, up to some exponentially decaying remainder. This may be associated with the implicit regularization of the dropout noise, as illustrated empirically in [51].

Mathematically, the result may be motivated via the representation of the dropout iterates as a random autoregressive process $Z_{k}=G_{k}Z_{k-1}+\bm{\xi}_{k}$ in (15). Writing out $Z_{k}Z_{k}^{\top}=G_{k}Z_{k-1}Z_{k-1}^{\top}G_{k}+\bm{\xi}_{k}\bm{\xi}_{k}^{\top}+G_{k}Z_{k-1}\bm{\xi}_{k}^{\top}+\bm{\xi}_{k}Z_{k-1}^{\top}G_{k}$ and comparing with the proof of the lemma, we see that the remainder term, denoted by $\rho_{k-1}$ in the proof, coincides with the expected value of the cross terms $G_{k}Z_{k-1}\bm{\xi}_{k}^{\top}+\bm{\xi}_{k}Z_{k-1}^{\top}G_{k}$ . Moreover, the operator $S$ is obtained by computing

\displaystyle S(A)=\mathbb{E}\Big{[}G_{k}AG_{k}+\bm{\xi}_{k}\bm{\xi}_{k}^{\top}\Big{]}.

As $(G_{k},\bm{\xi}_{k})$ are i.i.d., $S$ does not depend on $k$ . Moreover, independence of $G_{k}$ and $Z_{k-1}$ implies

	$\displaystyle\mathbb{E}\Big{[}Z_{k}Z_{k}^{\top}\Big{]}$	$\displaystyle=\mathbb{E}\Big{[}G_{k}Z_{k-1}Z_{k-1}^{\top}G_{k}+\bm{\xi}_{k}\bm{\xi}_{k}^{\top}\Big{]}+\rho_{k-1}$
		$\displaystyle=\mathbb{E}\Big{[}G_{k}\mathbb{E}\big{[}Z_{k-1}Z_{k-1}^{\top}\big{]}G_{k}+\bm{\xi}_{k}\bm{\xi}_{k}^{\top}\Big{]}+\rho_{k-1}$
		$\displaystyle=S\Big{(}\mathbb{E}\big{[}Z_{k-1}Z_{k-1}^{\top}\big{]}\Big{)}+\rho_{k-1}.$

Inserting the definition $Z_{k}=\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}$ results in the statement of the lemma. The random vector $\bm{\xi}_{k}$ depends on $\widetilde{\bm{\beta}}$ and by Theorem 1 the correlation between $Z_{k}=\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}$ and $\widetilde{\bm{\beta}}$ decreases as $k\to\infty$ . This leads to the exponentially decaying bound for the remainder term $\rho_{k-1}$ .

The previous lemma entails equality between $\mathbb{E}\big{[}Z_{k}Z_{k}^{\top}\big{]}$ and $S^{k}\big{(}\mathbb{E}[Z_{0}Z_{0}^{\top}]\big{)}$ , up to the remainder terms. The latter may be computed further by decomposing the affine operator $S$ into its intercept and linear part

\displaystyle S_{0}:=S(0)=\mathbb{E}\big{[}\bm{\xi}_{k}\bm{\xi}_{k}^{\top}\big{]}\qquad\mbox{and}\qquad S_{\mathrm{lin}}(A):=S(A)-S_{0}=\mathbb{E}\big{[}G_{k}AG_{k}\big{]}.

(16)

If $S_{\mathrm{lin}}$ were to have operator norm less than one, then the Neumann series for $(\mathrm{id}-S_{\mathrm{lin}})^{-1}$ (see Lemma 12) gives

\displaystyle S^{k}(A)=\sum_{j=0}^{k-1}S_{\mathrm{lin}}^{j}(S_{0})+S_{\mathrm{lin}}^{k}(A)\to\sum_{j=0}^{\infty}S_{\mathrm{lin}}^{j}(S_{0})=\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0},

with $\mathrm{id}$ the identity operator on $d\times d$ matrices. Surprisingly, the operator “forgets” $A$ in the sense that the limit does not depend on $A$ anymore. The argument shows that $\mathbb{E}\big{[}Z_{k}Z_{k}^{\top}\big{]}=\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})^{\top}\big{]}$ should behave like $\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}$ in the first order. The next result makes this precise, taking into account the remainder terms and approximation errors.

Theorem 2 (Second Moment - Limit Formula).

In addition to Assumption 1 suppose $\alpha<\tfrac{\lambda_{\mathrm{min}}(\mathbb{X}_{p})}{3\lVert\mathbb{X}\rVert^{2}}$ , then, for any $k=1,2,\ldots$

\displaystyle\bigg{\lVert}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\bigg{\rVert}\leq Ck\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}

and

\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}\leq Ck\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}

with constant $C$ given by

\displaystyle C:=\bigg{\lVert}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\bigg{\rVert}+6\Big{\lVert}\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}})\widetilde{\bm{\beta}}^{\top}\big{]}\Big{\rVert}+\Big{\lVert}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}\big{]}\Big{\rVert}_{2}^{2}.

In short, $\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})$ converges exponentially fast to the limit $(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}$ . Combining the generalized Gauss-Markov Theorem 1 with Theorem 2 also establishes

\displaystyle\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}\to\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}+(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0},\qquad\text{as}\ k\to\infty,

with exponential rate of convergence. Recall the intuition gained from Theorem 1 that $\widetilde{\bm{\beta}}_{k}$ may be decomposed into a sum of $\widetilde{\bm{\beta}}$ and (approximately) orthogonal centered noise. We now conclude that up to exponentially decaying terms, the covariance matrix of this orthogonal noise must be given by $(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}$ , which fully describes the (asymptotic) gap between the covariance matrices of $\widetilde{\bm{\beta}}$ and $\widetilde{\bm{\beta}}_{k}$ .

Taking the trace and noting that $\big{\lvert}\mathrm{Tr}(A)\big{\rvert}\leq d\lVert A\rVert$ , we obtain a bound for the convergence of $\widetilde{\bm{\beta}}_{k}$ with respect to the squared Euclidean loss,

\displaystyle\bigg{\lvert}\mathbb{E}\Big{[}\big{\lVert}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{\rVert}_{2}^{2}\Big{]}-\mathrm{Tr}\Big{(}\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{)}\bigg{\rvert}\leq Cdk\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}.

(17)

Since $(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}$ is a $d\times d$ matrix, the term $\mathrm{Tr}\big{(}(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}\big{)}$ describing the asymptotic discrepancy between $\widetilde{\bm{\beta}}_{k}$ and $\widetilde{\bm{\beta}}$ can be large in high dimensions $d$ , even if the spectral norm of $(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}$ is small. Since $\mathrm{id}-S_{\mathrm{lin}}$ is a positive definite operator, the matrix $(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}$ is zero if, and only if, $S_{0}$ is zero. By (16), $S_{0}=\mathbb{E}[\bm{\xi}_{k}\bm{\xi}_{k}^{\top}]$ . The explicit form $\bm{\xi}_{k}=\alpha D_{k}\overline{\mathbb{X}}\big{(}pI-D_{k}\big{)}\widetilde{\bm{\beta}}$ , shows that $\bm{\xi}_{k}=0$ and $S_{0}=0$ provided that $\overline{\mathbb{X}}=0$ , meaning whenever $\mathbb{X}$ is diagonal. To give a more precise quantification, we show that the operator norm of $(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}$ is of order $\alpha p/(1-p)^{2}$ .

Lemma 3.

In addition to Assumption 1 suppose $\alpha<\tfrac{\lambda_{\mathrm{min}}(\mathbb{X}_{p})}{3\lVert\mathbb{X}\rVert^{2}}$ , then, for any $k=1,2,\ldots$

\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}\Big{\rVert}\leq\dfrac{k\lVert I-\alpha p\mathbb{X}_{p}\rVert^{k-1}C^{\prime}+\alpha pC^{\prime\prime}}{(1-p)^{2}}

and

\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Diag}(\mathbb{X})^{-1}\mathbb{X}\mathrm{Diag}(\mathbb{X})^{-1}\Big{\rVert}\leq\dfrac{k\lVert I-\alpha p\mathbb{X}_{p}\rVert^{k-1}C^{\prime}+p(1+\alpha)C^{\prime\prime}}{(1-p)^{2}},

where $C^{\prime}$ and $C^{\prime\prime}$ are constants that are independent of $(\alpha,p,k)$ .

The first bound describes the interplay between $\alpha p$ and $k$ . Making $\alpha p$ small will decrease the second term in the bound, but conversely requires a larger number of iterations $k$ for the first term to decay.

In the second bound, $\mathrm{Diag}(\mathbb{X})^{-1}\mathbb{X}\mathrm{Diag}(\mathbb{X})^{-1}$ matches the covariance matrix $\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}$ of the marginalized loss minimizer $\widetilde{\bm{\beta}}$ up to a term of order $p$ . Consequently, the covariance structures induced by dropout and $\ell_{2}$ -regularization approximately coincide for sufficiently small $p$ . However, in this regime we have $\mathbb{X}_{p}=p\mathbb{X}+(1-p)\mathrm{Diag}(\mathbb{X})\approx\mathrm{Diag}(\mathbb{X})$ , and $\widetilde{\bm{\beta}}=\mathbb{X}_{p}^{-1}X^{\top}\mathbf{Y}\approx\mathrm{Diag}(\mathbb{X})^{-1}X^{\top}\mathbf{Y}$ becomes extremely biased whenever the Gram matrix $\mathbb{X}$ is not diagonal.

Theorem 1 already establishes $\mathrm{Cov}(\widetilde{\bm{\beta}})$ as the optimal covariance among all affine estimators that are asymptotically unbiased for $\widetilde{\bm{\beta}}$ . To conclude our study of the gap between $\mathrm{Cov}(\widetilde{\bm{\beta}}_{k})$ and $\mathrm{Cov}(\widetilde{\bm{\beta}})$ , we provide a lower-bound.

Theorem 3 (Sub-Optimality of Variance).

In addition to the assumptions of Theorem 2, suppose for every $\ell=1,\ldots,d$ there exists $m\neq\ell$ such that $\mathbb{X}_{\ell m}\neq 0$ , then

\displaystyle\lim_{k\to\infty}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}\geq\dfrac{\alpha p(1-p)^{2}\lambda_{\mathrm{min}}(\mathbb{X})}{2\lVert\mathbb{X}\rVert^{3}}\min_{i\neq j:\mathbb{X}_{ij}\neq 0}\mathbb{X}_{ij}^{2}\cdot I_{d}.

The lower-bound is positive whenever $\mathbb{X}$ is invertible. In general, Theorem 3 entails asymptotic statistical sub-optimality of the gradient descent iterates $\widetilde{\bm{\beta}}_{k}$ for a large class of design matrices. Moreover, the result does not require any further assumptions on the tuning parameters $\alpha$ and $p$ , other than $\alpha$ being sufficiently small.

To summarize, compared with the marginalized loss minimizer $\widetilde{\bm{\beta}}$ , the covariance matrix of the gradient descent iterates with dropout may be larger. The difference may be significant, especially if the data dimension $d$ is large. Proving the results requires a refined second moment analysis, based on explicit computation of the dynamics of $\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}$ . Simple heuristics such as (7) do not fully reveal the properties of the underlying dynamics.

4.1.1 Ruppert-Polyak Averaging

To reduce the gradient noise induced by dropout, one may consider the running average over the gradient descent iterates. This technique is also known as Ruppert-Polyak averaging [42, 39] The $k$ ^th Ruppert-Polyak average of the gradient descent iterates is given by

\displaystyle\widetilde{\bm{\beta}}^{\mathrm{rp}}_{k}:=\dfrac{1}{k}\sum_{\ell=1}^{k}\widetilde{\bm{\beta}}_{\ell}.

Averages of this type are well-studied in the stochastic approximation literature, see [40, 19] for results on linear regression and [55, 12] for stochastic gradient descent. The next theorem illustrates convergence of $\widetilde{\bm{\beta}}^{\mathrm{rp}}_{k}$ towards $\widetilde{\bm{\beta}}$ .

Theorem 4.

In addition to Assumption 1, suppose $\alpha<\tfrac{\lambda_{\mathrm{min}}(\mathbb{X}_{p})}{3\lVert\mathbb{X}\rVert^{2}}$ , then, for any $k=1,2,\ldots$

\displaystyle\bigg{\lVert}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}^{\mathrm{rp}}_{k}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}^{\mathrm{rp}}_{k}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}\bigg{\rVert}\leq\dfrac{2\lVert\mathbb{X}\rVert^{2}\cdot\big{\lVert}\mathbb{E}[X^{\top}\mathbf{Y}\mathbf{Y}^{\top}X]\big{\rVert}}{k(1-p)(\min_{i}\mathbb{X}_{ii})^{4}}+\dfrac{2C}{k^{2}(\alpha p(1-p)\min_{i}\mathbb{X}_{ii})^{3}},

where $C$ is the constant from Theorem 2.

The first term in the upper bound is independent of $\alpha$ and decays at the rate $k^{-1}$ , whereas the second term scales with $(\alpha p)^{-3}k^{-2}$ . Accordingly, for small $\alpha p$ , the second term will dominate initially, until $k$ grows sufficiently large.

Since the right hand side eventually tends to zero, the theorem implies convergence of the Ruppert-Polyak averaged iterates to the marginalized loss minimizer $\widetilde{\bm{\beta}}$ , so the link between dropout and $\ell_{2}$ -regularization persists at the variance level. The averaging comes at the price of a slower convergence rate $k^{-1}$ of the remainder terms, as opposed to the exponentially fast convergence in Theorem 2. As in (17), the bound can be converted into a convergence rate for $\mathbb{E}\big{[}\lVert\widetilde{\bm{\beta}}^{\mathrm{rp}}_{k}-\widetilde{\bm{\beta}}\rVert_{2}^{2}\big{]}$ by taking the trace,

\displaystyle\mathbb{E}\Big{[}\big{\lVert}\widetilde{\bm{\beta}}^{\mathrm{rp}}_{k}-\widetilde{\bm{\beta}}\big{\rVert}_{2}^{2}\Big{]}\leq d\left(\dfrac{2\lVert\mathbb{X}\rVert^{2}\cdot\big{\lVert}\mathbb{E}[X^{\top}\mathbf{Y}\mathbf{Y}^{\top}X]\big{\rVert}}{k(1-p)(\min_{i}\mathbb{X}_{ii})^{4}}+\dfrac{2C}{k^{2}(\alpha p(1-p)\min_{i}\mathbb{X}_{ii})^{3}}\right).

4.2 Convergence of Simplified Dropout

To further illustrate how dropout and gradient descent are coupled, we will now study the simplified dropout iterates

\displaystyle\widehat{\bm{\beta}}_{k}=\widehat{\bm{\beta}}_{k-1}+\alpha D_{k}X^{\top}\big{(}\mathbf{Y}-X\widehat{\bm{\beta}}_{k-1}\big{)},

(18)

as defined in (4). While the original dropout reduces the model before taking the gradient, this version takes the gradient first and applies dropout afterwards. As shown in Section 2, both versions coincide if the Gram matrix $\mathbb{X}$ is diagonal. Recall from the discussion preceding Lemma 3 that for diagonal $\mathbb{X}$ , $\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}$ converges to the optimal covariance matrix. This suggests that for the simplified dropout, no additional randomness in the limit $k\to\infty$ occurs.

The least squares objective $\bm{\beta}\mapsto\lVert\mathbf{Y}-X\bm{\beta}\rVert_{2}^{2}$ always admits a minimizer, with any minimizer $\widehat{\bm{\beta}}$ necessarily solving the so-called normal equations $X^{\top}\mathbf{Y}=\mathbb{X}\widehat{\bm{\beta}}$ . Provided $\mathbb{X}$ is invertible, the least-squares estimator $\widehat{\bm{\beta}}=\mathbb{X}^{-1}X^{\top}\mathbf{Y}$ gives the unique solution. We will not assume invertibility for all results below, so we let $\widehat{\bm{\beta}}$ denote any solution of the normal equations, unless specified otherwise. In turn, (18) may be rewritten as

\displaystyle\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}=\big{(}I-\alpha D_{k}\mathbb{X}\big{)}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)},

(19)

which is simpler than the analogous representation of $\widetilde{\bm{\beta}}_{k}$ as a VAR process in (15).

As a first result, we will show that the expectation of the simplified dropout iterates $\widehat{\bm{\beta}}_{k}$ converges to the mean of the unregularized least squares estimator $\widehat{\bm{\beta}}$ , provided that $\mathbb{X}$ is invertible. Indeed, using (19), independence of $D_{k}$ and $\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}$ , and $\mathbb{E}\big{[}D_{k}\big{]}=pI$ , observe that

\displaystyle\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}=\mathbb{E}\Big{[}\big{(}I-\alpha D_{k}\mathbb{X}\big{)}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}\Big{]}=\big{(}I-\alpha p\mathbb{X}\big{)}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{]}.

(20)

Induction on $k$ now shows $\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}=\big{(}I-\alpha p\mathbb{X}\big{)}^{k}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{]}$ and so

\displaystyle\Big{\lVert}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}\Big{\rVert}_{2}=\Big{\lVert}\big{(}I-\alpha p\mathbb{X}\big{)}^{k}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{]}\Big{\rVert}_{2}\leq\big{\lVert}I-\alpha p\mathbb{X}\big{\rVert}^{k}\Big{\lVert}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{]}\Big{\rVert}_{2}.

Assuming $\alpha p\lVert\mathbb{X}\rVert<1$ , invertibility of $\mathbb{X}$ implies $\big{\lVert}I-\alpha p\mathbb{X}\big{\rVert}<1$ . Consequently, the convergence is exponential in the number of iterations.

Invertibility of $\mathbb{X}$ may be avoided if the initialization $\widehat{\bm{\beta}}_{0}$ lies in the orthogonal complement of the kernel of $\mathbb{X}$ and $\widehat{\bm{\beta}}$ is the $\lVert\ \cdot\ \rVert_{2}$ -minimal solution to the normal equations. We can then argue that $(I-\alpha p\mathbb{X})^{k-1}\mathbb{E}[\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}]$ always stays in a linear subspace on which $(I-\alpha p\mathbb{X})$ still acts as a contraction.

We continue with our study of $\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}$ by employing the same techniques as in the previous section to analyze the second moment. The linear operator

\displaystyle T(A):=\big{(}I-\alpha p\mathbb{X}\big{)}A\big{(}I-\alpha p\mathbb{X}\big{)}+\alpha^{2}p(1-p)\mathrm{Diag}\big{(}\mathbb{X}A\mathbb{X}\big{)},

(21)

defined on $d\times d$ matrices, takes over the role of the affine operator $S$ encountered in Lemma 7. In particular, the second moments $A_{k}:=\mathbb{E}\big{[}(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})^{\top}\big{]}$ evolve as the linear dynamical system

\displaystyle A_{k}=T\big{(}A_{k-1}\big{)},\qquad k=1,2,\ldots

(22)

without remainder terms. To see this, observe via (19) the identity $(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})^{\top}=(I-\alpha D_{k}\mathbb{X})(\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}})(\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}})^{\top}(I-\alpha\mathbb{X}D_{k})$ . Taking the expectation on both sides, conditioning on $D_{k}$ , and recalling that $D_{k}$ is independent of $(\widehat{\bm{\beta}}_{k},\widehat{\bm{\beta}})$ gives $A_{k}=\mathbb{E}\big{[}(I-\alpha D_{k}\mathbb{X})A_{k-1}(I-\alpha\mathbb{X}D_{k})\big{]}$ . We have $\mathbb{E}[D_{k}]=pI_{d}$ and by Lemma 10, $\mathbb{E}\big{[}D_{k}\mathbb{X}A_{k-1}\mathbb{X}D_{k}\big{]}=p(\mathbb{X}A_{k-1}\mathbb{X})_{p}=p^{2}\mathbb{X}A_{k-1}\mathbb{X}+p(1-p)\mathrm{Diag}(\mathbb{X}A_{k-1}\mathbb{X})$ . Together with the definition of $T(A)$ , this proves (22).

Further results are based on analyzing the recursion in (22). It turns out that convergence of $\widehat{\bm{\beta}}_{k}$ to $\widehat{\bm{\beta}}$ in second mean requires a non-singular Gram matrix $\mathbb{X}$ .

Lemma 4.

Suppose the initialization $\widehat{\bm{\beta}}_{0}$ is independent of all other sources of randomness and the number of parameters satisfies $d\geq 2$ , then there exists a singular $\mathbb{X}$ , such that for any positive integer $k$ , $\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}\big{)}\geq\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}+\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)}$ and

\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}\Big{\rVert}\geq\alpha^{2}p(1-p).

For invertible $\mathbb{X}$ , we can apply Theorem 1 to show that $\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)}=\mathbb{X}^{-1}$ is the optimal covariance matrix for the sequence of affine estimators $\widehat{\bm{\beta}}_{k}$ . The simplified dropout iterates actually achieve the optimal variance when $\mathbb{X}$ is invertible, which stands in contrast with the situation in Theorem 3.

Theorem 5.

Suppose $\mathbb{X}$ is invertible, $\alpha\leq\min\Big{\{}\tfrac{1}{p\lVert\mathbb{X}\rVert},\tfrac{\lambda_{\mathrm{min}}(\mathbb{X})}{\lVert\mathbb{X}\rVert^{2}}\Big{\}}$ , and let $\widehat{\bm{\beta}}_{0}$ be square-integrable, then, for any $k=1,2,\ldots$

\displaystyle\bigg{\lVert}\mathbb{E}\Big{[}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}^{\top}\Big{]}\bigg{\rVert}\leq\big{(}1-\alpha p\lambda_{\mathrm{min}}(\mathbb{X})\big{)}^{k}\bigg{\lVert}\mathbb{E}\Big{[}\big{(}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{)}\big{(}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{)}^{\top}\Big{]}\bigg{\rVert}.

Intuitively, the result holds due to the operator $T$ in (21) being linear, as opposed to affine like in the case of Lemma 2. Choosing $\alpha$ sufficiently small ensures that $T$ acts as a contraction, meaning $T^{k}(A)\to 0$ for any matrix $A$ . Hence, linearity of $T$ serves as an algebraic expression of the simplified dynamics. As in (17), we may take the trace to obtain the bound

\displaystyle\mathbb{E}\Big{[}\big{\lVert}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{\rVert}_{2}^{2}\Big{]}\leq d\big{(}1-\alpha p\lambda_{\mathrm{min}}(\mathbb{X})\big{)}^{k}\bigg{\lVert}\mathbb{E}\Big{[}\big{(}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{)}\big{(}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{)}^{\top}\Big{]}\bigg{\rVert}.

5 Discussion and Outlook

Our main contributions may be summarized as follows: We studied dropout in the linear regression model, but unlike previous results, we explicitly analyzed the gradient descent dynamics with new dropout noise being sampled in each iteration. This allows us to characterize the limiting variance of the gradient descent iterates exactly (Theorem 2), which sheds light on the covariance structure induced via dropout. Our main tool in the analysis is a particular recursion (Lemma 2), which may be exhibited by “anchoring” the gradient descent iterates around the marginalized loss minimizer $\widetilde{\bm{\beta}}$ . To further understand the interaction between noise and gradient descent dynamics, we analyze the running average of the process (Theorem 4) and a simplified version of dropout (Theorem 5).

We view our analysis of the linear model as a fundamental first step towards understanding the dynamics of gradient descent with dropout. Analyzing the linear model has been a fruitful approach to study other phenomena in deep learning, such as overfitting [47], sharpness of local minima [7], and in-context learning [54]. We conclude by proposing some natural directions for future work.

Random minibatch sampling: For yet another way of incorporating dropout, we may compute the gradient based on a random subset of the data (mini batches). In this case, the updating formula satisfies

\displaystyle\overline{\bm{\beta}}_{k+1}

\displaystyle=\overline{\bm{\beta}}_{k}-\alpha\nabla_{\overline{\bm{\beta}}_{k}}\dfrac{1}{2}\Big{\lVert}D_{k+1}\big{(}\mathbf{Y}-X\overline{\bm{\beta}}_{k}\big{)}\Big{\rVert}^{2}_{2}=\overline{\bm{\beta}}_{k}+\alpha X^{\top}D_{k+1}\big{(}\mathbf{Y}-X\overline{\bm{\beta}}_{k}\big{)}.

(23)

The dropout matrices are now of dimension $n\times n$ and select a random subset of the data points in every iteration. This version of dropout also relates to randomly weighted least squares and resampling methods [13]. The update formula may be written in the form $\overline{\bm{\beta}}_{k+1}-\widehat{\bm{\beta}}=(I-\alpha X^{\top}D_{k+1}X)(\overline{\bm{\beta}}_{k}-\widehat{\bm{\beta}})+\alpha X^{\top}D_{k+1}(\mathbf{Y}-X\widehat{\bm{\beta}})$ with $\widehat{\bm{\beta}}$ solving the normal equations $X^{\top}\mathbf{Y}=\mathbb{X}\widehat{\bm{\beta}}$ . Similarly to the corresponding reformulation (14) of the original dropout scheme, this defines a vector autoregressive process with random coefficients and lag one.

Learning rates: The proof ideas may be generalized to a sequence of iteration-dependent learning rates $\alpha_{k}$ . We expect this to come at the cost of more involved formulas. Specifically, the operator $S$ in Lemma 2 will depend on the iteration number through $\alpha_{k}$ , so the limit in Theorem 2 cannot be expressed as $(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}$ anymore.

Random design and stochastic gradients: We considered a fixed design matrix $X$ and (full) gradient descent, whereas in machine learning practice inputs are typically assumed to be random and parameters are updated via stochastic gradient descent (SGD). The recent works [8, 44] derive convergence rates for SGD considering linear regression and another form of noisy gradient descent. We believe that parts of theses analyses carry over to dropout.

Generic dropout distributions: As already mentioned, [46] carry out simulations with dropout where $D_{ii}\overset{\scriptscriptstyle i.i.d.}{\sim}\mathcal{N}(1,1)$ . Gaussian dropout distributions are currently supported, or easily implemented, in major software libraries [1, 10, 26, 38]. Analyzing a generic dropout distribution with mean $\mu$ and variance $\sigma$ may also paint a clearer picture of how the dropout noise interacts with the gradient descent dynamics. For the linear regression model, results that marginalize over the dropout noise generalize to arbitrary dropout distribution. In particular, (9) turns into

	$\displaystyle\widetilde{\bm{\beta}}:=$	$\displaystyle\ \operatorname*{arg\,min}_{\bm{\beta}}\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-XD\bm{\beta}\big{\rVert}^{2}_{2}\ \big{\|}\ \mathbf{Y}\Big{]}$
	$\displaystyle=$	$\displaystyle\ \operatorname*{arg\,min}_{\bm{\beta}}\big{\lVert}\mathbf{Y}-\mu X\bm{\beta}\big{\rVert}^{2}_{2}+\sigma^{2}\bm{\beta}^{\top}\mathrm{Diag}(\mathbb{X})\bm{\beta}$
	$\displaystyle=$	$\displaystyle\ \mu\big{(}\mu^{2}\mathbb{X}+\sigma^{2}\mathrm{Diag}(\mathbb{X})\big{)}^{-1}X^{\top}\mathbf{Y}.$

If $D_{ii}\overset{\scriptscriptstyle i.i.d.}{\sim}\mathcal{N}(1,1)$ , then $\widetilde{\bm{\beta}}=\big{(}\mathbb{X}+\mathrm{Diag}(\mathbb{X})\big{)}^{-1}X^{\top}\mathbf{Y}$ .

In contrast, treatment of the corresponding iterative dropout scheme seems more involved. The analysis of dropout with Bernoulli distributions relies in parts on the projection property $D^{2}=D$ . Without it, additional terms occur in the moments in Lemma 10, which is required for the computation of the covariance matrix. For example, the formula $\mathbb{E}\big{[}DADBD\big{]}=pA_{p}B_{p}+p^{2}(1-p)\mathrm{Diag}(\overline{A}B)$ turns into

	$\displaystyle\mathbb{E}\big{[}DADBD\big{]}$	$\displaystyle=\dfrac{1}{\mu_{1}}\big{(}\mu_{1}^{2}A+\sigma^{2}\mathrm{Diag}(A)\big{)}\big{(}\mu_{1}^{2}B+\sigma^{2}\mathrm{Diag}(B)\big{)}+\sigma^{2}\mu_{1}\mathrm{Diag}(\overline{A}\,\overline{B})$
		$\displaystyle\qquad+\bigg{(}\mu_{3}-\frac{\mu_{2}^{2}}{\mu_{1}}\bigg{)}\mathrm{Diag}(A)\mathrm{Diag}(B),$

where $\mu_{r}$ denotes the $r$ ^th moment of the dropout distribution and $\sigma^{2}$ its variance. For the Bernoulli distribution, all moments equal $p$ , so $\mu_{3}-\mu_{2}^{2}/\mu_{1}=0$ and the last term disappears. Similarly, more terms will appear in the fourth moment of $D$ , making the expression for the operator corresponding to $S$ in Lemma 2 more complicated.

Inducing robustness via dropout: Among the possible ways of explaining the data, dropout should, by design, favor an explanation that is robust against setting a random subset of the parameters to zero. [34] indicate that dropout in two-layer linear networks tends to equalize the norms of different weight vectors.

To study the robustness properties of dropout, one may suggest analysis of loss functions measuring prediction of the response vector if each estimated regression coefficients is deleted with probability $p$ . Given an estimator $\widehat{\bm{\beta}}$ , a natural choice would be the loss

\displaystyle L\big{(}\widehat{\bm{\beta}},\bm{\beta}_{\star}\big{)}:=\mathbb{E}\Big{[}\big{\lVert}X(D\widehat{\bm{\beta}}-\bm{\beta}_{\star})\big{\rVert}_{2}^{2}\ \big{|}\ \mathbf{Y}\Big{]}=\big{(}p\widehat{\bm{\beta}}-\bm{\beta}_{\star}\big{)}^{\top}\mathbb{X}\big{(}p\widehat{\bm{\beta}}-\bm{\beta}_{\star}\big{)}+p(1-p)\widehat{\bm{\beta}}^{\top}\mathrm{Diag}\big{(}\mathbb{X}\big{)}\widehat{\bm{\beta}},

with $D$ a new draw of the dropout matrix, independent of all other randomness. This loss depends on the unknown true regression vector $\bm{\beta}_{\star}$ . Since $\mathbb{E}[\mathbf{Y}]=X\bm{\beta}_{\star}$ , an empirical version of the loss may replace $X\bm{\beta}_{\star}$ with $\mathbf{Y}$ , considering $\mathbb{E}\big{[}\lVert XD\widehat{\bm{\beta}}-\mathbf{Y}\rVert_{2}^{2}\mid\mathbf{Y}\big{]}$ . As shown in (9), $\widetilde{\bm{\beta}}=\mathbb{X}_{p}^{-1}X^{\top}\mathbf{Y}$ minimizes this loss function. This suggests that $\widetilde{\bm{\beta}}$ may possess some optimality properties for the loss $L(\ \cdot\ ,\bm{\beta}_{\star})$ defined above.

Shallow networks with linear activation function: Multi-layer neural networks do not admit unique minimizers. In comparison with the linear regression model, this poses a major challenge for the analysis of dropout. [34] consider shallow linear networks of the form $f(\mathbf{x})=UV^{\top}\mathbf{x}$ with $U=(\mathbf{u}_{1},\ldots,\mathbf{u}_{m})$ an $n\times m$ matrix and $V=(\mathbf{v}_{1},\ldots,\mathbf{v}_{m})$ a $d\times m$ matrix. Suppose $D$ is an $m\times m$ dropout matrix. Assuming the random design vector $\mathbf{X}$ satisfies $\mathbb{E}\big{[}\mathbf{X}\mathbf{X}^{\top}\big{]}=I_{d}$ , and marginalizing over dropout noise applied to the columns of $U$ (or equivalently to the rows of $V$ ) leads to an $\ell_{2}$ -penalty via

\begin{split}&\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-p^{-1}UDV^{\top}\mathbf{X}\big{\rVert}_{2}^{2}\mid\mathbf{Y}\Big{]}\\ =\ &\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-UV^{\top}\mathbf{X}\big{\rVert}_{2}^{2}\mid\mathbf{Y}\Big{]}+\frac{1-p}{p}\sum_{i=1}^{m}\lVert\mathbf{u}_{i}\rVert_{2}^{2}\lVert\mathbf{v}_{i}\rVert_{2}^{2}\\ =\ &\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-UV^{\top}\mathbf{X}\big{\rVert}_{2}^{2}\mid\mathbf{Y}\Big{]}+\frac{1-p}{p}\mathrm{Tr}\Big{(}\mathrm{Diag}(U^{\top}U)\mathrm{Diag}(V^{\top}V)\Big{)}.\end{split}

(24)

As an extension of our approach, it seems natural to investigate whether gradient descent with dropout will converge to the same minimizer or involve additional terms in the variance. In contrast with linear regression, the marginalized loss (24) is non-convex and does not admit a unique minimizer. Hence, we cannot simply center the gradient descent iterates around a specific closed-form estimator, as in Section 4. To extend our techniques we may expect to replace the centering estimator with the gradient descent iterates for the marginalized loss function, demanding a more careful analysis.

[45] study the gradient flow associated with (24) and exhibit exponential convergence of $U$ and $V$ towards a minimizer. Extending the existing result on gradient flows to gradient descent is, however, non-trivial, see for example the gradient descent version of Theorem 3.1 in [5] provided in Theorem 2.4 of [36].

To be more precise, suppose $\mathbf{Y}=W_{\star}\mathbf{X}+\bm{\varepsilon}$ , where $\mathbf{X}$ and $\bm{\varepsilon}$ are independent random vectors, so the task reduces to learning a factorization $W_{\star}\approx UV^{\top}$ based on noisy evaluation of $W_{\star}\mathbf{X}$ . Consider the randomized loss

\displaystyle L\big{(}U,V\big{)}\mapsto\dfrac{1}{2}\Big{\lVert}\mathbf{Y}-p^{-1}UDV^{\top}\mathbf{X}\Big{\rVert}_{2}^{2},

with respective gradients

	$\displaystyle\nabla_{U}L\big{(}U,V\big{)}$	$\displaystyle=-\big{(}\mathbf{Y}-p^{-1}UDV^{\top}\mathbf{X}\big{)}\mathbf{X}^{\top}Vp^{-1}D$
	$\displaystyle\nabla_{V^{\top}}L\big{(}U,V\big{)}$	$\displaystyle=-p^{-1}DU^{\top}\big{(}\mathbf{Y}-Up^{-1}DV^{\top}\mathbf{X}\big{)}\mathbf{X}^{\top}.$

Given observations $(\mathbf{Y}_{k},\mathbf{X}_{k})\overset{\scriptscriptstyle i.i.d.}{\sim}(\mathbf{Y},\mathbf{X})$ and independent dropout matrices $D_{k}\overset{\scriptscriptstyle i.i.d.}{\sim}D$ , the factorized structure leads to two coupled dynamical systems $U_{k+1}=U_{k}-\alpha\nabla_{U_{k}}L(U_{k},V_{k})$ and $V_{k+1}^{\top}=V_{k}^{\top}-\alpha\nabla_{V_{k}^{\top}}L(U_{k},V_{k})$ , which are linked through the appearance of $V_{k}$ in $\nabla_{U_{k}}L(U_{k},V_{k})$ and $U_{k}$ in $\nabla_{V_{k}}L(U_{k},V_{k})$ . Due to non-convexity of the underlying marginalized objective (24), the resulting dynamics should be sensitive to initialization. Suppose $U_{k}=P_{k}(U_{0},V_{0})$ and $V_{k}=Q_{k}(U_{0},V_{0})$ are given as random matrix polynomials $(P_{k},Q_{k})$ in $(U_{0},V_{0})$ , meaning finite sums of expressions like $A_{1}X_{i_{1}}A_{2}X_{i_{2}}\cdots A_{n}X_{i_{n}}A_{n+1}$ , where $X_{i_{j}}\in\{U_{0},U_{0}^{\top},V_{0},V_{0}^{\top}\}$ and the $A_{j}$ are random coefficient matrices. Now, the gradient descent recursions lead to

\begin{split}U_{k+1}&=P_{k}(U_{0},V_{0})+\alpha\Big{(}\mathbf{Y}_{k}-P_{k}(U_{0},V_{0})p^{-1}D_{k}Q_{k}^{\top}(U_{0},V_{0})\mathbf{X}_{k}\Big{)}\mathbf{X}_{k}^{\top}Q_{k}(U_{0},V_{0})p^{-1}D_{k}\\ &=:P_{k+1}(U_{0},V_{0}),\\ V_{k+1}^{\top}&=Q_{k}^{\top}(U_{0},V_{0})+\alpha p^{-1}D_{k}P_{k}^{\top}(U_{0},V_{0})\Big{(}\mathbf{Y}_{k}-P_{k}(U_{0},V_{0})p^{-1}D_{k}Q_{k}^{\top}(U_{0},V_{0})\mathbf{X}_{k}\Big{)}\mathbf{X}_{k}^{\top}\\ &=:Q_{k+1}^{\top}(U_{0},V_{0}),\end{split}

(25)

so $(U_{k+1},V_{k+1})$ is also a polynomial in $(U_{0},V_{0})$ with random coefficients. A difficulty in analyzing this recursion is that the degree of the polynomial increases exponentially fast. Indeed, since $P_{k+1}$ includes the term $p^{-2}P_{k}D_{k}Q_{k}^{\top}\mathbf{X}_{k}\mathbf{X}_{k}^{\top}Q_{k}D_{k}$ , the degree of $P_{k+1}$ is the degree of $P_{k}$ plus twice the degree of $Q_{k}$ . During each gradient descent step, additional randomness is introduced via the newly sampled dropout matrix $D_{k}$ and the training data $(\mathbf{Y}_{k},\mathbf{X}_{k})$ . Accordingly, the coefficients of $P_{k+1}$ and $Q_{k+1}$ fluctuate around the coefficients of $\mathbb{E}[P_{k+1}\mid U_{k},V_{k}]$ and $\mathbb{E}[Q_{k+1}\mid U_{k},V_{k}]$ . A principled analysis of the resulting dynamics requires careful accounting of how these fluctuations propagate through the iterations. [45] show that the gradient flow trajectories and minimizers of (24) satisfy specific symmetries, so one should hope to reduce the algebraic complexity of the problem by finding analogous symmetries in the stochastic recursions (25).

Alternatively, one may consider layer-wise training of the weight matrices to break the dependence between $U_{k}$ and $V_{k}$ . Given $K_{1}>0$ , suppose we keep $U_{0}$ fixed while taking $K_{1}$ gradient steps

\displaystyle V_{k+1}^{\top}=V_{k}^{\top}+\alpha p^{-1}D_{k}U_{0}^{\top}\big{(}\mathbf{Y}-U_{0}p^{-1}D_{k}V_{k}^{\top}\mathbf{X}_{k}\big{)}\mathbf{X}_{k}^{\top}

followed by $K_{2}>0$ gradient steps of the form

\displaystyle U_{k+1}

\displaystyle=U_{k}+\alpha\big{(}\mathbf{Y}-U_{k}p^{-1}D_{k}V_{K_{1}}^{\top}\mathbf{X}_{k}\big{)}\mathbf{X}_{k}^{\top}V_{K_{1}}p^{-1}D_{k}.

In each phase, the gradient descent recursion solves a linear regression problem similar to to our analysis of the linear model. We leave the details for future work.

Appendix A Proofs for Section 3

A.1 Proof of Equation (10)

Recall the definition $\widetilde{\bm{\beta}}=\mathbb{X}_{p}^{-1}X^{\top}\mathbf{Y}$ . By Lemma 10(a), $\mathbb{E}[D\mathbb{X}D]=p\mathbb{X}_{p}$ and so

\displaystyle\mathbb{E}\Big{[}DX^{\top}\big{(}\mathbf{Y}-XD\widetilde{\bm{\beta}}\big{)}\ \big{|}\ \mathbf{Y}\Big{]}=pX^{\top}\mathbf{Y}-p\mathbb{X}_{p}\widetilde{\bm{\beta}}=0.

(26)

Note the identity $\mathbf{Y}=XD\widetilde{\bm{\beta}}+(\mathbf{Y}-XD\widetilde{\bm{\beta}})$ . Hence, if $\widehat{\bm{\beta}}$ denotes any estimator, $XD\widehat{\bm{\beta}}-\mathbf{Y}=XD(\widehat{\bm{\beta}}-\widetilde{\bm{\beta}})-(\mathbf{Y}-XD\widetilde{\bm{\beta}})$ . By (26), $\mathbb{E}\big{[}(\widehat{\bm{\beta}}-\widetilde{\bm{\beta}})^{\top}DX^{\top}(\mathbf{Y}-XD\widetilde{\bm{\beta}})\mid\mathbf{Y}\big{]}=0$ and

	$\displaystyle\mathbb{E}\Big{[}\big{\lVert}XD\widehat{\bm{\beta}}-\mathbf{Y}\big{\rVert}_{2}^{2}\ \big{\|}\ \mathbf{Y}\Big{]}$	$\displaystyle=\mathbb{E}\Big{[}\big{\lVert}XD\widetilde{\bm{\beta}}-\mathbf{Y}\big{\rVert}_{2}^{2}\ \big{\|}\ \mathbf{Y}\Big{]}+\mathbb{E}\Big{[}\big{\lVert}XD\big{(}\widetilde{\bm{\beta}}-\widehat{\bm{\beta}}\big{)}\big{\rVert}_{2}^{2}\ \big{\|}\ \mathbf{Y}\Big{]}$
		$\displaystyle\qquad-2\mathbb{E}\Big{[}\big{(}\widehat{\bm{\beta}}-\widetilde{\bm{\beta}}\big{)}^{\top}DX^{\top}\big{(}\mathbf{Y}-XD\widetilde{\bm{\beta}}\big{)}\ \big{\|}\ \mathbf{Y}\Big{]}$
		$\displaystyle=\mathbb{E}\Big{[}\big{\lVert}XD\widetilde{\bm{\beta}}-\mathbf{Y}\big{\rVert}_{2}^{2}\ \big{\|}\ \mathbf{Y}\Big{]}+\mathbb{E}\Big{[}\big{\lVert}XD\big{(}\widetilde{\bm{\beta}}-\widehat{\bm{\beta}}\big{)}\big{\rVert}_{2}^{2}\ \big{\|}\ \mathbf{Y}\Big{]},$

which is the claimed expression. ∎

A.2 Proof of Equation (11)

Let $r:=\mathrm{rank}(X)$ and consider a singular value decomposition $X=\sum_{\ell=1}^{r}\sigma_{\ell}\mathbf{v}_{\ell}\mathbf{w}_{\ell}^{\top}$ . The left-singular vectors $\mathbf{w}_{\ell}$ , $\ell=1,\ldots,d$ are orthonormal, meaning $\mathbb{X}=\sum_{\ell=1}^{r}\sigma_{\ell}^{2}\mathbf{w}_{\ell}\mathbf{w}_{\ell}^{\top}$ and since $\mathrm{Diag}(\mathbb{X})=\mathbb{X}_{11}\cdot I_{d}$ ,

\displaystyle\mathbb{X}_{p}=p\left(\sum_{\ell=1}^{r}\sigma_{\ell}^{2}\mathbf{w}_{\ell}\mathbf{w}_{\ell}^{\top}\right)+(1-p)\mathbb{X}_{11}I_{d}.

Each left-singular vector $\mathbf{w}_{\ell}$ is an eigenvector of $\mathbb{X}_{p}$ , with associated eigenvalue $p\sigma_{\ell}^{2}+(1-p)\mathbb{X}_{11}$ . If $r<d$ , suppose $\mathbf{w}_{m}$ , $m=r+1,\ldots,d$ complete the orthonormal basis, then $\mathbb{X}_{p}\mathbf{w}_{m}=(1-p)\mathbb{X}_{11}\mathbf{w}_{m}$ . Consequently, $\mathbb{X}_{p}^{-1}$ admits $\mathbf{w}_{\ell}$ as an eigenvector for every $\ell=1,\ldots,d$ and

\displaystyle\mathbb{X}_{p}^{-1}=\sum_{\ell=1}^{r}\dfrac{1}{p\sigma_{\ell}^{2}+(1-p)\mathbb{X}_{11}}\mathbf{w}_{\ell}\mathbf{w}_{\ell}^{\top}+\sum_{m=r+1}^{d}\dfrac{1}{(1-p)\mathbb{X}_{11}}\mathbf{w}_{m}\mathbf{w}_{m}^{\top}.

By definition, $X^{\top}\mathbf{Y}=\sum_{\ell=1}^{r}\sigma_{\ell}\mathbf{w}_{\ell}\mathbf{v}_{\ell}^{\top}\mathbf{Y}$ and $\mathbf{Y}=\sum_{\ell=1}^{r}\sigma_{\ell}\mathbf{v}_{\ell}\mathbf{w}_{\ell}^{\top}\bm{\beta}_{\star}+\bm{\varepsilon}$ . Combining these facts now leads to

\displaystyle\widetilde{\bm{\beta}}=\sum_{\ell=1}^{r}\dfrac{\sigma_{\ell}}{p\sigma_{\ell}^{2}+(1-p)\mathbb{X}_{11}}\mathbf{w}_{\ell}\mathbf{v}_{\ell}^{\top}\mathbf{Y}

and

	$\displaystyle\mathbb{E}\big{[}X\widetilde{\bm{\beta}}\big{]}$	$\displaystyle=\sum_{\ell=1}^{r}\dfrac{\sigma^{2}_{\ell}}{p\sigma_{\ell}^{2}+(1-p)\mathbb{X}_{11}}\mathbf{v}_{\ell}\mathbf{v}_{\ell}^{\top}\mathbb{E}\big{[}\mathbf{Y}\big{]}$
		$\displaystyle=\sum_{\ell=1}^{r}\dfrac{1}{p+(1-p)\mathbb{X}_{11}/\sigma_{\ell}^{2}}\mathbf{v}_{\ell}\mathbf{v}_{\ell}^{\top}X\bm{\beta}_{\star}$
		$\displaystyle=\sum_{\ell=1}^{r}\dfrac{1}{p+(1-p)\mathbb{X}_{11}/\sigma_{\ell}^{2}}\sigma_{\ell}\mathbf{v}_{\ell}\mathbf{w}_{\ell}^{\top}\bm{\beta}_{\star}.$

∎

A.3 Proof of Equation (12)

Set $A:=X\big{(}\mathbb{X}+(p^{-1}-1)\mathrm{Diag}(\mathbb{X})\big{)}^{-1}X^{\top}$ and consider $\mathbf{w}=\mathbf{u}+\mathbf{v}$ with $\mathbf{u}^{\top}\mathbf{v}=0$ and $X^{\top}\mathbf{v}=\mathbf{0}$ . Observe that

\displaystyle A^{2}=A-X\big{(}\mathbb{X}+(p^{-1}-1)\mathrm{Diag}(\mathbb{X})\big{)}^{-1}(p^{-1}-1)\mathrm{Diag}(\mathbb{X})\big{(}\mathbb{X}+(p^{-1}-1)\mathrm{Diag}(\mathbb{X})\big{)}^{-1}X^{\top}

and thus $\mathbf{w}^{\top}A^{2}\mathbf{w}=\mathbf{u}^{\top}A^{2}\mathbf{u}\leq\mathbf{u}A\mathbf{u}=\mathbf{w}^{\top}A\mathbf{w}$ . If $\mathbf{u}$ is the zero vector, $0=\mathbf{w}^{\top}A^{2}\mathbf{w}=\mathbf{w}^{\top}A\mathbf{w}$ . Otherwise, $\mathrm{Diag}(\mathbb{X})>0$ implies $\big{(}\mathbb{X}+(p^{-1}-1)\mathrm{Diag}(\mathbb{X})\big{)}^{-1}(p^{-1}-1)\mathrm{Diag}(\mathbb{X})\big{(}\mathbb{X}+(p^{-1}-1)\mathrm{Diag}(\mathbb{X})\big{)}$ being positive definite, so we have the strict inequality $\mathbf{w}^{\top}A^{2}\mathbf{w}<\mathbf{w}^{\top}A\mathbf{w}$ whenever $\mathbf{u}\neq 0$ .

Now, suppose that $A$ has an eigenvector $\mathbf{w}$ with corresponding eigenvalue $\lambda\geq 1$ , which implies $\mathbf{u}\neq\mathbf{0}$ . Then, we have $\mathbf{w}^{\top}A^{2}\mathbf{w}=\lambda^{2}\geq\lambda=\mathbf{w}^{\top}A\mathbf{w}$ . Equality only holds if $\mathbf{w}^{\top}A\mathbf{w}=1$ . This contradicts the strict inequality $\mathbf{w}^{\top}A^{2}\mathbf{w}<\mathbf{w}^{\top}A\mathbf{w}$ so all eigenvalues have to be strictly smaller than one. ∎

Appendix B Proofs for Section 4

B.1 Proof of Lemma 1

To show that $\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}\leq 1-\alpha p(1-p)\min_{i}\mathbb{X}_{ii}<1$ , note that $\lVert\mathbb{X}_{p}\rVert\leq\lVert\mathbb{X}\rVert$ by Lemma 13 and recall $\alpha p\lVert\mathbb{X}\rVert<1$ from Assumption 1. Hence,

\displaystyle\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}=1-\alpha p\lambda_{\min}(\mathbb{X}_{p}).

(27)

For any vector $\mathbf{v}$ with $\lVert\mathbf{v}\rVert_{2}=1$ and any two $d\times d$ positive semi-definite matrices $A$ and $B$ , $\mathbf{v}^{\top}(A+B)\mathbf{v}=\mathbf{v}^{\top}A\mathbf{v}+\mathbf{v}^{\top}B\mathbf{v}\geq\lambda_{\mathrm{min}}(A)+\lambda_{\mathrm{min}}(B)$ . Hence, $\lambda_{\mathrm{min}}(A+B)\geq\lambda_{\mathrm{min}}(A)+\lambda_{\mathrm{min}}(B)$ and $\lambda_{\mathrm{min}}(\mathbb{X}_{p})\geq(1-p)\lambda_{\mathrm{min}}\big{(}\mathrm{Diag}(\mathbb{X})\big{)}=(1-p)\min_{i}\mathbb{X}_{ii}$ . By Assumption 1, the design matrix $X$ has no zero columns, guaranteeing $\min_{i}\mathbb{X}_{ii}>0$ . Combined with (27), we now obtain $\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}\leq 1-\alpha p(1-p)\min_{i}\mathbb{X}_{ii}<1$ .

To prove the bound on the expectation, recall from (14) that $\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}$ equals $\big{(}I-\alpha D_{k}\mathbb{X}_{p}\big{)}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}+\alpha D_{k}\overline{\mathbb{X}}\big{(}pI-D_{k}\big{)}\widetilde{\bm{\beta}}_{k-1}$ . Lemma 10(a) shows that $\mathbb{E}[DAD]=pA_{p}$ and Lemma 9(b) gives $\overline{A}_{p}=p\overline{A}$ . In turn, we have $\mathbb{E}\big{[}D_{k}\overline{\mathbb{X}}D_{k}\big{]}=p\overline{\mathbb{X}}_{p}=p^{2}\overline{\mathbb{X}}$ and

\displaystyle\mathbb{E}\big{[}D_{k}\overline{\mathbb{X}}(pI-D_{k})\big{]}=p^{2}\overline{\mathbb{X}}-p^{2}\overline{\mathbb{X}}=0.

(28)

Conditioning on all randomness except $D_{k}$ now implies

\displaystyle\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}},\widetilde{\bm{\beta}}_{k-1}\big{]}=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}.

(29)

By the tower rule $\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]}=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{]}$ , so induction on $k$ gives

\displaystyle\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]}=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}^{k}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}\big{]}.

Sub-multiplicativity of the spectral norm implies $\big{\lVert}(I-\alpha p\mathbb{X}_{p})^{k}\big{\rVert}\leq\left\lVert I-\alpha p\mathbb{X}_{p}\right\rVert^{k}$ , proving that $\big{\lVert}\mathbb{E}[\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}]\big{\rVert}_{2}\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k}\big{\lVert}\mathbb{E}[\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}]\big{\rVert}_{2}$ . ∎

B.2 Proof of Theorem 1

We have $\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}\big{)}=\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{A}+(\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A})\big{)}=\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{A}\big{)}+\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{)}+\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{A},\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{)}+\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A},\widetilde{\bm{\beta}}_{A}\big{)}$ , so the triangle inequality implies

\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{A}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{)}\Big{\rVert}\leq 2\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A},\widetilde{\bm{\beta}}_{A}\big{)}\Big{\rVert}.

(30)

Write $B^{\prime}:=B-AX^{\top}$ , then $\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}=B^{\prime}\mathbf{Y}+\mathbf{a}$ . When conditioned on $\mathbf{Y}$ , the estimator $\widetilde{\bm{\beta}}_{A}=AX^{\top}\mathbf{Y}$ is deterministic. Hence, the law of total covariance yields

	$\displaystyle\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A},\widetilde{\bm{\beta}}_{A}\big{)}$	$\displaystyle=\mathbb{E}\Big{[}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A},\widetilde{\bm{\beta}}_{A}\mid\mathbf{Y}\big{)}\Big{]}+\mathrm{Cov}\Big{(}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\mid\mathbf{Y}\big{]},\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{A}\mid\mathbf{Y}\big{]}\Big{)}$
		$\displaystyle=0+\mathrm{Cov}\big{(}\mathbb{E}[B^{\prime}]\mathbf{Y}+\mathbb{E}[\mathbf{a}],AX^{\top}\mathbf{Y}\big{)}=\mathrm{Cov}\big{(}\mathbb{E}[B^{\prime}]\mathbf{Y},AX^{\top}\mathbf{Y}\big{)}.$

Further, $\mathrm{Cov}(\mathbf{Y})=I$ implies $\mathrm{Cov}\big{(}\mathbb{E}[B^{\prime}]\mathbf{Y},AX^{\top}\mathbf{Y}\big{)}=\mathbb{E}[B^{\prime}]\mathrm{Cov}(\mathbf{Y})XA^{\top}=\mathbb{E}[B^{\prime}]XA^{\top}$ . Using $\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}=B^{\prime}\mathbf{Y}+\mathbf{a}$ , note that $\mathbb{E}[B^{\prime}]X\bm{\beta}_{\star}=\mathbb{E}_{\bm{\beta}_{\star}}\big{[}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{]}-\mathbb{E}_{\mathbf{0}}\big{[}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{]}$ with $\mathbf{0}=(0,\ldots,0)^{\top}$ . Combining these identities, sub-multiplicativity of the spectral norm, and the triangle inequality leads to

	$\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A},\widetilde{\bm{\beta}}_{A}\big{)}\Big{\rVert}$	$\displaystyle\leq\lVert A\rVert\cdot\big{\lVert}\mathbb{E}[B^{\prime}]X\big{\rVert}$
		$\displaystyle=\lVert A\rVert\sup_{\bm{\beta}_{\star}:\lVert\bm{\beta}_{\star}\rVert_{2}\leq 1}\big{\lVert}\mathbb{E}[B^{\prime}]X\bm{\beta}_{\star}\big{\rVert}_{2}$
		$\displaystyle\leq\lVert A\rVert\sup_{\bm{\beta}_{\star}:\lVert\bm{\beta}_{\star}\rVert_{2}\leq 1}\bigg{(}\Big{\lVert}\mathbb{E}_{\bm{\beta}_{\star}}\big{[}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{]}\Big{\rVert}_{2}+\Big{\lVert}\mathbb{E}_{\mathbf{0}}\big{[}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{]}\Big{\rVert}_{2}\bigg{)}$
		$\displaystyle\leq 2\lVert A\rVert\sup_{\bm{\beta}_{\star}:\lVert\bm{\beta}_{\star}\rVert_{2}\leq 1}\Big{\lVert}\mathbb{E}_{\bm{\beta}_{\star}}\big{[}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{]}\Big{\rVert}_{2}.$

Together with (30), this proves the result. ∎

B.3 Proof of Lemma 2

Given a random vector $U$ and a random element $V$ , observe that $\mathbb{E}\big{[}\mathrm{Cov}(U\mid V)\big{]}=\mathbb{E}\big{[}UU^{\top}\big{]}-\mathbb{E}\big{[}\mathbb{E}[U\mid V]\mathbb{E}[U\mid V]^{\top}\big{]}$ . Inserting $U=\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}$ and $V=\big{(}\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\big{)}$ , as well as defining $A_{k}:=\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})^{\top}\big{]}$ , leads to

\displaystyle A_{k}=\mathbb{E}\Big{[}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\big{]}^{\top}\Big{]}+\mathbb{E}\Big{[}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\big{)}\Big{]}

(31)

Recall $\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}},\widetilde{\bm{\beta}}_{k-1}\big{]}=(I-\alpha p\mathbb{X}_{p})\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}$ from (29), and so

\displaystyle\mathbb{E}\Big{[}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\big{]}^{\top}\Big{]}=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}A_{k-1}\big{(}I-\alpha p\mathbb{X}_{p}\big{)},

(32)

where $A_{k-1}:=\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}})^{\top}\big{]}$ .

Evaluating the conditional covariance $\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\big{)}$ is the more challenging part, requiring moments up to fourth order in $D_{k}$ , see Lemma 11. Recall that

	$\displaystyle S(A)$	$\displaystyle=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}A\big{(}I-\alpha p\mathbb{X}_{p}\big{)}$
		$\displaystyle\quad+\alpha^{2}p(1-p)\mathrm{Diag}\big{(}\mathbb{X}_{p}A\mathbb{X}_{p}\big{)}+\alpha^{2}p^{2}(1-p)^{2}\mathbb{X}\odot\overline{A+\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}}\odot\mathbb{X}$
		$\displaystyle\quad+\alpha^{2}p^{2}(1-p)\bigg{(}\Big{(}\overline{\mathbb{X}}\mathrm{Diag}\Big{(}A+\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}\Big{)}\overline{\mathbb{X}}\Big{)}_{p}+\overline{\mathbb{X}}\mathrm{Diag}\big{(}\mathbb{X}_{p}A\big{)}+\mathrm{Diag}\big{(}\mathbb{X}_{p}A\big{)}\overline{\mathbb{X}}\bigg{)}.$

Lemma 5.

For every positive integer $k$ ,

\displaystyle\mathbb{E}\Big{[}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\big{)}\Big{]}

\displaystyle=S(A_{k-1})-\big{(}I-\alpha p\mathbb{X}_{p}\big{)}A_{k-1}\big{(}I-\alpha p\mathbb{X}_{p}\big{)}+\rho_{k-1},

with remainder $\rho_{k-1}$ vanishing at the rate

\displaystyle\big{\lVert}\rho_{k-1}\big{\rVert}\leq 6\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}\Big{\lVert}\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}})\widetilde{\bm{\beta}}^{\top}\big{]}\Big{\rVert}.

Proof Recall from (14) that $\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}=\big{(}I-\alpha D_{k}\mathbb{X}_{p}\big{)}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}+\alpha D_{k}\overline{\mathbb{X}}\big{(}pI-D_{k}\big{)}\widetilde{\bm{\beta}}_{k-1}$ . The covariance is invariant under deterministic shifts and sign flips, so

\displaystyle\mathrm{Cov}\Big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\ \big{|}\ \widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\Big{)}=\alpha^{2}\mathrm{Cov}\Big{(}D_{k}\mathbb{X}_{p}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}+D_{k}\overline{\mathbb{X}}\big{(}D_{k}-pI\big{)}\widetilde{\bm{\beta}}_{k-1}\ \big{|}\ \widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\Big{)}.

Applying Lemma 11 with $\mathbf{u}:=\mathbb{X}_{p}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}$ , $\overline{A}:=\overline{\mathbb{X}}$ , and $\mathbf{v}:=\widetilde{\bm{\beta}}_{k-1}$ , we find

\begin{split}\dfrac{\mathrm{Cov}(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\ |\ \widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}})}{\alpha^{2}p(1-p)}&=\mathrm{Diag}\Big{(}\mathbb{X}_{p}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}^{\top}\mathbb{X}_{p}\Big{)}\\ &\quad+p\overline{\mathbb{X}}\mathrm{Diag}\Big{(}\mathbb{X}_{p}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}\widetilde{\bm{\beta}}^{\top}_{k-1}\Big{)}\\ &\quad+p\mathrm{Diag}\Big{(}\widetilde{\bm{\beta}}_{k-1}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}^{\top}\mathbb{X}_{p}\Big{)}\overline{\mathbb{X}}\\ &\quad+p\Big{(}\overline{\mathbb{X}}\mathrm{Diag}\big{(}\widetilde{\bm{\beta}}_{k-1}\widetilde{\bm{\beta}}_{k-1}^{\top}\big{)}\overline{\mathbb{X}}\Big{)}_{p}\\ &\quad+p(1-p)\mathbb{X}\odot\overline{\widetilde{\bm{\beta}}_{k-1}\widetilde{\bm{\beta}}^{\top}_{k-1}}\odot\mathbb{X}.\end{split}

(33)

Set $B_{k-1}:=\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}})\widetilde{\bm{\beta}}^{\top}\big{]}$ and recall $A_{k-1}=\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}})^{\top}\big{]}$ . Note the identities $\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}})\widetilde{\bm{\beta}}_{k-1}^{\top}\big{]}=A_{k-1}+B_{k-1}$ and $\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k-1}\widetilde{\bm{\beta}}_{k-1}^{\top}\big{]}=A_{k-1}+\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}+B_{k-1}+B_{k-1}^{\top}$ . Taking the expectation of (33), multiplying both sides with $\alpha^{2}p(1-p)$ , and using the definition of $S(A)$ proves the claimed expression for $\mathbb{E}\big{[}\mathrm{Cov}(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}})\big{]}$ with remainder term

\begin{split}\rho_{k-1}=\alpha^{2}p(1-p)\bigg{(}&p\overline{\mathbb{X}}\mathrm{Diag}\big{(}\mathbb{X}_{p}B_{k-1}\big{)}+p\mathrm{Diag}\big{(}B_{k-1}^{\top}\mathbb{X}_{p}\big{)}\overline{\mathbb{X}}+p\Big{(}\overline{\mathbb{X}}\mathrm{Diag}\big{(}B_{k-1}+B_{k-1}^{\top}\big{)}\overline{\mathbb{X}}\Big{)}_{p}\\ &+p(1-p)\mathbb{X}\odot\overline{\big{(}B_{k-1}+B_{k-1}^{\top}\big{)}}\odot\mathbb{X}\bigg{)}.\end{split}

(34)

For any $d\times d$ matrices $A$ and $B$ , Lemma 13 provides the inequalities $\big{\lVert}\mathrm{Diag}(A)\big{\rVert}\leq\lVert A\rVert$ , $\lVert A_{p}\rVert\leq\lVert A\rVert$ , and $\lVert A\odot B\rVert\leq\lVert A\rVert\cdot\lVert B\rVert$ . If $A$ is moreover positive semi-definite, then also $\big{\lVert}\overline{A}\big{\rVert}\leq\left\lVert A\right\rVert$ . Combined with the sub-multiplicativity of the spectral norm, this implies

	$\displaystyle\big{\lVert}\overline{\mathbb{X}}\mathrm{Diag}(\mathbb{X}_{p}B_{k-1})\big{\rVert}$	$\displaystyle\leq\big{\lVert}\overline{\mathbb{X}}\big{\rVert}\cdot\big{\lVert}\mathrm{Diag}(\mathbb{X}_{p}B_{k-1})\big{\rVert}$
		$\displaystyle\leq\lVert\mathbb{X}\rVert\cdot\lVert\mathbb{X}_{p}\rVert\cdot\big{\lVert}B_{k-1}\big{\rVert}\leq\lVert\mathbb{X}\rVert^{2}\cdot\big{\lVert}B_{k-1}\big{\rVert}$
	$\displaystyle\Big{\lVert}\Big{(}\overline{\mathbb{X}}\mathrm{Diag}\big{(}B_{k-1}+B_{k-1}^{\top}\big{)}\overline{\mathbb{X}}\Big{)}_{p}\Big{\rVert}$	$\displaystyle\leq\big{\lVert}\overline{\mathbb{X}}\mathrm{Diag}\big{(}B_{k-1}+B_{k-1}^{\top}\big{)}\overline{\mathbb{X}}\big{\rVert}$
		$\displaystyle\leq\big{\lVert}\overline{\mathbb{X}}\big{\rVert}^{2}\cdot\big{\lVert}B_{k-1}+B_{k-1}^{\top}\big{\rVert}\leq 2\lVert\mathbb{X}\rVert^{2}\cdot\big{\lVert}B_{k-1}\big{\rVert}$
	$\displaystyle\Big{\lVert}\mathbb{X}\odot\overline{\big{(}B_{k-1}+B_{k-1}^{\top}\big{)}}\odot\mathbb{X}\Big{\rVert}$	$\displaystyle\leq 2\lVert\mathbb{X}\rVert^{2}\cdot\big{\lVert}B_{k-1}\big{\rVert}.$

By Assumption 1 also $\alpha p\lVert\mathbb{X}\rVert<1$ , so combining the upper-bounds with (34) leads to

\displaystyle\big{\lVert}\rho_{k-1}\big{\rVert}\leq 6(\alpha p)^{2}\lVert\mathbb{X}\rVert^{2}\cdot\big{\lVert}B_{k-1}\big{\rVert}\leq 6\big{\lVert}B_{k-1}\big{\rVert}.

(35)

The argument is to be completed by bounding $\big{\lVert}B_{k-1}\big{\rVert}$ . Using (14), we have $\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}\widetilde{\bm{\beta}}^{\top}=\big{(}I-\alpha D_{k-1}\mathbb{X}_{p}\big{)}\big{(}\widetilde{\bm{\beta}}_{k-2}-\widetilde{\bm{\beta}}\big{)}\widetilde{\bm{\beta}}^{\top}+\alpha D_{k-1}\overline{\mathbb{X}}\big{(}pI-D_{k-1}\big{)}\widetilde{\bm{\beta}}_{k-1}\widetilde{\bm{\beta}}^{\top}$ . In (28), it was shown that $\mathbb{E}\big{[}D_{k-1}\overline{\mathbb{X}}(pI-D_{k-1})\big{]}=0$ . Recalling that $D_{k-1}$ is independent of $\big{(}\widetilde{\bm{\beta}},\widetilde{\bm{\beta}}_{k-2}\big{)}$ and $\mathbb{E}[D_{k-1}]=pI_{d}$ , we obtain $B_{k-1}=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}B_{k-2}$ . By induction on $k$ , $B_{k-1}=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}^{k-1}\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}\big{)}\widetilde{\bm{\beta}}^{\top}\big{]}$ . Using sub-multiplicativity of the spectral norm,

\displaystyle\big{\lVert}B_{k-1}\big{\rVert}\leq\lVert I-\alpha p\mathbb{X}_{p}\rVert^{k-1}\Big{\lVert}\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}})\widetilde{\bm{\beta}}^{\top}\big{]}\Big{\rVert}.

Together with (35) this finishes the proof. ∎

Combining Lemma 5 with (31) and (32) leads to $\big{\lVert}A_{k}-S(A_{k-1})\big{\rVert}=\big{\lVert}\rho_{k-1}\big{\rVert}$ with remainder $\rho_{k-1}$ as above. This completes the proof of Lemma 2. ∎

B.4 Proof of Theorem 2

Let $S:\mathbb{R}^{d\times d}\to\mathbb{R}^{d\times d}$ be the affine operator introduced in Lemma 2 and recall the definitions $S_{0}:=S(0)$ and $S_{\mathrm{lin}}(A):=S(A)-S_{0}$ . First, the operator norm of $S_{\mathrm{lin}}$ will be analyzed.

Lemma 6.

The linear operator $S_{\mathrm{lin}}$ satisfies $\left\lVert S_{\mathrm{lin}}\right\rVert_{\mathrm{op}}\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}<1$ , provided that

\displaystyle\alpha<\min\left\{\dfrac{1}{p\lVert\mathbb{X}\rVert},\dfrac{\lambda_{\mathrm{min}}(\mathbb{X}_{p})}{3\lVert\mathbb{X}\rVert^{2}}\right\},

where $\lambda_{\mathrm{min}}(\mathbb{X}_{p})$ denotes the smallest eigenvalue of $\mathbb{X}_{p}$ .

Proof Let $A$ be a $d\times d$ matrix. Applying the triangle inequality, Lemma 13, and sub-multiplicativity of the spectral norm,

	$\displaystyle\big{\lVert}S_{\mathrm{lin}}(A)\big{\rVert}$	$\displaystyle\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{2}\left\lVert A\right\rVert+\Big{(}\alpha^{2}p(1-p)+3\alpha^{2}p^{2}(1-p)+\alpha^{2}p^{2}(1-p)^{2}\Big{)}\left\lVert\mathbb{X}\right\rVert^{2}\left\lVert A\right\rVert$
		$\displaystyle\leq\Big{(}\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{2}+2\alpha^{2}p\left\lVert\mathbb{X}\right\rVert^{2}\Big{)}\left\lVert A\right\rVert,$

where the second inequality follows from $p(1-p)\leq 1/4$ .

As shown in (27), $\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}=1-\alpha p\lambda_{\mathrm{min}}(\mathbb{X}_{p})$ . Lemma 13 now implies $\big{(}1-\alpha p\lambda_{\mathrm{min}}(\mathbb{X}_{p})\big{)}^{2}=1-2\alpha p\lambda_{\mathrm{min}}(\mathbb{X}_{p})+\alpha^{2}p^{2}\lambda_{\mathrm{min}}(\mathbb{X}_{p})\leq 1-2\alpha p\lambda_{\mathrm{min}}(\mathbb{X}_{p})^{2}+\alpha^{2}p\lVert\mathbb{X}\rVert^{2}$ , so that

\displaystyle\big{\lVert}S_{\mathrm{lin}}(A)\big{\rVert}\leq\big{(}1-2\alpha p\lambda_{\mathrm{min}}(\mathbb{X}_{p})+3\alpha^{2}p\lVert\mathbb{X}\rVert^{2}\big{)}\left\lVert A\right\rVert.

If $\alpha<\lambda_{\mathrm{min}}(\mathbb{X}_{p})/\big{(}3\lVert\mathbb{X}\rVert^{2}\big{)}$ , then also $3\alpha^{2}p\lVert\mathbb{X}\rVert^{2}\leq 3\alpha p\lambda_{\mathrm{min}}(\mathbb{X}_{p})$ , so that in turn $\lVert S_{\mathrm{lin}}\rVert_{\mathrm{op}}\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}$ . The constraint $\alpha<1/\big{(}p\lVert\mathbb{X}\rVert\big{)}$ now enforces $\alpha p\lVert\mathbb{X}\rVert<1$ , which implies $\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}<1$ . ∎

As before, set $A_{k}:=\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})^{\top}\big{]}$ for each $k\geq 0$ and let $\rho_{k}:=A_{k+1}-S(A_{k})$ . Using induction on $k$ , we now prove

\displaystyle A_{k}

\displaystyle=S_{\mathrm{lin}}^{k}(A_{0})+\sum_{\ell=0}^{k-1}S_{\mathrm{lin}}^{\ell}\big{(}S_{0}+\rho_{k-1-\ell}\big{)}.

(36)

Taking $k=1$ , $A_{1}=S(A_{0})+\rho_{0}=S_{\mathrm{lin}}(A_{0})+S_{0}+\rho_{0}$ , so the claimed identity holds. Assuming the identity is true for $k-1$ , the recursion $A_{k}=S(A_{k-1})+\rho_{k-1}$ leads to

	$\displaystyle A_{k}$	$\displaystyle=S\left(S_{\mathrm{lin}}^{k-1}(A_{0})+\sum_{\ell=0}^{k-2}S_{\mathrm{lin}}^{\ell}\Big{(}S_{0}+\rho_{k-2-\ell}\Big{)}\right)+\rho_{k-1}$
		$\displaystyle=S_{\mathrm{lin}}^{k}(A_{0})+S_{\mathrm{lin}}\left(\sum_{\ell=0}^{k-2}S_{\mathrm{lin}}^{\ell}\Big{(}S_{0}+\rho_{k-2-\ell}\Big{)}\right)+S_{0}+\rho_{k-1}$
		$\displaystyle=S_{\mathrm{lin}}^{k}(A_{0})+\sum_{\ell=0}^{k-1}S_{\mathrm{lin}}^{\ell}\Big{(}S_{0}+\rho_{k-1-\ell}\Big{)},$

thereby establishing the induction step and proving (36).

Assuming $\left\lVert S_{\mathrm{lin}}\right\rVert_{\mathrm{op}}\leq\lVert I-\alpha p\mathbb{X}_{p}\rVert<1$ , we move on to show the bound

\displaystyle\Big{\lVert}A_{k}-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k}\Big{\lVert}A_{0}-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}+C_{0}k\left\lVert I-\alpha p\mathbb{X}_{p}\right\rVert^{k-1}

with $\mathrm{id}$ the identity operator on $\mathbb{R}^{d\times d}$ , and $C_{0}:=6\big{\lVert}\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}})\widetilde{\bm{\beta}}^{\top}\big{]}\big{\rVert}$ . By linearity, $\sum_{\ell=0}^{k-1}S_{\mathrm{lin}}^{\ell}\big{(}S_{0}+\rho_{k-1-\ell}\big{)}=\sum_{\ell=0}^{k-1}S_{\mathrm{lin}}^{\ell}(S_{0})+\sum_{m=0}^{k-1}S_{\mathrm{lin}}^{m}(\rho_{k-1-m})$ . Since $\left\lVert S_{\mathrm{lin}}\right\rVert_{\mathrm{op}}<1$ , Lemma 12 asserts that $\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}=\sum_{\ell=0}^{\infty}S_{\mathrm{lin}}^{\ell}$ and

$\displaystyle\sum_{\ell=0}^{k-1}S_{\mathrm{lin}}^{\ell}\big{(}S_{0}\big{)}$	$\displaystyle=\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}+\left(\sum_{\ell=0}^{k-1}S_{\mathrm{lin}}^{\ell}-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}\right)S_{0}$
	$\displaystyle=\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}-\sum_{\ell=k}^{\infty}S_{\mathrm{lin}}^{\ell}\big{(}S_{0}\big{)}$
	$\displaystyle=\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}-S_{\mathrm{lin}}^{k}\Big{(}\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{)}.$	(37)

Lemma 2 ensures $\lVert\rho_{k-1-m}\rVert\leq C_{0}\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1-m}$ for all $m\leq k-1$ . Moreover, $\lVert S_{\mathrm{lin}}\rVert_{\mathrm{op}}\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}$ and hence $\big{\lVert}S_{\mathrm{lin}}^{m}(\rho_{k-1-m})\big{\rVert}\leq C_{0}\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}$ for every $m=0,1,\ldots$ , so the triangle inequality implies

\displaystyle\left\lVert\sum_{m=0}^{k-1}S_{\mathrm{lin}}^{m}(\rho_{k-1-m})\right\rVert\leq C_{0}k\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}.

(38)

Combining (36) and (B.4), as well as applying the triangle inequality and the bound (38), leads to the first bound asserted in Theorem 2,

	$\displaystyle\Big{\lVert}A_{k}-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}$	$\displaystyle\leq\bigg{\lVert}S_{\mathrm{lin}}^{k}\Big{(}A_{0}-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{)}\bigg{\rVert}+\left\lVert\sum_{m=0}^{k-1}S_{\mathrm{lin}}^{m}(\rho_{k-1-m})\right\rVert$
		$\displaystyle\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k}\Big{\lVert}A_{0}-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}+C_{0}k\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}.$		(39)

To show the corresponding bound for the variance, observe that $\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}=A_{k}-\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]}^{\top}$ . Lemma 1 and (69) imply

	$\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}-A_{k}\Big{\rVert}$	$\displaystyle=\Big{\lVert}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]}^{\top}\Big{\rVert}$
		$\displaystyle\leq\Big{\lVert}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]}\Big{\rVert}_{2}^{2}$
		$\displaystyle\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}\Big{\lVert}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}\big{]}\Big{\rVert}_{2}^{2}.$

Together with (39) and the triangle inequality, this proves the second bound asserted in Theorem 2. ∎

B.5 Proof of Lemma 3

Applying Theorem 1, Lemma 1, and the triangle inequality,

\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}\Big{\rVert}\leq\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\Big{\rVert}+4\big{\lVert}\mathbb{X}_{p}^{-1}\big{\rVert}\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k}\sup_{\bm{\beta}_{\star}:\lVert\bm{\beta}_{\star}\rVert_{2}\leq 1}\Big{\lVert}\mathbb{E}_{\bm{\beta}_{\star}}\big{[}\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}\big{]}\Big{\rVert}_{2}.

(40)

Lemma 13 implies $\mathbb{X}_{p}\geq(1-p)\mathrm{Diag}(\mathbb{X})$ , so that $\big{\lVert}\mathbb{X}_{p}^{-1}\big{\rVert}=\lambda_{\mathrm{min}}(\mathbb{X}_{p})^{-1}\leq\big{(}(1-p)\min_{i}\mathbb{X}_{ii}\big{)}^{-1}$ . Next, $\mathbb{E}_{\bm{\beta}_{\star}}\big{[}\widetilde{\bm{\beta}}\big{]}=\mathbb{X}_{p}^{-1}\mathbb{X}\bm{\beta}_{\star}$ entails equality between $\sup_{\bm{\beta}_{\star}:\lVert\bm{\beta}_{\star}\rVert_{2}\leq 1}\big{\lVert}\mathbb{E}_{\bm{\beta}_{\star}}[\widetilde{\bm{\beta}}]\big{\rVert}_{2}$ and $\big{\lVert}\mathbb{X}_{p}^{-1}\mathbb{X}\big{\rVert}\leq\big{(}(1-p)\min_{i}\mathbb{X}_{ii}\big{)}^{-1}\lVert\mathbb{X}\rVert$ . The second term on the right-hand side of (40) is then bounded by $C_{1}\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k}/(1-p)^{2}$ , for some constant $C_{1}$ independent of $(\alpha,p,k)$ .

To prove the first claim of the lemma, it now suffices to show

\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\Big{\rVert}\leq\dfrac{1}{(1-p)^{2}}\Big{(}k\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}C_{2}+\alpha pC_{3}\Big{)},

(41)

where $C_{2}$ and $C_{3}$ are constants independent of $(\alpha,p,k)$ . As $\big{\lVert}\mathbb{X}_{p}^{-1}\big{\rVert}\leq\big{(}(1-p)\min_{i}\mathbb{X}_{ii}\big{)}^{-1}$ , the constant $C$ in Theorem 2 satisfies $C\leq C_{4}/(1-p)^{2}+\big{\lVert}(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}\big{\rVert}$ , with $C_{4}$ depending only on the distribution of $\big{(}\mathbf{Y},\widetilde{\bm{\beta}}_{0},X\big{)}$ . Consequently, Theorem 2 and the triangle inequality imply

\begin{split}\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\Big{\rVert}&\leq\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}+\Big{\lVert}\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}\\ &\leq k\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}\left(\dfrac{C_{4}}{(1-p)^{2}}+\Big{\lVert}\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}\right)+\Big{\lVert}\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}.\end{split}

(42)

Consider a bounded linear operator $G$ on $\mathbb{R}^{d\times d}$ satisfying $\lVert G\rVert_{\mathrm{op}}<1$ . For an arbitrary $d\times d$ matrix $A$ , Lemma 12 asserts $(\mathrm{id}-G)^{-1}A=\sum_{\ell=0}^{\infty}G^{\ell}(A)$ and therefore $\big{\lVert}(\mathrm{id}-G)^{-1}A\big{\rVert}\leq\sum_{\ell=0}^{\infty}\lVert G\rVert_{\mathrm{op}}^{\ell}\cdot\lVert A\rVert=\big{(}1-\lVert G\rVert_{\mathrm{op}}\big{)}^{-1}\lVert A\rVert$ . Theorem 1 states that $\left\lVert S_{\mathrm{lin}}\right\rVert_{\mathrm{op}}\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}$ . As shown following (27), $\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}\leq 1-\alpha p(1-p)\min_{i}\mathbb{X}_{ii}$ . Therefore,

\displaystyle\Big{\lVert}\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}\leq\big{(}1-\lVert S_{\mathrm{lin}}\rVert_{\mathrm{op}}\big{)}^{-1}\lVert S_{0}\rVert\leq\big{(}\alpha p(1-p)\min_{i}\mathbb{X}_{ii}\big{)}^{-1}\lVert S_{0}\rVert.

Taking $A=0$ in Lemma 2, $S_{0}=\alpha^{2}p^{2}(1-p)\big{(}\overline{\mathbb{X}}\mathrm{Diag}\big{(}\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}\big{)}\overline{\mathbb{X}}\big{)}_{p}+\alpha^{2}p^{2}(1-p)^{2}\mathbb{X}\odot\overline{\mathbb{E}[\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}]}\odot\mathbb{X}$ . Using Lemma 13 and $\big{\lVert}\mathbb{X}_{p}^{-1}\big{\rVert}\leq\big{(}(1-p)\min_{i}\mathbb{X}_{ii}\big{)}^{-1}$ ,

	$\displaystyle\lVert S_{0}\rVert$	$\displaystyle\leq\alpha^{2}p^{2}(1-p)\Big{(}\Big{\lVert}\overline{\mathbb{X}}\mathrm{Diag}\big{(}\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}\big{)}\overline{\mathbb{X}}\Big{\rVert}+\Big{\lVert}\mathbb{X}\odot\overline{\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}}\odot\mathbb{X}\Big{\rVert}\Big{)}$
		$\displaystyle\leq\dfrac{(\alpha p\lVert\mathbb{X}\rVert)^{2}\big{\lVert}\mathbb{E}[X^{\top}\mathbf{Y}\mathbf{Y}^{\top}X]\big{\rVert}}{(1-p)(\min_{i}\mathbb{X}_{ii})^{2}}$

proving that

\displaystyle\big{\lVert}(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}\big{\rVert}\leq\alpha p(1-p)^{-2}(\min_{i}\mathbb{X}_{ii})^{-3}\lVert\mathbb{X}\rVert^{2}\cdot\big{\lVert}\mathbb{E}[X^{\top}\mathbf{Y}\mathbf{Y}^{\top}X]\big{\rVert}.

(43)

Note that $\alpha p\lVert\mathbb{X}\rVert^{2}\leq\lVert\mathbb{X}\rVert$ by Assumption 1. Applying these bounds in (42) leads to

	$\displaystyle\left\lVert\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\right\rVert$	$\displaystyle\leq\dfrac{k\lVert I-\alpha p\mathbb{X}_{p}\rVert^{k-1}}{(1-p)^{2}}\left(C_{4}+\dfrac{\big{\lVert}\mathbb{X}\big{\rVert}\big{\lVert}\mathbb{E}[X^{\top}\mathbf{Y}\mathbf{Y}^{\top}X]\big{\rVert}}{(\min_{i}\mathbb{X}_{ii})^{3}}\right)$
		$\displaystyle\qquad+\dfrac{\alpha p\big{\lVert}\mathbb{X}\big{\rVert}^{2}\big{\lVert}\mathbb{E}[X^{\top}\mathbf{Y}\mathbf{Y}^{\top}X]\big{\rVert}}{(1-p)^{2}(\min_{i}\mathbb{X}_{ii})^{3}},$

which proves (41). Combined with (40), this proves the first claim of the lemma since

\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}\Big{\rVert}\leq\dfrac{k\lVert I-\alpha p\mathbb{X}_{p}\rVert^{k-1}(C_{1}+C_{2})+\alpha pC_{3}}{(1-p)^{2}}.

(44)

To start proving the second claim, recall that $\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}=\mathbb{X}_{p}^{-1}\mathbb{X}\mathbb{X}_{p}^{-1}$ . Hence, the triangle inequality leads to

	$\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Diag}(\mathbb{X})^{-1}\mathbb{X}\mathrm{Diag}(\mathbb{X})^{-1}\Big{\rVert}$
	$\displaystyle\leq\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}\Big{\rVert}+\Big{\lVert}\mathrm{Diag}(\mathbb{X})^{-1}\mathbb{X}\mathrm{Diag}(\mathbb{X})^{-1}-\mathbb{X}_{p}^{-1}\mathbb{X}\mathbb{X}_{p}^{-1}\Big{\rVert}.$

Let $A$ , $B$ , and $C$ be square matrices of the same dimension, with $A$ and $B$ invertible. Observe the identity $A^{-1}CA^{-1}-B^{-1}CB^{-1}=A^{-1}(B-A)B^{-1}CA^{-1}+B^{-1}CA^{-1}(B-A)B^{-1}$ , so sub-multiplicativity implies $\big{\lVert}A^{-1}CA^{-1}-B^{-1}CB^{-1}\big{\rVert}\leq 2\max\big{\{}\lVert A^{-1}\rVert,\lVert B^{-1}\rVert\big{\}}\lVert A^{-1}\rVert\cdot\lVert B^{-1}\rVert\cdot\lVert A-B\rVert\cdot\lVert C\rVert$ . Using $\big{\lVert}\mathbb{X}_{p}^{-1}\big{\rVert}\leq\big{(}(1-p)\min_{i}\mathbb{X}_{ii}\big{)}^{-1}$ , and inserting $A=\mathrm{Diag}(\mathbb{X})$ , $B=\mathbb{X}_{p}=A+p\overline{\mathbb{X}}$ , and $C=\mathbb{X}$ results in

\displaystyle\Big{\lVert}\mathrm{Diag}(\mathbb{X})^{-1}\mathbb{X}\mathrm{Diag}(\mathbb{X})^{-1}-\mathbb{X}_{p}^{-1}\mathbb{X}\mathbb{X}_{p}^{-1}\Big{\rVert}\leq\dfrac{pC_{5}}{(1-p)^{2}}

with $C_{5}$ independent of $(\alpha,p,k)$ . Combined with (44), this results in

\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Diag}(\mathbb{X})^{-1}\mathbb{X}\mathrm{Diag}(\mathbb{X})^{-1}\Big{\rVert}\leq\dfrac{k\lVert I-\alpha p\mathbb{X}_{p}\rVert^{k-1}(C_{1}+C_{2})+\alpha pC_{3}+pC_{5}}{(1-p)^{2}},

which proves the second claim of the lemma by enlarging $C^{\prime\prime}$ , if necessary. ∎

B.6 Proof of Theorem 3

Start by noting that $\lambda_{\mathrm{min}}(A)=\inf_{\mathbf{v}:\lVert\mathbf{v}\rVert=1}\mathbf{v}^{\top}A\mathbf{v}$ for symmetric matrices, see [24], Theorem 4.2.6. Using super-additivity of infima, observe the lower bound

	$\displaystyle\liminf_{k\to\infty}\inf_{\mathbf{v}:\lVert\mathbf{v}\rVert=1}\mathbf{v}^{\top}\Big{(}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}\Big{)}\mathbf{v}$
$\displaystyle\geq$	$\displaystyle\liminf_{k\to\infty}\bigg{(}\lambda_{\mathrm{min}}\Big{(}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\Big{)}-\sup_{\mathbf{v}:\lVert\mathbf{v}\rVert=1}\Big{\lvert}\mathbf{v}^{\top}\Big{(}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\Big{)}\mathbf{v}\Big{\rvert}\bigg{)}$
$\displaystyle\geq$	$\displaystyle\liminf_{k\to\infty}\lambda_{\mathrm{min}}\Big{(}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\Big{)}-\limsup_{k\to\infty}\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\Big{\rVert}.$	(45)

Combining Lemma 1 and Theorem 1, the limit superior in (45) vanishes. Further, $\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}$ converges to $\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}$ by Theorem 2, so it suffices to analyze the latter matrix.

For the next step, the matrix $S_{0}:=S(0)$ in Theorem 2 will be lower-bounded. Taking $A=0$ in Lemma 2 and exchanging the expectation with the $\mathrm{Diag}$ operator results in

	$\displaystyle S_{0}$	$\displaystyle=\alpha^{2}p^{2}(1-p)\bigg{(}\overline{\mathbb{X}}\mathrm{Diag}\Big{(}\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}\Big{)}\overline{\mathbb{X}}\bigg{)}_{p}+\alpha^{2}p^{2}(1-p)^{2}\mathbb{X}\odot\overline{\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}}\odot\mathbb{X}$
		$\displaystyle=\alpha^{2}p^{2}(1-p)\mathbb{E}\Bigg{[}p\overline{\mathbb{X}}\mathrm{Diag}\big{(}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{)}\overline{\mathbb{X}}+(1-p)\bigg{(}\mathrm{Diag}\Big{(}\overline{\mathbb{X}}\mathrm{Diag}\big{(}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{)}\overline{\mathbb{X}}\Big{)}+\mathbb{X}\odot\overline{\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}}\odot\mathbb{X}\bigg{)}\Bigg{]}.$		(46)

The first matrix in (46) is always positive semi-definite and we will now lower bound the matrix $B:=\mathrm{Diag}\big{(}\overline{\mathbb{X}}\mathrm{Diag}(\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top})\overline{\mathbb{X}}\big{)}+\mathbb{X}\odot\overline{\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}}\odot\mathbb{X}$ . Given distinct $i,j=1,\ldots,d$ , symmetry of $\mathbb{X}$ implies

	$\displaystyle\Big{(}\overline{\mathbb{X}}\mathrm{Diag}\big{(}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{)}\overline{\mathbb{X}}\Big{)}_{ii}$	$\displaystyle=\sum_{k=1}^{d}\overline{\mathbb{X}}_{ik}\mathrm{Diag}\big{(}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{)}_{kk}\overline{\mathbb{X}}_{ki}=\sum_{k=1}^{d}\mathbbm{1}_{\{k\neq i\}}\mathbb{X}_{ik}^{2}\widetilde{\beta}_{k}^{2},$
	$\displaystyle\Big{(}\mathbb{X}\odot\overline{\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}}\odot\mathbb{X}\Big{)}_{ij}$	$\displaystyle=\mathbb{X}_{ij}^{2}\widetilde{\beta}_{i}\widetilde{\beta}_{j}.$

In turn, for any unit-length vector $\mathbf{v}$ ,

$\displaystyle\mathbf{v}^{\top}\mathbb{E}[B]\mathbf{v}$	$\displaystyle=\mathbb{E}\left[\sum_{i=1}^{d}\sum_{k=1}^{d}\mathbbm{1}_{\{k\neq i\}}v_{i}^{2}\mathbb{X}_{ik}^{2}\widetilde{\beta}_{k}^{2}+\sum_{\ell=1}^{d}\sum_{m=1}^{d}\Big{(}\mathbbm{1}_{\{\ell\neq m\}}v_{\ell}\mathbb{X}_{\ell m}\widetilde{\beta}_{\ell}\Big{)}\Big{(}\mathbbm{1}_{\{\ell\neq m\}}\widetilde{\beta}_{m}\mathbb{X}_{\ell m}v_{m}\Big{)}\right]$
	$\displaystyle=\sum_{\ell=1}^{d}\sum_{m=1}^{d}\mathbbm{1}_{\{\ell\neq m\}}\mathbb{X}_{\ell m}^{2}\mathbb{E}\Big{[}v_{\ell}^{2}\widetilde{\beta}_{m}^{2}+v_{\ell}\widetilde{\beta}_{\ell}v_{m}\widetilde{\beta}_{m}\Big{]}$
	$\displaystyle=\dfrac{1}{2}\sum_{\ell=1}^{d}\sum_{m=1}^{d}\mathbbm{1}_{\{\ell\neq m\}}\mathbb{X}_{\ell m}^{2}\mathbb{E}\Big{[}\big{(}v_{\ell}\widetilde{\beta}_{m}+v_{m}\widetilde{\beta}_{\ell}\big{)}^{2}\Big{]},$	(47)

where the last equality follows by noting that each square $(v_{\ell}\widetilde{\beta}_{m}+v_{m}\widetilde{\beta}_{\ell})^{2}$ appears twice in (47) since the expression is symmetric in $(\ell,m)$ . Every summand in (47) is non-negative. If $v_{\ell}\neq 0$ , then there exists $m(\ell)\neq\ell$ such that $\mathbb{X}_{\ell m}\neq 0$ . Write $\mathbf{w}(\ell)$ for the vector with entries

\displaystyle w_{i}(\ell)=\begin{cases}v_{\ell}&\mbox{if }i=m(\ell),\\ v_{m(\ell)}&\mbox{if }i=\ell,\\ 0&\mbox{otherwise.}\end{cases}

By construction, $\mathbb{E}\big{[}(v_{\ell}\widetilde{\beta}_{m(\ell)}+v_{m(\ell)}\widetilde{\beta}_{\ell})^{2}\big{]}=\mathbf{w}(\ell)^{\top}\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}\mathbf{w}(\ell)\geq\mathbf{w}(\ell)^{\top}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}\mathbf{w}(\ell)$ . Recall that $\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}=\mathbb{X}_{p}^{-1}\mathbb{X}\mathbb{X}_{p}^{-1}$ and note that $\lambda_{\mathrm{min}}\big{(}\mathbb{X}_{p}^{-1}\mathbb{X}\mathbb{X}_{p}^{-1}\big{)}\geq\lambda_{\mathrm{min}}(\mathbb{X})/\lVert\mathbb{X}_{p}\rVert^{2}$ . Together with $\big{\lVert}\mathbf{w}(\ell)\big{\rVert}_{2}^{2}\geq v_{\ell}^{2}$ and $\sum_{\ell=1}^{d}v_{\ell}^{2}=\lVert\mathbf{v}\rVert_{2}^{2}=1$ , (47) now satisfies

		$\displaystyle\dfrac{1}{2}\sum_{\ell=1}^{d}\sum_{m=1}^{d}\mathbbm{1}_{\{\ell\neq m\}}\mathbb{X}_{\ell m}^{2}\mathbb{E}\Big{[}\big{(}v_{\ell}\widetilde{\beta}_{m}+v_{m}\widetilde{\beta}_{\ell}\big{)}^{2}\Big{]}$
	$\displaystyle\geq\$	$\displaystyle\dfrac{1}{2}\sum_{\ell=1}^{d}\mathbbm{1}_{\{v_{\ell}\neq 0\}}\mathbb{X}_{\ell m(\ell)}^{2}\mathbf{w}(\ell)^{\top}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}\mathbf{w}(\ell)$
	$\displaystyle\geq\$	$\displaystyle\dfrac{\lambda_{\mathrm{min}}(\mathbb{X})}{2\lVert\mathbb{X}_{p}\rVert^{2}}\sum_{\ell=1}^{d}\mathbbm{1}_{\{v_{\ell}\neq 0\}}\big{\lVert}\mathbf{w}(\ell)\big{\rVert}_{2}^{2}\min_{m:\mathbb{X}_{\ell m}\neq 0}\mathbb{X}_{\ell m}^{2}$
	$\displaystyle\geq\$	$\displaystyle\dfrac{\lambda_{\mathrm{min}}(\mathbb{X})}{2\lVert\mathbb{X}_{p}\rVert^{2}}\min_{i\neq j:\mathbb{X}_{ij}\neq 0}\mathbb{X}_{ij}^{2}.$

As $\lambda_{\mathrm{min}}(S_{0})\geq\alpha^{2}p^{2}(1-p)^{2}\lambda_{\mathrm{min}}(B)$ , this proves the matrix inequality

\displaystyle S_{0}\geq\dfrac{\alpha^{2}p^{2}(1-p)^{2}\lambda_{\mathrm{min}}(\mathbb{X})}{2\lVert\mathbb{X}_{p}\rVert^{2}}\min_{i\neq j:\mathbb{X}_{ij}\neq 0}\mathbb{X}_{ij}^{2}\cdot I_{d}.

(48)

Next, let $\bm{\xi}$ be a centered random vector with covariance matrix $M$ and suppose $D$ is a $d\times d$ dropout matrix, independent of $\bm{\xi}$ . Conditioning on $\bm{\xi}$ , the law of total variance states

	$\displaystyle\qquad\mathrm{Cov}\Big{(}\big{(}I-\alpha D\mathbb{X}_{p}\big{)}\bm{\xi}+\alpha D\overline{\mathbb{X}}\big{(}pI-D\big{)}\bm{\xi}\Big{)}$
	$\displaystyle=\mathrm{Cov}\Big{(}\mathbb{E}\big{[}(I-\alpha D\mathbb{X}_{p})\bm{\xi}+D\overline{\mathbb{X}}(pI-D)\bm{\xi}\mid\bm{\xi}\big{]}\Big{)}$		(49)
	$\displaystyle\qquad+\mathbb{E}\Big{[}\mathrm{Cov}\big{(}(I-\alpha D\mathbb{X}_{p})\bm{\xi}+\alpha D\overline{\mathbb{X}}(pI-D)\bm{\xi}\mid\bm{\xi}\big{)}\Big{]}$
	$\displaystyle=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}M\big{(}I-\alpha p\mathbb{X}_{p}\big{)}+\alpha^{2}\mathbb{E}\bigg{[}\mathrm{Cov}\Big{(}D\mathbb{X}_{p}\bm{\xi}+D\overline{\mathbb{X}}\big{(}D-pI\big{)}\bm{\xi}\mid\bm{\xi}\Big{)}\bigg{]}.$		(50)

Applying Lemma 11 with $A:=\mathbb{X}$ , $\mathbf{u}:=\mathbb{X}_{p}\bm{\xi}$ , and $\mathbf{v}:=\bm{\xi}$ now shows that $S_{\mathrm{lin}}(M)=\mathrm{Cov}\big{(}(I-\alpha D\mathbb{X}_{p}\big{)}\bm{\xi}+\alpha D\overline{\mathbb{X}}(pI-D)\bm{\xi}\big{)}$ . The second term in (50) is always positive semi-definite, proving that $S_{\mathrm{lin}}(M)\geq\lambda_{\mathrm{min}}(I-\alpha p\mathbb{X}_{p})^{2}\lambda_{\mathrm{min}}(M)\cdot I_{d}$ . As $(\mathrm{id}-S_{\mathrm{lin}})^{-1}=\sum_{\ell=0}^{\infty}S_{\mathrm{lin}}^{\ell}$ and $\lambda_{\mathrm{min}}\big{(}I-\alpha p\mathbb{X}_{p}\big{)}=1-\alpha p\lVert\mathbb{X}_{p}\rVert$ , this implies

	$\displaystyle\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}M$	$\displaystyle\geq\left(\lambda_{\mathrm{min}}(M)\sum_{\ell=0}^{\infty}\lambda_{\mathrm{min}}\big{(}I-\alpha p\mathbb{X}_{p}\big{)}^{2\ell}\right)\cdot I_{d}$
		$\displaystyle=\dfrac{\lambda_{\mathrm{min}}(M)}{2\alpha p\lVert\mathbb{X}_{p}\rVert-(\alpha p)^{2}\lVert\mathbb{X}_{p}\rVert^{2}}\cdot I_{d}\geq\dfrac{\lambda_{\mathrm{min}}(M)}{\alpha p\lVert\mathbb{X}_{p}\rVert}\cdot I_{d}.$

Lemma 13 moreover gives $\lVert\mathbb{X}_{p}\rVert\leq\lVert\mathbb{X}\rVert$ . Together with the lower-bound (48) for $\lambda_{\mathrm{min}}(S_{0})$ , this proves the result. ∎

B.7 Proof of Theorem 4

As in Section 4.1, write $\widetilde{\bm{\beta}}^{\mathrm{rp}}_{k}:=k^{-1}\sum_{j=1}^{k}\widetilde{\bm{\beta}}_{j}$ for the running average of the iterates and define

\displaystyle A^{\mathrm{rp}}_{k}:=\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}^{\mathrm{rp}}_{k}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}^{\mathrm{rp}}_{k}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}=\dfrac{1}{k^{2}}\sum_{j,\ell=1}^{k}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{j}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}.

(51)

Suppose $j>\ell$ and take $r=0,\ldots,j-\ell$ . Using induction on $r$ , we now prove that $\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{j}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}})^{\top}\big{]}=(I-\alpha p\mathbb{X}_{p})^{r}\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{j-r}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}})^{\top}\big{]}$ . The identity always holds when $r=0$ . Next, suppose the identity holds for some $r-1<j-\ell$ . Taking $k=j+1-r$ in (14), $\widetilde{\bm{\beta}}_{j+1-r}-\widetilde{\bm{\beta}}=\big{(}I-\alpha D_{j+1-r}\mathbb{X}_{p}\big{)}\big{(}\widetilde{\bm{\beta}}_{j-r}-\widetilde{\bm{\beta}}\big{)}+\alpha D_{j+1-r}\overline{\mathbb{X}}\big{(}pI-D_{j+1-r}\big{)}\widetilde{\bm{\beta}}_{j-r}$ . Since $j-r\geq\ell$ , $D_{j+1-r}$ is by assumption independent of $\big{(}\widetilde{\bm{\beta}},\widetilde{\bm{\beta}}_{j-r},\widetilde{\bm{\beta}}_{\ell}\big{)}$ . Recall from (28) that $\mathbb{E}\big{[}D_{j+1-r}\overline{\mathbb{X}}(pI-D_{j+1-r})\big{]}=0$ . Conditioning on $\big{(}\widetilde{\bm{\beta}},\widetilde{\bm{\beta}}_{j-r},\widetilde{\bm{\beta}}_{\ell}\big{)}$ and applying tower rule now gives

	$\displaystyle\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{j+1-r}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}$	$\displaystyle=\mathbb{E}\Big{[}I-\alpha D_{j+1-r}\mathbb{X}_{p}\Big{]}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{j-r}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}$
		$\displaystyle\qquad+\alpha\mathbb{E}\Big{[}D_{j+1-r}\overline{\mathbb{X}}\big{(}pI-D_{j+1-r}\big{)}\Big{]}\mathbb{E}\Big{[}\widetilde{\bm{\beta}}_{j-r}\big{(}\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}$
		$\displaystyle=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{j-r}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}.$

Together with the induction hypothesis, this proves the desired equality

	$\displaystyle\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{j}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}$	$\displaystyle=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}^{r-1}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{j+1-r}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}$
		$\displaystyle=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}^{r}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{j-r}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}.$

For $j<\ell$ , transposing and flipping the roles of $j$ and $\ell$ also shows that $\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{j}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}})^{\top}\big{]}=\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{j}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{\ell-r}-\widetilde{\bm{\beta}})^{\top}\big{]}(I-\alpha p\mathbb{X}_{p})^{r}$ with $r=0,\ldots,\ell-j$ .

Defining $A_{\ell}:=\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}})^{\top}\big{]}$ and taking $r=\lvert j-\ell\rvert$ , (51) may now be rewritten as

\displaystyle A^{\mathrm{rp}}_{k}=\dfrac{1}{k^{2}}\sum_{j=1}^{k}\left(\sum_{\ell=0}^{j}\big{(}I-\alpha p\mathbb{X}_{p}\big{)}^{j-\ell}A_{\ell}+\sum_{\ell=j+1}^{\infty}A_{j}\big{(}I-\alpha p\mathbb{X}_{p})^{\ell-j}\right).

(52)

Set $\gamma:=\lVert I-\alpha p\mathbb{X}_{p}\rVert$ , then $\gamma\leq 1-\alpha p(1-p)\min_{i}\mathbb{X}_{ii}<1$ by Lemma 1. Note also that $\sum_{r=0}^{j}\gamma^{r}\leq\sum_{r=0}^{\infty}\gamma^{r}=(1-\gamma)^{-1}$ . Using the triangle inequality and sub-multiplicativity of the spectral norm, (52) then satisfies

	$\displaystyle\big{\lVert}A^{\mathrm{rp}}_{k}\big{\rVert}\leq\dfrac{1}{k^{2}}\sum_{j=1}^{k}\left(\sum_{\ell=0}^{j}\gamma^{j-\ell}\big{\lVert}A_{\ell}\big{\rVert}+\sum_{\ell=j+1}^{\infty}\big{\lVert}A_{j}\big{\rVert}\gamma^{\ell-j}\right)$	$\displaystyle\leq\dfrac{2}{k^{2}}\sum_{\ell=1}^{k}\big{\lVert}A_{\ell}\big{\rVert}\sum_{r=0}^{\infty}\gamma^{r}$
		$\displaystyle=\dfrac{2}{k^{2}(1-\gamma)}\sum_{\ell=1}^{k}\big{\lVert}A_{\ell}\big{\rVert}.$

As shown in Theorem 2, $\lVert A_{\ell}\rVert\leq\big{\lVert}(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}\big{\rVert}+C\ell\gamma^{\ell-1}$ for some constant $C$ . Observing that $\sum_{\ell=1}^{\infty}\ell\gamma^{\ell-1}=\partial_{\gamma}\sum_{\ell=1}^{\infty}\gamma^{\ell}=\partial_{\gamma}\big{(}(1-\gamma)^{-1}-1\big{)}=(1-\gamma)^{-2}$ , this implies

\displaystyle\big{\lVert}A^{\mathrm{rp}}_{k}\big{\rVert}\leq\dfrac{2}{k(1-\gamma)}\big{\lVert}(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}\big{\rVert}+\dfrac{2C}{k^{2}(1-\gamma)^{3}}.

To complete the proof note that $\gamma\leq 1-\alpha p(1-p)\min_{i}\mathbb{X}_{ii}$ may be rewritten as $(1-\gamma)^{-1}\leq\big{(}\alpha p(1-p)\min_{i}\mathbb{X}_{ii}\big{)}^{-1}$ and $\big{\lVert}(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}\big{\rVert}\leq\alpha p(1-p)^{-2}(\min_{i}\mathbb{X}_{ii})^{-3}\lVert\mathbb{X}\rVert^{2}\cdot\big{\lVert}\mathbb{E}[X^{\top}\mathbf{Y}\mathbf{Y}^{\top}X]\big{\rVert}$ by (43). ∎

B.8 Proof of Lemma 4

Recall the definition $T(A):=\big{(}I-\alpha p\mathbb{X}\big{)}A\big{(}I-\alpha p\mathbb{X}\big{)}+\alpha^{2}p(1-p)\mathrm{Diag}\big{(}\mathbb{X}A\mathbb{X}\big{)}$ .

Lemma 7.

For every $k=1,2,\ldots$

	$\displaystyle\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}$	$\displaystyle=\big{(}I-\alpha p\mathbb{X}\big{)}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}\big{(}I-\alpha p\mathbb{X}\big{)}$
		$\displaystyle\qquad+\alpha^{2}p(1-p)\mathrm{Diag}\bigg{(}\mathbb{X}\mathbb{E}\Big{[}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}^{\top}\Big{]}\mathbb{X}\bigg{)}$
		$\displaystyle\geq T\big{(}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}\big{)},$

with equality if $\widehat{\bm{\beta}}_{0}=\mathbb{E}\big{[}\widehat{\bm{\beta}}\big{]}$ almost surely.

Proof Recall the definition $A_{k}:=\mathbb{E}\big{[}(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})^{\top}\big{]}$ , so that $\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}=A_{k}-\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}^{\top}$ . As shown in (22), $A_{k}=T(A_{k-1})$ and hence

\displaystyle\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}

\displaystyle=T\big{(}A_{k-1}\big{)}-\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}^{\top}

By definition, $T(A_{k-1})=\big{(}I-\alpha p\mathbb{X}\big{)}A_{k-1}\big{(}I-\alpha p\mathbb{X})+\alpha^{2}p(1-p)\mathrm{Diag}\big{(}\mathbb{X}A_{k-1}\mathbb{X}\big{)}$ . Recall from (20) that $\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}=\big{(}I-\alpha p\mathbb{X}\big{)}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{]}$ , so $\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}=A_{k-1}-\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{]}^{\top}$ implies

\displaystyle\big{(}I-\alpha p\mathbb{X}\big{)}A_{k-1}\big{(}I-\alpha p\mathbb{X})=\big{(}I-\alpha p\mathbb{X}\big{)}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}\big{(}I-\alpha p\mathbb{X})+\mathbb{E}[\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}]\mathbb{E}[\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}]^{\top}.

Together, these identities prove the first claim.

The lower bound follows from $\mathbb{X}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{]}^{\top}\mathbb{X}$ being positive semi-definite. A positive semi-definite matrix has non-negative diagonal entries, meaning $\mathrm{Diag}\big{(}\mathbb{X}\mathbb{E}[\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}]\mathbb{E}[\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}]^{\top}\mathbb{X}\big{)}$ is also positive semi-definite. Next, note that $A_{k-1}=\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}+\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{]}^{\top}$ and in turn

\begin{split}\mathrm{Diag}\big{(}\mathbb{X}A_{k-1}\mathbb{X}\big{)}&=\mathrm{Diag}\big{(}\mathbb{X}\mathrm{Cov}(\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}})\mathbb{X}\big{)}+\mathrm{Diag}\big{(}\mathbb{X}\mathbb{E}[\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}]\mathbb{E}[\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}]^{\top}\mathbb{X}\big{)}\\ &\geq\mathrm{Diag}\big{(}\mathbb{X}\mathrm{Cov}(\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}})\mathbb{X}\big{)}.\end{split}

(53)

Together with the first part of the lemma and the definition of $T$ , the lower-bound follows.

Lastly, if $\widehat{\bm{\beta}}_{0}=\mathbb{E}[\widehat{\bm{\beta}}]$ almost surely, then (20) implies $\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}=\big{(}I-\alpha p\mathbb{X}\big{)}^{k}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{]}=0$ , so equality holds in (53). ∎

Consider arbitrary positive semi-definite matrices $A\geq B$ , then $\mathbf{w}^{\top}(A-B)\mathbf{w}\geq 0$ for all vectors $\mathbf{w}$ . Given any vector $\mathbf{v}$ , this implies

	$\displaystyle\mathbf{v}^{\top}T(A-B)\mathbf{v}$	$\displaystyle=\mathbf{v}^{\top}\big{(}I-\alpha p\mathbb{X}\big{)}\big{(}A-B\big{)}\big{(}I-\alpha p\mathbb{X}\big{)}\mathbf{v}+\alpha^{2}p(1-p)\sum_{\ell=1}^{d}v_{\ell}^{2}\mathbf{e}_{\ell}^{\top}\mathbb{X}\big{(}A-B\big{)}\mathbb{X}\mathbf{e}_{\ell}$
		$\displaystyle\geq 0$

with $\mathbf{e}_{\ell}$ the $\ell$ ^th standard basis vector. Accordingly, $T$ is operator monotone with respect to the ordering of positive semi-definite matrices, in the sense that $T(A)\geq T(B)$ whenever $A\geq B$ . Using induction on $k$ , Lemma 7 may now be rewritten as

\displaystyle\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}\geq T^{k}\Big{(}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{)}\Big{)}.

(54)

To complete the proof, the right-hand side of (54) will be analyzed for a suitable choice of $\mathbb{X}$ .

Lemma 8.

Suppose $\widehat{\bm{\beta}}_{0}$ is independent of all other sources of randomness. Consider the linear regression model with a single observation $n=1$ , number of parameters $d\geq 2$ , and design matrix $X=\mathbf{1}^{\top}$ . Then, $\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}\big{)}\geq\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}+\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)}$ and for any $d$ -dimensional vector $\mathbf{v}$ satisfying $\mathbf{v}^{\top}\mathbf{1}=0$ and every $k=1,2,\ldots$

\displaystyle\mathbf{v}^{\top}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}\mathbf{v}^{\top}\geq\alpha^{2}p(1-p)\left\lVert\mathbf{v}\right\rVert_{2}^{2}.

Proof By definition, $\mathbb{X}=\mathbf{1}\mathbf{1}^{\top}$ is the $d\times d$ -matrix with all entries equal to one. Consequently, $\mathbb{X}^{k}=d^{k-1}\mathbb{X}$ for all $k\geq 1$ and $\mathbb{X}X^{\top}=\mathbf{1}\mathbf{1}^{\top}\mathbf{1}=d\mathbf{1}=dX^{\top}$ , so $\widehat{\bm{\beta}}:=d^{-1}X^{\top}\mathbf{Y}$ satisfies the normal equations $X^{\top}\mathbf{Y}=\mathbb{X}\widehat{\bm{\beta}}$ .

To prove $\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}\big{)}\geq\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}+\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)}$ , note that $\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}+\widehat{\bm{\beta}}\big{)}\geq\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}+\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)}$ whenever $\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}},\widehat{\bm{\beta}}\big{)}\geq 0$ . By conditioning on $\big{(}\widehat{\bm{\beta}}_{k},\widehat{\bm{\beta}}\big{)}$ , the identity $\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}=\big{(}I-\alpha p\mathbb{X}\big{)}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{]}$ was shown in (20). The same argument also proves $\mathbb{E}\big{[}(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})\widehat{\bm{\beta}}^{\top}\big{]}=\big{(}I-\alpha p\mathbb{X}\big{)}\mathbb{E}\big{[}(\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}})\widehat{\bm{\beta}}^{\top}\big{]}$ . Induction on $k$ and the assumed independence between $\widehat{\bm{\beta}}_{0}$ and $\widehat{\bm{\beta}}$ now lead to

$\displaystyle\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}},\widehat{\bm{\beta}}\big{)}$	$\displaystyle=\mathbb{E}\big{[}(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})\widehat{\bm{\beta}}^{\top}\big{]}-\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widehat{\bm{\beta}}^{\top}\big{]}$
	$\displaystyle=\big{(}I-\alpha p\mathbb{X}\big{)}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}},\widehat{\bm{\beta}}\big{)}$
	$\displaystyle=\big{(}I-\alpha p\mathbb{X}\big{)}^{k}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}},\widehat{\bm{\beta}}\big{)}$
	$\displaystyle=\big{(}I-\alpha p\mathbb{X}\big{)}^{k}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)}.$	(55)

Next, note that $\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}=d^{-2}\mathbb{X}$ and $\big{(}I-\alpha p\mathbb{X}\big{)}\mathbb{X}=\big{(}1-\alpha pd\big{)}\mathbb{X}$ . As $\alpha p\lVert\mathbb{X}\rVert=\alpha pd<1$ , (55) satisfies

\displaystyle\big{(}I-\alpha p\mathbb{X}\big{)}^{k}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)}

\displaystyle=\dfrac{1}{d^{2}}\big{(}I-\alpha p\mathbb{X}\big{)}^{k}\mathbb{X}=\dfrac{1}{d^{2}}\big{(}1-\alpha pd\big{)}^{k}\mathbb{X}\geq 0

which proves the first claim.

To prove the second claim, we first show that there are real sequences $\{\nu_{k}\}_{k}$ and $\{\lambda_{k}\}_{k}$ , not depending on the distribution of $\widehat{\bm{\beta}}_{0}$ , such that

\displaystyle\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}\geq\nu_{k}I_{d}+\tfrac{\lambda_{k}}{d}\mathbb{X}

(56)

for all $k\geq 0$ , with equality if $\widehat{\bm{\beta}}_{0}=\mathbb{E}\big{[}\widehat{\bm{\beta}}\big{]}$ almost surely. When $k=0$ , independence between $\widehat{\bm{\beta}}_{0}$ and $\widehat{\bm{\beta}}$ , as well as $\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)}=d^{-2}\mathbb{X}$ , imply $\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{)}=\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{0}\big{)}+\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)}\geq d^{-2}\mathbb{X}$ . Moreover, equality holds whenever $\widehat{\bm{\beta}}_{0}$ is deterministic.

For the sake of induction suppose the claim is true for some $k-1$ . Lemma 7 and operator monotonicity of $T$ then imply

\displaystyle\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}\geq T\Big{(}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}\Big{)}\geq T\Big{(}\nu_{k-1}I_{d}+\tfrac{\lambda_{k-1}}{d}\mathbb{X}\Big{)}.

In case $\widehat{\bm{\beta}}=\mathbb{E}\big{[}\widehat{\bm{\beta}}\big{]}$ almost surely, Lemma 7 and the induction hypothesis assert equality in the previous display. Recall $\mathbb{X}^{\ell}=d^{\ell-1}\mathbb{X}$ , so $\big{(}I-\alpha p\mathbb{X}\big{)}^{2}=I+\big{(}(\alpha p)^{2}d-2\alpha p\big{)}\mathbb{X}$ as well as $\big{(}I-\alpha p\mathbb{X}\big{)}\mathbb{X}\big{(}I-\alpha p\mathbb{X}\big{)}=(1-\alpha pd)^{2}\mathbb{X}$ . Note also that $\mathrm{Diag}(\mathbb{X})=I_{d}$ . Setting $c:=\alpha pd$ , expanding the definition $T(A)=\big{(}I-\alpha p\mathbb{X}\big{)}A\big{(}I-\alpha p\mathbb{X}\big{)}+\alpha^{2}p(1-p)\mathrm{Diag}\big{(}\mathbb{X}A\mathbb{X}\big{)}$ now results in

	$\displaystyle T\left(\nu_{k-1}I_{d}+\dfrac{\lambda_{k-1}}{d}\mathbb{X}\right)$
$\displaystyle=\$	$\displaystyle\Big{(}\nu_{k-1}+\alpha^{2}p(1-p)d(\nu_{k-1}+\lambda_{k-1})\Big{)}\cdot I_{d}+\left(\nu_{k-1}\Big{(}(\alpha p)^{2}d-2\alpha p\Big{)}+\dfrac{\lambda_{k-1}(1-\alpha pd)^{2}}{d}\right)\cdot\mathbb{X}$
$\displaystyle=\$	$\displaystyle\Big{(}\nu_{k-1}+\alpha^{2}p(1-p)d(\nu_{k-1}+\lambda_{k-1})\Big{)}\cdot I_{d}-\dfrac{\nu_{k-1}-(1-\alpha pd)^{2}(\nu_{k-1}+\lambda_{k-1})}{d}\cdot\mathbb{X}$	(57)

This establishes the induction step and thereby proves (56) for all $k\geq 0$ .

As $\big{(}\nu_{k},\lambda_{k})$ do not depend on the distribution of $\widehat{\bm{\beta}}_{0}$ , taking $\widehat{\bm{\beta}}_{0}=\mathbb{E}\big{[}\widehat{\bm{\beta}}\big{]}$ shows that $\nu_{k}+\lambda_{k}\geq 0$ for all $k\geq 0$ since

\displaystyle 0\leq\mathbf{1}^{\top}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}\mathbf{1}=\mathbf{1}^{\top}\Big{(}\nu_{k}I_{d}+\tfrac{\lambda_{k}}{d}\mathbb{X}\Big{)}\mathbf{1}=d\big{(}\nu_{k}+\lambda_{k}\big{)}.

Consequently, (57) implies $\nu_{k}=\nu_{k-1}+\alpha^{2}p(1-p)d(\nu_{k-1}+\lambda_{k-1})\geq\nu_{k-1}$ , proving that $\{\nu_{k}\}_{k}$ is non-decreasing in $k$ .

Lastly, we show that $\nu_{1}\geq\alpha^{2}p(1-p)$ . To this end, recall that $\mathbb{X}^{3}=d^{2}\mathbb{X}$ and $\mathrm{Diag}(\mathbb{X})=I$ . As $T$ is operator monotone and $T(A)\geq\alpha^{2}p(1-p)\mathrm{Diag}\big{(}\mathbb{X}A\mathbb{X}\big{)}$ , independence of $\widehat{\bm{\beta}}_{0}$ and $\widehat{\bm{\beta}}$ results in

	$\displaystyle\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{1}-\widehat{\bm{\beta}}\big{)}$	$\displaystyle\geq T\Big{(}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{)}\Big{)}\geq T\Big{(}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)}\Big{)}\geq\alpha^{2}p(1-p)\mathrm{Diag}\big{(}\mathbb{X}d^{-2}\mathbb{X}\mathbb{X}\big{)}$
		$\displaystyle=\dfrac{\alpha^{2}p(1-p)}{d^{2}}\mathrm{Diag}\big{(}\mathbb{X}^{3}\big{)}=\alpha^{2}p(1-p)\mathrm{Diag}(\mathbb{X})=\alpha^{2}p(1-p)I_{d},$

so $\nu_{1}\geq\alpha^{2}p(1-p)$ .

To complete the proof, observe that $\mathbf{v}^{\top}\mathbf{1}=0$ implies $\mathbb{X}\mathbf{v}=0$ . Accordingly,

\displaystyle\mathbf{v}^{\top}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}\mathbf{v}^{\top}\geq\mathbf{v}^{\top}\Big{(}\nu_{k}I_{d}+\tfrac{\lambda_{k}}{d}\mathbb{X}\Big{)}\mathbf{v}^{\top}\geq\nu_{k}\left\lVert\mathbf{v}\right\rVert_{2}^{2}\geq\nu_{1}\left\lVert\mathbf{v}\right\rVert_{2}^{2}\geq\alpha^{2}p(1-p)\left\lVert\mathbf{v}\right\rVert_{2}^{2}

which yields the second claim of the lemma. ∎

B.9 Proof of Theorem 5

Recall that $T(A):=\big{(}I-\alpha p\mathbb{X}\big{)}A\big{(}I-\alpha p\mathbb{X}\big{)}+\alpha^{2}p(1-p)\mathrm{Diag}\big{(}\mathbb{X}A\mathbb{X}\big{)}$ , as defined in (21). If $A_{k}:=\mathbb{E}\big{[}(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})^{\top}\big{]}$ , then $A_{k}=T\big{(}A_{k-1}\big{)}$ by (22).

For an arbitrary $d\times d$ matrix $A$ , the triangle inequality, Lemma 13, and submultiplicativity of the spectral norm imply

\displaystyle\big{\lVert}T(A)\big{\rVert}

\displaystyle\leq\Big{(}\big{\lVert}I-\alpha p\mathbb{X}\big{\rVert}^{2}+\alpha^{2}p(1-p)\left\lVert\mathbb{X}\right\rVert^{2}\Big{)}\left\lVert A\right\rVert.

(58)

As $\mathbb{X}$ is positive definite, $\alpha p\left\lVert\mathbb{X}\right\rVert<1$ implies $\big{\lVert}I-\alpha p\mathbb{X}\big{\rVert}=1-\alpha p\lambda_{\mathrm{min}}(\mathbb{X})$ . If $\alpha\leq\tfrac{\lambda_{\mathrm{min}}(\mathbb{X})}{\lVert\mathbb{X}\rVert^{2}}$ , then

	$\displaystyle\big{\lVert}I-\alpha p\mathbb{X}\big{\rVert}^{2}+\alpha^{2}p(1-p)\lVert\mathbb{X}\rVert^{2}$	$\displaystyle=1-2\alpha p\lambda_{\mathrm{min}}(\mathbb{X})+\alpha^{2}\big{(}p^{2}\lambda_{\mathrm{min}}(\mathbb{X})^{2}+p(1-p)\lVert\mathbb{X}\rVert^{2}\big{)}$
		$\displaystyle\leq\big{(}1-2\alpha p\lambda_{\mathrm{min}}(\mathbb{X})\big{)}+\alpha^{2}p\lVert\mathbb{X}\rVert^{2}$
		$\displaystyle\leq 1-\alpha p\lambda_{\mathrm{min}}(\mathbb{X}).$

Together with (58) this leads to $\lVert A_{k}\rVert\leq\big{(}1-\alpha p\lambda_{\mathrm{min}}(\mathbb{X})\big{)}\lVert A_{k-1}\rVert$ . By induction on $k$ , $\lVert A_{k}\rVert\leq\big{(}1-\alpha p\lambda_{\mathrm{min}}(\mathbb{X})\big{)}^{k}\lVert A_{0}\rVert$ , completing the proof. ∎

Appendix C Higher Moments of Dropout Matrices

Deriving concise closed-form expressions for third and fourth order expectations of the dropout matrices presents one of the main technical challenges encountered in Section B.

All matrices in this section will be of dimension $d\times d$ and all vectors of length $d$ . Moreover, $D$ always denotes a random diagonal matrix such that $D_{ii}\overset{\scriptscriptstyle i.i.d.}{\sim}\mathrm{Ber}(p)$ for all $i=1,\ldots,d$ . The diagonal entries of $D$ are elements of $\{0,1\}$ , meaning $D=D^{k}$ for all positive powers $k$ .

Given a matrix $A$ and $p\in(0,1)$ , recall the definitions

	$\displaystyle A_{p}$	$\displaystyle:=pA+(1-p)\mathrm{Diag}(A)$
	$\displaystyle\overline{A}$	$\displaystyle:=A-\mathrm{Diag}(A).$

The first lemma contains some simple identities.

Lemma 9.

For arbitrary matrices $A$ and $B$ , $p\in(0,1)$ , and a diagonal matrix $F$ ,

(a)

$\overline{AF}=\overline{A}F$ and $\overline{FA}=F\overline{A}$
(b)

$\overline{A}_{p}=p\overline{A}=\overline{A_{p}}$
(c)

$\mathrm{Diag}\big{(}\overline{A}B\big{)}=\mathrm{Diag}\big{(}A\overline{B}\big{)}$ .

Proof

(a)

By definition, $(\overline{A}F)_{ij}=F_{jj}\mathbbm{1}_{\{i\neq j\}}A_{ij}$ for all $i,j\in\{1,\dots,d\}$ , which equals $\overline{AF}$ . The second equality then follows by transposition.
(b)

Clearly, $\mathrm{Diag}(\overline{A})=0$ and in turn $\overline{A}_{p}=p\overline{A}+(1-p)\mathrm{Diag}(\overline{A})=p\overline{A}$ . On the other hand, $\mathrm{Diag}(A_{p})=\mathrm{Diag}(A)$ and hence $\overline{A_{p}}=pA+(1-p)\mathrm{Diag}(A)-\mathrm{Diag}(A)=pA-p\mathrm{Diag}(A)$ equals $p\overline{A}$ as well.
(c)

Observe that $\mathrm{Diag}\big{(}\overline{A}B\big{)}$ equals $\mathrm{Diag}\big{(}AB\big{)}-\mathrm{Diag}\big{(}\mathrm{Diag}(A)B\big{)}$ . As $\mathrm{Diag}\big{(}\mathrm{Diag}(A)B\big{)}=\mathrm{Diag}(A)\mathrm{Diag}(B)=\mathrm{Diag}\big{(}A\mathrm{Diag}(B)\big{)}$ , the claim follows.

∎

With these basic properties at hand, higher moments involving the dropout matrix $D$ may be computed by carefully accounting for equalities between the involved indices.

Lemma 10.

Given arbitrary matrices $A$ , $B$ , and $C$ , the following hold:

(a)

$\mathbb{E}\big{[}DAD\big{]}=pA_{p}$
(b)

$\mathbb{E}\big{[}DADBD\big{]}=pA_{p}B_{p}+p^{2}(1-p)\mathrm{Diag}(\overline{A}B)$
(c)

$\mathbb{E}\big{[}DADBDCD\big{]}=pA_{p}B_{p}C_{p}+p^{2}(1-p)\Big{(}\mathrm{Diag}\big{(}\overline{A}B_{p}\overline{C}\big{)}+A_{p}\mathrm{Diag}(\overline{B}C)+\mathrm{Diag}(A\overline{B})C_{p}+(1-p)A\odot\overline{B}^{\top}\odot C\Big{)}$

Proof

(a)

Recall that $D=D^{2}$ and hence $\mathbb{E}[D]=\mathbb{E}[D^{2}]=pI$ , meaning $\mathbb{E}\big{[}D\mathrm{Diag}(A)D\big{]}=\mathrm{Diag}(A)\mathbb{E}[D]=p\mathrm{Diag}(A)$ . On the other hand, $(D\overline{A}D)_{ij}=D_{ii}D_{jj}A_{ij}\mathbbm{1}_{\{i\neq j\}}$ implies $\mathbb{E}\big{[}D\overline{A}D\big{]}=p^{2}\overline{A}$ due to independence of $D_{ii}$ and $D_{jj}$ . Combining both identities,

	$\displaystyle\mathbb{E}\big{[}DAD\big{]}$	$\displaystyle=\mathbb{E}\Big{[}D\overline{A}D+D\mathrm{Diag}(A)D\Big{]}$
		$\displaystyle=p^{2}\overline{A}+p\mathrm{Diag}(A)$
		$\displaystyle=p\Big{(}pA-p\mathrm{Diag}(A)+\mathrm{Diag}(A)\Big{)}=pA_{p}.$

(b)

First, note that $D=D^{2}$ and commutativity of diagonal matrices imply

\begin{split}DADBD&=D\overline{A}DBD+D\mathrm{Diag}(A)DBD\\ &=D\overline{\overline{A}DB}D+\mathrm{Diag}(A)DBD+D\mathrm{Diag}\big{(}\overline{A}DB\big{)}D\\ &=D\overline{\overline{A}DB}D+\mathrm{Diag}(A)DBD+\mathrm{Diag}(\overline{A}DBD).\end{split}

(59)

Applying Lemma 9(a) twice, $D\overline{\overline{A}DB}D=\overline{D\overline{A}DBD}$ has no non-zero diagonal entries. Moreover, taking $i,j\in\{1,\ldots,d\}$ distinct,

\displaystyle\Big{(}D\overline{\overline{A}DB}D\Big{)}_{ij}

\displaystyle=D_{ii}D_{jj}\Big{(}\overline{\overline{A}DB}\Big{)}_{ij}=D_{ii}D_{jj}\sum_{k=1}^{d}A_{ik}\mathbbm{1}_{\{i\neq k\}}D_{kk}B_{kj},

so that both $i\neq j$ and $i\neq k$ . Therefore,

	$\displaystyle\mathbb{E}\Big{[}D\overline{\overline{A}DB}D\Big{]}$	$\displaystyle=\mathbb{E}[D]\mathbb{E}\Big{[}\overline{\overline{A}DB}D\Big{]}$
		$\displaystyle=p\mathbb{E}\big{[}\overline{A}DBD\big{]}-p\mathrm{Diag}\big{(}\overline{A}\mathbb{E}[DBD]\big{)}$
		$\displaystyle=pA\mathbb{E}[DBD]-p\mathrm{Diag}(A)\mathbb{E}[DBD]-p\mathrm{Diag}\big{(}\overline{A}\mathbb{E}[DBD]\big{)}.$

Reinserting this expression into the expectation of (59) and applying Part (i) of the lemma now results in the claimed identity

	$\displaystyle\mathbb{E}\big{[}DADBD\big{]}$	$\displaystyle=pA\mathbb{E}[DBD]+(1-p)\mathrm{Diag}(A)\mathbb{E}[DBD]+(1-p)\mathrm{Diag}\big{(}\overline{A}\mathbb{E}[DBD]\big{)}$
		$\displaystyle=p\big{(}pA+(1-p)\mathrm{Diag}(A)\big{)}B_{p}+p(1-p)\mathrm{Diag}\big{(}\overline{A}B_{p}\big{)}$
		$\displaystyle=pA_{p}B_{p}+p(1-p)\mathrm{Diag}\big{(}\overline{A}B_{p}\big{)}$
		$\displaystyle=pA_{p}B_{p}+p^{2}(1-p)\mathrm{Diag}\big{(}\overline{A}B\big{)},$

where $\mathrm{Diag}(\overline{A}\mathrm{Diag}(B))=0$ by Lemma 9(a).

(c)

Following a similar strategy as in Part (ii), observe that

\begin{split}DADBDCD&=DAD\overline{B}DCD+DAD\mathrm{Diag}(B)DCD\\ &=DAD\overline{B}D\overline{C}D+DAD\mathrm{Diag}(B)CD+DAD\overline{B}D\mathrm{Diag}(C)D\\ &=D\overline{AD\overline{B}}D\overline{C}D+DAD\mathrm{Diag}(B)CD+DAD\overline{B}\mathrm{Diag}(C)D\\ &\qquad+D\mathrm{Diag}(AD\overline{B})\overline{C}D.\end{split}

(60)

By construction of the latter matrix,

	$\displaystyle\big{(}D\overline{AD\overline{B}}D\overline{C}D\big{)}_{ij}$	$\displaystyle=D_{ii}D_{jj}\sum_{k=1}^{d}\big{(}\overline{AD\overline{B}}\big{)}_{ik}\big{(}D\overline{C}\big{)}_{kj}$
		$\displaystyle=D_{ii}D_{jj}\sum_{k=1}^{d}\mathbbm{1}_{\{k\neq j\}}D_{kk}C_{kj}\sum_{\ell=1}^{d}\mathbbm{1}_{\{i\neq k\}}A_{i\ell}D_{\ell\ell}\mathbbm{1}_{\{\ell\neq k\}}B_{\ell k},$

meaning $k$ is always distinct from the other indices. The $k$ index corresponds to the third $D$ -matrix from the left, so this proves $\mathbb{E}\big{[}D\overline{AD\overline{B}}D\overline{C}D\big{]}=p\mathbb{E}\big{[}D\overline{AD\overline{B}}\,\overline{C}D\big{]}$ . Reversing the overlines of the latter expression in order, note that

	$\displaystyle D\overline{AD\overline{B}}\,\overline{C}D$	$\displaystyle=DADBCD-D\mathrm{Diag}(AD\overline{B})\overline{C}D-DAD\overline{B}\mathrm{Diag}(C)D$
		$\displaystyle\qquad-DAD\mathrm{Diag}(B)CD.$

Note that the subtracted terms match those added to $D\overline{AD\overline{B}}D\overline{C}D$ in (60) exactly. In turn, these identities prove

	$\displaystyle\mathbb{E}\big{[}DADBDCD\big{]}$
$\displaystyle=\$	$\displaystyle\mathbb{E}\Big{[}D\overline{AD\overline{B}}D\overline{C}D+DAD\mathrm{Diag}(B)CD+DAD\overline{B}\mathrm{Diag}(C)D+D\mathrm{Diag}(AD\overline{B})\overline{C}D\Big{]}$
$\displaystyle=\$	$\displaystyle p\mathbb{E}\Big{[}D\overline{AD\overline{B}}\,\overline{C}D\Big{]}+\mathbb{E}\Big{[}DAD\mathrm{Diag}(B)CD+DAD\overline{B}\mathrm{Diag}(C)D+D\mathrm{Diag}(AD\overline{B})\overline{C}D\Big{]}$
$\displaystyle=\$	$\displaystyle p\mathbb{E}\Big{[}DADBCD\Big{]}+(1-p)\mathbb{E}\Big{[}DAD\mathrm{Diag}(B)CD\Big{]}$
	$\displaystyle\quad+(1-p)\mathbb{E}\Big{[}DAD\overline{B}\mathrm{Diag}(C)D\Big{]}+(1-p)\mathbb{E}\Big{[}D\mathrm{Diag}(AD\overline{B})\overline{C}D\Big{]}$
$\displaystyle=\$	$\displaystyle\mathbb{E}\big{[}DADB_{p}CD\big{]}+(1-p)\Big{(}\mathbb{E}\big{[}DAD\overline{B}\mathrm{Diag}(C)D\big{]}+\mathbb{E}\big{[}D\mathrm{Diag}(AD\overline{B})\overline{C}D\big{]}\Big{)}.$	(61)

The first and second term in the last equality may be computed via Part (ii) of the lemma, whereas the third term remains to be treated.

By definition, the diagonal entries of $\overline{B}$ and $\overline{C}$ are all zero, so $\big{(}D\mathrm{Diag}(AD\overline{B})\overline{C}D\big{)}_{ii}$ equals $0$ for all $i\in\{1,\ldots,d\}$ . Moreover, taking $i\neq j$ implies

	$\displaystyle\big{(}D\mathrm{Diag}(AD\overline{B})\overline{C}D\big{)}_{ij}$	$\displaystyle=D_{ii}\big{(}AD\overline{B}\big{)}_{ii}C_{ij}D_{jj}$
		$\displaystyle=D_{ii}C_{ij}D_{jj}\sum_{k=1}^{d}A_{ik}D_{kk}B_{ki}\mathbbm{1}_{\{k\neq i\}}.$

On the set $\{i\neq j\}\cap\{i\neq k\}$ , the entry $D_{ii}$ is independent of $D_{jj}$ and $D_{kk}$ . Consequently,

	$\displaystyle\mathbb{E}\Big{[}\big{(}D\mathrm{Diag}(AD\overline{B})\overline{C}D\big{)}_{ij}\Big{]}$	$\displaystyle=\mathbb{E}[D_{ii}]\sum_{k=1}^{d}A_{ik}\overline{B}_{ki}C_{ij}\mathbb{E}[D_{jj}D_{kk}]$
		$\displaystyle=\sum_{k=1}^{d}A_{ik}\overline{B}_{ki}C_{ij}\big{(}p^{3}+p^{2}(1-p)\mathbbm{1}_{\{k=j\}}\big{)}$
		$\displaystyle=p^{3}\sum_{k=1}^{d}A_{ik}\overline{B}_{ki}C_{ij}+p^{2}(1-p)A_{ij}\overline{B}_{ji}C_{ij}.$

In matrix form, the previous equation reads

\displaystyle\mathbb{E}\Big{[}D\mathrm{Diag}\big{(}AD\overline{B}\big{)}\overline{C}D\Big{]}=p^{3}\mathrm{Diag}(A\overline{B})\overline{C}+p^{2}(1-p)A\odot\overline{B}^{\top}\odot C

where $\odot$ denotes the Hadamard product.

Reinserting the computed expressions into (61), as well as noting that $\big{(}\overline{B}\mathrm{Diag}(C)\big{)}_{p}=\overline{B}_{p}\mathrm{Diag}(C)$ and $\mathrm{Diag}\big{(}\overline{A}\,\overline{B}\mathrm{Diag}(C)\big{)}=\mathrm{Diag}(\overline{A}\,\overline{B})\mathrm{Diag}(C)$ yields

\begin{split}\mathbb{E}\big{[}DADBDCD\big{]}&=pA_{p}\big{(}B_{p}C\big{)}_{p}+p^{2}(1-p)\mathrm{Diag}\big{(}\overline{A}B_{p}C\big{)}\\ &\qquad+p(1-p)A_{p}\overline{B}_{p}\mathrm{Diag}(C)+p^{2}(1-p)^{2}\mathrm{Diag}(\overline{A}\,\overline{B})\mathrm{Diag}(C)\\ &\qquad+p^{3}(1-p)\mathrm{Diag}(A\overline{B})\overline{C}+p^{2}(1-p)^{2}A\odot\overline{B}^{\top}\odot C.\end{split}

(62)

Next, using the identity $\overline{B}_{p}=B_{p}-\mathrm{Diag}(B)$ of Lemma 9(b), we combine the first and third terms of the latter display into

	$\displaystyle pA_{p}\big{(}B_{p}C\big{)}_{p}+p(1-p)A_{p}\overline{B}_{p}\mathrm{Diag}(C)$	$\displaystyle=pA_{p}B_{p}\big{(}pC+(1-p)\mathrm{Diag}(C)\big{)}$
		$\displaystyle\qquad+p(1-p)A_{p}\mathrm{Diag}\big{(}B_{p}C\big{)}$
		$\displaystyle\qquad-p(1-p)A_{p}\mathrm{Diag}(B)\mathrm{Diag}(C)$
		$\displaystyle=pA_{p}B_{p}C_{p}+p^{2}(1-p)A_{p}\mathrm{Diag}\big{(}\overline{B}C\big{)}.$

Regarding the second term of (62), observe that

	$\displaystyle\mathrm{Diag}\big{(}\overline{A}B_{p}C\big{)}$	$\displaystyle=\mathrm{Diag}\big{(}\overline{A}B_{p}\overline{C}\big{)}+\mathrm{Diag}\big{(}\overline{A}B_{p}\big{)}\mathrm{Diag}(C)$
		$\displaystyle=\mathrm{Diag}\big{(}\overline{A}B_{p}\overline{C}\big{)}+p\mathrm{Diag}(\overline{A}B)\mathrm{Diag}(C),$

where the second equality follows from $\mathrm{Diag}\big{(}\overline{A}\mathrm{Diag}(B)\big{)}=0$ . Lastly, $\mathrm{Diag}(\overline{A}\,\overline{B})=\mathrm{Diag}(\overline{A}B)=\mathrm{Diag}(A\overline{B})$ by Lemma 9(c), so the fourth and fifth term of (62) combine into

		$\displaystyle p\mathrm{Diag}(A\overline{B})\overline{C}+(1-p)\mathrm{Diag}(\overline{A}\,\overline{B})\mathrm{Diag}(C)$
	$\displaystyle=\$	$\displaystyle\mathrm{Diag}(A\overline{B})\big{(}p\overline{C}+\mathrm{Diag}(C)\big{)}-p\mathrm{Diag}(A\overline{B})\mathrm{Diag}(C)$
	$\displaystyle=\$	$\displaystyle\mathrm{Diag}(A\overline{B})C_{p}-p\mathrm{Diag}(A\overline{B})\mathrm{Diag}(C),$

where the common factor $p^{2}(1-p)$ is omitted in the display.

Using these identities, Equation (62) now turns into

	$\displaystyle\mathbb{E}\big{[}DADBDCD\big{]}$	$\displaystyle=pA_{p}B_{p}C_{p}+p^{2}(1-p)^{2}A\odot\overline{B}^{\top}\odot C$
		$\displaystyle\qquad+p^{2}(1-p)\Big{(}\mathrm{Diag}\big{(}\overline{A}B_{p}C\big{)}-p\mathrm{Diag}(A\overline{B})\mathrm{Diag}(C)\Big{)}$
		$\displaystyle\qquad+p^{2}(1-p)\Big{(}A_{p}\mathrm{Diag}\big{(}\overline{B}C\big{)}+\mathrm{Diag}(A\overline{B})C_{p}\Big{)}.$

Noting that $\mathrm{Diag}\big{(}\overline{A}B_{p}C\big{)}-p\mathrm{Diag}(A\overline{B})\mathrm{Diag}(C)$ equals $\mathrm{Diag}\big{(}\overline{A}B_{p}\overline{C}\big{)}$ finishes the proof.

∎

In principle, any computations involving higher moments of $D$ may be accomplished with the proof strategy of Lemma 10. A particular covariance matrix is needed in Section B, which will be given in the next lemma.

Lemma 11.

Given a symmetric matrix $A$ , as well as vectors $\mathbf{u}$ and $\mathbf{v}$ ,

	$\displaystyle\dfrac{\mathrm{Cov}\big{(}D\mathbf{u}+D\overline{A}(D-pI)\mathbf{v}\big{)}}{p(1-p)}$	$\displaystyle=\mathrm{Diag}(\mathbf{u}\mathbf{u}^{\top})+p\overline{A}\mathrm{Diag}(\mathbf{v}\mathbf{u}^{\top})+p\mathrm{Diag}(\mathbf{u}\mathbf{v}^{\top})\overline{A}$
		$\displaystyle\qquad+p\big{(}\overline{A}\mathrm{Diag}(\mathbf{v}\mathbf{v}^{\top})\overline{A}\big{)}_{p}+p(1-p)A\odot\overline{\mathbf{v}\mathbf{v}^{\top}}\odot A.$

Proof The covariance of the sum is given by

$\displaystyle\mathrm{Cov}\big{(}D\mathbf{u}+D\overline{A}(D-pI)\mathbf{v}\big{)}$	$\displaystyle=\mathrm{Cov}\big{(}D\mathbf{u}\big{)}+\mathrm{Cov}\big{(}D\overline{A}(D-pI)\mathbf{v}\big{)}+\mathrm{Cov}\big{(}D\mathbf{u},D\overline{A}(D-pI)\mathbf{v}\big{)}$
	$\displaystyle\qquad+\mathrm{Cov}\big{(}D\overline{A}(D-pI)\mathbf{v},D\mathbf{u}\big{)}$
	$\displaystyle=:T_{1}+T_{2}+T_{3}+T_{4}$	(63)

where each of the latter terms will be treated separately. To this end, set $B_{\mathbf{u}}:=\mathbf{u}\mathbf{u}^{\top}$ , $B_{\mathbf{v}}:=\mathbf{v}\mathbf{v}^{\top}$ , $B_{\mathbf{u},\mathbf{v}}:=\mathbf{u}\mathbf{v}^{\top}$ and $B_{\mathbf{v},\mathbf{u}}:=\mathbf{v}\mathbf{u}^{\top}$ .

First, recall that $\mathbb{E}\big{[}DB_{\mathbf{u}}D\big{]}=p(B_{\mathbf{u}})_{p}$ by Lemma 10(a) and $\mathbb{E}[D]=p$ , so the definition of covariance implies

	$\displaystyle T_{1}$	$\displaystyle=\mathbb{E}\big{[}DB_{\mathbf{u}}D\big{]}-\mathbb{E}[D]B_{\mathbf{u}}\mathbb{E}[D]=p^{2}B_{\mathbf{u}}+p(1-p)\mathrm{Diag}(B_{\mathbf{u}})-p^{2}B_{\mathbf{u}}$
		$\displaystyle=p(1-p)\mathrm{Diag}(B_{\mathbf{u}}).$		(64)

Moving on to $T_{4}$ , observe that

\displaystyle T_{4}

\displaystyle=\mathrm{Cov}\big{(}D\overline{A}(D-pI)\mathbf{v},D\mathbf{u}\big{)}=\mathbb{E}\big{[}D\overline{A}(D-pI)B_{\mathbf{v},\mathbf{u}}D\big{]}-\mathbb{E}\big{[}D\overline{A}(D-pI)\mathbf{v}\big{]}\mathbb{E}\big{[}D\mathbf{u}\big{]}^{\top}.

By the same argument as in (28),

\displaystyle\mathbb{E}\big{[}D\overline{A}(D-pI)\mathbf{v}\big{]}

\displaystyle=\mathbb{E}\big{[}D\overline{A}D\big{]}\mathbf{v}-p\mathbb{E}\big{[}D\overline{A}\big{]}\mathbf{v}=p^{2}\overline{A}\mathbf{v}-p^{2}\overline{A}\mathbf{v}=0

(65)

so that in turn

\displaystyle T_{4}=\mathbb{E}\big{[}D\overline{A}DB_{\mathbf{v},\mathbf{u}}D\big{]}-p\mathbb{E}\big{[}D\overline{A}B_{\mathbf{v},\mathbf{u}}D\big{]}.

Recall $\overline{A}_{p}=p\overline{A}$ from Lemma 9(b). Applying Lemma 10(b) for the first term in the previous display and Lemma 10(a) for the second term now leads to

$\displaystyle T_{4}$	$\displaystyle=p^{2}\overline{A}(B_{\mathbf{v},\mathbf{u}})_{p}+p^{2}(1-p)\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v},\mathbf{u}}\big{)}-p^{2}\big{(}\overline{A}B_{\mathbf{v},\mathbf{u}}\big{)}_{p}$
	$\displaystyle=p^{3}\overline{A}B_{\mathbf{v},\mathbf{u}}+p^{2}(1-p)\overline{A}\mathrm{Diag}(B_{\mathbf{v},\mathbf{u}})+p^{2}(1-p)\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v},\mathbf{u}}\big{)}-p^{3}\overline{A}B_{\mathbf{v},\mathbf{u}}$
	$\displaystyle\qquad-p^{2}(1-p)\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v},\mathbf{u}}\big{)}$
	$\displaystyle=p^{2}(1-p)\overline{A}\mathrm{Diag}(B_{\mathbf{v},\mathbf{u}}).$	(66)

Using a completely analogous argument, the reflected term $T_{3}$ satisfies

\displaystyle T_{3}=p^{2}(1-p)\mathrm{Diag}\big{(}B_{\mathbf{u},\mathbf{v}}\big{)}\overline{A}.

(67)

The last term $T_{2}$ necessitates another decomposition into four sub-problems. First, recall from (65) that $\mathbb{E}\big{[}D\overline{A}(D-pI)\mathbf{v}\big{]}$ vanishes, which leads to

$\displaystyle T_{2}$	$\displaystyle=\mathrm{Cov}\big{(}D\overline{A}(D-pI)\mathbf{v}\big{)}$
	$\displaystyle=\mathbb{E}\big{[}D\overline{A}(D-pI)B_{\mathbf{v}}(D-pI)\overline{A}D\big{]}$
	$\displaystyle=\mathbb{E}\big{[}D\overline{A}DB_{\mathbf{v}}D\overline{A}D\big{]}-p\mathbb{E}\big{[}D\overline{A}B(D-pI)\overline{A}D\big{]}-p\mathbb{E}\big{[}D\overline{A}(D-pI)B_{\mathbf{v}}\overline{A}D\big{]}$
	$\displaystyle\qquad+p^{2}\mathbb{E}\big{[}D\overline{A}B_{\mathbf{v}}\overline{A}D\big{]}$
	$\displaystyle=:T_{2,1}-T_{2,2}-T_{2,3}-T_{2,4}$	(68)

where the last term is negative since $-pB_{\mathbf{v}}(D-pI)-p(D-pI)B_{\mathbf{v}}=-pB_{\mathbf{v}}D-pDB_{\mathbf{v}}+2p^{2}B_{\mathbf{v}}$ . Recall once more the identity $\overline{A}_{p}=p\overline{A}$ from Lemma 9(b) and apply Lemma 10(c), to rewrite $T_{2,1}$ as

	$\displaystyle T_{2,1}$	$\displaystyle=\mathbb{E}\big{[}D\overline{A}DB_{\mathbf{v}}D\overline{A}D\big{]}$
		$\displaystyle=p^{3}\overline{A}(B_{\mathbf{v}})_{p}\overline{A}+p^{2}(1-p)\Big{(}\mathrm{Diag}\big{(}\overline{A}(B_{\mathbf{v}})_{p}\overline{A}\big{)}+p\overline{A}\mathrm{Diag}\big{(}\overline{B_{\mathbf{v}}}A\big{)}+p\mathrm{Diag}\big{(}A\overline{B_{\mathbf{v}}}\big{)}\overline{A}\Big{)}$
		$\displaystyle\qquad+p^{2}(1-p)^{2}A\odot\overline{B_{\mathbf{v}}}\odot A.$

As for $T_{2,2}$ , start by noting that

	$\displaystyle T_{2,2}$	$\displaystyle=p\mathbb{E}\big{[}D\overline{A}B_{\mathbf{v}}(D-pI)\overline{A}D\big{]}$
		$\displaystyle=p\mathbb{E}\big{[}D\overline{A}B_{\mathbf{v}}D\overline{A}D\big{]}-p^{2}\mathbb{E}\big{[}D\overline{A}B_{\mathbf{v}}\overline{A}D\big{]}$
		$\displaystyle=p^{3}\big{(}\overline{A}B_{\mathbf{v}}\big{)}_{p}\overline{A}+p^{3}(1-p)\mathrm{Diag}\big{(}\overline{\overline{A}B_{\mathbf{v}}}\overline{A}\big{)}-p^{3}\big{(}\overline{A}B_{\mathbf{v}}\overline{A}\big{)}_{p},$

where Lemma 10(a) computes the second expectation and Lemma 10(b) the first expectation. To progress, note first the identities

	$\displaystyle\big{(}\overline{A}B_{\mathbf{v}}\big{)}_{p}\overline{A}-\big{(}\overline{A}B_{\mathbf{v}}\overline{A}\big{)}_{p}$	$\displaystyle=(1-p)\Big{(}\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v}}\big{)}\overline{A}-\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v}}\overline{A}\big{)}\Big{)}$
	$\displaystyle\mathrm{Diag}\big{(}\overline{\overline{A}B_{\mathbf{v}}}\overline{A}\big{)}$	$\displaystyle=\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v}}\overline{A}\big{)}$

so that

\displaystyle T_{2,2}

\displaystyle=p^{3}(1-p)\Big{(}\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v}}\big{)}\overline{A}-\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v}}\overline{A}\big{)}+\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v}}\overline{A}\big{)}\Big{)}=p^{3}(1-p)\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v}}\big{)}\overline{A}.

By symmetry, the reflected term $T_{2,3}$ then also satisfies $T_{2,3}=p^{3}(1-p)\overline{A}\mathrm{Diag}\big{(}B_{\mathbf{v}}\overline{A}\big{)}$ . Lastly, applying Lemma 10(a) to $T_{2,4}$ results in $T_{2,4}=p^{3}\big{(}\overline{A}B_{\mathbf{v}}\overline{A}\big{)}_{p}$ . To finish the treatment of $T_{2}$ , inserting the computed expressions for $T_{2,1},T_{2,2},T_{2,3}$ , and $T_{2,4}$ into (C) and combining like terms now leads to

	$\displaystyle T_{2}$	$\displaystyle=p^{3}\overline{A}(B_{\mathbf{v}})_{p}\overline{A}+p^{2}(1-p)^{2}A\odot\overline{B_{\mathbf{v}}}\odot A$
		$\displaystyle\qquad+p^{2}(1-p)\Big{(}\mathrm{Diag}\big{(}\overline{A}(B_{\mathbf{v}})_{p}\overline{A}\big{)}+p\overline{A}\mathrm{Diag}\big{(}\overline{B_{\mathbf{v}}}A\big{)}+\mathrm{Diag}\big{(}A\overline{B_{\mathbf{v}}}\big{)}p\overline{A}\Big{)}$
		$\displaystyle\qquad-p^{3}(1-p)\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v}}\big{)}\overline{A}-p^{3}(1-p)\overline{A}\mathrm{Diag}\big{(}B_{\mathbf{v}}\overline{A}\big{)}+p^{3}\big{(}\overline{A}B_{\mathbf{v}}\overline{A}\big{)}_{p}$
		$\displaystyle=p^{3}\Big{(}\overline{A}(B_{\mathbf{v}})_{p}\overline{A}-\big{(}\overline{A}B_{\mathbf{v}}\overline{A}\big{)}_{p}\Big{)}+p^{2}(1-p)\mathrm{Diag}\big{(}\overline{A}(B_{\mathbf{v}})_{p}\overline{A}\big{)}+p^{2}(1-p)^{2}A\odot\overline{B_{\mathbf{v}}}\odot A$
		$\displaystyle=p^{3}(1-p)\overline{A}\mathrm{Diag}(B_{\mathbf{v}})\overline{A}+p^{2}(1-p)^{2}\mathrm{Diag}\big{(}\overline{A}\mathrm{Diag}(B_{\mathbf{v}})\overline{A}\big{)}p^{2}(1-p)^{2}A\odot\overline{B_{\mathbf{v}}}\odot A$
		$\displaystyle=p^{2}(1-p)\big{(}\overline{A}\mathrm{Diag}(B_{\mathbf{v}})\overline{A}\big{)}_{p}+p^{2}(1-p)^{2}A\odot\overline{B_{\mathbf{v}}}\odot A.$

To conclude the proof, insert this expression for $T_{2}$ into (C), together with $T_{1}$ as in (64), $T_{3}$ as in (67), and $T_{4}$ as in (C) to obtain the desired identity

	$\displaystyle\mathrm{Cov}\big{(}D\mathbf{u}+D\overline{A}(D-pI)\mathbf{v}\big{)}$	$\displaystyle=p(1-p)\mathrm{Diag}(B_{\mathbf{u}})+p^{2}(1-p)\big{(}\overline{A}\mathrm{Diag}(B_{\mathbf{v}})\overline{A}\big{)}_{p}$
		$\displaystyle\qquad+p^{2}(1-p)^{2}A\odot\overline{B}_{\mathbf{v}}\odot A$
		$\displaystyle\qquad+p^{2}(1-p)\big{(}\mathrm{Diag}(B_{\mathbf{u},\mathbf{v}})\overline{A}+\overline{A}\mathrm{Diag}(B_{\mathbf{v},\mathbf{u}})\big{)}.$

∎

Appendix D Auxiliary Results

Below we collect identities and definitions referenced in other sections.

Neumann series: Let $V$ denote a real or complex Banach space with norm $\lVert\ \cdot\ \rVert$ . Recall that the operator norm of a linear operator $\lambda$ on $V$ is given by $\left\lVert\lambda\right\rVert_{\mathrm{op}}=\sup_{\mathbf{v}\in V:\lVert\mathbf{v}\rVert\leq 1}\left\lVert\lambda(\mathbf{v})\right\rVert$ .

Lemma 12 ([20], Proposition 5.3.4).

Suppose $\lambda:V\to V$ is bounded, linear, and satisfies $\left\lVert\mathrm{id}-\lambda\right\rVert_{\mathrm{op}}<1$ , then $\lambda$ is invertible and $\lambda^{-1}=\sum_{i=0}^{\infty}\big{(}\mathrm{id}-\lambda\big{)}^{i}$ .

Bounds on singular values: Recall that $\lVert\ \cdot\ \rVert_{2}$ denotes the Euclidean norm on $\mathbb{R}^{d}$ and $\lVert\ \cdot\ \rVert$ the spectral norm on $\mathbb{R}^{d\times d}$ , which is given by the largest singular value $\sigma_{\max}(\cdot)$ . The spectral norm is sub-multiplicative in the sense that $\lVert AB\rVert\leq\lVert A\rVert\lVert B\rVert$ . The spectral norm of a vector $\mathbf{v}\in\mathbb{R}^{d}$ , viewed as a linear functional on $\mathbb{R}^{d}$ , is given by $\left\lVert\mathbf{v}\right\rVert=\left\lVert\mathbf{v}\right\rVert_{2}$ , proving that

\displaystyle\big{\lVert}\mathbf{v}\mathbf{w}^{\top}\big{\rVert}\leq\lVert\mathbf{v}\rVert_{2}\lVert\mathbf{w}\rVert_{2}

(69)

for any vectors $\mathbf{v}$ and $\mathbf{w}$ of the same length.

Recall the definitions $\overline{A}=A-\mathrm{Diag}(A)$ and $A_{p}=pA+(1-p)\mathrm{Diag}(A)$ with $p\in(0,1)$ .

Lemma 13.

Given $d\times d$ matrices $A$ and $B$ , the inequalities $\big{\lVert}\mathrm{Diag}(A)\big{\rVert}\leq\lVert A\rVert$ , $\lVert A_{p}\rVert\leq\lVert A\rVert$ , and $\lVert A\odot B\rVert\leq\lVert A\rVert\cdot\lVert B\rVert$ hold. Moreover, if $A$ is symmetric and positive semi-definite, then also $\big{\lVert}\overline{A}\big{\rVert}\leq\lVert A\rVert$ .

Proof For any matrix $A$ , the maximal singular value $\sigma_{\max}(A)$ can be computed from the variational formulation $\sigma_{\max}(A)=\max_{\mathbf{v}\in\mathbb{R}^{d}\setminus\{\mathbf{0}\}}\lVert A\mathbf{v}\rVert_{2}/\lVert\mathbf{v}\rVert_{2},$ see [24], Theorem 4.2.6.

Let $\mathbf{e}_{i}$ denote the $i$ ^th standard basis vector. The variational formulation of the maximal singular value implies $\lVert\mathrm{Diag}(A)\rVert^{2}=\max_{i}A_{ii}^{2}$ which is bounded by $\max_{i}\sum_{k=1}^{d}A_{ki}^{2}=\max_{i}\big{(}A^{\top}A\big{)}_{ii}=\max_{i}\mathbf{e}_{i}^{\top}A^{\top}A\mathbf{e}_{i}$ . The latter is further bounded by $\big{\lVert}A^{\top}A\big{\rVert}$ , proving the first statement. The second inequality follows from the first since

\lVert A_{p}\rVert\leq p\lVert A\rVert+(1-p)\big{\lVert}\mathrm{Diag}(A)\big{\rVert}\leq p\lVert A\rVert+(1-p)\lVert A\rVert=\lVert A\rVert.

For the inequality concerning the Hadamard product, see Theorem 5.5.7 of [23].

For the last inequality, note that semi-definiteness entails $\min_{i}\mathrm{Diag}(A)_{ii}\geq 0$ . Fixing $\mathbf{v}\in\mathbb{R}^{d}$ , this ensures $\mathbf{v}^{\top}\overline{A}\mathbf{v}\leq\mathbf{v}^{\top}A\mathbf{v}$ , which completes the proof. ∎

References

[1] Martín Abadi et al. “TensorFlow: A System for Large-Scale Machine Learning” In 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association, 2016, pp. 265–283
[2] Donald W.. Andrews “Laws of Large Numbers for Dependent Non-Identically Distributed Random Variables” In Econometric Theory 4.3, 1988, pp. 458–467
[3] Raman Arora, Peter Bartlett, Poorya Mianjy and Nathan Srebro “Dropout: Explicit Forms and Capacity Control” In 38th International Conference on Machine Learning Proceedings of Machine Learning Research, 2021, pp. 351–361
[4] Jimmy Ba and Brendan Frey “Adaptive dropout for training deep neural networks” In Advances in Neural Information Processing Systems 26 Curran Associates, Inc., 2013, pp. 3084–3092
[5] Bubacarr Bah, Holger Rauhut, Ulrich Terstiege and Michael Westdickenberg “Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizers” In Information and Inference: A Journal of the IMA 11.1, 2022, pp. 307–353
[6] Pierre Baldi and Peter J. Sadowski “Understanding Dropout” In Advances in Neural Information Processing Systems 26 Curran Associates, Inc., 2013, pp. 2814–2822
[7] Peter L. Bartlett, Philip M. Long and Olivier Bousquet “The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima” In Journal of Machine Learning Research 24.316, 2023, pp. 1–36
[8] Thijs Bos and Johannes Schmidt-Hieber “Convergence guarantees for forward gradient descent in the linear regression model”, arXiv:2309.15001 [math.ST], 2023
[9] Jacopo Cavazza et al. “Dropout as a Low-Rank Regularizer for Matrix Factorization” In 21st International Conference on Artificial Intelligence and Statistics Proceedings of Machine Learning Research, 2018, pp. 435–444
[10] François Chollet “Keras”, https://keras.io, 2015
[11] George Cybenko “Approximation by superpositions of a sigmoidal function” In Mathematics of Control, Signals, and Systems 2.4, 1989, pp. 303–314
[12] Steffen Dereich and Sebastian Kassing “Central limit theorems for stochastic gradient descent with averaging for stable manifolds” In Electronic Journal of Probability 28, 2023, pp. 1–48
[13] Lutz Dümbgen, Richard J. Samworth and Dominic Schuhmacher “Stochastic search for semiparametric linear regression models” In From Probability to Statistics and Back: High-Dimensional Models and Processes. A Festschrift in Honor of Jon A. Wellner Institute of Mathematical Statistics, 2013, pp. 78–90
[14] Bradley Efron and Trevor Hastie “Computer Age Statistical Inference. Algorithms, Evidence, and Data Science” 5, Institute of Mathematical Statistics Monographs Cambridge University Press, 2016
[15] Yarin Gal and Zoubin Ghahramani “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks” In Advances in Neural Information Processing Systems 29 Curran Associates, Inc., 2016, pp. 1027–1035
[16] Yarin Gal and Zoubin Ghahramani “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning” In 33rd International Conference on International Conference on Machine Learning Proceedings of Machine Learning Research, 2016, pp. 1050–1059
[17] Wei Gao and Zhi-Hua Zhou “Dropout Rademacher complexity of deep neural networks” In Science China Information Sciences 59.7, 2016, pp. 2104:1–12
[18] Ian Goodfellow, Yoshua Bengio and Aaron Courville “Deep learning”, Adaptive Computation and Machine Learning MIT Press, 2016
[19] László Györfi and Harro Walk “On the Averaged Stochastic Approximation for Linear Regression” In SIAM Journal on Control and Optimization 34.1, 1996, pp. 31–61
[20] Aleksandr Ya. Helemskii “Lectures and Exercises on Functional Analysis” 233, Translations of Mathematical Monographs American Mathematical Society, 2006
[21] David P. Helmbold and Philip M. Long “Surprising properties of dropout in deep networks” In Conference on Learning Theory Proceedings of Machine Learning Research, 2017, pp. 1123–1146
[22] Jonathan Hill and Liang Peng “Unified interval estimation for random coefficient autoregressive models” In Journal of Time Series Analysis 35.3, 2014, pp. 282–297
[23] Roger A. Horn and Charles R. Johnson “Topics in Matrix Analysis” Cambridge University Press, 1991
[24] Roger A. Horn and Charles R. Johnson “Matrix Analysis” Cambridge University Press, 2013
[25] Kurt Hornik “Approximation capabilities of multilayer feedforward networks” In Neural Networks 4.2, 1991, pp. 251–257
[26] Yangqing Jia et al. “Caffe: Convolutional Architecture for Fast Feature Embedding” In 22nd ACM International Conference on Multimedia Association for Computing Machinery, 2014, pp. 675–678
[27] Diederik P. Kingma, Tim Salimans and Max Welling “Variational Dropout and the Local Reparameterization Trick” In Advances in Neural Information Processing Systems 28 Curran Associates, Inc., 2015, pp. 2575–2583
[28] Alex Krizhevsky, Ilya Sutskever and Geoffrey E Hinton “ImageNet Classification with Deep Convolutional Neural Networks” In Advances in Neural Information Processing Systems 25 Curran Associates, Inc., 2012, pp. 1097–1105
[29] Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus and Shimon Schocken “Multilayer feedforward networks with a nonpolynomial activation function can approximate any function” In Neural Networks 6.6, 1993, pp. 861–867
[30] Oxana A. Manita et al. “Universal Approximation in Dropout Neural Networks” In Journal of Machine Learning Research 23.19, 2022, pp. 1–46
[31] David McAllester “A PAC-Bayesian Tutorial with A Dropout Bound”, arXiv:1307.2118 [cs.LG], 2013
[32] Poorya Mianjy and Raman Arora “On Dropout and Nuclear Norm Regularization” In 36th International Conference on Machine Learning Proceedings of Machine Learning Research, 2019, pp. 4575–4584
[33] Poorya Mianjy and Raman Arora “On Convergence and Generalization of Dropout Training” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 21151–21161
[34] Poorya Mianjy, Raman Arora and Rene Vidal “On the Implicit Bias of Dropout” In 35th International Conference on Machine Learning Proceedings of Machine Learning Research, 2018, pp. 3540–3548
[35] Reza Moradi, Reza Berangi and Behrouz Minaei “A Survey of Regularization Strategies for Deep Models” In Artificial Intelligence Review 53.6, 2020, pp. 3947–3986
[36] Gabin Maxime Nguegnang, Holger Rauhut and Ulrich Terstiege “Convergence of gradient descent for learning linear neural networks”, arXiv:2108.02040 [cs.LG], 2021
[37] Des F. Nicholls and Barry G. Quinn “Random Coefficient Autoregressive Models: An Introduction” 11, Lecture Notes in Statistics Springer New York, 1982
[38] Adam Paszke et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019
[39] Boris Polyak “New method of stochastic approximation type” In Avtomatica i Telemekhanika 7, 1990, pp. 98–107
[40] Boris Polyak and Anatoli B. Juditsky “Acceleration of stochastic approximation by averaging” In SIAM Journal on Control and Optimization 30.4, 1992, pp. 838–855
[41] Marta Regis, Paulo Serra and Edwin R. Heuvel “Random autoregressive models: a structured overview” In Econometric Reviews 41.2, 2022, pp. 207–230
[42] David Ruppert “Efficient Estimations from a Slowly Convergent Robbins-Monro Process”, Technical Report 781. Cornell University Operations Research and Industrial Engineering, 1988
[43] Claudio Filipi Gonçalves Dos Santos and João Paulo Papa “Avoiding Overfitting: A Survey on Regularization Methods for Convolutional Neural Networks” In ACM Computing Surveys 54.10s, 2022, pp. 213:1–25
[44] Johannes Schmidt-Hieber and Wouter M. Koolen “Hebbian learning inspired estimation of the linear regression parameters from queries”, arXiv:2311.03483 [math.ST], 2023
[45] Albert Senen-Cerda and Jaron Sanders “Asymptotic Convergence Rate of Dropout on Shallow Linear Neural Networks” In Proceedings of the ACM on Measurement and Analysis of Computing Systems 6.2, 2022, pp. 32:1–53
[46] Nitish Srivastava et al. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” In Journal of Machine Learning Research 15.56, 2014, pp. 1929–1958
[47] Alexander Tsigler and Peter L. Bartlett “Benign Overfitting in Ridge Regression” In Journal of Machine Learning Research 24.123, 2023, pp. 1–76
[48] Stefan Wager, Sida Wang and Percy S Liang “Dropout Training as Adaptive Regularization” In Advances in Neural Information Processing Systems 26 Curran Associates, Inc., 2013, pp. 351–359
[49] Li Wan et al. “Regularization of Neural Networks using DropConnect” In 30th International Conference on Machine Learning Proceedings of Machine Learning Research, 2013, pp. 1058–1066
[50] Sida Wang and Christopher Manning “Fast dropout training” In 30th International Conference on Machine Learning Proceedings of Machine Learning Research, 2013, pp. 118–126
[51] Colin Wei, Sham Kakade and Tengyu Ma “The Implicit and Explicit Regularization Effects of Dropout” In 37th International Conference on Machine Learning Proceedings of Machine Learning Research, 2020, pp. 10181–10192
[52] Haibing Wu and Xiaodong Gu “Towards dropout training for convolutional neural networks” In Neural Networks 71, 2015, pp. 1–10
[53] Ke Zhai and Huan Wang “Adaptive Dropout with Rademacher Complexity Regularization” In 6th International Conference on Learning Representations, 2018
[54] Ruiqi Zhang, Spencer Frei and Peter L. Bartlett “Trained Transformers Learn Linear Models In-Context” In Journal of Machine Learning Research 25.49, 2024, pp. 1–55
[55] Wanrong Zhu, Xi Chen and Wei Biao Wu “Online Covariance Matrix Estimation in Stochastic Gradient Descent” In Journal of the American Statistical Association 118.541, 2021, pp. 393–404

	$\displaystyle\mathbb{E}\Big{[}\big{\lVert}XD\widehat{\bm{\beta}}-\mathbf{Y}\big{\rVert}_{2}^{2}\ \big{\|}\ \mathbf{Y}\Big{]}$	$\displaystyle=\mathbb{E}\Big{[}\big{\lVert}XD\widetilde{\bm{\beta}}-\mathbf{Y}\big{\rVert}_{2}^{2}\ \big{\|}\ \mathbf{Y}\Big{]}+\mathbb{E}\Big{[}\big{\lVert}XD\big{(}\widetilde{\bm{\beta}}-\widehat{\bm{\beta}}\big{)}\big{\rVert}_{2}^{2}\ \big{\|}\ \mathbf{Y}\Big{]}$
		$\displaystyle\qquad-2\mathbb{E}\Big{[}\big{(}\widehat{\bm{\beta}}-\widetilde{\bm{\beta}}\big{)}^{\top}DX^{\top}\big{(}\mathbf{Y}-XD\widetilde{\bm{\beta}}\big{)}\ \big{\|}\ \mathbf{Y}\Big{]}$
		$\displaystyle=\mathbb{E}\Big{[}\big{\lVert}XD\widetilde{\bm{\beta}}-\mathbf{Y}\big{\rVert}_{2}^{2}\ \big{\|}\ \mathbf{Y}\Big{]}+\mathbb{E}\Big{[}\big{\lVert}XD\big{(}\widetilde{\bm{\beta}}-\widehat{\bm{\beta}}\big{)}\big{\rVert}_{2}^{2}\ \big{\|}\ \mathbf{Y}\Big{]},$

Dropout Regularization Versus ℓ2\ell_{2}-Penalization in the Linear Model