This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Dropout Regularization Versus 2\ell_{2}-Penalization
in the Linear Model

Gabriel Clara 111Corresponding author: [email protected] 222The authors thank an anonymous editor and three anonymous referees for their valuable time and effort; their comments improved the article tremendously. [Uncaptioned image] This publication is part of the project Statistical foundation for multilayer neural networks (project number VI.Vidi.192.021 of the Vidi ENW programme) financed by the Dutch Research Council (NWO). Faculty of Electrical Engineering, Mathematics, and Computer Science
University of Twente
Sophie Langer 22footnotemark: 2 Faculty of Electrical Engineering, Mathematics, and Computer Science
University of Twente
Johannes Schmidt-Hieber 22footnotemark: 2 Faculty of Electrical Engineering, Mathematics, and Computer Science
University of Twente
Abstract

We investigate the statistical behavior of gradient descent iterates with dropout in the linear regression model. In particular, non-asymptotic bounds for the convergence of expectations and covariance matrices of the iterates are derived. The results shed more light on the widely cited connection between dropout and 2\ell_{2}-regularization in the linear model. We indicate a more subtle relationship, owing to interactions between the gradient descent dynamics and the additional randomness induced by dropout. Further, we study a simplified variant of dropout which does not have a regularizing effect and converges to the least squares estimator.

1 Introduction

Dropout is a simple, yet effective, algorithmic regularization technique, intended to prevent neural networks from overfitting. First introduced in [46], the method is implemented via random masking of neurons at training time. Specifically, during every gradient descent iteration, the output of each neuron is replaced by zero based on the outcome of an independently sampled Ber(p)\mathrm{Ber}(p)-distributed variable. This temporarily removes each neuron with a probability of 1p1-p, see Figure 1 for an illustration. The method has demonstrated effectiveness in various applications, see for example [28, 46]. On the theoretical side, dropout is often studied by exhibiting connections with explicit regularizers [3, 6, 9, 31, 32, 34, 45, 46, 48]. Rather than analyzing the gradient descent iterates with dropout noise, these results consider the marginalized training loss with marginalization over the dropout noise. Within this framework, [46] established a connection between dropout and weighted 2\ell_{2}-penalization in the linear regression model. This connection is now cited in popular textbooks [14, 18].

However, [51] show empirically that injecting dropout noise in the gradient descent iterates also induces an implicit regularization effect that is not captured by the link between the marginalized loss and explicit regularization. This motivates our approach to directly derive the statistical properties of gradient descent iterates with dropout. We study the linear regression model due to mathematical tractability and because the minimizer of the explicit regularizer is unique and admits a closed-form expression. In line with the implicit regularization observed in [51], our main result provides a theoretical bound quantifying the amount of randomness in the gradient descent scheme that is ignored by the previously considered minimizers of the marginalized training loss. More specifically, Theorem 2 shows that for a fixed learning rate there is additional randomness which fails to vanish in the limit, while Theorem 3 characterizes the gap between dropout and 2\ell_{2}-penalization with respect to the learning rate, the dropout parameter pp, the design matrix, and the distribution of the initial iterate. Theorem 4 shows that this gap disappears for the Ruppert-Polyak averages of the iterates.

Figure 1: Regular neurons (left) with all connections active. Sample of the same neurons with dropout (right). The dashed connections are ignored during the current iteration of training.

To provide a clearer understanding of the interplay between gradient descent and variance, we also investigate a simplified variant of dropout featuring more straightforward interactions between the two. Applying the same analytical techniques to this simplified variant, Theorem 5 establishes convergence in quadratic mean to the conventional linear least-squares estimator. This analysis illustrates the sensitivity of gradient descent to small changes in the way noise is injected during training.

Many randomized optimization methods can be formulated as noisy gradient descent schemes. The developed strategy to treat gradient descent with dropout may be generalized to other settings. An example is the recent analysis of forward gradient descent in [8].

The article is organized as follows. After discussing related results below, Section 2 contains preliminaries and introduces two different variants of dropout. Section 3 discusses some extensions of previous results for averaged dropout obtained by marginalizing over the dropout distribution in the linear model considered in [46]. Section 4 illustrates the main results on gradient descent with dropout in the linear model, and examines its statistical optimality. Section 5 contains further discussion and mentions a number of natural follow-up problems. All proofs are deferred to the Appendix.

1.1 Other Related Work

Considering linear regression and the marginalized training loss with marginalization over the dropout noise, the initial dropout article [46] already connects dropout with 2\ell_{2}-regularization. This connection was also noted by [6] and by [31]. As this argument is crucial in our own analysis, we will discuss it in more detail in Section 3.

[48] extends the reasoning to generalized linear models and more general forms of injected noise. Employing a quadratic approximation to the loss function after marginalization over the injected noise, the authors exhibit an explicit regularizer. In case of dropout noise, this regularizer induces in first-order an 2\ell_{2}-penalty after rescaling of the data by the estimated inverse of the diagonal Fisher information.

For two-layer models, marginalizing the dropout noise leads to a nuclear norm penalty on the product matrix, both in matrix factorization [9] and linear neural networks [34]. The latter may be seen as a special case of a particular “2\ell_{2}-path regularizer”, which appears in deep linear networks [32] and shallow ReLU-activated networks [3]. Further, [3] exhibit a data distribution-dependent regularizer in two-layer matrix sensing/completion problems. This regularizer collapses to a nuclear norm penalty for specific distributions.

[16] show that empirical risk minimization in deep neural networks with dropout may be recast as performing Bayesian variational inference to approximate the intractable posterior resulting from a deep Gaussian process prior. The Bayesian viewpoint also allows for the quantification of uncertainty. [15] further generalizes this technique to recurrent and long-short-term-memory (LSTM) networks. [52] analyze dropout applied to the max-pooling layers in convolutional neural networks. [50] present a Gaussian approximation to the gradient noise induced by dropout.

Generalization results for dropout training exist in various settings. Given bounds on the norms of weight vectors, [49], [17], and [53] prove decreasing Rademacher complexity bounds as the dropout rate increases. [3] bound the Rademacher complexity of shallow ReLU-activated networks with dropout. [31] obtains a PAC-Bayes bound for dropout and illustrates a trade-off between large and small dropout probabilities for different terms in the bound.

Recently, [30] demonstrated an universal approximation result in the vein of classic results [11, 25, 29], stating that any function in some generic semi-normed space that can be ε\varepsilon-approximated by a deterministic neural network may also be stochastically approximated in LqL^{q}-norm by a sufficiently large network with dropout.

Less is known about gradient descent training with dropout. [45] study the gradient flow associated with the explicit regularizer obtained by marginalizing the dropout noise in a shallow linear network. In particular, the flow converges exponentially fast within a neighborhood of a parameter vector satisfying a balancing condition. [33] study gradient descent with dropout on the logistic loss of a shallow ReLU-activated network in a binary classification task. Their main result includes an explicit rate for the misclassification error, assuming an overparametrized network operating in the so-called lazy regime, where the trained weights stay relatively close to their initializations, and two well-separated classes.

1.2 Notation

Column vectors 𝐱=(x1,,xd)\mathbf{x}=(x_{1},\dots,x_{d})^{\top} are denoted by bold letters. We define 𝟎:=(0,,0)\mathbf{0}:=(0,\ldots,0)^{\top}, 𝟏:=(1,,1)\mathbf{1}:=(1,\ldots,1)^{\top}, and the Euclidean norm 𝐱2:=𝐱𝐱\left\lVert\mathbf{x}\right\rVert_{2}:=\sqrt{\mathbf{x}^{\top}\mathbf{x}}. The d×dd\times d identity matrix is symbolized by IdI_{d}, or simply II, when the dimension dd is clear from context. For matrices A,BA,B of the same dimension, ABA\odot B denotes the Hadamard/entry-wise product (AB)ij=AijBij(A\odot B)_{ij}=A_{ij}B_{ij}. We write Diag(A):=IA\mathrm{Diag}(A):=I\odot A for the diagonal matrix with the same main diagonal as AA. Given p(0,1)p\in(0,1), we define the matrices

A¯\displaystyle\overline{A} :=ADiag(A)\displaystyle:=A-\mathrm{Diag}(A)
Ap\displaystyle A_{p} :=pA+(1p)Diag(A).\displaystyle:=pA+(1-p)\mathrm{Diag}(A).

In particular, Ap=pA¯+Diag(A)A_{p}=p\overline{A}+\mathrm{Diag}(A), so ApA_{p} results from rescaling the off-diagonal entries of AA by pp.

The smallest eigenvalue of a symmetric matrix AA is denoted by λmin(A)\lambda_{\mathrm{min}}(A). The operator norm of a linear operator T:VWT:V\to W between normed linear spaces is given by Top:=supvV:vV1TvW\left\lVert T\right\rVert_{\mathrm{op}}:=\sup_{v\in V:\left\lVert v\right\rVert_{V}\leq 1}\left\lVert Tv\right\rVert_{W}. We write \lVert\ \cdot\ \rVert for the spectral norm of matrices, which is the operator norm induced by 2\lVert\ \cdot\ \rVert_{2}. For symmetric matrices, the relation ABA\geq B signifies 𝐱(AB)𝐱0\mathbf{x}^{\top}(A-B)\mathbf{x}\geq 0 for all non-zero vectors 𝐱\mathbf{x}. The strict operator inequality A>BA>B is defined analogously.

2 Gradient Descent and Dropout

We consider a linear regression model with fixed n×dn\times d design matrix XX and nn outcomes 𝐘\mathbf{Y}, so that

𝐘=X𝜷+𝜺,\displaystyle\mathbf{Y}=X\bm{\beta}_{\star}+\bm{\varepsilon}, (1)

with unknown parameter 𝜷\bm{\beta}_{\star}. We assume 𝔼[𝜺]=𝟎\mathbb{E}\big{[}\bm{\varepsilon}\big{]}=\mathbf{0} and Cov(𝜺)=In\mathrm{Cov}\big{(}\bm{\varepsilon}\big{)}=I_{n}. The task is to estimate 𝜷\bm{\beta}_{\star} from the observed data (X,𝐘)(X,\mathbf{Y}). As the Gram matrix XXX^{\top}X appears throughout our analysis, we introduce the shorthand

𝕏:=XX.\mathbb{X}:=X^{\top}X.

Recovery of 𝜷\bm{\beta}_{\star} in the linear regression model (1) may be interpreted as training a neural network without intermediate hidden layers, see Figure 2. If XX were to have a zero column, the corresponding regression coefficient would not affect the response vector 𝐘\mathbf{Y}. Consequently, both the zero column and the regression coefficient may be eliminated from the linear regression model. Without zero columns, the model is said to be in reduced form.

yyx1x_{1}x2x_{2}xd1x_{d-1}xdx_{d}β1\beta_{1}β2\beta_{2}\ \cdotsβd1\beta_{d-1}βd\beta_{d}
Figure 2: The linear regression model y=i=1dβixiy=\sum_{i=1}^{d}\beta_{i}x_{i}, viewed as a neural network without hidden layers.

The least squares criterion for the estimation of 𝜷\bm{\beta}_{\star} refers to the objective function 𝜷12𝐘X𝜷22\bm{\beta}\mapsto\tfrac{1}{2}\left\lVert\mathbf{Y}-X\bm{\beta}\right\rVert^{2}_{2}. Given a fixed learning rate α>0\alpha>0, performing gradient descent on the least squares objective leads to the iterative scheme

𝜷~k+1gd=𝜷~kgdα𝜷~kgd12𝐘X𝜷~kgd22=𝜷~kgd+αX(𝐘X𝜷~kgd)\displaystyle\widetilde{\bm{\beta}}^{\mathrm{gd}}_{k+1}=\widetilde{\bm{\beta}}^{\mathrm{gd}}_{k}-\alpha\nabla_{\widetilde{\bm{\beta}}^{\mathrm{gd}}_{k}}\dfrac{1}{2}\left\lVert\mathbf{Y}-X\widetilde{\bm{\beta}}^{\mathrm{gd}}_{k}\right\rVert^{2}_{2}=\widetilde{\bm{\beta}}^{\mathrm{gd}}_{k}+\alpha X^{\top}\big{(}\mathbf{Y}-X\widetilde{\bm{\beta}}^{\mathrm{gd}}_{k}\big{)} (2)

with k=0,1,2,k=0,1,2,\ldots and (possibly random) initialization 𝜷~0gd\widetilde{\bm{\beta}}^{\mathrm{gd}}_{0}.

For standard gradient descent as defined in (2), the estimate is updated with the gradient of the full model. Dropout, as introduced in [46], replaces the gradient of the full model with the gradient of a randomly reduced model during each iteration of training. To make this notion more precise, we call a random diagonal d×dd\times d matrix DD a pp-dropout matrix, or simply a dropout matrix, if its diagonal entries satisfy Diii.i.d.Ber(p)D_{ii}\overset{\scriptscriptstyle i.i.d.}{\sim}\mathrm{Ber}(p) for some p(0,1)p\in(0,1). We note that the Bernoulli distribution may alternatively be parametrized with the failure probability q:=1pq:=1-p, but following [46] we choose the success probability pp.

On average, DD has pdpd diagonal entries equal to 11 and (1p)d(1-p)d diagonal entries equal to 0. Given any vector 𝜷\bm{\beta}, the coordinates of D𝜷D\bm{\beta} are randomly set to 0 with probability 1p1-p. For simplicity, the dependence of DD on pp will be omitted.

Now, let DkD_{k}, k=1,2,k=1,2,\ldots be a sequence of i.i.d. dropout matrices, where DkD_{k} refers to the dropout matrix applied in the kkth iteration. Gradient descent with dropout takes the form

𝜷~k+1=𝜷~kα𝜷~k12𝐘XDk+1𝜷~k22=𝜷~k+αDk+1X(𝐘XDk+1𝜷~k)\displaystyle\widetilde{\bm{\beta}}_{k+1}=\widetilde{\bm{\beta}}_{k}-\alpha\nabla_{\widetilde{\bm{\beta}}_{k}}\dfrac{1}{2}\left\lVert\mathbf{Y}-XD_{k+1}\widetilde{\bm{\beta}}_{k}\right\rVert^{2}_{2}=\widetilde{\bm{\beta}}_{k}+\alpha D_{k+1}X^{\top}\big{(}\mathbf{Y}-XD_{k+1}\widetilde{\bm{\beta}}_{k}\big{)} (3)

with k=0,1,2,k=0,1,2,\ldots and (possibly random) initialization 𝜷~0\widetilde{\bm{\beta}}_{0}. In contrast with (2), the gradient in (3) is taken on the model reduced by the action of multiplying 𝜷~k\widetilde{\bm{\beta}}_{k} with Dk+1D_{k+1}. Alternatively, (3) may be interpreted as replacing the design matrix XX with the reduced matrix XDk+1XD_{k+1} during the (k+1)(k+1)th iteration. The columns of the reduced matrix are randomly deleted with a probability of 1p1-p. Observe that the dropout matrix appears inside the squared norm, making the gradient quadratic in Dk+1D_{k+1}.

Dropout, as defined in (3), considers the full gradient of the reduced model, whereas another variant is obtained through reduction of the full gradient. The resulting iterative scheme takes the form

𝜷^k+1=𝜷^kαDk+1𝜷^k12𝐘X𝜷^k22=𝜷^k+αDk+1X(𝐘X𝜷^k)\displaystyle\widehat{\bm{\beta}}_{k+1}=\widehat{\bm{\beta}}_{k}-\alpha D_{k+1}\nabla_{\widehat{\bm{\beta}}_{k}}\dfrac{1}{2}\left\lVert\mathbf{Y}-X\widehat{\bm{\beta}}_{k}\right\rVert^{2}_{2}=\widehat{\bm{\beta}}_{k}+\alpha D_{k+1}X^{\top}\big{(}\mathbf{Y}-X\widehat{\bm{\beta}}_{k}\big{)} (4)

with k=0,1,2,k=0,1,2,\ldots and (possibly random) initialization 𝜷^0\widehat{\bm{\beta}}_{0}. As opposed to 𝜷~k\widetilde{\bm{\beta}}_{k} defined above, the dropout matrix only occurs once in the updates, so we shall call this method simplified dropout from here on. As we will illustrate, the quadratic dependence of 𝜷~k\widetilde{\bm{\beta}}_{k} on Dk+1D_{k+1} creates various challenges, whereas the analysis of 𝜷^k\widehat{\bm{\beta}}_{k} is more straightforward.

Both versions (3) and (4) coincide when the Gram matrix 𝕏=XX\mathbb{X}=X^{\top}X is diagonal, meaning when the columns of XX are orthogonal. To see this, note that diagonal matrices commute, so Dk+12=Dk+1D_{k+1}^{2}=D_{k+1} and hence Dk+1𝕏Dk+1=Dk+1𝕏D_{k+1}\mathbb{X}D_{k+1}=D_{k+1}\mathbb{X}.

We note that dropout need not require the complete removal of neurons. Each neuron may be multiplied by an arbitrary (not necessarily Bernoulli distributed) random variable. For instance, [46] also report good performance for 𝒩(1,1)\mathcal{N}(1,1)-distributed diagonal entries of the dropout matrix. However, the Bernoulli variant seems well-motivated from a model averaging perspective. [46] propose dropout with the explicit aim of approximating a Bayesian model averaging procedure over all possible combinations of connections in the network. The random removal of nodes during training is thought to prevent the neurons from co-adapting, recreating the model averaging effect. This is the main variant implemented in popular software libraries, such as Caffe [26], TensorFlow [1], Keras [10], and PyTorch [38].

Numerous variations and extensions of dropout exist. [49] show state-of-the-art results for networks with DropConnect, a generalization of dropout where connections are randomly dropped, instead of neurons. In the linear model, this coincides with standard dropout. [4] analyze the case of varying dropout probabilities, where the dropout probability for each neuron is computed using binary belief networks that share parameters with the underlying fully connected network. An adaptive procedure for the choice of dropout probabilities is presented in [27], while also giving a Bayesian justification for dropout.

For a comprehensive overview of established methods and cutting-edge variants, see [35] and [43].

3 Analysis of Averaged Dropout

Before presenting our main results on iterative dropout schemes, we further discuss some properties of the marginalized loss minimizer that was first analyzed by [46]. For the linear regression model (1), marginalizing the dropout noise leads to

𝜷~:=argmin𝜷𝔼[𝐘XD𝜷22𝐘].\displaystyle\widetilde{\bm{\beta}}:=\operatorname*{arg\,min}_{\bm{\beta}}\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-XD\bm{\beta}\big{\rVert}^{2}_{2}\mid\mathbf{Y}\Big{]}. (5)

One may hope that the dropout gradient descent recursion for 𝜷~k\widetilde{\bm{\beta}}_{k} in (3) leads to a minimizer of (5), so that the marginalized loss minimizer may be studied as a surrogate for the behaviour of 𝜷~k\widetilde{\bm{\beta}}_{k} in the long run.

Intuitively, the gradient descent iterates with dropout represent a Monte-Carlo estimate of some deterministic algorithm [50]. This can be motivated by separating the gradient descent update into a part without algorithmic randomness and a centered noise term, meaning

𝜷~k+1=𝜷~kα2𝔼[𝜷~k𝐘XDk+1𝜷~k22|𝐘,𝜷~k]α2(𝜷~k𝐘XDk+1𝜷~k22𝔼[𝜷~k𝐘XDk+1𝜷~k22|𝐘,𝜷~k]).\begin{split}\widetilde{\bm{\beta}}_{k+1}=\widetilde{\bm{\beta}}_{k}&-\dfrac{\alpha}{2}\mathbb{E}\bigg{[}\nabla_{\widetilde{\bm{\beta}}_{k}}\left\lVert\mathbf{Y}-XD_{k+1}\widetilde{\bm{\beta}}_{k}\right\rVert^{2}_{2}\ \big{|}\ \mathbf{Y},\widetilde{\bm{\beta}}_{k}\bigg{]}\\ &-\dfrac{\alpha}{2}\Bigg{(}\nabla_{\widetilde{\bm{\beta}}_{k}}\left\lVert\mathbf{Y}-XD_{k+1}\widetilde{\bm{\beta}}_{k}\right\rVert^{2}_{2}-\mathbb{E}\bigg{[}\nabla_{\widetilde{\bm{\beta}}_{k}}\left\lVert\mathbf{Y}-XD_{k+1}\widetilde{\bm{\beta}}_{k}\right\rVert^{2}_{2}\ \big{|}\ \mathbf{Y},\widetilde{\bm{\beta}}_{k}\bigg{]}\Bigg{)}.\end{split} (6)

Notably, the stochastic terms form a martingale difference sequence with respect to (𝐘,𝜷~k)(\mathbf{Y},\widetilde{\bm{\beta}}_{k}). It seems conceivable that the noise in (6) averages out; despite the random variables being neither independent, nor identically distributed, one may hope that a law of large numbers still holds, see [2]. In this case, after a sufficient number of gradient steps,

𝜷~k+1=𝜷~0α2=1k𝜷~𝐘XD+1𝜷~22𝜷~0α2=1k𝔼[𝜷~𝐘XD+1𝜷~22|𝐘,𝜷~k].\begin{split}\widetilde{\bm{\beta}}_{k+1}&=\widetilde{\bm{\beta}}_{0}-\dfrac{\alpha}{2}\sum_{\ell=1}^{k}\nabla_{\widetilde{\bm{\beta}}_{\ell}}\left\lVert\mathbf{Y}-XD_{\ell+1}\widetilde{\bm{\beta}}_{\ell}\right\rVert^{2}_{2}\\ &\approx\widetilde{\bm{\beta}}_{0}-\dfrac{\alpha}{2}\sum_{\ell=1}^{k}\mathbb{E}\bigg{[}\nabla_{\widetilde{\bm{\beta}}_{\ell}}\left\lVert\mathbf{Y}-XD_{\ell+1}\widetilde{\bm{\beta}}_{\ell}\right\rVert^{2}_{2}\ \big{|}\ \mathbf{Y},\widetilde{\bm{\beta}}_{k}\bigg{]}.\end{split} (7)

The latter sequence could plausibly converge to the marginalized loss minimizer 𝜷~\widetilde{\bm{\beta}}. While this motivates studying 𝜷~\widetilde{\bm{\beta}}, the main conclusion of our work is that this heuristic is not entirely correct and additional noise terms occur in the limit kk\to\infty.

As the marginalized loss minimizer 𝜷~\widetilde{\bm{\beta}} still plays a pivotal role in our analysis, we briefly recount and expand on some of the properties derived in [46]. Recall that 𝕏=XX\mathbb{X}=X^{\top}X, so we have

𝐘XD𝜷22=𝐘222𝐘XD𝜷+𝜷D𝕏D𝜷.\displaystyle\big{\lVert}\mathbf{Y}-XD\bm{\beta}\big{\rVert}^{2}_{2}=\left\lVert\mathbf{Y}\right\rVert^{2}_{2}-2\mathbf{Y}^{\top}XD\bm{\beta}+\bm{\beta}^{\top}D\mathbb{X}D\bm{\beta}.

Since DD is diagonal, 𝔼[D]=pId\mathbb{E}[D]=pI_{d}, and by Lemma 10(a), 𝔼[D𝕏D]=p2𝕏+p(1p)Diag(𝕏)\mathbb{E}[D\mathbb{X}D]=p^{2}\mathbb{X}+p(1-p)\mathrm{Diag}(\mathbb{X}),

𝔼[𝐘XD𝜷22|𝐘]\displaystyle\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-XD\bm{\beta}\big{\rVert}^{2}_{2}\ \big{|}\ \mathbf{Y}\Big{]} =𝐘222p𝐘X𝜷+p2𝜷𝕏𝜷+p(1p)𝜷Diag(𝕏)𝜷\displaystyle=\lVert\mathbf{Y}\rVert_{2}^{2}-2p\mathbf{Y}^{\top}X\bm{\beta}+p^{2}\bm{\beta}^{\top}\mathbb{X}\bm{\beta}+p(1-p)\bm{\beta}^{\top}\mathrm{Diag}(\mathbb{X})\bm{\beta}
=𝐘pX𝜷22+p(1p)𝜷Diag(𝕏)𝜷.\displaystyle=\big{\lVert}\mathbf{Y}-pX\bm{\beta}\big{\rVert}^{2}_{2}+p(1-p)\bm{\beta}^{\top}\mathrm{Diag}(\mathbb{X})\bm{\beta}. (8)

The right-hand side may be identified with a Tikhonov functional, or an 2\ell_{2}-penalized least squares objective. Its gradient with respect to 𝜷\bm{\beta} is given by

𝜷𝔼[𝐘XD𝜷22𝐘]=2pX𝐘+2(p2𝕏+p(1p)Diag(𝕏))𝜷.\displaystyle\nabla_{\bm{\beta}}\mathbb{E}\big{[}\lVert\mathbf{Y}-XD\bm{\beta}\rVert^{2}_{2}\mid\mathbf{Y}\big{]}=-2pX^{\top}\mathbf{Y}+2\big{(}p^{2}\mathbb{X}+p(1-p)\mathrm{Diag}(\mathbb{X})\big{)}\bm{\beta}.

Recall from the discussion following Equation (1) that the model is assumed to be in reduced form, meaning mini𝕏ii>0\min_{i}\mathbb{X}_{ii}>0. In turn,

p2𝕏+p(1p)Diag(𝕏)p(1p)Diag(𝕏)p(1p)mini𝕏iiId\displaystyle p^{2}\mathbb{X}+p(1-p)\mathrm{Diag}(\mathbb{X})\geq p(1-p)\mathrm{Diag}(\mathbb{X})\geq p(1-p)\min_{i}\mathbb{X}_{ii}\cdot I_{d}

is bounded away from 0, making p2𝕏+p(1p)Diag(𝕏)p^{2}\mathbb{X}+p(1-p)\mathrm{Diag}(\mathbb{X}) invertible. Solving the gradient for the minimizer 𝜷~\widetilde{\bm{\beta}} now leads to

𝜷~=argmin𝜷d𝔼[𝐘XD𝜷22|𝐘]=p(p2𝕏+p(1p)Diag(𝕏))1X𝐘=𝕏p1X𝐘,\displaystyle\widetilde{\bm{\beta}}=\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^{d}}\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-XD\bm{\beta}\big{\rVert}^{2}_{2}\ \big{|}\ \mathbf{Y}\Big{]}=p\Big{(}p^{2}\mathbb{X}+p(1-p)\mathrm{Diag}(\mathbb{X})\Big{)}^{-1}X^{\top}\mathbf{Y}=\mathbb{X}_{p}^{-1}X^{\top}\mathbf{Y}, (9)

where 𝕏p:=p𝕏+(1p)Diag(𝕏)\mathbb{X}_{p}:=p\mathbb{X}+(1-p)\mathrm{Diag}(\mathbb{X}). If the columns of XX are orthogonal, then 𝕏\mathbb{X} is diagonal and hence 𝕏p=𝕏\mathbb{X}_{p}=\mathbb{X}. In this case, 𝜷~\widetilde{\bm{\beta}} matches the usual linear least squares estimator 𝕏1X𝐘\mathbb{X}^{-1}X^{\top}\mathbf{Y}. Alternatively, 𝜷~\widetilde{\bm{\beta}} minimizing the marginalized loss can also be deduced from the identity

𝔼[𝐘XD𝜷^22|𝐘]=𝔼[𝐘XD𝜷~22|𝐘]+𝔼[XD(𝜷~𝜷^)22|𝐘],\displaystyle\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-XD\widehat{\bm{\beta}}\big{\rVert}_{2}^{2}\ \big{|}\ \mathbf{Y}\Big{]}=\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-XD\widetilde{\bm{\beta}}\big{\rVert}_{2}^{2}\ \big{|}\ \mathbf{Y}\Big{]}+\mathbb{E}\Big{[}\big{\lVert}XD\big{(}\widetilde{\bm{\beta}}-\widehat{\bm{\beta}}\big{)}\big{\rVert}_{2}^{2}\ \big{|}\ \mathbf{Y}\Big{]}, (10)

which holds for all estimators 𝜷^\widehat{\bm{\beta}}. See Appendix A for a proof of (10). We now mention several other relevant properties of 𝜷~\widetilde{\bm{\beta}}.

Calibration: [46] recommend multiplying 𝜷~\widetilde{\bm{\beta}} by pp, which may be motivated as follows: Since 𝐘=X𝜷+𝜺\mathbf{Y}=X\bm{\beta}_{\star}+\bm{\varepsilon}, a small squared error 𝐘pX𝜷~22\big{\lVert}\mathbf{Y}-pX\widetilde{\bm{\beta}}\big{\rVert}^{2}_{2} in (3) implies 𝜷p𝜷~\bm{\beta}_{\star}\approx p\widetilde{\bm{\beta}}. Moreover, multiplying 𝜷~\widetilde{\bm{\beta}} by pp leads to p𝜷~=(𝕏+(1/p1)Diag(𝕏))1X𝐘p\widetilde{\bm{\beta}}=\big{(}\mathbb{X}+(1/p-1)\mathrm{Diag}(\mathbb{X})\big{)}^{-1}X^{\top}\mathbf{Y} which may be identified with the minimizer of the objective function

𝜷𝐘X𝜷22+(p11)𝜷Diag(𝕏)𝜷=𝔼[𝐘Xp1D𝜷22𝐘].\displaystyle\bm{\beta}\mapsto\big{\lVert}\mathbf{Y}-X\bm{\beta}\big{\rVert}^{2}_{2}+(p^{-1}-1)\bm{\beta}^{\top}\mathrm{Diag}(\mathbb{X})\bm{\beta}=\mathbb{E}\big{[}\lVert\mathbf{Y}-Xp^{-1}D\bm{\beta}\rVert^{2}_{2}\mid\mathbf{Y}\big{]}.

This recasts p𝜷~p\widetilde{\bm{\beta}} as resulting from a weighted form of ridge regression. Comparing the objective function to the original marginalized loss 𝔼[𝐘XD𝜷22𝐘]\mathbb{E}\big{[}\lVert\mathbf{Y}-XD\bm{\beta}\rVert^{2}_{2}\mid\mathbf{Y}\big{]}, the rescaling replaces DD with the normalized dropout matrix p1Dp^{-1}D, which has the identity matrix as its expected value. In popular machine learning software, the sampled dropout matrices are usually rescaled by p1p^{-1} [1, 10, 26, 38].

In some settings multiplication by pp may worsen 𝜷~\widetilde{\bm{\beta}} as a statistical estimator. As an example, consider the case n=dn=d with X=nInX=nI_{n} a multiple of the identity matrix. Now, (9) turns into 𝜷~=n1𝐘=𝜷+n1𝜺\widetilde{\bm{\beta}}=n^{-1}\mathbf{Y}=\bm{\beta}_{\star}+n^{-1}\bm{\varepsilon}. If the noise vector 𝜺\bm{\varepsilon} consists of independent standard normal random variables, then 𝜷~\widetilde{\bm{\beta}} has mean squared error 𝔼[𝜷~𝜷22]=n1\mathbb{E}\big{[}\lVert\widetilde{\bm{\beta}}-\bm{\beta}_{\star}\rVert_{2}^{2}\big{]}=n^{-1}. In contrast, 𝔼[p𝜷~𝜷22]=(1p)𝜷22+p2n1\mathbb{E}\big{[}\lVert p\widetilde{\bm{\beta}}-\bm{\beta}_{\star}\rVert_{2}^{2}\big{]}=(1-p)\lVert\bm{\beta}_{\star}\rVert_{2}^{2}+p^{2}n^{-1}, so 𝜷~\widetilde{\bm{\beta}} converges to 𝜷\bm{\beta}_{\star} at the rate n1n^{-1} while p𝜷~p\widetilde{\bm{\beta}} cannot be consistent as nn\to\infty, unless 𝜷=𝟎\bm{\beta}_{\star}=\mathbf{0}.

The correct rescaling may also depend on the parameter dimension dd and the spectrum of 𝕏\mathbb{X}. Suppose that all columns of XX have the same Euclidean norm, so that Diag(𝕏)=𝕏11Id\mathrm{Diag}(\mathbb{X})=\mathbb{X}_{11}\cdot I_{d}. Let X==1rank(X)σ𝐯𝐰X=\sum_{\ell=1}^{\mathrm{rank}(X)}\sigma_{\ell}\mathbf{v}_{\ell}\mathbf{w}_{\ell}^{\top} denote a singular value decomposition of XX, with singular values σ1,,σrank(X)\sigma_{1},\ldots,\sigma_{\mathrm{rank}(X)}. Now, 𝜷~\widetilde{\bm{\beta}} satisfies

𝜷~==1rank(X)1pσ2+(1p)𝕏11(σ𝐰𝐯)𝐘𝔼[X𝜷~]==1rank(X)1p+(1p)𝕏11/σ2(σ𝐯𝐰)𝜷.\begin{split}\widetilde{\bm{\beta}}&=\sum_{\ell=1}^{\mathrm{rank}(X)}\dfrac{1}{p\sigma_{\ell}^{2}+(1-p)\mathbb{X}_{11}}\big{(}\sigma_{\ell}\mathbf{w}_{\ell}\mathbf{v}_{\ell}^{\top}\big{)}\mathbf{Y}\\ \mathbb{E}\big{[}X\widetilde{\bm{\beta}}\big{]}&=\sum_{\ell=1}^{\mathrm{rank}(X)}\dfrac{1}{p+(1-p)\mathbb{X}_{11}/\sigma_{\ell}^{2}}\big{(}\sigma_{\ell}\mathbf{v}_{\ell}\mathbf{w}_{\ell}^{\top}\big{)}\bm{\beta}_{\star}.\end{split} (11)

For a proof of these identities, see Appendix A. To get an unbiased estimator for X𝜷==1rank(X)σ𝐯𝐰𝜷X\bm{\beta}_{\star}=\sum_{\ell=1}^{\mathrm{rank}(X)}\sigma_{\ell}\mathbf{v}_{\ell}\mathbf{w}_{\ell}^{\top}\bm{\beta}_{\star}, we must undo the effect of the spectral multipliers 1/(p+(1p)𝕏11/σ2)1/(p+(1-p)\mathbb{X}_{11}/\sigma_{\ell}^{2}) which take values in the interval [0,1/p][0,1/p]. Consequently, the proper rescaling depends on the eigenspace. Multiplication of the estimator by pp addresses the case where the singular values σ\sigma_{\ell} are large. In particular, if X=σ𝐯𝐰X=\sigma\mathbf{v}\mathbf{w}^{\top} with σ=nd\sigma=\sqrt{nd}, 𝐯=(n1/2,,n1/2)\mathbf{v}=(n^{-1/2},\ldots,n^{-1/2})^{\top}, and 𝐰=(d1/2,,d1/2)\mathbf{w}=(d^{-1/2},\ldots,d^{-1/2})^{\top}, then XX is the n×dn\times d matrix with all entries equaling 11. Now 𝕏11=n\mathbb{X}_{11}=n and so 𝕏11/σ2=d1\mathbb{X}_{11}/\sigma^{2}=d^{-1}, meaning the correct scaling factor depends explicitly on the parameter dimension dd and converges to the dropout probability pp as dd\to\infty.

Invariance properties: The minimizer 𝜷~=𝕏p1X𝐘\widetilde{\bm{\beta}}=\mathbb{X}_{p}^{-1}X^{\top}\mathbf{Y} is scale invariant in the sense that 𝐘\mathbf{Y} and XX may be replaced with γ𝐘\gamma\mathbf{Y} and γX\gamma X for some arbitrary γ0\gamma\neq 0, without changing 𝜷~\widetilde{\bm{\beta}}. This does not hold for the gradient descent iterates (3) and (4), since rescaling by γ\gamma changes the learning rate from α\alpha to αγ2\alpha\gamma^{2}. Moreover, 𝜷~\widetilde{\bm{\beta}} as well as the gradient descent iterates 𝜷~k\widetilde{\bm{\beta}}_{k} in (3) and 𝜷^k\widehat{\bm{\beta}}_{k} in (4) are invariant under replacement of 𝐘\mathbf{Y} and XX by Q𝐘Q\mathbf{Y} and QXQX for any orthogonal n×nn\times n matrix QQ. See [21] for further results on scale-invariance of dropout in deep networks.

Overparametrization: Dropout has been successfully applied in the overparametrized regime, see for example [28]. For the overparametrized linear regression model, the data-misfit term in (3) suggests that pX𝜷~=X(𝕏+(p11)Diag(𝕏))1X𝐘pX\widetilde{\bm{\beta}}=X\big{(}\mathbb{X}+(p^{-1}-1)\mathrm{Diag}(\mathbb{X})\big{)}^{-1}X^{\top}\mathbf{Y} should be close to the data vector 𝐘\mathbf{Y}. However,

X(𝕏+(p11)Diag(𝕏))1X<1.\Big{\lVert}X\big{(}\mathbb{X}+(p^{-1}-1)\mathrm{Diag}(\mathbb{X})\big{)}^{-1}X^{\top}\Big{\rVert}<1. (12)

See Appendix A for a proof. Hence, pX𝜷~pX\widetilde{\bm{\beta}} also shrinks 𝐘\mathbf{Y} towards zero in the overparametrized regime and does not interpolate the data. The variational formulation 𝜷~argmin𝜷d𝐘pX𝜷22+p(1p)𝜷Diag(𝕏)𝜷\widetilde{\bm{\beta}}\in\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^{d}}\big{\lVert}\mathbf{Y}-pX\bm{\beta}\big{\rVert}^{2}_{2}+p(1-p)\bm{\beta}^{\top}\mathrm{Diag}(\mathbb{X})\bm{\beta} reveals that 𝜷~\widetilde{\bm{\beta}} is a minimum-norm solution in the sense

𝜷~argmin𝜷:X𝜷=X𝜷~𝜷Diag(𝕏)𝜷,\displaystyle\widetilde{\bm{\beta}}\in\operatorname*{arg\,min}_{\bm{\beta}:X\bm{\beta}=X\widetilde{\bm{\beta}}}\ \bm{\beta}^{\top}\mathrm{Diag}(\mathbb{X})\bm{\beta},

which explains the induced shrinkage.

4 Analysis of Iterative Dropout Schemes

In the linear model, gradient descent with a small but fixed learning rate, as in (2), leads to exponential convergence in the number of iterations. Accordingly, we analyze the iterative dropout schemes (3) and (4) for fixed learning rate α\alpha and only briefly discuss the algebraically less tractable case of decaying learning rates.

4.1 Convergence of Dropout

We proceed by assessing convergence of the iterative dropout scheme (3), as well as some of its statistical properties. Recall that gradient descent with dropout takes the form

𝜷~k=𝜷~k1+αDkX(𝐘XDk𝜷~k1)=(IαDk𝕏Dk)𝜷~k1+αDkX𝐘.\displaystyle\widetilde{\bm{\beta}}_{k}=\widetilde{\bm{\beta}}_{k-1}+\alpha D_{k}X^{\top}\big{(}\mathbf{Y}-XD_{k}\widetilde{\bm{\beta}}_{k-1}\big{)}=\big{(}I-\alpha D_{k}\mathbb{X}D_{k}\big{)}\widetilde{\bm{\beta}}_{k-1}+\alpha D_{k}X^{\top}\mathbf{Y}. (13)

As alluded to in the beginning of Section 3, the gradient descent iterates should be related to the minimizer 𝜷~\widetilde{\bm{\beta}} of (5). It then seems natural to study the difference 𝜷~k𝜷~\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}, with 𝜷~\widetilde{\bm{\beta}} as an “anchoring point”. Comparing 𝜷~k\widetilde{\bm{\beta}}_{k} and 𝜷~\widetilde{\bm{\beta}} demands an explicit analysis, without marginalization of the dropout noise.

To start, we rewrite the updating formula (13) in terms of 𝜷~k𝜷~\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}. Using Dk2=DkD_{k}^{2}=D_{k}, Diag(𝕏)=𝕏pp𝕏¯\mathrm{Diag}(\mathbb{X})=\mathbb{X}_{p}-p\overline{\mathbb{X}}, and that diagonal matrices always commute, we obtain Dk𝕏Dk=Dk𝕏¯Dk+DkDiag(𝕏)=Dk𝕏¯Dk+Dk𝕏ppDk𝕏¯D_{k}\mathbb{X}D_{k}=D_{k}\overline{\mathbb{X}}D_{k}+D_{k}\mathrm{Diag}(\mathbb{X})=D_{k}\overline{\mathbb{X}}D_{k}+D_{k}\mathbb{X}_{p}-pD_{k}\overline{\mathbb{X}}. As defined in (9), 𝕏p𝜷~=X𝐘\mathbb{X}_{p}\widetilde{\bm{\beta}}=X^{\top}\mathbf{Y} and thus

𝜷~k𝜷~=(IαDk𝕏Dk)(𝜷~k1𝜷~)+αDk𝕏¯(pIDk)𝜷~=(IαDk𝕏p)(𝜷~k1𝜷~)+αDk𝕏¯(pIDk)𝜷~k1.\displaystyle\begin{split}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}&=\big{(}I-\alpha D_{k}\mathbb{X}D_{k}\big{)}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}+\alpha D_{k}\overline{\mathbb{X}}\big{(}pI-D_{k}\big{)}\widetilde{\bm{\beta}}\\ &=\big{(}I-\alpha D_{k}\mathbb{X}_{p}\big{)}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}+\alpha D_{k}\overline{\mathbb{X}}\big{(}pI-D_{k}\big{)}\widetilde{\bm{\beta}}_{k-1}.\end{split} (14)

In both representations, the second term is centered and uncorrelated for different values of kk. Vanishing of the mean follows from the independence of DkD_{k} and (𝜷~,𝜷~k1)\big{(}\widetilde{\bm{\beta}},\widetilde{\bm{\beta}}_{k-1}\big{)}, combined with 𝔼[Dk𝕏¯(pIDk)]=0\mathbb{E}\big{[}D_{k}\overline{\mathbb{X}}(pI-D_{k})\big{]}=0, the latter being shown in (28). If k>k>\ell, independence of DkD_{k} and (𝜷~,𝜷~k1,𝜷~1)\big{(}\widetilde{\bm{\beta}},\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}_{\ell-1}\big{)}, as well as 𝔼[Dk𝕏¯(pIDk)]=0\mathbb{E}\big{[}D_{k}\overline{\mathbb{X}}(pI-D_{k})\big{]}=0, imply Cov(Dk𝕏¯(pIDk)𝜷~,D𝕏¯(pID)𝜷~)=0\mathrm{Cov}\big{(}D_{k}\overline{\mathbb{X}}(pI-D_{k})\widetilde{\bm{\beta}},D_{\ell}\overline{\mathbb{X}}(pI-D_{\ell})\widetilde{\bm{\beta}}\big{)}=0 and Cov(Dk𝕏¯(pIDk)𝜷~k1,D𝕏¯(pID)𝜷~1)=0\mathrm{Cov}\big{(}D_{k}\overline{\mathbb{X}}(pI-D_{k})\widetilde{\bm{\beta}}_{k-1},D_{\ell}\overline{\mathbb{X}}(pI-D_{\ell})\widetilde{\bm{\beta}}_{\ell-1}\big{)}=0, which proves uncorrelatedness.

Defining Zk:=𝜷~k𝜷~Z_{k}:=\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}, Gk:=IαDk𝕏DkG_{k}:=I-\alpha D_{k}\mathbb{X}D_{k}, and 𝝃k:=αDk𝕏¯(pIDk)𝜷~\bm{\xi}_{k}:=\alpha D_{k}\overline{\mathbb{X}}\big{(}pI-D_{k}\big{)}\widetilde{\bm{\beta}}, the first representation in (14) may be identified with a lag one vector autoregressive (VAR) process

Zk=GkZk1+𝝃k\displaystyle Z_{k}=G_{k}Z_{k-1}+\bm{\xi}_{k} (15)

with i.i.d. random coefficients GkG_{k} and noise/innovation process 𝝃k\bm{\xi}_{k}. As just shown, 𝔼[𝝃k]=0\mathbb{E}[\bm{\xi}_{k}]=0 and Cov(𝝃k,𝝃)=0\mathrm{Cov}(\bm{\xi}_{k},\bm{\xi}_{\ell})=0 whenever kk\neq\ell, so the noise process is centered and serially uncorrelated. The random coefficients GkG_{k} and 𝝃k\bm{\xi}_{k} are, however, dependent. While most authors do not allow for random coefficients GkG_{k} in VAR processes, such processes are special cases of a random autoregressive process (RAR) [41].

In the VAR literature, identifiability and estimation of the random coefficients GkG_{k} is considered [37, 41]. In contrast, we aim to obtain bounds for the convergence of 𝔼[𝜷~k𝜷~]\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]} and Cov(𝜷~k𝜷~)\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}. Difficulties arise from the involved structure and coupled randomness of GkG_{k} and 𝝃k\bm{\xi}_{k}. Estimation of coefficients under dependence of GkG_{k} and 𝝃k\bm{\xi}_{k} is treated in [22].

For a sufficiently small learning rate α\alpha, the random matrices IαDk𝕏DkI-\alpha D_{k}\mathbb{X}D_{k} and IαDk𝕏pI-\alpha D_{k}\mathbb{X}_{p} in both representations in (14) are contractive maps in expectation. By Lemma 10(a), their expected values coincide since

𝔼[IαDk𝕏Dk]=𝔼[IαDk𝕏p]=Iαp𝕏p.\displaystyle\mathbb{E}\big{[}I-\alpha D_{k}\mathbb{X}D_{k}\big{]}=\mathbb{E}\big{[}I-\alpha D_{k}\mathbb{X}_{p}\big{]}=I-\alpha p\mathbb{X}_{p}.

For the subsequent analysis, we impose the following mild conditions that, among other things, establish contractivity of Iαp𝕏pI-\alpha p\mathbb{X}_{p} as a linear map.

Assumption 1.

The learning rate α\alpha and the dropout probability pp are chosen such that αp𝕏<1\alpha p\left\lVert\mathbb{X}\right\rVert<1, the initialization 𝛃~0\widetilde{\bm{\beta}}_{0} is a square integrable random vector that is independent of the data 𝐘\mathbf{Y} and the model is in reduced form, meaning that XX does not have zero columns.

For gradient descent without dropout and fixed learning rate, as defined in (2), α𝕏<1\alpha\lVert\mathbb{X}\rVert<1 guarantees converge of the scheme in expectation. We will see shortly that dropout essentially replaces the expected learning rate with αp\alpha p, which motivates the condition αp𝕏<1\alpha p\lVert\mathbb{X}\rVert<1.

As a straightforward consequence of the definitions, we are now able to show that 𝜷~k𝜷~\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}} vanishes in expectation at a geometric rate. For a proof of this as well as subsequent results, see Appendix B.

Lemma 1 (Convergence of Expectation).

Given Assumption 1, Iαp𝕏p1αp(1p)mini𝕏ii<1\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}\leq 1-\alpha p(1-p)\min_{i}\mathbb{X}_{ii}<1 and for any k=0,1,k=0,1,\ldots,

𝔼[𝜷~k𝜷~]2Iαp𝕏pk𝔼[𝜷~0𝜷~]2.\displaystyle\Big{\lVert}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]}\Big{\rVert}_{2}\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k}\Big{\lVert}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}\big{]}\Big{\rVert}_{2}.

Before turning to the analysis of the covariance structure, we highlight a property of the sequence 𝔼[𝜷~k𝐘]\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}\mid\mathbf{Y}\big{]}. As mentioned, these conditional expectations may be viewed as gradient descent iterates generated by the marginalized objective 12𝔼[𝐘XD𝜷22𝐘]\tfrac{1}{2}\mathbb{E}\big{[}\lVert\mathbf{Y}-XD\bm{\beta}\rVert_{2}^{2}\mid\mathbf{Y}\big{]} that gives rise to 𝜷~\widetilde{\bm{\beta}}. Indeed, combining (13) with 𝔼[Dk+1]=pId\mathbb{E}[D_{k+1}]=pI_{d}, 𝔼[Dk+1𝕏Dk+1]=p𝕏p\mathbb{E}\big{[}D_{k+1}\mathbb{X}D_{k+1}\big{]}=p\mathbb{X}_{p} from Lemma 10(a), and (3) yields

𝔼[𝜷~k+1𝐘]\displaystyle\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k+1}\mid\mathbf{Y}\big{]} =𝔼[𝜷~k𝐘]+αpX𝐘αp𝕏p𝔼[𝜷~k𝐘]\displaystyle=\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}\mid\mathbf{Y}\big{]}+\alpha pX^{\top}\mathbf{Y}-\alpha p\mathbb{X}_{p}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}\mid\mathbf{Y}\big{]}
=𝔼[𝜷~k𝐘]α2𝔼[𝜷~k𝐘]𝔼[𝐘XD𝔼[𝜷~k𝐘]22𝐘].\displaystyle=\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}\mid\mathbf{Y}\big{]}-\dfrac{\alpha}{2}\nabla_{\mathbb{E}[\widetilde{\bm{\beta}}_{k}\mid\mathbf{Y}]}\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-XD\mathbb{E}[\widetilde{\bm{\beta}}_{k}\mid\mathbf{Y}]\big{\rVert}_{2}^{2}\mid\mathbf{Y}\Big{]}.

This establishes a connection between the dropout iterates and the averaged analysis of the previous section. However, the relationship between the (unconditional) covariance matrices Cov(𝜷~k)\mathrm{Cov}(\widetilde{\bm{\beta}}_{k}) and the added noise remains unclear. A new dropout matrix is sampled for each iteration, whereas 𝜷~\widetilde{\bm{\beta}} results from minimization only after applying the conditional expectation 𝔼[𝐘]\mathbb{E}[\ \cdot\mid\mathbf{Y}] to the randomized objective function. Hence, we may expect that 𝜷~\widetilde{\bm{\beta}} features smaller variance than the iterates as the latter also depend on the noise added via dropout.

As a first result for the covariance analysis, we establish an extension of the Gauss-Markov theorem stating that the covariance matrix of a linear estimator lower-bounds the covariance matrix of an affine estimator, provided that both estimators have the same asymptotic mean. Moreover, the covariance matrix of their difference characterizes the gap. We believe that a similar result may already be known, but we are not aware of any reference, so a full proof is provided in Appendix B for completeness.

Theorem 1.

In the linear regression model (1), consider estimators 𝛃~A=AX𝐘\widetilde{\bm{\beta}}_{A}=AX^{\top}\mathbf{Y} and 𝛃~aff=B𝐘+𝐚\widetilde{\bm{\beta}}_{\mathrm{aff}}=B\mathbf{Y}+\mathbf{a}, with Bd×nB\in\mathbb{R}^{d\times n} and 𝐚d\mathbf{a}\in\mathbb{R}^{d} (possibly) random, but independent of 𝐘\mathbf{Y}, and Ad×dA\in\mathbb{R}^{d\times d} deterministic. Then,

Cov(𝜷~aff)Cov(𝜷~A)Cov(𝜷~aff𝜷~A)4Asup𝜷:𝜷21𝔼𝜷[𝜷~aff𝜷~A]2,\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{A}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{)}\Big{\rVert}\leq 4\lVert A\rVert\sup_{\bm{\beta}_{\star}:\lVert\bm{\beta}_{\star}\rVert_{2}\leq 1}\big{\lVert}\mathbb{E}_{\bm{\beta}_{\star}}\big{[}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{]}\big{\rVert}_{2},

where 𝔼𝛃\mathbb{E}_{\bm{\beta}_{\star}} denotes the expectation with respect to 𝛃\bm{\beta}_{\star} being the true regression vector in the linear regression model (1).

Since Cov(𝜷~aff)Cov(𝜷~A)Cov(𝜷~aff𝜷~A)=Cov(𝜷~aff𝜷~A,𝜷~A)+Cov(𝜷~A,𝜷~aff𝜷~A)\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{A}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{)}=\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A},\widetilde{\bm{\beta}}_{A}\big{)}+\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{A},\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{)}, Theorem 1 may be interpreted as follows: if the estimators 𝜷~aff\widetilde{\bm{\beta}}_{\mathrm{aff}} and 𝜷~A\widetilde{\bm{\beta}}_{A} are nearly the same in expectation, then 𝜷~aff𝜷~A\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A} and 𝜷~A\widetilde{\bm{\beta}}_{A} must be nearly uncorrelated. In turn, 𝜷~k\widetilde{\bm{\beta}}_{k} may be decomposed into 𝜷~A\widetilde{\bm{\beta}}_{A} and (nearly) orthogonal noise 𝜷~aff𝜷~A\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}, so that Cov(𝜷~aff)Cov(𝜷~A)+Cov(𝜷~aff𝜷~A)\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}\big{)}\approx\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{A}\big{)}+\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{)} is lower bounded by Cov(𝜷~A)\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{A}\big{)}. Therefore, the covariance matrix Cov(𝜷~aff𝜷~A)\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{)} quantifies the gap in the bound.

Taking A:=𝕏1A:=\mathbb{X}^{-1} and considering linear estimators with 𝐚=𝟎\mathbf{a}=\mathbf{0} recovers the usual Gauss-Markov theorem, stating that 𝕏1X𝐘\mathbb{X}^{-1}X^{\top}\mathbf{Y} is the best linear unbiased estimator (BLUE) for the linear model. Applying the generalized Gauss-Markov theorem with A=(𝕏+Γ)1A=(\mathbb{X}+\Gamma)^{-1}, where Γ\Gamma is a positive definite matrix, we obtain the following statement about 2\ell_{2}-penalized estimators.

Corollary 1.

The minimizer 𝛃~Γ:=(𝕏+Γ)1X𝐘\widetilde{\bm{\beta}}_{\Gamma}:=(\mathbb{X}+\Gamma)^{-1}X^{\top}\mathbf{Y} of the 2\ell_{2}-penalized functional 𝐘X𝛃22+𝛃Γ𝛃\big{\lVert}\mathbf{Y}-X\bm{\beta}\big{\rVert}_{2}^{2}+\bm{\beta}^{\top}\Gamma\bm{\beta} has the smallest covariance matrix among all affine estimators with the same expectation as 𝛃~Γ\widetilde{\bm{\beta}}_{\Gamma}.

We now return to our analysis of the covariance structure induced by dropout. If A:=𝕏p1A:=\mathbb{X}_{p}^{-1}, then 𝜷~=𝕏p1X𝐘=AX𝐘=𝜷~A\widetilde{\bm{\beta}}=\mathbb{X}_{p}^{-1}X^{\top}\mathbf{Y}=AX^{\top}\mathbf{Y}=\widetilde{\bm{\beta}}_{A} in Theorem 1. Further, the dropout iterates may be rewritten as affine estimators 𝜷~k=Bk𝐘+𝐚k\widetilde{\bm{\beta}}_{k}=B_{k}\mathbf{Y}+\mathbf{a}_{k} with

Bk:=j=1k1(=0kj1(IαDk𝕏Dk))αDjX+αDkX𝐚k:=(=0k1(IαDk𝕏Dk))𝜷~0.\begin{split}B_{k}&:=\sum_{j=1}^{k-1}\left(\prod_{\ell=0}^{k-j-1}\Big{(}I-\alpha D_{k-\ell}\mathbb{X}D_{k-\ell}\Big{)}\right)\alpha D_{j}X^{\top}+\alpha D_{k}X^{\top}\\ \mathbf{a}_{k}&:=\left(\prod_{\ell=0}^{k-1}\Big{(}I-\alpha D_{k-\ell}\mathbb{X}D_{k-\ell}\Big{)}\right)\widetilde{\bm{\beta}}_{0}.\end{split}

By construction, (Bk,𝐚k)(B_{k},\mathbf{a}_{k}) and 𝐘\mathbf{Y} are independent, so Theorem 1 applies. As shown in Lemma 1, 𝔼[𝜷~k𝜷~]\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]} vanishes exponentially fast, so we conclude that Cov(𝜷~k)\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)} is asymptotically lower-bounded by Cov(𝜷~)\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}. Further, the covariance structure of 𝜷~\widetilde{\bm{\beta}} is optimal in the sense of Corollary 1.

We proceed by studying Cov(𝜷~k𝜷~)\mathrm{Cov}(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}), with the aim of quantifying the gap between the covariance matrices. To this end, we exhibit a particular recurrence for the second moments 𝔼[(𝜷~k𝜷~)(𝜷~k𝜷~)]\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})^{\top}\big{]}. Recall that \odot denotes the Hadamard product, Bp=pB+(1p)Diag(B)B_{p}=pB+(1-p)\mathrm{Diag}(B), and B¯=BDiag(B)\overline{B}=B-\mathrm{Diag}(B).

Lemma 2 (Second Moment - Recursive Formula).

Under Assumption 1, for all positive integers kk

𝔼[(𝜷~k𝜷~)(𝜷~k𝜷~)]S(𝔼[(𝜷~k1𝜷~)(𝜷~k1𝜷~)])6Iαp𝕏pk1𝔼[(𝜷~0𝜷~)𝜷~],\displaystyle\begin{split}&\Bigg{\lVert}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}-S\bigg{(}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}\bigg{)}\Bigg{\rVert}\\ &\leq 6\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}\Big{\lVert}\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}\big{)}\widetilde{\bm{\beta}}^{\top}\big{]}\Big{\rVert},\end{split}

where S:d×dd×dS:\mathbb{R}^{d\times d}\to\mathbb{R}^{d\times d} denotes the affine operator

S(A)\displaystyle S(A) =(Iαp𝕏p)A(Iαp𝕏p)\displaystyle=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}A\big{(}I-\alpha p\mathbb{X}_{p}\big{)}
+α2p(1p)Diag(𝕏pA𝕏p)+α2p2(1p)2𝕏A+𝔼[𝜷~𝜷~]¯𝕏\displaystyle\quad+\alpha^{2}p(1-p)\mathrm{Diag}\big{(}\mathbb{X}_{p}A\mathbb{X}_{p}\big{)}+\alpha^{2}p^{2}(1-p)^{2}\mathbb{X}\odot\overline{A+\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}}\odot\mathbb{X}
+α2p2(1p)((𝕏¯Diag(A+𝔼[𝜷~𝜷~])𝕏¯)p+𝕏¯Diag(𝕏pA)+Diag(𝕏pA)𝕏¯).\displaystyle\quad+\alpha^{2}p^{2}(1-p)\bigg{(}\Big{(}\overline{\mathbb{X}}\mathrm{Diag}\Big{(}A+\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}\Big{)}\overline{\mathbb{X}}\Big{)}_{p}+\overline{\mathbb{X}}\mathrm{Diag}\big{(}\mathbb{X}_{p}A\big{)}+\mathrm{Diag}\big{(}\mathbb{X}_{p}A\big{)}\overline{\mathbb{X}}\bigg{)}.

Intuitively, the lemma states that the second moment of 𝜷~k𝜷~\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}} evolves as an affine dynamical system, up to some exponentially decaying remainder. This may be associated with the implicit regularization of the dropout noise, as illustrated empirically in [51].

Mathematically, the result may be motivated via the representation of the dropout iterates as a random autoregressive process Zk=GkZk1+𝝃kZ_{k}=G_{k}Z_{k-1}+\bm{\xi}_{k} in (15). Writing out ZkZk=GkZk1Zk1Gk+𝝃k𝝃k+GkZk1𝝃k+𝝃kZk1GkZ_{k}Z_{k}^{\top}=G_{k}Z_{k-1}Z_{k-1}^{\top}G_{k}+\bm{\xi}_{k}\bm{\xi}_{k}^{\top}+G_{k}Z_{k-1}\bm{\xi}_{k}^{\top}+\bm{\xi}_{k}Z_{k-1}^{\top}G_{k} and comparing with the proof of the lemma, we see that the remainder term, denoted by ρk1\rho_{k-1} in the proof, coincides with the expected value of the cross terms GkZk1𝝃k+𝝃kZk1GkG_{k}Z_{k-1}\bm{\xi}_{k}^{\top}+\bm{\xi}_{k}Z_{k-1}^{\top}G_{k}. Moreover, the operator SS is obtained by computing

S(A)=𝔼[GkAGk+𝝃k𝝃k].\displaystyle S(A)=\mathbb{E}\Big{[}G_{k}AG_{k}+\bm{\xi}_{k}\bm{\xi}_{k}^{\top}\Big{]}.

As (Gk,𝝃k)(G_{k},\bm{\xi}_{k}) are i.i.d., SS does not depend on kk. Moreover, independence of GkG_{k} and Zk1Z_{k-1} implies

𝔼[ZkZk]\displaystyle\mathbb{E}\Big{[}Z_{k}Z_{k}^{\top}\Big{]} =𝔼[GkZk1Zk1Gk+𝝃k𝝃k]+ρk1\displaystyle=\mathbb{E}\Big{[}G_{k}Z_{k-1}Z_{k-1}^{\top}G_{k}+\bm{\xi}_{k}\bm{\xi}_{k}^{\top}\Big{]}+\rho_{k-1}
=𝔼[Gk𝔼[Zk1Zk1]Gk+𝝃k𝝃k]+ρk1\displaystyle=\mathbb{E}\Big{[}G_{k}\mathbb{E}\big{[}Z_{k-1}Z_{k-1}^{\top}\big{]}G_{k}+\bm{\xi}_{k}\bm{\xi}_{k}^{\top}\Big{]}+\rho_{k-1}
=S(𝔼[Zk1Zk1])+ρk1.\displaystyle=S\Big{(}\mathbb{E}\big{[}Z_{k-1}Z_{k-1}^{\top}\big{]}\Big{)}+\rho_{k-1}.

Inserting the definition Zk=𝜷~k𝜷~Z_{k}=\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}} results in the statement of the lemma. The random vector 𝝃k\bm{\xi}_{k} depends on 𝜷~\widetilde{\bm{\beta}} and by Theorem 1 the correlation between Zk=𝜷~k𝜷~Z_{k}=\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}} and 𝜷~\widetilde{\bm{\beta}} decreases as kk\to\infty. This leads to the exponentially decaying bound for the remainder term ρk1\rho_{k-1}.

The previous lemma entails equality between 𝔼[ZkZk]\mathbb{E}\big{[}Z_{k}Z_{k}^{\top}\big{]} and Sk(𝔼[Z0Z0])S^{k}\big{(}\mathbb{E}[Z_{0}Z_{0}^{\top}]\big{)}, up to the remainder terms. The latter may be computed further by decomposing the affine operator SS into its intercept and linear part

S0:=S(0)=𝔼[𝝃k𝝃k]andSlin(A):=S(A)S0=𝔼[GkAGk].\displaystyle S_{0}:=S(0)=\mathbb{E}\big{[}\bm{\xi}_{k}\bm{\xi}_{k}^{\top}\big{]}\qquad\mbox{and}\qquad S_{\mathrm{lin}}(A):=S(A)-S_{0}=\mathbb{E}\big{[}G_{k}AG_{k}\big{]}. (16)

If SlinS_{\mathrm{lin}} were to have operator norm less than one, then the Neumann series for (idSlin)1(\mathrm{id}-S_{\mathrm{lin}})^{-1} (see Lemma 12) gives

Sk(A)=j=0k1Slinj(S0)+Slink(A)j=0Slinj(S0)=(idSlin)1S0,\displaystyle S^{k}(A)=\sum_{j=0}^{k-1}S_{\mathrm{lin}}^{j}(S_{0})+S_{\mathrm{lin}}^{k}(A)\to\sum_{j=0}^{\infty}S_{\mathrm{lin}}^{j}(S_{0})=\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0},

with id\mathrm{id} the identity operator on d×dd\times d matrices. Surprisingly, the operator “forgets” AA in the sense that the limit does not depend on AA anymore. The argument shows that 𝔼[ZkZk]=𝔼[(𝜷~k𝜷~)(𝜷~k𝜷~)]\mathbb{E}\big{[}Z_{k}Z_{k}^{\top}\big{]}=\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})^{\top}\big{]} should behave like (idSlin)1S0\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0} in the first order. The next result makes this precise, taking into account the remainder terms and approximation errors.

Theorem 2 (Second Moment - Limit Formula).

In addition to Assumption 1 suppose α<λmin(𝕏p)3𝕏2\alpha<\tfrac{\lambda_{\mathrm{min}}(\mathbb{X}_{p})}{3\lVert\mathbb{X}\rVert^{2}}, then, for any k=1,2,k=1,2,\ldots

𝔼[(𝜷~k𝜷~)(𝜷~k𝜷~)](idSlin)1S0CkIαp𝕏pk1\displaystyle\bigg{\lVert}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\bigg{\rVert}\leq Ck\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}

and

Cov(𝜷~k𝜷~)(idSlin)1S0CkIαp𝕏pk1\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}\leq Ck\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}

with constant CC given by

C:=𝔼[(𝜷~0𝜷~)(𝜷~0𝜷~)](idSlin)1S0+6𝔼[(𝜷~0𝜷~)𝜷~]+𝔼[𝜷~0𝜷~]22.\displaystyle C:=\bigg{\lVert}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\bigg{\rVert}+6\Big{\lVert}\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}})\widetilde{\bm{\beta}}^{\top}\big{]}\Big{\rVert}+\Big{\lVert}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}\big{]}\Big{\rVert}_{2}^{2}.

In short, Cov(𝜷~k𝜷~)\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}) converges exponentially fast to the limit (idSlin)1S0(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}. Combining the generalized Gauss-Markov Theorem 1 with Theorem 2 also establishes

Cov(𝜷~k)Cov(𝜷~)+(idSlin)1S0,ask,\displaystyle\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}\to\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}+(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0},\qquad\text{as}\ k\to\infty,

with exponential rate of convergence. Recall the intuition gained from Theorem 1 that 𝜷~k\widetilde{\bm{\beta}}_{k} may be decomposed into a sum of 𝜷~\widetilde{\bm{\beta}} and (approximately) orthogonal centered noise. We now conclude that up to exponentially decaying terms, the covariance matrix of this orthogonal noise must be given by (idSlin)1S0(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}, which fully describes the (asymptotic) gap between the covariance matrices of 𝜷~\widetilde{\bm{\beta}} and 𝜷~k\widetilde{\bm{\beta}}_{k}.

Taking the trace and noting that |Tr(A)|dA\big{\lvert}\mathrm{Tr}(A)\big{\rvert}\leq d\lVert A\rVert, we obtain a bound for the convergence of 𝜷~k\widetilde{\bm{\beta}}_{k} with respect to the squared Euclidean loss,

|𝔼[𝜷~k𝜷~22]Tr((idSlin)1S0)|CdkIαp𝕏pk1.\displaystyle\bigg{\lvert}\mathbb{E}\Big{[}\big{\lVert}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{\rVert}_{2}^{2}\Big{]}-\mathrm{Tr}\Big{(}\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{)}\bigg{\rvert}\leq Cdk\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}. (17)

Since (idSlin)1S0(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0} is a d×dd\times d matrix, the term Tr((idSlin)1S0)\mathrm{Tr}\big{(}(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}\big{)} describing the asymptotic discrepancy between 𝜷~k\widetilde{\bm{\beta}}_{k} and 𝜷~\widetilde{\bm{\beta}} can be large in high dimensions dd, even if the spectral norm of (idSlin)1S0(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0} is small. Since idSlin\mathrm{id}-S_{\mathrm{lin}} is a positive definite operator, the matrix (idSlin)1S0(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0} is zero if, and only if, S0S_{0} is zero. By (16), S0=𝔼[𝝃k𝝃k]S_{0}=\mathbb{E}[\bm{\xi}_{k}\bm{\xi}_{k}^{\top}]. The explicit form 𝝃k=αDk𝕏¯(pIDk)𝜷~\bm{\xi}_{k}=\alpha D_{k}\overline{\mathbb{X}}\big{(}pI-D_{k}\big{)}\widetilde{\bm{\beta}}, shows that 𝝃k=0\bm{\xi}_{k}=0 and S0=0S_{0}=0 provided that 𝕏¯=0\overline{\mathbb{X}}=0, meaning whenever 𝕏\mathbb{X} is diagonal. To give a more precise quantification, we show that the operator norm of (idSlin)1S0(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0} is of order αp/(1p)2\alpha p/(1-p)^{2}.

Lemma 3.

In addition to Assumption 1 suppose α<λmin(𝕏p)3𝕏2\alpha<\tfrac{\lambda_{\mathrm{min}}(\mathbb{X}_{p})}{3\lVert\mathbb{X}\rVert^{2}}, then, for any k=1,2,k=1,2,\ldots

Cov(𝜷~k)Cov(𝜷~)kIαp𝕏pk1C+αpC′′(1p)2\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}\Big{\rVert}\leq\dfrac{k\lVert I-\alpha p\mathbb{X}_{p}\rVert^{k-1}C^{\prime}+\alpha pC^{\prime\prime}}{(1-p)^{2}}

and

Cov(𝜷~k)Diag(𝕏)1𝕏Diag(𝕏)1kIαp𝕏pk1C+p(1+α)C′′(1p)2,\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Diag}(\mathbb{X})^{-1}\mathbb{X}\mathrm{Diag}(\mathbb{X})^{-1}\Big{\rVert}\leq\dfrac{k\lVert I-\alpha p\mathbb{X}_{p}\rVert^{k-1}C^{\prime}+p(1+\alpha)C^{\prime\prime}}{(1-p)^{2}},

where CC^{\prime} and C′′C^{\prime\prime} are constants that are independent of (α,p,k)(\alpha,p,k).

The first bound describes the interplay between αp\alpha p and kk. Making αp\alpha p small will decrease the second term in the bound, but conversely requires a larger number of iterations kk for the first term to decay.

In the second bound, Diag(𝕏)1𝕏Diag(𝕏)1\mathrm{Diag}(\mathbb{X})^{-1}\mathbb{X}\mathrm{Diag}(\mathbb{X})^{-1} matches the covariance matrix Cov(𝜷~)\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)} of the marginalized loss minimizer 𝜷~\widetilde{\bm{\beta}} up to a term of order pp. Consequently, the covariance structures induced by dropout and 2\ell_{2}-regularization approximately coincide for sufficiently small pp. However, in this regime we have 𝕏p=p𝕏+(1p)Diag(𝕏)Diag(𝕏)\mathbb{X}_{p}=p\mathbb{X}+(1-p)\mathrm{Diag}(\mathbb{X})\approx\mathrm{Diag}(\mathbb{X}), and 𝜷~=𝕏p1X𝐘Diag(𝕏)1X𝐘\widetilde{\bm{\beta}}=\mathbb{X}_{p}^{-1}X^{\top}\mathbf{Y}\approx\mathrm{Diag}(\mathbb{X})^{-1}X^{\top}\mathbf{Y} becomes extremely biased whenever the Gram matrix 𝕏\mathbb{X} is not diagonal.

Theorem 1 already establishes Cov(𝜷~)\mathrm{Cov}(\widetilde{\bm{\beta}}) as the optimal covariance among all affine estimators that are asymptotically unbiased for 𝜷~\widetilde{\bm{\beta}}. To conclude our study of the gap between Cov(𝜷~k)\mathrm{Cov}(\widetilde{\bm{\beta}}_{k}) and Cov(𝜷~)\mathrm{Cov}(\widetilde{\bm{\beta}}), we provide a lower-bound.

Theorem 3 (Sub-Optimality of Variance).

In addition to the assumptions of Theorem 2, suppose for every =1,,d\ell=1,\ldots,d there exists mm\neq\ell such that 𝕏m0\mathbb{X}_{\ell m}\neq 0, then

limkCov(𝜷~k)Cov(𝜷~)αp(1p)2λmin(𝕏)2𝕏3minij:𝕏ij0𝕏ij2Id.\displaystyle\lim_{k\to\infty}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}\geq\dfrac{\alpha p(1-p)^{2}\lambda_{\mathrm{min}}(\mathbb{X})}{2\lVert\mathbb{X}\rVert^{3}}\min_{i\neq j:\mathbb{X}_{ij}\neq 0}\mathbb{X}_{ij}^{2}\cdot I_{d}.

The lower-bound is positive whenever 𝕏\mathbb{X} is invertible. In general, Theorem 3 entails asymptotic statistical sub-optimality of the gradient descent iterates 𝜷~k\widetilde{\bm{\beta}}_{k} for a large class of design matrices. Moreover, the result does not require any further assumptions on the tuning parameters α\alpha and pp, other than α\alpha being sufficiently small.

To summarize, compared with the marginalized loss minimizer 𝜷~\widetilde{\bm{\beta}}, the covariance matrix of the gradient descent iterates with dropout may be larger. The difference may be significant, especially if the data dimension dd is large. Proving the results requires a refined second moment analysis, based on explicit computation of the dynamics of 𝜷~k𝜷~\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}. Simple heuristics such as (7) do not fully reveal the properties of the underlying dynamics.

4.1.1 Ruppert-Polyak Averaging

To reduce the gradient noise induced by dropout, one may consider the running average over the gradient descent iterates. This technique is also known as Ruppert-Polyak averaging [42, 39] The kkth Ruppert-Polyak average of the gradient descent iterates is given by

𝜷~krp:=1k=1k𝜷~.\displaystyle\widetilde{\bm{\beta}}^{\mathrm{rp}}_{k}:=\dfrac{1}{k}\sum_{\ell=1}^{k}\widetilde{\bm{\beta}}_{\ell}.

Averages of this type are well-studied in the stochastic approximation literature, see [40, 19] for results on linear regression and [55, 12] for stochastic gradient descent. The next theorem illustrates convergence of 𝜷~krp\widetilde{\bm{\beta}}^{\mathrm{rp}}_{k} towards 𝜷~\widetilde{\bm{\beta}}.

Theorem 4.

In addition to Assumption 1, suppose α<λmin(𝕏p)3𝕏2\alpha<\tfrac{\lambda_{\mathrm{min}}(\mathbb{X}_{p})}{3\lVert\mathbb{X}\rVert^{2}}, then, for any k=1,2,k=1,2,\ldots

𝔼[(𝜷~krp𝜷~)(𝜷~krp𝜷~)]2𝕏2𝔼[X𝐘𝐘X]k(1p)(mini𝕏ii)4+2Ck2(αp(1p)mini𝕏ii)3,\displaystyle\bigg{\lVert}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}^{\mathrm{rp}}_{k}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}^{\mathrm{rp}}_{k}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}\bigg{\rVert}\leq\dfrac{2\lVert\mathbb{X}\rVert^{2}\cdot\big{\lVert}\mathbb{E}[X^{\top}\mathbf{Y}\mathbf{Y}^{\top}X]\big{\rVert}}{k(1-p)(\min_{i}\mathbb{X}_{ii})^{4}}+\dfrac{2C}{k^{2}(\alpha p(1-p)\min_{i}\mathbb{X}_{ii})^{3}},

where CC is the constant from Theorem 2.

The first term in the upper bound is independent of α\alpha and decays at the rate k1k^{-1}, whereas the second term scales with (αp)3k2(\alpha p)^{-3}k^{-2}. Accordingly, for small αp\alpha p, the second term will dominate initially, until kk grows sufficiently large.

Since the right hand side eventually tends to zero, the theorem implies convergence of the Ruppert-Polyak averaged iterates to the marginalized loss minimizer 𝜷~\widetilde{\bm{\beta}}, so the link between dropout and 2\ell_{2}-regularization persists at the variance level. The averaging comes at the price of a slower convergence rate k1k^{-1} of the remainder terms, as opposed to the exponentially fast convergence in Theorem 2. As in (17), the bound can be converted into a convergence rate for 𝔼[𝜷~krp𝜷~22]\mathbb{E}\big{[}\lVert\widetilde{\bm{\beta}}^{\mathrm{rp}}_{k}-\widetilde{\bm{\beta}}\rVert_{2}^{2}\big{]} by taking the trace,

𝔼[𝜷~krp𝜷~22]d(2𝕏2𝔼[X𝐘𝐘X]k(1p)(mini𝕏ii)4+2Ck2(αp(1p)mini𝕏ii)3).\displaystyle\mathbb{E}\Big{[}\big{\lVert}\widetilde{\bm{\beta}}^{\mathrm{rp}}_{k}-\widetilde{\bm{\beta}}\big{\rVert}_{2}^{2}\Big{]}\leq d\left(\dfrac{2\lVert\mathbb{X}\rVert^{2}\cdot\big{\lVert}\mathbb{E}[X^{\top}\mathbf{Y}\mathbf{Y}^{\top}X]\big{\rVert}}{k(1-p)(\min_{i}\mathbb{X}_{ii})^{4}}+\dfrac{2C}{k^{2}(\alpha p(1-p)\min_{i}\mathbb{X}_{ii})^{3}}\right).

4.2 Convergence of Simplified Dropout

To further illustrate how dropout and gradient descent are coupled, we will now study the simplified dropout iterates

𝜷^k=𝜷^k1+αDkX(𝐘X𝜷^k1),\displaystyle\widehat{\bm{\beta}}_{k}=\widehat{\bm{\beta}}_{k-1}+\alpha D_{k}X^{\top}\big{(}\mathbf{Y}-X\widehat{\bm{\beta}}_{k-1}\big{)}, (18)

as defined in (4). While the original dropout reduces the model before taking the gradient, this version takes the gradient first and applies dropout afterwards. As shown in Section 2, both versions coincide if the Gram matrix 𝕏\mathbb{X} is diagonal. Recall from the discussion preceding Lemma 3 that for diagonal 𝕏\mathbb{X}, Cov(𝜷~k)\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)} converges to the optimal covariance matrix. This suggests that for the simplified dropout, no additional randomness in the limit kk\to\infty occurs.

The least squares objective 𝜷𝐘X𝜷22\bm{\beta}\mapsto\lVert\mathbf{Y}-X\bm{\beta}\rVert_{2}^{2} always admits a minimizer, with any minimizer 𝜷^\widehat{\bm{\beta}} necessarily solving the so-called normal equations X𝐘=𝕏𝜷^X^{\top}\mathbf{Y}=\mathbb{X}\widehat{\bm{\beta}}. Provided 𝕏\mathbb{X} is invertible, the least-squares estimator 𝜷^=𝕏1X𝐘\widehat{\bm{\beta}}=\mathbb{X}^{-1}X^{\top}\mathbf{Y} gives the unique solution. We will not assume invertibility for all results below, so we let 𝜷^\widehat{\bm{\beta}} denote any solution of the normal equations, unless specified otherwise. In turn, (18) may be rewritten as

𝜷^k𝜷^=(IαDk𝕏)(𝜷^k1𝜷^),\displaystyle\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}=\big{(}I-\alpha D_{k}\mathbb{X}\big{)}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}, (19)

which is simpler than the analogous representation of 𝜷~k\widetilde{\bm{\beta}}_{k} as a VAR process in (15).

As a first result, we will show that the expectation of the simplified dropout iterates 𝜷^k\widehat{\bm{\beta}}_{k} converges to the mean of the unregularized least squares estimator 𝜷^\widehat{\bm{\beta}}, provided that 𝕏\mathbb{X} is invertible. Indeed, using (19), independence of DkD_{k} and (𝜷^k1𝜷^)\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}, and 𝔼[Dk]=pI\mathbb{E}\big{[}D_{k}\big{]}=pI, observe that

𝔼[𝜷^k𝜷^]=𝔼[(IαDk𝕏)(𝜷^k1𝜷^)]=(Iαp𝕏)𝔼[𝜷^k1𝜷^].\displaystyle\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}=\mathbb{E}\Big{[}\big{(}I-\alpha D_{k}\mathbb{X}\big{)}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}\Big{]}=\big{(}I-\alpha p\mathbb{X}\big{)}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{]}. (20)

Induction on kk now shows 𝔼[𝜷^k𝜷^]=(Iαp𝕏)k𝔼[𝜷^0𝜷^]\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}=\big{(}I-\alpha p\mathbb{X}\big{)}^{k}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{]} and so

𝔼[𝜷^k𝜷^]2=(Iαp𝕏)k𝔼[𝜷^0𝜷^]2Iαp𝕏k𝔼[𝜷^0𝜷^]2.\displaystyle\Big{\lVert}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}\Big{\rVert}_{2}=\Big{\lVert}\big{(}I-\alpha p\mathbb{X}\big{)}^{k}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{]}\Big{\rVert}_{2}\leq\big{\lVert}I-\alpha p\mathbb{X}\big{\rVert}^{k}\Big{\lVert}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{]}\Big{\rVert}_{2}.

Assuming αp𝕏<1\alpha p\lVert\mathbb{X}\rVert<1, invertibility of 𝕏\mathbb{X} implies Iαp𝕏<1\big{\lVert}I-\alpha p\mathbb{X}\big{\rVert}<1. Consequently, the convergence is exponential in the number of iterations.

Invertibility of 𝕏\mathbb{X} may be avoided if the initialization 𝜷^0\widehat{\bm{\beta}}_{0} lies in the orthogonal complement of the kernel of 𝕏\mathbb{X} and 𝜷^\widehat{\bm{\beta}} is the 2\lVert\ \cdot\ \rVert_{2}-minimal solution to the normal equations. We can then argue that (Iαp𝕏)k1𝔼[𝜷^0𝜷^](I-\alpha p\mathbb{X})^{k-1}\mathbb{E}[\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}] always stays in a linear subspace on which (Iαp𝕏)(I-\alpha p\mathbb{X}) still acts as a contraction.

We continue with our study of 𝜷^k𝜷^\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}} by employing the same techniques as in the previous section to analyze the second moment. The linear operator

T(A):=(Iαp𝕏)A(Iαp𝕏)+α2p(1p)Diag(𝕏A𝕏),\displaystyle T(A):=\big{(}I-\alpha p\mathbb{X}\big{)}A\big{(}I-\alpha p\mathbb{X}\big{)}+\alpha^{2}p(1-p)\mathrm{Diag}\big{(}\mathbb{X}A\mathbb{X}\big{)}, (21)

defined on d×dd\times d matrices, takes over the role of the affine operator SS encountered in Lemma 7. In particular, the second moments Ak:=𝔼[(𝜷^k𝜷^)(𝜷^k𝜷^)]A_{k}:=\mathbb{E}\big{[}(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})^{\top}\big{]} evolve as the linear dynamical system

Ak=T(Ak1),k=1,2,\displaystyle A_{k}=T\big{(}A_{k-1}\big{)},\qquad k=1,2,\ldots (22)

without remainder terms. To see this, observe via (19) the identity (𝜷^k𝜷^)(𝜷^k𝜷^)=(IαDk𝕏)(𝜷^k1𝜷^)(𝜷^k1𝜷^)(Iα𝕏Dk)(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})^{\top}=(I-\alpha D_{k}\mathbb{X})(\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}})(\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}})^{\top}(I-\alpha\mathbb{X}D_{k}). Taking the expectation on both sides, conditioning on DkD_{k}, and recalling that DkD_{k} is independent of (𝜷^k,𝜷^)(\widehat{\bm{\beta}}_{k},\widehat{\bm{\beta}}) gives Ak=𝔼[(IαDk𝕏)Ak1(Iα𝕏Dk)]A_{k}=\mathbb{E}\big{[}(I-\alpha D_{k}\mathbb{X})A_{k-1}(I-\alpha\mathbb{X}D_{k})\big{]}. We have 𝔼[Dk]=pId\mathbb{E}[D_{k}]=pI_{d} and by Lemma 10, 𝔼[Dk𝕏Ak1𝕏Dk]=p(𝕏Ak1𝕏)p=p2𝕏Ak1𝕏+p(1p)Diag(𝕏Ak1𝕏)\mathbb{E}\big{[}D_{k}\mathbb{X}A_{k-1}\mathbb{X}D_{k}\big{]}=p(\mathbb{X}A_{k-1}\mathbb{X})_{p}=p^{2}\mathbb{X}A_{k-1}\mathbb{X}+p(1-p)\mathrm{Diag}(\mathbb{X}A_{k-1}\mathbb{X}). Together with the definition of T(A)T(A), this proves (22).

Further results are based on analyzing the recursion in (22). It turns out that convergence of 𝜷^k\widehat{\bm{\beta}}_{k} to 𝜷^\widehat{\bm{\beta}} in second mean requires a non-singular Gram matrix 𝕏\mathbb{X}.

Lemma 4.

Suppose the initialization 𝛃^0\widehat{\bm{\beta}}_{0} is independent of all other sources of randomness and the number of parameters satisfies d2d\geq 2, then there exists a singular 𝕏\mathbb{X}, such that for any positive integer kk, Cov(𝛃^k)Cov(𝛃^k𝛃^)+Cov(𝛃^)\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}\big{)}\geq\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}+\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)} and

Cov(𝜷^k𝜷^)α2p(1p).\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}\Big{\rVert}\geq\alpha^{2}p(1-p).

For invertible 𝕏\mathbb{X}, we can apply Theorem 1 to show that Cov(𝜷^)=𝕏1\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)}=\mathbb{X}^{-1} is the optimal covariance matrix for the sequence of affine estimators 𝜷^k\widehat{\bm{\beta}}_{k}. The simplified dropout iterates actually achieve the optimal variance when 𝕏\mathbb{X} is invertible, which stands in contrast with the situation in Theorem 3.

Theorem 5.

Suppose 𝕏\mathbb{X} is invertible, αmin{1p𝕏,λmin(𝕏)𝕏2}\alpha\leq\min\Big{\{}\tfrac{1}{p\lVert\mathbb{X}\rVert},\tfrac{\lambda_{\mathrm{min}}(\mathbb{X})}{\lVert\mathbb{X}\rVert^{2}}\Big{\}}, and let 𝛃^0\widehat{\bm{\beta}}_{0} be square-integrable, then, for any k=1,2,k=1,2,\ldots

𝔼[(𝜷^k𝜷^)(𝜷^k𝜷^)](1αpλmin(𝕏))k𝔼[(𝜷^0𝜷^)(𝜷^0𝜷^)].\displaystyle\bigg{\lVert}\mathbb{E}\Big{[}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}^{\top}\Big{]}\bigg{\rVert}\leq\big{(}1-\alpha p\lambda_{\mathrm{min}}(\mathbb{X})\big{)}^{k}\bigg{\lVert}\mathbb{E}\Big{[}\big{(}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{)}\big{(}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{)}^{\top}\Big{]}\bigg{\rVert}.

Intuitively, the result holds due to the operator TT in (21) being linear, as opposed to affine like in the case of Lemma 2. Choosing α\alpha sufficiently small ensures that TT acts as a contraction, meaning Tk(A)0T^{k}(A)\to 0 for any matrix AA. Hence, linearity of TT serves as an algebraic expression of the simplified dynamics. As in (17), we may take the trace to obtain the bound

𝔼[𝜷^k𝜷^22]d(1αpλmin(𝕏))k𝔼[(𝜷^0𝜷^)(𝜷^0𝜷^)].\displaystyle\mathbb{E}\Big{[}\big{\lVert}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{\rVert}_{2}^{2}\Big{]}\leq d\big{(}1-\alpha p\lambda_{\mathrm{min}}(\mathbb{X})\big{)}^{k}\bigg{\lVert}\mathbb{E}\Big{[}\big{(}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{)}\big{(}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{)}^{\top}\Big{]}\bigg{\rVert}.

5 Discussion and Outlook

Our main contributions may be summarized as follows: We studied dropout in the linear regression model, but unlike previous results, we explicitly analyzed the gradient descent dynamics with new dropout noise being sampled in each iteration. This allows us to characterize the limiting variance of the gradient descent iterates exactly (Theorem 2), which sheds light on the covariance structure induced via dropout. Our main tool in the analysis is a particular recursion (Lemma 2), which may be exhibited by “anchoring” the gradient descent iterates around the marginalized loss minimizer 𝜷~\widetilde{\bm{\beta}}. To further understand the interaction between noise and gradient descent dynamics, we analyze the running average of the process (Theorem 4) and a simplified version of dropout (Theorem 5).

We view our analysis of the linear model as a fundamental first step towards understanding the dynamics of gradient descent with dropout. Analyzing the linear model has been a fruitful approach to study other phenomena in deep learning, such as overfitting [47], sharpness of local minima [7], and in-context learning [54]. We conclude by proposing some natural directions for future work.

Random minibatch sampling: For yet another way of incorporating dropout, we may compute the gradient based on a random subset of the data (mini batches). In this case, the updating formula satisfies

𝜷¯k+1\displaystyle\overline{\bm{\beta}}_{k+1} =𝜷¯kα𝜷¯k12Dk+1(𝐘X𝜷¯k)22=𝜷¯k+αXDk+1(𝐘X𝜷¯k).\displaystyle=\overline{\bm{\beta}}_{k}-\alpha\nabla_{\overline{\bm{\beta}}_{k}}\dfrac{1}{2}\Big{\lVert}D_{k+1}\big{(}\mathbf{Y}-X\overline{\bm{\beta}}_{k}\big{)}\Big{\rVert}^{2}_{2}=\overline{\bm{\beta}}_{k}+\alpha X^{\top}D_{k+1}\big{(}\mathbf{Y}-X\overline{\bm{\beta}}_{k}\big{)}. (23)

The dropout matrices are now of dimension n×nn\times n and select a random subset of the data points in every iteration. This version of dropout also relates to randomly weighted least squares and resampling methods [13]. The update formula may be written in the form 𝜷¯k+1𝜷^=(IαXDk+1X)(𝜷¯k𝜷^)+αXDk+1(𝐘X𝜷^)\overline{\bm{\beta}}_{k+1}-\widehat{\bm{\beta}}=(I-\alpha X^{\top}D_{k+1}X)(\overline{\bm{\beta}}_{k}-\widehat{\bm{\beta}})+\alpha X^{\top}D_{k+1}(\mathbf{Y}-X\widehat{\bm{\beta}}) with 𝜷^\widehat{\bm{\beta}} solving the normal equations X𝐘=𝕏𝜷^X^{\top}\mathbf{Y}=\mathbb{X}\widehat{\bm{\beta}}. Similarly to the corresponding reformulation (14) of the original dropout scheme, this defines a vector autoregressive process with random coefficients and lag one.

Learning rates: The proof ideas may be generalized to a sequence of iteration-dependent learning rates αk\alpha_{k}. We expect this to come at the cost of more involved formulas. Specifically, the operator SS in Lemma 2 will depend on the iteration number through αk\alpha_{k}, so the limit in Theorem 2 cannot be expressed as (idSlin)1S0(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0} anymore.

Random design and stochastic gradients: We considered a fixed design matrix XX and (full) gradient descent, whereas in machine learning practice inputs are typically assumed to be random and parameters are updated via stochastic gradient descent (SGD). The recent works [8, 44] derive convergence rates for SGD considering linear regression and another form of noisy gradient descent. We believe that parts of theses analyses carry over to dropout.

Generic dropout distributions: As already mentioned, [46] carry out simulations with dropout where Diii.i.d.𝒩(1,1)D_{ii}\overset{\scriptscriptstyle i.i.d.}{\sim}\mathcal{N}(1,1). Gaussian dropout distributions are currently supported, or easily implemented, in major software libraries [1, 10, 26, 38]. Analyzing a generic dropout distribution with mean μ\mu and variance σ\sigma may also paint a clearer picture of how the dropout noise interacts with the gradient descent dynamics. For the linear regression model, results that marginalize over the dropout noise generalize to arbitrary dropout distribution. In particular, (9) turns into

𝜷~:=\displaystyle\widetilde{\bm{\beta}}:= argmin𝜷𝔼[𝐘XD𝜷22|𝐘]\displaystyle\ \operatorname*{arg\,min}_{\bm{\beta}}\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-XD\bm{\beta}\big{\rVert}^{2}_{2}\ \big{|}\ \mathbf{Y}\Big{]}
=\displaystyle= argmin𝜷𝐘μX𝜷22+σ2𝜷Diag(𝕏)𝜷\displaystyle\ \operatorname*{arg\,min}_{\bm{\beta}}\big{\lVert}\mathbf{Y}-\mu X\bm{\beta}\big{\rVert}^{2}_{2}+\sigma^{2}\bm{\beta}^{\top}\mathrm{Diag}(\mathbb{X})\bm{\beta}
=\displaystyle= μ(μ2𝕏+σ2Diag(𝕏))1X𝐘.\displaystyle\ \mu\big{(}\mu^{2}\mathbb{X}+\sigma^{2}\mathrm{Diag}(\mathbb{X})\big{)}^{-1}X^{\top}\mathbf{Y}.

If Diii.i.d.𝒩(1,1)D_{ii}\overset{\scriptscriptstyle i.i.d.}{\sim}\mathcal{N}(1,1), then 𝜷~=(𝕏+Diag(𝕏))1X𝐘\widetilde{\bm{\beta}}=\big{(}\mathbb{X}+\mathrm{Diag}(\mathbb{X})\big{)}^{-1}X^{\top}\mathbf{Y}.

In contrast, treatment of the corresponding iterative dropout scheme seems more involved. The analysis of dropout with Bernoulli distributions relies in parts on the projection property D2=DD^{2}=D. Without it, additional terms occur in the moments in Lemma 10, which is required for the computation of the covariance matrix. For example, the formula 𝔼[DADBD]=pApBp+p2(1p)Diag(A¯B)\mathbb{E}\big{[}DADBD\big{]}=pA_{p}B_{p}+p^{2}(1-p)\mathrm{Diag}(\overline{A}B) turns into

𝔼[DADBD]\displaystyle\mathbb{E}\big{[}DADBD\big{]} =1μ1(μ12A+σ2Diag(A))(μ12B+σ2Diag(B))+σ2μ1Diag(A¯B¯)\displaystyle=\dfrac{1}{\mu_{1}}\big{(}\mu_{1}^{2}A+\sigma^{2}\mathrm{Diag}(A)\big{)}\big{(}\mu_{1}^{2}B+\sigma^{2}\mathrm{Diag}(B)\big{)}+\sigma^{2}\mu_{1}\mathrm{Diag}(\overline{A}\,\overline{B})
+(μ3μ22μ1)Diag(A)Diag(B),\displaystyle\qquad+\bigg{(}\mu_{3}-\frac{\mu_{2}^{2}}{\mu_{1}}\bigg{)}\mathrm{Diag}(A)\mathrm{Diag}(B),

where μr\mu_{r} denotes the rrth moment of the dropout distribution and σ2\sigma^{2} its variance. For the Bernoulli distribution, all moments equal pp, so μ3μ22/μ1=0\mu_{3}-\mu_{2}^{2}/\mu_{1}=0 and the last term disappears. Similarly, more terms will appear in the fourth moment of DD, making the expression for the operator corresponding to SS in Lemma 2 more complicated.

Inducing robustness via dropout: Among the possible ways of explaining the data, dropout should, by design, favor an explanation that is robust against setting a random subset of the parameters to zero. [34] indicate that dropout in two-layer linear networks tends to equalize the norms of different weight vectors.

To study the robustness properties of dropout, one may suggest analysis of loss functions measuring prediction of the response vector if each estimated regression coefficients is deleted with probability pp. Given an estimator 𝜷^\widehat{\bm{\beta}}, a natural choice would be the loss

L(𝜷^,𝜷):=𝔼[X(D𝜷^𝜷)22|𝐘]=(p𝜷^𝜷)𝕏(p𝜷^𝜷)+p(1p)𝜷^Diag(𝕏)𝜷^,\displaystyle L\big{(}\widehat{\bm{\beta}},\bm{\beta}_{\star}\big{)}:=\mathbb{E}\Big{[}\big{\lVert}X(D\widehat{\bm{\beta}}-\bm{\beta}_{\star})\big{\rVert}_{2}^{2}\ \big{|}\ \mathbf{Y}\Big{]}=\big{(}p\widehat{\bm{\beta}}-\bm{\beta}_{\star}\big{)}^{\top}\mathbb{X}\big{(}p\widehat{\bm{\beta}}-\bm{\beta}_{\star}\big{)}+p(1-p)\widehat{\bm{\beta}}^{\top}\mathrm{Diag}\big{(}\mathbb{X}\big{)}\widehat{\bm{\beta}},

with DD a new draw of the dropout matrix, independent of all other randomness. This loss depends on the unknown true regression vector 𝜷\bm{\beta}_{\star}. Since 𝔼[𝐘]=X𝜷\mathbb{E}[\mathbf{Y}]=X\bm{\beta}_{\star}, an empirical version of the loss may replace X𝜷X\bm{\beta}_{\star} with 𝐘\mathbf{Y}, considering 𝔼[XD𝜷^𝐘22𝐘]\mathbb{E}\big{[}\lVert XD\widehat{\bm{\beta}}-\mathbf{Y}\rVert_{2}^{2}\mid\mathbf{Y}\big{]}. As shown in (9), 𝜷~=𝕏p1X𝐘\widetilde{\bm{\beta}}=\mathbb{X}_{p}^{-1}X^{\top}\mathbf{Y} minimizes this loss function. This suggests that 𝜷~\widetilde{\bm{\beta}} may possess some optimality properties for the loss L(,𝜷)L(\ \cdot\ ,\bm{\beta}_{\star}) defined above.

Shallow networks with linear activation function: Multi-layer neural networks do not admit unique minimizers. In comparison with the linear regression model, this poses a major challenge for the analysis of dropout. [34] consider shallow linear networks of the form f(𝐱)=UV𝐱f(\mathbf{x})=UV^{\top}\mathbf{x} with U=(𝐮1,,𝐮m)U=(\mathbf{u}_{1},\ldots,\mathbf{u}_{m}) an n×mn\times m matrix and V=(𝐯1,,𝐯m)V=(\mathbf{v}_{1},\ldots,\mathbf{v}_{m}) a d×md\times m matrix. Suppose DD is an m×mm\times m dropout matrix. Assuming the random design vector 𝐗\mathbf{X} satisfies 𝔼[𝐗𝐗]=Id\mathbb{E}\big{[}\mathbf{X}\mathbf{X}^{\top}\big{]}=I_{d}, and marginalizing over dropout noise applied to the columns of UU (or equivalently to the rows of VV) leads to an 2\ell_{2}-penalty via

𝔼[𝐘p1UDV𝐗22𝐘]=𝔼[𝐘UV𝐗22𝐘]+1ppi=1m𝐮i22𝐯i22=𝔼[𝐘UV𝐗22𝐘]+1ppTr(Diag(UU)Diag(VV)).\begin{split}&\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-p^{-1}UDV^{\top}\mathbf{X}\big{\rVert}_{2}^{2}\mid\mathbf{Y}\Big{]}\\ =\ &\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-UV^{\top}\mathbf{X}\big{\rVert}_{2}^{2}\mid\mathbf{Y}\Big{]}+\frac{1-p}{p}\sum_{i=1}^{m}\lVert\mathbf{u}_{i}\rVert_{2}^{2}\lVert\mathbf{v}_{i}\rVert_{2}^{2}\\ =\ &\mathbb{E}\Big{[}\big{\lVert}\mathbf{Y}-UV^{\top}\mathbf{X}\big{\rVert}_{2}^{2}\mid\mathbf{Y}\Big{]}+\frac{1-p}{p}\mathrm{Tr}\Big{(}\mathrm{Diag}(U^{\top}U)\mathrm{Diag}(V^{\top}V)\Big{)}.\end{split} (24)

As an extension of our approach, it seems natural to investigate whether gradient descent with dropout will converge to the same minimizer or involve additional terms in the variance. In contrast with linear regression, the marginalized loss (24) is non-convex and does not admit a unique minimizer. Hence, we cannot simply center the gradient descent iterates around a specific closed-form estimator, as in Section 4. To extend our techniques we may expect to replace the centering estimator with the gradient descent iterates for the marginalized loss function, demanding a more careful analysis.

[45] study the gradient flow associated with (24) and exhibit exponential convergence of UU and VV towards a minimizer. Extending the existing result on gradient flows to gradient descent is, however, non-trivial, see for example the gradient descent version of Theorem 3.1 in [5] provided in Theorem 2.4 of [36].

To be more precise, suppose 𝐘=W𝐗+𝜺\mathbf{Y}=W_{\star}\mathbf{X}+\bm{\varepsilon}, where 𝐗\mathbf{X} and 𝜺\bm{\varepsilon} are independent random vectors, so the task reduces to learning a factorization WUVW_{\star}\approx UV^{\top} based on noisy evaluation of W𝐗W_{\star}\mathbf{X}. Consider the randomized loss

L(U,V)12𝐘p1UDV𝐗22,\displaystyle L\big{(}U,V\big{)}\mapsto\dfrac{1}{2}\Big{\lVert}\mathbf{Y}-p^{-1}UDV^{\top}\mathbf{X}\Big{\rVert}_{2}^{2},

with respective gradients

UL(U,V)\displaystyle\nabla_{U}L\big{(}U,V\big{)} =(𝐘p1UDV𝐗)𝐗Vp1D\displaystyle=-\big{(}\mathbf{Y}-p^{-1}UDV^{\top}\mathbf{X}\big{)}\mathbf{X}^{\top}Vp^{-1}D
VL(U,V)\displaystyle\nabla_{V^{\top}}L\big{(}U,V\big{)} =p1DU(𝐘Up1DV𝐗)𝐗.\displaystyle=-p^{-1}DU^{\top}\big{(}\mathbf{Y}-Up^{-1}DV^{\top}\mathbf{X}\big{)}\mathbf{X}^{\top}.

Given observations (𝐘k,𝐗k)i.i.d.(𝐘,𝐗)(\mathbf{Y}_{k},\mathbf{X}_{k})\overset{\scriptscriptstyle i.i.d.}{\sim}(\mathbf{Y},\mathbf{X}) and independent dropout matrices Dki.i.d.DD_{k}\overset{\scriptscriptstyle i.i.d.}{\sim}D, the factorized structure leads to two coupled dynamical systems Uk+1=UkαUkL(Uk,Vk)U_{k+1}=U_{k}-\alpha\nabla_{U_{k}}L(U_{k},V_{k}) and Vk+1=VkαVkL(Uk,Vk)V_{k+1}^{\top}=V_{k}^{\top}-\alpha\nabla_{V_{k}^{\top}}L(U_{k},V_{k}), which are linked through the appearance of VkV_{k} in UkL(Uk,Vk)\nabla_{U_{k}}L(U_{k},V_{k}) and UkU_{k} in VkL(Uk,Vk)\nabla_{V_{k}}L(U_{k},V_{k}). Due to non-convexity of the underlying marginalized objective (24), the resulting dynamics should be sensitive to initialization. Suppose Uk=Pk(U0,V0)U_{k}=P_{k}(U_{0},V_{0}) and Vk=Qk(U0,V0)V_{k}=Q_{k}(U_{0},V_{0}) are given as random matrix polynomials (Pk,Qk)(P_{k},Q_{k}) in (U0,V0)(U_{0},V_{0}), meaning finite sums of expressions like A1Xi1A2Xi2AnXinAn+1A_{1}X_{i_{1}}A_{2}X_{i_{2}}\cdots A_{n}X_{i_{n}}A_{n+1}, where Xij{U0,U0,V0,V0}X_{i_{j}}\in\{U_{0},U_{0}^{\top},V_{0},V_{0}^{\top}\} and the AjA_{j} are random coefficient matrices. Now, the gradient descent recursions lead to

Uk+1=Pk(U0,V0)+α(𝐘kPk(U0,V0)p1DkQk(U0,V0)𝐗k)𝐗kQk(U0,V0)p1Dk=:Pk+1(U0,V0),Vk+1=Qk(U0,V0)+αp1DkPk(U0,V0)(𝐘kPk(U0,V0)p1DkQk(U0,V0)𝐗k)𝐗k=:Qk+1(U0,V0),\begin{split}U_{k+1}&=P_{k}(U_{0},V_{0})+\alpha\Big{(}\mathbf{Y}_{k}-P_{k}(U_{0},V_{0})p^{-1}D_{k}Q_{k}^{\top}(U_{0},V_{0})\mathbf{X}_{k}\Big{)}\mathbf{X}_{k}^{\top}Q_{k}(U_{0},V_{0})p^{-1}D_{k}\\ &=:P_{k+1}(U_{0},V_{0}),\\ V_{k+1}^{\top}&=Q_{k}^{\top}(U_{0},V_{0})+\alpha p^{-1}D_{k}P_{k}^{\top}(U_{0},V_{0})\Big{(}\mathbf{Y}_{k}-P_{k}(U_{0},V_{0})p^{-1}D_{k}Q_{k}^{\top}(U_{0},V_{0})\mathbf{X}_{k}\Big{)}\mathbf{X}_{k}^{\top}\\ &=:Q_{k+1}^{\top}(U_{0},V_{0}),\end{split} (25)

so (Uk+1,Vk+1)(U_{k+1},V_{k+1}) is also a polynomial in (U0,V0)(U_{0},V_{0}) with random coefficients. A difficulty in analyzing this recursion is that the degree of the polynomial increases exponentially fast. Indeed, since Pk+1P_{k+1} includes the term p2PkDkQk𝐗k𝐗kQkDkp^{-2}P_{k}D_{k}Q_{k}^{\top}\mathbf{X}_{k}\mathbf{X}_{k}^{\top}Q_{k}D_{k}, the degree of Pk+1P_{k+1} is the degree of PkP_{k} plus twice the degree of QkQ_{k}. During each gradient descent step, additional randomness is introduced via the newly sampled dropout matrix DkD_{k} and the training data (𝐘k,𝐗k)(\mathbf{Y}_{k},\mathbf{X}_{k}). Accordingly, the coefficients of Pk+1P_{k+1} and Qk+1Q_{k+1} fluctuate around the coefficients of 𝔼[Pk+1Uk,Vk]\mathbb{E}[P_{k+1}\mid U_{k},V_{k}] and 𝔼[Qk+1Uk,Vk]\mathbb{E}[Q_{k+1}\mid U_{k},V_{k}]. A principled analysis of the resulting dynamics requires careful accounting of how these fluctuations propagate through the iterations. [45] show that the gradient flow trajectories and minimizers of (24) satisfy specific symmetries, so one should hope to reduce the algebraic complexity of the problem by finding analogous symmetries in the stochastic recursions (25).

Alternatively, one may consider layer-wise training of the weight matrices to break the dependence between UkU_{k} and VkV_{k}. Given K1>0K_{1}>0, suppose we keep U0U_{0} fixed while taking K1K_{1} gradient steps

Vk+1=Vk+αp1DkU0(𝐘U0p1DkVk𝐗k)𝐗k\displaystyle V_{k+1}^{\top}=V_{k}^{\top}+\alpha p^{-1}D_{k}U_{0}^{\top}\big{(}\mathbf{Y}-U_{0}p^{-1}D_{k}V_{k}^{\top}\mathbf{X}_{k}\big{)}\mathbf{X}_{k}^{\top}

followed by K2>0K_{2}>0 gradient steps of the form

Uk+1\displaystyle U_{k+1} =Uk+α(𝐘Ukp1DkVK1𝐗k)𝐗kVK1p1Dk.\displaystyle=U_{k}+\alpha\big{(}\mathbf{Y}-U_{k}p^{-1}D_{k}V_{K_{1}}^{\top}\mathbf{X}_{k}\big{)}\mathbf{X}_{k}^{\top}V_{K_{1}}p^{-1}D_{k}.

In each phase, the gradient descent recursion solves a linear regression problem similar to to our analysis of the linear model. We leave the details for future work.

Appendix A Proofs for Section 3

A.1 Proof of Equation (10)

Recall the definition 𝜷~=𝕏p1X𝐘\widetilde{\bm{\beta}}=\mathbb{X}_{p}^{-1}X^{\top}\mathbf{Y}. By Lemma 10(a), 𝔼[D𝕏D]=p𝕏p\mathbb{E}[D\mathbb{X}D]=p\mathbb{X}_{p} and so

𝔼[DX(𝐘XD𝜷~)|𝐘]=pX𝐘p𝕏p𝜷~=0.\displaystyle\mathbb{E}\Big{[}DX^{\top}\big{(}\mathbf{Y}-XD\widetilde{\bm{\beta}}\big{)}\ \big{|}\ \mathbf{Y}\Big{]}=pX^{\top}\mathbf{Y}-p\mathbb{X}_{p}\widetilde{\bm{\beta}}=0. (26)

Note the identity 𝐘=XD𝜷~+(𝐘XD𝜷~)\mathbf{Y}=XD\widetilde{\bm{\beta}}+(\mathbf{Y}-XD\widetilde{\bm{\beta}}). Hence, if 𝜷^\widehat{\bm{\beta}} denotes any estimator, XD𝜷^𝐘=XD(𝜷^𝜷~)(𝐘XD𝜷~)XD\widehat{\bm{\beta}}-\mathbf{Y}=XD(\widehat{\bm{\beta}}-\widetilde{\bm{\beta}})-(\mathbf{Y}-XD\widetilde{\bm{\beta}}). By (26), 𝔼[(𝜷^𝜷~)DX(𝐘XD𝜷~)𝐘]=0\mathbb{E}\big{[}(\widehat{\bm{\beta}}-\widetilde{\bm{\beta}})^{\top}DX^{\top}(\mathbf{Y}-XD\widetilde{\bm{\beta}})\mid\mathbf{Y}\big{]}=0 and

𝔼[XD𝜷^𝐘22|𝐘]\displaystyle\mathbb{E}\Big{[}\big{\lVert}XD\widehat{\bm{\beta}}-\mathbf{Y}\big{\rVert}_{2}^{2}\ \big{|}\ \mathbf{Y}\Big{]} =𝔼[XD𝜷~𝐘22|𝐘]+𝔼[XD(𝜷~𝜷^)22|𝐘]\displaystyle=\mathbb{E}\Big{[}\big{\lVert}XD\widetilde{\bm{\beta}}-\mathbf{Y}\big{\rVert}_{2}^{2}\ \big{|}\ \mathbf{Y}\Big{]}+\mathbb{E}\Big{[}\big{\lVert}XD\big{(}\widetilde{\bm{\beta}}-\widehat{\bm{\beta}}\big{)}\big{\rVert}_{2}^{2}\ \big{|}\ \mathbf{Y}\Big{]}
2𝔼[(𝜷^𝜷~)DX(𝐘XD𝜷~)|𝐘]\displaystyle\qquad-2\mathbb{E}\Big{[}\big{(}\widehat{\bm{\beta}}-\widetilde{\bm{\beta}}\big{)}^{\top}DX^{\top}\big{(}\mathbf{Y}-XD\widetilde{\bm{\beta}}\big{)}\ \big{|}\ \mathbf{Y}\Big{]}
=𝔼[XD𝜷~𝐘22|𝐘]+𝔼[XD(𝜷~𝜷^)22|𝐘],\displaystyle=\mathbb{E}\Big{[}\big{\lVert}XD\widetilde{\bm{\beta}}-\mathbf{Y}\big{\rVert}_{2}^{2}\ \big{|}\ \mathbf{Y}\Big{]}+\mathbb{E}\Big{[}\big{\lVert}XD\big{(}\widetilde{\bm{\beta}}-\widehat{\bm{\beta}}\big{)}\big{\rVert}_{2}^{2}\ \big{|}\ \mathbf{Y}\Big{]},

which is the claimed expression. ∎

A.2 Proof of Equation (11)

Let r:=rank(X)r:=\mathrm{rank}(X) and consider a singular value decomposition X==1rσ𝐯𝐰X=\sum_{\ell=1}^{r}\sigma_{\ell}\mathbf{v}_{\ell}\mathbf{w}_{\ell}^{\top}. The left-singular vectors 𝐰\mathbf{w}_{\ell}, =1,,d\ell=1,\ldots,d are orthonormal, meaning 𝕏==1rσ2𝐰𝐰\mathbb{X}=\sum_{\ell=1}^{r}\sigma_{\ell}^{2}\mathbf{w}_{\ell}\mathbf{w}_{\ell}^{\top} and since Diag(𝕏)=𝕏11Id\mathrm{Diag}(\mathbb{X})=\mathbb{X}_{11}\cdot I_{d},

𝕏p=p(=1rσ2𝐰𝐰)+(1p)𝕏11Id.\displaystyle\mathbb{X}_{p}=p\left(\sum_{\ell=1}^{r}\sigma_{\ell}^{2}\mathbf{w}_{\ell}\mathbf{w}_{\ell}^{\top}\right)+(1-p)\mathbb{X}_{11}I_{d}.

Each left-singular vector 𝐰\mathbf{w}_{\ell} is an eigenvector of 𝕏p\mathbb{X}_{p}, with associated eigenvalue pσ2+(1p)𝕏11p\sigma_{\ell}^{2}+(1-p)\mathbb{X}_{11}. If r<dr<d, suppose 𝐰m\mathbf{w}_{m}, m=r+1,,dm=r+1,\ldots,d complete the orthonormal basis, then 𝕏p𝐰m=(1p)𝕏11𝐰m\mathbb{X}_{p}\mathbf{w}_{m}=(1-p)\mathbb{X}_{11}\mathbf{w}_{m}. Consequently, 𝕏p1\mathbb{X}_{p}^{-1} admits 𝐰\mathbf{w}_{\ell} as an eigenvector for every =1,,d\ell=1,\ldots,d and

𝕏p1==1r1pσ2+(1p)𝕏11𝐰𝐰+m=r+1d1(1p)𝕏11𝐰m𝐰m.\displaystyle\mathbb{X}_{p}^{-1}=\sum_{\ell=1}^{r}\dfrac{1}{p\sigma_{\ell}^{2}+(1-p)\mathbb{X}_{11}}\mathbf{w}_{\ell}\mathbf{w}_{\ell}^{\top}+\sum_{m=r+1}^{d}\dfrac{1}{(1-p)\mathbb{X}_{11}}\mathbf{w}_{m}\mathbf{w}_{m}^{\top}.

By definition, X𝐘==1rσ𝐰𝐯𝐘X^{\top}\mathbf{Y}=\sum_{\ell=1}^{r}\sigma_{\ell}\mathbf{w}_{\ell}\mathbf{v}_{\ell}^{\top}\mathbf{Y} and 𝐘==1rσ𝐯𝐰𝜷+𝜺\mathbf{Y}=\sum_{\ell=1}^{r}\sigma_{\ell}\mathbf{v}_{\ell}\mathbf{w}_{\ell}^{\top}\bm{\beta}_{\star}+\bm{\varepsilon}. Combining these facts now leads to

𝜷~==1rσpσ2+(1p)𝕏11𝐰𝐯𝐘\displaystyle\widetilde{\bm{\beta}}=\sum_{\ell=1}^{r}\dfrac{\sigma_{\ell}}{p\sigma_{\ell}^{2}+(1-p)\mathbb{X}_{11}}\mathbf{w}_{\ell}\mathbf{v}_{\ell}^{\top}\mathbf{Y}

and

𝔼[X𝜷~]\displaystyle\mathbb{E}\big{[}X\widetilde{\bm{\beta}}\big{]} ==1rσ2pσ2+(1p)𝕏11𝐯𝐯𝔼[𝐘]\displaystyle=\sum_{\ell=1}^{r}\dfrac{\sigma^{2}_{\ell}}{p\sigma_{\ell}^{2}+(1-p)\mathbb{X}_{11}}\mathbf{v}_{\ell}\mathbf{v}_{\ell}^{\top}\mathbb{E}\big{[}\mathbf{Y}\big{]}
==1r1p+(1p)𝕏11/σ2𝐯𝐯X𝜷\displaystyle=\sum_{\ell=1}^{r}\dfrac{1}{p+(1-p)\mathbb{X}_{11}/\sigma_{\ell}^{2}}\mathbf{v}_{\ell}\mathbf{v}_{\ell}^{\top}X\bm{\beta}_{\star}
==1r1p+(1p)𝕏11/σ2σ𝐯𝐰𝜷.\displaystyle=\sum_{\ell=1}^{r}\dfrac{1}{p+(1-p)\mathbb{X}_{11}/\sigma_{\ell}^{2}}\sigma_{\ell}\mathbf{v}_{\ell}\mathbf{w}_{\ell}^{\top}\bm{\beta}_{\star}.

A.3 Proof of Equation (12)

Set A:=X(𝕏+(p11)Diag(𝕏))1XA:=X\big{(}\mathbb{X}+(p^{-1}-1)\mathrm{Diag}(\mathbb{X})\big{)}^{-1}X^{\top} and consider 𝐰=𝐮+𝐯\mathbf{w}=\mathbf{u}+\mathbf{v} with 𝐮𝐯=0\mathbf{u}^{\top}\mathbf{v}=0 and X𝐯=𝟎X^{\top}\mathbf{v}=\mathbf{0}. Observe that

A2=AX(𝕏+(p11)Diag(𝕏))1(p11)Diag(𝕏)(𝕏+(p11)Diag(𝕏))1X\displaystyle A^{2}=A-X\big{(}\mathbb{X}+(p^{-1}-1)\mathrm{Diag}(\mathbb{X})\big{)}^{-1}(p^{-1}-1)\mathrm{Diag}(\mathbb{X})\big{(}\mathbb{X}+(p^{-1}-1)\mathrm{Diag}(\mathbb{X})\big{)}^{-1}X^{\top}

and thus 𝐰A2𝐰=𝐮A2𝐮𝐮A𝐮=𝐰A𝐰\mathbf{w}^{\top}A^{2}\mathbf{w}=\mathbf{u}^{\top}A^{2}\mathbf{u}\leq\mathbf{u}A\mathbf{u}=\mathbf{w}^{\top}A\mathbf{w}. If 𝐮\mathbf{u} is the zero vector, 0=𝐰A2𝐰=𝐰A𝐰0=\mathbf{w}^{\top}A^{2}\mathbf{w}=\mathbf{w}^{\top}A\mathbf{w}. Otherwise, Diag(𝕏)>0\mathrm{Diag}(\mathbb{X})>0 implies (𝕏+(p11)Diag(𝕏))1(p11)Diag(𝕏)(𝕏+(p11)Diag(𝕏))\big{(}\mathbb{X}+(p^{-1}-1)\mathrm{Diag}(\mathbb{X})\big{)}^{-1}(p^{-1}-1)\mathrm{Diag}(\mathbb{X})\big{(}\mathbb{X}+(p^{-1}-1)\mathrm{Diag}(\mathbb{X})\big{)} being positive definite, so we have the strict inequality 𝐰A2𝐰<𝐰A𝐰\mathbf{w}^{\top}A^{2}\mathbf{w}<\mathbf{w}^{\top}A\mathbf{w} whenever 𝐮0\mathbf{u}\neq 0.

Now, suppose that AA has an eigenvector 𝐰\mathbf{w} with corresponding eigenvalue λ1\lambda\geq 1, which implies 𝐮𝟎\mathbf{u}\neq\mathbf{0}. Then, we have 𝐰A2𝐰=λ2λ=𝐰A𝐰\mathbf{w}^{\top}A^{2}\mathbf{w}=\lambda^{2}\geq\lambda=\mathbf{w}^{\top}A\mathbf{w}. Equality only holds if 𝐰A𝐰=1\mathbf{w}^{\top}A\mathbf{w}=1. This contradicts the strict inequality 𝐰A2𝐰<𝐰A𝐰\mathbf{w}^{\top}A^{2}\mathbf{w}<\mathbf{w}^{\top}A\mathbf{w} so all eigenvalues have to be strictly smaller than one. ∎

Appendix B Proofs for Section 4

B.1 Proof of Lemma 1

To show that Iαp𝕏p1αp(1p)mini𝕏ii<1\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}\leq 1-\alpha p(1-p)\min_{i}\mathbb{X}_{ii}<1, note that 𝕏p𝕏\lVert\mathbb{X}_{p}\rVert\leq\lVert\mathbb{X}\rVert by Lemma 13 and recall αp𝕏<1\alpha p\lVert\mathbb{X}\rVert<1 from Assumption 1. Hence,

Iαp𝕏p=1αpλmin(𝕏p).\displaystyle\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}=1-\alpha p\lambda_{\min}(\mathbb{X}_{p}). (27)

For any vector 𝐯\mathbf{v} with 𝐯2=1\lVert\mathbf{v}\rVert_{2}=1 and any two d×dd\times d positive semi-definite matrices AA and BB, 𝐯(A+B)𝐯=𝐯A𝐯+𝐯B𝐯λmin(A)+λmin(B)\mathbf{v}^{\top}(A+B)\mathbf{v}=\mathbf{v}^{\top}A\mathbf{v}+\mathbf{v}^{\top}B\mathbf{v}\geq\lambda_{\mathrm{min}}(A)+\lambda_{\mathrm{min}}(B). Hence, λmin(A+B)λmin(A)+λmin(B)\lambda_{\mathrm{min}}(A+B)\geq\lambda_{\mathrm{min}}(A)+\lambda_{\mathrm{min}}(B) and λmin(𝕏p)(1p)λmin(Diag(𝕏))=(1p)mini𝕏ii\lambda_{\mathrm{min}}(\mathbb{X}_{p})\geq(1-p)\lambda_{\mathrm{min}}\big{(}\mathrm{Diag}(\mathbb{X})\big{)}=(1-p)\min_{i}\mathbb{X}_{ii}. By Assumption 1, the design matrix XX has no zero columns, guaranteeing mini𝕏ii>0\min_{i}\mathbb{X}_{ii}>0. Combined with (27), we now obtain Iαp𝕏p1αp(1p)mini𝕏ii<1\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}\leq 1-\alpha p(1-p)\min_{i}\mathbb{X}_{ii}<1.

To prove the bound on the expectation, recall from (14) that 𝜷~k𝜷~\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}} equals (IαDk𝕏p)(𝜷~k1𝜷~)+αDk𝕏¯(pIDk)𝜷~k1\big{(}I-\alpha D_{k}\mathbb{X}_{p}\big{)}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}+\alpha D_{k}\overline{\mathbb{X}}\big{(}pI-D_{k}\big{)}\widetilde{\bm{\beta}}_{k-1}. Lemma 10(a) shows that 𝔼[DAD]=pAp\mathbb{E}[DAD]=pA_{p} and Lemma 9(b) gives A¯p=pA¯\overline{A}_{p}=p\overline{A}. In turn, we have 𝔼[Dk𝕏¯Dk]=p𝕏¯p=p2𝕏¯\mathbb{E}\big{[}D_{k}\overline{\mathbb{X}}D_{k}\big{]}=p\overline{\mathbb{X}}_{p}=p^{2}\overline{\mathbb{X}} and

𝔼[Dk𝕏¯(pIDk)]=p2𝕏¯p2𝕏¯=0.\displaystyle\mathbb{E}\big{[}D_{k}\overline{\mathbb{X}}(pI-D_{k})\big{]}=p^{2}\overline{\mathbb{X}}-p^{2}\overline{\mathbb{X}}=0. (28)

Conditioning on all randomness except DkD_{k} now implies

𝔼[𝜷~k𝜷~𝜷~,𝜷~k1]=(Iαp𝕏p)(𝜷~k1𝜷~).\displaystyle\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}},\widetilde{\bm{\beta}}_{k-1}\big{]}=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}. (29)

By the tower rule 𝔼[𝜷~k𝜷~]=(Iαp𝕏p)𝔼[𝜷~k1𝜷~]\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]}=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{]}, so induction on kk gives

𝔼[𝜷~k𝜷~]=(Iαp𝕏p)k𝔼[𝜷~0𝜷~].\displaystyle\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]}=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}^{k}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}\big{]}.

Sub-multiplicativity of the spectral norm implies (Iαp𝕏p)kIαp𝕏pk\big{\lVert}(I-\alpha p\mathbb{X}_{p})^{k}\big{\rVert}\leq\left\lVert I-\alpha p\mathbb{X}_{p}\right\rVert^{k}, proving that 𝔼[𝜷~k𝜷~]2Iαp𝕏pk𝔼[𝜷~0𝜷~]2\big{\lVert}\mathbb{E}[\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}]\big{\rVert}_{2}\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k}\big{\lVert}\mathbb{E}[\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}]\big{\rVert}_{2}. ∎

B.2 Proof of Theorem 1

We have Cov(𝜷~aff)=Cov(𝜷~A+(𝜷~aff𝜷~A))=Cov(𝜷~A)+Cov(𝜷~aff𝜷~A)+Cov(𝜷~A,𝜷~aff𝜷~A)+Cov(𝜷~aff𝜷~A,𝜷~A)\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}\big{)}=\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{A}+(\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A})\big{)}=\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{A}\big{)}+\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{)}+\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{A},\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{)}+\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A},\widetilde{\bm{\beta}}_{A}\big{)}, so the triangle inequality implies

Cov(𝜷~aff)Cov(𝜷~A)Cov(𝜷~aff𝜷~A)2Cov(𝜷~aff𝜷~A,𝜷~A).\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{A}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{)}\Big{\rVert}\leq 2\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A},\widetilde{\bm{\beta}}_{A}\big{)}\Big{\rVert}. (30)

Write B:=BAXB^{\prime}:=B-AX^{\top}, then 𝜷~aff𝜷~A=B𝐘+𝐚\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}=B^{\prime}\mathbf{Y}+\mathbf{a}. When conditioned on 𝐘\mathbf{Y}, the estimator 𝜷~A=AX𝐘\widetilde{\bm{\beta}}_{A}=AX^{\top}\mathbf{Y} is deterministic. Hence, the law of total covariance yields

Cov(𝜷~aff𝜷~A,𝜷~A)\displaystyle\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A},\widetilde{\bm{\beta}}_{A}\big{)} =𝔼[Cov(𝜷~aff𝜷~A,𝜷~A𝐘)]+Cov(𝔼[𝜷~aff𝜷~A𝐘],𝔼[𝜷~A𝐘])\displaystyle=\mathbb{E}\Big{[}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A},\widetilde{\bm{\beta}}_{A}\mid\mathbf{Y}\big{)}\Big{]}+\mathrm{Cov}\Big{(}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\mid\mathbf{Y}\big{]},\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{A}\mid\mathbf{Y}\big{]}\Big{)}
=0+Cov(𝔼[B]𝐘+𝔼[𝐚],AX𝐘)=Cov(𝔼[B]𝐘,AX𝐘).\displaystyle=0+\mathrm{Cov}\big{(}\mathbb{E}[B^{\prime}]\mathbf{Y}+\mathbb{E}[\mathbf{a}],AX^{\top}\mathbf{Y}\big{)}=\mathrm{Cov}\big{(}\mathbb{E}[B^{\prime}]\mathbf{Y},AX^{\top}\mathbf{Y}\big{)}.

Further, Cov(𝐘)=I\mathrm{Cov}(\mathbf{Y})=I implies Cov(𝔼[B]𝐘,AX𝐘)=𝔼[B]Cov(𝐘)XA=𝔼[B]XA\mathrm{Cov}\big{(}\mathbb{E}[B^{\prime}]\mathbf{Y},AX^{\top}\mathbf{Y}\big{)}=\mathbb{E}[B^{\prime}]\mathrm{Cov}(\mathbf{Y})XA^{\top}=\mathbb{E}[B^{\prime}]XA^{\top}. Using 𝜷~aff𝜷~A=B𝐘+𝐚\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}=B^{\prime}\mathbf{Y}+\mathbf{a}, note that 𝔼[B]X𝜷=𝔼𝜷[𝜷~aff𝜷~A]𝔼𝟎[𝜷~aff𝜷~A]\mathbb{E}[B^{\prime}]X\bm{\beta}_{\star}=\mathbb{E}_{\bm{\beta}_{\star}}\big{[}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{]}-\mathbb{E}_{\mathbf{0}}\big{[}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{]} with 𝟎=(0,,0)\mathbf{0}=(0,\ldots,0)^{\top}. Combining these identities, sub-multiplicativity of the spectral norm, and the triangle inequality leads to

Cov(𝜷~aff𝜷~A,𝜷~A)\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A},\widetilde{\bm{\beta}}_{A}\big{)}\Big{\rVert} A𝔼[B]X\displaystyle\leq\lVert A\rVert\cdot\big{\lVert}\mathbb{E}[B^{\prime}]X\big{\rVert}
=Asup𝜷:𝜷21𝔼[B]X𝜷2\displaystyle=\lVert A\rVert\sup_{\bm{\beta}_{\star}:\lVert\bm{\beta}_{\star}\rVert_{2}\leq 1}\big{\lVert}\mathbb{E}[B^{\prime}]X\bm{\beta}_{\star}\big{\rVert}_{2}
Asup𝜷:𝜷21(𝔼𝜷[𝜷~aff𝜷~A]2+𝔼𝟎[𝜷~aff𝜷~A]2)\displaystyle\leq\lVert A\rVert\sup_{\bm{\beta}_{\star}:\lVert\bm{\beta}_{\star}\rVert_{2}\leq 1}\bigg{(}\Big{\lVert}\mathbb{E}_{\bm{\beta}_{\star}}\big{[}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{]}\Big{\rVert}_{2}+\Big{\lVert}\mathbb{E}_{\mathbf{0}}\big{[}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{]}\Big{\rVert}_{2}\bigg{)}
2Asup𝜷:𝜷21𝔼𝜷[𝜷~aff𝜷~A]2.\displaystyle\leq 2\lVert A\rVert\sup_{\bm{\beta}_{\star}:\lVert\bm{\beta}_{\star}\rVert_{2}\leq 1}\Big{\lVert}\mathbb{E}_{\bm{\beta}_{\star}}\big{[}\widetilde{\bm{\beta}}_{\mathrm{aff}}-\widetilde{\bm{\beta}}_{A}\big{]}\Big{\rVert}_{2}.

Together with (30), this proves the result. ∎

B.3 Proof of Lemma 2

Given a random vector UU and a random element VV, observe that 𝔼[Cov(UV)]=𝔼[UU]𝔼[𝔼[UV]𝔼[UV]]\mathbb{E}\big{[}\mathrm{Cov}(U\mid V)\big{]}=\mathbb{E}\big{[}UU^{\top}\big{]}-\mathbb{E}\big{[}\mathbb{E}[U\mid V]\mathbb{E}[U\mid V]^{\top}\big{]}. Inserting U=𝜷~k𝜷~U=\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}} and V=(𝜷~k1,𝜷~)V=\big{(}\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\big{)}, as well as defining Ak:=𝔼[(𝜷~k𝜷~)(𝜷~k𝜷~)]A_{k}:=\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})^{\top}\big{]}, leads to

Ak=𝔼[𝔼[𝜷~k𝜷~𝜷~k1,𝜷~]𝔼[𝜷~k𝜷~𝜷~k1,𝜷~]]+𝔼[Cov(𝜷~k𝜷~𝜷~k1,𝜷~)]\displaystyle A_{k}=\mathbb{E}\Big{[}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\big{]}^{\top}\Big{]}+\mathbb{E}\Big{[}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\big{)}\Big{]} (31)

Recall 𝔼[𝜷~k𝜷~𝜷~,𝜷~k1]=(Iαp𝕏p)(𝜷~k1𝜷~)\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}},\widetilde{\bm{\beta}}_{k-1}\big{]}=(I-\alpha p\mathbb{X}_{p})\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)} from (29), and so

𝔼[𝔼[𝜷~k𝜷~𝜷~k1,𝜷~]𝔼[𝜷~k𝜷~𝜷~k1,𝜷~]]=(Iαp𝕏p)Ak1(Iαp𝕏p),\displaystyle\mathbb{E}\Big{[}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\big{]}^{\top}\Big{]}=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}A_{k-1}\big{(}I-\alpha p\mathbb{X}_{p}\big{)}, (32)

where Ak1:=𝔼[(𝜷~k1𝜷~)(𝜷~k1𝜷~)]A_{k-1}:=\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}})^{\top}\big{]}.

Evaluating the conditional covariance Cov(𝜷~k𝜷~𝜷~k1,𝜷~)\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\big{)} is the more challenging part, requiring moments up to fourth order in DkD_{k}, see Lemma 11. Recall that

S(A)\displaystyle S(A) =(Iαp𝕏p)A(Iαp𝕏p)\displaystyle=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}A\big{(}I-\alpha p\mathbb{X}_{p}\big{)}
+α2p(1p)Diag(𝕏pA𝕏p)+α2p2(1p)2𝕏A+𝔼[𝜷~𝜷~]¯𝕏\displaystyle\quad+\alpha^{2}p(1-p)\mathrm{Diag}\big{(}\mathbb{X}_{p}A\mathbb{X}_{p}\big{)}+\alpha^{2}p^{2}(1-p)^{2}\mathbb{X}\odot\overline{A+\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}}\odot\mathbb{X}
+α2p2(1p)((𝕏¯Diag(A+𝔼[𝜷~𝜷~])𝕏¯)p+𝕏¯Diag(𝕏pA)+Diag(𝕏pA)𝕏¯).\displaystyle\quad+\alpha^{2}p^{2}(1-p)\bigg{(}\Big{(}\overline{\mathbb{X}}\mathrm{Diag}\Big{(}A+\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}\Big{)}\overline{\mathbb{X}}\Big{)}_{p}+\overline{\mathbb{X}}\mathrm{Diag}\big{(}\mathbb{X}_{p}A\big{)}+\mathrm{Diag}\big{(}\mathbb{X}_{p}A\big{)}\overline{\mathbb{X}}\bigg{)}.
Lemma 5.

For every positive integer kk,

𝔼[Cov(𝜷~k𝜷~𝜷~k1,𝜷~)]\displaystyle\mathbb{E}\Big{[}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\big{)}\Big{]} =S(Ak1)(Iαp𝕏p)Ak1(Iαp𝕏p)+ρk1,\displaystyle=S(A_{k-1})-\big{(}I-\alpha p\mathbb{X}_{p}\big{)}A_{k-1}\big{(}I-\alpha p\mathbb{X}_{p}\big{)}+\rho_{k-1},

with remainder ρk1\rho_{k-1} vanishing at the rate

ρk16Iαp𝕏pk1𝔼[(𝜷~0𝜷~)𝜷~].\displaystyle\big{\lVert}\rho_{k-1}\big{\rVert}\leq 6\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}\Big{\lVert}\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}})\widetilde{\bm{\beta}}^{\top}\big{]}\Big{\rVert}.

Proof  Recall from (14) that 𝜷~k𝜷~=(IαDk𝕏p)(𝜷~k1𝜷~)+αDk𝕏¯(pIDk)𝜷~k1\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}=\big{(}I-\alpha D_{k}\mathbb{X}_{p}\big{)}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}+\alpha D_{k}\overline{\mathbb{X}}\big{(}pI-D_{k}\big{)}\widetilde{\bm{\beta}}_{k-1}. The covariance is invariant under deterministic shifts and sign flips, so

Cov(𝜷~k𝜷~|𝜷~k1,𝜷~)=α2Cov(Dk𝕏p(𝜷~k1𝜷~)+Dk𝕏¯(DkpI)𝜷~k1|𝜷~k1,𝜷~).\displaystyle\mathrm{Cov}\Big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\ \big{|}\ \widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\Big{)}=\alpha^{2}\mathrm{Cov}\Big{(}D_{k}\mathbb{X}_{p}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}+D_{k}\overline{\mathbb{X}}\big{(}D_{k}-pI\big{)}\widetilde{\bm{\beta}}_{k-1}\ \big{|}\ \widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}}\Big{)}.

Applying Lemma 11 with 𝐮:=𝕏p(𝜷~k1𝜷~)\mathbf{u}:=\mathbb{X}_{p}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}, A¯:=𝕏¯\overline{A}:=\overline{\mathbb{X}}, and 𝐯:=𝜷~k1\mathbf{v}:=\widetilde{\bm{\beta}}_{k-1}, we find

Cov(𝜷~k𝜷~|𝜷~k1,𝜷~)α2p(1p)=Diag(𝕏p(𝜷~k1𝜷~)(𝜷~k1𝜷~)𝕏p)+p𝕏¯Diag(𝕏p(𝜷~k1𝜷~)𝜷~k1)+pDiag(𝜷~k1(𝜷~k1𝜷~)𝕏p)𝕏¯+p(𝕏¯Diag(𝜷~k1𝜷~k1)𝕏¯)p+p(1p)𝕏𝜷~k1𝜷~k1¯𝕏.\begin{split}\dfrac{\mathrm{Cov}(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\ |\ \widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}})}{\alpha^{2}p(1-p)}&=\mathrm{Diag}\Big{(}\mathbb{X}_{p}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}^{\top}\mathbb{X}_{p}\Big{)}\\ &\quad+p\overline{\mathbb{X}}\mathrm{Diag}\Big{(}\mathbb{X}_{p}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}\widetilde{\bm{\beta}}^{\top}_{k-1}\Big{)}\\ &\quad+p\mathrm{Diag}\Big{(}\widetilde{\bm{\beta}}_{k-1}\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}^{\top}\mathbb{X}_{p}\Big{)}\overline{\mathbb{X}}\\ &\quad+p\Big{(}\overline{\mathbb{X}}\mathrm{Diag}\big{(}\widetilde{\bm{\beta}}_{k-1}\widetilde{\bm{\beta}}_{k-1}^{\top}\big{)}\overline{\mathbb{X}}\Big{)}_{p}\\ &\quad+p(1-p)\mathbb{X}\odot\overline{\widetilde{\bm{\beta}}_{k-1}\widetilde{\bm{\beta}}^{\top}_{k-1}}\odot\mathbb{X}.\end{split} (33)

Set Bk1:=𝔼[(𝜷~k1𝜷~)𝜷~]B_{k-1}:=\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}})\widetilde{\bm{\beta}}^{\top}\big{]} and recall Ak1=𝔼[(𝜷~k1𝜷~)(𝜷~k1𝜷~)]A_{k-1}=\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}})^{\top}\big{]}. Note the identities 𝔼[(𝜷~k1𝜷~)𝜷~k1]=Ak1+Bk1\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}})\widetilde{\bm{\beta}}_{k-1}^{\top}\big{]}=A_{k-1}+B_{k-1} and 𝔼[𝜷~k1𝜷~k1]=Ak1+𝔼[𝜷~𝜷~]+Bk1+Bk1\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k-1}\widetilde{\bm{\beta}}_{k-1}^{\top}\big{]}=A_{k-1}+\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}+B_{k-1}+B_{k-1}^{\top}. Taking the expectation of (33), multiplying both sides with α2p(1p)\alpha^{2}p(1-p), and using the definition of S(A)S(A) proves the claimed expression for 𝔼[Cov(𝜷~k𝜷~𝜷~k1,𝜷~)]\mathbb{E}\big{[}\mathrm{Cov}(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\mid\widetilde{\bm{\beta}}_{k-1},\widetilde{\bm{\beta}})\big{]} with remainder term

ρk1=α2p(1p)(p𝕏¯Diag(𝕏pBk1)+pDiag(Bk1𝕏p)𝕏¯+p(𝕏¯Diag(Bk1+Bk1)𝕏¯)p+p(1p)𝕏(Bk1+Bk1)¯𝕏).\begin{split}\rho_{k-1}=\alpha^{2}p(1-p)\bigg{(}&p\overline{\mathbb{X}}\mathrm{Diag}\big{(}\mathbb{X}_{p}B_{k-1}\big{)}+p\mathrm{Diag}\big{(}B_{k-1}^{\top}\mathbb{X}_{p}\big{)}\overline{\mathbb{X}}+p\Big{(}\overline{\mathbb{X}}\mathrm{Diag}\big{(}B_{k-1}+B_{k-1}^{\top}\big{)}\overline{\mathbb{X}}\Big{)}_{p}\\ &+p(1-p)\mathbb{X}\odot\overline{\big{(}B_{k-1}+B_{k-1}^{\top}\big{)}}\odot\mathbb{X}\bigg{)}.\end{split} (34)

For any d×dd\times d matrices AA and BB, Lemma 13 provides the inequalities Diag(A)A\big{\lVert}\mathrm{Diag}(A)\big{\rVert}\leq\lVert A\rVert, ApA\lVert A_{p}\rVert\leq\lVert A\rVert, and ABAB\lVert A\odot B\rVert\leq\lVert A\rVert\cdot\lVert B\rVert. If AA is moreover positive semi-definite, then also A¯A\big{\lVert}\overline{A}\big{\rVert}\leq\left\lVert A\right\rVert. Combined with the sub-multiplicativity of the spectral norm, this implies

𝕏¯Diag(𝕏pBk1)\displaystyle\big{\lVert}\overline{\mathbb{X}}\mathrm{Diag}(\mathbb{X}_{p}B_{k-1})\big{\rVert} 𝕏¯Diag(𝕏pBk1)\displaystyle\leq\big{\lVert}\overline{\mathbb{X}}\big{\rVert}\cdot\big{\lVert}\mathrm{Diag}(\mathbb{X}_{p}B_{k-1})\big{\rVert}
𝕏𝕏pBk1𝕏2Bk1\displaystyle\leq\lVert\mathbb{X}\rVert\cdot\lVert\mathbb{X}_{p}\rVert\cdot\big{\lVert}B_{k-1}\big{\rVert}\leq\lVert\mathbb{X}\rVert^{2}\cdot\big{\lVert}B_{k-1}\big{\rVert}
(𝕏¯Diag(Bk1+Bk1)𝕏¯)p\displaystyle\Big{\lVert}\Big{(}\overline{\mathbb{X}}\mathrm{Diag}\big{(}B_{k-1}+B_{k-1}^{\top}\big{)}\overline{\mathbb{X}}\Big{)}_{p}\Big{\rVert} 𝕏¯Diag(Bk1+Bk1)𝕏¯\displaystyle\leq\big{\lVert}\overline{\mathbb{X}}\mathrm{Diag}\big{(}B_{k-1}+B_{k-1}^{\top}\big{)}\overline{\mathbb{X}}\big{\rVert}
𝕏¯2Bk1+Bk12𝕏2Bk1\displaystyle\leq\big{\lVert}\overline{\mathbb{X}}\big{\rVert}^{2}\cdot\big{\lVert}B_{k-1}+B_{k-1}^{\top}\big{\rVert}\leq 2\lVert\mathbb{X}\rVert^{2}\cdot\big{\lVert}B_{k-1}\big{\rVert}
𝕏(Bk1+Bk1)¯𝕏\displaystyle\Big{\lVert}\mathbb{X}\odot\overline{\big{(}B_{k-1}+B_{k-1}^{\top}\big{)}}\odot\mathbb{X}\Big{\rVert} 2𝕏2Bk1.\displaystyle\leq 2\lVert\mathbb{X}\rVert^{2}\cdot\big{\lVert}B_{k-1}\big{\rVert}.

By Assumption 1 also αp𝕏<1\alpha p\lVert\mathbb{X}\rVert<1, so combining the upper-bounds with (34) leads to

ρk16(αp)2𝕏2Bk16Bk1.\displaystyle\big{\lVert}\rho_{k-1}\big{\rVert}\leq 6(\alpha p)^{2}\lVert\mathbb{X}\rVert^{2}\cdot\big{\lVert}B_{k-1}\big{\rVert}\leq 6\big{\lVert}B_{k-1}\big{\rVert}. (35)

The argument is to be completed by bounding Bk1\big{\lVert}B_{k-1}\big{\rVert}. Using (14), we have (𝜷~k1𝜷~)𝜷~=(IαDk1𝕏p)(𝜷~k2𝜷~)𝜷~+αDk1𝕏¯(pIDk1)𝜷~k1𝜷~\big{(}\widetilde{\bm{\beta}}_{k-1}-\widetilde{\bm{\beta}}\big{)}\widetilde{\bm{\beta}}^{\top}=\big{(}I-\alpha D_{k-1}\mathbb{X}_{p}\big{)}\big{(}\widetilde{\bm{\beta}}_{k-2}-\widetilde{\bm{\beta}}\big{)}\widetilde{\bm{\beta}}^{\top}+\alpha D_{k-1}\overline{\mathbb{X}}\big{(}pI-D_{k-1}\big{)}\widetilde{\bm{\beta}}_{k-1}\widetilde{\bm{\beta}}^{\top}. In (28), it was shown that 𝔼[Dk1𝕏¯(pIDk1)]=0\mathbb{E}\big{[}D_{k-1}\overline{\mathbb{X}}(pI-D_{k-1})\big{]}=0. Recalling that Dk1D_{k-1} is independent of (𝜷~,𝜷~k2)\big{(}\widetilde{\bm{\beta}},\widetilde{\bm{\beta}}_{k-2}\big{)} and 𝔼[Dk1]=pId\mathbb{E}[D_{k-1}]=pI_{d}, we obtain Bk1=(Iαp𝕏p)Bk2B_{k-1}=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}B_{k-2}. By induction on kk, Bk1=(Iαp𝕏p)k1𝔼[(𝜷~0𝜷~)𝜷~]B_{k-1}=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}^{k-1}\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}\big{)}\widetilde{\bm{\beta}}^{\top}\big{]}. Using sub-multiplicativity of the spectral norm,

Bk1Iαp𝕏pk1𝔼[(𝜷~0𝜷~)𝜷~].\displaystyle\big{\lVert}B_{k-1}\big{\rVert}\leq\lVert I-\alpha p\mathbb{X}_{p}\rVert^{k-1}\Big{\lVert}\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}})\widetilde{\bm{\beta}}^{\top}\big{]}\Big{\rVert}.

Together with (35) this finishes the proof. ∎

Combining Lemma 5 with (31) and (32) leads to AkS(Ak1)=ρk1\big{\lVert}A_{k}-S(A_{k-1})\big{\rVert}=\big{\lVert}\rho_{k-1}\big{\rVert} with remainder ρk1\rho_{k-1} as above. This completes the proof of Lemma 2. ∎

B.4 Proof of Theorem 2

Let S:d×dd×dS:\mathbb{R}^{d\times d}\to\mathbb{R}^{d\times d} be the affine operator introduced in Lemma 2 and recall the definitions S0:=S(0)S_{0}:=S(0) and Slin(A):=S(A)S0S_{\mathrm{lin}}(A):=S(A)-S_{0}. First, the operator norm of SlinS_{\mathrm{lin}} will be analyzed.

Lemma 6.

The linear operator SlinS_{\mathrm{lin}} satisfies SlinopIαp𝕏p<1\left\lVert S_{\mathrm{lin}}\right\rVert_{\mathrm{op}}\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}<1, provided that

α<min{1p𝕏,λmin(𝕏p)3𝕏2},\displaystyle\alpha<\min\left\{\dfrac{1}{p\lVert\mathbb{X}\rVert},\dfrac{\lambda_{\mathrm{min}}(\mathbb{X}_{p})}{3\lVert\mathbb{X}\rVert^{2}}\right\},

where λmin(𝕏p)\lambda_{\mathrm{min}}(\mathbb{X}_{p}) denotes the smallest eigenvalue of 𝕏p\mathbb{X}_{p}.

Proof  Let AA be a d×dd\times d matrix. Applying the triangle inequality, Lemma 13, and sub-multiplicativity of the spectral norm,

Slin(A)\displaystyle\big{\lVert}S_{\mathrm{lin}}(A)\big{\rVert} Iαp𝕏p2A+(α2p(1p)+3α2p2(1p)+α2p2(1p)2)𝕏2A\displaystyle\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{2}\left\lVert A\right\rVert+\Big{(}\alpha^{2}p(1-p)+3\alpha^{2}p^{2}(1-p)+\alpha^{2}p^{2}(1-p)^{2}\Big{)}\left\lVert\mathbb{X}\right\rVert^{2}\left\lVert A\right\rVert
(Iαp𝕏p2+2α2p𝕏2)A,\displaystyle\leq\Big{(}\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{2}+2\alpha^{2}p\left\lVert\mathbb{X}\right\rVert^{2}\Big{)}\left\lVert A\right\rVert,

where the second inequality follows from p(1p)1/4p(1-p)\leq 1/4.

As shown in (27), Iαp𝕏p=1αpλmin(𝕏p)\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}=1-\alpha p\lambda_{\mathrm{min}}(\mathbb{X}_{p}). Lemma 13 now implies (1αpλmin(𝕏p))2=12αpλmin(𝕏p)+α2p2λmin(𝕏p)12αpλmin(𝕏p)2+α2p𝕏2\big{(}1-\alpha p\lambda_{\mathrm{min}}(\mathbb{X}_{p})\big{)}^{2}=1-2\alpha p\lambda_{\mathrm{min}}(\mathbb{X}_{p})+\alpha^{2}p^{2}\lambda_{\mathrm{min}}(\mathbb{X}_{p})\leq 1-2\alpha p\lambda_{\mathrm{min}}(\mathbb{X}_{p})^{2}+\alpha^{2}p\lVert\mathbb{X}\rVert^{2}, so that

Slin(A)(12αpλmin(𝕏p)+3α2p𝕏2)A.\displaystyle\big{\lVert}S_{\mathrm{lin}}(A)\big{\rVert}\leq\big{(}1-2\alpha p\lambda_{\mathrm{min}}(\mathbb{X}_{p})+3\alpha^{2}p\lVert\mathbb{X}\rVert^{2}\big{)}\left\lVert A\right\rVert.

If α<λmin(𝕏p)/(3𝕏2)\alpha<\lambda_{\mathrm{min}}(\mathbb{X}_{p})/\big{(}3\lVert\mathbb{X}\rVert^{2}\big{)}, then also 3α2p𝕏23αpλmin(𝕏p)3\alpha^{2}p\lVert\mathbb{X}\rVert^{2}\leq 3\alpha p\lambda_{\mathrm{min}}(\mathbb{X}_{p}), so that in turn SlinopIαp𝕏p\lVert S_{\mathrm{lin}}\rVert_{\mathrm{op}}\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}. The constraint α<1/(p𝕏)\alpha<1/\big{(}p\lVert\mathbb{X}\rVert\big{)} now enforces αp𝕏<1\alpha p\lVert\mathbb{X}\rVert<1, which implies Iαp𝕏p<1\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}<1. ∎

As before, set Ak:=𝔼[(𝜷~k𝜷~)(𝜷~k𝜷~)]A_{k}:=\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})^{\top}\big{]} for each k0k\geq 0 and let ρk:=Ak+1S(Ak)\rho_{k}:=A_{k+1}-S(A_{k}). Using induction on kk, we now prove

Ak\displaystyle A_{k} =Slink(A0)+=0k1Slin(S0+ρk1).\displaystyle=S_{\mathrm{lin}}^{k}(A_{0})+\sum_{\ell=0}^{k-1}S_{\mathrm{lin}}^{\ell}\big{(}S_{0}+\rho_{k-1-\ell}\big{)}. (36)

Taking k=1k=1, A1=S(A0)+ρ0=Slin(A0)+S0+ρ0A_{1}=S(A_{0})+\rho_{0}=S_{\mathrm{lin}}(A_{0})+S_{0}+\rho_{0}, so the claimed identity holds. Assuming the identity is true for k1k-1, the recursion Ak=S(Ak1)+ρk1A_{k}=S(A_{k-1})+\rho_{k-1} leads to

Ak\displaystyle A_{k} =S(Slink1(A0)+=0k2Slin(S0+ρk2))+ρk1\displaystyle=S\left(S_{\mathrm{lin}}^{k-1}(A_{0})+\sum_{\ell=0}^{k-2}S_{\mathrm{lin}}^{\ell}\Big{(}S_{0}+\rho_{k-2-\ell}\Big{)}\right)+\rho_{k-1}
=Slink(A0)+Slin(=0k2Slin(S0+ρk2))+S0+ρk1\displaystyle=S_{\mathrm{lin}}^{k}(A_{0})+S_{\mathrm{lin}}\left(\sum_{\ell=0}^{k-2}S_{\mathrm{lin}}^{\ell}\Big{(}S_{0}+\rho_{k-2-\ell}\Big{)}\right)+S_{0}+\rho_{k-1}
=Slink(A0)+=0k1Slin(S0+ρk1),\displaystyle=S_{\mathrm{lin}}^{k}(A_{0})+\sum_{\ell=0}^{k-1}S_{\mathrm{lin}}^{\ell}\Big{(}S_{0}+\rho_{k-1-\ell}\Big{)},

thereby establishing the induction step and proving (36).

Assuming SlinopIαp𝕏p<1\left\lVert S_{\mathrm{lin}}\right\rVert_{\mathrm{op}}\leq\lVert I-\alpha p\mathbb{X}_{p}\rVert<1, we move on to show the bound

Ak(idSlin)1S0Iαp𝕏pkA0(idSlin)1S0+C0kIαp𝕏pk1\displaystyle\Big{\lVert}A_{k}-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k}\Big{\lVert}A_{0}-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}+C_{0}k\left\lVert I-\alpha p\mathbb{X}_{p}\right\rVert^{k-1}

with id\mathrm{id} the identity operator on d×d\mathbb{R}^{d\times d}, and C0:=6𝔼[(𝜷~0𝜷~)𝜷~]C_{0}:=6\big{\lVert}\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}})\widetilde{\bm{\beta}}^{\top}\big{]}\big{\rVert}. By linearity, =0k1Slin(S0+ρk1)==0k1Slin(S0)+m=0k1Slinm(ρk1m)\sum_{\ell=0}^{k-1}S_{\mathrm{lin}}^{\ell}\big{(}S_{0}+\rho_{k-1-\ell}\big{)}=\sum_{\ell=0}^{k-1}S_{\mathrm{lin}}^{\ell}(S_{0})+\sum_{m=0}^{k-1}S_{\mathrm{lin}}^{m}(\rho_{k-1-m}). Since Slinop<1\left\lVert S_{\mathrm{lin}}\right\rVert_{\mathrm{op}}<1, Lemma 12 asserts that (idSlin)1==0Slin\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}=\sum_{\ell=0}^{\infty}S_{\mathrm{lin}}^{\ell} and

=0k1Slin(S0)\displaystyle\sum_{\ell=0}^{k-1}S_{\mathrm{lin}}^{\ell}\big{(}S_{0}\big{)} =(idSlin)1S0+(=0k1Slin(idSlin)1)S0\displaystyle=\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}+\left(\sum_{\ell=0}^{k-1}S_{\mathrm{lin}}^{\ell}-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}\right)S_{0}
=(idSlin)1S0=kSlin(S0)\displaystyle=\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}-\sum_{\ell=k}^{\infty}S_{\mathrm{lin}}^{\ell}\big{(}S_{0}\big{)}
=(idSlin)1S0Slink((idSlin)1S0).\displaystyle=\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}-S_{\mathrm{lin}}^{k}\Big{(}\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{)}. (37)

Lemma 2 ensures ρk1mC0Iαp𝕏pk1m\lVert\rho_{k-1-m}\rVert\leq C_{0}\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1-m} for all mk1m\leq k-1. Moreover, SlinopIαp𝕏p\lVert S_{\mathrm{lin}}\rVert_{\mathrm{op}}\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert} and hence Slinm(ρk1m)C0Iαp𝕏pk1\big{\lVert}S_{\mathrm{lin}}^{m}(\rho_{k-1-m})\big{\rVert}\leq C_{0}\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1} for every m=0,1,m=0,1,\ldots, so the triangle inequality implies

m=0k1Slinm(ρk1m)C0kIαp𝕏pk1.\displaystyle\left\lVert\sum_{m=0}^{k-1}S_{\mathrm{lin}}^{m}(\rho_{k-1-m})\right\rVert\leq C_{0}k\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}. (38)

Combining (36) and (B.4), as well as applying the triangle inequality and the bound (38), leads to the first bound asserted in Theorem 2,

Ak(idSlin)1S0\displaystyle\Big{\lVert}A_{k}-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert} Slink(A0(idSlin)1S0)+m=0k1Slinm(ρk1m)\displaystyle\leq\bigg{\lVert}S_{\mathrm{lin}}^{k}\Big{(}A_{0}-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{)}\bigg{\rVert}+\left\lVert\sum_{m=0}^{k-1}S_{\mathrm{lin}}^{m}(\rho_{k-1-m})\right\rVert
Iαp𝕏pkA0(idSlin)1S0+C0kIαp𝕏pk1.\displaystyle\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k}\Big{\lVert}A_{0}-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}+C_{0}k\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}. (39)

To show the corresponding bound for the variance, observe that Cov(𝜷~k𝜷~)=Ak𝔼[𝜷~k𝜷~]𝔼[𝜷~k𝜷~]\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}=A_{k}-\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]}^{\top}. Lemma 1 and (69) imply

Cov(𝜷~k𝜷~)Ak\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}-A_{k}\Big{\rVert} =𝔼[𝜷~k𝜷~]𝔼[𝜷~k𝜷~]\displaystyle=\Big{\lVert}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]}^{\top}\Big{\rVert}
𝔼[𝜷~k𝜷~]22\displaystyle\leq\Big{\lVert}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{]}\Big{\rVert}_{2}^{2}
Iαp𝕏pk1𝔼[𝜷~0𝜷~]22.\displaystyle\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}\Big{\lVert}\mathbb{E}\big{[}\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}\big{]}\Big{\rVert}_{2}^{2}.

Together with (39) and the triangle inequality, this proves the second bound asserted in Theorem 2. ∎

B.5 Proof of Lemma 3

Applying Theorem 1, Lemma 1, and the triangle inequality,

Cov(𝜷~k)Cov(𝜷~)Cov(𝜷~k𝜷~)+4𝕏p1Iαp𝕏pksup𝜷:𝜷21𝔼𝜷[𝜷~0𝜷~]2.\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}\Big{\rVert}\leq\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\Big{\rVert}+4\big{\lVert}\mathbb{X}_{p}^{-1}\big{\rVert}\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k}\sup_{\bm{\beta}_{\star}:\lVert\bm{\beta}_{\star}\rVert_{2}\leq 1}\Big{\lVert}\mathbb{E}_{\bm{\beta}_{\star}}\big{[}\widetilde{\bm{\beta}}_{0}-\widetilde{\bm{\beta}}\big{]}\Big{\rVert}_{2}. (40)

Lemma 13 implies 𝕏p(1p)Diag(𝕏)\mathbb{X}_{p}\geq(1-p)\mathrm{Diag}(\mathbb{X}), so that 𝕏p1=λmin(𝕏p)1((1p)mini𝕏ii)1\big{\lVert}\mathbb{X}_{p}^{-1}\big{\rVert}=\lambda_{\mathrm{min}}(\mathbb{X}_{p})^{-1}\leq\big{(}(1-p)\min_{i}\mathbb{X}_{ii}\big{)}^{-1}. Next, 𝔼𝜷[𝜷~]=𝕏p1𝕏𝜷\mathbb{E}_{\bm{\beta}_{\star}}\big{[}\widetilde{\bm{\beta}}\big{]}=\mathbb{X}_{p}^{-1}\mathbb{X}\bm{\beta}_{\star} entails equality between sup𝜷:𝜷21𝔼𝜷[𝜷~]2\sup_{\bm{\beta}_{\star}:\lVert\bm{\beta}_{\star}\rVert_{2}\leq 1}\big{\lVert}\mathbb{E}_{\bm{\beta}_{\star}}[\widetilde{\bm{\beta}}]\big{\rVert}_{2} and 𝕏p1𝕏((1p)mini𝕏ii)1𝕏\big{\lVert}\mathbb{X}_{p}^{-1}\mathbb{X}\big{\rVert}\leq\big{(}(1-p)\min_{i}\mathbb{X}_{ii}\big{)}^{-1}\lVert\mathbb{X}\rVert. The second term on the right-hand side of (40) is then bounded by C1Iαp𝕏pk/(1p)2C_{1}\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k}/(1-p)^{2}, for some constant C1C_{1} independent of (α,p,k)(\alpha,p,k).

To prove the first claim of the lemma, it now suffices to show

Cov(𝜷~k𝜷~)1(1p)2(kIαp𝕏pk1C2+αpC3),\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\Big{\rVert}\leq\dfrac{1}{(1-p)^{2}}\Big{(}k\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}C_{2}+\alpha pC_{3}\Big{)}, (41)

where C2C_{2} and C3C_{3} are constants independent of (α,p,k)(\alpha,p,k). As 𝕏p1((1p)mini𝕏ii)1\big{\lVert}\mathbb{X}_{p}^{-1}\big{\rVert}\leq\big{(}(1-p)\min_{i}\mathbb{X}_{ii}\big{)}^{-1}, the constant CC in Theorem 2 satisfies CC4/(1p)2+(idSlin)1S0C\leq C_{4}/(1-p)^{2}+\big{\lVert}(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}\big{\rVert}, with C4C_{4} depending only on the distribution of (𝐘,𝜷~0,X)\big{(}\mathbf{Y},\widetilde{\bm{\beta}}_{0},X\big{)}. Consequently, Theorem 2 and the triangle inequality imply

Cov(𝜷~k𝜷~)Cov(𝜷~k𝜷~)(idSlin)1S0+(idSlin)1S0kIαp𝕏pk1(C4(1p)2+(idSlin)1S0)+(idSlin)1S0.\begin{split}\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\Big{\rVert}&\leq\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}})-\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}+\Big{\lVert}\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}\\ &\leq k\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}^{k-1}\left(\dfrac{C_{4}}{(1-p)^{2}}+\Big{\lVert}\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}\right)+\Big{\lVert}\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}.\end{split} (42)

Consider a bounded linear operator GG on d×d\mathbb{R}^{d\times d} satisfying Gop<1\lVert G\rVert_{\mathrm{op}}<1. For an arbitrary d×dd\times d matrix AA, Lemma 12 asserts (idG)1A==0G(A)(\mathrm{id}-G)^{-1}A=\sum_{\ell=0}^{\infty}G^{\ell}(A) and therefore (idG)1A=0GopA=(1Gop)1A\big{\lVert}(\mathrm{id}-G)^{-1}A\big{\rVert}\leq\sum_{\ell=0}^{\infty}\lVert G\rVert_{\mathrm{op}}^{\ell}\cdot\lVert A\rVert=\big{(}1-\lVert G\rVert_{\mathrm{op}}\big{)}^{-1}\lVert A\rVert. Theorem 1 states that SlinopIαp𝕏p\left\lVert S_{\mathrm{lin}}\right\rVert_{\mathrm{op}}\leq\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}. As shown following (27), Iαp𝕏p1αp(1p)mini𝕏ii\big{\lVert}I-\alpha p\mathbb{X}_{p}\big{\rVert}\leq 1-\alpha p(1-p)\min_{i}\mathbb{X}_{ii}. Therefore,

(idSlin)1S0(1Slinop)1S0(αp(1p)mini𝕏ii)1S0.\displaystyle\Big{\lVert}\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0}\Big{\rVert}\leq\big{(}1-\lVert S_{\mathrm{lin}}\rVert_{\mathrm{op}}\big{)}^{-1}\lVert S_{0}\rVert\leq\big{(}\alpha p(1-p)\min_{i}\mathbb{X}_{ii}\big{)}^{-1}\lVert S_{0}\rVert.

Taking A=0A=0 in Lemma 2, S0=α2p2(1p)(𝕏¯Diag(𝔼[𝜷~𝜷~])𝕏¯)p+α2p2(1p)2𝕏𝔼[𝜷~𝜷~]¯𝕏S_{0}=\alpha^{2}p^{2}(1-p)\big{(}\overline{\mathbb{X}}\mathrm{Diag}\big{(}\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}\big{)}\overline{\mathbb{X}}\big{)}_{p}+\alpha^{2}p^{2}(1-p)^{2}\mathbb{X}\odot\overline{\mathbb{E}[\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}]}\odot\mathbb{X}. Using Lemma 13 and 𝕏p1((1p)mini𝕏ii)1\big{\lVert}\mathbb{X}_{p}^{-1}\big{\rVert}\leq\big{(}(1-p)\min_{i}\mathbb{X}_{ii}\big{)}^{-1},

S0\displaystyle\lVert S_{0}\rVert α2p2(1p)(𝕏¯Diag(𝔼[𝜷~𝜷~])𝕏¯+𝕏𝔼[𝜷~𝜷~]¯𝕏)\displaystyle\leq\alpha^{2}p^{2}(1-p)\Big{(}\Big{\lVert}\overline{\mathbb{X}}\mathrm{Diag}\big{(}\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}\big{)}\overline{\mathbb{X}}\Big{\rVert}+\Big{\lVert}\mathbb{X}\odot\overline{\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}}\odot\mathbb{X}\Big{\rVert}\Big{)}
(αp𝕏)2𝔼[X𝐘𝐘X](1p)(mini𝕏ii)2\displaystyle\leq\dfrac{(\alpha p\lVert\mathbb{X}\rVert)^{2}\big{\lVert}\mathbb{E}[X^{\top}\mathbf{Y}\mathbf{Y}^{\top}X]\big{\rVert}}{(1-p)(\min_{i}\mathbb{X}_{ii})^{2}}

proving that

(idSlin)1S0αp(1p)2(mini𝕏ii)3𝕏2𝔼[X𝐘𝐘X].\displaystyle\big{\lVert}(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}\big{\rVert}\leq\alpha p(1-p)^{-2}(\min_{i}\mathbb{X}_{ii})^{-3}\lVert\mathbb{X}\rVert^{2}\cdot\big{\lVert}\mathbb{E}[X^{\top}\mathbf{Y}\mathbf{Y}^{\top}X]\big{\rVert}. (43)

Note that αp𝕏2𝕏\alpha p\lVert\mathbb{X}\rVert^{2}\leq\lVert\mathbb{X}\rVert by Assumption 1. Applying these bounds in (42) leads to

Cov(𝜷~k𝜷~)\displaystyle\left\lVert\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\right\rVert kIαp𝕏pk1(1p)2(C4+𝕏𝔼[X𝐘𝐘X](mini𝕏ii)3)\displaystyle\leq\dfrac{k\lVert I-\alpha p\mathbb{X}_{p}\rVert^{k-1}}{(1-p)^{2}}\left(C_{4}+\dfrac{\big{\lVert}\mathbb{X}\big{\rVert}\big{\lVert}\mathbb{E}[X^{\top}\mathbf{Y}\mathbf{Y}^{\top}X]\big{\rVert}}{(\min_{i}\mathbb{X}_{ii})^{3}}\right)
+αp𝕏2𝔼[X𝐘𝐘X](1p)2(mini𝕏ii)3,\displaystyle\qquad+\dfrac{\alpha p\big{\lVert}\mathbb{X}\big{\rVert}^{2}\big{\lVert}\mathbb{E}[X^{\top}\mathbf{Y}\mathbf{Y}^{\top}X]\big{\rVert}}{(1-p)^{2}(\min_{i}\mathbb{X}_{ii})^{3}},

which proves (41). Combined with (40), this proves the first claim of the lemma since

Cov(𝜷~k)Cov(𝜷~)kIαp𝕏pk1(C1+C2)+αpC3(1p)2.\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}\Big{\rVert}\leq\dfrac{k\lVert I-\alpha p\mathbb{X}_{p}\rVert^{k-1}(C_{1}+C_{2})+\alpha pC_{3}}{(1-p)^{2}}. (44)

To start proving the second claim, recall that Cov(𝜷~)=𝕏p1𝕏𝕏p1\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}=\mathbb{X}_{p}^{-1}\mathbb{X}\mathbb{X}_{p}^{-1}. Hence, the triangle inequality leads to

Cov(𝜷~k)Diag(𝕏)1𝕏Diag(𝕏)1\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Diag}(\mathbb{X})^{-1}\mathbb{X}\mathrm{Diag}(\mathbb{X})^{-1}\Big{\rVert}
Cov(𝜷~k)Cov(𝜷~)+Diag(𝕏)1𝕏Diag(𝕏)1𝕏p1𝕏𝕏p1.\displaystyle\leq\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}\Big{\rVert}+\Big{\lVert}\mathrm{Diag}(\mathbb{X})^{-1}\mathbb{X}\mathrm{Diag}(\mathbb{X})^{-1}-\mathbb{X}_{p}^{-1}\mathbb{X}\mathbb{X}_{p}^{-1}\Big{\rVert}.

Let AA, BB, and CC be square matrices of the same dimension, with AA and BB invertible. Observe the identity A1CA1B1CB1=A1(BA)B1CA1+B1CA1(BA)B1A^{-1}CA^{-1}-B^{-1}CB^{-1}=A^{-1}(B-A)B^{-1}CA^{-1}+B^{-1}CA^{-1}(B-A)B^{-1}, so sub-multiplicativity implies A1CA1B1CB12max{A1,B1}A1B1ABC\big{\lVert}A^{-1}CA^{-1}-B^{-1}CB^{-1}\big{\rVert}\leq 2\max\big{\{}\lVert A^{-1}\rVert,\lVert B^{-1}\rVert\big{\}}\lVert A^{-1}\rVert\cdot\lVert B^{-1}\rVert\cdot\lVert A-B\rVert\cdot\lVert C\rVert. Using 𝕏p1((1p)mini𝕏ii)1\big{\lVert}\mathbb{X}_{p}^{-1}\big{\rVert}\leq\big{(}(1-p)\min_{i}\mathbb{X}_{ii}\big{)}^{-1}, and inserting A=Diag(𝕏)A=\mathrm{Diag}(\mathbb{X}), B=𝕏p=A+p𝕏¯B=\mathbb{X}_{p}=A+p\overline{\mathbb{X}}, and C=𝕏C=\mathbb{X} results in

Diag(𝕏)1𝕏Diag(𝕏)1𝕏p1𝕏𝕏p1pC5(1p)2\displaystyle\Big{\lVert}\mathrm{Diag}(\mathbb{X})^{-1}\mathbb{X}\mathrm{Diag}(\mathbb{X})^{-1}-\mathbb{X}_{p}^{-1}\mathbb{X}\mathbb{X}_{p}^{-1}\Big{\rVert}\leq\dfrac{pC_{5}}{(1-p)^{2}}

with C5C_{5} independent of (α,p,k)(\alpha,p,k). Combined with (44), this results in

Cov(𝜷~k)Diag(𝕏)1𝕏Diag(𝕏)1kIαp𝕏pk1(C1+C2)+αpC3+pC5(1p)2,\displaystyle\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Diag}(\mathbb{X})^{-1}\mathbb{X}\mathrm{Diag}(\mathbb{X})^{-1}\Big{\rVert}\leq\dfrac{k\lVert I-\alpha p\mathbb{X}_{p}\rVert^{k-1}(C_{1}+C_{2})+\alpha pC_{3}+pC_{5}}{(1-p)^{2}},

which proves the second claim of the lemma by enlarging C′′C^{\prime\prime}, if necessary. ∎

B.6 Proof of Theorem 3

Start by noting that λmin(A)=inf𝐯:𝐯=1𝐯A𝐯\lambda_{\mathrm{min}}(A)=\inf_{\mathbf{v}:\lVert\mathbf{v}\rVert=1}\mathbf{v}^{\top}A\mathbf{v} for symmetric matrices, see [24], Theorem 4.2.6. Using super-additivity of infima, observe the lower bound

lim infkinf𝐯:𝐯=1𝐯(Cov(𝜷~k)Cov(𝜷~))𝐯\displaystyle\liminf_{k\to\infty}\inf_{\mathbf{v}:\lVert\mathbf{v}\rVert=1}\mathbf{v}^{\top}\Big{(}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}\Big{)}\mathbf{v}
\displaystyle\geq lim infk(λmin(Cov(𝜷~k𝜷~))sup𝐯:𝐯=1|𝐯(Cov(𝜷~k)Cov(𝜷~)Cov(𝜷~k𝜷~))𝐯|)\displaystyle\liminf_{k\to\infty}\bigg{(}\lambda_{\mathrm{min}}\Big{(}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\Big{)}-\sup_{\mathbf{v}:\lVert\mathbf{v}\rVert=1}\Big{\lvert}\mathbf{v}^{\top}\Big{(}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\Big{)}\mathbf{v}\Big{\rvert}\bigg{)}
\displaystyle\geq lim infkλmin(Cov(𝜷~k𝜷~))lim supkCov(𝜷~k)Cov(𝜷~)Cov(𝜷~k𝜷~).\displaystyle\liminf_{k\to\infty}\lambda_{\mathrm{min}}\Big{(}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\Big{)}-\limsup_{k\to\infty}\Big{\lVert}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}-\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)}\Big{\rVert}. (45)

Combining Lemma 1 and Theorem 1, the limit superior in (45) vanishes. Further, Cov(𝜷~k𝜷~)\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)} converges to (idSlin)1S0\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}S_{0} by Theorem 2, so it suffices to analyze the latter matrix.

For the next step, the matrix S0:=S(0)S_{0}:=S(0) in Theorem 2 will be lower-bounded. Taking A=0A=0 in Lemma 2 and exchanging the expectation with the Diag\mathrm{Diag} operator results in

S0\displaystyle S_{0} =α2p2(1p)(𝕏¯Diag(𝔼[𝜷~𝜷~])𝕏¯)p+α2p2(1p)2𝕏𝔼[𝜷~𝜷~]¯𝕏\displaystyle=\alpha^{2}p^{2}(1-p)\bigg{(}\overline{\mathbb{X}}\mathrm{Diag}\Big{(}\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}\Big{)}\overline{\mathbb{X}}\bigg{)}_{p}+\alpha^{2}p^{2}(1-p)^{2}\mathbb{X}\odot\overline{\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}}\odot\mathbb{X}
=α2p2(1p)𝔼[p𝕏¯Diag(𝜷~𝜷~)𝕏¯+(1p)(Diag(𝕏¯Diag(𝜷~𝜷~)𝕏¯)+𝕏𝜷~𝜷~¯𝕏)].\displaystyle=\alpha^{2}p^{2}(1-p)\mathbb{E}\Bigg{[}p\overline{\mathbb{X}}\mathrm{Diag}\big{(}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{)}\overline{\mathbb{X}}+(1-p)\bigg{(}\mathrm{Diag}\Big{(}\overline{\mathbb{X}}\mathrm{Diag}\big{(}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{)}\overline{\mathbb{X}}\Big{)}+\mathbb{X}\odot\overline{\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}}\odot\mathbb{X}\bigg{)}\Bigg{]}. (46)

The first matrix in (46) is always positive semi-definite and we will now lower bound the matrix B:=Diag(𝕏¯Diag(𝜷~𝜷~)𝕏¯)+𝕏𝜷~𝜷~¯𝕏B:=\mathrm{Diag}\big{(}\overline{\mathbb{X}}\mathrm{Diag}(\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top})\overline{\mathbb{X}}\big{)}+\mathbb{X}\odot\overline{\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}}\odot\mathbb{X}. Given distinct i,j=1,,di,j=1,\ldots,d, symmetry of 𝕏\mathbb{X} implies

(𝕏¯Diag(𝜷~𝜷~)𝕏¯)ii\displaystyle\Big{(}\overline{\mathbb{X}}\mathrm{Diag}\big{(}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{)}\overline{\mathbb{X}}\Big{)}_{ii} =k=1d𝕏¯ikDiag(𝜷~𝜷~)kk𝕏¯ki=k=1d𝟙{ki}𝕏ik2β~k2,\displaystyle=\sum_{k=1}^{d}\overline{\mathbb{X}}_{ik}\mathrm{Diag}\big{(}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{)}_{kk}\overline{\mathbb{X}}_{ki}=\sum_{k=1}^{d}\mathbbm{1}_{\{k\neq i\}}\mathbb{X}_{ik}^{2}\widetilde{\beta}_{k}^{2},
(𝕏𝜷~𝜷~¯𝕏)ij\displaystyle\Big{(}\mathbb{X}\odot\overline{\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}}\odot\mathbb{X}\Big{)}_{ij} =𝕏ij2β~iβ~j.\displaystyle=\mathbb{X}_{ij}^{2}\widetilde{\beta}_{i}\widetilde{\beta}_{j}.

In turn, for any unit-length vector 𝐯\mathbf{v},

𝐯𝔼[B]𝐯\displaystyle\mathbf{v}^{\top}\mathbb{E}[B]\mathbf{v} =𝔼[i=1dk=1d𝟙{ki}vi2𝕏ik2β~k2+=1dm=1d(𝟙{m}v𝕏mβ~)(𝟙{m}β~m𝕏mvm)]\displaystyle=\mathbb{E}\left[\sum_{i=1}^{d}\sum_{k=1}^{d}\mathbbm{1}_{\{k\neq i\}}v_{i}^{2}\mathbb{X}_{ik}^{2}\widetilde{\beta}_{k}^{2}+\sum_{\ell=1}^{d}\sum_{m=1}^{d}\Big{(}\mathbbm{1}_{\{\ell\neq m\}}v_{\ell}\mathbb{X}_{\ell m}\widetilde{\beta}_{\ell}\Big{)}\Big{(}\mathbbm{1}_{\{\ell\neq m\}}\widetilde{\beta}_{m}\mathbb{X}_{\ell m}v_{m}\Big{)}\right]
==1dm=1d𝟙{m}𝕏m2𝔼[v2β~m2+vβ~vmβ~m]\displaystyle=\sum_{\ell=1}^{d}\sum_{m=1}^{d}\mathbbm{1}_{\{\ell\neq m\}}\mathbb{X}_{\ell m}^{2}\mathbb{E}\Big{[}v_{\ell}^{2}\widetilde{\beta}_{m}^{2}+v_{\ell}\widetilde{\beta}_{\ell}v_{m}\widetilde{\beta}_{m}\Big{]}
=12=1dm=1d𝟙{m}𝕏m2𝔼[(vβ~m+vmβ~)2],\displaystyle=\dfrac{1}{2}\sum_{\ell=1}^{d}\sum_{m=1}^{d}\mathbbm{1}_{\{\ell\neq m\}}\mathbb{X}_{\ell m}^{2}\mathbb{E}\Big{[}\big{(}v_{\ell}\widetilde{\beta}_{m}+v_{m}\widetilde{\beta}_{\ell}\big{)}^{2}\Big{]}, (47)

where the last equality follows by noting that each square (vβ~m+vmβ~)2(v_{\ell}\widetilde{\beta}_{m}+v_{m}\widetilde{\beta}_{\ell})^{2} appears twice in (47) since the expression is symmetric in (,m)(\ell,m). Every summand in (47) is non-negative. If v0v_{\ell}\neq 0, then there exists m()m(\ell)\neq\ell such that 𝕏m0\mathbb{X}_{\ell m}\neq 0. Write 𝐰()\mathbf{w}(\ell) for the vector with entries

wi()={vif i=m(),vm()if i=,0otherwise.\displaystyle w_{i}(\ell)=\begin{cases}v_{\ell}&\mbox{if }i=m(\ell),\\ v_{m(\ell)}&\mbox{if }i=\ell,\\ 0&\mbox{otherwise.}\end{cases}

By construction, 𝔼[(vβ~m()+vm()β~)2]=𝐰()𝔼[𝜷~𝜷~]𝐰()𝐰()Cov(𝜷~)𝐰()\mathbb{E}\big{[}(v_{\ell}\widetilde{\beta}_{m(\ell)}+v_{m(\ell)}\widetilde{\beta}_{\ell})^{2}\big{]}=\mathbf{w}(\ell)^{\top}\mathbb{E}\big{[}\widetilde{\bm{\beta}}\widetilde{\bm{\beta}}^{\top}\big{]}\mathbf{w}(\ell)\geq\mathbf{w}(\ell)^{\top}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}\mathbf{w}(\ell). Recall that Cov(𝜷~)=𝕏p1𝕏𝕏p1\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}=\mathbb{X}_{p}^{-1}\mathbb{X}\mathbb{X}_{p}^{-1} and note that λmin(𝕏p1𝕏𝕏p1)λmin(𝕏)/𝕏p2\lambda_{\mathrm{min}}\big{(}\mathbb{X}_{p}^{-1}\mathbb{X}\mathbb{X}_{p}^{-1}\big{)}\geq\lambda_{\mathrm{min}}(\mathbb{X})/\lVert\mathbb{X}_{p}\rVert^{2}. Together with 𝐰()22v2\big{\lVert}\mathbf{w}(\ell)\big{\rVert}_{2}^{2}\geq v_{\ell}^{2} and =1dv2=𝐯22=1\sum_{\ell=1}^{d}v_{\ell}^{2}=\lVert\mathbf{v}\rVert_{2}^{2}=1, (47) now satisfies

12=1dm=1d𝟙{m}𝕏m2𝔼[(vβ~m+vmβ~)2]\displaystyle\dfrac{1}{2}\sum_{\ell=1}^{d}\sum_{m=1}^{d}\mathbbm{1}_{\{\ell\neq m\}}\mathbb{X}_{\ell m}^{2}\mathbb{E}\Big{[}\big{(}v_{\ell}\widetilde{\beta}_{m}+v_{m}\widetilde{\beta}_{\ell}\big{)}^{2}\Big{]}
\displaystyle\geq\ 12=1d𝟙{v0}𝕏m()2𝐰()Cov(𝜷~)𝐰()\displaystyle\dfrac{1}{2}\sum_{\ell=1}^{d}\mathbbm{1}_{\{v_{\ell}\neq 0\}}\mathbb{X}_{\ell m(\ell)}^{2}\mathbf{w}(\ell)^{\top}\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}\mathbf{w}(\ell)
\displaystyle\geq\ λmin(𝕏)2𝕏p2=1d𝟙{v0}𝐰()22minm:𝕏m0𝕏m2\displaystyle\dfrac{\lambda_{\mathrm{min}}(\mathbb{X})}{2\lVert\mathbb{X}_{p}\rVert^{2}}\sum_{\ell=1}^{d}\mathbbm{1}_{\{v_{\ell}\neq 0\}}\big{\lVert}\mathbf{w}(\ell)\big{\rVert}_{2}^{2}\min_{m:\mathbb{X}_{\ell m}\neq 0}\mathbb{X}_{\ell m}^{2}
\displaystyle\geq\ λmin(𝕏)2𝕏p2minij:𝕏ij0𝕏ij2.\displaystyle\dfrac{\lambda_{\mathrm{min}}(\mathbb{X})}{2\lVert\mathbb{X}_{p}\rVert^{2}}\min_{i\neq j:\mathbb{X}_{ij}\neq 0}\mathbb{X}_{ij}^{2}.

As λmin(S0)α2p2(1p)2λmin(B)\lambda_{\mathrm{min}}(S_{0})\geq\alpha^{2}p^{2}(1-p)^{2}\lambda_{\mathrm{min}}(B), this proves the matrix inequality

S0α2p2(1p)2λmin(𝕏)2𝕏p2minij:𝕏ij0𝕏ij2Id.\displaystyle S_{0}\geq\dfrac{\alpha^{2}p^{2}(1-p)^{2}\lambda_{\mathrm{min}}(\mathbb{X})}{2\lVert\mathbb{X}_{p}\rVert^{2}}\min_{i\neq j:\mathbb{X}_{ij}\neq 0}\mathbb{X}_{ij}^{2}\cdot I_{d}. (48)

Next, let 𝝃\bm{\xi} be a centered random vector with covariance matrix MM and suppose DD is a d×dd\times d dropout matrix, independent of 𝝃\bm{\xi}. Conditioning on 𝝃\bm{\xi}, the law of total variance states

Cov((IαD𝕏p)𝝃+αD𝕏¯(pID)𝝃)\displaystyle\qquad\mathrm{Cov}\Big{(}\big{(}I-\alpha D\mathbb{X}_{p}\big{)}\bm{\xi}+\alpha D\overline{\mathbb{X}}\big{(}pI-D\big{)}\bm{\xi}\Big{)}
=Cov(𝔼[(IαD𝕏p)𝝃+D𝕏¯(pID)𝝃𝝃])\displaystyle=\mathrm{Cov}\Big{(}\mathbb{E}\big{[}(I-\alpha D\mathbb{X}_{p})\bm{\xi}+D\overline{\mathbb{X}}(pI-D)\bm{\xi}\mid\bm{\xi}\big{]}\Big{)} (49)
+𝔼[Cov((IαD𝕏p)𝝃+αD𝕏¯(pID)𝝃𝝃)]\displaystyle\qquad+\mathbb{E}\Big{[}\mathrm{Cov}\big{(}(I-\alpha D\mathbb{X}_{p})\bm{\xi}+\alpha D\overline{\mathbb{X}}(pI-D)\bm{\xi}\mid\bm{\xi}\big{)}\Big{]}
=(Iαp𝕏p)M(Iαp𝕏p)+α2𝔼[Cov(D𝕏p𝝃+D𝕏¯(DpI)𝝃𝝃)].\displaystyle=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}M\big{(}I-\alpha p\mathbb{X}_{p}\big{)}+\alpha^{2}\mathbb{E}\bigg{[}\mathrm{Cov}\Big{(}D\mathbb{X}_{p}\bm{\xi}+D\overline{\mathbb{X}}\big{(}D-pI\big{)}\bm{\xi}\mid\bm{\xi}\Big{)}\bigg{]}. (50)

Applying Lemma 11 with A:=𝕏A:=\mathbb{X}, 𝐮:=𝕏p𝝃\mathbf{u}:=\mathbb{X}_{p}\bm{\xi}, and 𝐯:=𝝃\mathbf{v}:=\bm{\xi} now shows that Slin(M)=Cov((IαD𝕏p)𝝃+αD𝕏¯(pID)𝝃)S_{\mathrm{lin}}(M)=\mathrm{Cov}\big{(}(I-\alpha D\mathbb{X}_{p}\big{)}\bm{\xi}+\alpha D\overline{\mathbb{X}}(pI-D)\bm{\xi}\big{)}. The second term in (50) is always positive semi-definite, proving that Slin(M)λmin(Iαp𝕏p)2λmin(M)IdS_{\mathrm{lin}}(M)\geq\lambda_{\mathrm{min}}(I-\alpha p\mathbb{X}_{p})^{2}\lambda_{\mathrm{min}}(M)\cdot I_{d}. As (idSlin)1==0Slin(\mathrm{id}-S_{\mathrm{lin}})^{-1}=\sum_{\ell=0}^{\infty}S_{\mathrm{lin}}^{\ell} and λmin(Iαp𝕏p)=1αp𝕏p\lambda_{\mathrm{min}}\big{(}I-\alpha p\mathbb{X}_{p}\big{)}=1-\alpha p\lVert\mathbb{X}_{p}\rVert, this implies

(idSlin)1M\displaystyle\big{(}\mathrm{id}-S_{\mathrm{lin}}\big{)}^{-1}M (λmin(M)=0λmin(Iαp𝕏p)2)Id\displaystyle\geq\left(\lambda_{\mathrm{min}}(M)\sum_{\ell=0}^{\infty}\lambda_{\mathrm{min}}\big{(}I-\alpha p\mathbb{X}_{p}\big{)}^{2\ell}\right)\cdot I_{d}
=λmin(M)2αp𝕏p(αp)2𝕏p2Idλmin(M)αp𝕏pId.\displaystyle=\dfrac{\lambda_{\mathrm{min}}(M)}{2\alpha p\lVert\mathbb{X}_{p}\rVert-(\alpha p)^{2}\lVert\mathbb{X}_{p}\rVert^{2}}\cdot I_{d}\geq\dfrac{\lambda_{\mathrm{min}}(M)}{\alpha p\lVert\mathbb{X}_{p}\rVert}\cdot I_{d}.

Lemma 13 moreover gives 𝕏p𝕏\lVert\mathbb{X}_{p}\rVert\leq\lVert\mathbb{X}\rVert. Together with the lower-bound (48) for λmin(S0)\lambda_{\mathrm{min}}(S_{0}), this proves the result. ∎

B.7 Proof of Theorem 4

As in Section 4.1, write 𝜷~krp:=k1j=1k𝜷~j\widetilde{\bm{\beta}}^{\mathrm{rp}}_{k}:=k^{-1}\sum_{j=1}^{k}\widetilde{\bm{\beta}}_{j} for the running average of the iterates and define

Akrp:=𝔼[(𝜷~krp𝜷~)(𝜷~krp𝜷~)]=1k2j,=1k𝔼[(𝜷~j𝜷~)(𝜷~𝜷~)].\displaystyle A^{\mathrm{rp}}_{k}:=\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}^{\mathrm{rp}}_{k}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}^{\mathrm{rp}}_{k}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}=\dfrac{1}{k^{2}}\sum_{j,\ell=1}^{k}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{j}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}. (51)

Suppose j>j>\ell and take r=0,,jr=0,\ldots,j-\ell. Using induction on rr, we now prove that 𝔼[(𝜷~j𝜷~)(𝜷~𝜷~)]=(Iαp𝕏p)r𝔼[(𝜷~jr𝜷~)(𝜷~𝜷~)]\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{j}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}})^{\top}\big{]}=(I-\alpha p\mathbb{X}_{p})^{r}\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{j-r}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}})^{\top}\big{]}. The identity always holds when r=0r=0. Next, suppose the identity holds for some r1<jr-1<j-\ell. Taking k=j+1rk=j+1-r in (14), 𝜷~j+1r𝜷~=(IαDj+1r𝕏p)(𝜷~jr𝜷~)+αDj+1r𝕏¯(pIDj+1r)𝜷~jr\widetilde{\bm{\beta}}_{j+1-r}-\widetilde{\bm{\beta}}=\big{(}I-\alpha D_{j+1-r}\mathbb{X}_{p}\big{)}\big{(}\widetilde{\bm{\beta}}_{j-r}-\widetilde{\bm{\beta}}\big{)}+\alpha D_{j+1-r}\overline{\mathbb{X}}\big{(}pI-D_{j+1-r}\big{)}\widetilde{\bm{\beta}}_{j-r}. Since jrj-r\geq\ell, Dj+1rD_{j+1-r} is by assumption independent of (𝜷~,𝜷~jr,𝜷~)\big{(}\widetilde{\bm{\beta}},\widetilde{\bm{\beta}}_{j-r},\widetilde{\bm{\beta}}_{\ell}\big{)}. Recall from (28) that 𝔼[Dj+1r𝕏¯(pIDj+1r)]=0\mathbb{E}\big{[}D_{j+1-r}\overline{\mathbb{X}}(pI-D_{j+1-r})\big{]}=0. Conditioning on (𝜷~,𝜷~jr,𝜷~)\big{(}\widetilde{\bm{\beta}},\widetilde{\bm{\beta}}_{j-r},\widetilde{\bm{\beta}}_{\ell}\big{)} and applying tower rule now gives

𝔼[(𝜷~j+1r𝜷~)(𝜷~𝜷~)]\displaystyle\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{j+1-r}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]} =𝔼[IαDj+1r𝕏p]𝔼[(𝜷~jr𝜷~)(𝜷~𝜷~)]\displaystyle=\mathbb{E}\Big{[}I-\alpha D_{j+1-r}\mathbb{X}_{p}\Big{]}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{j-r}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}
+α𝔼[Dj+1r𝕏¯(pIDj+1r)]𝔼[𝜷~jr(𝜷~𝜷~)]\displaystyle\qquad+\alpha\mathbb{E}\Big{[}D_{j+1-r}\overline{\mathbb{X}}\big{(}pI-D_{j+1-r}\big{)}\Big{]}\mathbb{E}\Big{[}\widetilde{\bm{\beta}}_{j-r}\big{(}\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}
=(Iαp𝕏p)𝔼[(𝜷~jr𝜷~)(𝜷~𝜷~)].\displaystyle=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{j-r}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}.

Together with the induction hypothesis, this proves the desired equality

𝔼[(𝜷~j𝜷~)(𝜷~𝜷~)]\displaystyle\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{j}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]} =(Iαp𝕏p)r1𝔼[(𝜷~j+1r𝜷~)(𝜷~𝜷~)]\displaystyle=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}^{r-1}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{j+1-r}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}
=(Iαp𝕏p)r𝔼[(𝜷~jr𝜷~)(𝜷~𝜷~)].\displaystyle=\big{(}I-\alpha p\mathbb{X}_{p}\big{)}^{r}\mathbb{E}\Big{[}\big{(}\widetilde{\bm{\beta}}_{j-r}-\widetilde{\bm{\beta}}\big{)}\big{(}\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}}\big{)}^{\top}\Big{]}.

For j<j<\ell, transposing and flipping the roles of jj and \ell also shows that 𝔼[(𝜷~j𝜷~)(𝜷~𝜷~)]=𝔼[(𝜷~j𝜷~)(𝜷~r𝜷~)](Iαp𝕏p)r\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{j}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}})^{\top}\big{]}=\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{j}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{\ell-r}-\widetilde{\bm{\beta}})^{\top}\big{]}(I-\alpha p\mathbb{X}_{p})^{r} with r=0,,jr=0,\ldots,\ell-j.

Defining A:=𝔼[(𝜷~𝜷~)(𝜷~𝜷~)]A_{\ell}:=\mathbb{E}\big{[}(\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}})(\widetilde{\bm{\beta}}_{\ell}-\widetilde{\bm{\beta}})^{\top}\big{]} and taking r=|j|r=\lvert j-\ell\rvert, (51) may now be rewritten as

Akrp=1k2j=1k(=0j(Iαp𝕏p)jA+=j+1Aj(Iαp𝕏p)j).\displaystyle A^{\mathrm{rp}}_{k}=\dfrac{1}{k^{2}}\sum_{j=1}^{k}\left(\sum_{\ell=0}^{j}\big{(}I-\alpha p\mathbb{X}_{p}\big{)}^{j-\ell}A_{\ell}+\sum_{\ell=j+1}^{\infty}A_{j}\big{(}I-\alpha p\mathbb{X}_{p})^{\ell-j}\right). (52)

Set γ:=Iαp𝕏p\gamma:=\lVert I-\alpha p\mathbb{X}_{p}\rVert, then γ1αp(1p)mini𝕏ii<1\gamma\leq 1-\alpha p(1-p)\min_{i}\mathbb{X}_{ii}<1 by Lemma 1. Note also that r=0jγrr=0γr=(1γ)1\sum_{r=0}^{j}\gamma^{r}\leq\sum_{r=0}^{\infty}\gamma^{r}=(1-\gamma)^{-1}. Using the triangle inequality and sub-multiplicativity of the spectral norm, (52) then satisfies

Akrp1k2j=1k(=0jγjA+=j+1Ajγj)\displaystyle\big{\lVert}A^{\mathrm{rp}}_{k}\big{\rVert}\leq\dfrac{1}{k^{2}}\sum_{j=1}^{k}\left(\sum_{\ell=0}^{j}\gamma^{j-\ell}\big{\lVert}A_{\ell}\big{\rVert}+\sum_{\ell=j+1}^{\infty}\big{\lVert}A_{j}\big{\rVert}\gamma^{\ell-j}\right) 2k2=1kAr=0γr\displaystyle\leq\dfrac{2}{k^{2}}\sum_{\ell=1}^{k}\big{\lVert}A_{\ell}\big{\rVert}\sum_{r=0}^{\infty}\gamma^{r}
=2k2(1γ)=1kA.\displaystyle=\dfrac{2}{k^{2}(1-\gamma)}\sum_{\ell=1}^{k}\big{\lVert}A_{\ell}\big{\rVert}.

As shown in Theorem 2, A(idSlin)1S0+Cγ1\lVert A_{\ell}\rVert\leq\big{\lVert}(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}\big{\rVert}+C\ell\gamma^{\ell-1} for some constant CC. Observing that =1γ1=γ=1γ=γ((1γ)11)=(1γ)2\sum_{\ell=1}^{\infty}\ell\gamma^{\ell-1}=\partial_{\gamma}\sum_{\ell=1}^{\infty}\gamma^{\ell}=\partial_{\gamma}\big{(}(1-\gamma)^{-1}-1\big{)}=(1-\gamma)^{-2}, this implies

Akrp2k(1γ)(idSlin)1S0+2Ck2(1γ)3.\displaystyle\big{\lVert}A^{\mathrm{rp}}_{k}\big{\rVert}\leq\dfrac{2}{k(1-\gamma)}\big{\lVert}(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}\big{\rVert}+\dfrac{2C}{k^{2}(1-\gamma)^{3}}.

To complete the proof note that γ1αp(1p)mini𝕏ii\gamma\leq 1-\alpha p(1-p)\min_{i}\mathbb{X}_{ii} may be rewritten as (1γ)1(αp(1p)mini𝕏ii)1(1-\gamma)^{-1}\leq\big{(}\alpha p(1-p)\min_{i}\mathbb{X}_{ii}\big{)}^{-1} and (idSlin)1S0αp(1p)2(mini𝕏ii)3𝕏2𝔼[X𝐘𝐘X]\big{\lVert}(\mathrm{id}-S_{\mathrm{lin}})^{-1}S_{0}\big{\rVert}\leq\alpha p(1-p)^{-2}(\min_{i}\mathbb{X}_{ii})^{-3}\lVert\mathbb{X}\rVert^{2}\cdot\big{\lVert}\mathbb{E}[X^{\top}\mathbf{Y}\mathbf{Y}^{\top}X]\big{\rVert} by (43). ∎

B.8 Proof of Lemma 4

Recall the definition T(A):=(Iαp𝕏)A(Iαp𝕏)+α2p(1p)Diag(𝕏A𝕏)T(A):=\big{(}I-\alpha p\mathbb{X}\big{)}A\big{(}I-\alpha p\mathbb{X}\big{)}+\alpha^{2}p(1-p)\mathrm{Diag}\big{(}\mathbb{X}A\mathbb{X}\big{)}.

Lemma 7.

For every k=1,2,k=1,2,\ldots

Cov(𝜷^k𝜷^)\displaystyle\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)} =(Iαp𝕏)Cov(𝜷^k1𝜷^)(Iαp𝕏)\displaystyle=\big{(}I-\alpha p\mathbb{X}\big{)}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}\big{(}I-\alpha p\mathbb{X}\big{)}
+α2p(1p)Diag(𝕏𝔼[(𝜷^k1𝜷^)(𝜷^k1𝜷^)]𝕏)\displaystyle\qquad+\alpha^{2}p(1-p)\mathrm{Diag}\bigg{(}\mathbb{X}\mathbb{E}\Big{[}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}^{\top}\Big{]}\mathbb{X}\bigg{)}
T(Cov(𝜷^k1𝜷^)),\displaystyle\geq T\big{(}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}\big{)},

with equality if 𝛃^0=𝔼[𝛃^]\widehat{\bm{\beta}}_{0}=\mathbb{E}\big{[}\widehat{\bm{\beta}}\big{]} almost surely.

Proof  Recall the definition Ak:=𝔼[(𝜷^k𝜷^)(𝜷^k𝜷^)]A_{k}:=\mathbb{E}\big{[}(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})^{\top}\big{]}, so that Cov(𝜷^k𝜷^)=Ak𝔼[𝜷^k𝜷^]𝔼[𝜷^k𝜷^]\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}=A_{k}-\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}^{\top}. As shown in (22), Ak=T(Ak1)A_{k}=T(A_{k-1}) and hence

Cov(𝜷~k𝜷~)\displaystyle\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}_{k}-\widetilde{\bm{\beta}}\big{)} =T(Ak1)𝔼[𝜷^k𝜷^]𝔼[𝜷^k𝜷^]\displaystyle=T\big{(}A_{k-1}\big{)}-\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}^{\top}

By definition, T(Ak1)=(Iαp𝕏)Ak1(Iαp𝕏)+α2p(1p)Diag(𝕏Ak1𝕏)T(A_{k-1})=\big{(}I-\alpha p\mathbb{X}\big{)}A_{k-1}\big{(}I-\alpha p\mathbb{X})+\alpha^{2}p(1-p)\mathrm{Diag}\big{(}\mathbb{X}A_{k-1}\mathbb{X}\big{)}. Recall from (20) that 𝔼[𝜷^k𝜷^]=(Iαp𝕏)𝔼[𝜷^k1𝜷^]\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}=\big{(}I-\alpha p\mathbb{X}\big{)}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{]}, so Cov(𝜷^k1𝜷^)=Ak1𝔼[𝜷^k1𝜷^]𝔼[𝜷^k1𝜷^]\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}=A_{k-1}-\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{]}^{\top} implies

(Iαp𝕏)Ak1(Iαp𝕏)=(Iαp𝕏)Cov(𝜷^k1𝜷^)(Iαp𝕏)+𝔼[𝜷^k𝜷^]𝔼[𝜷^k𝜷^].\displaystyle\big{(}I-\alpha p\mathbb{X}\big{)}A_{k-1}\big{(}I-\alpha p\mathbb{X})=\big{(}I-\alpha p\mathbb{X}\big{)}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}\big{(}I-\alpha p\mathbb{X})+\mathbb{E}[\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}]\mathbb{E}[\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}]^{\top}.

Together, these identities prove the first claim.

The lower bound follows from 𝕏𝔼[𝜷^k1𝜷^]𝔼[𝜷^k1𝜷^]𝕏\mathbb{X}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{]}^{\top}\mathbb{X} being positive semi-definite. A positive semi-definite matrix has non-negative diagonal entries, meaning Diag(𝕏𝔼[𝜷^k1𝜷^]𝔼[𝜷^k1𝜷^]𝕏)\mathrm{Diag}\big{(}\mathbb{X}\mathbb{E}[\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}]\mathbb{E}[\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}]^{\top}\mathbb{X}\big{)} is also positive semi-definite. Next, note that Ak1=Cov(𝜷^k1𝜷^)+𝔼[𝜷^k1𝜷^]𝔼[𝜷^k1𝜷^]A_{k-1}=\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}+\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{]}^{\top} and in turn

Diag(𝕏Ak1𝕏)=Diag(𝕏Cov(𝜷^k1𝜷^)𝕏)+Diag(𝕏𝔼[𝜷^k1𝜷^]𝔼[𝜷^k1𝜷^]𝕏)Diag(𝕏Cov(𝜷^k1𝜷^)𝕏).\begin{split}\mathrm{Diag}\big{(}\mathbb{X}A_{k-1}\mathbb{X}\big{)}&=\mathrm{Diag}\big{(}\mathbb{X}\mathrm{Cov}(\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}})\mathbb{X}\big{)}+\mathrm{Diag}\big{(}\mathbb{X}\mathbb{E}[\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}]\mathbb{E}[\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}]^{\top}\mathbb{X}\big{)}\\ &\geq\mathrm{Diag}\big{(}\mathbb{X}\mathrm{Cov}(\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}})\mathbb{X}\big{)}.\end{split} (53)

Together with the first part of the lemma and the definition of TT, the lower-bound follows.

Lastly, if 𝜷^0=𝔼[𝜷^]\widehat{\bm{\beta}}_{0}=\mathbb{E}[\widehat{\bm{\beta}}] almost surely, then (20) implies 𝔼[𝜷^k𝜷^]=(Iαp𝕏)k𝔼[𝜷^0𝜷^]=0\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}=\big{(}I-\alpha p\mathbb{X}\big{)}^{k}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{]}=0, so equality holds in (53). ∎

Consider arbitrary positive semi-definite matrices ABA\geq B, then 𝐰(AB)𝐰0\mathbf{w}^{\top}(A-B)\mathbf{w}\geq 0 for all vectors 𝐰\mathbf{w}. Given any vector 𝐯\mathbf{v}, this implies

𝐯T(AB)𝐯\displaystyle\mathbf{v}^{\top}T(A-B)\mathbf{v} =𝐯(Iαp𝕏)(AB)(Iαp𝕏)𝐯+α2p(1p)=1dv2𝐞𝕏(AB)𝕏𝐞\displaystyle=\mathbf{v}^{\top}\big{(}I-\alpha p\mathbb{X}\big{)}\big{(}A-B\big{)}\big{(}I-\alpha p\mathbb{X}\big{)}\mathbf{v}+\alpha^{2}p(1-p)\sum_{\ell=1}^{d}v_{\ell}^{2}\mathbf{e}_{\ell}^{\top}\mathbb{X}\big{(}A-B\big{)}\mathbb{X}\mathbf{e}_{\ell}
0\displaystyle\geq 0

with 𝐞\mathbf{e}_{\ell} the \ellth standard basis vector. Accordingly, TT is operator monotone with respect to the ordering of positive semi-definite matrices, in the sense that T(A)T(B)T(A)\geq T(B) whenever ABA\geq B. Using induction on kk, Lemma 7 may now be rewritten as

Cov(𝜷^k𝜷^)Tk(Cov(𝜷^0𝜷^)).\displaystyle\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}\geq T^{k}\Big{(}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{)}\Big{)}. (54)

To complete the proof, the right-hand side of (54) will be analyzed for a suitable choice of 𝕏\mathbb{X}.

Lemma 8.

Suppose 𝛃^0\widehat{\bm{\beta}}_{0} is independent of all other sources of randomness. Consider the linear regression model with a single observation n=1n=1, number of parameters d2d\geq 2, and design matrix X=𝟏X=\mathbf{1}^{\top}. Then, Cov(𝛃^k)Cov(𝛃^k𝛃^)+Cov(𝛃^)\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}\big{)}\geq\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}+\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)} and for any dd-dimensional vector 𝐯\mathbf{v} satisfying 𝐯𝟏=0\mathbf{v}^{\top}\mathbf{1}=0 and every k=1,2,k=1,2,\ldots

𝐯Cov(𝜷^k𝜷^)𝐯α2p(1p)𝐯22.\displaystyle\mathbf{v}^{\top}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}\mathbf{v}^{\top}\geq\alpha^{2}p(1-p)\left\lVert\mathbf{v}\right\rVert_{2}^{2}.

Proof  By definition, 𝕏=𝟏𝟏\mathbb{X}=\mathbf{1}\mathbf{1}^{\top} is the d×dd\times d-matrix with all entries equal to one. Consequently, 𝕏k=dk1𝕏\mathbb{X}^{k}=d^{k-1}\mathbb{X} for all k1k\geq 1 and 𝕏X=𝟏𝟏𝟏=d𝟏=dX\mathbb{X}X^{\top}=\mathbf{1}\mathbf{1}^{\top}\mathbf{1}=d\mathbf{1}=dX^{\top}, so 𝜷^:=d1X𝐘\widehat{\bm{\beta}}:=d^{-1}X^{\top}\mathbf{Y} satisfies the normal equations X𝐘=𝕏𝜷^X^{\top}\mathbf{Y}=\mathbb{X}\widehat{\bm{\beta}}.

To prove Cov(𝜷^k)Cov(𝜷^k𝜷^)+Cov(𝜷^)\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}\big{)}\geq\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}+\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)}, note that Cov(𝜷^k𝜷^+𝜷^)Cov(𝜷^k𝜷^)+Cov(𝜷^)\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}+\widehat{\bm{\beta}}\big{)}\geq\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}+\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)} whenever Cov(𝜷^k𝜷^,𝜷^)0\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}},\widehat{\bm{\beta}}\big{)}\geq 0. By conditioning on (𝜷^k,𝜷^)\big{(}\widehat{\bm{\beta}}_{k},\widehat{\bm{\beta}}\big{)}, the identity 𝔼[𝜷^k𝜷^]=(Iαp𝕏)𝔼[𝜷^k1𝜷^]\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}=\big{(}I-\alpha p\mathbb{X}\big{)}\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{]} was shown in (20). The same argument also proves 𝔼[(𝜷^k𝜷^)𝜷^]=(Iαp𝕏)𝔼[(𝜷^k1𝜷^)𝜷^]\mathbb{E}\big{[}(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})\widehat{\bm{\beta}}^{\top}\big{]}=\big{(}I-\alpha p\mathbb{X}\big{)}\mathbb{E}\big{[}(\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}})\widehat{\bm{\beta}}^{\top}\big{]}. Induction on kk and the assumed independence between 𝜷^0\widehat{\bm{\beta}}_{0} and 𝜷^\widehat{\bm{\beta}} now lead to

Cov(𝜷^k𝜷^,𝜷^)\displaystyle\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}},\widehat{\bm{\beta}}\big{)} =𝔼[(𝜷^k𝜷^)𝜷^]𝔼[𝜷^k𝜷^]𝔼[𝜷^]\displaystyle=\mathbb{E}\big{[}(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})\widehat{\bm{\beta}}^{\top}\big{]}-\mathbb{E}\big{[}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{]}\mathbb{E}\big{[}\widehat{\bm{\beta}}^{\top}\big{]}
=(Iαp𝕏)Cov(𝜷^k1𝜷^,𝜷^)\displaystyle=\big{(}I-\alpha p\mathbb{X}\big{)}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}},\widehat{\bm{\beta}}\big{)}
=(Iαp𝕏)kCov(𝜷^0𝜷^,𝜷^)\displaystyle=\big{(}I-\alpha p\mathbb{X}\big{)}^{k}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}},\widehat{\bm{\beta}}\big{)}
=(Iαp𝕏)kCov(𝜷^).\displaystyle=\big{(}I-\alpha p\mathbb{X}\big{)}^{k}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)}. (55)

Next, note that Cov(𝜷~)=d2𝕏\mathrm{Cov}\big{(}\widetilde{\bm{\beta}}\big{)}=d^{-2}\mathbb{X} and (Iαp𝕏)𝕏=(1αpd)𝕏\big{(}I-\alpha p\mathbb{X}\big{)}\mathbb{X}=\big{(}1-\alpha pd\big{)}\mathbb{X}. As αp𝕏=αpd<1\alpha p\lVert\mathbb{X}\rVert=\alpha pd<1, (55) satisfies

(Iαp𝕏)kCov(𝜷^)\displaystyle\big{(}I-\alpha p\mathbb{X}\big{)}^{k}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)} =1d2(Iαp𝕏)k𝕏=1d2(1αpd)k𝕏0\displaystyle=\dfrac{1}{d^{2}}\big{(}I-\alpha p\mathbb{X}\big{)}^{k}\mathbb{X}=\dfrac{1}{d^{2}}\big{(}1-\alpha pd\big{)}^{k}\mathbb{X}\geq 0

which proves the first claim.

To prove the second claim, we first show that there are real sequences {νk}k\{\nu_{k}\}_{k} and {λk}k\{\lambda_{k}\}_{k}, not depending on the distribution of 𝜷^0\widehat{\bm{\beta}}_{0}, such that

Cov(𝜷^k𝜷^)νkId+λkd𝕏\displaystyle\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}\geq\nu_{k}I_{d}+\tfrac{\lambda_{k}}{d}\mathbb{X} (56)

for all k0k\geq 0, with equality if 𝜷^0=𝔼[𝜷^]\widehat{\bm{\beta}}_{0}=\mathbb{E}\big{[}\widehat{\bm{\beta}}\big{]} almost surely. When k=0k=0, independence between 𝜷^0\widehat{\bm{\beta}}_{0} and 𝜷^\widehat{\bm{\beta}}, as well as Cov(𝜷^)=d2𝕏\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)}=d^{-2}\mathbb{X}, imply Cov(𝜷^0𝜷^)=Cov(𝜷^0)+Cov(𝜷^)d2𝕏\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{)}=\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{0}\big{)}+\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)}\geq d^{-2}\mathbb{X}. Moreover, equality holds whenever 𝜷^0\widehat{\bm{\beta}}_{0} is deterministic.

For the sake of induction suppose the claim is true for some k1k-1. Lemma 7 and operator monotonicity of TT then imply

Cov(𝜷^k𝜷^)T(Cov(𝜷^k1𝜷^))T(νk1Id+λk1d𝕏).\displaystyle\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}\geq T\Big{(}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k-1}-\widehat{\bm{\beta}}\big{)}\Big{)}\geq T\Big{(}\nu_{k-1}I_{d}+\tfrac{\lambda_{k-1}}{d}\mathbb{X}\Big{)}.

In case 𝜷^=𝔼[𝜷^]\widehat{\bm{\beta}}=\mathbb{E}\big{[}\widehat{\bm{\beta}}\big{]} almost surely, Lemma 7 and the induction hypothesis assert equality in the previous display. Recall 𝕏=d1𝕏\mathbb{X}^{\ell}=d^{\ell-1}\mathbb{X}, so (Iαp𝕏)2=I+((αp)2d2αp)𝕏\big{(}I-\alpha p\mathbb{X}\big{)}^{2}=I+\big{(}(\alpha p)^{2}d-2\alpha p\big{)}\mathbb{X} as well as (Iαp𝕏)𝕏(Iαp𝕏)=(1αpd)2𝕏\big{(}I-\alpha p\mathbb{X}\big{)}\mathbb{X}\big{(}I-\alpha p\mathbb{X}\big{)}=(1-\alpha pd)^{2}\mathbb{X}. Note also that Diag(𝕏)=Id\mathrm{Diag}(\mathbb{X})=I_{d}. Setting c:=αpdc:=\alpha pd, expanding the definition T(A)=(Iαp𝕏)A(Iαp𝕏)+α2p(1p)Diag(𝕏A𝕏)T(A)=\big{(}I-\alpha p\mathbb{X}\big{)}A\big{(}I-\alpha p\mathbb{X}\big{)}+\alpha^{2}p(1-p)\mathrm{Diag}\big{(}\mathbb{X}A\mathbb{X}\big{)} now results in

T(νk1Id+λk1d𝕏)\displaystyle T\left(\nu_{k-1}I_{d}+\dfrac{\lambda_{k-1}}{d}\mathbb{X}\right)
=\displaystyle=\ (νk1+α2p(1p)d(νk1+λk1))Id+(νk1((αp)2d2αp)+λk1(1αpd)2d)𝕏\displaystyle\Big{(}\nu_{k-1}+\alpha^{2}p(1-p)d(\nu_{k-1}+\lambda_{k-1})\Big{)}\cdot I_{d}+\left(\nu_{k-1}\Big{(}(\alpha p)^{2}d-2\alpha p\Big{)}+\dfrac{\lambda_{k-1}(1-\alpha pd)^{2}}{d}\right)\cdot\mathbb{X}
=\displaystyle=\ (νk1+α2p(1p)d(νk1+λk1))Idνk1(1αpd)2(νk1+λk1)d𝕏\displaystyle\Big{(}\nu_{k-1}+\alpha^{2}p(1-p)d(\nu_{k-1}+\lambda_{k-1})\Big{)}\cdot I_{d}-\dfrac{\nu_{k-1}-(1-\alpha pd)^{2}(\nu_{k-1}+\lambda_{k-1})}{d}\cdot\mathbb{X} (57)

This establishes the induction step and thereby proves (56) for all k0k\geq 0.

As (νk,λk)\big{(}\nu_{k},\lambda_{k}) do not depend on the distribution of 𝜷^0\widehat{\bm{\beta}}_{0}, taking 𝜷^0=𝔼[𝜷^]\widehat{\bm{\beta}}_{0}=\mathbb{E}\big{[}\widehat{\bm{\beta}}\big{]} shows that νk+λk0\nu_{k}+\lambda_{k}\geq 0 for all k0k\geq 0 since

0𝟏Cov(𝜷^k𝜷^)𝟏=𝟏(νkId+λkd𝕏)𝟏=d(νk+λk).\displaystyle 0\leq\mathbf{1}^{\top}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}\mathbf{1}=\mathbf{1}^{\top}\Big{(}\nu_{k}I_{d}+\tfrac{\lambda_{k}}{d}\mathbb{X}\Big{)}\mathbf{1}=d\big{(}\nu_{k}+\lambda_{k}\big{)}.

Consequently, (57) implies νk=νk1+α2p(1p)d(νk1+λk1)νk1\nu_{k}=\nu_{k-1}+\alpha^{2}p(1-p)d(\nu_{k-1}+\lambda_{k-1})\geq\nu_{k-1}, proving that {νk}k\{\nu_{k}\}_{k} is non-decreasing in kk.

Lastly, we show that ν1α2p(1p)\nu_{1}\geq\alpha^{2}p(1-p). To this end, recall that 𝕏3=d2𝕏\mathbb{X}^{3}=d^{2}\mathbb{X} and Diag(𝕏)=I\mathrm{Diag}(\mathbb{X})=I. As TT is operator monotone and T(A)α2p(1p)Diag(𝕏A𝕏)T(A)\geq\alpha^{2}p(1-p)\mathrm{Diag}\big{(}\mathbb{X}A\mathbb{X}\big{)}, independence of 𝜷^0\widehat{\bm{\beta}}_{0} and 𝜷^\widehat{\bm{\beta}} results in

Cov(𝜷^1𝜷^)\displaystyle\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{1}-\widehat{\bm{\beta}}\big{)} T(Cov(𝜷^0𝜷^))T(Cov(𝜷^))α2p(1p)Diag(𝕏d2𝕏𝕏)\displaystyle\geq T\Big{(}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{0}-\widehat{\bm{\beta}}\big{)}\Big{)}\geq T\Big{(}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}\big{)}\Big{)}\geq\alpha^{2}p(1-p)\mathrm{Diag}\big{(}\mathbb{X}d^{-2}\mathbb{X}\mathbb{X}\big{)}
=α2p(1p)d2Diag(𝕏3)=α2p(1p)Diag(𝕏)=α2p(1p)Id,\displaystyle=\dfrac{\alpha^{2}p(1-p)}{d^{2}}\mathrm{Diag}\big{(}\mathbb{X}^{3}\big{)}=\alpha^{2}p(1-p)\mathrm{Diag}(\mathbb{X})=\alpha^{2}p(1-p)I_{d},

so ν1α2p(1p)\nu_{1}\geq\alpha^{2}p(1-p).

To complete the proof, observe that 𝐯𝟏=0\mathbf{v}^{\top}\mathbf{1}=0 implies 𝕏𝐯=0\mathbb{X}\mathbf{v}=0. Accordingly,

𝐯Cov(𝜷^k𝜷^)𝐯𝐯(νkId+λkd𝕏)𝐯νk𝐯22ν1𝐯22α2p(1p)𝐯22\displaystyle\mathbf{v}^{\top}\mathrm{Cov}\big{(}\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}}\big{)}\mathbf{v}^{\top}\geq\mathbf{v}^{\top}\Big{(}\nu_{k}I_{d}+\tfrac{\lambda_{k}}{d}\mathbb{X}\Big{)}\mathbf{v}^{\top}\geq\nu_{k}\left\lVert\mathbf{v}\right\rVert_{2}^{2}\geq\nu_{1}\left\lVert\mathbf{v}\right\rVert_{2}^{2}\geq\alpha^{2}p(1-p)\left\lVert\mathbf{v}\right\rVert_{2}^{2}

which yields the second claim of the lemma. ∎

B.9 Proof of Theorem 5

Recall that T(A):=(Iαp𝕏)A(Iαp𝕏)+α2p(1p)Diag(𝕏A𝕏)T(A):=\big{(}I-\alpha p\mathbb{X}\big{)}A\big{(}I-\alpha p\mathbb{X}\big{)}+\alpha^{2}p(1-p)\mathrm{Diag}\big{(}\mathbb{X}A\mathbb{X}\big{)}, as defined in (21). If Ak:=𝔼[(𝜷^k𝜷^)(𝜷^k𝜷^)]A_{k}:=\mathbb{E}\big{[}(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})(\widehat{\bm{\beta}}_{k}-\widehat{\bm{\beta}})^{\top}\big{]}, then Ak=T(Ak1)A_{k}=T\big{(}A_{k-1}\big{)} by (22).

For an arbitrary d×dd\times d matrix AA, the triangle inequality, Lemma 13, and submultiplicativity of the spectral norm imply

T(A)\displaystyle\big{\lVert}T(A)\big{\rVert} (Iαp𝕏2+α2p(1p)𝕏2)A.\displaystyle\leq\Big{(}\big{\lVert}I-\alpha p\mathbb{X}\big{\rVert}^{2}+\alpha^{2}p(1-p)\left\lVert\mathbb{X}\right\rVert^{2}\Big{)}\left\lVert A\right\rVert. (58)

As 𝕏\mathbb{X} is positive definite, αp𝕏<1\alpha p\left\lVert\mathbb{X}\right\rVert<1 implies Iαp𝕏=1αpλmin(𝕏)\big{\lVert}I-\alpha p\mathbb{X}\big{\rVert}=1-\alpha p\lambda_{\mathrm{min}}(\mathbb{X}). If αλmin(𝕏)𝕏2\alpha\leq\tfrac{\lambda_{\mathrm{min}}(\mathbb{X})}{\lVert\mathbb{X}\rVert^{2}}, then

Iαp𝕏2+α2p(1p)𝕏2\displaystyle\big{\lVert}I-\alpha p\mathbb{X}\big{\rVert}^{2}+\alpha^{2}p(1-p)\lVert\mathbb{X}\rVert^{2} =12αpλmin(𝕏)+α2(p2λmin(𝕏)2+p(1p)𝕏2)\displaystyle=1-2\alpha p\lambda_{\mathrm{min}}(\mathbb{X})+\alpha^{2}\big{(}p^{2}\lambda_{\mathrm{min}}(\mathbb{X})^{2}+p(1-p)\lVert\mathbb{X}\rVert^{2}\big{)}
(12αpλmin(𝕏))+α2p𝕏2\displaystyle\leq\big{(}1-2\alpha p\lambda_{\mathrm{min}}(\mathbb{X})\big{)}+\alpha^{2}p\lVert\mathbb{X}\rVert^{2}
1αpλmin(𝕏).\displaystyle\leq 1-\alpha p\lambda_{\mathrm{min}}(\mathbb{X}).

Together with (58) this leads to Ak(1αpλmin(𝕏))Ak1\lVert A_{k}\rVert\leq\big{(}1-\alpha p\lambda_{\mathrm{min}}(\mathbb{X})\big{)}\lVert A_{k-1}\rVert. By induction on kk, Ak(1αpλmin(𝕏))kA0\lVert A_{k}\rVert\leq\big{(}1-\alpha p\lambda_{\mathrm{min}}(\mathbb{X})\big{)}^{k}\lVert A_{0}\rVert, completing the proof. ∎

Appendix C Higher Moments of Dropout Matrices

Deriving concise closed-form expressions for third and fourth order expectations of the dropout matrices presents one of the main technical challenges encountered in Section B.

All matrices in this section will be of dimension d×dd\times d and all vectors of length dd. Moreover, DD always denotes a random diagonal matrix such that Diii.i.d.Ber(p)D_{ii}\overset{\scriptscriptstyle i.i.d.}{\sim}\mathrm{Ber}(p) for all i=1,,di=1,\ldots,d. The diagonal entries of DD are elements of {0,1}\{0,1\}, meaning D=DkD=D^{k} for all positive powers kk.

Given a matrix AA and p(0,1)p\in(0,1), recall the definitions

Ap\displaystyle A_{p} :=pA+(1p)Diag(A)\displaystyle:=pA+(1-p)\mathrm{Diag}(A)
A¯\displaystyle\overline{A} :=ADiag(A).\displaystyle:=A-\mathrm{Diag}(A).

The first lemma contains some simple identities.

Lemma 9.

For arbitrary matrices AA and BB, p(0,1)p\in(0,1), and a diagonal matrix FF,

  1. (a)

    AF¯=A¯F\overline{AF}=\overline{A}F and FA¯=FA¯\overline{FA}=F\overline{A}

  2. (b)

    A¯p=pA¯=Ap¯\overline{A}_{p}=p\overline{A}=\overline{A_{p}}

  3. (c)

    Diag(A¯B)=Diag(AB¯)\mathrm{Diag}\big{(}\overline{A}B\big{)}=\mathrm{Diag}\big{(}A\overline{B}\big{)}.

Proof

  1. (a)

    By definition, (A¯F)ij=Fjj𝟙{ij}Aij(\overline{A}F)_{ij}=F_{jj}\mathbbm{1}_{\{i\neq j\}}A_{ij} for all i,j{1,,d}i,j\in\{1,\dots,d\}, which equals AF¯\overline{AF}. The second equality then follows by transposition.

  2. (b)

    Clearly, Diag(A¯)=0\mathrm{Diag}(\overline{A})=0 and in turn A¯p=pA¯+(1p)Diag(A¯)=pA¯\overline{A}_{p}=p\overline{A}+(1-p)\mathrm{Diag}(\overline{A})=p\overline{A}. On the other hand, Diag(Ap)=Diag(A)\mathrm{Diag}(A_{p})=\mathrm{Diag}(A) and hence Ap¯=pA+(1p)Diag(A)Diag(A)=pApDiag(A)\overline{A_{p}}=pA+(1-p)\mathrm{Diag}(A)-\mathrm{Diag}(A)=pA-p\mathrm{Diag}(A) equals pA¯p\overline{A} as well.

  3. (c)

    Observe that Diag(A¯B)\mathrm{Diag}\big{(}\overline{A}B\big{)} equals Diag(AB)Diag(Diag(A)B)\mathrm{Diag}\big{(}AB\big{)}-\mathrm{Diag}\big{(}\mathrm{Diag}(A)B\big{)}. As Diag(Diag(A)B)=Diag(A)Diag(B)=Diag(ADiag(B))\mathrm{Diag}\big{(}\mathrm{Diag}(A)B\big{)}=\mathrm{Diag}(A)\mathrm{Diag}(B)=\mathrm{Diag}\big{(}A\mathrm{Diag}(B)\big{)}, the claim follows.

With these basic properties at hand, higher moments involving the dropout matrix DD may be computed by carefully accounting for equalities between the involved indices.

Lemma 10.

Given arbitrary matrices AA, BB, and CC, the following hold:

  1. (a)

    𝔼[DAD]=pAp\mathbb{E}\big{[}DAD\big{]}=pA_{p}

  2. (b)

    𝔼[DADBD]=pApBp+p2(1p)Diag(A¯B)\mathbb{E}\big{[}DADBD\big{]}=pA_{p}B_{p}+p^{2}(1-p)\mathrm{Diag}(\overline{A}B)

  3. (c)

    𝔼[DADBDCD]=pApBpCp+p2(1p)(Diag(A¯BpC¯)+ApDiag(B¯C)+Diag(AB¯)Cp+(1p)AB¯C)\mathbb{E}\big{[}DADBDCD\big{]}=pA_{p}B_{p}C_{p}+p^{2}(1-p)\Big{(}\mathrm{Diag}\big{(}\overline{A}B_{p}\overline{C}\big{)}+A_{p}\mathrm{Diag}(\overline{B}C)+\mathrm{Diag}(A\overline{B})C_{p}+(1-p)A\odot\overline{B}^{\top}\odot C\Big{)}

Proof

  1. (a)

    Recall that D=D2D=D^{2} and hence 𝔼[D]=𝔼[D2]=pI\mathbb{E}[D]=\mathbb{E}[D^{2}]=pI, meaning 𝔼[DDiag(A)D]=Diag(A)𝔼[D]=pDiag(A)\mathbb{E}\big{[}D\mathrm{Diag}(A)D\big{]}=\mathrm{Diag}(A)\mathbb{E}[D]=p\mathrm{Diag}(A). On the other hand, (DA¯D)ij=DiiDjjAij𝟙{ij}(D\overline{A}D)_{ij}=D_{ii}D_{jj}A_{ij}\mathbbm{1}_{\{i\neq j\}} implies 𝔼[DA¯D]=p2A¯\mathbb{E}\big{[}D\overline{A}D\big{]}=p^{2}\overline{A} due to independence of DiiD_{ii} and DjjD_{jj}. Combining both identities,

    𝔼[DAD]\displaystyle\mathbb{E}\big{[}DAD\big{]} =𝔼[DA¯D+DDiag(A)D]\displaystyle=\mathbb{E}\Big{[}D\overline{A}D+D\mathrm{Diag}(A)D\Big{]}
    =p2A¯+pDiag(A)\displaystyle=p^{2}\overline{A}+p\mathrm{Diag}(A)
    =p(pApDiag(A)+Diag(A))=pAp.\displaystyle=p\Big{(}pA-p\mathrm{Diag}(A)+\mathrm{Diag}(A)\Big{)}=pA_{p}.
  2. (b)

    First, note that D=D2D=D^{2} and commutativity of diagonal matrices imply

    DADBD=DA¯DBD+DDiag(A)DBD=DA¯DB¯D+Diag(A)DBD+DDiag(A¯DB)D=DA¯DB¯D+Diag(A)DBD+Diag(A¯DBD).\begin{split}DADBD&=D\overline{A}DBD+D\mathrm{Diag}(A)DBD\\ &=D\overline{\overline{A}DB}D+\mathrm{Diag}(A)DBD+D\mathrm{Diag}\big{(}\overline{A}DB\big{)}D\\ &=D\overline{\overline{A}DB}D+\mathrm{Diag}(A)DBD+\mathrm{Diag}(\overline{A}DBD).\end{split} (59)

    Applying Lemma 9(a) twice, DA¯DB¯D=DA¯DBD¯D\overline{\overline{A}DB}D=\overline{D\overline{A}DBD} has no non-zero diagonal entries. Moreover, taking i,j{1,,d}i,j\in\{1,\ldots,d\} distinct,

    (DA¯DB¯D)ij\displaystyle\Big{(}D\overline{\overline{A}DB}D\Big{)}_{ij} =DiiDjj(A¯DB¯)ij=DiiDjjk=1dAik𝟙{ik}DkkBkj,\displaystyle=D_{ii}D_{jj}\Big{(}\overline{\overline{A}DB}\Big{)}_{ij}=D_{ii}D_{jj}\sum_{k=1}^{d}A_{ik}\mathbbm{1}_{\{i\neq k\}}D_{kk}B_{kj},

    so that both iji\neq j and iki\neq k. Therefore,

    𝔼[DA¯DB¯D]\displaystyle\mathbb{E}\Big{[}D\overline{\overline{A}DB}D\Big{]} =𝔼[D]𝔼[A¯DB¯D]\displaystyle=\mathbb{E}[D]\mathbb{E}\Big{[}\overline{\overline{A}DB}D\Big{]}
    =p𝔼[A¯DBD]pDiag(A¯𝔼[DBD])\displaystyle=p\mathbb{E}\big{[}\overline{A}DBD\big{]}-p\mathrm{Diag}\big{(}\overline{A}\mathbb{E}[DBD]\big{)}
    =pA𝔼[DBD]pDiag(A)𝔼[DBD]pDiag(A¯𝔼[DBD]).\displaystyle=pA\mathbb{E}[DBD]-p\mathrm{Diag}(A)\mathbb{E}[DBD]-p\mathrm{Diag}\big{(}\overline{A}\mathbb{E}[DBD]\big{)}.

    Reinserting this expression into the expectation of (59) and applying Part (i) of the lemma now results in the claimed identity

    𝔼[DADBD]\displaystyle\mathbb{E}\big{[}DADBD\big{]} =pA𝔼[DBD]+(1p)Diag(A)𝔼[DBD]+(1p)Diag(A¯𝔼[DBD])\displaystyle=pA\mathbb{E}[DBD]+(1-p)\mathrm{Diag}(A)\mathbb{E}[DBD]+(1-p)\mathrm{Diag}\big{(}\overline{A}\mathbb{E}[DBD]\big{)}
    =p(pA+(1p)Diag(A))Bp+p(1p)Diag(A¯Bp)\displaystyle=p\big{(}pA+(1-p)\mathrm{Diag}(A)\big{)}B_{p}+p(1-p)\mathrm{Diag}\big{(}\overline{A}B_{p}\big{)}
    =pApBp+p(1p)Diag(A¯Bp)\displaystyle=pA_{p}B_{p}+p(1-p)\mathrm{Diag}\big{(}\overline{A}B_{p}\big{)}
    =pApBp+p2(1p)Diag(A¯B),\displaystyle=pA_{p}B_{p}+p^{2}(1-p)\mathrm{Diag}\big{(}\overline{A}B\big{)},

    where Diag(A¯Diag(B))=0\mathrm{Diag}(\overline{A}\mathrm{Diag}(B))=0 by Lemma 9(a).

  3. (c)

    Following a similar strategy as in Part (ii), observe that

    DADBDCD=DADB¯DCD+DADDiag(B)DCD=DADB¯DC¯D+DADDiag(B)CD+DADB¯DDiag(C)D=DADB¯¯DC¯D+DADDiag(B)CD+DADB¯Diag(C)D+DDiag(ADB¯)C¯D.\begin{split}DADBDCD&=DAD\overline{B}DCD+DAD\mathrm{Diag}(B)DCD\\ &=DAD\overline{B}D\overline{C}D+DAD\mathrm{Diag}(B)CD+DAD\overline{B}D\mathrm{Diag}(C)D\\ &=D\overline{AD\overline{B}}D\overline{C}D+DAD\mathrm{Diag}(B)CD+DAD\overline{B}\mathrm{Diag}(C)D\\ &\qquad+D\mathrm{Diag}(AD\overline{B})\overline{C}D.\end{split} (60)

    By construction of the latter matrix,

    (DADB¯¯DC¯D)ij\displaystyle\big{(}D\overline{AD\overline{B}}D\overline{C}D\big{)}_{ij} =DiiDjjk=1d(ADB¯¯)ik(DC¯)kj\displaystyle=D_{ii}D_{jj}\sum_{k=1}^{d}\big{(}\overline{AD\overline{B}}\big{)}_{ik}\big{(}D\overline{C}\big{)}_{kj}
    =DiiDjjk=1d𝟙{kj}DkkCkj=1d𝟙{ik}AiD𝟙{k}Bk,\displaystyle=D_{ii}D_{jj}\sum_{k=1}^{d}\mathbbm{1}_{\{k\neq j\}}D_{kk}C_{kj}\sum_{\ell=1}^{d}\mathbbm{1}_{\{i\neq k\}}A_{i\ell}D_{\ell\ell}\mathbbm{1}_{\{\ell\neq k\}}B_{\ell k},

    meaning kk is always distinct from the other indices. The kk index corresponds to the third DD-matrix from the left, so this proves 𝔼[DADB¯¯DC¯D]=p𝔼[DADB¯¯C¯D]\mathbb{E}\big{[}D\overline{AD\overline{B}}D\overline{C}D\big{]}=p\mathbb{E}\big{[}D\overline{AD\overline{B}}\,\overline{C}D\big{]}. Reversing the overlines of the latter expression in order, note that

    DADB¯¯C¯D\displaystyle D\overline{AD\overline{B}}\,\overline{C}D =DADBCDDDiag(ADB¯)C¯DDADB¯Diag(C)D\displaystyle=DADBCD-D\mathrm{Diag}(AD\overline{B})\overline{C}D-DAD\overline{B}\mathrm{Diag}(C)D
    DADDiag(B)CD.\displaystyle\qquad-DAD\mathrm{Diag}(B)CD.

    Note that the subtracted terms match those added to DADB¯¯DC¯DD\overline{AD\overline{B}}D\overline{C}D in (60) exactly. In turn, these identities prove

    𝔼[DADBDCD]\displaystyle\mathbb{E}\big{[}DADBDCD\big{]}
    =\displaystyle=\ 𝔼[DADB¯¯DC¯D+DADDiag(B)CD+DADB¯Diag(C)D+DDiag(ADB¯)C¯D]\displaystyle\mathbb{E}\Big{[}D\overline{AD\overline{B}}D\overline{C}D+DAD\mathrm{Diag}(B)CD+DAD\overline{B}\mathrm{Diag}(C)D+D\mathrm{Diag}(AD\overline{B})\overline{C}D\Big{]}
    =\displaystyle=\ p𝔼[DADB¯¯C¯D]+𝔼[DADDiag(B)CD+DADB¯Diag(C)D+DDiag(ADB¯)C¯D]\displaystyle p\mathbb{E}\Big{[}D\overline{AD\overline{B}}\,\overline{C}D\Big{]}+\mathbb{E}\Big{[}DAD\mathrm{Diag}(B)CD+DAD\overline{B}\mathrm{Diag}(C)D+D\mathrm{Diag}(AD\overline{B})\overline{C}D\Big{]}
    =\displaystyle=\ p𝔼[DADBCD]+(1p)𝔼[DADDiag(B)CD]\displaystyle p\mathbb{E}\Big{[}DADBCD\Big{]}+(1-p)\mathbb{E}\Big{[}DAD\mathrm{Diag}(B)CD\Big{]}
    +(1p)𝔼[DADB¯Diag(C)D]+(1p)𝔼[DDiag(ADB¯)C¯D]\displaystyle\quad+(1-p)\mathbb{E}\Big{[}DAD\overline{B}\mathrm{Diag}(C)D\Big{]}+(1-p)\mathbb{E}\Big{[}D\mathrm{Diag}(AD\overline{B})\overline{C}D\Big{]}
    =\displaystyle=\ 𝔼[DADBpCD]+(1p)(𝔼[DADB¯Diag(C)D]+𝔼[DDiag(ADB¯)C¯D]).\displaystyle\mathbb{E}\big{[}DADB_{p}CD\big{]}+(1-p)\Big{(}\mathbb{E}\big{[}DAD\overline{B}\mathrm{Diag}(C)D\big{]}+\mathbb{E}\big{[}D\mathrm{Diag}(AD\overline{B})\overline{C}D\big{]}\Big{)}. (61)

    The first and second term in the last equality may be computed via Part (ii) of the lemma, whereas the third term remains to be treated.

    By definition, the diagonal entries of B¯\overline{B} and C¯\overline{C} are all zero, so (DDiag(ADB¯)C¯D)ii\big{(}D\mathrm{Diag}(AD\overline{B})\overline{C}D\big{)}_{ii} equals 0 for all i{1,,d}i\in\{1,\ldots,d\}. Moreover, taking iji\neq j implies

    (DDiag(ADB¯)C¯D)ij\displaystyle\big{(}D\mathrm{Diag}(AD\overline{B})\overline{C}D\big{)}_{ij} =Dii(ADB¯)iiCijDjj\displaystyle=D_{ii}\big{(}AD\overline{B}\big{)}_{ii}C_{ij}D_{jj}
    =DiiCijDjjk=1dAikDkkBki𝟙{ki}.\displaystyle=D_{ii}C_{ij}D_{jj}\sum_{k=1}^{d}A_{ik}D_{kk}B_{ki}\mathbbm{1}_{\{k\neq i\}}.

    On the set {ij}{ik}\{i\neq j\}\cap\{i\neq k\}, the entry DiiD_{ii} is independent of DjjD_{jj} and DkkD_{kk}. Consequently,

    𝔼[(DDiag(ADB¯)C¯D)ij]\displaystyle\mathbb{E}\Big{[}\big{(}D\mathrm{Diag}(AD\overline{B})\overline{C}D\big{)}_{ij}\Big{]} =𝔼[Dii]k=1dAikB¯kiCij𝔼[DjjDkk]\displaystyle=\mathbb{E}[D_{ii}]\sum_{k=1}^{d}A_{ik}\overline{B}_{ki}C_{ij}\mathbb{E}[D_{jj}D_{kk}]
    =k=1dAikB¯kiCij(p3+p2(1p)𝟙{k=j})\displaystyle=\sum_{k=1}^{d}A_{ik}\overline{B}_{ki}C_{ij}\big{(}p^{3}+p^{2}(1-p)\mathbbm{1}_{\{k=j\}}\big{)}
    =p3k=1dAikB¯kiCij+p2(1p)AijB¯jiCij.\displaystyle=p^{3}\sum_{k=1}^{d}A_{ik}\overline{B}_{ki}C_{ij}+p^{2}(1-p)A_{ij}\overline{B}_{ji}C_{ij}.

    In matrix form, the previous equation reads

    𝔼[DDiag(ADB¯)C¯D]=p3Diag(AB¯)C¯+p2(1p)AB¯C\displaystyle\mathbb{E}\Big{[}D\mathrm{Diag}\big{(}AD\overline{B}\big{)}\overline{C}D\Big{]}=p^{3}\mathrm{Diag}(A\overline{B})\overline{C}+p^{2}(1-p)A\odot\overline{B}^{\top}\odot C

    where \odot denotes the Hadamard product.

    Reinserting the computed expressions into (61), as well as noting that (B¯Diag(C))p=B¯pDiag(C)\big{(}\overline{B}\mathrm{Diag}(C)\big{)}_{p}=\overline{B}_{p}\mathrm{Diag}(C) and Diag(A¯B¯Diag(C))=Diag(A¯B¯)Diag(C)\mathrm{Diag}\big{(}\overline{A}\,\overline{B}\mathrm{Diag}(C)\big{)}=\mathrm{Diag}(\overline{A}\,\overline{B})\mathrm{Diag}(C) yields

    𝔼[DADBDCD]=pAp(BpC)p+p2(1p)Diag(A¯BpC)+p(1p)ApB¯pDiag(C)+p2(1p)2Diag(A¯B¯)Diag(C)+p3(1p)Diag(AB¯)C¯+p2(1p)2AB¯C.\begin{split}\mathbb{E}\big{[}DADBDCD\big{]}&=pA_{p}\big{(}B_{p}C\big{)}_{p}+p^{2}(1-p)\mathrm{Diag}\big{(}\overline{A}B_{p}C\big{)}\\ &\qquad+p(1-p)A_{p}\overline{B}_{p}\mathrm{Diag}(C)+p^{2}(1-p)^{2}\mathrm{Diag}(\overline{A}\,\overline{B})\mathrm{Diag}(C)\\ &\qquad+p^{3}(1-p)\mathrm{Diag}(A\overline{B})\overline{C}+p^{2}(1-p)^{2}A\odot\overline{B}^{\top}\odot C.\end{split} (62)

    Next, using the identity B¯p=BpDiag(B)\overline{B}_{p}=B_{p}-\mathrm{Diag}(B) of Lemma 9(b), we combine the first and third terms of the latter display into

    pAp(BpC)p+p(1p)ApB¯pDiag(C)\displaystyle pA_{p}\big{(}B_{p}C\big{)}_{p}+p(1-p)A_{p}\overline{B}_{p}\mathrm{Diag}(C) =pApBp(pC+(1p)Diag(C))\displaystyle=pA_{p}B_{p}\big{(}pC+(1-p)\mathrm{Diag}(C)\big{)}
    +p(1p)ApDiag(BpC)\displaystyle\qquad+p(1-p)A_{p}\mathrm{Diag}\big{(}B_{p}C\big{)}
    p(1p)ApDiag(B)Diag(C)\displaystyle\qquad-p(1-p)A_{p}\mathrm{Diag}(B)\mathrm{Diag}(C)
    =pApBpCp+p2(1p)ApDiag(B¯C).\displaystyle=pA_{p}B_{p}C_{p}+p^{2}(1-p)A_{p}\mathrm{Diag}\big{(}\overline{B}C\big{)}.

    Regarding the second term of (62), observe that

    Diag(A¯BpC)\displaystyle\mathrm{Diag}\big{(}\overline{A}B_{p}C\big{)} =Diag(A¯BpC¯)+Diag(A¯Bp)Diag(C)\displaystyle=\mathrm{Diag}\big{(}\overline{A}B_{p}\overline{C}\big{)}+\mathrm{Diag}\big{(}\overline{A}B_{p}\big{)}\mathrm{Diag}(C)
    =Diag(A¯BpC¯)+pDiag(A¯B)Diag(C),\displaystyle=\mathrm{Diag}\big{(}\overline{A}B_{p}\overline{C}\big{)}+p\mathrm{Diag}(\overline{A}B)\mathrm{Diag}(C),

    where the second equality follows from Diag(A¯Diag(B))=0\mathrm{Diag}\big{(}\overline{A}\mathrm{Diag}(B)\big{)}=0. Lastly, Diag(A¯B¯)=Diag(A¯B)=Diag(AB¯)\mathrm{Diag}(\overline{A}\,\overline{B})=\mathrm{Diag}(\overline{A}B)=\mathrm{Diag}(A\overline{B}) by Lemma 9(c), so the fourth and fifth term of (62) combine into

    pDiag(AB¯)C¯+(1p)Diag(A¯B¯)Diag(C)\displaystyle p\mathrm{Diag}(A\overline{B})\overline{C}+(1-p)\mathrm{Diag}(\overline{A}\,\overline{B})\mathrm{Diag}(C)
    =\displaystyle=\ Diag(AB¯)(pC¯+Diag(C))pDiag(AB¯)Diag(C)\displaystyle\mathrm{Diag}(A\overline{B})\big{(}p\overline{C}+\mathrm{Diag}(C)\big{)}-p\mathrm{Diag}(A\overline{B})\mathrm{Diag}(C)
    =\displaystyle=\ Diag(AB¯)CppDiag(AB¯)Diag(C),\displaystyle\mathrm{Diag}(A\overline{B})C_{p}-p\mathrm{Diag}(A\overline{B})\mathrm{Diag}(C),

    where the common factor p2(1p)p^{2}(1-p) is omitted in the display.

    Using these identities, Equation (62) now turns into

    𝔼[DADBDCD]\displaystyle\mathbb{E}\big{[}DADBDCD\big{]} =pApBpCp+p2(1p)2AB¯C\displaystyle=pA_{p}B_{p}C_{p}+p^{2}(1-p)^{2}A\odot\overline{B}^{\top}\odot C
    +p2(1p)(Diag(A¯BpC)pDiag(AB¯)Diag(C))\displaystyle\qquad+p^{2}(1-p)\Big{(}\mathrm{Diag}\big{(}\overline{A}B_{p}C\big{)}-p\mathrm{Diag}(A\overline{B})\mathrm{Diag}(C)\Big{)}
    +p2(1p)(ApDiag(B¯C)+Diag(AB¯)Cp).\displaystyle\qquad+p^{2}(1-p)\Big{(}A_{p}\mathrm{Diag}\big{(}\overline{B}C\big{)}+\mathrm{Diag}(A\overline{B})C_{p}\Big{)}.

    Noting that Diag(A¯BpC)pDiag(AB¯)Diag(C)\mathrm{Diag}\big{(}\overline{A}B_{p}C\big{)}-p\mathrm{Diag}(A\overline{B})\mathrm{Diag}(C) equals Diag(A¯BpC¯)\mathrm{Diag}\big{(}\overline{A}B_{p}\overline{C}\big{)} finishes the proof.

In principle, any computations involving higher moments of DD may be accomplished with the proof strategy of Lemma 10. A particular covariance matrix is needed in Section B, which will be given in the next lemma.

Lemma 11.

Given a symmetric matrix AA, as well as vectors 𝐮\mathbf{u} and 𝐯\mathbf{v},

Cov(D𝐮+DA¯(DpI)𝐯)p(1p)\displaystyle\dfrac{\mathrm{Cov}\big{(}D\mathbf{u}+D\overline{A}(D-pI)\mathbf{v}\big{)}}{p(1-p)} =Diag(𝐮𝐮)+pA¯Diag(𝐯𝐮)+pDiag(𝐮𝐯)A¯\displaystyle=\mathrm{Diag}(\mathbf{u}\mathbf{u}^{\top})+p\overline{A}\mathrm{Diag}(\mathbf{v}\mathbf{u}^{\top})+p\mathrm{Diag}(\mathbf{u}\mathbf{v}^{\top})\overline{A}
+p(A¯Diag(𝐯𝐯)A¯)p+p(1p)A𝐯𝐯¯A.\displaystyle\qquad+p\big{(}\overline{A}\mathrm{Diag}(\mathbf{v}\mathbf{v}^{\top})\overline{A}\big{)}_{p}+p(1-p)A\odot\overline{\mathbf{v}\mathbf{v}^{\top}}\odot A.

Proof  The covariance of the sum is given by

Cov(D𝐮+DA¯(DpI)𝐯)\displaystyle\mathrm{Cov}\big{(}D\mathbf{u}+D\overline{A}(D-pI)\mathbf{v}\big{)} =Cov(D𝐮)+Cov(DA¯(DpI)𝐯)+Cov(D𝐮,DA¯(DpI)𝐯)\displaystyle=\mathrm{Cov}\big{(}D\mathbf{u}\big{)}+\mathrm{Cov}\big{(}D\overline{A}(D-pI)\mathbf{v}\big{)}+\mathrm{Cov}\big{(}D\mathbf{u},D\overline{A}(D-pI)\mathbf{v}\big{)}
+Cov(DA¯(DpI)𝐯,D𝐮)\displaystyle\qquad+\mathrm{Cov}\big{(}D\overline{A}(D-pI)\mathbf{v},D\mathbf{u}\big{)}
=:T1+T2+T3+T4\displaystyle=:T_{1}+T_{2}+T_{3}+T_{4} (63)

where each of the latter terms will be treated separately. To this end, set B𝐮:=𝐮𝐮B_{\mathbf{u}}:=\mathbf{u}\mathbf{u}^{\top}, B𝐯:=𝐯𝐯B_{\mathbf{v}}:=\mathbf{v}\mathbf{v}^{\top}, B𝐮,𝐯:=𝐮𝐯B_{\mathbf{u},\mathbf{v}}:=\mathbf{u}\mathbf{v}^{\top} and B𝐯,𝐮:=𝐯𝐮B_{\mathbf{v},\mathbf{u}}:=\mathbf{v}\mathbf{u}^{\top}.

First, recall that 𝔼[DB𝐮D]=p(B𝐮)p\mathbb{E}\big{[}DB_{\mathbf{u}}D\big{]}=p(B_{\mathbf{u}})_{p} by Lemma 10(a) and 𝔼[D]=p\mathbb{E}[D]=p, so the definition of covariance implies

T1\displaystyle T_{1} =𝔼[DB𝐮D]𝔼[D]B𝐮𝔼[D]=p2B𝐮+p(1p)Diag(B𝐮)p2B𝐮\displaystyle=\mathbb{E}\big{[}DB_{\mathbf{u}}D\big{]}-\mathbb{E}[D]B_{\mathbf{u}}\mathbb{E}[D]=p^{2}B_{\mathbf{u}}+p(1-p)\mathrm{Diag}(B_{\mathbf{u}})-p^{2}B_{\mathbf{u}}
=p(1p)Diag(B𝐮).\displaystyle=p(1-p)\mathrm{Diag}(B_{\mathbf{u}}). (64)

Moving on to T4T_{4}, observe that

T4\displaystyle T_{4} =Cov(DA¯(DpI)𝐯,D𝐮)=𝔼[DA¯(DpI)B𝐯,𝐮D]𝔼[DA¯(DpI)𝐯]𝔼[D𝐮].\displaystyle=\mathrm{Cov}\big{(}D\overline{A}(D-pI)\mathbf{v},D\mathbf{u}\big{)}=\mathbb{E}\big{[}D\overline{A}(D-pI)B_{\mathbf{v},\mathbf{u}}D\big{]}-\mathbb{E}\big{[}D\overline{A}(D-pI)\mathbf{v}\big{]}\mathbb{E}\big{[}D\mathbf{u}\big{]}^{\top}.

By the same argument as in (28),

𝔼[DA¯(DpI)𝐯]\displaystyle\mathbb{E}\big{[}D\overline{A}(D-pI)\mathbf{v}\big{]} =𝔼[DA¯D]𝐯p𝔼[DA¯]𝐯=p2A¯𝐯p2A¯𝐯=0\displaystyle=\mathbb{E}\big{[}D\overline{A}D\big{]}\mathbf{v}-p\mathbb{E}\big{[}D\overline{A}\big{]}\mathbf{v}=p^{2}\overline{A}\mathbf{v}-p^{2}\overline{A}\mathbf{v}=0 (65)

so that in turn

T4=𝔼[DA¯DB𝐯,𝐮D]p𝔼[DA¯B𝐯,𝐮D].\displaystyle T_{4}=\mathbb{E}\big{[}D\overline{A}DB_{\mathbf{v},\mathbf{u}}D\big{]}-p\mathbb{E}\big{[}D\overline{A}B_{\mathbf{v},\mathbf{u}}D\big{]}.

Recall A¯p=pA¯\overline{A}_{p}=p\overline{A} from Lemma 9(b). Applying Lemma 10(b) for the first term in the previous display and Lemma 10(a) for the second term now leads to

T4\displaystyle T_{4} =p2A¯(B𝐯,𝐮)p+p2(1p)Diag(A¯B𝐯,𝐮)p2(A¯B𝐯,𝐮)p\displaystyle=p^{2}\overline{A}(B_{\mathbf{v},\mathbf{u}})_{p}+p^{2}(1-p)\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v},\mathbf{u}}\big{)}-p^{2}\big{(}\overline{A}B_{\mathbf{v},\mathbf{u}}\big{)}_{p}
=p3A¯B𝐯,𝐮+p2(1p)A¯Diag(B𝐯,𝐮)+p2(1p)Diag(A¯B𝐯,𝐮)p3A¯B𝐯,𝐮\displaystyle=p^{3}\overline{A}B_{\mathbf{v},\mathbf{u}}+p^{2}(1-p)\overline{A}\mathrm{Diag}(B_{\mathbf{v},\mathbf{u}})+p^{2}(1-p)\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v},\mathbf{u}}\big{)}-p^{3}\overline{A}B_{\mathbf{v},\mathbf{u}}
p2(1p)Diag(A¯B𝐯,𝐮)\displaystyle\qquad-p^{2}(1-p)\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v},\mathbf{u}}\big{)}
=p2(1p)A¯Diag(B𝐯,𝐮).\displaystyle=p^{2}(1-p)\overline{A}\mathrm{Diag}(B_{\mathbf{v},\mathbf{u}}). (66)

Using a completely analogous argument, the reflected term T3T_{3} satisfies

T3=p2(1p)Diag(B𝐮,𝐯)A¯.\displaystyle T_{3}=p^{2}(1-p)\mathrm{Diag}\big{(}B_{\mathbf{u},\mathbf{v}}\big{)}\overline{A}. (67)

The last term T2T_{2} necessitates another decomposition into four sub-problems. First, recall from (65) that 𝔼[DA¯(DpI)𝐯]\mathbb{E}\big{[}D\overline{A}(D-pI)\mathbf{v}\big{]} vanishes, which leads to

T2\displaystyle T_{2} =Cov(DA¯(DpI)𝐯)\displaystyle=\mathrm{Cov}\big{(}D\overline{A}(D-pI)\mathbf{v}\big{)}
=𝔼[DA¯(DpI)B𝐯(DpI)A¯D]\displaystyle=\mathbb{E}\big{[}D\overline{A}(D-pI)B_{\mathbf{v}}(D-pI)\overline{A}D\big{]}
=𝔼[DA¯DB𝐯DA¯D]p𝔼[DA¯B(DpI)A¯D]p𝔼[DA¯(DpI)B𝐯A¯D]\displaystyle=\mathbb{E}\big{[}D\overline{A}DB_{\mathbf{v}}D\overline{A}D\big{]}-p\mathbb{E}\big{[}D\overline{A}B(D-pI)\overline{A}D\big{]}-p\mathbb{E}\big{[}D\overline{A}(D-pI)B_{\mathbf{v}}\overline{A}D\big{]}
+p2𝔼[DA¯B𝐯A¯D]\displaystyle\qquad+p^{2}\mathbb{E}\big{[}D\overline{A}B_{\mathbf{v}}\overline{A}D\big{]}
=:T2,1T2,2T2,3T2,4\displaystyle=:T_{2,1}-T_{2,2}-T_{2,3}-T_{2,4} (68)

where the last term is negative since pB𝐯(DpI)p(DpI)B𝐯=pB𝐯DpDB𝐯+2p2B𝐯-pB_{\mathbf{v}}(D-pI)-p(D-pI)B_{\mathbf{v}}=-pB_{\mathbf{v}}D-pDB_{\mathbf{v}}+2p^{2}B_{\mathbf{v}}. Recall once more the identity A¯p=pA¯\overline{A}_{p}=p\overline{A} from Lemma 9(b) and apply Lemma 10(c), to rewrite T2,1T_{2,1} as

T2,1\displaystyle T_{2,1} =𝔼[DA¯DB𝐯DA¯D]\displaystyle=\mathbb{E}\big{[}D\overline{A}DB_{\mathbf{v}}D\overline{A}D\big{]}
=p3A¯(B𝐯)pA¯+p2(1p)(Diag(A¯(B𝐯)pA¯)+pA¯Diag(B𝐯¯A)+pDiag(AB𝐯¯)A¯)\displaystyle=p^{3}\overline{A}(B_{\mathbf{v}})_{p}\overline{A}+p^{2}(1-p)\Big{(}\mathrm{Diag}\big{(}\overline{A}(B_{\mathbf{v}})_{p}\overline{A}\big{)}+p\overline{A}\mathrm{Diag}\big{(}\overline{B_{\mathbf{v}}}A\big{)}+p\mathrm{Diag}\big{(}A\overline{B_{\mathbf{v}}}\big{)}\overline{A}\Big{)}
+p2(1p)2AB𝐯¯A.\displaystyle\qquad+p^{2}(1-p)^{2}A\odot\overline{B_{\mathbf{v}}}\odot A.

As for T2,2T_{2,2}, start by noting that

T2,2\displaystyle T_{2,2} =p𝔼[DA¯B𝐯(DpI)A¯D]\displaystyle=p\mathbb{E}\big{[}D\overline{A}B_{\mathbf{v}}(D-pI)\overline{A}D\big{]}
=p𝔼[DA¯B𝐯DA¯D]p2𝔼[DA¯B𝐯A¯D]\displaystyle=p\mathbb{E}\big{[}D\overline{A}B_{\mathbf{v}}D\overline{A}D\big{]}-p^{2}\mathbb{E}\big{[}D\overline{A}B_{\mathbf{v}}\overline{A}D\big{]}
=p3(A¯B𝐯)pA¯+p3(1p)Diag(A¯B𝐯¯A¯)p3(A¯B𝐯A¯)p,\displaystyle=p^{3}\big{(}\overline{A}B_{\mathbf{v}}\big{)}_{p}\overline{A}+p^{3}(1-p)\mathrm{Diag}\big{(}\overline{\overline{A}B_{\mathbf{v}}}\overline{A}\big{)}-p^{3}\big{(}\overline{A}B_{\mathbf{v}}\overline{A}\big{)}_{p},

where Lemma 10(a) computes the second expectation and Lemma 10(b) the first expectation. To progress, note first the identities

(A¯B𝐯)pA¯(A¯B𝐯A¯)p\displaystyle\big{(}\overline{A}B_{\mathbf{v}}\big{)}_{p}\overline{A}-\big{(}\overline{A}B_{\mathbf{v}}\overline{A}\big{)}_{p} =(1p)(Diag(A¯B𝐯)A¯Diag(A¯B𝐯A¯))\displaystyle=(1-p)\Big{(}\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v}}\big{)}\overline{A}-\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v}}\overline{A}\big{)}\Big{)}
Diag(A¯B𝐯¯A¯)\displaystyle\mathrm{Diag}\big{(}\overline{\overline{A}B_{\mathbf{v}}}\overline{A}\big{)} =Diag(A¯B𝐯A¯)\displaystyle=\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v}}\overline{A}\big{)}

so that

T2,2\displaystyle T_{2,2} =p3(1p)(Diag(A¯B𝐯)A¯Diag(A¯B𝐯A¯)+Diag(A¯B𝐯A¯))=p3(1p)Diag(A¯B𝐯)A¯.\displaystyle=p^{3}(1-p)\Big{(}\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v}}\big{)}\overline{A}-\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v}}\overline{A}\big{)}+\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v}}\overline{A}\big{)}\Big{)}=p^{3}(1-p)\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v}}\big{)}\overline{A}.

By symmetry, the reflected term T2,3T_{2,3} then also satisfies T2,3=p3(1p)A¯Diag(B𝐯A¯)T_{2,3}=p^{3}(1-p)\overline{A}\mathrm{Diag}\big{(}B_{\mathbf{v}}\overline{A}\big{)}. Lastly, applying Lemma 10(a) to T2,4T_{2,4} results in T2,4=p3(A¯B𝐯A¯)pT_{2,4}=p^{3}\big{(}\overline{A}B_{\mathbf{v}}\overline{A}\big{)}_{p}. To finish the treatment of T2T_{2}, inserting the computed expressions for T2,1,T2,2,T2,3T_{2,1},T_{2,2},T_{2,3}, and T2,4T_{2,4} into (C) and combining like terms now leads to

T2\displaystyle T_{2} =p3A¯(B𝐯)pA¯+p2(1p)2AB𝐯¯A\displaystyle=p^{3}\overline{A}(B_{\mathbf{v}})_{p}\overline{A}+p^{2}(1-p)^{2}A\odot\overline{B_{\mathbf{v}}}\odot A
+p2(1p)(Diag(A¯(B𝐯)pA¯)+pA¯Diag(B𝐯¯A)+Diag(AB𝐯¯)pA¯)\displaystyle\qquad+p^{2}(1-p)\Big{(}\mathrm{Diag}\big{(}\overline{A}(B_{\mathbf{v}})_{p}\overline{A}\big{)}+p\overline{A}\mathrm{Diag}\big{(}\overline{B_{\mathbf{v}}}A\big{)}+\mathrm{Diag}\big{(}A\overline{B_{\mathbf{v}}}\big{)}p\overline{A}\Big{)}
p3(1p)Diag(A¯B𝐯)A¯p3(1p)A¯Diag(B𝐯A¯)+p3(A¯B𝐯A¯)p\displaystyle\qquad-p^{3}(1-p)\mathrm{Diag}\big{(}\overline{A}B_{\mathbf{v}}\big{)}\overline{A}-p^{3}(1-p)\overline{A}\mathrm{Diag}\big{(}B_{\mathbf{v}}\overline{A}\big{)}+p^{3}\big{(}\overline{A}B_{\mathbf{v}}\overline{A}\big{)}_{p}
=p3(A¯(B𝐯)pA¯(A¯B𝐯A¯)p)+p2(1p)Diag(A¯(B𝐯)pA¯)+p2(1p)2AB𝐯¯A\displaystyle=p^{3}\Big{(}\overline{A}(B_{\mathbf{v}})_{p}\overline{A}-\big{(}\overline{A}B_{\mathbf{v}}\overline{A}\big{)}_{p}\Big{)}+p^{2}(1-p)\mathrm{Diag}\big{(}\overline{A}(B_{\mathbf{v}})_{p}\overline{A}\big{)}+p^{2}(1-p)^{2}A\odot\overline{B_{\mathbf{v}}}\odot A
=p3(1p)A¯Diag(B𝐯)A¯+p2(1p)2Diag(A¯Diag(B𝐯)A¯)p2(1p)2AB𝐯¯A\displaystyle=p^{3}(1-p)\overline{A}\mathrm{Diag}(B_{\mathbf{v}})\overline{A}+p^{2}(1-p)^{2}\mathrm{Diag}\big{(}\overline{A}\mathrm{Diag}(B_{\mathbf{v}})\overline{A}\big{)}p^{2}(1-p)^{2}A\odot\overline{B_{\mathbf{v}}}\odot A
=p2(1p)(A¯Diag(B𝐯)A¯)p+p2(1p)2AB𝐯¯A.\displaystyle=p^{2}(1-p)\big{(}\overline{A}\mathrm{Diag}(B_{\mathbf{v}})\overline{A}\big{)}_{p}+p^{2}(1-p)^{2}A\odot\overline{B_{\mathbf{v}}}\odot A.

To conclude the proof, insert this expression for T2T_{2} into (C), together with T1T_{1} as in (64), T3T_{3} as in (67), and T4T_{4} as in (C) to obtain the desired identity

Cov(D𝐮+DA¯(DpI)𝐯)\displaystyle\mathrm{Cov}\big{(}D\mathbf{u}+D\overline{A}(D-pI)\mathbf{v}\big{)} =p(1p)Diag(B𝐮)+p2(1p)(A¯Diag(B𝐯)A¯)p\displaystyle=p(1-p)\mathrm{Diag}(B_{\mathbf{u}})+p^{2}(1-p)\big{(}\overline{A}\mathrm{Diag}(B_{\mathbf{v}})\overline{A}\big{)}_{p}
+p2(1p)2AB¯𝐯A\displaystyle\qquad+p^{2}(1-p)^{2}A\odot\overline{B}_{\mathbf{v}}\odot A
+p2(1p)(Diag(B𝐮,𝐯)A¯+A¯Diag(B𝐯,𝐮)).\displaystyle\qquad+p^{2}(1-p)\big{(}\mathrm{Diag}(B_{\mathbf{u},\mathbf{v}})\overline{A}+\overline{A}\mathrm{Diag}(B_{\mathbf{v},\mathbf{u}})\big{)}.

Appendix D Auxiliary Results

Below we collect identities and definitions referenced in other sections.

Neumann series: Let VV denote a real or complex Banach space with norm \lVert\ \cdot\ \rVert. Recall that the operator norm of a linear operator λ\lambda on VV is given by λop=sup𝐯V:𝐯1λ(𝐯)\left\lVert\lambda\right\rVert_{\mathrm{op}}=\sup_{\mathbf{v}\in V:\lVert\mathbf{v}\rVert\leq 1}\left\lVert\lambda(\mathbf{v})\right\rVert.

Lemma 12 ([20], Proposition 5.3.4).

Suppose λ:VV\lambda:V\to V is bounded, linear, and satisfies idλop<1\left\lVert\mathrm{id}-\lambda\right\rVert_{\mathrm{op}}<1, then λ\lambda is invertible and λ1=i=0(idλ)i\lambda^{-1}=\sum_{i=0}^{\infty}\big{(}\mathrm{id}-\lambda\big{)}^{i}.

Bounds on singular values: Recall that 2\lVert\ \cdot\ \rVert_{2} denotes the Euclidean norm on d\mathbb{R}^{d} and \lVert\ \cdot\ \rVert the spectral norm on d×d\mathbb{R}^{d\times d}, which is given by the largest singular value σmax()\sigma_{\max}(\cdot). The spectral norm is sub-multiplicative in the sense that ABAB\lVert AB\rVert\leq\lVert A\rVert\lVert B\rVert. The spectral norm of a vector 𝐯d\mathbf{v}\in\mathbb{R}^{d}, viewed as a linear functional on d\mathbb{R}^{d}, is given by 𝐯=𝐯2\left\lVert\mathbf{v}\right\rVert=\left\lVert\mathbf{v}\right\rVert_{2}, proving that

𝐯𝐰𝐯2𝐰2\displaystyle\big{\lVert}\mathbf{v}\mathbf{w}^{\top}\big{\rVert}\leq\lVert\mathbf{v}\rVert_{2}\lVert\mathbf{w}\rVert_{2} (69)

for any vectors 𝐯\mathbf{v} and 𝐰\mathbf{w} of the same length.

Recall the definitions A¯=ADiag(A)\overline{A}=A-\mathrm{Diag}(A) and Ap=pA+(1p)Diag(A)A_{p}=pA+(1-p)\mathrm{Diag}(A) with p(0,1)p\in(0,1).

Lemma 13.

Given d×dd\times d matrices AA and BB, the inequalities Diag(A)A\big{\lVert}\mathrm{Diag}(A)\big{\rVert}\leq\lVert A\rVert, ApA\lVert A_{p}\rVert\leq\lVert A\rVert, and ABAB\lVert A\odot B\rVert\leq\lVert A\rVert\cdot\lVert B\rVert hold. Moreover, if AA is symmetric and positive semi-definite, then also A¯A\big{\lVert}\overline{A}\big{\rVert}\leq\lVert A\rVert.

Proof  For any matrix AA, the maximal singular value σmax(A)\sigma_{\max}(A) can be computed from the variational formulation σmax(A)=max𝐯d{𝟎}A𝐯2/𝐯2,\sigma_{\max}(A)=\max_{\mathbf{v}\in\mathbb{R}^{d}\setminus\{\mathbf{0}\}}\lVert A\mathbf{v}\rVert_{2}/\lVert\mathbf{v}\rVert_{2}, see [24], Theorem 4.2.6.

Let 𝐞i\mathbf{e}_{i} denote the iith standard basis vector. The variational formulation of the maximal singular value implies Diag(A)2=maxiAii2\lVert\mathrm{Diag}(A)\rVert^{2}=\max_{i}A_{ii}^{2} which is bounded by maxik=1dAki2=maxi(AA)ii=maxi𝐞iAA𝐞i\max_{i}\sum_{k=1}^{d}A_{ki}^{2}=\max_{i}\big{(}A^{\top}A\big{)}_{ii}=\max_{i}\mathbf{e}_{i}^{\top}A^{\top}A\mathbf{e}_{i}. The latter is further bounded by AA\big{\lVert}A^{\top}A\big{\rVert}, proving the first statement. The second inequality follows from the first since

AppA+(1p)Diag(A)pA+(1p)A=A.\lVert A_{p}\rVert\leq p\lVert A\rVert+(1-p)\big{\lVert}\mathrm{Diag}(A)\big{\rVert}\leq p\lVert A\rVert+(1-p)\lVert A\rVert=\lVert A\rVert.

For the inequality concerning the Hadamard product, see Theorem 5.5.7 of [23].

For the last inequality, note that semi-definiteness entails miniDiag(A)ii0\min_{i}\mathrm{Diag}(A)_{ii}\geq 0. Fixing 𝐯d\mathbf{v}\in\mathbb{R}^{d}, this ensures 𝐯A¯𝐯𝐯A𝐯\mathbf{v}^{\top}\overline{A}\mathbf{v}\leq\mathbf{v}^{\top}A\mathbf{v}, which completes the proof. ∎

References

  • [1] Martín Abadi et al. “TensorFlow: A System for Large-Scale Machine Learning” In 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association, 2016, pp. 265–283
  • [2] Donald W.. Andrews “Laws of Large Numbers for Dependent Non-Identically Distributed Random Variables” In Econometric Theory 4.3, 1988, pp. 458–467
  • [3] Raman Arora, Peter Bartlett, Poorya Mianjy and Nathan Srebro “Dropout: Explicit Forms and Capacity Control” In 38th International Conference on Machine Learning Proceedings of Machine Learning Research, 2021, pp. 351–361
  • [4] Jimmy Ba and Brendan Frey “Adaptive dropout for training deep neural networks” In Advances in Neural Information Processing Systems 26 Curran Associates, Inc., 2013, pp. 3084–3092
  • [5] Bubacarr Bah, Holger Rauhut, Ulrich Terstiege and Michael Westdickenberg “Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizers” In Information and Inference: A Journal of the IMA 11.1, 2022, pp. 307–353
  • [6] Pierre Baldi and Peter J. Sadowski “Understanding Dropout” In Advances in Neural Information Processing Systems 26 Curran Associates, Inc., 2013, pp. 2814–2822
  • [7] Peter L. Bartlett, Philip M. Long and Olivier Bousquet “The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima” In Journal of Machine Learning Research 24.316, 2023, pp. 1–36
  • [8] Thijs Bos and Johannes Schmidt-Hieber “Convergence guarantees for forward gradient descent in the linear regression model”, arXiv:2309.15001 [math.ST], 2023
  • [9] Jacopo Cavazza et al. “Dropout as a Low-Rank Regularizer for Matrix Factorization” In 21st International Conference on Artificial Intelligence and Statistics Proceedings of Machine Learning Research, 2018, pp. 435–444
  • [10] François Chollet “Keras”, https://keras.io, 2015
  • [11] George Cybenko “Approximation by superpositions of a sigmoidal function” In Mathematics of Control, Signals, and Systems 2.4, 1989, pp. 303–314
  • [12] Steffen Dereich and Sebastian Kassing “Central limit theorems for stochastic gradient descent with averaging for stable manifolds” In Electronic Journal of Probability 28, 2023, pp. 1–48
  • [13] Lutz Dümbgen, Richard J. Samworth and Dominic Schuhmacher “Stochastic search for semiparametric linear regression models” In From Probability to Statistics and Back: High-Dimensional Models and Processes. A Festschrift in Honor of Jon A. Wellner Institute of Mathematical Statistics, 2013, pp. 78–90
  • [14] Bradley Efron and Trevor Hastie “Computer Age Statistical Inference. Algorithms, Evidence, and Data Science” 5, Institute of Mathematical Statistics Monographs Cambridge University Press, 2016
  • [15] Yarin Gal and Zoubin Ghahramani “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks” In Advances in Neural Information Processing Systems 29 Curran Associates, Inc., 2016, pp. 1027–1035
  • [16] Yarin Gal and Zoubin Ghahramani “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning” In 33rd International Conference on International Conference on Machine Learning Proceedings of Machine Learning Research, 2016, pp. 1050–1059
  • [17] Wei Gao and Zhi-Hua Zhou “Dropout Rademacher complexity of deep neural networks” In Science China Information Sciences 59.7, 2016, pp. 2104:1–12
  • [18] Ian Goodfellow, Yoshua Bengio and Aaron Courville “Deep learning”, Adaptive Computation and Machine Learning MIT Press, 2016
  • [19] László Györfi and Harro Walk “On the Averaged Stochastic Approximation for Linear Regression” In SIAM Journal on Control and Optimization 34.1, 1996, pp. 31–61
  • [20] Aleksandr Ya. Helemskii “Lectures and Exercises on Functional Analysis” 233, Translations of Mathematical Monographs American Mathematical Society, 2006
  • [21] David P. Helmbold and Philip M. Long “Surprising properties of dropout in deep networks” In Conference on Learning Theory Proceedings of Machine Learning Research, 2017, pp. 1123–1146
  • [22] Jonathan Hill and Liang Peng “Unified interval estimation for random coefficient autoregressive models” In Journal of Time Series Analysis 35.3, 2014, pp. 282–297
  • [23] Roger A. Horn and Charles R. Johnson “Topics in Matrix Analysis” Cambridge University Press, 1991
  • [24] Roger A. Horn and Charles R. Johnson “Matrix Analysis” Cambridge University Press, 2013
  • [25] Kurt Hornik “Approximation capabilities of multilayer feedforward networks” In Neural Networks 4.2, 1991, pp. 251–257
  • [26] Yangqing Jia et al. “Caffe: Convolutional Architecture for Fast Feature Embedding” In 22nd ACM International Conference on Multimedia Association for Computing Machinery, 2014, pp. 675–678
  • [27] Diederik P. Kingma, Tim Salimans and Max Welling “Variational Dropout and the Local Reparameterization Trick” In Advances in Neural Information Processing Systems 28 Curran Associates, Inc., 2015, pp. 2575–2583
  • [28] Alex Krizhevsky, Ilya Sutskever and Geoffrey E Hinton “ImageNet Classification with Deep Convolutional Neural Networks” In Advances in Neural Information Processing Systems 25 Curran Associates, Inc., 2012, pp. 1097–1105
  • [29] Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus and Shimon Schocken “Multilayer feedforward networks with a nonpolynomial activation function can approximate any function” In Neural Networks 6.6, 1993, pp. 861–867
  • [30] Oxana A. Manita et al. “Universal Approximation in Dropout Neural Networks” In Journal of Machine Learning Research 23.19, 2022, pp. 1–46
  • [31] David McAllester “A PAC-Bayesian Tutorial with A Dropout Bound”, arXiv:1307.2118 [cs.LG], 2013
  • [32] Poorya Mianjy and Raman Arora “On Dropout and Nuclear Norm Regularization” In 36th International Conference on Machine Learning Proceedings of Machine Learning Research, 2019, pp. 4575–4584
  • [33] Poorya Mianjy and Raman Arora “On Convergence and Generalization of Dropout Training” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 21151–21161
  • [34] Poorya Mianjy, Raman Arora and Rene Vidal “On the Implicit Bias of Dropout” In 35th International Conference on Machine Learning Proceedings of Machine Learning Research, 2018, pp. 3540–3548
  • [35] Reza Moradi, Reza Berangi and Behrouz Minaei “A Survey of Regularization Strategies for Deep Models” In Artificial Intelligence Review 53.6, 2020, pp. 3947–3986
  • [36] Gabin Maxime Nguegnang, Holger Rauhut and Ulrich Terstiege “Convergence of gradient descent for learning linear neural networks”, arXiv:2108.02040 [cs.LG], 2021
  • [37] Des F. Nicholls and Barry G. Quinn “Random Coefficient Autoregressive Models: An Introduction” 11, Lecture Notes in Statistics Springer New York, 1982
  • [38] Adam Paszke et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019
  • [39] Boris Polyak “New method of stochastic approximation type” In Avtomatica i Telemekhanika 7, 1990, pp. 98–107
  • [40] Boris Polyak and Anatoli B. Juditsky “Acceleration of stochastic approximation by averaging” In SIAM Journal on Control and Optimization 30.4, 1992, pp. 838–855
  • [41] Marta Regis, Paulo Serra and Edwin R. Heuvel “Random autoregressive models: a structured overview” In Econometric Reviews 41.2, 2022, pp. 207–230
  • [42] David Ruppert “Efficient Estimations from a Slowly Convergent Robbins-Monro Process”, Technical Report 781. Cornell University Operations Research and Industrial Engineering, 1988
  • [43] Claudio Filipi Gonçalves Dos Santos and João Paulo Papa “Avoiding Overfitting: A Survey on Regularization Methods for Convolutional Neural Networks” In ACM Computing Surveys 54.10s, 2022, pp. 213:1–25
  • [44] Johannes Schmidt-Hieber and Wouter M. Koolen “Hebbian learning inspired estimation of the linear regression parameters from queries”, arXiv:2311.03483 [math.ST], 2023
  • [45] Albert Senen-Cerda and Jaron Sanders “Asymptotic Convergence Rate of Dropout on Shallow Linear Neural Networks” In Proceedings of the ACM on Measurement and Analysis of Computing Systems 6.2, 2022, pp. 32:1–53
  • [46] Nitish Srivastava et al. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” In Journal of Machine Learning Research 15.56, 2014, pp. 1929–1958
  • [47] Alexander Tsigler and Peter L. Bartlett “Benign Overfitting in Ridge Regression” In Journal of Machine Learning Research 24.123, 2023, pp. 1–76
  • [48] Stefan Wager, Sida Wang and Percy S Liang “Dropout Training as Adaptive Regularization” In Advances in Neural Information Processing Systems 26 Curran Associates, Inc., 2013, pp. 351–359
  • [49] Li Wan et al. “Regularization of Neural Networks using DropConnect” In 30th International Conference on Machine Learning Proceedings of Machine Learning Research, 2013, pp. 1058–1066
  • [50] Sida Wang and Christopher Manning “Fast dropout training” In 30th International Conference on Machine Learning Proceedings of Machine Learning Research, 2013, pp. 118–126
  • [51] Colin Wei, Sham Kakade and Tengyu Ma “The Implicit and Explicit Regularization Effects of Dropout” In 37th International Conference on Machine Learning Proceedings of Machine Learning Research, 2020, pp. 10181–10192
  • [52] Haibing Wu and Xiaodong Gu “Towards dropout training for convolutional neural networks” In Neural Networks 71, 2015, pp. 1–10
  • [53] Ke Zhai and Huan Wang “Adaptive Dropout with Rademacher Complexity Regularization” In 6th International Conference on Learning Representations, 2018
  • [54] Ruiqi Zhang, Spencer Frei and Peter L. Bartlett “Trained Transformers Learn Linear Models In-Context” In Journal of Machine Learning Research 25.49, 2024, pp. 1–55
  • [55] Wanrong Zhu, Xi Chen and Wei Biao Wu “Online Covariance Matrix Estimation in Stochastic Gradient Descent” In Journal of the American Statistical Association 118.541, 2021, pp. 393–404