This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Product of Gaussian Matrices is Close to Gaussian

Yi Li111Supported in part by Singapore Ministry of Education (AcRF) Tier 2 grant MOE2018-T2-1-013
Division of Mathematical Sciences
Nanyang Technological University
[email protected]
   David P. Woodruff222Supported in part by Office of Naval Research (ONR) grant N00014-18-1-256 and a Simons Investigator Award
Department of Computer Science
Carnegie Mellon University
[email protected]
Abstract

We study the distribution of the matrix product G1G2GrG_{1}G_{2}\cdots G_{r} of rr independent Gaussian matrices of various sizes, where GiG_{i} is di1×did_{i-1}\times d_{i}, and we denote p=d0p=d_{0}, q=drq=d_{r}, and require d1=dr1d_{1}=d_{r-1}. Here the entries in each GiG_{i} are standard normal random variables with mean 0 and variance 11. Such products arise in the study of wireless communication, dynamical systems, and quantum transport, among other places. We show that, provided each did_{i}, i=1,,ri=1,\ldots,r, satisfies diCpqd_{i}\geq Cp\cdot q, where CC0C\geq C_{0} for a constant C0>0C_{0}>0 depending on rr, then the matrix product G1G2GrG_{1}G_{2}\cdots G_{r} has variation distance at most δ\delta to a p×qp\times q matrix GG of i.i.d. standard normal random variables with mean 0 and variance i=1r1di\prod_{i=1}^{r-1}d_{i}. Here δ0\delta\rightarrow 0 as CC\rightarrow\infty. Moreover, we show a converse for constant rr that if di<Cmax{p,q}1/2min{p,q}3/2d_{i}<C^{\prime}\max\{p,q\}^{1/2}\min\{p,q\}^{3/2} for some ii, then this total variation distance is at least δ\delta^{\prime}, for an absolute constant δ>0\delta^{\prime}>0 depending on CC^{\prime} and rr. This converse is best possible when p=Θ(q)p=\Theta(q).

1 Introduction

Random matrices play a central role in many areas of theoretical, applied, and computational mathematics. One particular application is dimensionality reduction, whereby one often chooses a rectangular random matrix Gm×nG\in\mathbb{R}^{m\times n}, mnm\ll n, and computes GxG\cdot x for a fixed vector xnx\in\mathbb{R}^{n}. Indeed, this is the setting in compressed sensing and sparse recovery [12], randomized numerical linear algebra [18, 20, 36], and sketching algorithms for data streams [25]. Often GG is chosen to be a Gaussian matrix, and in particular, an m×nm\times n matrix with entries that are i.i.d. normal random variables with mean 0 and variance 11, denoted by N(0,1)N(0,1). Indeed, in compressed sensing, such matrices can be shown to satisfy the Restricted Isometry Property (RIP) [10], while in randomized numerical linear algebra, in certain applications such as support vector machines [29] and non-negative matrix factorization [19], their performance is shown to often outperform that of other sketching matrices.

Our focus in this paper will be on understanding the product of two or more Gaussian matrices. Such products arise naturally in different applications. For example, in the over-constrained ridge regression problem minxAxb22+λx22\min_{x}\|Ax-b\|_{2}^{2}+\lambda\|x\|_{2}^{2}, the design matrix An×dA\in\mathbb{R}^{n\times d}, ndn\gg d, is itself often assumed to be Gaussian (see, e.g., [26]). In this case, the “sketch-and-solve” algorithmic framework for regression [32] would compute GAG\cdot A and GbG\cdot b for an m×nm\times n Gaussian matrix GG with msdλm\approx sd_{\lambda}, where sdλsd_{\lambda} is the so-called statistical dimension [2], and solve for the xx which minimizes GAxGb22+λx22\|G\cdot Ax-G\cdot b\|_{2}^{2}+\lambda\|x\|_{2}^{2}. While computing GAG\cdot A is slower than computing the corresponding matrix product for other kinds of sketching matrices GG, it often has application-specific [29, 19] as well as statistical benefits [31]. Notice that GAG\cdot A is the product of two independent Gaussian matrices, and in particular, GG has a small number of rows while AA has a small number of columns – this is precisely the rectangular case we will study below. Other applications in randomized numerical linear algebra where the product of two Gaussian matrices arises is when one computes the product of a Gaussian sketching matrix and Gaussian noise in a spiked identity covariance model [37].

The product of two or more Gaussian matrices also arises in diverse fields such as multiple-input multiple-output (MIMO) wireless communication channels [24]. Indeed, similar to the above regression problem in which one wants to reconstruct an underlying vector xx, in such settings one observes the vector y=G1Grx+η,y=G_{1}\cdots G_{r}\cdot x+\eta, where xx is the transmitted signal and η\eta is background noise. This setting corresponds to the situation in which there are rr scattering environments separated by major obstacles, and the dimensions of the GiG_{i} correspond to the number of “keyholes” [24]. To determine the mutual information of this channel, one needs to understand the singular values of the matrix G1GrG_{1}\cdots G_{r}. If one can show the distribution of this product is close to that of a Gaussian distribution in total variation distance, then one can use the wide range of results known for the spectrum of a single Gaussian matrix (see, e.g., [35]). Other applications of products of Gaussian matrices include disordered spin chains [11, 3, 15], stability of large complex dynamical systems [22, 21], symplectic maps and Hamiltonian mechanics [11, 4, 28], quantum transport in disordered wires [23, 13], and quantum chromodynamics [27]; we refer the reader to [14, 1] for an overview.

The main question we ask in this work is:

What is the distribution of the product G1G2GrG_{1}G_{2}\cdots G_{r} of rr independent Gaussian matrices of various sizes, where GiG_{i} is di1×did_{i-1}\times d_{i}?

Our main interest in the question above will be when G1G_{1} has a small number p=d0p=d_{0} of rows, and GrG_{r} has a small number q=drq=d_{r} of columns. Despite the large body of work on random matrix theory (see, e.g., [34] for a survey), we are not aware of any work which attempts to bound the total variation distance of the entire distribution of G1G2GrG_{1}G_{2}\cdots G_{r} to a Gaussian distribution itself.

1.1 Our Results

Formally, we consider the problem of distinguishing the product of normalized Gaussian matrices

Ar=(1d1G1)(1d2G2)(1dr1Gr1)(1d1Gr)A_{r}=\left(\frac{1}{\sqrt{d_{1}}}G_{1}\right)\left(\frac{1}{\sqrt{d_{2}}}G_{2}\right)\cdots\left(\frac{1}{\sqrt{d_{r-1}}}G_{r-1}\right)\left(\frac{1}{\sqrt{d_{1}}}G_{r}\right)

from a single normalized Gaussian matrix

A1=1d1G1.A_{1}=\frac{1}{\sqrt{d_{1}}}G_{1}.

We show that, when rr is a constant, with constant probability we cannot distinguish the distributions of these two random matrices when dip,qd_{i}\gg p,q for all ii; and, conversely, with constant probability, we can distinguish these two distributions when the did_{i} are not large enough.

Theorem 1 (Main theorem).

Suppose that dimax{p,q}d_{i}\geq\max\{p,q\} for all ii and dr1=d1d_{r-1}=d_{1}.

  1. (a)

    It holds that

    dTV(Ar,A1)C1i=1r1pqdi,d_{TV}(A_{r},A_{1})\leq C_{1}\sum_{i=1}^{r-1}\sqrt{\frac{pq}{d_{i}}},

    where dTV(Ar,A1)d_{TV}(A_{r},A_{1}) denotes the total variation distance between ArA_{r} and A1A_{1}, and C1>0C_{1}>0 is an absolute constant.

  2. (b)

    If p,q,d1,,drp,q,d_{1},\dots,d_{r} further satisfy that

    j=1r11djC2rmax{p,q}12min{p,q}32,\sum_{j=1}^{r-1}\frac{1}{d_{j}}\geq\frac{C_{2}^{r}}{\max\{p,q\}^{\frac{1}{2}}\min\{p,q\}^{\frac{3}{2}}},

    where C2>0C_{2}>0 is an absolute constant, then dTV(Ar,A1)2/3d_{TV}(A_{r},A_{1})\geq 2/3.

Part (a) states that dTV(Ar,A1)<2/3d_{TV}(A_{r},A_{1})<2/3 when diC1pqd_{i}\geq C_{1}^{\prime}pq for all ii for a constant C1C_{1}^{\prime} depending on rr. The converse in (b) implies that dTV(Ar,A1)2/3d_{TV}(A_{r},A_{1})\geq 2/3 when diC2max{p,q}1/2min{p,q}3/2d_{i}\leq C_{2}^{\prime}\max\{p,q\}^{1/2}\min\{p,q\}^{3/2} for some ii for a constant C2C_{2}^{\prime} depending on rr. When p=Θ(q)p=\Theta(q) and rr is a constant, we obtain a dichotomy (up to a constant factor) for the conditions on p,qp,q and did_{i}.

1.2 Our Techniques

Upper Bound.

We start by explaining our main insight as to why the distribution of a product G1G2G_{1}\cdot G_{2} of a p×dp\times d matrix G1G_{1} of i.i.d. N(0,1)N(0,1) random variables and a d×qd\times q matrix G2G_{2} of i.i.d. N(0,1)N(0,1) random variables has low variation distance to the distribution of a p×qp\times q matrix AA of i.i.d. N(0,d)N(0,d) random variables. One could try to directly understand the probability density function as was done in the case of Wishart matrices in [7, 30], which corresponds to the setting when G1=G2G_{1}=G_{2}. However, there are certain algebraic simplifications in the case of the Wishart distribution that seem much less tractable when manipulating the density function of the product of independent Gaussians [9]. Another approach would be to try to use entropic methods as in [8, 6]. Such arguments try to reveal entries of the product G1G2G_{1}\cdot G_{2} one-by-one, arguing that for most conditionings of previous entries, the new entry still looks like an independent Gaussian. However, the entries are clearly not independent – if (G1G2)i,j(G_{1}\cdot G_{2})_{i,j} has large absolute value, then (G1G2)i,j(G_{1}\cdot G_{2})_{i,j^{\prime}} is more likely to be large in absolute value, as it could indicate that the ii-th row of G1G_{1} has large norm. One could try to first condition on the norms of all rows of G1G_{1} and columns of G2G_{2}, but additional issues arise when one looks at submatrices: if (G1G2)i,j,(G1G2)i,j,(G_{1}\cdot G_{2})_{i,j},(G_{1}\cdot G_{2})_{i,j^{\prime}}, and (G1G2)i,j(G_{1}\cdot G_{2})_{i^{\prime},j} are all large, then it could mean the ii-th row of G1G_{1} and the ii^{\prime}-th row of G1G_{1} are correlated with each other, since they both are correlated with the jj-th column of G2G_{2}. Consequently, since (G1G2)i,j(G_{1}\cdot G_{2})_{i,j^{\prime}} is large, it could make it more likely that (G1G2)i,j(G_{1}\cdot G_{2})_{i^{\prime},j^{\prime}} has large absolute value. This makes the entropic method difficult to apply in this context.

Our upper bound instead leverages beautiful work of Jiang [16] and Jiang and Ma [17] which bounds the total variation distance between the distribution of an r×r\times\ell submatrix of a random d×dd\times d orthogonal matrix (orthonormal rows and columns) and an r×r\times\ell matrix with i.i.d. N(0,1/d)N(0,1/d) entries. Their work shows that if r/d0r\cdot\ell/d\to 0 as dd\to\infty, then the total variation distance between these two matrix ensembles goes to 0. It is not immediately clear how to apply such results in our context. First of all, which submatrix should we be looking at? Note though, that if VTV^{T} is a p×dp\times d uniformly random (Haar measure) matrix with orthonormal rows, and EE is a d×qd\times q uniformly random matrix with orthonormal columns, then by rotational invariance, VTEV^{T}E is identically distributed to a p×qp\times q submatrix of a d×dd\times d random orthonormal matrix. Thus, setting r=pr=p and =q\ell=q in the above results, they imply that VTEV^{T}E is close in variation distance to a p×qp\times q matrix HH with i.i.d. N(0,1/d)N(0,1/d) entries. Given G1G_{1} and G2G_{2}, one could then write them in their singular value decomposition, obtaining G1=UΣVTG_{1}=U\Sigma V^{T} and G2=ETFTG_{2}=E\mathrm{T}F^{T}. Then VTV^{T} and EE are independent and well-known to be uniformly random p×dp\times d and d×qd\times q orthonormal matrices, respectively. Thus G1G2G_{1}\cdot G_{2} is close in total variation distance to UΣHTFTU\Sigma H\mathrm{T}F^{T}. However, this does not immediately help either, as it is not clear what the distribution of this matrix is. Instead, the “right” way to utilize the results above is to (1) observe that G1G2=UΣVTG2G_{1}\cdot G_{2}=U\Sigma V^{T}G_{2} is identically distributed as UΣXU\Sigma X, where XX is a matrix of i.i.d. normal random variables, given the rotational invariance of the Gaussian distribution. Then (2) XX is itself close to a product WTZW^{T}Z where WTW^{T} is a random p×dp\times d matrix with orthonormal rows, and ZZ is a random d×qd\times q matrix with orthonormal columns, by the above results. Thus, G1G2G_{1}\cdot G_{2} is close to UΣWTZU\Sigma W^{T}Z. Then (3) UΣWTU\Sigma W^{T} has the same distribution as G1G_{1}, so UΣWTZU\Sigma W^{T}Z is close to G1ZG_{1}^{\prime}Z, where G1G_{1}^{\prime} and G1G_{1} are identically distributed, and G1G_{1}^{\prime} is independent of ZZ. Finally, (4) G1ZG_{1}^{\prime}Z is identically distributed as a matrix A1A_{1} of standard normal random variables because G1G_{1}^{\prime} is Gaussian and ZZ has orthonormal columns, by rotational invariance of the Gaussian distribution.

We hope that this provides a general method for arguments involving Gaussian matrices - in step (2) we had the quantity UΣXU\Sigma X, where XX was a Gaussian matrix, and then viewed XX as a product of a short-fat random orthonormal matrix WTW^{T} and a tall-thin random orthonormal matrix ZZ. Our proof for the product of more than 22 matrices recursively uses similar ideas, and bounds the growth in variation distance as a function of the number rr of matrices involved in the product.

Lower Bound.

For our lower bound for constant rr, we show that the fourth power of the Schatten 4-norm of a matrix, namely, XS44=tr((XTX)2)\|X\|_{S_{4}}^{4}=\operatorname{tr}((X^{T}X)^{2}), can be used to distinguish a product ArA_{r} of rr Gaussian matrices and a single Gaussian matrix A1A_{1}. We use Chebyshev’s inequality, for which we need to find the expectation and variance of tr((XTX)2)\operatorname{tr}((X^{T}X)^{2}) for X=ArX=A_{r} and X=A1X=A_{1}.

Let us consider the expectation first. An idea is to calculate the expectation recursively, that is, for a fixed matrix MM and a Gaussian random matrix GG we express 𝔼tr(((MG)T(MG))2)\operatorname*{\mathbb{E}}\operatorname{tr}(((MG)^{T}(MG))^{2}) in terms of 𝔼tr((MTM)2)\operatorname*{\mathbb{E}}\operatorname{tr}((M^{T}M)^{2}). The real situation turns out to be slightly more complicated. Instead of expressing 𝔼tr(((MG)T(MG))2)\operatorname*{\mathbb{E}}\operatorname{tr}(((MG)^{T}(MG))^{2}) in terms of 𝔼tr((MTM)2)\operatorname*{\mathbb{E}}\operatorname{tr}((M^{T}M)^{2}) directly, we decompose 𝔼tr(((MG)T(MG))2)\operatorname*{\mathbb{E}}\operatorname{tr}(((MG)^{T}(MG))^{2}) into the sum of expectations of a few functions in terms of MM, say,

𝔼tr(((MG)T(MG))2)=𝔼f1(M)+𝔼f2(M)++𝔼fs(M)\operatorname*{\mathbb{E}}\operatorname{tr}(((MG)^{T}(MG))^{2})=\operatorname*{\mathbb{E}}f_{1}(M)+\operatorname*{\mathbb{E}}f_{2}(M)+\cdots+\operatorname*{\mathbb{E}}f_{s}(M)

and build up the recurrence relations for 𝔼f1(MG),,𝔼fs(MG)\operatorname*{\mathbb{E}}f_{1}(MG),\dots,\operatorname*{\mathbb{E}}f_{s}(MG) in terms of 𝔼f1(M)\operatorname*{\mathbb{E}}f_{1}(M), 𝔼f2(M)\operatorname*{\mathbb{E}}f_{2}(M), …, 𝔼fs(M)\operatorname*{\mathbb{E}}f_{s}(M). It turns out that the recurrence relations are all linear, i.e.,

𝔼fi(MG)=j=1saij𝔼fj(M),i=1,,s,\operatorname*{\mathbb{E}}f_{i}(MG)=\sum_{j=1}^{s}a_{ij}\operatorname*{\mathbb{E}}f_{j}(M),\quad i=1,\dots,s, (1)

whence we can solve for 𝔼fi(Ar)\operatorname*{\mathbb{E}}f_{i}(A_{r}) and obtaining the desired expectation 𝔼tr((ArTAr)2)\operatorname*{\mathbb{E}}\operatorname{tr}((A_{r}^{T}A_{r})^{2}).

Now we turn to variance. One could try to apply the same idea of finding recurrence relations for Var(Q)=𝔼(Q2)(𝔼Q)2\operatorname*{Var}(Q)=\operatorname*{\mathbb{E}}(Q^{2})-(\operatorname*{\mathbb{E}}Q)^{2} (where Q=tr(((MG)T(MG))2)Q=\operatorname{tr}(((MG)^{T}(MG))^{2})), but it quickly becomes intractable for the 𝔼(Q2)\operatorname*{\mathbb{E}}(Q^{2}) term as it involves products of eight entries of MM, which all need to be handled carefully as to avoid any loose bounds; note, the subtraction of (𝔼Q)2(\operatorname*{\mathbb{E}}Q)^{2} is critically needed to obtain a small upper bound on Var(Q)\operatorname*{Var}(Q) and thus loose bounds on 𝔼(Q2)\operatorname*{\mathbb{E}}(Q^{2}) would not suffice. For a tractable calculation, we keep the product of entries of MM to 44th order throughout, without involving any terms of 88th order. To do so, we invoke the law of total variance,

VarM,G(tr((MG)T(MG))2)=𝔼M(VarG(tr((GTMTMG)2))|M)+VarM(𝔼Gtr((GTMTMG)2)|M).\operatorname*{Var}_{M,G}(\operatorname{tr}((MG)^{T}(MG))^{2})=\operatorname*{\mathbb{E}}_{M}\left(\left.\operatorname*{Var}_{G}(\operatorname{tr}((G^{T}M^{T}MG)^{2}))\right|M\right)+\operatorname*{Var}_{M}\left(\left.\operatorname*{\mathbb{E}}_{G}\operatorname{tr}((G^{T}M^{T}MG)^{2})\right|M\right). (2)

For the first term on the right-hand side, we use Poincaré’s inequality to upper bound it. Poincaré’s inequality for the Gaussian measure states that for a differentiable function ff on m\mathbb{R}^{m},

VargN(0,Im)(f(g))C𝔼gN(0,Im)f(g)22.\operatorname*{Var}_{g\sim N(0,I_{m})}(f(g))\leq C\operatorname*{\mathbb{E}}_{g\sim N(0,I_{m})}\|\nabla f(g)\|_{2}^{2}.

Here we can simply let f(X)=tr((MX)T(MX))2)f(X)=\operatorname{tr}((MX)^{T}(MX))^{2}) and calculate 𝔼f(G)22\operatorname*{\mathbb{E}}\|\nabla f(G)\|_{2}^{2}. This is tractable since 𝔼f(G)22\operatorname*{\mathbb{E}}\|\nabla f(G)\|_{2}^{2} involves the products of at most 44 entries of MM, and we can use the recursive idea for the expectation above to express

𝔼f(G)22=iaij𝔼gi(M)\operatorname*{\mathbb{E}}\|\nabla f(G)\|_{2}^{2}=\sum_{i}a_{ij}\operatorname*{\mathbb{E}}g_{i}(M)

for a few functions gig_{i}’s and establish a recurrence relation for each gig_{i}.

The second term on the right-hand side of (2) can be dealt with by plugging in (1), and turns out to depend on a new quantity Var(tr2(MTM))\operatorname*{Var}(\operatorname{tr}^{2}(M^{T}M)). We again apply the recursive idea and the law of total variance to

VarM,G(tr2(GTMTMG))=𝔼M(VarG(tr2((GTMTMG))|M)+VarM(𝔼Gtr2(GTMTMG)|M).\operatorname*{Var}_{M,G}(\operatorname{tr}^{2}(G^{T}M^{T}MG))=\operatorname*{\mathbb{E}}_{M}\left(\left.\operatorname*{Var}_{G}(\operatorname{tr}^{2}((G^{T}M^{T}MG))\right|M\right)+\operatorname*{Var}_{M}\left(\left.\operatorname*{\mathbb{E}}_{G}\operatorname{tr}^{2}(G^{T}M^{T}MG)\right|M\right).

Again, the first term on the right-hand side can be handled by Poincaré’s inequality and the second-term turns out to depend on Var(tr((MTM)2))\operatorname*{Var}(\operatorname{tr}((M^{T}M)^{2})), which is crucial. We have now obtained a double recurrence involving inequalities on Var(tr((MTM)2))\operatorname*{Var}(\operatorname{tr}((M^{T}M)^{2})) and Var(tr2((MTM)2))\operatorname*{Var}(\operatorname{tr}^{2}((M^{T}M)^{2})), from which we can solve for an upper bound on Var(tr(ArTAr)2)\operatorname*{Var}(\operatorname{tr}(A_{r}^{T}A_{r})^{2}). This upper bound, however, grows exponentially in rr, which is impossible to improve due to our use of Poincaré’s inequality.

2 Preliminaries

Notation.

For a random variable XX and a probability distribution 𝒟\mathcal{D}, we use X𝒟X\sim\mathcal{D} to denote that XX is subject to 𝒟\mathcal{D}. For two random variables XX and YY defined on the same sample space, we write X= d YX\mathop{\overset{\text{ }d\text{ }}{\leavevmode\resizebox{0.0pt}{0.0pt}{=}}}Y if XX and YY are identically distributed.

We use 𝒢m,n\mathcal{G}_{m,n} to denote the distribution of m×nm\times n Gaussian random matrices of i.i.d. entries N(0,1)N(0,1) and 𝒪m,n\mathcal{O}_{m,n} to denote the uniform distribution (Haar) of an m×nm\times n random matrix with orthonormal rows. For a distribution 𝒟\mathcal{D} on a linear space and a scaling factor α\alpha\in\mathbb{R}, we use α𝒟\alpha\mathcal{D} to denote the distribution of αX\alpha X, where X𝒟X\sim\mathcal{D}.

For two probability measures μ\mu and ν\nu on the Borel algebra \mathcal{F} of m\mathbb{R}^{m}, the total variation distance between μ\mu and ν\nu is defined as

dTV(μ,ν)=supA|μ(A)ν(A)|=12m|dμdν1|𝑑ν.d_{TV}(\mu,\nu)=\sup_{A\in\mathcal{F}}|\mu(A)-\nu(A)|=\frac{1}{2}\int_{\mathbb{R}^{m}}\left|\frac{d\mu}{d\nu}-1\right|d\nu.

If ν\nu is absolutely continuous with respect to μ\mu, one can define the Kullback-Leibler Divergence between μ\mu and ν\nu as

DKL(μν)=mdμdνlog2dμdνdν.D_{\mathrm{KL}}(\mu\|\nu)=\int_{\mathbb{R}^{m}}\frac{d\mu}{d\nu}\log_{2}\frac{d\mu}{d\nu}d\nu.

If ν\nu is not absolutely continuous with respect to μ\mu, we define DKL(μν)=D_{\mathrm{KL}}(\mu\|\nu)=\infty.

When μ\mu and ν\nu correspond to two random variables XX and YY, respectively, we also write dTV(μ,ν)d_{TV}(\mu,\nu) and DKL(μν)D_{\mathrm{KL}}(\mu\|\nu) as dTV(X,Y)d_{TV}(X,Y) and DKL(XY)D_{\mathrm{KL}}(X\|Y), respectively.

The following is the well-known relation between the Kullback-Leibler divergence and the total variation distance between two probability measures.

Lemma 1 (Pinsker’s Inequality [5, Theorem 4.19]).

dTV(μ,ν)12DKL(μν)d_{TV}(\mu,\nu)\leq\sqrt{\frac{1}{2}D_{\mathrm{KL}}(\mu\|\nu)}.

The following result, concerning the distance between the submatrix of a properly scaled Gaussian random matrix and a submatrix of a random orthogonal matrix, is due to Jiang and Ma [17].

Lemma 2 ([17]).

Let G𝒢d,dG\sim\mathcal{G}_{d,d} and Z𝒪d,dZ\sim\mathcal{O}_{d,d}. Suppose that p,qdp,q\leq d and G^\hat{G} is the top-left p×qp\times q block of GG and Z^\hat{Z} the top-left p×qp\times q block of ZZ. Then

dKL(1dG^Z^)Cpqd,d_{\mathrm{KL}}\left(\left.\frac{1}{\sqrt{d}}\hat{G}\right\|\hat{Z}\right)\leq C\frac{pq}{d}, (3)

where C>0C>0 is an absolute constant.

The original paper [17] does not state explicitly the bound in (3) and only states that the Kullback-Leibler divergence tends to 0 as dd\to\infty. A careful examination of the proof of [17, Theorem 1(i)], by keeping track of the order of the various o(1)o(1) terms, reveals the quantitative bound (3).

Useful Inequalities.

We list two useful inequalities below.

Lemma 3 (Poincaré’s inequality for Gaussian measure [5, Theorem 3.20]).

Let XN(0,In)X\sim N(0,I_{n}) be the standard nn-dimensional Gaussian distribution and f:nf:\mathbb{R}^{n}\to\mathbb{R} be any continuously differentiable function. Then

Var(f(X))𝔼(f(X)22).\operatorname*{Var}(f(X))\leq\operatorname*{\mathbb{E}}\left(\left\|\nabla f(X)\right\|_{2}^{2}\right).
Lemma 4 (Trace inequality, [33]).

Let AA and BB be symmetric, positive semidefinite matrices and kk be a positive integer. Then

tr((AB)k)min{Aopktr(Bk),Bopktr(Ak)}.\operatorname{tr}((AB)^{k})\leq\min\left\{\left\|A\right\|_{op}^{k}\operatorname{tr}(B^{k}),\left\|B\right\|_{op}^{k}\operatorname{tr}(A^{k})\right\}.

3 Upper Bound

Let r2r\geq 2 be an integer. Suppose that G1,,GrG_{1},\dots,G_{r} are independent Gaussian random matrices, where Gi𝒢di1,diG_{i}\sim\mathcal{G}_{d_{i-1},d_{i}} and d0=pd_{0}=p, dr=qd_{r}=q and dr1=d1d_{r-1}=d_{1}. Consider the product of normalized Gaussian matrices

Ar=(1d1G1)(1d2G2)(1dr1Gr1)(1d1Gr)\displaystyle A_{r}=\left(\frac{1}{\sqrt{d_{1}}}G_{1}\right)\left(\frac{1}{\sqrt{d_{2}}}G_{2}\right)\cdots\left(\frac{1}{\sqrt{d_{r-1}}}G_{r-1}\right)\left(\frac{1}{\sqrt{d_{1}}}G_{r}\right)
and a single normalized Gaussian random matrix
A1=1d1G1\displaystyle A_{1}=\frac{1}{\sqrt{d_{1}}}G_{1}^{\prime}

where G1𝒢p,qG_{1}^{\prime}\sim\mathcal{G}_{p,q}. In this section, we shall show that when p,qdip,q\ll d_{i} for all ii, we cannot distinguish ArA_{r} from A1A_{1} with constant probability.

For notational convenience, let Wi=1diGiW_{i}=\frac{1}{\sqrt{d_{i}}}G_{i} for iri\leq r and Wr=1d1GrW_{r}=\frac{1}{\sqrt{d_{1}}}G_{r}. Assume that pqβdipq\leq\beta d_{i} for some constant β\beta for all ii. Our question is to find the total variation distance between the matrix product W1W2WrW_{1}W_{2}\cdots W_{r} and the product W1WrW_{1}W_{r} of two matrices.

Lemma 5.

Let p,q,d,dp,q,d,d^{\prime} be positive integers satisfying that pqβdpq\leq\beta d and pqβdpq\leq\beta d^{\prime} for some constant β<1\beta<1. Suppose that Ap×dA\in\mathbb{R}^{p\times d}, G1d𝒢d,dG\sim\frac{1}{\sqrt{d}}\mathcal{G}_{d,d^{\prime}}, and L𝒪d,dL\sim\mathcal{O}_{d^{\prime},d}. Further suppose that GG and LL are independent. Let Z𝒪q,dZ\sim\mathcal{O}_{q,d} be independent of AA, GG and LL. Then

dTV(AGL,AZT)Cpqd,d_{TV}(AGL,AZ^{T})\leq C\sqrt{\frac{pq}{d}},

where C>0C>0 is an absolute constant.

Proof.

Let A=UΣVTA=U\Sigma V^{T} be its singular value decomposition, where VV has dimension d×pd\times p. Then

AGL=UΣ(VTGL)= d UΣX,AGL=U\Sigma(V^{T}GL)\mathop{\overset{\text{ }d\text{ }}{\leavevmode\resizebox{0.0pt}{0.0pt}{=}}}U\Sigma X,

where XX is a p×qp\times q random matrix of i.i.d. N(0,1/d)N(0,1/d) entries. Suppose that Z~\tilde{Z} consists of the top pp rows of ZTZ^{T}. Then

AZT=UΣ(VTZT)= d UΣZ~.AZ^{T}=U\Sigma(V^{T}Z^{T})\mathop{\overset{\text{ }d\text{ }}{\leavevmode\resizebox{0.0pt}{0.0pt}{=}}}U\Sigma\tilde{Z}.

Note that XX and ZZ are independent of UU and Σ\Sigma. It follows from Lemma 2 that

dKL(AGLAZT)=dKL(UΣXUΣZ~)=dKL(XZ~)Cpqd,d_{\mathrm{KL}}(AGL\|AZ^{T})=d_{\mathrm{KL}}(U\Sigma X\|U\Sigma\tilde{Z})=d_{\mathrm{KL}}(X\|\tilde{Z})\leq C\frac{pq}{d},

where C>0C>0 is an absolute constant. The result follows from Pinsker’s inequality (Lemma 1). ∎

The next theorem follows from the above.

Theorem 2.

It holds that

dTV(W1Wr,W1Wr)Ci=1rpqdi,d_{TV}(W_{1}\cdots W_{r},W_{1}W_{r})\leq C\sum_{i=1}^{r}\sqrt{\frac{pq}{d_{i}}},

where C>0C>0 is an absolute constant.

Proof.

Let Wr=UΣVTW_{r}=U\Sigma V^{T} and Xi𝒪q,diX_{i}\sim\mathcal{O}_{q,d_{i}}, independent from each other and from the WiW_{i}’s. Applying the preceding lemma with A=W1Wr2A=W_{1}\cdots W_{r-2}, G=Wr1G=W_{r-1} and L=UL=U, we have

dTV(W1Wr2Wr1Wr,W1Wr2Xr1TΣVT)Cpqdr1,d_{TV}(W_{1}\cdots W_{r-2}W_{r-1}W_{r},W_{1}\cdots W_{r-2}X_{r-1}^{T}\Sigma V^{T})\leq C\sqrt{\frac{pq}{d_{r-1}}},

Next, applying the preceding lemma with A=W1Wr3A=W_{1}\cdots W_{r-3}, G=Wr1G=W_{r-1} and L=XrL=X_{r}, we have

dTV(W1Wr2XrTΣVT,W1Wr3Xr2TΣVT)Cpqdr2,d_{TV}(W_{1}\cdots W_{r-2}X_{r}^{T}\Sigma V^{T},W_{1}\cdots W_{r-3}X_{r-2}^{T}\Sigma V^{T})\leq C\sqrt{\frac{pq}{d_{r-2}}},

Iterating this procedure, we have in the end that

dTV(W1W2X3TΣVT,W1X2TΣVT)Cpqd2.d_{TV}(W_{1}W_{2}X_{3}^{T}\Sigma V^{T},W_{1}X_{2}^{T}\Sigma V^{T})\leq C\sqrt{\frac{pq}{d_{2}}}.

Since UU, Σ\Sigma and VV are independent and X2= d UTX_{2}\mathop{\overset{\text{ }d\text{ }}{\leavevmode\resizebox{0.0pt}{0.0pt}{=}}}U^{T}, it holds that X2TΣVT= d WrX_{2}^{T}\Sigma V^{T}\mathop{\overset{\text{ }d\text{ }}{\leavevmode\resizebox{0.0pt}{0.0pt}{=}}}W_{r}. Therefore,

dTV(W1Wr,W1Wr)Ci=2r1pqdi.d_{TV}(W_{1}\cdots W_{r},W_{1}W_{r})\leq C\sum_{i=2}^{r-1}\sqrt{\frac{pq}{d_{i}}}.\qed

Repeating the same argument for W1WrW_{1}W_{r}, we obtain the following corollary immediately.

Corollary 1.

It holds that

dTV(Ar,A1)Ci=1r1pqdi,d_{TV}(A_{r},A_{1})\leq C\sum_{i=1}^{r-1}\sqrt{\frac{pq}{d_{i}}},

where C>0C>0 is an absolute constant.

4 Lower Bound

Suppose that rr is a constant. We shall show that one can distinguish the product of rr Gaussian random matrices

Ar=(1d1G1)(1d2G2)(1dr1Gr1)(1d1Gr),A_{r}=\left(\frac{1}{\sqrt{d_{1}}}G_{1}\right)\left(\frac{1}{\sqrt{d_{2}}}G_{2}\right)\cdots\left(\frac{1}{\sqrt{d_{r-1}}}G_{r-1}\right)\left(\frac{1}{\sqrt{d_{1}}}G_{r}\right),

from one Gaussian random matrix

A1=1d1G1A_{1}=\frac{1}{\sqrt{d_{1}}}G_{1}^{\prime}

when the intermediate dimensions d1,,dr1d_{1},\dots,d_{r-1} are not large enough. Considering h(X)=tr((XTX)2)h(X)=\operatorname{tr}((X^{T}X)^{2}), it suffices to show that one can distinguish h(Ar)h(A_{r}) and h(A1)h(A_{1}) with a constant probability for constant rr. By Chebyshev’s inequality, it suffices to show that

max{Var(h(A1)),Var(h(Ar))}c(𝔼h(Ar)𝔼h(A1))\max\left\{\sqrt{\operatorname*{Var}(h(A_{1}))},\sqrt{\operatorname*{Var}(h(A_{r}))}\right\}\leq c(\operatorname*{\mathbb{E}}h(A_{r})-\operatorname*{\mathbb{E}}h(A_{1}))

for a small constant cc. We calculate that:

Lemma 6.

Suppose that rr is a constant, dimax{p,q}d_{i}\geq\max\{p,q\} for all i=1,,ri=1,\dots,r. When p,q,d1,,drp,q,d_{1},\dots,d_{r}\to\infty,

𝔼h(Ar)=pq(p+q+1)dr2+(1+o(1))pq(p1)(q1)dr2j=1r11dj.\operatorname*{\mathbb{E}}h(A_{r})=\frac{pq(p+q+1)}{d_{r}^{2}}+(1+o(1))\frac{pq(p-1)(q-1)}{d_{r}^{2}}\sum_{j=1}^{r-1}\frac{1}{d_{j}}.
Lemma 7.

Suppose that rr is a constant, dimax{p,q}d_{i}\geq\max\{p,q\} for all i=1,,ri=1,\dots,r. There exists an absolute constant CC such that, when p,q,d1,,drp,q,d_{1},\dots,d_{r} are sufficently large,

Var(h(Ar))Cr(p3q+pq3)d14.\operatorname*{Var}(h(A_{r}))\leq\frac{C^{r}(p^{3}q+pq^{3})}{d_{1}^{4}}.

We conclude with the following theorem, which can be seen as a tight converse to Corollary 1 up to a constant factor on the conditions for p,q,d1,,drp,q,d_{1},\dots,d_{r}.

Theorem 3.

Suppose that rr is a constant and dimax{p,q}d_{i}\geq\max\{p,q\} for all i=1,,ri=1,\dots,r. Further suppose that d1=drd_{1}=d_{r}. When p,q,d1,,drp,q,d_{1},\dots,d_{r} are sufficiently large and satisfy that

j=1r11djCrmax{p,q}12min{p,q}32,\sum_{j=1}^{r-1}\frac{1}{d_{j}}\geq\frac{C^{r}}{\max\{p,q\}^{\frac{1}{2}}\min\{p,q\}^{\frac{3}{2}}},

where C>0C>0 is some absolute constant, with probability at least 2/32/3, one can distinguish ArA_{r} from A1A_{1}.

4.1 Calculation of the Mean

Suppose that AA is a p×qp\times q random matrix, and is rotationally invariant under left- and right-multiplication by orthogonal matrices. We define

S1(p,q)\displaystyle S_{1}(p,q) =𝔼A114(diagonal)\displaystyle=\operatorname*{\mathbb{E}}A_{11}^{4}\quad\text{(diagonal)}
S2(p,q)\displaystyle S_{2}(p,q) =𝔼A214(off-diagonal)\displaystyle=\operatorname*{\mathbb{E}}A_{21}^{4}\quad\text{(off-diagonal)}
S3(p,q)\displaystyle S_{3}(p,q) =𝔼Ai12Aj12(ij)(same column)\displaystyle=\operatorname*{\mathbb{E}}A_{i1}^{2}A_{j1}^{2}\quad(i\neq j)\quad\text{(same column)}
S4(p,q)\displaystyle S_{4}(p,q) =𝔼A1i2A1j2(ij)(same row)\displaystyle=\operatorname*{\mathbb{E}}A_{1i}^{2}A_{1j}^{2}\quad(i\neq j)\quad\text{(same row)}
S5(p,q)\displaystyle S_{5}(p,q) =𝔼A1i2A2j2(ij)\displaystyle=\operatorname*{\mathbb{E}}A_{1i}^{2}A_{2j}^{2}\quad(i\neq j)
S6(p,q)\displaystyle S_{6}(p,q) =𝔼AikAilAjkAjl(ij,kl)(rectangle)\displaystyle=\operatorname*{\mathbb{E}}A_{ik}A_{il}A_{jk}A_{jl}\quad(i\neq j,k\neq l)\quad\text{(rectangle)}

Since AA is left- and right-invariant under rotations, these quantities are well-defined. Then

𝔼tr((ATA)2)=𝔼1i,jq(ATA)ij2=i=1q𝔼(ATA)ii2+1i,jq,ij𝔼(ATA)ij2=q𝔼(ATA)112+q(q1)𝔼(ATA)122\displaystyle\begin{aligned} \operatorname*{\mathbb{E}}\operatorname{tr}((A^{T}A)^{2})=\operatorname*{\mathbb{E}}\sum_{1\leq i,j\leq q}(A^{T}A)_{ij}^{2}&=\sum_{i=1}^{q}\operatorname*{\mathbb{E}}(A^{T}A)_{ii}^{2}+\sum_{1\leq i,j\leq q,i\neq j}\operatorname*{\mathbb{E}}(A^{T}A)_{ij}^{2}\\ &=q\operatorname*{\mathbb{E}}(A^{T}A)_{11}^{2}+q(q-1)\operatorname*{\mathbb{E}}(A^{T}A)_{12}^{2}\end{aligned}
and
𝔼(ATA)112=𝔼(i=1pAi12)2=i=1p𝔼Ai14+1i,jp,ij𝔼Ai12Aj12=𝔼A114+(p1)𝔼A214+p(p1)𝔼A112A212=:S1(p,q)+(p1)S2(p,q)+p(p1)S3(p,q)\displaystyle\begin{aligned} \operatorname*{\mathbb{E}}(A^{T}A)_{11}^{2}=\operatorname*{\mathbb{E}}\bigg{(}\sum_{i=1}^{p}A_{i1}^{2}\bigg{)}^{2}&=\sum_{i=1}^{p}\operatorname*{\mathbb{E}}A_{i1}^{4}+\sum_{1\leq i,j\leq p,i\neq j}\operatorname*{\mathbb{E}}A_{i1}^{2}A_{j1}^{2}\\ &=\operatorname*{\mathbb{E}}A_{11}^{4}+(p-1)\operatorname*{\mathbb{E}}A_{21}^{4}+p(p-1)\operatorname*{\mathbb{E}}A_{11}^{2}A_{21}^{2}\\ &=:S_{1}(p,q)+(p-1)S_{2}(p,q)+p(p-1)S_{3}(p,q)\end{aligned}
𝔼(ATA)122=𝔼(i=1pAi1Ai2)2=i=1p𝔼Ai12Ai22+1i,jp,ij𝔼Ai1Ai2Aj1Aj2=pS4(p,q)+p(p1)S6(p,q).\displaystyle\begin{aligned} \operatorname*{\mathbb{E}}(A^{T}A)_{12}^{2}=\operatorname*{\mathbb{E}}\bigg{(}\sum_{i=1}^{p}A_{i1}A_{i2}\bigg{)}^{2}&=\sum_{i=1}^{p}\operatorname*{\mathbb{E}}A_{i1}^{2}A_{i2}^{2}+\sum_{1\leq i,j\leq p,i\neq j}\operatorname*{\mathbb{E}}A_{i1}A_{i2}A_{j1}A_{j2}\\ &=pS_{4}(p,q)+p(p-1)S_{6}(p,q).\end{aligned}

When S1(p,q)=S2(p,q)S_{1}(p,q)=S_{2}(p,q), we have

𝔼tr((ATA)2)\displaystyle\operatorname*{\mathbb{E}}\operatorname{tr}((A^{T}A)^{2}) =q(pS1(p,q)+p(p1)S3(p,q))+q(q1)(pS4(p,q)+p(p1)S6(p,q))\displaystyle=q(pS_{1}(p,q)+p(p-1)S_{3}(p,q))+q(q-1)(pS_{4}(p,q)+p(p-1)S_{6}(p,q))
=pqS1(p,q)+pq(p1)S3(p,q)+pq(q1)S4(p,q)+p(p1)q(q1)S6(p,q).\displaystyle=pqS_{1}(p,q)\!+\!pq(p\!-\!1)S_{3}(p,q)\!+\!pq(q\!-\!1)S_{4}(p,q)\!+\!p(p\!-\!1)q(q\!-\!1)S_{6}(p,q).

When A=GA=G, we have

S1(p,q)=S2(p,q)=3,S3(p,q)=S4(p,q)=S5(p,q)=1,S6(p,q)=0\displaystyle S_{1}(p,q)=S_{2}(p,q)=3,\quad S_{3}(p,q)=S_{4}(p,q)=S_{5}(p,q)=1,\quad S_{6}(p,q)=0
and so
𝔼tr((ATA)2)=3pq+pq(p1)+pq(q1)=pq(p+q+1).\displaystyle\operatorname*{\mathbb{E}}\operatorname{tr}((A^{T}A)^{2})=3pq+pq(p-1)+pq(q-1)=pq(p+q+1).

Next, consider A=BGA=BG, where BB is a p×dp\times d random matrix and GG a d×qd\times q random matrix of i.i.d. N(0,1)N(0,1) entries. The following proposition is easy to verify, and its proof is postponed to Appendix LABEL:sec:proof_of_simple_prop.

Proposition 1.

It holds that 𝔼A214=𝔼A114\operatorname*{\mathbb{E}}A_{21}^{4}=\operatorname*{\mathbb{E}}A_{11}^{4}.

Proof.

We have

𝔼A114=𝔼(iB1iGi1)4=i,j,k,l𝔼B1iB1jB1kB1l𝔼Gi1Gj1Gk1Gl1=3i𝔼B1i4+3ij𝔼B1i2B1j2\displaystyle\begin{aligned} \operatorname*{\mathbb{E}}A_{11}^{4}=\operatorname*{\mathbb{E}}\bigg{(}\sum_{i}B_{1i}G_{i1}\bigg{)}^{4}&=\sum_{i,j,k,l}\operatorname*{\mathbb{E}}B_{1i}B_{1j}B_{1k}B_{1l}\operatorname*{\mathbb{E}}G_{i1}G_{j1}G_{k1}G_{l1}\\ &=3\sum_{i}\operatorname*{\mathbb{E}}B_{1i}^{4}+3\sum_{i\neq j}\operatorname*{\mathbb{E}}B_{1i}^{2}B_{1j}^{2}\\ \end{aligned}
and
𝔼A214=𝔼(iB2iGi1)4=i,j,k,l𝔼B2iB2jB2kB2l𝔼Gi1Gj1Gk1Gl1=3i𝔼B2i4+3ij𝔼B2i2B2j2=𝔼A114.\displaystyle\begin{aligned} \operatorname*{\mathbb{E}}A_{21}^{4}=\operatorname*{\mathbb{E}}\bigg{(}\sum_{i}B_{2i}G_{i1}\bigg{)}^{4}&=\sum_{i,j,k,l}\operatorname*{\mathbb{E}}B_{2i}B_{2j}B_{2k}B_{2l}\operatorname*{\mathbb{E}}G_{i1}G_{j1}G_{k1}G_{l1}\\ &=3\sum_{i}\operatorname*{\mathbb{E}}B_{2i}^{4}+3\sum_{i\neq j}\operatorname*{\mathbb{E}}B_{2i}^{2}B_{2j}^{2}=\operatorname*{\mathbb{E}}A_{11}^{4}.\qed\end{aligned}

Suppose that the associated functions of BB are named T1,T2,T3,T4,T6,T5T_{1},T_{2},T_{3},T_{4},T_{6},T_{5}. Then we can calculate that (detailed calculations can be found in Appendix A)

S1(p,q)\displaystyle S_{1}(p,q) =3dT1(p,d)+3d(d1)T4(p,d)\displaystyle=3dT_{1}(p,d)+3d(d-1)T_{4}(p,d)
S3(p,q)\displaystyle S_{3}(p,q) =3dT3(p,d)+d(d1)T5(p,d)+2d(d1)T6(p,d)\displaystyle=3dT_{3}(p,d)+d(d-1)T_{5}(p,d)+2d(d-1)T_{6}(p,d)
S4(p,q)\displaystyle S_{4}(p,q) =dT1(p,d)+d(d1)T4(p,d)\displaystyle=dT_{1}(p,d)+d(d-1)T_{4}(p,d)
S5(p,q)\displaystyle S_{5}(p,q) =dT3(p,d)+d(d1)T5(p,d)\displaystyle=dT_{3}(p,d)+d(d-1)T_{5}(p,d)
S6(p,q)\displaystyle S_{6}(p,q) =dT3(p,d)+d(d1)T6(p,d)\displaystyle=dT_{3}(p,d)+d(d-1)T_{6}(p,d)

It is clear that S1,S3,S4,S5,S6S_{1},S_{3},S_{4},S_{5},S_{6} depend only on dd (not on pp and qq) if T1,T3,T4,T5,T6T_{1},T_{3},T_{4},T_{5},T_{6} do so. Furthermore, if T1=3T4T_{1}=3T_{4} then we have S1=3S4S_{1}=3S_{4} and thus S4=d(d+2)T4S_{4}=d(d+2)T_{4}. If T3=2T6+T5T_{3}=2T_{6}+T_{5} then S3=d(d+2)T3S_{3}=d(d+2)T_{3} and S3=2S6+S5S_{3}=2S_{6}+S_{5}. Hence, if T3=T4T_{3}=T_{4} then S3=S4S_{3}=S_{4}. We can verify that all these conditions are satisfied with one Gaussian matrix and we can iterate it to obtain these quantities for the product of rr Gaussian matrices with intermediate dimensions d1,d2,,dr1d_{1},d_{2},\dots,d_{r-1}. We have that

S3=S4=i=1r1di(di+2),S1=3S4,S6=j=1r1(i=1j1di(di+2))dj(i=j+1r1di(di1)).S_{3}=S_{4}=\prod_{i=1}^{r-1}d_{i}(d_{i}+2),\quad S_{1}=3S_{4},\quad S_{6}=\sum_{j=1}^{r-1}\left(\prod_{i=1}^{j-1}d_{i}(d_{i}+2)\right)d_{j}\left(\prod_{i=j+1}^{r-1}d_{i}(d_{i}-1)\right).

Therefore, normalizing the ii-th matrix by 1/di1/\sqrt{d_{i}}, that is,

A=(1d1G1)(1d2G2)(1dr1Gr1)(1d1Gr),A=\left(\frac{1}{\sqrt{d_{1}}}G_{1}\right)\left(\frac{1}{\sqrt{d_{2}}}G_{2}\right)\cdots\left(\frac{1}{\sqrt{d_{r-1}}}G_{r-1}\right)\left(\frac{1}{\sqrt{d_{1}}}G_{r}\right),

we have for constant rr that

𝔼tr((ATA)2)\displaystyle\operatorname*{\mathbb{E}}\operatorname{tr}((A^{T}A)^{2}) =1d12d22dr12d12(pq(p+q+1)S3+pq(p1)(q1)S6)\displaystyle=\frac{1}{d_{1}^{2}d_{2}^{2}\cdots d_{r-1}^{2}d_{1}^{2}}\left(pq(p+q+1)S_{3}+pq(p-1)(q-1)S_{6}\right) (4)
pq(p+q+1)dr2+pq(p1)(q1)dr2j=1r11dj.\displaystyle\approx\frac{pq(p+q+1)}{d_{r}^{2}}+\frac{pq(p-1)(q-1)}{d_{r}^{2}}\sum_{j=1}^{r-1}\frac{1}{d_{j}}.

4.2 Calculation of the Variance

Let Mp×pM\in\mathbb{R}^{p\times p} be a random symmetric matrix, and let Gp×qG\in\mathbb{R}^{p\times q} be a random matrix of i.i.d. N(0,1)N(0,1) entries. We want to find the variance of tr((GTMG)2)\operatorname{tr}((G^{T}MG)^{2}). The detailed calculations of some steps can be found in Appendix B.

Our starting point is the law of total variance, which states that

Var(tr((GTMG)2))=𝔼M(VarG(tr((GTMG)2))|M)+VarM(𝔼Gtr((GTMG)2)|M)\operatorname*{Var}(\operatorname{tr}((G^{T}MG)^{2}))=\operatorname*{\mathbb{E}}_{M}\left(\left.\operatorname*{Var}_{G}(\operatorname{tr}((G^{T}MG)^{2}))\right|M\right)+\operatorname*{Var}_{M}\left(\left.\operatorname*{\mathbb{E}}_{G}\operatorname{tr}((G^{T}MG)^{2})\right|M\right) (5)

Step 1a.

We shall handle each term separately. Consider the first term, which we shall bound using the Poincaré inequality for Gaussian measures. Define f(X)=tr((XTMX)2)f(X)=\operatorname{tr}((X^{T}MX)^{2}), where Xp×qX\in\mathbb{R}^{p\times q}. We shall calculate f\nabla f.

f(X)=XTMXF2=1i,jq(XTMX)ij2=1i,jq(1k,lpMklXkiXlj)2.f(X)=\left\|X^{T}MX\right\|_{F}^{2}=\sum_{1\leq i,j\leq q}(X^{T}MX)_{ij}^{2}=\sum_{1\leq i,j\leq q}\bigg{(}\sum_{1\leq k,l\leq p}M_{kl}X_{ki}X_{lj}\bigg{)}^{2}.

Then

fXrs=1i,jq2(1u,vpMuvXuiXvj)(1k,lpXrs(MklXkiXlj)).\frac{\partial f}{\partial X_{rs}}=\sum_{1\leq i,j\leq q}2\bigg{(}\sum_{1\leq u,v\leq p}M_{uv}X_{ui}X_{vj}\bigg{)}\bigg{(}\sum_{1\leq k,l\leq p}\frac{\partial}{\partial X_{rs}}(M_{kl}X_{ki}X_{lj})\bigg{)}.

Note that

Xrs(MklXkiXlj)={MklXlj,(k,i)=(r,s) and (l,j)(r,s)MklXki,(k,i)(r,s) and (l,j)=(r,s)2MrrXrs,(k,i)=(r,s) and (l,j)=(r,s)0,otherwise.\frac{\partial}{\partial X_{rs}}(M_{kl}X_{ki}X_{lj})=\begin{cases}M_{kl}X_{lj},&(k,i)=(r,s)\text{ and }(l,j)\neq(r,s)\\ M_{kl}X_{ki},&(k,i)\neq(r,s)\text{ and }(l,j)=(r,s)\\ 2M_{rr}X_{rs},&(k,i)=(r,s)\text{ and }(l,j)=(r,s)\\ 0,&\text{otherwise}.\end{cases}

we have that

fXrs\displaystyle\frac{\partial f}{\partial X_{rs}} =4(1u,vpMuvXusXvs)MrrXrs+2(l,j)(r,s)(1u,vpMuvXusXvj)MrlXlj\displaystyle=4\left(\sum_{1\leq u,v\leq p}M_{uv}X_{us}X_{vs}\right)M_{rr}X_{rs}+2\sum_{(l,j)\neq(r,s)}\left(\sum_{1\leq u,v\leq p}M_{uv}X_{us}X_{vj}\right)M_{rl}X_{lj}
+2(k,i)(r,s)(1u,vpMuvXuiXvs)MkrXki\displaystyle\qquad+2\sum_{(k,i)\neq(r,s)}\left(\sum_{1\leq u,v\leq p}M_{uv}X_{ui}X_{vs}\right)M_{kr}X_{ki}
=4[(1u,vpMuvXusXvs)MrrXrs+(l,j)(r,s)(u,vMuvXusXvj)MrlXlj]\displaystyle=4\left[\left(\sum_{1\leq u,v\leq p}M_{uv}X_{us}X_{vs}\right)M_{rr}X_{rs}+\sum_{(l,j)\neq(r,s)}\left(\sum_{u,v}M_{uv}X_{us}X_{vj}\right)M_{rl}X_{lj}\right]
=4l,j(u,vMuvXusXvj)MrlXlj.\displaystyle=4\sum_{l,j}\left(\sum_{u,v}M_{uv}X_{us}X_{vj}\right)M_{rl}X_{lj}.

Next we calculate 𝔼(f/Xrs)2\operatorname*{\mathbb{E}}(\partial f/\partial X_{rs})^{2} when XX is i.i.d. Gaussian.

(14fXrs)2=l,jl,ju,vu,vMuvMuvMrlMrl𝔼XusXusXvjXljXvjXlj\left(\frac{1}{4}\frac{\partial f}{\partial X_{rs}}\right)^{2}=\sum_{\begin{subarray}{c}l,j\\ l^{\prime},j^{\prime}\end{subarray}}\sum_{\begin{subarray}{c}u,v\\ u^{\prime},v^{\prime}\end{subarray}}M_{uv}M_{u^{\prime}v^{\prime}}M_{rl}M_{rl^{\prime}}\operatorname*{\mathbb{E}}X_{us}X_{u^{\prime}s}X_{vj}X_{lj}X_{v^{\prime}j^{\prime}}X_{l^{\prime}j^{\prime}}

We discuss different cases of j,j,sj,j^{\prime},s.

When jjsj\neq j^{\prime}\neq s, it must hold that u=uu=u^{\prime}, v=lv=l and v=lv^{\prime}=l^{\prime} for a possible nonzero contribution, and the total contribution in this case is at most q(q1)Br,s(1)q(q-1)B^{(1)}_{r,s}, where

Br,s(1)=1l,lpuMulMulMrlMrl=uMu,,Mr,2.B^{(1)}_{r,s}=\sum_{1\leq l,l^{\prime}\leq p}\sum_{u}M_{ul}M_{ul^{\prime}}M_{rl}M_{rl^{\prime}}=\sum_{u}\langle M_{u,\cdot},M_{r,\cdot}\rangle^{2}.

When j=jsj=j^{\prime}\neq s, it must hold that u=uu=u^{\prime} for a possible nonzero contribution, and the total contribution in this case is at most (q1)Br,s(2)(q-1)B^{(2)}_{r,s}, where

Br,s(2)\displaystyle B^{(2)}_{r,s} =l,lu,v,vMuvMuvMrlMrl𝔼Xus2XvjXljXvjXlj\displaystyle=\sum_{l,l^{\prime}}\sum_{u,v,v^{\prime}}M_{uv}M_{uv^{\prime}}M_{rl}M_{rl^{\prime}}\operatorname*{\mathbb{E}}X_{us}^{2}X_{vj}X_{lj}X_{v^{\prime}j}X_{l^{\prime}j}
=MF2Mr,22+2uMu,,Mr,2.\displaystyle=\left\|M\right\|_{F}^{2}\left\|M_{r,\cdot}\right\|_{2}^{2}+2\sum_{u}\langle M_{u,\cdot},M_{r,\cdot}\rangle^{2}.

When j=sjj=s\neq j^{\prime}, it must hold that v=lv^{\prime}=l^{\prime} for possible nonzero contribution, and the total contribution in this case is at most (q1)Br,s(3)(q-1)B^{(3)}_{r,s}, where

Br,s(3)\displaystyle B^{(3)}_{r,s} =js[l,lu,vMuvMulMrlMrl𝔼XusXusXvsXlsXlj2]\displaystyle=\sum_{j^{\prime}\neq s}\left[\sum_{l,l^{\prime}}\sum_{u,v}M_{uv}M_{u^{\prime}l^{\prime}}M_{rl}M_{rl^{\prime}}\operatorname*{\mathbb{E}}X_{us}X_{u^{\prime}s}X_{vs}X_{ls}X_{l^{\prime}j^{\prime}}^{2}\right]
=l,l(2Ml,,Ml,+tr(M)Mll)MrlMrl.\displaystyle=\sum_{l,l^{\prime}}(2\langle M_{l,\cdot},M_{l^{\prime},\cdot}\rangle+\operatorname{tr}(M)M_{ll^{\prime}})M_{rl}M_{rl^{\prime}}.

When j=j=sj=j^{\prime}=s, the nonzero contribution is

Br,s(4)=l,lu,vu,vMuvMuvMrlMrl𝔼XusXusXvsXlsXvsXls.B^{(4)}_{r,s}=\sum_{l,l^{\prime}}\sum_{\begin{subarray}{c}u,v\\ u^{\prime},v^{\prime}\end{subarray}}M_{uv}M_{u^{\prime}v^{\prime}}M_{rl}M_{rl^{\prime}}\operatorname*{\mathbb{E}}X_{us}X_{u^{\prime}s}X_{vs}X_{ls}X_{v^{\prime}s}X_{l^{\prime}s}.

Since u,u,v,v,l,lu,u^{\prime},v,v^{\prime},l,l^{\prime} needs to be paired, the only case which is not covered by Brs(1),Brs(3)B^{(1)}_{rs},B^{(3)}_{rs} and Brs(3)B^{(3)}_{rs} is when u=vu=v, u=vu^{\prime}=v^{\prime} and l=ll=l^{\prime}, in which case the contribution is at most

lu,uMuuMuuMrl2𝔼Xus2Xus2Xls2tr2(M)Mr,22.\sum_{l}\sum_{u,u^{\prime}}M_{uu}M_{u^{\prime}u^{\prime}}M_{rl}^{2}\operatorname*{\mathbb{E}}X_{us}^{2}X_{u^{\prime}s}^{2}X_{ls}^{2}\lesssim\operatorname{tr}^{2}(M)\left\|M_{r,\cdot}\right\|_{2}^{2}.

Hence

Br,s(4)Br,s(1)+Br,s(2)+Br,s(3)+tr2(M)Mr,22.B^{(4)}_{r,s}\lesssim B^{(1)}_{r,s}+B^{(2)}_{r,s}+B^{(3)}_{r,s}+\operatorname{tr}^{2}(M)\left\|M_{r,\cdot}\right\|_{2}^{2}.

It follows that

r,sBr,s(1)=qu,rMu,,Mr,2=qtr(M4)\displaystyle\sum_{r,s}B^{(1)}_{r,s}=q\sum_{u,r}\langle M_{u,\cdot},M_{r,\cdot}\rangle^{2}=q\operatorname{tr}(M^{4})
r,sBr,s(2)=qrMF2Mr,22+2qu,rMu,,Mr,2=qMF4+2qtr(M4)\displaystyle\sum_{r,s}B^{(2)}_{r,s}=q\sum_{r}\left\|M\right\|_{F}^{2}\left\|M_{r,\cdot}\right\|_{2}^{2}+2q\sum_{u,r}\langle M_{u,\cdot},M_{r,\cdot}\rangle^{2}=q\left\|M\right\|_{F}^{4}+2q\operatorname{tr}(M^{4})
r,sBr,s(3)=r,sl,l(2Ml,,Ml,+tr(M)Mll)MrlMrl2qtr(M4)+qtr(M)MFtr(M4)\displaystyle\begin{aligned} \sum_{r,s}B^{(3)}_{r,s}&=\sum_{r,s}\sum_{l,l^{\prime}}(2\langle M_{l,\cdot},M_{l^{\prime},\cdot}\rangle+\operatorname{tr}(M)M_{l^{\prime}l^{\prime}})M_{rl}M_{rl^{\prime}}\\ &\leq 2q\operatorname{tr}(M^{4})+q\operatorname{tr}(M)\left\|M\right\|_{F}\sqrt{\operatorname{tr}(M^{4})}\end{aligned}

Note that tr(M4)tr2(M2)=MF4\operatorname{tr}(M^{4})\leq\operatorname{tr}^{2}(M^{2})=\left\|M\right\|_{F}^{4}. Hence

116𝔼f22\displaystyle\frac{1}{16}\operatorname*{\mathbb{E}}\left\|\nabla f\right\|_{2}^{2} r,s((q1)(q2)Brs(1)+(q1)Brs(2)+(q1)Brs(3)+Brs(4))\displaystyle\leq\sum_{r,s}((q-1)(q-2)B_{rs}^{(1)}+(q-1)B_{rs}^{(2)}+(q-1)B_{rs}^{(3)}+B_{rs}^{(4)})
r,s(q2Brs(1)+qBrs(2)+qBrs(3)+tr2(M)Mr,22)\displaystyle\lesssim\sum_{r,s}(q^{2}B_{rs}^{(1)}+qB_{rs}^{(2)}+qB_{rs}^{(3)}+\operatorname{tr}^{2}(M)\left\|M_{r,\cdot}\right\|_{2}^{2})
q3tr(M4)+q2MF4+q2tr(M)MFtr(M4)+qtr2(M)MF2.\displaystyle\lesssim q^{3}\operatorname{tr}(M^{4})+q^{2}\left\|M\right\|_{F}^{4}+q^{2}\operatorname{tr}(M)\left\|M\right\|_{F}\sqrt{\operatorname{tr}(M^{4})}+q\operatorname{tr}^{2}(M)\left\|M\right\|_{F}^{2}.

By the Gaussian Poincaré inequality,

VarG(tr((GTMG)2)|M)\displaystyle\quad\ \operatorname*{Var}_{G}(\operatorname{tr}((G^{T}MG)^{2})|M) (6)
𝔼f22\displaystyle\lesssim\operatorname*{\mathbb{E}}\left\|\nabla f\right\|_{2}^{2}
q3tr(M4)+q2MF4+q2tr(M)MFtr(M4)+qtr2(M)MF2.\displaystyle\lesssim q^{3}\operatorname{tr}(M^{4})+q^{2}\left\|M\right\|_{F}^{4}+q^{2}\operatorname{tr}(M)\left\|M\right\|_{F}\sqrt{\operatorname{tr}(M^{4})}+q\operatorname{tr}^{2}(M)\left\|M\right\|_{F}^{2}.

For the terms on the right-hand side, we calculate that (using the trace inequality (Lemma 4))

𝔼tr((GTMG)4)=𝔼tr((MGGT)4)𝔼GGTop4tr(M4)=𝔼Gop8tr(M4)max{p,q}4tr(M4),\displaystyle\begin{aligned} \operatorname*{\mathbb{E}}\operatorname{tr}((G^{T}MG)^{4})=\operatorname*{\mathbb{E}}\operatorname{tr}((MGG^{T})^{4})\leq\operatorname*{\mathbb{E}}\left\|GG^{T}\right\|_{op}^{4}\operatorname{tr}(M^{4})&=\operatorname*{\mathbb{E}}\left\|G\right\|_{op}^{8}\operatorname{tr}(M^{4})\\ &\lesssim\max\{p,q\}^{4}\operatorname{tr}(M^{4}),\end{aligned}
𝔼GTMGF4𝔼Gop8MF4max{p,q}4MF4,\displaystyle\operatorname*{\mathbb{E}}\left\|G^{T}MG\right\|_{F}^{4}\leq\operatorname*{\mathbb{E}}\left\|G\right\|_{op}^{8}\left\|M\right\|_{F}^{4}\lesssim\max\{p,q\}^{4}\left\|M\right\|_{F}^{4},
𝔼tr2(GTMG)GTMGF2𝔼Gop8tr2(M)MF2max{p,q}4tr2(M)MF2\displaystyle\operatorname*{\mathbb{E}}\operatorname{tr}^{2}(G^{T}MG)\left\|G^{T}MG\right\|_{F}^{2}\leq\operatorname*{\mathbb{E}}\left\|G\right\|_{op}^{8}\operatorname{tr}^{2}(M)\left\|M\right\|_{F}^{2}\lesssim\max\{p,q\}^{4}\operatorname{tr}^{2}(M)\left\|M\right\|_{F}^{2}

and

𝔼tr(GTMG)GTMGFtr((GTMG)4)\displaystyle\quad\ \operatorname*{\mathbb{E}}\operatorname{tr}(G^{T}MG)\left\|G^{T}MG\right\|_{F}\sqrt{\operatorname{tr}((G^{T}MG)^{4})}
𝔼Gop2tr(G)Gop2MF2Gop8tr(M4)\displaystyle\leq\operatorname*{\mathbb{E}}\left\|G\right\|_{op}^{2}\operatorname{tr}(G)\cdot\left\|G\right\|_{op}^{2}\left\|M\right\|_{F}^{2}\cdot\sqrt{\left\|G\right\|_{op}^{8}\operatorname{tr}(M^{4})}
=𝔼Gop8tr(M)MFtr(M4)\displaystyle=\operatorname*{\mathbb{E}}\left\|G\right\|_{op}^{8}\operatorname{tr}(M)\left\|M\right\|_{F}\sqrt{\operatorname{tr}(M^{4})}
max{p,q}4tr(M)MFtr(M4).\displaystyle\lesssim\max\{p,q\}^{4}\operatorname{tr}(M)\left\|M\right\|_{F}\sqrt{\operatorname{tr}(M^{4})}.

This implies that each term on the right-hand of (6) grows geometrically.

Step 1b.

Next we deal with the second term in (5). We have

𝔼Gtr((GTMG)2)=i,j𝔼G(GTMG)ij2\displaystyle\operatorname*{\mathbb{E}}_{G}\operatorname{tr}\left((G^{T}MG)^{2}\right)=\sum_{i,j}\operatorname*{\mathbb{E}}_{G}(G^{T}MG)_{ij}^{2} =i,j𝔼G(k,lMklGkiGlj)2\displaystyle=\sum_{i,j}\operatorname*{\mathbb{E}}_{G}\bigg{(}\sum_{k,l}M_{kl}G_{ki}G_{lj}\bigg{)}^{2}
=i,jk,l,k,lMklMkl𝔼GGkiGljGkiGlj.\displaystyle=\sum_{i,j}\sum_{k,l,k^{\prime},l^{\prime}}M_{kl}M_{k^{\prime}l^{\prime}}\operatorname*{\mathbb{E}}_{G}G_{ki}G_{lj}G_{k^{\prime}i}G_{l^{\prime}j}.

When iji\neq j, for non-zero contribution, it must hold that k=lk=l and k=lk^{\prime}=l^{\prime} and thus the nonzero contribution is

ijk,lMkl2=q(q1)MF2.\sum_{i\neq j}\sum_{k,l}M_{kl}^{2}=q(q-1)\left\|M\right\|_{F}^{2}.

When i=ji=j, the contribution is

ik,l,k,lMklMkl𝔼GkiGliGkiGli=2qMF2+qtr2(M).\sum_{i}\sum_{\begin{subarray}{c}k,l,k^{\prime},l^{\prime}\end{subarray}}M_{kl}M_{k^{\prime}l^{\prime}}\operatorname*{\mathbb{E}}G_{ki}G_{li}G_{k^{\prime}i}G_{l^{\prime}i}=2q\left\|M\right\|_{F}^{2}+q\operatorname{tr}^{2}(M). (7)

Hence

𝔼Gtr((GTMG)2)=q(q+1)MF2+qtr2(M)\operatorname*{\mathbb{E}}_{G}\operatorname{tr}\left((G^{T}MG)^{2}\right)=q(q+1)\left\|M\right\|_{F}^{2}+q\operatorname{tr}^{2}(M)

and when MM is random,

Var(𝔼tr((GTMG)2)|M)\displaystyle\quad\,\operatorname*{Var}\left(\left.\operatorname*{\mathbb{E}}\operatorname{tr}((G^{T}MG)^{2})\right|M\right) (8)
=Var(q(q+1)MF2+qtr2(M))\displaystyle=\operatorname*{Var}\left(q(q+1)\left\|M\right\|_{F}^{2}+q\operatorname{tr}^{2}(M)\right)
q2(q+1)2Var(MF2)+q2Var(tr2(M))+2q2(q+1)Var(MF2)Var(tr2(M)).\displaystyle\leq q^{2}(q+1)^{2}\operatorname*{Var}(\left\|M\right\|_{F}^{2})+q^{2}\operatorname*{Var}(\operatorname{tr}^{2}(M))+2q^{2}(q+1)\sqrt{\operatorname*{Var}(\left\|M\right\|_{F}^{2})\operatorname*{Var}(\operatorname{tr}^{2}(M))}.

Step 2a.

Note that the Var(tr2(M))\operatorname*{Var}(\operatorname{tr}^{2}(M)) term on the right-hand side of (8). To bound this term, we examine the variance of g(G)g(G), where g(X)=tr2(XTMX)g(X)=\operatorname{tr}^{2}(X^{T}MX). We shall again calculate g\nabla g. Note that

gXrs=2tr(XTMX)ik,lMklXrsXkiXli\frac{\partial g}{\partial X_{rs}}=2\operatorname{tr}(X^{T}MX)\sum_{i}\sum_{k,l}M_{kl}\frac{\partial}{\partial X_{rs}}X_{ki}X_{li}

and

Xrs(XkiXli)={Xli,(k,i)=(r,s) and (l,i)(r,s)Xki,(k,i)(r,s) and (l,i)=(r,s)2Xrs,(k,i)=(r,s) and (l,i)=(r,s)0,otherwise.\frac{\partial}{\partial X_{rs}}(X_{ki}X_{li})=\begin{cases}X_{li},&(k,i)=(r,s)\text{ and }(l,i)\neq(r,s)\\ X_{ki},&(k,i)\neq(r,s)\text{ and }(l,i)=(r,s)\\ 2X_{rs},&(k,i)=(r,s)\text{ and }(l,i)=(r,s)\\ 0,&\text{otherwise}.\end{cases}

We have

gXrs=4tr(XTMX)lMrlXls=41jq1l,u,vpMuvMrlXlsXujXvj\frac{\partial g}{\partial X_{rs}}=4\operatorname{tr}(X^{T}MX)\sum_{l}M_{rl}X_{ls}=4\sum_{\begin{subarray}{c}1\leq j\leq q\\ 1\leq l,u,v\leq p\end{subarray}}M_{uv}M_{rl}X_{ls}X_{uj}X_{vj}

Next we calculate 𝔼(g/Xrs)2\operatorname*{\mathbb{E}}(\partial g/\partial X_{rs})^{2} when XX is i.i.d. Gaussian.

(14gXrs)2=j,l,u,vj,l,u,vMuvMuvMrlMrl𝔼XlsXlsXujXvjXujXvj\left(\frac{1}{4}\frac{\partial g}{\partial X_{rs}}\right)^{2}=\sum_{\begin{subarray}{c}j,l,u,v\\ j^{\prime},l^{\prime},u^{\prime},v^{\prime}\end{subarray}}M_{uv}M_{u^{\prime}v^{\prime}}M_{rl}M_{rl^{\prime}}\operatorname*{\mathbb{E}}X_{ls}X_{l^{\prime}s}X_{uj}X_{vj}X_{u^{\prime}j^{\prime}}X_{v^{\prime}j^{\prime}}

In order for the expectation in the summand to be non-zero, we must have one of the following cases: (1) sjjs\neq j\neq j^{\prime}, (2) s=jjs=j\neq j^{\prime}, (3) s=jjs=j^{\prime}\neq j, (4) sj=js\neq j=j^{\prime}, (5) s=j=js=j=j^{\prime}. We calculate the contribution in each case below.

Case 1: it must hold that l=ll=l^{\prime}, u=vu=v and u=vu^{\prime}=v^{\prime}. The contribution is (q1)(q2)Brs(1)(q-1)(q-2)B^{(1)}_{rs}, where

Brs(1)=l,u,uMuuMuuMrl2=tr2(M)Mr,22.B^{(1)}_{rs}=\sum_{l,u,u^{\prime}}M_{uu}M_{u^{\prime}u^{\prime}}M_{rl}^{2}=\operatorname{tr}^{2}(M)\left\|M_{r,\cdot}\right\|_{2}^{2}.

Case 2: it must hold that u=vu^{\prime}=v^{\prime}. The contribution is (q1)Brs(2)(q-1)B^{(2)}_{rs}, where

Brs(2)\displaystyle B^{(2)}_{rs} =l,l,u,u,vMuvMuuMrlMrl𝔼XlsXlsXusXvsXuj2\displaystyle=\sum_{l,l^{\prime},u,u^{\prime},v}M_{uv}M_{u^{\prime}u^{\prime}}M_{rl}M_{rl^{\prime}}\operatorname*{\mathbb{E}}X_{ls}X_{l^{\prime}s}X_{us}X_{vs}X_{u^{\prime}j^{\prime}}^{2}
=tr(M)(tr(M)Mr,22+2l,lMllMrlMrl)\displaystyle=\operatorname{tr}(M)\bigg{(}\operatorname{tr}(M)\left\|M_{r,\cdot}\right\|_{2}^{2}+2\sum_{l,l^{\prime}}M_{ll^{\prime}}M_{rl}M_{rl^{\prime}}\bigg{)}

Case 3: this gives the same bound as Case 2.

Case 4: it must hold that l=ll=l^{\prime}. The contribution is (q1)Brs(4)(q-1)B^{(4)}_{rs}, where

Brs(4)=l,u,u,v,vMuvMuvMrl2𝔼XujXvjXujXvj=3Mr,22MF2B^{(4)}_{rs}=\sum_{l,u,u^{\prime},v,v^{\prime}}M_{uv}M_{u^{\prime}v^{\prime}}M_{rl}^{2}\operatorname*{\mathbb{E}}X_{uj}X_{vj}X_{u^{\prime}j}X_{v^{\prime}j}=3\left\|M_{r,\cdot}\right\|_{2}^{2}\left\|M\right\|_{F}^{2}

Case 5: the contribution is Brs(5)B^{(5)}_{rs}, where

Brs(5)=l,u,vl,u,vMuvMuvMrlMrl𝔼XlsXusXvsXlsXusXvs.B^{(5)}_{rs}=\sum_{\begin{subarray}{c}l,u,v\\ l^{\prime},u^{\prime},v^{\prime}\end{subarray}}M_{uv}M_{u^{\prime}v^{\prime}}M_{rl}M_{rl^{\prime}}\operatorname*{\mathbb{E}}X_{ls}X_{us}X_{vs}X_{l^{\prime}s}X_{u^{\prime}s}X_{v^{\prime}s}.

The only uncovered case is l=ul=u^{\prime}, l=vl^{\prime}=v, u=vu=v^{\prime} and its symmetries. In such a case the contribution is at most

Cl,u,vMuvMluMrlMrv=CuMr,,Mu,2.C\sum_{l,u,v}M_{uv}M_{lu}M_{rl}M_{rv}=C\sum_{u}\langle M_{r,\cdot},M_{u,\cdot}\rangle^{2}.

Note that

r,sBrs(1)\displaystyle\sum_{r,s}B_{rs}^{(1)} =qtr2(M)MF2,\displaystyle=q\operatorname{tr}^{2}(M)\left\|M\right\|_{F}^{2},
r,sBrs(2)\displaystyle\sum_{r,s}B_{rs}^{(2)} =qtr2(M)MF2+2qtr(M)l,lMllMl,,Ml,\displaystyle=q\operatorname{tr}^{2}(M)\left\|M\right\|_{F}^{2}+2q\operatorname{tr}(M)\sum_{l,l^{\prime}}M_{ll^{\prime}}\langle M_{l,\cdot},M_{l^{\prime},\cdot}\rangle
qtr2(M)MF2+2qtr(M)MFtr(M4),\displaystyle\leq q\operatorname{tr}^{2}(M)\left\|M\right\|_{F}^{2}+2q\operatorname{tr}(M)\left\|M\right\|_{F}\sqrt{\operatorname{tr}(M^{4})},
r,sBrs(4)\displaystyle\sum_{r,s}B_{rs}^{(4)} =qMF4,\displaystyle=q\left\|M\right\|_{F}^{4},
r,sBrs(5)\displaystyle\sum_{r,s}B_{rs}^{(5)} r,sBrs(1)+r,sBrs(2)+tr(M4).\displaystyle\lesssim\sum_{r,s}B_{rs}^{(1)}+\sum_{r,s}B_{rs}^{(2)}+\operatorname{tr}(M^{4}).

Therefore,

116𝔼g22\displaystyle\frac{1}{16}\operatorname*{\mathbb{E}}\left\|\nabla g\right\|_{2}^{2} r,s((q1)(q2)Brs(1)+(q1)Brs(2)+(q1)Brs(4)+Brs(5))\displaystyle\leq\sum_{r,s}((q-1)(q-2)B_{rs}^{(1)}+(q-1)B_{rs}^{(2)}+(q-1)B_{rs}^{(4)}+B_{rs}^{(5)})
q3tr2(M)MF2+q2tr(M)MFtr(M4)+q2MF4+qtr(M4).\displaystyle\lesssim q^{3}\operatorname{tr}^{2}(M)\left\|M\right\|_{F}^{2}+q^{2}\operatorname{tr}(M)\left\|M\right\|_{F}\sqrt{\operatorname{tr}(M^{4})}+q^{2}\left\|M\right\|_{F}^{4}+q\operatorname{tr}(M^{4}).

By Poincaré’s inequality,

VarG(tr2(GTMG))\displaystyle\quad\ \operatorname*{Var}_{G}(\operatorname{tr}^{2}(G^{T}MG)) (9)
𝔼g22\displaystyle\lesssim\operatorname*{\mathbb{E}}\left\|\nabla g\right\|_{2}^{2}
q3tr2(M)MF2+q2tr(M)MFtr(M4)+q2MF4+qtr(M4).\displaystyle\lesssim q^{3}\operatorname{tr}^{2}(M)\left\|M\right\|_{F}^{2}+q^{2}\operatorname{tr}(M)\left\|M\right\|_{F}\sqrt{\operatorname{tr}(M^{4})}+q^{2}\left\|M\right\|_{F}^{4}+q\operatorname{tr}(M^{4}).

Similar to before, each term on the right-hand side grows geometrically.

Step 2b.

Next we deal with VarM(𝔼Gtr2(GTMG)|M)\operatorname*{Var}_{M}(\operatorname*{\mathbb{E}}_{G}\operatorname{tr}^{2}\left(G^{T}MG\right)|M).

𝔼tr2(GTMG)=𝔼(i,k,lMklGkiGli)2=i,jk,l,k,lMklMkl𝔼GkiGliGkjGlj.\operatorname*{\mathbb{E}}\operatorname{tr}^{2}\left(G^{T}MG\right)=\operatorname*{\mathbb{E}}\bigg{(}\sum_{i,k,l}M_{kl}G_{ki}G_{li}\bigg{)}^{2}=\sum_{i,j}\sum_{k,l,k^{\prime},l^{\prime}}M_{kl}M_{k^{\prime}l^{\prime}}\operatorname*{\mathbb{E}}G_{ki}G_{li}G_{k^{\prime}j}G_{l^{\prime}j}.

When iji\neq j, for non-zero contribution, it must hold that k=lk=l and k=lk^{\prime}=l^{\prime} and thus the nonzero contribution is

ijk,kMkkMkk=q(q1)tr2(M).\sum_{i\neq j}\sum_{k,k^{\prime}}M_{kk}M_{k^{\prime}k^{\prime}}=q(q-1)\operatorname{tr}^{2}(M).

When i=ji=j, the contribution is (this is exactly the same as (7) in Step 1b.)

ik,k,l,lMklMkl𝔼GkiGliGkiGli=2qMF2+qtr2(M).\sum_{i}\sum_{\begin{subarray}{c}k,k^{\prime},l,l^{\prime}\end{subarray}}M_{kl}M_{k^{\prime}l^{\prime}}\operatorname*{\mathbb{E}}G_{ki}G_{li}G_{k^{\prime}i}G_{l^{\prime}i}=2q\left\|M\right\|_{F}^{2}+q\operatorname{tr}^{2}(M).

Hence

𝔼tr2(GTMG)=2qMF2+q2tr2(M)\operatorname*{\mathbb{E}}\operatorname{tr}^{2}\left(G^{T}MG\right)=2q\left\|M\right\|_{F}^{2}+q^{2}\operatorname{tr}^{2}(M)

and when MM is random,

Var(𝔼tr2(GTMG)|M)\displaystyle\quad\ \operatorname*{Var}\left(\left.\operatorname*{\mathbb{E}}\operatorname{tr}^{2}(G^{T}MG)\right|M\right) (10)
=Var(2qMF2+q2tr2(M))\displaystyle=\operatorname*{Var}\left(2q\left\|M\right\|_{F}^{2}+q^{2}\operatorname{tr}^{2}(M)\right)
4q2Var(MF2)+q4Var(tr2(M))+2q3Var(MF2)Var(tr2(M)).\displaystyle\leq 4q^{2}\operatorname*{Var}(\left\|M\right\|_{F}^{2})+q^{4}\operatorname*{Var}(\operatorname{tr}^{2}(M))+2q^{3}\sqrt{\operatorname*{Var}(\left\|M\right\|_{F}^{2})\operatorname*{Var}(\operatorname{tr}^{2}(M))}.

Step 3.

Let UrU_{r} denote the variance of tr((ArTAr)2)\operatorname{tr}((A_{r}^{T}A_{r})^{2}) and VrV_{r} the variance of tr2(ArTAr)\operatorname{tr}^{2}(A_{r}^{T}A_{r}). Combining (5), (6), (8), (9), (10), we have the following recurrence relations, where C1,C2,C3,C4>0C_{1},C_{2},C_{3},C_{4}>0 are absolute constants.

Ur+1\displaystyle U_{r+1} C1Pr+2Ur+1dr2Vr+3drUrVr\displaystyle\leq C_{1}P_{r}+2U_{r}+\frac{1}{d_{r}^{2}}V_{r}+\frac{3}{d_{r}}\sqrt{U_{r}V_{r}}
Vr+1\displaystyle V_{r+1} C2Qr+1dr2Ur+Vr+2drUrVr\displaystyle\leq C_{2}Q_{r}+\frac{1}{d_{r}^{2}}U_{r}+V_{r}+\frac{2}{d_{r}}\sqrt{U_{r}V_{r}}
Pr+1\displaystyle P_{r+1} C3Pr\displaystyle\leq C_{3}P_{r}
Qr+1\displaystyle Q_{r+1} C4Qr\displaystyle\leq C_{4}Q_{r}
U0\displaystyle U_{0} =V0=0\displaystyle=V_{0}=0

In the base case, set M=IpM=I_{p} (the p×pp\times p identity matrix in (6)) and note that the second term in (5) vanishes. We see that P1(p3q+pq3)/d14P_{1}\lesssim(p^{3}q+pq^{3})/d_{1}^{4} after proper normalization. (Alternatively we can calculate this precisely, see Appendix C.) Similarly we have Q1p3q3/d14Q_{1}\lesssim p^{3}q^{3}/d_{1}^{4}. Note that Q1/d12(p3q+pq3)/d14Q_{1}/d_{1}^{2}\lesssim(p^{3}q+pq^{3})/d_{1}^{4}. Now, we can solve that

Ur+1Crp3q+pq3d14U_{r+1}\leq C^{r}\frac{p^{3}q+pq^{3}}{d_{1}^{4}}

for some absolute constant C>0C>0.

References

  • [1] Gernot Akemann and Jesper R Ipsen. Recent exact and asymptotic results for products of independent random matrices. Acta Physica Polonica B, pages 1747–1784, 2015.
  • [2] Ahmed El Alaoui and Michael W. Mahoney. Fast randomized kernel ridge regression with statistical guarantees. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 775–783, Cambridge, MA, USA, 2015. MIT Press.
  • [3] Richard Bellman. Limit theorems for non-commutative operations. I. Duke Mathematical Journal, 21(3):491–500, 1954.
  • [4] Giancarlo Benettin. Power-law behavior of lyapunov exponents in some conservative dynamical systems. Physica D: Nonlinear Phenomena, 13(1-2):211–220, 1984.
  • [5] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
  • [6] Matthew Brennan, Guy Bresler, and Dheeraj Nagaraj. Phase transitions for detecting latent geometry in random graphs. Probability Theory and Related Fields, pages 1215–1289, 2020. doi:10.1007/s00440-020-00998-3.
  • [7] Sébastien Bubeck, Jian Ding, Ronen Eldan, and Miklós Z. Rácz. Testing for high-dimensional geometry in random graphs. Random Structures & Algorithms, 49(3):503–532, 2016. doi:10.1002/rsa.20633.
  • [8] Sébastien Bubeck and Shirshendu Ganguly. Entropic CLT and Phase Transition in High-dimensional Wishart Matrices. International Mathematics Research Notices, 2018(2):588–606, 12 2016. doi:10.1093/imrn/rnw243.
  • [9] Z. Burda, R. A. Janik, and B. Waclaw. Spectrum of the product of independent random gaussian matrices. Physical Review E, 81(4), Apr 2010. doi:10.1103/physreve.81.041132.
  • [10] Emmanuel J Candes. The restricted isometry property and its implications for compressed sensing. Comptes rendus mathematique, 346(9-10):589–592, 2008.
  • [11] Andrea Crisanti, Giovanni Paladin, and Angelo Vulpiani. Products of random matrices: in Statistical Physics, volume 104. Springer Science & Business Media, 2012.
  • [12] David L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
  • [13] S. Iida, H.A. Weidenmüller, and J.A. Zuk. Statistical scattering theory, the supersymmetry method and universal conductance fluctuations. Annals of Physics, 200(2):219–270, 1990.
  • [14] Jesper R. Ipsen. Products of independent Gaussian random matrices. PhD thesis, Bielefeld University, 2015.
  • [15] Hiroshi Ishitani. A central limit theorem for the subadditive process and its application to products of random matrices. Publications of the Research Institute for Mathematical Sciences, 12(3):565–575, 1977.
  • [16] Tiefeng Jiang. How many entries of a typical orthogonal matrix can be approximated by independent normals? The Annals of Probability, 34(4):1497–1529, Jul 2006. doi:10.1214/009117906000000205.
  • [17] Tiefeng Jiang and Yutao Ma. Distances between random orthogonal matrices and independent normals. Transactions of the American Mathematical Society, 372(3):1509–1553, August 2019. doi:10.1090/tran/7470.
  • [18] Ravindran Kannan and Santosh Vempala. Spectral algorithms. Now Publishers Inc, 2009.
  • [19] Michael Kapralov, Vamsi Potluru, and David Woodruff. How to fake multiply by a gaussian matrix. In International Conference on Machine Learning, pages 2101–2110. PMLR, 2016.
  • [20] Michael W. Mahoney. Randomized algorithms for matrices and data. Found. Trends Mach. Learn., 3(2):123–224, February 2011. doi:10.1561/2200000035.
  • [21] Satya N Majumdar and Grégory Schehr. Top eigenvalue of a random matrix: large deviations and third order phase transition. Journal of Statistical Mechanics: Theory and Experiment, 2014(1):P01012, 2014.
  • [22] Robert M May. Will a large complex system be stable? Nature, 238(5364):413–414, 1972.
  • [23] P.A. Mello, P. Pereyra, and N. Kumar. Macroscopic approach to multichannel disordered conductors. Annals of Physics, 181(2):290–317, 1988.
  • [24] Ralf R Muller. On the asymptotic eigenvalue distribution of concatenated vector-valued fading channels. IEEE Transactions on Information Theory, 48(7):2086–2091, 2002.
  • [25] Shanmugavelayutham Muthukrishnan. Data streams: Algorithms and applications. Now Publishers Inc, 2005.
  • [26] Guillaume Obozinski, Martin J Wainwright, Michael I Jordan, et al. Support union recovery in high-dimensional multivariate regression. The Annals of Statistics, 39(1):1–47, 2011.
  • [27] James C Osborn. Universal results from an alternate random-matrix model for QCD with a baryon chemical potential. Physical review letters, 93(22):222001, 2004.
  • [28] G Paladin and A Vulpiani. Scaling law and asymptotic distribution of lyapunov exponents in conservative dynamical systems with many degrees of freedom. Journal of Physics A: Mathematical and General, 19(10):1881, 1986.
  • [29] Saurabh Paul, Christos Boutsidis, Malik Magdon-Ismail, and Petros Drineas. Random projections for linear support vector machines. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(4):1–25, 2014.
  • [30] Miklós Z. Rácz and Jacob Richey. A smooth transition from wishart to goe. Journal of Theoretical Probability, pages 898–906, 2019. doi:10.1007/s10959-018-0808-2.
  • [31] Garvesh Raskutti and Michael W Mahoney. A statistical perspective on randomized sketching for ordinary least-squares. The Journal of Machine Learning Research, 17(1):7508–7538, 2016.
  • [32] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 143–152. IEEE, 2006.
  • [33] Khalid Shebrawi and Hussein Albadawi. Trace inequalities for matrices. Bulletin of the Australian Mathematical Society, 87(1):139–148, 2013. doi:10.1017/S0004972712000627.
  • [34] Terence Tao. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012.
  • [35] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Yonina C. Eldar and Gitta Kutyniok, editors, Compressed Sensing: Theory and Applications, pages 210–268. Cambridge University Press, 2012. doi:10.1017/CBO9780511794308.006.
  • [36] David P. Woodruff. Sketching as a tool for numerical linear algebra. Found. Trends Theor. Comput. Sci., 10(1–2):1–157, October 2014. doi:10.1561/0400000060.
  • [37] Fan Yang, Sifan Liu, Edgar Dobriban, and David P Woodruff. How to reduce dimension with PCA and random projections? arXiv:2005.00511 [math.ST], 2020.

Appendix A Omitted Calculations in Section 4.1

S1(p,q)=3i𝔼B1i4+3ij𝔼B1i2B1j2=3dT1(p,d)+3d(d1)T4(p,d)\displaystyle S_{1}(p,q)=3\sum_{i}\operatorname*{\mathbb{E}}B_{1i}^{4}+3\sum_{i\neq j}\operatorname*{\mathbb{E}}B_{1i}^{2}B_{1j}^{2}=3dT_{1}(p,d)+3d(d-1)T_{4}(p,d)
S3(p,q)=𝔼A112A212=𝔼(iB1iGi1)2(kB2kGk1)2=i,j,k,l𝔼B1iB1jB2kB2l𝔼Gi1Gj1Gk1Gl1=3i𝔼B1i2B2i2+ij𝔼B1i2B2j2+2ij𝔼B1iB2iB1jB2j=3dT3(p,d)+d(d1)T5(p,d)+2d(d1)T6(p,d)\displaystyle\begin{aligned} S_{3}(p,q)=\operatorname*{\mathbb{E}}A_{11}^{2}A_{21}^{2}&=\operatorname*{\mathbb{E}}\bigg{(}\sum_{i}B_{1i}G_{i1}\bigg{)}^{2}\bigg{(}\sum_{k}B_{2k}G_{k1}\bigg{)}^{2}\\ &=\sum_{i,j,k,l}\operatorname*{\mathbb{E}}B_{1i}B_{1j}B_{2k}B_{2l}\operatorname*{\mathbb{E}}G_{i1}G_{j1}G_{k1}G_{l1}\\ &=3\sum_{i}\operatorname*{\mathbb{E}}B_{1i}^{2}B_{2i}^{2}+\sum_{i\neq j}\operatorname*{\mathbb{E}}B_{1i}^{2}B_{2j}^{2}+2\sum_{i\neq j}\operatorname*{\mathbb{E}}B_{1i}B_{2i}B_{1j}B_{2j}\\ &=3dT_{3}(p,d)+d(d-1)T_{5}(p,d)+2d(d-1)T_{6}(p,d)\end{aligned}
S4(p,q)=𝔼A112A122=𝔼(iB1iGi1)2(kB1kGk2)2=i,j,k,l𝔼B1iB1jB1kB1l𝔼Gi1Gj1Gk2Gl2=i𝔼B1i4+ij𝔼B1i2B1j2=dT1(p,d)+d(d1)T4(p,d).\displaystyle\begin{aligned} S_{4}(p,q)=\operatorname*{\mathbb{E}}A_{11}^{2}A_{12}^{2}&=\operatorname*{\mathbb{E}}\bigg{(}\sum_{i}B_{1i}G_{i1}\bigg{)}^{2}\bigg{(}\sum_{k}B_{1k}G_{k2}\bigg{)}^{2}\\ &=\sum_{i,j,k,l}\operatorname*{\mathbb{E}}B_{1i}B_{1j}B_{1k}B_{1l}\operatorname*{\mathbb{E}}G_{i1}G_{j1}G_{k2}G_{l2}\\ &=\sum_{i}\operatorname*{\mathbb{E}}B_{1i}^{4}+\sum_{i\neq j}\operatorname*{\mathbb{E}}B_{1i}^{2}B_{1j}^{2}=dT_{1}(p,d)+d(d-1)T_{4}(p,d).\end{aligned}
S5(p,q)=𝔼A112A222=𝔼(iB1iGi1)2(kB2kGk2)2=i,j,k,l𝔼B1iB1jB2kB2l𝔼Gi1Gj1Gk2Gl2=i,jB1i2B2j2=dT3(p,d)+d(d1)T5(p,d)\displaystyle\begin{aligned} S_{5}(p,q)=\operatorname*{\mathbb{E}}A_{11}^{2}A_{22}^{2}&=\operatorname*{\mathbb{E}}\bigg{(}\sum_{i}B_{1i}G_{i1}\bigg{)}^{2}\bigg{(}\sum_{k}B_{2k}G_{k2}\bigg{)}^{2}\\ &=\sum_{i,j,k,l}\operatorname*{\mathbb{E}}B_{1i}B_{1j}B_{2k}B_{2l}\operatorname*{\mathbb{E}}G_{i1}G_{j1}G_{k2}G_{l2}\\ &=\sum_{i,j}B_{1i}^{2}B_{2j}^{2}=dT_{3}(p,d)+d(d-1)T_{5}(p,d)\end{aligned}
S6(p,q)=𝔼A11A12A21A22=i,j,k,l𝔼B1iB1jB2kB2l𝔼Gi1Gj2Gk1Gl2=i𝔼B1i2B2i2+ij𝔼B1iB1jB2iB2j=dT3(p,d)+d(d1)T6(p,d)\displaystyle\begin{aligned} S_{6}(p,q)=\operatorname*{\mathbb{E}}A_{11}A_{12}A_{21}A_{22}&=\sum_{i,j,k,l}\operatorname*{\mathbb{E}}B_{1i}B_{1j}B_{2k}B_{2l}\operatorname*{\mathbb{E}}G_{i1}G_{j2}G_{k1}G_{l2}\\ &=\sum_{i}\operatorname*{\mathbb{E}}B_{1i}^{2}B_{2i}^{2}+\sum_{i\neq j}\operatorname*{\mathbb{E}}B_{1i}B_{1j}B_{2i}B_{2j}\\ &=dT_{3}(p,d)+d(d-1)T_{6}(p,d)\end{aligned}

Appendix B Omitted Calculations in Section 4.2

In Step 1a.

Br,s(2)=l,lu,v,vMuvMuvMrlMrl𝔼Xus2XvjXljXvjXlj=lluMulMulMrlMrlv=lv=l+luvlMuv2Mrl2v=vl=l+lluMulMulMrlMrlv=ll=v+3l,uMul2Mrl2v=v=l=l=(u,vMuv2)(lMrl2)+2l,l,uMulMulMrlMrl=MF2Mr,22+2uMu,,Mr,2.\displaystyle\begin{aligned} B^{(2)}_{r,s}&=\sum_{l,l^{\prime}}\sum_{u,v,v^{\prime}}M_{uv}M_{uv^{\prime}}M_{rl}M_{rl^{\prime}}\operatorname*{\mathbb{E}}X_{us}^{2}X_{vj}X_{lj}X_{v^{\prime}j}X_{l^{\prime}j}\\ &=\underbrace{\sum_{l\neq l^{\prime}}\sum_{u}M_{ul}M_{ul^{\prime}}M_{rl}M_{rl^{\prime}}}_{v=l\neq v^{\prime}=l^{\prime}}+\underbrace{\sum_{l}\sum_{\begin{subarray}{c}u\\ v\neq l\end{subarray}}M_{uv}^{2}M_{rl}^{2}}_{v=v^{\prime}\neq l=l^{\prime}}+\underbrace{\sum_{l\neq l^{\prime}}\sum_{u}M_{ul^{\prime}}M_{ul}M_{rl}M_{rl^{\prime}}}_{v=l^{\prime}\neq l=v^{\prime}}\\ &\qquad\qquad\qquad+3\underbrace{\sum_{l,u}M_{ul}^{2}M_{rl}^{2}}_{v=v^{\prime}=l=l^{\prime}}\\ &=\left(\sum_{u,v}M_{uv}^{2}\right)\left(\sum_{l}M_{rl}^{2}\right)+2\sum_{l,l^{\prime},u}M_{ul^{\prime}}M_{ul}M_{rl}M_{rl^{\prime}}\\ &=\left\|M\right\|_{F}^{2}\left\|M_{r,\cdot}\right\|_{2}^{2}+2\sum_{u}\langle M_{u,\cdot},M_{r,\cdot}\rangle^{2}.\end{aligned}
Br,s(3)=js[l,lu,vMuvMulMrlMrl𝔼XusXusXvsXlsXlj2]=l,lulMulMulMrlMrlu=uv=l+l,lulMuuMllMrlMrlu=vu=l+l,lvMlvMvlMrlMrlu=lu=v+3l,lMllMllMrlMrlu=u=v=l=l,lu(2MulMul+MuuMll)MrlMrl=l,l(2Ml,,Ml,+tr(M)Mll)MrlMrl.\displaystyle\begin{aligned} B^{(3)}_{r,s}&=\sum_{j^{\prime}\neq s}\left[\sum_{l,l^{\prime}}\sum_{u,v}M_{uv}M_{u^{\prime}l^{\prime}}M_{rl}M_{rl^{\prime}}\operatorname*{\mathbb{E}}X_{us}X_{u^{\prime}s}X_{vs}X_{ls}X_{l^{\prime}j^{\prime}}^{2}\right]\\ &=\underbrace{\sum_{l,l^{\prime}}\sum_{u\neq l}M_{ul}M_{ul^{\prime}}M_{rl}M_{rl^{\prime}}}_{u=u^{\prime}\neq v=l}+\underbrace{\sum_{l,l^{\prime}}\sum_{u\neq l}M_{uu}M_{ll^{\prime}}M_{rl}M_{rl^{\prime}}}_{u=v\neq u^{\prime}=l}+\underbrace{\sum_{l,l^{\prime}}\sum_{v}M_{lv}M_{vl^{\prime}}M_{rl}M_{rl^{\prime}}}_{u=l\neq u^{\prime}=v}\\ &\qquad+3\underbrace{\sum_{l,l^{\prime}}M_{ll}M_{ll^{\prime}}M_{rl}M_{rl^{\prime}}}_{u=u^{\prime}=v=l}\\ &=\sum_{l,l^{\prime}}\sum_{u}(2M_{ul}M_{ul^{\prime}}+M_{uu}M_{ll^{\prime}})M_{rl}M_{rl^{\prime}}\\ &=\sum_{l,l^{\prime}}(2\langle M_{l,\cdot},M_{l^{\prime},\cdot}\rangle+\operatorname{tr}(M)M_{ll^{\prime}})M_{rl}M_{rl^{\prime}}.\end{aligned}
r,sBr,s(3)=r,sl,l(2Ml,,Ml,+tr(M)Mll)MrlMrl=ql,l(2Ml,,Ml,+tr(M)Mll)rMrlMrl=ql,l(2Ml,,Ml,+tr(M)Mll)Ml,,Ml,=ql,l2Ml,,Ml,2+qtr(M)l,lMllMl,,Ml,2qtr(M4)+qtr(M)(l,lMll2)12(l,lMl,,Ml,2)122qtr(M4)+qtr(M)MFtr(M4)\displaystyle\begin{aligned} \sum_{r,s}B^{(3)}_{r,s}&=\sum_{r,s}\sum_{l,l^{\prime}}(2\langle M_{l,\cdot},M_{l^{\prime},\cdot}\rangle+\operatorname{tr}(M)M_{l^{\prime}l^{\prime}})M_{rl}M_{rl^{\prime}}\\ &=q\sum_{l,l^{\prime}}(2\langle M_{l,\cdot},M_{l^{\prime},\cdot}\rangle+\operatorname{tr}(M)M_{ll^{\prime}})\sum_{r}M_{rl}M_{rl^{\prime}}\\ &=q\sum_{l,l^{\prime}}(2\langle M_{l,\cdot},M_{l^{\prime},\cdot}\rangle+\operatorname{tr}(M)M_{ll^{\prime}})\langle M_{l,\cdot},M_{l^{\prime},\cdot}\rangle\\ &=q\sum_{l,l^{\prime}}2\langle M_{l,\cdot},M_{l^{\prime},\cdot}\rangle^{2}+q\operatorname{tr}(M)\sum_{l,l^{\prime}}M_{ll^{\prime}}\langle M_{l,\cdot},M_{l^{\prime},\cdot}\rangle\\ &\leq 2q\operatorname{tr}(M^{4})+q\operatorname{tr}(M)\bigg{(}\sum_{l,l^{\prime}}M_{ll^{\prime}}^{2}\bigg{)}^{\frac{1}{2}}\bigg{(}\sum_{l,l^{\prime}}\langle M_{l,\cdot},M_{l^{\prime},\cdot}\rangle^{2}\bigg{)}^{\frac{1}{2}}\\ &\leq 2q\operatorname{tr}(M^{4})+q\operatorname{tr}(M)\left\|M\right\|_{F}\sqrt{\operatorname{tr}(M^{4})}\end{aligned}

In Step 1b.

ik,lk,lMklMkl𝔼GkiGliGkiGli\displaystyle\quad\ \sum_{i}\sum_{\begin{subarray}{c}k,l\\ k^{\prime},l^{\prime}\end{subarray}}M_{kl}M_{k^{\prime}l^{\prime}}\operatorname*{\mathbb{E}}G_{ki}G_{li}G_{k^{\prime}i}G_{l^{\prime}i}
=i(klMkl2k=kl=l+klMkl2k=lk=l+klMkkMkkk=lk=l+3kMkk2k=k=l=l)\displaystyle=\sum_{i}\bigg{(}\underbrace{\sum_{k\neq l}M_{kl}^{2}}_{k=k^{\prime}\neq l=l^{\prime}}+\underbrace{\sum_{k\neq l}M_{kl}^{2}}_{k=l^{\prime}\neq k^{\prime}=l}+\underbrace{\sum_{k\neq l}M_{kk}M_{k^{\prime}k^{\prime}}}_{k=l\neq k^{\prime}=l^{\prime}}+3\underbrace{\sum_{k}M_{kk}^{2}}_{k=k^{\prime}=l=l^{\prime}}\bigg{)}
=i(k,lMkl2+k,lMkl2+k,lMkkMkk)\displaystyle=\sum_{i}\bigg{(}\sum_{k,l}M_{kl}^{2}+\sum_{k,l}M_{kl}^{2}+\sum_{k,l}M_{kk}M_{k^{\prime}k^{\prime}}\bigg{)}
=i(2MF2+tr2(M))\displaystyle=\sum_{i}(2\left\|M\right\|_{F}^{2}+\operatorname{tr}^{2}(M))
=2qMF2+qtr2(M).\displaystyle=2q\left\|M\right\|_{F}^{2}+q\operatorname{tr}^{2}(M).

In Step 2a.

Brs(2)\displaystyle B^{(2)}_{rs} =l,l,u,u,vMuvMuuMrlMrl𝔼XlsXlsXusXvsXuj2\displaystyle=\sum_{l,l^{\prime},u,u^{\prime},v}M_{uv}M_{u^{\prime}u^{\prime}}M_{rl}M_{rl^{\prime}}\operatorname*{\mathbb{E}}X_{ls}X_{l^{\prime}s}X_{us}X_{vs}X_{u^{\prime}j^{\prime}}^{2}
=luuMuuMuuMrl2l=lu=v+lluMllMuuMrlMrll=ul=v+lluMllMuuMrlMrll=vl=u\displaystyle=\underbrace{\sum_{\begin{subarray}{c}l\neq u\\ u^{\prime}\end{subarray}}M_{uu}M_{u^{\prime}u^{\prime}}M_{rl}^{2}}_{l=l^{\prime}\neq u=v}+\underbrace{\sum_{\begin{subarray}{c}l\neq l^{\prime}\\ u^{\prime}\end{subarray}}M_{ll^{\prime}}M_{u^{\prime}u^{\prime}}M_{rl}M_{rl^{\prime}}}_{l=u\neq l^{\prime}=v}+\underbrace{\sum_{\begin{subarray}{c}l\neq l^{\prime}\\ u^{\prime}\end{subarray}}M_{ll^{\prime}}M_{u^{\prime}u^{\prime}}M_{rl}M_{rl^{\prime}}}_{l=v\neq l^{\prime}=u}
+3l,uMllMuuMrl2l=u=l=v\displaystyle\qquad\qquad\qquad+3\underbrace{\sum_{l,u^{\prime}}M_{ll}M_{u^{\prime}u^{\prime}}M_{rl}^{2}}_{l=u=l^{\prime}=v}
=tr(M)(l,uMuuMrl2+2l,lMllMrlMrl)\displaystyle=\operatorname{tr}(M)\left(\sum_{l,u}M_{uu}M_{rl}^{2}+2\sum_{l,l^{\prime}}M_{ll^{\prime}}M_{rl}M_{rl^{\prime}}\right)
=tr(M)(tr(M)Mr,22+2l,lMllMrlMrl)\displaystyle=\operatorname{tr}(M)\left(\operatorname{tr}(M)\left\|M_{r,\cdot}\right\|_{2}^{2}+2\sum_{l,l^{\prime}}M_{ll^{\prime}}M_{rl}M_{rl^{\prime}}\right)
Brs(4)\displaystyle B^{(4)}_{rs} =l,u,u,v,vMuvMuvMrl2𝔼XujXvjXujXvj\displaystyle=\sum_{l,u,u^{\prime},v,v^{\prime}}M_{uv}M_{u^{\prime}v^{\prime}}M_{rl}^{2}\operatorname*{\mathbb{E}}X_{uj}X_{vj}X_{u^{\prime}j}X_{v^{\prime}j}
=(lMrl2)(u,vMuv2u=uv=v+u,vMuv2u=vu=v+u,vMuv2u=vu=v+3uMuu2u=v=u=v)\displaystyle=\left(\sum_{l}M_{rl}^{2}\right)\left(\underbrace{\sum_{u,v}M_{uv}^{2}}_{u=u^{\prime}\neq v=v^{\prime}}+\underbrace{\sum_{u,v}M_{uv}^{2}}_{u=v^{\prime}\neq u=v}+\underbrace{\sum_{u,v}M_{uv}^{2}}_{u=v\neq u^{\prime}=v^{\prime}}+\underbrace{3\sum_{u}M_{uu}^{2}}_{u=v=u^{\prime}=v^{\prime}}\right)
=3(lMrl2)u,vMu,v2\displaystyle=3\left(\sum_{l}M_{rl}^{2}\right)\sum_{u,v}M_{u,v}^{2}
=3Mr,22MF2\displaystyle=3\left\|M_{r,\cdot}\right\|_{2}^{2}\left\|M\right\|_{F}^{2}

Appendix C Exact Variance when r=2r=2

Suppose that AA is rotationally invariant under both left- and right-multiplication of an orthogonal matrix. Define

U1(p,q)\displaystyle U_{1}(p,q) =Var((ATA)ii2)\displaystyle=\operatorname*{Var}((A^{T}A)_{ii}^{2})
U2(p,q)\displaystyle U_{2}(p,q) =Var((ATA)ij2)ij\displaystyle=\operatorname*{Var}((A^{T}A)_{ij}^{2})\quad i\neq j
U3(p,q)\displaystyle U_{3}(p,q) =cov((ATA)ii2,(ATA)ik2)ik(same row, one entry on diagonal)\displaystyle=\operatorname{cov}((A^{T}A)_{ii}^{2},(A^{T}A)_{ik}^{2})\quad i\neq k\quad\text{(same row, one entry on diagonal)}
U4(p,q)\displaystyle U_{4}(p,q) =cov((ATA)ij2,(ATA)ik2)jk(same row, both entries off-diagonal)\displaystyle=\operatorname{cov}((A^{T}A)_{ij}^{2},(A^{T}A)_{ik}^{2})\quad j\neq k\quad\text{(same row, both entries off-diagonal)}
U5(p,q)\displaystyle U_{5}(p,q) =cov((ATA)ii2,(ATA)jj2)ij(diff. rows and cols, both entries on diagonal)\displaystyle=\operatorname{cov}((A^{T}A)_{ii}^{2},(A^{T}A)_{jj}^{2})\quad i\neq j\quad\text{(diff.\ rows and cols, both entries on diagonal)}
U6(p,q)\displaystyle U_{6}(p,q) =cov((ATA)ii2,(ATA)jk2)ijk(diff. rows and cols, one entry on diagonal)\displaystyle=\operatorname{cov}((A^{T}A)_{ii}^{2},(A^{T}A)_{jk}^{2})\quad i\neq j\neq k\quad\text{(diff.\ rows and cols, one entry on diagonal)}
U7(p,q)\displaystyle U_{7}(p,q) =cov((ATA)ij2,(ATA)kl2)ijkl(diff. rows and cols, nonsymmetric around diag.)\displaystyle=\operatorname{cov}((A^{T}A)_{ij}^{2},(A^{T}A)_{kl}^{2})\quad i\neq j\neq k\neq l\quad\text{(diff.\ rows and cols, nonsymmetric around diag.)}

It is clear that they are well-defined.

Var(tr((ATA)2))\displaystyle\quad\ \operatorname*{Var}(\operatorname{tr}((A^{T}A)^{2}))
=Var(i,j(ATA)ij2)\displaystyle=\operatorname*{Var}\bigg{(}\sum_{i,j}(A^{T}A)_{ij}^{2}\bigg{)}
=i,j,k,lcov((ATA)ij2,(ATA)kl2)\displaystyle=\sum_{i,j,k,l}\operatorname{cov}((A^{T}A)_{ij}^{2},(A^{T}A)_{kl}^{2})
=i,jVar((ATA)ij2)+2ijlcov((ATA)ij2,(ATA)il2)+ikjlcov(𝔼(ATA)ij2,(ATA)kl2)\displaystyle=\sum_{i,j}\operatorname*{Var}((A^{T}A)_{ij}^{2})+2\sum_{i}\sum_{j\neq l}\operatorname{cov}((A^{T}A)_{ij}^{2},(A^{T}A)_{il}^{2})+\sum_{\begin{subarray}{c}i\neq k\\ j\neq l\end{subarray}}\operatorname{cov}(\operatorname*{\mathbb{E}}(A^{T}A)_{ij}^{2},(A^{T}A)_{kl}^{2})
=qVar((ATA)112)+q(q1)Var(𝔼(ATA)122)\displaystyle=q\operatorname*{Var}((A^{T}A)_{11}^{2})+q(q-1)\operatorname*{Var}(\operatorname*{\mathbb{E}}(A^{T}A)_{12}^{2})
+2[2q(q1)cov((ATA)112,(ATA)122)+q(q1)(q2)cov((ATA)122,(ATA)132)]\displaystyle\qquad+2\left[2q(q-1)\operatorname{cov}((A^{T}A)_{11}^{2},(A^{T}A)_{12}^{2})+q(q-1)(q-2)\operatorname{cov}((A^{T}A)_{12}^{2},(A^{T}A)_{13}^{2})\right]
+q(q1)cov(𝔼(ATA)112,(ATA)222)+q(q1)cov(𝔼(ATA)122,(ATA)212)\displaystyle\qquad+q(q-1)\operatorname{cov}(\operatorname*{\mathbb{E}}(A^{T}A)_{11}^{2},(A^{T}A)_{22}^{2})+q(q-1)\operatorname{cov}(\operatorname*{\mathbb{E}}(A^{T}A)_{12}^{2},(A^{T}A)_{21}^{2})
+2q(q1)(q2)cov((ATA)112,(ATA)232)\displaystyle\qquad\quad+2q(q-1)(q-2)\operatorname{cov}((A^{T}A)_{11}^{2},(A^{T}A)_{23}^{2})
+2q(q1)(q2)cov((ATA)122,(ATA)312)\displaystyle\qquad\quad+2q(q-1)(q-2)\operatorname{cov}((A^{T}A)_{12}^{2},(A^{T}A)_{31}^{2})
+q(q1)(q2)(q3)𝔼(ATA)122(ATA)342\displaystyle\qquad\quad+q(q-1)(q-2)(q-3)\operatorname*{\mathbb{E}}(A^{T}A)_{12}^{2}(A^{T}A)_{34}^{2}
=qU1(p,q)+q(q1)U2(p,q)+2q(q1)(2U3(p,q)+(q2)U4(p,q))\displaystyle=qU_{1}(p,q)+q(q-1)U_{2}(p,q)+2q(q-1)(2U_{3}(p,q)+(q-2)U_{4}(p,q))
+q(q1)(U5(p,q)+U2(p,q))+2q(q1)(q2)(U6(p,q)+U4(p,q))\displaystyle\qquad+q(q-1)(U_{5}(p,q)+U_{2}(p,q))+2q(q-1)(q-2)(U_{6}(p,q)+U_{4}(p,q))
+q(q1)(q2)(q3)U7(p,q)\displaystyle\qquad+q(q-1)(q-2)(q-3)U_{7}(p,q)
=qU1(p,q)+q(q1)(2U2(p,q)+4U3(p,q)+U5(p,q))\displaystyle=qU_{1}(p,q)+q(q-1)(2U_{2}(p,q)+4U_{3}(p,q)+U_{5}(p,q))
+2q(q1)(q2)(2U4(p,q)+U6(p,q))+q(q1)(q2)(q3)U7(p,q).\displaystyle\qquad+2q(q-1)(q-2)(2U_{4}(p,q)+U_{6}(p,q))+q(q-1)(q-2)(q-3)U_{7}(p,q).

Let us calculate U1,,U7U_{1},\dots,U_{7} for a p×qp\times q Gaussian random matrix GG.

U1(p,q)=𝔼(GTG)114(𝔼(GTG)112)2=𝔼G128(𝔼G124)2=p(p+2)(p+4)(p+6)(p(p+2))2=8p(p+2)(p+3)\displaystyle\begin{aligned} U_{1}(p,q)=\operatorname*{\mathbb{E}}(G^{T}G)_{11}^{4}-(\operatorname*{\mathbb{E}}(G^{T}G)_{11}^{2})^{2}&=\operatorname*{\mathbb{E}}\left\|G_{1}\right\|_{2}^{8}-(\operatorname*{\mathbb{E}}\left\|G_{1}\right\|_{2}^{4})^{2}\\ &=p(p+2)(p+4)(p+6)-(p(p+2))^{2}\\ &=8p(p+2)(p+3)\end{aligned}
U2(p,q)=𝔼(GTG)124(𝔼(GTG)122)2=𝔼(rGr1Gr2)4(𝔼G1,G22)2=r,s,t,u𝔼Gr1Gs1Gt1Gu1Gr2Gs2Gt2Gu2p2=3rt𝔼Gr12Gt12Gr22Gt22+rGr14Gr24p2=3p(p1)+9pp2=2p(p+3).\displaystyle\begin{aligned} U_{2}(p,q)=\operatorname*{\mathbb{E}}(G^{T}G)_{12}^{4}-(\operatorname*{\mathbb{E}}(G^{T}G)_{12}^{2})^{2}&=\operatorname*{\mathbb{E}}\left(\sum_{r}G_{r1}G_{r2}\right)^{4}-(\operatorname*{\mathbb{E}}\langle G_{1},G_{2}\rangle^{2})^{2}\\ &=\sum_{r,s,t,u}\operatorname*{\mathbb{E}}G_{r1}G_{s1}G_{t1}G_{u1}G_{r2}G_{s2}G_{t2}G_{u2}-p^{2}\\ &=3\sum_{r\neq t}\operatorname*{\mathbb{E}}G_{r1}^{2}G_{t1}^{2}G_{r2}^{2}G_{t2}^{2}+\sum_{r}G_{r1}^{4}G_{r2}^{4}-p^{2}\\ &=3p(p-1)+9p-p^{2}=2p(p+3).\end{aligned}
U3(p,q)=𝔼(GTG)112(GTG)122𝔼(GTG)112𝔼(GTG)122=𝔼(G1TG1)2G1TG2G2TG1𝔼G124𝔼G1,G22=𝔼(G1TG1)2G1T(𝔼G2G2T)G1p(p+2)p=𝔼(G1TG1)3p2(p+2)=𝔼G126p2(p+2)=p(p+2)(p+4)p2(p+2)=4p(p+2)\displaystyle\begin{aligned} U_{3}(p,q)&=\operatorname*{\mathbb{E}}(G^{T}G)_{11}^{2}(G^{T}G)_{12}^{2}-\operatorname*{\mathbb{E}}(G^{T}G)_{11}^{2}\operatorname*{\mathbb{E}}(G^{T}G)_{12}^{2}\\ &=\operatorname*{\mathbb{E}}(G_{1}^{T}G_{1})^{2}G_{1}^{T}G_{2}G_{2}^{T}G_{1}-\operatorname*{\mathbb{E}}\left\|G_{1}\right\|_{2}^{4}\operatorname*{\mathbb{E}}\langle G_{1},G_{2}\rangle^{2}\\ &=\operatorname*{\mathbb{E}}(G_{1}^{T}G_{1})^{2}G_{1}^{T}(\operatorname*{\mathbb{E}}G_{2}G_{2}^{T})G_{1}-p(p+2)\cdot p\\ &=\operatorname*{\mathbb{E}}(G_{1}^{T}G_{1})^{3}-p^{2}(p+2)\\ &=\operatorname*{\mathbb{E}}\left\|G_{1}\right\|_{2}^{6}-p^{2}(p+2)=p(p+2)(p+4)-p^{2}(p+2)=4p(p+2)\end{aligned}
U4(p,q)=𝔼(GTG)122(GTG)132𝔼(GTG)122𝔼(GTG)132=𝔼G1TG2G2TG1G1TG3G3TG1p2=𝔼G1T𝔼(G2G2T)G1G1T𝔼(G3G3T)G1p2=𝔼(G1TG1)2p2=𝔼G124p2=p(p+2)p2=2p\displaystyle\begin{aligned} U_{4}(p,q)&=\operatorname*{\mathbb{E}}(G^{T}G)_{12}^{2}(G^{T}G)_{13}^{2}-\operatorname*{\mathbb{E}}(G^{T}G)_{12}^{2}\operatorname*{\mathbb{E}}(G^{T}G)_{13}^{2}\\ &=\operatorname*{\mathbb{E}}G_{1}^{T}G_{2}G_{2}^{T}G_{1}G_{1}^{T}G_{3}G_{3}^{T}G_{1}-p^{2}\\ &=\operatorname*{\mathbb{E}}G_{1}^{T}\operatorname*{\mathbb{E}}(G_{2}G_{2}^{T})G_{1}G_{1}^{T}\operatorname*{\mathbb{E}}(G_{3}G_{3}^{T})G_{1}-p^{2}\\ &=\operatorname*{\mathbb{E}}(G_{1}^{T}G_{1})^{2}-p^{2}=\operatorname*{\mathbb{E}}\left\|G_{1}\right\|_{2}^{4}-p^{2}=p(p+2)-p^{2}=2p\end{aligned}
U5(p,q)=𝔼(GTG)112(GTG)222𝔼(GTG)112𝔼(GTG)222=𝔼G124G224𝔼G124G224=0\displaystyle\begin{aligned} U_{5}(p,q)&=\operatorname*{\mathbb{E}}(G^{T}G)_{11}^{2}(G^{T}G)_{22}^{2}-\operatorname*{\mathbb{E}}(G^{T}G)_{11}^{2}\operatorname*{\mathbb{E}}(G^{T}G)_{22}^{2}\\ &=\operatorname*{\mathbb{E}}\left\|G_{1}\right\|_{2}^{4}\left\|G_{2}\right\|_{2}^{4}-\operatorname*{\mathbb{E}}\left\|G_{1}\right\|_{2}^{4}\left\|G_{2}\right\|_{2}^{4}=0\end{aligned}
U6(p,q)=𝔼(GTG)112(GTG)232𝔼(GTG)112𝔼(GTG)232=𝔼G124G2,G32𝔼G124𝔼G2,G32=0\displaystyle\begin{aligned} U_{6}(p,q)&=\operatorname*{\mathbb{E}}(G^{T}G)_{11}^{2}(G^{T}G)_{23}^{2}-\operatorname*{\mathbb{E}}(G^{T}G)_{11}^{2}\operatorname*{\mathbb{E}}(G^{T}G)_{23}^{2}\\ &=\operatorname*{\mathbb{E}}\left\|G_{1}\right\|_{2}^{4}\langle G_{2},G_{3}\rangle^{2}-\operatorname*{\mathbb{E}}\left\|G_{1}\right\|_{2}^{4}\operatorname*{\mathbb{E}}\langle G_{2},G_{3}\rangle^{2}=0\end{aligned}
U7(p,q)=𝔼(GTG)122(GTG)342𝔼(GTG)122𝔼(GTG)342=𝔼G1,G22G3,G42𝔼G1,G22𝔼G3,G42=0\displaystyle\begin{aligned} U_{7}(p,q)&=\operatorname*{\mathbb{E}}(G^{T}G)_{12}^{2}(G^{T}G)_{34}^{2}-\operatorname*{\mathbb{E}}(G^{T}G)_{12}^{2}\operatorname*{\mathbb{E}}(G^{T}G)_{34}^{2}\\ &=\operatorname*{\mathbb{E}}\langle G_{1},G_{2}\rangle^{2}\langle G_{3},G_{4}\rangle^{2}-\operatorname*{\mathbb{E}}\langle G_{1},G_{2}\rangle^{2}\operatorname*{\mathbb{E}}\langle G_{3},G_{4}\rangle^{2}=0\end{aligned}

Therefore

Var(tr((GTG)2))\displaystyle\operatorname*{Var}(\operatorname{tr}((G^{T}G)^{2})) =qU1+q(q1)(2U2+4U3+U5)+2q(q1)(q2)(2U4+U6)\displaystyle=qU_{1}+q(q-1)(2U_{2}+4U_{3}+U_{5})+2q(q-1)(q-2)(2U_{4}+U_{6})
+q(q1)(q2)(q3)U7\displaystyle\qquad+q(q-1)(q-2)(q-3)U_{7}
=qU1+q(q1)(2U2+4U3)+4q(q1)(q2)U4\displaystyle=qU_{1}+q(q-1)(2U_{2}+4U_{3})+4q(q-1)(q-2)U_{4}
=4pq(5+5p+5q+2p2+5pq+2q2).\displaystyle=4pq(5+5p+5q+2p^{2}+5pq+2q^{2}).

When r=2r=2, recalling that 𝔼(A2A1)=(1+o(1))p2q2/d3\operatorname*{\mathbb{E}}(A_{2}-A_{1})=(1+o(1))p^{2}q^{2}/d^{3} (see (4)), we have that

Var(tr((1dGT1dG)2))p2q2/d36dmax{p,q}12min{p,q}32.\frac{\sqrt{\operatorname*{Var}(\operatorname{tr}((\frac{1}{\sqrt{d}}G^{T}\cdot\frac{1}{\sqrt{d}}G)^{2}))}}{p^{2}q^{2}/d^{3}}\leq\frac{6d}{\max\{p,q\}^{\frac{1}{2}}\min\{p,q\}^{\frac{3}{2}}}.

If the right-hand side above is at most a small constant cc, we can distinguish A2A_{2} from A1A_{1} with probability at least a constant.