The Product of Gaussian Matrices is Close to Gaussian
Abstract
We study the distribution of the matrix product of independent Gaussian matrices of various sizes, where is , and we denote , , and require . Here the entries in each are standard normal random variables with mean and variance . Such products arise in the study of wireless communication, dynamical systems, and quantum transport, among other places. We show that, provided each , , satisfies , where for a constant depending on , then the matrix product has variation distance at most to a matrix of i.i.d. standard normal random variables with mean and variance . Here as . Moreover, we show a converse for constant that if for some , then this total variation distance is at least , for an absolute constant depending on and . This converse is best possible when .
1 Introduction
Random matrices play a central role in many areas of theoretical, applied, and computational mathematics. One particular application is dimensionality reduction, whereby one often chooses a rectangular random matrix , , and computes for a fixed vector . Indeed, this is the setting in compressed sensing and sparse recovery [12], randomized numerical linear algebra [18, 20, 36], and sketching algorithms for data streams [25]. Often is chosen to be a Gaussian matrix, and in particular, an matrix with entries that are i.i.d. normal random variables with mean and variance , denoted by . Indeed, in compressed sensing, such matrices can be shown to satisfy the Restricted Isometry Property (RIP) [10], while in randomized numerical linear algebra, in certain applications such as support vector machines [29] and non-negative matrix factorization [19], their performance is shown to often outperform that of other sketching matrices.
Our focus in this paper will be on understanding the product of two or more Gaussian matrices. Such products arise naturally in different applications. For example, in the over-constrained ridge regression problem , the design matrix , , is itself often assumed to be Gaussian (see, e.g., [26]). In this case, the “sketch-and-solve” algorithmic framework for regression [32] would compute and for an Gaussian matrix with , where is the so-called statistical dimension [2], and solve for the which minimizes . While computing is slower than computing the corresponding matrix product for other kinds of sketching matrices , it often has application-specific [29, 19] as well as statistical benefits [31]. Notice that is the product of two independent Gaussian matrices, and in particular, has a small number of rows while has a small number of columns – this is precisely the rectangular case we will study below. Other applications in randomized numerical linear algebra where the product of two Gaussian matrices arises is when one computes the product of a Gaussian sketching matrix and Gaussian noise in a spiked identity covariance model [37].
The product of two or more Gaussian matrices also arises in diverse fields such as multiple-input multiple-output (MIMO) wireless communication channels [24]. Indeed, similar to the above regression problem in which one wants to reconstruct an underlying vector , in such settings one observes the vector where is the transmitted signal and is background noise. This setting corresponds to the situation in which there are scattering environments separated by major obstacles, and the dimensions of the correspond to the number of “keyholes” [24]. To determine the mutual information of this channel, one needs to understand the singular values of the matrix . If one can show the distribution of this product is close to that of a Gaussian distribution in total variation distance, then one can use the wide range of results known for the spectrum of a single Gaussian matrix (see, e.g., [35]). Other applications of products of Gaussian matrices include disordered spin chains [11, 3, 15], stability of large complex dynamical systems [22, 21], symplectic maps and Hamiltonian mechanics [11, 4, 28], quantum transport in disordered wires [23, 13], and quantum chromodynamics [27]; we refer the reader to [14, 1] for an overview.
The main question we ask in this work is:
What is the distribution of the product of independent Gaussian matrices of various sizes, where is ?
Our main interest in the question above will be when has a small number of rows, and has a small number of columns. Despite the large body of work on random matrix theory (see, e.g., [34] for a survey), we are not aware of any work which attempts to bound the total variation distance of the entire distribution of to a Gaussian distribution itself.
1.1 Our Results
Formally, we consider the problem of distinguishing the product of normalized Gaussian matrices
from a single normalized Gaussian matrix
We show that, when is a constant, with constant probability we cannot distinguish the distributions of these two random matrices when for all ; and, conversely, with constant probability, we can distinguish these two distributions when the are not large enough.
Theorem 1 (Main theorem).
Suppose that for all and .
-
(a)
It holds that
where denotes the total variation distance between and , and is an absolute constant.
-
(b)
If further satisfy that
where is an absolute constant, then .
Part (a) states that when for all for a constant depending on . The converse in (b) implies that when for some for a constant depending on . When and is a constant, we obtain a dichotomy (up to a constant factor) for the conditions on and .
1.2 Our Techniques
Upper Bound.
We start by explaining our main insight as to why the distribution of a product of a matrix of i.i.d. random variables and a matrix of i.i.d. random variables has low variation distance to the distribution of a matrix of i.i.d. random variables. One could try to directly understand the probability density function as was done in the case of Wishart matrices in [7, 30], which corresponds to the setting when . However, there are certain algebraic simplifications in the case of the Wishart distribution that seem much less tractable when manipulating the density function of the product of independent Gaussians [9]. Another approach would be to try to use entropic methods as in [8, 6]. Such arguments try to reveal entries of the product one-by-one, arguing that for most conditionings of previous entries, the new entry still looks like an independent Gaussian. However, the entries are clearly not independent – if has large absolute value, then is more likely to be large in absolute value, as it could indicate that the -th row of has large norm. One could try to first condition on the norms of all rows of and columns of , but additional issues arise when one looks at submatrices: if and are all large, then it could mean the -th row of and the -th row of are correlated with each other, since they both are correlated with the -th column of . Consequently, since is large, it could make it more likely that has large absolute value. This makes the entropic method difficult to apply in this context.
Our upper bound instead leverages beautiful work of Jiang [16] and Jiang and Ma [17] which bounds the total variation distance between the distribution of an submatrix of a random orthogonal matrix (orthonormal rows and columns) and an matrix with i.i.d. entries. Their work shows that if as , then the total variation distance between these two matrix ensembles goes to . It is not immediately clear how to apply such results in our context. First of all, which submatrix should we be looking at? Note though, that if is a uniformly random (Haar measure) matrix with orthonormal rows, and is a uniformly random matrix with orthonormal columns, then by rotational invariance, is identically distributed to a submatrix of a random orthonormal matrix. Thus, setting and in the above results, they imply that is close in variation distance to a matrix with i.i.d. entries. Given and , one could then write them in their singular value decomposition, obtaining and . Then and are independent and well-known to be uniformly random and orthonormal matrices, respectively. Thus is close in total variation distance to . However, this does not immediately help either, as it is not clear what the distribution of this matrix is. Instead, the “right” way to utilize the results above is to (1) observe that is identically distributed as , where is a matrix of i.i.d. normal random variables, given the rotational invariance of the Gaussian distribution. Then (2) is itself close to a product where is a random matrix with orthonormal rows, and is a random matrix with orthonormal columns, by the above results. Thus, is close to . Then (3) has the same distribution as , so is close to , where and are identically distributed, and is independent of . Finally, (4) is identically distributed as a matrix of standard normal random variables because is Gaussian and has orthonormal columns, by rotational invariance of the Gaussian distribution.
We hope that this provides a general method for arguments involving Gaussian matrices - in step (2) we had the quantity , where was a Gaussian matrix, and then viewed as a product of a short-fat random orthonormal matrix and a tall-thin random orthonormal matrix . Our proof for the product of more than matrices recursively uses similar ideas, and bounds the growth in variation distance as a function of the number of matrices involved in the product.
Lower Bound.
For our lower bound for constant , we show that the fourth power of the Schatten 4-norm of a matrix, namely, , can be used to distinguish a product of Gaussian matrices and a single Gaussian matrix . We use Chebyshev’s inequality, for which we need to find the expectation and variance of for and .
Let us consider the expectation first. An idea is to calculate the expectation recursively, that is, for a fixed matrix and a Gaussian random matrix we express in terms of . The real situation turns out to be slightly more complicated. Instead of expressing in terms of directly, we decompose into the sum of expectations of a few functions in terms of , say,
and build up the recurrence relations for in terms of , , …, . It turns out that the recurrence relations are all linear, i.e.,
(1) |
whence we can solve for and obtaining the desired expectation .
Now we turn to variance. One could try to apply the same idea of finding recurrence relations for (where ), but it quickly becomes intractable for the term as it involves products of eight entries of , which all need to be handled carefully as to avoid any loose bounds; note, the subtraction of is critically needed to obtain a small upper bound on and thus loose bounds on would not suffice. For a tractable calculation, we keep the product of entries of to th order throughout, without involving any terms of th order. To do so, we invoke the law of total variance,
(2) |
For the first term on the right-hand side, we use Poincaré’s inequality to upper bound it. Poincaré’s inequality for the Gaussian measure states that for a differentiable function on ,
Here we can simply let and calculate . This is tractable since involves the products of at most entries of , and we can use the recursive idea for the expectation above to express
for a few functions ’s and establish a recurrence relation for each .
The second term on the right-hand side of (2) can be dealt with by plugging in (1), and turns out to depend on a new quantity . We again apply the recursive idea and the law of total variance to
Again, the first term on the right-hand side can be handled by Poincaré’s inequality and the second-term turns out to depend on , which is crucial. We have now obtained a double recurrence involving inequalities on and , from which we can solve for an upper bound on . This upper bound, however, grows exponentially in , which is impossible to improve due to our use of Poincaré’s inequality.
2 Preliminaries
Notation.
For a random variable and a probability distribution , we use to denote that is subject to . For two random variables and defined on the same sample space, we write if and are identically distributed.
We use to denote the distribution of Gaussian random matrices of i.i.d. entries and to denote the uniform distribution (Haar) of an random matrix with orthonormal rows. For a distribution on a linear space and a scaling factor , we use to denote the distribution of , where .
For two probability measures and on the Borel algebra of , the total variation distance between and is defined as
If is absolutely continuous with respect to , one can define the Kullback-Leibler Divergence between and as
If is not absolutely continuous with respect to , we define .
When and correspond to two random variables and , respectively, we also write and as and , respectively.
The following is the well-known relation between the Kullback-Leibler divergence and the total variation distance between two probability measures.
Lemma 1 (Pinsker’s Inequality [5, Theorem 4.19]).
.
The following result, concerning the distance between the submatrix of a properly scaled Gaussian random matrix and a submatrix of a random orthogonal matrix, is due to Jiang and Ma [17].
Lemma 2 ([17]).
Let and . Suppose that and is the top-left block of and the top-left block of . Then
(3) |
where is an absolute constant.
Useful Inequalities.
We list two useful inequalities below.
Lemma 3 (Poincaré’s inequality for Gaussian measure [5, Theorem 3.20]).
Let be the standard -dimensional Gaussian distribution and be any continuously differentiable function. Then
Lemma 4 (Trace inequality, [33]).
Let and be symmetric, positive semidefinite matrices and be a positive integer. Then
3 Upper Bound
Let be an integer. Suppose that are independent Gaussian random matrices, where and , and . Consider the product of normalized Gaussian matrices
and a single normalized Gaussian random matrix | |||
where . In this section, we shall show that when for all , we cannot distinguish from with constant probability.
For notational convenience, let for and . Assume that for some constant for all . Our question is to find the total variation distance between the matrix product and the product of two matrices.
Lemma 5.
Let be positive integers satisfying that and for some constant . Suppose that , , and . Further suppose that and are independent. Let be independent of , and . Then
where is an absolute constant.
Proof.
Let be its singular value decomposition, where has dimension . Then
where is a random matrix of i.i.d. entries. Suppose that consists of the top rows of . Then
Note that and are independent of and . It follows from Lemma 2 that
where is an absolute constant. The result follows from Pinsker’s inequality (Lemma 1). ∎
The next theorem follows from the above.
Theorem 2.
It holds that
where is an absolute constant.
Proof.
Let and , independent from each other and from the ’s. Applying the preceding lemma with , and , we have
Next, applying the preceding lemma with , and , we have
Iterating this procedure, we have in the end that
Since , and are independent and , it holds that . Therefore,
Repeating the same argument for , we obtain the following corollary immediately.
Corollary 1.
It holds that
where is an absolute constant.
4 Lower Bound
Suppose that is a constant. We shall show that one can distinguish the product of Gaussian random matrices
from one Gaussian random matrix
when the intermediate dimensions are not large enough. Considering , it suffices to show that one can distinguish and with a constant probability for constant . By Chebyshev’s inequality, it suffices to show that
for a small constant . We calculate that:
Lemma 6.
Suppose that is a constant, for all . When ,
Lemma 7.
Suppose that is a constant, for all . There exists an absolute constant such that, when are sufficently large,
We conclude with the following theorem, which can be seen as a tight converse to Corollary 1 up to a constant factor on the conditions for .
Theorem 3.
Suppose that is a constant and for all . Further suppose that . When are sufficiently large and satisfy that
where is some absolute constant, with probability at least , one can distinguish from .
4.1 Calculation of the Mean
Suppose that is a random matrix, and is rotationally invariant under left- and right-multiplication by orthogonal matrices. We define
Since is left- and right-invariant under rotations, these quantities are well-defined. Then
and | |||
When , we have
When , we have
and so | |||
Next, consider , where is a random matrix and a random matrix of i.i.d. entries. The following proposition is easy to verify, and its proof is postponed to Appendix LABEL:sec:proof_of_simple_prop.
Proposition 1.
It holds that .
Proof.
We have
and | |||
Suppose that the associated functions of are named . Then we can calculate that (detailed calculations can be found in Appendix A)
It is clear that depend only on (not on and ) if do so. Furthermore, if then we have and thus . If then and . Hence, if then . We can verify that all these conditions are satisfied with one Gaussian matrix and we can iterate it to obtain these quantities for the product of Gaussian matrices with intermediate dimensions . We have that
Therefore, normalizing the -th matrix by , that is,
we have for constant that
(4) | ||||
4.2 Calculation of the Variance
Let be a random symmetric matrix, and let be a random matrix of i.i.d. entries. We want to find the variance of . The detailed calculations of some steps can be found in Appendix B.
Our starting point is the law of total variance, which states that
(5) |
Step 1a.
We shall handle each term separately. Consider the first term, which we shall bound using the Poincaré inequality for Gaussian measures. Define , where . We shall calculate .
Then
Note that
we have that
Next we calculate when is i.i.d. Gaussian.
We discuss different cases of .
When , it must hold that , and for a possible nonzero contribution, and the total contribution in this case is at most , where
When , it must hold that for a possible nonzero contribution, and the total contribution in this case is at most , where
When , it must hold that for possible nonzero contribution, and the total contribution in this case is at most , where
When , the nonzero contribution is
Since needs to be paired, the only case which is not covered by and is when , and , in which case the contribution is at most
Hence
It follows that
Note that . Hence
By the Gaussian Poincaré inequality,
(6) | ||||
For the terms on the right-hand side, we calculate that (using the trace inequality (Lemma 4))
and
This implies that each term on the right-hand of (6) grows geometrically.
Step 1b.
Next we deal with the second term in (5). We have
When , for non-zero contribution, it must hold that and and thus the nonzero contribution is
When , the contribution is
(7) |
Hence
and when is random,
(8) | ||||
Step 2a.
Note that the term on the right-hand side of (8). To bound this term, we examine the variance of , where . We shall again calculate . Note that
and
We have
Next we calculate when is i.i.d. Gaussian.
In order for the expectation in the summand to be non-zero, we must have one of the following cases: (1) , (2) , (3) , (4) , (5) . We calculate the contribution in each case below.
Case 1: it must hold that , and . The contribution is , where
Case 2: it must hold that . The contribution is , where
Case 3: this gives the same bound as Case 2.
Case 4: it must hold that . The contribution is , where
Case 5: the contribution is , where
The only uncovered case is , , and its symmetries. In such a case the contribution is at most
Note that
Therefore,
By Poincaré’s inequality,
(9) | ||||
Similar to before, each term on the right-hand side grows geometrically.
Step 2b.
Next we deal with .
When , for non-zero contribution, it must hold that and and thus the nonzero contribution is
When , the contribution is (this is exactly the same as (7) in Step 1b.)
Hence
and when is random,
(10) | ||||
Step 3.
Let denote the variance of and the variance of . Combining (5), (6), (8), (9), (10), we have the following recurrence relations, where are absolute constants.
In the base case, set (the identity matrix in (6)) and note that the second term in (5) vanishes. We see that after proper normalization. (Alternatively we can calculate this precisely, see Appendix C.) Similarly we have . Note that . Now, we can solve that
for some absolute constant .
References
- [1] Gernot Akemann and Jesper R Ipsen. Recent exact and asymptotic results for products of independent random matrices. Acta Physica Polonica B, pages 1747–1784, 2015.
- [2] Ahmed El Alaoui and Michael W. Mahoney. Fast randomized kernel ridge regression with statistical guarantees. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 775–783, Cambridge, MA, USA, 2015. MIT Press.
- [3] Richard Bellman. Limit theorems for non-commutative operations. I. Duke Mathematical Journal, 21(3):491–500, 1954.
- [4] Giancarlo Benettin. Power-law behavior of lyapunov exponents in some conservative dynamical systems. Physica D: Nonlinear Phenomena, 13(1-2):211–220, 1984.
- [5] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
- [6] Matthew Brennan, Guy Bresler, and Dheeraj Nagaraj. Phase transitions for detecting latent geometry in random graphs. Probability Theory and Related Fields, pages 1215–1289, 2020. doi:10.1007/s00440-020-00998-3.
- [7] Sébastien Bubeck, Jian Ding, Ronen Eldan, and Miklós Z. Rácz. Testing for high-dimensional geometry in random graphs. Random Structures & Algorithms, 49(3):503–532, 2016. doi:10.1002/rsa.20633.
- [8] Sébastien Bubeck and Shirshendu Ganguly. Entropic CLT and Phase Transition in High-dimensional Wishart Matrices. International Mathematics Research Notices, 2018(2):588–606, 12 2016. doi:10.1093/imrn/rnw243.
- [9] Z. Burda, R. A. Janik, and B. Waclaw. Spectrum of the product of independent random gaussian matrices. Physical Review E, 81(4), Apr 2010. doi:10.1103/physreve.81.041132.
- [10] Emmanuel J Candes. The restricted isometry property and its implications for compressed sensing. Comptes rendus mathematique, 346(9-10):589–592, 2008.
- [11] Andrea Crisanti, Giovanni Paladin, and Angelo Vulpiani. Products of random matrices: in Statistical Physics, volume 104. Springer Science & Business Media, 2012.
- [12] David L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
- [13] S. Iida, H.A. Weidenmüller, and J.A. Zuk. Statistical scattering theory, the supersymmetry method and universal conductance fluctuations. Annals of Physics, 200(2):219–270, 1990.
- [14] Jesper R. Ipsen. Products of independent Gaussian random matrices. PhD thesis, Bielefeld University, 2015.
- [15] Hiroshi Ishitani. A central limit theorem for the subadditive process and its application to products of random matrices. Publications of the Research Institute for Mathematical Sciences, 12(3):565–575, 1977.
- [16] Tiefeng Jiang. How many entries of a typical orthogonal matrix can be approximated by independent normals? The Annals of Probability, 34(4):1497–1529, Jul 2006. doi:10.1214/009117906000000205.
- [17] Tiefeng Jiang and Yutao Ma. Distances between random orthogonal matrices and independent normals. Transactions of the American Mathematical Society, 372(3):1509–1553, August 2019. doi:10.1090/tran/7470.
- [18] Ravindran Kannan and Santosh Vempala. Spectral algorithms. Now Publishers Inc, 2009.
- [19] Michael Kapralov, Vamsi Potluru, and David Woodruff. How to fake multiply by a gaussian matrix. In International Conference on Machine Learning, pages 2101–2110. PMLR, 2016.
- [20] Michael W. Mahoney. Randomized algorithms for matrices and data. Found. Trends Mach. Learn., 3(2):123–224, February 2011. doi:10.1561/2200000035.
- [21] Satya N Majumdar and Grégory Schehr. Top eigenvalue of a random matrix: large deviations and third order phase transition. Journal of Statistical Mechanics: Theory and Experiment, 2014(1):P01012, 2014.
- [22] Robert M May. Will a large complex system be stable? Nature, 238(5364):413–414, 1972.
- [23] P.A. Mello, P. Pereyra, and N. Kumar. Macroscopic approach to multichannel disordered conductors. Annals of Physics, 181(2):290–317, 1988.
- [24] Ralf R Muller. On the asymptotic eigenvalue distribution of concatenated vector-valued fading channels. IEEE Transactions on Information Theory, 48(7):2086–2091, 2002.
- [25] Shanmugavelayutham Muthukrishnan. Data streams: Algorithms and applications. Now Publishers Inc, 2005.
- [26] Guillaume Obozinski, Martin J Wainwright, Michael I Jordan, et al. Support union recovery in high-dimensional multivariate regression. The Annals of Statistics, 39(1):1–47, 2011.
- [27] James C Osborn. Universal results from an alternate random-matrix model for QCD with a baryon chemical potential. Physical review letters, 93(22):222001, 2004.
- [28] G Paladin and A Vulpiani. Scaling law and asymptotic distribution of lyapunov exponents in conservative dynamical systems with many degrees of freedom. Journal of Physics A: Mathematical and General, 19(10):1881, 1986.
- [29] Saurabh Paul, Christos Boutsidis, Malik Magdon-Ismail, and Petros Drineas. Random projections for linear support vector machines. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(4):1–25, 2014.
- [30] Miklós Z. Rácz and Jacob Richey. A smooth transition from wishart to goe. Journal of Theoretical Probability, pages 898–906, 2019. doi:10.1007/s10959-018-0808-2.
- [31] Garvesh Raskutti and Michael W Mahoney. A statistical perspective on randomized sketching for ordinary least-squares. The Journal of Machine Learning Research, 17(1):7508–7538, 2016.
- [32] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 143–152. IEEE, 2006.
- [33] Khalid Shebrawi and Hussein Albadawi. Trace inequalities for matrices. Bulletin of the Australian Mathematical Society, 87(1):139–148, 2013. doi:10.1017/S0004972712000627.
- [34] Terence Tao. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012.
- [35] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Yonina C. Eldar and Gitta Kutyniok, editors, Compressed Sensing: Theory and Applications, pages 210–268. Cambridge University Press, 2012. doi:10.1017/CBO9780511794308.006.
- [36] David P. Woodruff. Sketching as a tool for numerical linear algebra. Found. Trends Theor. Comput. Sci., 10(1–2):1–157, October 2014. doi:10.1561/0400000060.
- [37] Fan Yang, Sifan Liu, Edgar Dobriban, and David P Woodruff. How to reduce dimension with PCA and random projections? arXiv:2005.00511 [math.ST], 2020.
Appendix A Omitted Calculations in Section 4.1
Appendix B Omitted Calculations in Section 4.2
In Step 1a.
In Step 1b.
In Step 2a.
Appendix C Exact Variance when
Suppose that is rotationally invariant under both left- and right-multiplication of an orthogonal matrix. Define
It is clear that they are well-defined.
Let us calculate for a Gaussian random matrix .
Therefore
When , recalling that (see (4)), we have that
If the right-hand side above is at most a small constant , we can distinguish from with probability at least a constant.