Optimal Smoothed Analysis and Quantitative Universality for the Smallest Singular Value of Random Matrices
Abstract
The smallest singular value and condition number play important roles in numerical linear algebra and the analysis of algorithms. In numerical analysis with randomness, many previous works make Gaussian assumptions, which are not general enough to reflect the arbitrariness of the input. To overcome this drawback, we prove the first quantitative universality for the smallest singular value and condition number of random matrices.
Moreover, motivated by the study of smoothed analysis that random perturbation makes deterministic matrices well-conditioned, we consider an analog for random matrices. For a random matrix perturbed by independent Gaussian noise, we show that this matrix quickly becomes approximately Gaussian. In particular, we derive an optimal smoothed analysis for random matrices in terms of a sharp Gaussian approximation.
1 Introduction
In numerical analysis and analysis of algorithms, the smallest singular value and condition number of matrices play important roles [TB97]. Broadly speaking, the smallest singular value reflects the invertibility of a matrix, and a better non-degeneracy condition of the matrix will result in faster algorithms or ensures better stability in computations. The condition number is a crucial quantity as well. It measures the hardness and accuracy of numerical computations, and its most well-known practical meaning is the loss of precision in solving linear equations [Sma85]. For more concrete state-of-the-art applications, the smallest singular value affect the stability in fast algorithms of solving linear systems [PV21, Nie22]. In [GPV21], the condition number determines the iteration complexity in fast algorithms for -norm regression. There are a lot of other applications of these two quantities, and we do not intend to provide a complete review here.
In the seminal work of Spielman and Teng [ST04], smoothed analysis is introduced to understand why some algorithms with poor worst-case performance can work successfully in practice. Roughly speaking, smoothed analysis is an interpolation between the worst-case analysis and the average-case analysis. In the context of solving linear equations , even if the matrix has large condition number (and consequently large loss of precision), a random perturbation will become well-conditioned with high probability. Smoothed analysis for random perturbation of deterministic matrices has been well studied since its invention [Wsc04, VH14, TV10b, FV16, SS20] etc. It has also been applied in many algorithms, for example Gaussian elimination [SST06], matrix inversion [BC10], conjugate descent method [MT16], tensor decomposition [BCMV14], the -means method [AMR11], etc.
Classical results of smoothed analysis focus on random perturbation of deterministic matrices, and show that this perturbation indeed makes the original matrix better for the implementation of algorithms (in the sense such as having smaller condition number). Recently, numerical computations with random data has become more and more common. In the context of random matrices, we show that if a random matrix is perturbed by independent Gaussian noise, it quickly becomes approximately Gaussian. The smallest singular value and condition number for a Gaussian matrix has been well studied [Ede88], and therefore such Gaussian approximations enable us to analyze the more tractable Gaussian ensemble instead with benign approximation error. Specifically, we prove an optimal smoothed analysis of random matrices in terms of a sharp Gaussian approximation.
Moreover, in numerical analysis with randomness, many previous works make Gaussian assumptions. Such assumptions are not general enough to reflect the arbitrariness of the input. To overcome this drawback, we prove the first quantitative universality for the smallest singular value and condition number of random matrices. Both the smoothed analysis for random matrices and the quantitative universality will play useful roles in computations with randomness.
1.1 Overview and our contributions
From the mathematical perspective, the edge universality has been a classical problem in random matrix theory and it has tremendous progress in the past decade. However, most previous results are qualitative statements, and most quantitative rate of convergence to the limiting law were only known for integrable models. Recently, the rate of convergence to the Tracy-Widom distribution for the largest eigenvalue was first obtained in [Bou22] for generalized Wigner matrices and in [Wan19] for sample covariance matrices. These results were further improved by Schnelli and Xu in [SX22a, SX21, SX22b]. Our work is the first result about the quantitative universality for the smallest singular value and the condition number.
For an matrix with , the empirical spectral distribution of the sample covariance matrix converges to the Marchenko-Pastur law
In the case , the Marchenko-Pastur law has no singularity and such situations are called the soft edge case in the literature of random matrix theory. Moreover, note that all eigenvalues are of order constant with high probability. In particular, thanks to the square-root decay near the spectral edge of the Marchenko-Pastur law, the extreme eigenvalues of the covariance matrix are highly concentrated at the spectral endpoints at the scale and the fluctuation is the Tracy-Widom law. In this case, the optimal rate of convergence to Tracy-Widom distribution has been established in [SX21], and the quantitative universality for the smallest singular value will be an easy consequence.
However, in the case , things become much more complicated. Note that in this case the Marchenko-Pastur law has a singularity at the origin
This singularity of the limiting spectral distribution makes the behaviours of the eigenvalue near the left endpoint very different. In particular, the typical eigenvalue spacing near the left edge is of order , which is much smaller than the soft edge case where the typical spacing is of order . In random matrix theory, this case is called the hard edge case. In this model, the left edge of the spectrum with singularity is called the hard edge and the right edge is called the soft edge. The study of the spectral statistics near the hard edge is a notoriously tricky problem. Due to some technical difficulties, the study of the hard edge case is mostly restricted to square matrices with . Because of this difficulty, we will also focus on the study of square random matrices.
For the study of the smallest singular value of an random matrix, the first result was obtained by Edelman for the Gaussian matrix in [Ede88]. Later, the distribution in the Gaussian ensemble was shown to be universal by Tao and Vu in [TV10a] and further generalized to sparse matrices by Che and Lopatto in [CL19]. However, both of their universality results are qualitative statements with unknown error terms.
In computer science and numerical computations, the accuracy or complexity of algorithms depend on the smallest singular value or condition number in a delicate way. Therefore, a qualitative universality is not enough to measure the performance of general random input. Thus, quantitative estimates are necessary for practical purposes. In this work, we prove the first quantitative version of the universality for the smallest singular value and the condition number. This solves a long-standing open problem111Problem 11 in http://www.aimath.org/WWN/randommatrices/randommatrices.pdf, Open problems: AIM workshop on random matrices.. Along the way to prove quantitative universality, we obtain a sharp estimate for matrices with Gaussian perturbations, and hence establish the optimal smoothed analysis for random matrices.
1.2 Models and main results
Thanks to the motivations from theoretical computer science, we mainly focus on real matrices. However, as mentioned in [CL19], our whole proof works for complex matrices as well.
Let be an matrix with independent real valued entries with mean 0 and variance ,
(1) |
We assume the entries have a sub-exponential decay, that is, there exists a constant such that for ,
(2) |
We remark that this assumption is mainly for convenience, and other conditions such as the existence of a sufficiently high moment would also be enough.
For an matrix , let denote the singular values in non-decreasing order and use to denote the condition number. Throughout this paper, we let be an Gaussian matrix with i.i.d. entires .
To state the main results, we first introduce two important probabilistic notions that are commonly used throughout the whole paper.
Definition 1 (Overwhelming probability).
Let be a sequence of events. We say that holds with overwhelming probability if for any (large) , there exists such that for all we have
Definition 2 (Stochastic domination).
Let and be two families of nonnegative random variables. We say that is stochastically dominated by if for all (small) and (large) , there exists such that for we have
The stochastic domination is always uniform in all parameters. If is stochastically dominated by , we use the notation .
Our main result is the following.
Theorem 2.
For the smallest singular value, one aspect of the complex-valued case is particularly interesting in the sense that the complex Gaussian model is explicitly integrable, i.e., the distribution of its smallest singular value is given by an exact formula. Specifically, let be an matrix whose entries are i.i.d complex Gaussians whose real and imaginary parts are i.i.d. copies of . For the complex Gaussian ensemble, Edelman proved in [Ede88] that the distribution of the (renormalized) smallest singular value of a complex Gaussian ensemble is independent of and can be computed explicitly
Thanks to this exact formula for the integrable model, the edge universality for the smallest singular value can be quantified in terms of the Kolmogorov-Smirnov distance to the explicit law.
More precisely, let be an random matrix satisfying
and the sub-exponential decay assumption (2). Then we have the following rate of convergence to the explicit law.
Corollary 1.
Let be a complex random matrix defined as above, then for any we have
(5) |
Based on the analysis of the smallest singular value, combined with results on the quantitative results of the largest singular values, we can also derive optimal smoothed analysis and quantitative universality for the condition number.
As mentioned previously, the general hard edge case without is a notoriously difficult problem in random matrix theory. To further generalize our results, we have the following slight extension towards the general case.
1.3 Outline of proofs
The central idea of this paper is based on the Erdős-Schlein-Yau dynamical approach in random matrix theory. In their seminal work [ESYY12], the so-called three-step strategy is developed to prove universality phenomena for random matrices. Roughly speaking, this framework is the following three ideas.
-
(i)
A priori estimates for spectral statistics. This is based on the analyzing the resolvent of the matrix, and such analysis is called local law in random matrix theory. Local law states that the spectral density converges to the limiting law on microscopic scale. This local law implies the eigenvalue rigidity phenomenon, which states that the eigenvalues are close to their typical locations. Such a priori control of the eigenvalue locations will play a significant role in further analysis.
-
(ii)
Local relaxation of eigenvalues. This step is designed to prove universality for matrices with a tiny Gaussian component. We perturb the matrix by some independent Gaussian noise, and then under this perturbation, the dynamics of the eigenvalues is governed by the Dyson Brownian motion (DBM). Moreover, the spectral distribution of the Gaussian ensemble is the equilibrium measure of DBM. The ergodicity of DBM results in a fast convergence to the local equilibrium, and hence implies the universality for matrix with small Gaussian perturbation.
-
(iii)
Density arguments. For any probability distribution of the matrix elements, there exists a distribution with a small Gaussian component (in the sense of Step (ii)) such that the two associated random matrices have asymptotically identical spectral statistics. Typically, such an asymptotic identity is guaranteed by some moment matching conditions and the comparison of resolvents.
For a systematic discussion of this method, we refer to the monograph [EY17]. Following this strategy, our main techniques can also be divided into the following three steps.
- •
-
•
The second step is to interpolate the general matrix with the Gaussian matrix , and estimate the dynamics of the singular value. More specifically, we consider the interpolation , which solves the matrix Ornstein-Uhlenbeck stochastic differential equation
Note that this interpolation is equivalent to the matrix perturbation in our smoothed analysis. We consider a weighted Stieltjes transform (defined in (15)). A key innovation of our work is that, combined with a symmetrization trick, the evolution of the weighted Stieltjes along the dynamics of satisfies a stochastic PDE that can be well approximated by a deterministic advection equation. This deterministic PDE yields a rough estimate for . Finally, using a delicate bootstrap argument, we show that the estimates for are self-improving. Iterating the bootstrap argument to optimal scale, we derive the optimal smoothed analysis for the smallest singular value.
-
•
The last step is a quantitative resolvent comparison. In particular, the difference between the resolvent of two different random matrices are explicitly controlled in terms of the difference of their fourth moments. This comparison is proved via the Lindeberg exchange method. Together with the optimal smoothed analysis, this comparison theorem establishes the quantitative universality.
1.4 Notations and paper organizations
Throughout this paper, we denote a generic constant which does not depend on any parameter but may vary form line to line. We write if holds for some constant , and similarly write if . We also denote if both hold. When and are complex valued, means and . We use to denote the set of integers between and . We use and to denote the maximum and minimum between and , respectively.
The paper is organized as follows. In Section 2, we discuss some applications of our results in numerical analysis and algorithms. In Section 3, we discuss the smoothed analysis for the smallest singular value of random matrices via the study of singular value dynamics. In Section 4, we use the smoothed analysis to establish a full quantitative universality for the smallest singular value. In Section 5, we use the results on the smallest singular value to derive smoothed analysis and quantitative universality for the condition number. In Section 6, we extend the result for square matrices to a slightly more general non-square case. Finally, in the Appendix, we collect some auxiliary results and provide the deferred technical proofs.
2 Applications in Numerical Analysis and Algorithms
In this section, we discuss some applications of our results in numerical analysis and algorithms. There are numerous circumstances where the smallest singular value and condition number play important roles. We do not intend to mention all of them and just focus on two simple scenarios in the framework of solving linear systems to illustrate the usefulness of our results. We expect our results can be applied in more complicated models and more advanced algorithms.
2.1 Accuracy of least-square solution
Consider the linear least-square optimization
where is an matrix satisfying , and is a fixed vector. The loss of precision of this problem, denoted by , is the number of correct digits in the entries of the data minus the same quantity for the computed solution. Let denote the loss of precision for the worst . Then, as shown in [Hig02], we have
Let be an random matrix satisfying (1) and (2). Also let be an Gaussian matrix. By Theorem 3 and Theorem 5, for any , with overwhelming probability we have
Also, using Theorem 4 and 5, for any , with probability at least , we have
The error terms are smaller than the term. These results imply that a general random matrix and its Gaussian perturbation can ensure accuracy as good as in the Gaussian case.
2.2 Complexity of conjugate gradient method
Consider the linear equation
where is an matrix with , and is a fixed vector. This linear system can be solved via the conjugate gradient algorithm. Let denote the needed iterations to obtain an -approximation of the true solution in the worst case. Then it is known (see e.g. [TB97]) that
Let be an random matrix satisfying (1) and (2). Also let be an Gaussian matrix. By Theorem 3 and Theorem 5, for any , with overwhelming probability we have
This shows that as long as , the Gaussian perturbation has time complexity as good as the Gaussian ensemble.
3 Smoothed Analysis and Gaussian Approximation
3.1 Singular value dynamics
In smoothed analysis, we are interested in matrix perturbation of the form . After normalization of the variance, it is equivalent to study matrix of the form . More specifically, we have
(8) |
Let be an matrix Brownian motion, i.e. are independent standard Brownian motions. Then the evolution of is governed by the following matrix-valued Ornstein-Uhlenbeck process:
(9) |
Let denote the singular values of , then satisfy the following system of stochastic differential equations [ESYY12, equation (5.8)],
(10) |
To handle these SDEs, an important idea is the following symmetrization trick (see [CL19, equation (3.9)]):
With these notations, we label the indices from to and to , so that the zero index is omitted. Unless otherwise stated, this will be the convention and we will not emphasize it explicitly in the following parts of the paper. After symmetrization, for the real case we have
(11) |
Now we use the coupling method introduced in [LSY19] to analyze these dynamics. Consider the interpotation between a general matrix and a Gaussian matrix . Let and be the (symmetrized) singular values of and , respectively. For , define
With this initial condition, we denote the unique solution of (11) by . Also, let and denote the solutions of (11) with initial conditions and , respectively.
It is well known that the empirical measure of the eigenvalues of converges to the Marchenko-Pastur distribution
For , we define the typical position of the singular value as the quantile satisfying
We also define . By a change of variable, we have
(12) |
where is the semicircle law.
An important input of our proof is the following uniform rigidity estimates. For any fixed , consider the set of good trajectories
(13) |
Such rigidity estimates for fixed and or were proved in [AEK14, AEK17, BEK+14, BYY14, CMS13].The extension to uniform estimates in parameters can be done by a discretization argument: (1) discretize in and ; (2) use weyl’s inequality to control increments over small time intervals; (3) use a maximum principle for the derivative with respect to the parameter (see Lemma 3.2) to control increments in small -intervals. As a consequence, we have
Lemma 3.1.
For any , the event happens with overwhelming probability, i.e. for any , there exists such that for any we have
We consider
For the simplicity of notations, we omit the parameter if the context is clear. Then satisfies the following non-local parabolic type equation.
(14) |
Let solve the same equation as in (14) but with initial condition . Following the same arguments in [Wan19, Lemma 3.1], this equation satisfies a maximum principle.
Lemma 3.2.
For all and , we have
We consider the following weighted Stieltjes transform
(15) |
Let and denote the Stieltjes transforms of the empirical measure for the singular values and of the semicircle law
A well-known result in random matrix theory is the following local semicircle law for the Stieltjes transform . Let be an arbitrarily fixed constant. Define the spectral domain
For any and , it was shown in [AEK14, AEK17] that
(16) |
As computed in [Wan19, Lemma 3.3], a key result is that and satisfy the following stochastic advection equation.
Lemma 3.3.
For , we have
(17) | ||||
Based on the local semicircle law (16), we expect that this stochastic differential equation can be approximated by the deterministic advection PDE
(18) |
The above PDE have the following explicit characteristics
(19) |
This implies , and we will justify this approximation in the next subsection. Moreover, we remark that satisfies the same equation (17) with replaced by .
Before moving to the main estimates, we first collect some basic results, including the geometry of the characteristics and a rough estimate for the initial condition. These can be proved via direct computations and the details can be found in [Bou22, Section 2].
Let . For any , we consider the curve and the domain
We also define and .
Lemma 3.4.
Uniformly in and satisfying and , we have
In particular, for , if , then .
Moreover, for any , and , we have .
Lemma 3.5.
Let be any small constant. In the set , for any , we have if , and otherwise. The same bounds also hold for .
A key ingredient of our proof is the following a priori estimate for , whose proof is deferred to Appendix B.1.
Proposition 3.1.
Let be any small constant. Uniformly for all and , with overwhelming probability we have
This estimate yields a rough control for the decay of , which will be an important input for more refined estimates.
Lemma 3.6.
For all and , we have
(20) |
Proof.
By Lemma 3.2, it suffices to control . By the nonnegativity of , we have
which implies
Let . For , pick the point . In this case we have . Therefore, in the set , by Proposition 3.1, uniformly for all and , with overwhelming probability we have
For , without loss of generality we consider . In this case, let and consider . The same argument results in a similar bound with a larger factor. By the arbitrariness of , this completes the proof. ∎
3.2 Local relaxation at the hard edge
In this subsection we prove a quantitative estimate for the local relaxation flow (11) at the hard edge. The main estimate in this section is the following. We remark that Theorem 6 is equivalent to Theorem 1 using the rescaling (8).
Theorem 6.
For arbitrarily small and any , we have
(21) |
To estimate near the hard edge, we introduce the following quantity to approximate it. Let with the convention , and define
(22) |
Our goal is to prove the following estimates
Proposition 3.2.
Let be a fixed small constant. For arbitrarily small with any and , we have
To obtain the optimal control for the local relaxation flow, we need to carefully estimate near the hard edge. A first step towards such estimates is given in the following lemma.
Lemma 3.7.
Let and . For any , , and , in the set , for we have
(23) |
(24) |
Proof.
By the properties of the Stieltjes transform (see e.g. [EY17, Section 6]) and direct computation, we have
and
(25) |
In the set , the rigidity estimates imply that
Then we have
(26) | ||||
By triangle inequality, we have
Using (25) and (26), we obtain
For the second term , in the set , the rigidity and (25) imply that
Recall from Lemma 3.4 that and . Using a similar argument as in (26) we obtain
Hence we have proved (23). For the other part (24), it can be proved via the same arguments. ∎
As a consequence, we have a good control for the size of away from the soft edge. This is based on the symmetric structure of .
Lemma 3.8.
Let and . For any and , with overwhelming probability we have
(27) |
Proof.
A key observation is the following
Therefore, we have
Using the symmtrization of the singular values, we further have
Consequently, by (23) we have
This shows the desired result. ∎
3.3 Bootstrap arguments
We will prove the main technical estimate Proposition 3.2 via a bootstrap argument.
Definition 3 (Hypothesis ).
Consider the following hypothesis: For any fixed small , the following holds for arbitrarily small. For any , and , we have
(28) |
Proposition 3.2 is derived via a bootstrap of the hypothesis . Specifically, we have the following two lemmas.
Lemma 3.9.
The hypothesis is true.
Proof.
Lemma 3.10.
If is true, then is true, i.e.
The self-improving property of the hypothesis stated in Lemma 3.10 is the main technical part of the proof for Proposition 3.2. We defer its proof to Appendix B.2.
Finally, the optimal control (21) for the local relaxation flow at the hard edge follows from these two lemma together with Lemma 3.8.
Proof of Proposition 3.2.
Note that
(29) |
Consider an arbitrarily fixed , based on Lemma 3.9 and Lemma 3.10, after a finite time of iterations, with overwhelming probability we have
This shows that for any fixed and , and for large enough , we have
By (29) we obtain
We choose and , and then the Markov inequality yields
which completes the proof thanks to the arbitrariness of and . ∎
4 Quantitative Universality
4.1 Quantitative resolvent comparison
In classical random matrix theory, the spectral universality is proved by comparison of the resolvent for matrices with some moment matching conditions. To obtain a quantitative universality for the smallest singular value, we need a quantitative version of the resolvent comparison theorem.
For a fixed constant , let be a cutoff scaling. Let and consider two symmetric functions that are non-increasing in , given by
Also, consider a fixed non-increasing smooth function such that for and for .
A key observation is that the functions and can bound the distribution of the smallest singular value . For any function , we denote .
Lemma 4.1.
We have
(30) |
Proof.
For the right-hand side, assume . By definition of the function , we have , which implies . Also note that . Therefore, we conclude and this yields
The left-hand side can be proved similarly. ∎
When estimating the distribution , thanks to the rigidity of singular values, we can assume without loss of generality, where is a constant that can be arbitrarily small. Based on Lemma 4.1, to compare the distribution of the smallest singular values of different random matrices, it suffices to compare the functions and . In the remaining part of this section, we provide a systematic treatment of such a comparison.
Pick a point with . Let be a smooth symmetric function that is non-increasing in satisfying
(31) |
For the test functions and defined as above, we have the following quantitative comparison of the resolvents, whose proof is deferred to Appendix C.
Proposition 4.1.
Let and be two independent random matrices satisfying (1) and (2). Assume the first three moments of the entries are identical, i.e. for all and . Suppose also that for some parameter we have
(32) |
With the test functions and defined as above, there exists a constant such that the following is true for any
(33) |
4.2 Proof of Theorem 2
Using the quantitative comparison theorem (Proposition 4.1) and the smoothed analysis (Theorem 6), we now prove the quantitative universality.
For a general random matrix satisfying Assumptions (1) and (2), there exists another matrix that also satisfies the same assumptions such that the matrix has the same first three moments as and the difference between the fourth moments (in the sense of (32)) is . This is guaranteed by [EYY11, Lemma 3.4].
Lemma 4.1 and Proposition 4.1 yields
(34) |
Using Lemma 4.1 for with and shifted by , we have
(35) |
Using smoothed analysis Theorem 6, we have
(36) |
Taking , and setting , we obtain the optimal bounds
(37) |
Hence, thanks to the arbitrariness of , we have proved Theorem 2.
Finally, for the complex case, using the exact formula for the distribution of , we obtain a rate of convergence to the limiting law. Recall that
5 Condition Number
5.1 Smoothed analysis
Note that the condition number is scaling invariant in the sense that for any . Therefore, in the smoothed analysis, it suffices to consider , whose singular values satisfies the stochastic differential equation (10).
Recall from Theorem 6, we have shown that
Note that in Lemma 3.6, using the same arguments as in Proposition 3.2, we can derive that
Then for any large and small , there exists such that the following holds with probability at least .
Without loss of generality, we assume that for some that can be arbitrarily small. Then we have
where in the last inequality we use that is of order constant with overwhelming probability. By the arbitrariness of , we can relabel the parameter and then obtain
Similarly, we can also prove a lower bound and conclude that
For the matrix , we can write the matrix as
Therefore we have that . As a consequence, we obtain that
which completes the proof for Theorem 3.
5.2 Quantitative universality
In this section, we prove Theorem 4, which establishes the quantitative universality for the condition number. We will use the universality for both the smallest singular value and the largest singular values. In particular, since the smallest singular values is typically of order , a quantitative control is necessary. The largest singular value is of order constant, and therefore it is easier to deal with. Specifically, for the largest singular value, due to the Tracy-Widom limit for the largest eigenvalue of the sample covariance matrix, we know that for any with overwhelming probability.
Let be an arbitrarily small constant. For any large and sufficiently large , we have
Using Theorem 2, we obtain
where the third inequality follows from that is of order constant with overwhelming probability. Taking large enough , we have
This yields the upper bound in (7), and the lower bound can be proved similarly. Hence, we have proved Theorem 4.
6 Beyond Strictly Square Matrices
As mentioned in the Introduction, the optimal local law for an random matrix with general is a notoriously hard problem. In particular, the optimal rigidity estimates Lemma 3.1 is unknown unless we restrict . In this section, we discuss a slight extension of the strictly-square case. We show that in the regime , all of our theorems still hold.
This claim is based on the following important observation. All proofs of our paper only relies on the local law (as well as its consequences), and therefore it suffices to show that such a local law is still valid for a general matrix. More specifically, the main task is to show that the optimal local semicircle law (16) still holds for the Girko symmetrization of an random matrix. Modulo the optimal local law, the optimal rigidity will still be valid as a by-product via standard approaches in random matrix theory.
For an matrix , we consider the augmented matrix , which is an matrix by adding columns to with independent entries satisfying (1) and (2). Without loss of generality, we may assume that the added columns are the first columns in . Since is a square matrix, the local semicircle law (see [AEK14, Theorem 1.1]) is still true. Specifically, for any fixed , define the spectral domain
Then for any , define the resolvent and we have
and
For an matrix and a subset , we define as the matrix
Recall the Girko symmetrization of a matrix . Then we have .
For with , let . In particular we have and . Then we have the following resolvent identity (see e.g. [BGK17, Lemma 3.5])
From the local semicircle law for the square matrix , with overwhelming probability we have
Using local law for again, this implies that
More generally, we have
By the same arguments as above, for any fixed , we can derive that
By a telescoping summation, we derive
Since , the second term can be absorb into the first term with a larger factor . Thanks to the arbitrariness of , we have
(39) |
Moreover, the resolvent identity also yields
This yields
Using the Ward identity, this implies that
From the local law for , with overwhelming probability we have
Again, using , the telescoping sum yields
The second term can be absorbed into the first term for any fixed . The arbitrariness of concludes that
(40) |
Acknowledgment
The author thanks Paul Bourgade for suggesting this problem and for insightful comments on an early version of this manuscript.
Appendix A Auxiliary Results
To make this paper self-contained, we collect some well-known results that are used in the paper.
The first result is about controlling the size of a martingale, which is used in Proposition 3.1 to bound the martingale term in the stochastic dynamics (42), and also in Lemma B.1 to bound (45). This is from [SW09, Appendix B.6, Equation (18)].
Lemma A.1.
For any continuous martingale and any , we have
The second result is the Helffer-Sjőstrand formula, which is a classical result in functional calculus. This formula is used in Proposition 4.1 to compute the trace of functions via the Stieltjes transform. We are using the version in [EY17, Section 11.2].
Lemma A.2 (Helffer-Sjőstrand formula).
Let with compact support and let be a smooth cutoff function with support in , with for and with bounded derivatives. Then
We also have the following resolvent expansion identity. This is a well-known result in linear algebra, and it is used in Proposition 4.1 to compare the resolvents of two matrices.
Lemma A.3 (Resolvent expansion).
For any two matrices and , we have
provided that all the matrix inverses exist.
Finally, we have some estimates for the Stieltjes transform of the semicircle law. For with , recall that denotes the Stieltjes transform of the semicircle distribution. The following estimates are well known in random matrix theory (see e.g. [EY17, Lemma 6.2]).
Lemma A.4.
We have for all with that
Furthermore, there is a constant such that for and we have
as well as
where is the distance of to the spectral edge.
Appendix B Proofs for Smoothed Analysis
B.1 Proof of Proposition 3.1
The proof is essentially the same as [Wan19, Proposition 3.8], and we briefly describe the key steps here for completeness. For any , we define and , where . Consider the stopping times
with the convention . We claim that it suffices to show that with overwhelming probability.
To prove this claim, for any and , we pick and . Note that the maximum principle Lemma 3.2 implies for all and . Then we have . Also, note that for we have , and
Consider the events
On the event , the above estimates imply that . It further shows that
Since this holds for all and , we have shown that
(41) |
Moreover, note that
Using Lemma A.1, we conclude that the event happens with overwhelming probability. By a union bound, we further have that happens with overwhelming probability. Together with the set inclusion (41), we conclude that the claim is true, i.e. it suffices to prove with overwhelming probability.
To prove with overwhelming probability, consider some fixed and , and define the function . By Lemma 3.5, the initial condition is well controlled . To bound the increments, note that the dynamics (17) yields
(42) |
where
By the local semicircle law (16), we have
(43) | ||||
Also, we have
and
For the martingale part
using the rigidity of singular values (Lemma 3.1), with overwhelming probability we have
To estimate this integral, we chop the interval into subintervals where . We can bound the summation in the integral in the following way
Using similar discretization arguments as above, we can derive
and
Therefore, we obtain
Combining this estimate for the martingale term with previous estimates, a union bound shows that with overwhelming probability we have
Now we have proved with overwhelming probability and hence the desired result is true.
B.2 Proof of Lemma 3.10
The proof of Lemma 3.10 is a delicate task. The key part of the proof is a careful analysis of the dynamics. The main idea is to approximate the dynamics with a short-range version, which will be easier to control. To do this, we show the finite speed of propagation estimate for the short-range kernel of the parabolic-type equation (14) satisfied by . Then we prove a short-range approximation of the original dynamics and introduce a regularized equation. Finally, we show that, with a well-behaved initial condition, the regularized equation gives us the desired good approximation.
To begin with, the core input of the bootstrap argument is the following technical lemma, which states that the estimate of the local average will improve along with the induction hypothesis .
Lemma B.1.
Assume . Let be any fixed small constant. For any , any arbitrarily small and satisfying , , we have
(44) |
Proof.
Fix and consider the function
An observation is that satisfy the same stochastic advection equation (17) with replaced by . Therefore, we have
(45) | ||||
where
Similarly as in the proof of Proposition 3.1, for and , define and where and . We also pick such that . Assuming , let be the arbitrarily small scale in the hypothesis. Let be some suitably large constant. Recall the stopping times
and consider the new stopping times
Recall the convention . As shown in the proof of Proposition 3.1, it suffices to show that with overwhelming probability.
A key ingredient for the analysis of the dynamics of is the following estimates on . To do this, we fix some and with , and let and .
On the one hand, we have a direct a priori estimate. Since , we have
Moreover, note that for with in the bulk and , uniformly we have . By Lemma 3.5, this shows
As a consequence, we have
(46) |
On the other hand, the estimate can also be obtained via approximation
For the first term, since we have
Choosing , the remaining two terms are controlled by Lemma 3.8
Together, we decompose the error terms into two parts and obtain the following
(47) |
With the above control on , the dynamics (45) can be used to bound similarly as in Proposition 3.1. For the first term, we have
(48) | ||||
For the soft edge part , using (46) we obtain
(49) | ||||
For , note that
This yields
(50) | ||||
Moreover, without loss of generality, we may assume that . Then, using (46) we obtain
(51) |
Together with previous estimates, this shows
(52) |
Similarly, we have
and
It suffices to bound the martingale term
Again we decompose the integral into two parts
The contribution from the soft edge is easy to control
For the other term, we use both (46) and (47)
Note that
For the other term, without loss of generality we may assume . For the term, we have
For the contribution of , we have
The first two terms in the bracket give us
For the remaining term, we have
Combining these results shows
Using Lemma A.1 and a union bound, for any fixed large and sufficiently large , we have
Together with previous estimates, with overwhelming probability we have
This implies with overwhelming probability. Moreover, we have shown in Lemma 3.1, Lemma 3.6 that and with overwhelming probability. Assuming the hypothesis , we also have with overwhelming probability. These imply that with overwhelming probability. Hence we complete the proof. ∎
Now we move on to the short-range approximation of the dynamics. Recall that satisfies the parabolic equation (14), and we rewrite it as
where the time-dependent operator is defined in the following way: For ,
Consider some parameter which will be determined later, we decompose the operator into two parts . The operators and represent the short-range interactions and long-range interactions respectively and are defined as follows
Note that the operators are also time dependent. Let denote the semigroup associated with the operator in the sense
Also, let denote the semigroup associated with .
To prove the short-range approximation, we need the following finite speed of propagation estimate for the semigroup. Such estimates were proved in [CL19] with minor changes.
Lemma B.2.
For any fixed small and large , there exists such that the following holds with probability at least . For any , , , , and such that , we have
(53) |
With Lemma B.2, we have the following short-range approximation estimate. In particular, this short-range approximation can be improved based on the hypothesis .
Lemma B.3.
Assume . Let be any fixed small constant. There exists a constant such that for any , , , and , we have
(54) |
Proof.
The Duhamel’s principle implies
On the event that Lemma B.2 holds, for , the finite speed of propagation yields
where . Moreover, using the property that is an contraction, we have
(55) |
For , assuming and on the event that Lemma 3.7 holds, there exists a constant so that
For , Lemma 3.6 yields
Therefore, using rigidity of singular values, we have
Combined with (55), this implies the desired claim with . Note that is arbitrary, and thus it concludes the proof. ∎
Further, we will show that we have nice control for a regularization of the short-range dynamics with a well-behaved initial data. To do this, we follow the techniques developed in [BY17]. Consider some fixed times , the short-range parameter , and define an averaging space window scale . Throughout the remaining parts of this section, for any fixed arbitrarily small , we make the following assumption on these parameters
(56) |
For a fixed index , as in [BY17], we define the flattening operator with parameter by
and the averaging operator
As shown in [BY17, Equation (7.4)], the averaging operator can also be represented as a combination of Lipschitz function, i.e. there exists a Lipschitz function with such that
(57) |
Finally, for , consider the regularized dynamics
The following lemma shows that the averaging the regularized dynamics gives good approximation for .
Lemma B.4.
Assume . Let be any fixed small constant. There exists a constant such that for any , , , , such that , and , we have
(58) |
Proof.
For the first term , Note that
When , we have
and the finite speed of propagation (53) yields
This gives
Similarly, this bound also holds in the case . Now suppose . Applying (23) and Hypothesis , we obtain
Combined with the estimate above, this implies
(59) |
For the term , note that implies . Therefore, using the Lipschitz representation of the averaging operator (57), the short-range approximation (54) gives us
This shows
(60) |
Finally, for the term , by the Lipschitz representation of the averaging opeator (57), it can be rewritten in the following way,
Using (24) and (44), we control in the following way
Applying (23) to estimate , we obtain
Similarly, by the Lipschitz property of , we estimate as follows
Using the same arguments, by (23), we have
Together with the previous estimates, this leads to
Combined with (59) and (60), we obtain the desired result. ∎
Finally, we have all the tools to prove Lemma 3.10.
Proof of Lemma 3.10.
We fix some small and consider an arbitrarily small . Throughout the whole proof, we do all estimates on the overwhelming probability event where Lemma B.2, Lemma B.3 and Lemma B.4 hold. For a fixed index , we have
By the definition of the averaging opeartor, we know that on the set . Therefore, combined with the finite speed of propagation estimate (53) for the second term and the short-range approximation (54) for the first term, we obtain
(61) |
It suffices to estimate . Consider the function
Similarly as in Lemma 3.2, we can show a parabolic maximum principle for and consequently decreases in time. Moreover, note that if .
Let to denote the index that attains the maximum. If there exists a time such that , then in this case the finite speed propagation (53) gives us
(62) |
On the other hand, now we assume that for all . In this case, we have
This gives us
and therefore
Applying Lemma B.4 and Lemma 3.7 yields
where the left-hand side represents the right derivative of at time . Let , then above inequality leads to
Choosing
(63) |
then we have
Similarly, this bound also holds for . Combined with (61) and (62), this completes the proof. ∎
Appendix C Proofs for Quantitative Universality
In this section, we prove the quantitative resolvent comparison Proposition 4.1.
The key idea is based on the Lindeberg exchange method (for a detailed introduction we refer to the monograph [EY17, VH14]). We first fix an ordering map of the indices . For , let be the random matrix defined as
so that and . By telescoping summation, it suffices to show the following is true uniformly in ,
(64) |
To prove (64), we use the Helffer-Sjöstrand formula. Let be a smooth symmetric cutoff function such that if and if , with . For any matrix , let denote its Girko symmetrization
Recall that the symmetrized singular values are the eigenvalues of . With the cutoff function , applying Lemma A.2 to yields
(65) |
where is the Lebesgue measure on and
The analysis of the comparison can be proceeded in the following steps:
Step 1: Approximation of . We first truncate the integral in (65) and define
The approximation error can be bounded by
For singular values near the origin, i.e. , we have
On the other hand, for , by the rigidity of singular values, we have the following overwhelming probability bound
Combining the above two bounds together, we obtain
with overwhelming probability.
Step 2: Expansions and moment matching. With the approximation by , it suffices to show that
(66) |
uniformly for all . Now consider a fixed corresponding to the index , i.e. . We rewrite the matrices and in the following way
where the matrix coincides with and except on the entry with . Then note that the matrices satisfy and and all other entries are zero. Recall the notation for the Girko symmetrization. Consider the resolvents of the matrices and
The Taylor expansion yields
(67) | ||||
We first control the term corresponding to the fifth derivative. By Lemma A.3, the first order resolvent expansion gives us
Consequently,
We can restrict the integral on the domain as the contribution outside this region is negligible. Moreover, a key observation is that the matrix only has two non-zero entries. Thus,
Note that in this integral domain, the scale of is smaller than the natural size of the local law. Therefore, we will use a suboptimal version of the local semicircle law for a larger spectral domain, which was discussed in [EKYY13, LS18]. For in this integral domain, with overwhelming probability we have
The same result also holds for . By Lemma A.4, we have for in the integral domain. Note that the contribution of the diagonal resolvent entries is negligible. Therefore, with overwhelming probability we have
Similarly, this bound also holds for , and we obtain
Hence the fifth order term in (67) is bounded by
(68) |
Now we consider the first term in the Taylor expansion (67). Denote
and also define
Using the resolvent expansion (Lemma A.3) up to the fifth order, we obtain
A Similar expansion also holds for . Then we have
(69) |
A key observation is that for , the terms and only depend on the first three moments of and . Recall that the first three moments of and are identical. Therefore, the terms corresponding to in (69) makes no contribution.
Step 3: Higher order error. For the term in (69), note that
(70) |
A similar formula is also true for . Note that typically we have and , but we may have , , . Moreover, the terms with either or are combinatorially negligible in the summation and therefore we can ignore these terms in the following computations. Recall that the difference between the fourth moments of and is bounded by . Thus, we have
As mentioned above, for the integral in (69) we can restrict the integral domain to and . In this region, the entries of the resolvent are bound by and . As a consequence,
(71) |
For the term , since these terms involve the higher moments of and , we simply bound it by the size of and . By a similar expansion as in (70) and the local law, we have . Therefore,
(72) |
The same bound also holds for .
References
- [AEK14] Oskari Ajanki, László Erdős, and Torben Krüger. Local semicircle law with imprimitive variance matrix. Electron. Commun. Probab., 19:no. 33, 9, 2014.
- [AEK17] Johannes Alt, László Erdős, and Torben Krüger. Local law for random Gram matrices. Electron. J. Probab., 22:Paper No. 25, 41, 2017.
- [AMR11] David Arthur, Bodo Manthey, and Heiko Röglin. Smoothed analysis of the k-means method. Journal of the ACM (JACM), 58(5):1–31, 2011.
- [BC10] Peter Bürgisser and Felipe Cucker. Smoothed analysis of moore–penrose inversion. SIAM Journal on Matrix Analysis and Applications, 31(5):2769–2783, 2010.
- [BCMV14] Aditya Bhaskara, Moses Charikar, Ankur Moitra, and Aravindan Vijayaraghavan. Smoothed analysis of tensor decompositions. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 594–603, 2014.
- [BEK+14] Alex Bloemendal, László Erdős, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Isotropic local laws for sample covariance and generalized Wigner matrices. Electron. J. Probab., 19:no. 33, 53, 2014.
- [BGK17] Florent Benaych-Georges and Antti Knowles. Local semicircle law for Wigner matrices. In Advanced topics in random matrices, volume 53 of Panor. Synthèses, pages 1–90. Soc. Math. France, Paris, 2017.
- [Bou22] Paul Bourgade. Extreme gaps between eigenvalues of Wigner matrices. Journal of the European Mathematical Society, 24(8):2823–2873, 2022.
- [BY17] P. Bourgade and H.-T. Yau. The eigenvector moment flow and local quantum unique ergodicity. Comm. Math. Phys., 350(1):231–278, 2017.
- [BYY14] Paul Bourgade, Horng-Tzer Yau, and Jun Yin. Local circular law for random matrices. Probability Theory and Related Fields, 159(3-4):545–595, 2014.
- [CL19] Ziliang Che and Patrick Lopatto. Universality of the least singular value for sparse random matrices. Electron. J. Probab., 24:Paper No. 9, 53, 2019.
- [CMS13] Claudio Cacciapuoti, Anna Maltsev, and Benjamin Schlein. Local Marchenko-Pastur law at the hard edge of sample covariance matrices. J. Math. Phys., 54(4):043302, 13, 2013.
- [Ede88] Alan Edelman. Eigenvalues and condition numbers of random matrices. SIAM J. Matrix Anal. Appl., 9(4):543–560, 1988.
- [EKYY13] László Erdős, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Spectral statistics of erdős–rényi graphs i: local semicircle law. The Annals of Probability, 41(3B):2279–2375, 2013.
- [ESYY12] László Erdős, Benjamin Schlein, Horng-Tzer Yau, and Jun Yin. The local relaxation flow approach to universality of the local statistics for random matrices. Ann. Inst. Henri Poincaré Probab. Stat., 48(1):1–46, 2012.
- [EY17] László Erdős and Horng-Tzer Yau. A dynamical approach to random matrix theory, volume 28 of Courant Lecture Notes in Mathematics. Courant Institute of Mathematical Sciences, New York; American Mathematical Society, Providence, RI, 2017.
- [EYY11] László Erdős, Horng-Tzer Yau, and Jun Yin. Universality for generalized Wigner matrices with Bernoulli distribution. J. Comb., 2(1):15–81, 2011.
- [FV16] Brendan Farrell and Roman Vershynin. Smoothed analysis of symmetric random matrices with continuous distributions. Proceedings of the American Mathematical Society, 144(5):2257–2261, 2016.
- [GPV21] Mehrdad Ghadiri, Richard Peng, and Santosh S Vempala. Sparse regression faster than . arXiv preprint arXiv:2109.11537, 2021.
- [Hig02] Nicholas J Higham. Accuracy and stability of numerical algorithms. SIAM, 2002.
- [LS18] Ji Oon Lee and Kevin Schnelli. Local law and tracy–widom limit for sparse random matrices. Probability Theory and Related Fields, 171(1):543–616, 2018.
- [LSY19] Benjamin Landon, Philippe Sosoe, and Horng-Tzer Yau. Fixed energy universality of Dyson Brownian motion. Adv. Math., 346:1137–1332, 2019.
- [MT16] Govind Menon and Thomas Trogdon. Smoothed analysis for the conjugate gradient algorithm. SIGMA. Symmetry, Integrability and Geometry: Methods and Applications, 12:109, 2016.
- [Nie22] Zipei Nie. Matrix anti-concentration inequalities with applications. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 568–581, 2022.
- [PV21] Richard Peng and Santosh Vempala. Solving sparse linear systems faster than matrix multiplication. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 504–521. SIAM, 2021.
- [Sma85] Steve Smale. On the efficiency of algorithms of analysis. Bulletin of the American Mathematical Society, 13(2):87–121, 1985.
- [SS20] Rikhav Shah and Sandeep Silwal. Smoothed analysis of the condition number under low-rank perturbations. arXiv preprint arXiv:2009.01986, 2020.
- [SST06] Arvind Sankar, Daniel A Spielman, and Shang-Hua Teng. Smoothed analysis of the condition numbers and growth factors of matrices. SIAM Journal on Matrix Analysis and Applications, 28(2):446–476, 2006.
- [ST04] Daniel A Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463, 2004.
- [SW09] Galen R. Shorack and Jon A. Wellner. Empirical processes with applications to statistics, volume 59 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2009. Reprint of the 1986 original [MR0838963].
- [SX21] Kevin Schnelli and Yuanyuan Xu. Convergence rate to the Tracy-Widom laws for the largest eigenvalue of sample covariance matrices. arXiv preprint arXiv:2108.02728, 2021.
- [SX22a] Kevin Schnelli and Yuanyuan Xu. Convergence rate to the Tracy-Widom laws for the largest eigenvalue of Wigner matrices. Comm. Math. Phys., 393(2):839–907, 2022.
- [SX22b] Kevin Schnelli and Yuanyuan Xu. Quantitative tracy-widom laws for the largest eigenvalue of generalized Wigner matrices. arXiv preprint arXiv:2207.00546, 2022.
- [TB97] Lloyd N. Trefethen and David Bau, III. Numerical linear algebra. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1997.
- [TV10a] Terence Tao and Van Vu. Random matrices: the distribution of the smallest singular values. Geom. Funct. Anal., 20(1):260–297, 2010.
- [TV10b] Terence Tao and Van Vu. Smooth analysis of the condition number and the least singular value. Mathematics of computation, 79(272):2333–2352, 2010.
- [VH14] Ramon Van Handel. Probability in high dimension. Technical report, PRINCETON UNIV NJ, 2014.
- [Wan19] Haoyu Wang. Quantitative universality for the largest eigenvalue of sample covariance matrices. arXiv preprint arXiv:1912.05473, 2019.
- [Wsc04] Mario Wschebor. Smoothed analysis of . Journal of Complexity, 20(1):97–107, 2004.