This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Resampling Sensitivity of High-Dimensional PCA

Haoyu Wang Department of Mathematics, Yale University, New Haven, CT 06511, USA, [email protected]
Abstract

The study of stability and sensitivity of statistical methods or algorithms with respect to their data is an important problem in machine learning and statistics. The performance of the algorithm under resampling of the data is a fundamental way to measure its stability and is closely related to generalization or privacy of the algorithm. In this paper, we study the resampling sensitivity for the principal component analysis (PCA). Given an n×pn\times p random matrix 𝐗\mathbf{X}, let 𝐗[k]\mathbf{X}^{[k]} be the matrix obtained from 𝐗\mathbf{X} by resampling kk randomly chosen entries of 𝐗\mathbf{X}. Let 𝐯\mathbf{v} and 𝐯[k]\mathbf{v}^{[k]} denote the principal components of 𝐗\mathbf{X} and 𝐗[k]\mathbf{X}^{[k]}. In the proportional growth regime p/nξ(0,1]p/n\to\xi\in(0,1], we establish the sharp threshold for the sensitivity/stability transition of PCA. When kn5/3k\gg n^{5/3}, the principal components 𝐯\mathbf{v} and 𝐯[k]\mathbf{v}^{[k]} are asymptotically orthogonal. On the other hand, when kn5/3k\ll n^{5/3}, the principal components 𝐯\mathbf{v} and 𝐯[k]\mathbf{v}^{[k]} are asymptotically colinear. In words, we show that PCA is sensitive to the input data in the sense that resampling even a negligible portion of the input may completely change the output.

1 Introduction

The study of stability and sensitivity of statistical methods and algorithms with respect to the input data is an important task in machine learning and statistics [BE02, EEPK05, MNPR06, HRS16, DHS21]. The notion of stability for algorithms is also closely related to differential privacy [DR14] and generalization error [KN02]. To measure algorithmic stability, one fundamental question is to study the performance of the algorithm under resampling of its input data [BCRT21, KB21]. Originating from the analysis of Boolean functions, resampling sensitivity (also called noise sensitivity) is an important concept in theoretical computer science, which refers to the phenomenon that resampling a small portion of the random input data may lead to decorrelation of the output. Such a remarkable phenomenon was first studied in the pioneering work of Benjamini, Kalai and Schramm [BKS99], and we refer to the monograph [GS14] for a systematic discussion on this topic.

In this work, we study the resampling sensitivity of principal component analysis (PCA). As one of the most commonly used statistical methods, PCA is widely applied for dimension reduction, feature extraction, etc [Joh07, DT11]. It is also used in other fields such as economics [VK06], finance [ASX17], genetics [Rin08], and so on. The performance of PCA under the additive or multiplicative independent perturbation of the data matrix has been well studied (see e.g. [BBAP05, BS06, Pau07, BGN11, CLMW11, FWZ18]).

However, how resampling of the data matrix affects the outcome remains unclear. In this paper, we address this problem for the first time. Here, we emphasize that the resampling of the input data may not have any structure, and the specific resampling procedure is given in the next subsection. In our main results, we show that PCA is resampling sensitive, in the sense that, above certain threshold, resampling even a negligible portion of the data may make the resulted principal component completely change (i.e. become orthogonal to the original direction).

Compared with previous work that mainly focused on PCA with additive or multiplicative independent noise, our setting is very different. In our model, if writing the resampling effect as an additive or multiplicative perturbation, then this noise is not independent of the signal and does not possess any special structure. In contrast, in previous work, sometimes low-rank assumptions on the structure of the matrix or the noise, or some kind of incoherence conditions were imposed. In our work, we have almost no assumption on the data other than a sub-exponential decay condition. Moreover, we highlight that our results have universality. In particular, we do not need to know the specific distribution of the data and we do not require the data is i.i.d sampled.

1.1 Model and main results

Let 𝐗=(𝐗ij)\mathbf{X}=(\mathbf{X}_{ij}) be an n×pn\times p data matrix with independent real valued entries with mean 0 and variance p1p^{-1},

𝐗ij=p1/2xij,𝔼[xij]=0,𝔼[xij2]=1.\mathbf{X}_{ij}=p^{-1/2}x_{ij},\ \ \ \mathbb{E}[x_{ij}]=0,\ \ \ \mathbb{E}[x_{ij}^{2}]=1. (1)

Note that we do not require the i.i.d. condition for the data. Furthermore, we assume the entries xijx_{ij} have a sub-exponential decay, that is, there exists a constant θ>0\theta>0 such that for u>1u>1,

(|xij|>u)θ1exp(uθ).\mathbb{P}(|x_{ij}|>u)\leq\theta^{-1}\exp(-u^{\theta}). (2)

This sub-exponential decay assumption is mainly for convenience, and other conditions such as the finiteness of a sufficiently high moment would be enough.

Motivated by high-dimensional statistics, we will work in the proportional growth regime npn\asymp p. Throughout this paper, to avoid trivial eigenvalues, we will be working in the regime

limnp/n=ξ(0,1)orp/n1.\lim_{n\to\infty}p/n=\xi\in(0,1)\ \ \mbox{or}\ \ p/n\equiv 1.

In the case limp/n=1\lim p/n=1, our assymption p/n1p/n\equiv 1 is due to technical reasons in random matrix theory. Specifically, the proof relies on the delocalization of eigenvectors in the whole spectrum. As one of the major open problems in random matrix theory, delocalization of eigenvectors near the lower spectral edge is not known in the general case with just limp/n=1\lim p/n=1. The strictly square assumption pnp\equiv n can be slightly relaxed to |np|=po(1)|n-p|=p^{o(1)} (see e.g. [Wan22]), but we do not pursue such an extension for simplicity.

The sample covariance matrix corresponding to data matrix 𝐗\mathbf{X} is defined by 𝐇:=𝐗𝐗\mathbf{H}:=\mathbf{X}^{\top}\mathbf{X}. We order the eigenvalues of 𝐇\mathbf{H} as λ1λp\lambda_{1}\geq\cdots\geq\lambda_{p}, and use 𝐯ip\mathbf{v}_{i}\in\mathbb{R}^{p} to denote the unit eigenvector corresponding to the eigenvalue λi\lambda_{i}. If the context is clear, we just use λ:=λ1\lambda:=\lambda_{1} and 𝐯:=𝐯1\mathbf{v}:=\mathbf{v}_{1} to denote the largest eigenvalue and the top eigenvector. We also consider the eigenvalues and eigenvectors of the Gram matrix 𝐇^:=𝐗𝐗\widehat{\mathbf{H}}:=\mathbf{X}\mathbf{X}^{\top}. Note that 𝐇^\widehat{\mathbf{H}} and 𝐇\mathbf{H} have the same non-trivial eigenvalues, and the spectrum of 𝐇^\widehat{\mathbf{H}} is given by {λi}i=1n\{\lambda_{i}\}_{i=1}^{n} with λp+1==λn=0\lambda_{p+1}=\cdots=\lambda_{n}=0. We denote the unit eigenvectors of 𝐇^\widehat{\mathbf{H}} associated with the eigenvalue λi\lambda_{i} by 𝐮in\mathbf{u}_{i}\in\mathbb{R}^{n}.

Let 𝐔=[𝐮1,,𝐮p]n×p\mathbf{U}=[\mathbf{u}_{1},\cdots,\mathbf{u}_{p}]\in\mathbb{R}^{n\times p} and 𝐕=[𝐯1,,𝐯p]p×p\mathbf{V}=[\mathbf{v}_{1},\cdots,\mathbf{v}_{p}]\in\mathbb{R}^{p\times p}. These eigenvectors may be connected by the singular value decomposition of the data matrix 𝐗=𝐔𝚺𝐕\mathbf{X}=\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\top}, where 𝚺:=𝖽𝗂𝖺𝗀(σ1,,σp)\mathbf{\Sigma}:=\mathsf{diag}(\sigma_{1},\cdots,\sigma_{p}) with σi=λi\sigma_{i}=\sqrt{\lambda_{i}} corresponds to the singular values. For convenience, we also denote σ:=σ1\sigma:=\sigma_{1}. And therefore, up to the sign of the eigenvectors, we have

𝐗𝐯α=λα𝐮α,𝐗𝐮α=λα𝐯α.\mathbf{X}\mathbf{v}_{\alpha}=\sqrt{\lambda_{\alpha}}\mathbf{u}_{\alpha},\ \ \mathbf{X}^{\top}\mathbf{u}_{\alpha}=\sqrt{\lambda_{\alpha}}\mathbf{v}_{\alpha}.

We now describe the resampling procedure. For a positive number knpk\leq np, define the random matrix 𝐗[k]\mathbf{X}^{[k]} in the following way. Let Sk={(i1,α1),,(ik,αk)}S_{k}=\{(i_{1},\alpha_{1}),\cdots,(i_{k},\alpha_{k})\} be a set of kk pairs chosen uniformly at random without raplacement from the set of all ordered pairs (i,α)(i,\alpha) of indices with 1in1\leq i\leq n and 1αp1\leq\alpha\leq p. We assume that the set SkS_{k} is independent of the entries of 𝐗\mathbf{X}. The entries of 𝐗[k]\mathbf{X}^{[k]} are given by

𝐗i,α[k]={𝐗i,αif (i,α)Sk,𝐗i,αotherwise,\mathbf{X}^{[k]}_{i,\alpha}=\left\{\begin{aligned} &\mathbf{X}_{i,\alpha}^{\prime}&\quad&\mbox{if }(i,\alpha)\in S_{k},\\ &\mathbf{X}_{i,\alpha}&\quad&\mbox{otherwise},\end{aligned}\right.

where (𝐗i,α)1in,1αp(\mathbf{X}_{i,\alpha}^{\prime})_{1\leq i\leq n,1\leq\alpha\leq p} are independent random variables, independent of 𝐗\mathbf{X}, and 𝐗i,α\mathbf{X}_{i,\alpha}^{\prime} has the same distribution as 𝐗i,α\mathbf{X}_{i,\alpha}. In other words, 𝐗[k]\mathbf{X}^{[k]} is obtained from 𝐗\mathbf{X} by resampling kk random entries of the matrix, and therefore 𝐗[k]\mathbf{X}^{[k]} clearly has the same distribution as 𝐗\mathbf{X}. Let 𝐇[k]:=(𝐗[k])𝐗[k]\mathbf{H}^{[k]}:=(\mathbf{X}^{[k]})^{\top}\mathbf{X}^{[k]} the sample covariance matrix corresponding to the resampled data matrix 𝐗[k]\mathbf{X}^{[k]}. Denote the eigenvalues and the corresponding normalized eigenvectors of 𝐇[k]\mathbf{H}^{[k]} by λ1[k]λp[k]\lambda^{[k]}_{1}\geq\cdots\geq\lambda^{[k]}_{p} and 𝐯1[k],,𝐯p[k]\mathbf{v}^{[k]}_{1},\cdots,\mathbf{v}^{[k]}_{p}. When the context is clear, the principal component is just denoted by λ[k]\lambda^{[k]} and 𝐯[k]\mathbf{v}^{[k]}. Similarly, denote the eigenvector of the matrix 𝐇^[k]:=𝐗[k](𝐗[k])\widehat{\mathbf{H}}^{[k]}:=\mathbf{X}^{[k]}(\mathbf{X}^{[k]})^{\top} associated with the eigenvalue λi[k]\lambda^{[k]}_{i} by 𝐮i[k]\mathbf{u}^{[k]}_{i}.

With the resampling parameter in two different regimes, we have the following results.

Theorem 1 (Noise sensitivity under excessive resampling).

Let 𝐗\mathbf{X} be a random data matrix satisfying (1) and (2) and 𝐗[k]\mathbf{X}^{[k]} be the resampled matrix defined as above. If kn5/3k\gg n^{5/3}, then the associated principal components are asymptotically orthogonal, i.e.

limn𝔼|𝐯,𝐯[k]|=0,andlimn𝔼|𝐮,𝐮[k]|=0.\lim_{n\to\infty}\mathbb{E}\left|\langle\mathbf{v},\mathbf{v}^{[k]}\rangle\right|=0,\ \ \mbox{and}\ \ \lim_{n\to\infty}\mathbb{E}\left|\langle\mathbf{u},\mathbf{u}^{[k]}\rangle\right|=0. (3)
Theorem 2 (Noise sensitivity under moderate resampling).

Let 𝐗\mathbf{X} be a random data matrix satisfying (1) and (2) and 𝐗[k]\mathbf{X}^{[k]} be the resampled matrix defined as above. For any ϵ0>0\epsilon_{0}>0,

max1kn5/3ϵ0mins{1,1}n𝐯s𝐯[k]0,\max_{1\leq k\leq n^{5/3-\epsilon_{0}}}\min_{s\in\{-1,1\}}\sqrt{n}\|\mathbf{v}-s\mathbf{v}^{[k]}\|_{\infty}\xrightarrow{\mathbb{P}}0, (4)

where \xrightarrow{\mathbb{P}} means convergence in probability. In particular, this implies

limn𝔼[max1kn5/3ϵ0mins{1,1}𝐯s𝐯[k]2]=0.\lim_{n\to\infty}\mathbb{E}\left[\max_{1\leq k\leq n^{5/3-\epsilon_{0}}}\min_{s\in\{-1,1\}}\|\mathbf{v}-s\mathbf{v}^{[k]}\|_{2}\right]=0.

The same result also holds for 𝐮\mathbf{u} and 𝐮[k]\mathbf{u}^{[k]}.

These two theorems together state that the critical threshold for the resampling strength is of order kn5/3k\asymp n^{5/3}. Note that n5/3n^{5/3} compared with the total number of inputs npn2np\asymp n^{2} is negligible. We show that, above the threshold n5/3n^{5/3}, resampling even a negligible portion of the data will result in a dramatic change of the resulting principal component, in the sense that the new principal component is asymptotically orthogonal to the old one; while below the threshold, a relatively mild resampling has almost no effect on the corresponding new principal component. If considering the eigenvector overlaps |𝐯,𝐯[k]||\langle\mathbf{v},\mathbf{v}^{[k]}\rangle| and |𝐮,𝐮[k]||\langle\mathbf{u},\mathbf{u}^{[k]}\rangle|, these quantities exhibit sharp phase transitions from 11 to 0 near the critical threshold kn5/3k\asymp n^{5/3}.

We remark that the phase transition stated in the above theorems is not restricted to the top eigenvectors 𝐯,𝐯[k],𝐮,𝐮[k]\mathbf{v},\mathbf{v}^{[k]},\mathbf{u},\mathbf{u}^{[k]}. With essentially the same arguments, we can prove that for any fixed m>0m>0, the mm-th leading eigenvectors 𝐯m,𝐯m[k]\mathbf{v}_{m},\mathbf{v}^{[k]}_{m} and 𝐮m,𝐮m[k]\mathbf{u}_{m},\mathbf{u}^{[k]}_{m} exhibit the same phase transition at the critical threshold of the same order n5/3n^{5/3}.

1.2 High-Level Proof Scheme

The high-level idea is the “superconcentration implies chaos” phenomenon established by Chatterjee [Cha14], which means that small perturbation (beyond certain threshold) of super-concentrated system ( in the sense that it is characterized by some quantity with small variance) will lead to a dramatic change of the system. For random matrix models, the super-concentrated quantity is usually the eigenvalues. Resampling sensitivity for random matrices was first studied for Wigner matrices and sparse Erdős-Rényi graphs in [BLZ20, BL22]. For the PCA model, the key difference is that the entries of the sample covariance matrix are correlated. Moreover, resampling a single entry of the data will change Θ(n)\Theta(n) entries in the sample covariance matrix. These two differences will make the proofs more technical and a linearization trick would be important to reduce the interdependency of the matrix entries. In addition, we also need a technical variance formula related to resampling to compute the variance of the top eigenvalue and tools from random matrix theory, in particular the local Marchenko-Pastur law for the resolvent and the delocalization of eigenvectors.

For the sensitivity of the principal components. We show that the inner products 𝐯,𝐯[k]\langle\mathbf{v},\mathbf{v}^{[k]}\rangle and 𝐮,𝐮[k]\langle\mathbf{u},\mathbf{u}^{[k]}\rangle are closely related to the variance of the top eigenvalue of the sample covariance matrix. Using the variance formula for resampling, we have a precise characterization between the inner product of the top eigenvectors and the concentration of the top eigenvalue. The perturbation effect of the resampling is studied via the variational representation of the eigenvalue.

For the stability under moderate resampling, the key idea of our proof is to study the stability of the resolvent. On the one hand, the stability of resolvent implies the stability of the top eigenvalue. On the other hand, the resolvent can be used to approximate certain useful statistics of the top eigenvector. The stability of the resolvent is proved via a Lindeberg exchange strategy. The resampling procedure can be decomposed into a martingale, and the difference between the resolvents can be therefore bounded by martingale concentration inequalities combined with the local Marchenko-Pastur law of the resolvent for a priori estimates.

2 Sensitivity under Excessive Resampling

We provide a heuristic argument for deriving the threshold for the sensitivity regime. We consider the derivative of the top eigenvalue as a function of the matrix entries. Then we have the approximation

λ[1]λ𝐯[(𝐗[1])𝐗[1]𝐗𝐗]𝐯\lambda^{[1]}-\lambda\approx\mathbf{v}^{\top}\left[(\mathbf{X}^{[1]})^{\top}\mathbf{X}^{[1]}-\mathbf{X}^{\top}\mathbf{X}\right]\mathbf{v}

Note that the matrix in the parenthesis has only Θ(p)\Theta(p) non-zero entries, and each entry is roughly of size O(p1)O(p^{-1}). Also, the eigenvector 𝐯\mathbf{v} is delocalized in the sense that 𝐯(m)p1/2\mathbf{v}(m)\approx p^{-1/2} for all m=1,,pm=1,\dots,p. A central limit theorem yields that approximately we have

λ[1]λpp1p1=p3/2.\lambda^{[1]}-\lambda\approx\sqrt{p}p^{-1}p^{-1}=p^{-3/2}.

By this heuristic argument and central limit theorem, we have

λ[k]λkp3/2.\lambda^{[k]}-\lambda\approx\sqrt{k}p^{-3/2}.

Note that from random matrix theory, we know that λ1λ2\lambda_{1}-\lambda_{2} is of order p2/3p^{-2/3}. Therefore, if we have kp3/2p2/3\sqrt{k}p^{-3/2}\ll p^{-2/3} (this corresponds to kn5/3k\ll n^{5/3}), then the difference the two top eigenvalues λ\lambda and λ[k]\lambda^{[k]} is much smaller than the first two eigenvalues λ1\lambda_{1} and λ2\lambda_{2} of the matrix 𝐗𝐗\mathbf{X}^{\top}\mathbf{X}. This implies that the perturbation effect on 𝐗[k]\mathbf{X}^{[k]} is small, and therefore in this case it is plausible to believe that 𝐯[k]\mathbf{v}^{[k]} is just a small perturbation of 𝐯\mathbf{v}. Thus, for the sensitivity regime, we must have kn5/3k\gg n^{5/3}.

Our proof is essentially trying to make the above heuristics rigorous. To do this, a key observation is that the inner products 𝐯,𝐯[k]\langle\mathbf{v},\mathbf{v}^{[k]}\rangle and 𝐮,𝐮[k]\langle\mathbf{u},\mathbf{u}^{[k]}\rangle can be related to the variance of the leading eigenvalue.

2.1 Connection with Variance of Top Eigenvalue

As mentioned above, the key step in the proof for sensitivity regime is to establish a connection between the inner products of top eigenvalues with the variance of the top eigenvalue. Specifically, we will prove

𝔼[|𝐯,𝐯[k]|2]Cn3𝖵𝖺𝗋(σ)k+o(1),\mathbb{E}\left[|\langle\mathbf{v},\mathbf{v}^{[k]}\rangle|^{2}\right]\leq C\frac{n^{3}\mathsf{Var}(\sigma)}{k}+o(1),

where C>0C>0 is some universal constant and σ=λ\sigma=\sqrt{\lambda} is the leading singular value. A similar result is also true for 𝐮\mathbf{u} and 𝐮[k]\mathbf{u}^{[k]}. For more details, we refer to Section C.2.

From random matrix theory, we have 𝖵𝖺𝗋(σ)=O(n4/3)\mathsf{Var}(\sigma)=O(n^{-4/3}). Then, based on this inequality, we derive the threshold kn5/3k\gg n^{5/3} for the sensitivity regime.

3 Stability under Moderate Resampling

To establish the stability of PCA when the resampling strength is mild, we will utilize tools from random matrix theory and specifically the proof relies on the analysis of the resolvent matrix. Also, to simplify the Gram matrix structure of the sample covariance matrix, when considering the resolvent we use a linearization trick. For any zz\in\mathbb{C} with Imz>0\text{\rm Im}{\,}z>0, the resolvent is defined as

𝐑(z):=(𝐈n𝐗𝐗z𝐈p)1.\mathbf{R}(z):=\left(\begin{matrix}-\mathbf{I}_{n}&\mathbf{X}\\ \mathbf{X}^{\top}&-z\mathbf{I}_{p}\end{matrix}\right)^{-1}.

Similarly, we denote the resolvent of 𝐗[k]\mathbf{X}^{[k]} by 𝐑[k](z)\mathbf{R}^{[k]}(z). The key idea for the proofs in the stability regime is that eigenvectors can be approximated by resolvents and the resolvents are stable under moderate resampling.

3.1 Resolvent Approximation

To illustrate the usefulness of the resolvent, we show that the entries of the resolvent can be used to approximate the product of entries in the eigenvector. For some small δ>0\delta>0, let z0=λ+iηz_{0}=\lambda+\mathrm{i}\eta with η=n2/3δ\eta=n^{-2/3-\delta}. In the regime kn5/3ϵ0k\leq n^{5/3-\epsilon_{0}} for some ϵ0>0\epsilon_{0}>0, there exists some small c>0c>0 such that for all α,β=1,,p\alpha,\beta=1,\dots,p, we have

|ηIm𝐑n+α,n+β(z0)𝐯(α)𝐯(β)|n1c,\left|{\eta\text{\rm Im}{\,}\mathbf{R}_{n+\alpha,n+\beta}(z_{0})-\mathbf{v}(\alpha)\mathbf{v}(\beta)}\right|\leq n^{-1-c},

and

|ηIm𝐑n+α,n+β[k](z0)𝐯[k](α)𝐯[k](β)|n1c.\left|{\eta\text{\rm Im}{\,}\mathbf{R}^{[k]}_{n+\alpha,n+\beta}(z_{0})-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)}\right|\leq n^{-1-c}.

A similar result also holds for 𝐮\mathbf{u} and 𝐮[k]\mathbf{u}^{[k]}. For more details, we refer to Lemma D.5.

3.2 Stability of the Resolvent

Since the eigenvector can be approximated by the resolvent, it suffices to show the stability of the resolvent. Consider the regime kn5/3ϵ0k\leq n^{5/3-\epsilon_{0}} for some ϵ0>0\epsilon_{0}>0. For some small δ>0\delta>0 and all z=E+iηz=E+\mathrm{i}\eta that is close to the upper spectral edge and η=n2/3δ\eta=n^{-2/3-\delta}, there exists a small constant c>0c>0 such that the following is true for all i,j=1,,ni,j=1,\dots,n and α,β=1,,p\alpha,\beta=1,\dots,p,

|𝐑ij[k](z)𝐑ij(z)|1n1+cη,\left|{\mathbf{R}^{[k]}_{ij}(z)-\mathbf{R}_{ij}(z)}\right|\leq\frac{1}{n^{1+c}\eta},

and

|𝐑n+α,n+β[k](z)𝐑n+α,n+β(z)|1n1+cη.\left|{\mathbf{R}^{[k]}_{n+\alpha,n+\beta}(z)-\mathbf{R}_{n+\alpha,n+\beta}(z)}\right|\leq\frac{1}{n^{1+c}\eta}.

This is the main technical part of the whole argument, and its proof relies on the Lindeberg exchange method and a martingale concentration argument. For more details, we refer to Lemma D.3.

Combining the stability of the resolvents with the resolvent approximation for eigenvectors, we can conclude that 𝐯\mathbf{v} and 𝐯[k]\mathbf{v}^{[k]} must be close (similarly, also for 𝐮\mathbf{u} and 𝐮[k]\mathbf{u}^{[k]}).

3.3 Stability of the Top Eigenvalue

As a byproduct, we derive an stability estimate of the top eigenvalues, which may be of independent interest. For kn5/3ϵ0k\leq n^{5/3-\epsilon_{0}} with some arbitrary ϵ0>0\epsilon_{0}>0 and an arbitrary ε>0\varepsilon>0,

|λ[k]λ|n2/3δ+ε,|\lambda^{[k]}-\lambda|\leq n^{-2/3-\delta+\varepsilon},

where δ>0\delta>0 is some small constant. Note that the fluctuation of the top eigenvalue around its typical location is of order O(n2/3+ε)O(n^{-2/3+\varepsilon}), this result shows that the top eigenvalues under moderate resampling are indeed non-trivially stable. For more details, we refer to Lemma D.4.

4 Discussions and Applications

We have shown that PCA is sensitive to the input data, in the sense that resampling ω(n2/3)\omega(n^{-2/3}) fraction of the data will results in decorrelation between the new output and the old output. We further prove that this threshold is sharp that a moderate resampling below this threshold will have no effect on the outcome.

Moreover, besides demonstrating an exciting phenomenon, our results have broad implications in other related fields. We briefly discuss a few potential extensions and applications that would be worth further exploration.

4.1 Extensions to Broader PCA Models

Sparse PCA

A natural extension is to consider sparse data, and this corresponds to sparse PCA that received a lot of attention in the past decade (see e.g. [MWA05, ZHT06, CMW13]). However, establishing the sharp phase transition for sparse PCA lacks several technical ingredients. In particular, for the sensitivity regime, the eigenvalue gap property (i.e. tail estimates for the eigenvalue gap, see Lemma B.3) is unknown. Also, to establish the sharp threshold for the stability regime, an improved local law of the resolvent is needed (cf. Lemma D.1). Though we expect these missing parts to be correct, proving them would be beyond the scope of this paper and we leave them to future work. Assuming the eigenvalue gap property and the improved local law, the phase transition can be proved by the same arguments in this paper.

PCA with General Population

For practical purposes, it would be significant to consider data with a general population covariance profile (see e.g. [Nad08]). The corresponding matrices were studied in [BPZ15, LS16]. The only missing ingredient we need is the eigenvalue gap property. Under reasonable assumptions for the population matrix so that the eigenvalue gap property is true, our arguments in this paper will yield the same stability-sensitivity transition.

4.2 Extensions to Other Statistical Methods

Within the general PCA framework, one important variant is the kernel PCA, which is closely related to the widely used spectral clustering [NJW01, VL07, CBG16]. The corresponding kernel random matrices were studied in [EK10, CS13]. However, the study of these kernel random matrices are far from being well-understood. In particular, the study of eigenvectors were very limited and the indispensable property of delocalization is still unknown.

It would be interesting to explore whether other statistical method share the same resampling sensitivity phenomenon. Random matrices associated with canonical correlation analysis (CCA) or multivariate analysis of variance (MANOVA) are well studied [HPZ16, HPY18, Yan22]. In particular, the Tracy-Widom concentration of the top eigenvalue, one of the most important ingredients that we need, has been proved. We anticipate that these models exhibit a similar stability-sensitivity transition as in PCA.

4.3 Differential Privacy

In our paper, we study the stability of the top eigenvalue and the top eigenvector under resampling in terms of bounding the \ell_{\infty} distance. Such stability estimates can be regarded as the global sensitivity of PCA performed on neighboring datasets. This measurement is closely related to the analysis of differential privacy [DR14]. PCA under differential privacy was previous studied in [BDMN05, CSS13], etc. Our result revisit the problem of designing a private algorithm for solving the principal component. Here we remark that though the statements in Theorem 1 and Theorem 2 are qualitative, a careful examinination of the proof can yield some quantitative estimates. Based on the stability estimates in terms of the \ell_{\infty} metric, a simple Laplace mechanism produces a differentially private version of PCA for computing the top eigenvalue or the top eigenvector. However, compared with [CSS13], our results are limited in the sense that their results are non-asymptotic for all sample size nn and data dimension pp, while ours are restricted to the proportional growth regime.

Moreover, previous works on differentially private PCA focused on neighbouring datasets that differ by one sample vector. Our result may be seen as a refined notion of privacy, since we can analyze the sensitivity of PCA over two “neighbouring” datasets with kk different entries for any kk.

Meanwhile, the largest eigenvalue of the sample covariance matrix plays an important rule in hypothesis testing. For example, the Roy’s largest root test is used in many problems (see e.g. [JN17]). Our result may provide useful insights to construct a differentially private test statistic based on the top eigenvalue.

4.4 Database Alignment

Database alignment (or in some cases graph matching) refers to the optimization problem in which we are given two datasets and the goal is to find an optimal correspondence between the samples and features that maximally align the data. For datasets 𝐗,𝐘n×p\mathbf{X},\mathbf{Y}\in\mathbb{R}^{n\times p}, we look for permutations πs𝒮n\pi_{\mathrm{s}}\in{\mathcal{S}}_{n} and πf𝒮p\pi_{\mathrm{f}}\in{\mathcal{S}}_{p} to solve the optimization problem

maxπs,πfi=1nα=1p𝐗iα𝐘πs(i)πf(α),\max_{\pi_{\mathrm{s}},\pi_{\mathrm{f}}}\sum_{i=1}^{n}\sum_{\alpha=1}^{p}\mathbf{X}_{i\alpha}\mathbf{Y}_{\pi_{\mathrm{s}}(i)\pi_{\mathrm{f}}(\alpha)},

where 𝒮n{\mathcal{S}}_{n} and 𝒮p{\mathcal{S}}_{p} are the sets of all permutations on [n][n] and [p][p], respectively. This problem is closely related to the Quadratic Assignment Problem (QAP), which is known to be NP-hard to solve or even approximate.

The study of the alignment problem for correlated random databases has a long history. The previous work mainly considered matrices that are correlated through some additive perturbation, and some of the general model were studied with a homogeneous correlation (i.e. the correlation between all correponding pairs are the same). See for example [DCK19, WXS22] and many other works.

Our resampling procedure may be regarded as an adversarial corruption of the dataset, which is a different kind of correlation compared with previous work. To our knowledge, this is the first time to consider database alignment with adversarial corruption. To state the setting of the problem, we have two matrices

𝐗n×p,𝐘=𝚷s𝐗[k]𝚷f\mathbf{X}\in\mathbb{R}^{n\times p},\ \ \ \mathbf{Y}=\mathbf{\Pi}_{\mathrm{s}}\mathbf{X}^{[k]}\mathbf{\Pi}_{\mathrm{f}}^{\top}

where 𝐗\mathbf{X} is a random matrix satisfying (1) and (2), and 𝚷s\mathbf{\Pi}_{\mathrm{s}} and 𝚷f\mathbf{\Pi}_{\mathrm{f}} are permutation matrices of order nn and pp chosen uniformly at random. The goal is to recover the permutations 𝚷s\mathbf{\Pi}_{\mathrm{s}} and 𝚷f\mathbf{\Pi}_{\mathrm{f}} based on the observations 𝐗\mathbf{X} and 𝐘\mathbf{Y}. Here, we can think of 𝐘\mathbf{Y} as the unlabeled version of 𝐗\mathbf{X} with adversarial corruption. By considering the covariance matrices, we have

𝐀=𝐗𝐗,𝐁=𝐘𝐘=𝚷s(𝐗[k](𝐗[k]))𝚷s,\mathbf{A}=\mathbf{X}\mathbf{X}^{\top},\ \ \mathbf{B}=\mathbf{Y}\mathbf{Y}^{\top}=\mathbf{\Pi}_{\mathrm{s}}\left(\mathbf{X}^{[k]}(\mathbf{X}^{[k]})^{\top}\right)\mathbf{\Pi}_{\mathrm{s}}^{\top},

and similarly we consider

𝐀^=𝐗𝐗,𝐁^=𝐘𝐘=𝚷f((𝐗[k])𝐗[k])𝚷f.\widehat{\mathbf{A}}=\mathbf{X}^{\top}\mathbf{X},\ \ \widehat{\mathbf{B}}=\mathbf{Y}^{\top}\mathbf{Y}=\mathbf{\Pi}_{\mathrm{f}}\left((\mathbf{X}^{[k]})^{\top}\mathbf{X}^{[k]}\right)\mathbf{\Pi}_{\mathrm{f}}^{\top}.

A natural idea to reconstruct the permutations 𝚷s\mathbf{\Pi}_{\mathrm{s}} (and 𝚷f)\mathbf{\Pi}_{\mathrm{f}}) is to align the top eigenvectors of the matrices 𝐀\mathbf{A} and 𝐁\mathbf{B} (and 𝐀^\widehat{\mathbf{A}} and 𝐁^\widehat{\mathbf{B}}). See Algorithm 1 for details. This spectral method is a natural technique for databse alignment and graph matching. We are interested in under what resampling strength, the PCA-Recovery algorithm can almost perfectly reconstruct the permutations, and under what condition this method completely fail.

Algorithm 1 PCA-Recovery
  Input: data matrices 𝐗,𝐘n×p\mathbf{X},\mathbf{Y}\in\mathbb{R}^{n\times p}
  Output: permutation matrices 𝚷sn×n\mathbf{\Pi}_{\mathrm{s}}\in\mathbb{R}^{n\times n}, 𝚷fp×p\mathbf{\Pi}_{\mathrm{f}}\in\mathbb{R}^{p\times p}
  Compute 𝐮\mathbf{u} the unit leading left singular vectors of 𝐗\mathbf{X}
  Compute 𝐯\mathbf{v} the unit leading right singular vectors of 𝐗\mathbf{X}
  Compute 𝐮\mathbf{u}^{\prime} the unit leading left singular vectors of 𝐘\mathbf{Y}
  Compute 𝐯\mathbf{v}^{\prime} the unit leading right singular vectors of 𝐘\mathbf{Y}
  
  Compute 𝚷s+\mathbf{\Pi}_{\mathrm{s}}^{+} the permutation aligning 𝐮\mathbf{u} and 𝐮\mathbf{u}^{\prime}
  Compute 𝚷s\mathbf{\Pi}_{\mathrm{s}}^{-} the permutation aligning 𝐮\mathbf{u} and 𝐮-\mathbf{u}^{\prime}
  Compute 𝚷f+\mathbf{\Pi}_{\mathrm{f}}^{+} the permutation aligning 𝐯\mathbf{v} and 𝐯\mathbf{v}^{\prime}
  Compute 𝚷f\mathbf{\Pi}_{\mathrm{f}}^{-} the permutation aligning 𝐯\mathbf{v} and 𝐯-\mathbf{v}^{\prime}
  
  if 𝐀,𝚷s+𝐁(𝚷s+)𝐀,𝚷s𝐁(𝚷s)\langle{\mathbf{A}},\mathbf{\Pi}_{\mathrm{s}}^{+}{\mathbf{B}}(\mathbf{\Pi}_{\mathrm{s}}^{+})^{\top}\rangle\geq\langle{\mathbf{A}},\mathbf{\Pi}_{\mathrm{s}}^{-}{\mathbf{B}}(\mathbf{\Pi}_{\mathrm{s}}^{-})^{\top}\rangle then
     𝚷s𝚷s+\mathbf{\Pi}_{\mathrm{s}}\leftarrow\mathbf{\Pi}_{\mathrm{s}}^{+}
  else
     𝚷s𝚷s\mathbf{\Pi}_{\mathrm{s}}\leftarrow\mathbf{\Pi}_{\mathrm{s}}^{-}
  end if
  
  if 𝐀^,𝚷f+𝐁^(𝚷f+)𝐀^,𝚷f𝐁^(𝚷f)\langle\widehat{\mathbf{A}},\mathbf{\Pi}_{\mathrm{f}}^{+}\widehat{\mathbf{B}}(\mathbf{\Pi}_{\mathrm{f}}^{+})^{\top}\rangle\geq\langle\widehat{\mathbf{A}},\mathbf{\Pi}_{\mathrm{f}}^{-}\widehat{\mathbf{B}}(\mathbf{\Pi}_{\mathrm{f}}^{-})^{\top}\rangle then
     𝚷f𝚷f+\mathbf{\Pi}_{\mathrm{f}}\leftarrow\mathbf{\Pi}_{\mathrm{f}}^{+}
  else
     𝚷f𝚷f\mathbf{\Pi}_{\mathrm{f}}\leftarrow\mathbf{\Pi}_{\mathrm{f}}^{-}
  end if

Spectral methods have been studied and applied in many scenarios (see e.g. [FMWX20, FMWX22] and [GLM22]). In particular, in [GLM22], a similar PCA method was studied to match two symmetric Gaussian matrices correlated via additive Gaussian noise. Their work proved a similar 010-1 transition for the inner product of the top eigenvectors, which leads to a all-or-nothing phenomenon in the alignment problem, i.e. the accuracy of the recovery undergoes a sharp transition from 0 to 11 near some critical threshold. However, the arguments in [GLM22] are not applicable in our case. Their proof heavily depends on the Gaussian assumption of the matrices, and the additive strucutre of the noise. In particular, they proof crucially relies on the orthogonal invariance of the Gaussian noise. While in our case, the noise is presented in terms of the resampling strength. There is no way to write the “noise” in an additive form that is independent of the “signal”. Even in the Gaussian case, a rigorous analysis of the PCA-Recovery algorithm seems difficult.

Nevertheless, our results on the sensitivity of the eigenvector inner products suggest that, when kn5/3k\gg n^{5/3}, the two eigenvectors are approximately de-correlated so that they share almost no common information. Consequently, recovery of the data via aligning the principal components would be basically random guessing. Therefore, we conjecture that if kn5/3k\gg n^{5/3}, PCA-Recovery fails to recover the latent permutations in the sense that it can only achieve o(1)o(1) fraction of correct matching with the ground truth. On the other hand, when kn5/3k\ll n^{5/3}, the performance of our algorithm seems mysterious.

In Section 5, we empirically check the performance of the PCA-Recovery algorithm. Numerical simulations suggest that when kn5/3k\gg n^{5/3}, the performance of PCA-Recovery is indeed poor in the sense that the accuracy of the recovery is almost 0. On the other hand, when kn5/3k\ll n^{5/3}, experiments show that we cannot expect the sharp all-or-nothing phenomenon similarly as in [GLM22].

Finally, we remark that what PCA-Recovery actually studies is a more difficult task, as we do not need direct observations of 𝐗\mathbf{X} and 𝐘\mathbf{Y}. We can consider a harder problem (in both statistical and computational sense), which we call alignment from covariance profile. In this problem, we only have access to the covariance between the samples and we aim to recover the correspondence between the samples from the two databases. A similar problem with Gaussian data and additive noise was considered in [WWXY22] as a prototype for matching random geometric dot-product graphs. The analysis of such database alignment problem with adversarial corruption will be an interesting direction for future studies.

5 Numerical Experiments

We validate our theoretical prediction by checking the performance of PCA on synthetic data. To highlight the universality of our results, we will consider Gaussian data and Bernoulli data. The Gaussian data matrix consists of i.i.d. 𝒩(0,1){\mathcal{N}}(0,1) entries. The Bernoulli data matrix consists of i.i.d. entries taking value in {±1}\{\pm 1\} with equal probability 12\frac{1}{2}. To visualize the stability-sensitivity transition, we focus on the overlap of the leading eigenvectors |𝐯,𝐯[k]||\langle\mathbf{v},\mathbf{v}^{[k]}\rangle| and |𝐮,𝐮[k]||\langle\mathbf{u},\mathbf{u}^{[k]}\rangle| as the observable. Note that, in the stability regime, the asymptotic colinearity (4) implies that |𝐯,𝐯[k]|,|𝐮,𝐮[k]|1|\langle\mathbf{v},\mathbf{v}^{[k]}\rangle|,|\langle\mathbf{u},\mathbf{u}^{[k]}\rangle|\to 1. Therefore, we expect a phase transition from 1 to 0 at the critical threshold kn5/3k\asymp n^{5/3}.

We first focus on rectangular data matrices. For concreteness, we set p/n=0.25p/n=0.25 and n=1000n=1000. As shown in Figure 1, there is a clear phase transition for the eigenvector overlap varying from 11 to 0. Also, it provide good evidence that the transition happens at the critical threshold kn5/3k\asymp n^{5/3}.

Refer to caption
(a) The inner product |𝐯,𝐯[k]||\langle\mathbf{v},\mathbf{v}^{[k]}\rangle|
Refer to caption
(b) The inner product |𝐮,𝐮[k]||\langle\mathbf{u},\mathbf{u}^{[k]}\rangle|
Figure 1: Inner products of the leading eigenvectors for 1000×2501000\times 250 matrices with Gaussian and Bernoulli data. The horizontal axis is the resampling strength, given by logn(4k)\log_{n}(4k). Each experiment is averaged over 50 repetitions.

We also check square data 1000×10001000\times 1000 matrices. As shown in Figure 2, for both Gaussian data and Bernoulli data, we have the same phase transition for the overlap of the leading eigenvectors. Again, this numerical simulation supports our theoretical prediction that the transition happens at the critical threshold kn5/3k\asymp n^{5/3}.

Refer to caption
(a) The inner product |𝐯,𝐯[k]||\langle\mathbf{v},\mathbf{v}^{[k]}\rangle|
Refer to caption
(b) The inner product |𝐮,𝐮[k]||\langle\mathbf{u},\mathbf{u}^{[k]}\rangle|
Figure 2: Inner products of the leading eigenvectors for 1000×10001000\times 1000 matrices with Gaussian and Bernoulli data. The horizontal axis is the resampling strength, given by logn(k)\log_{n}(k). Each experiment is averaged over 50 repetitions.

Also we check the performance of the PCA-Recovery algorithm for database alignment. Still, we consider both Gaussian data and Bernoulli data. As shown in Figure 3, in both the rectangular case and the square case, the accuracy of the algorithm is almost 0 as long as the resampling strength kk surpasses n5/3n^{5/3}. This is consistent with the theoretical prediction that in this case the top eigenvectors approximately decorrelate. On the other hand, numerical simulations suggest that PCA-Recovery is brittle to resampling. In particular, we cannot expect a similar all-or-nothing phenomenon as in [GLM22]. Identifying the critical threshold below which PCA-Recovery can achieve almost perfect recovery is an interesting open problem for future research.

Refer to caption
(a) Fraction of correct matching for rectangular data
Refer to caption
(b) Fraction of correct matching for square data
Figure 3: Performance of the PCA algorithm for database alignment with adversarial corruption

Appendix A Notations and Organization

We use CC to denote generic constant, which may be different in each appearance. We denote ABA\lesssim B if there exists a universal C>0C>0 such that ACBA\leq CB, and denote ABA\gtrsim B if ACBA\geq CB for some universal C>0C>0. We write ABA\asymp B if ABA\lesssim B and BAB\lesssim A.

For the analysis of the sample covariance matrix, it is useful to apply the linearization trick (see e.g. [Tro12, DY18, FWZ18]). Specifically, we will also analyze the symmetrization of 𝐗\mathbf{X}, which is defined as

𝐗~:=(0𝐗𝐗0)\widetilde{\mathbf{X}}:=\left(\begin{matrix}0&\mathbf{X}\\ \mathbf{X}^{\top}&0\end{matrix}\right) (5)

The spectrum of the symmetrization 𝐗~\widetilde{\mathbf{X}} are given by the singular values {λm}m=1p\{\sqrt{\lambda_{m}}\}_{m=1}^{p}, the symmetrized singular values {λm}m=1p\{-\sqrt{\lambda_{m}}\}_{m=1}^{p}, and trivial eigenvalue 0 with multiplicity npn-p. Let 𝐰i:=(𝐮i,𝐯i)n+p\mathbf{w}_{i}:=(\mathbf{u}_{i}^{\top},\mathbf{v}_{i}^{\top})^{\top}\in\mathbb{R}^{n+p} be the concatenation of the eigenvectors 𝐮i\mathbf{u}_{i} and 𝐯i\mathbf{v}_{i}. Then 𝐰i\mathbf{w}_{i} is the eigenvector of 𝐗~\widetilde{\mathbf{X}} associated with the eigenvalue λi\sqrt{\lambda_{i}}. Indeed, we have

(0𝐗𝐗0)(𝐮i𝐯i)=(𝐗vi𝐗𝐮i)=(λi𝐮iλi𝐯i).\left(\begin{matrix}0&\mathbf{X}\\ \mathbf{X}^{\top}&0\end{matrix}\right)\left(\begin{matrix}\mathbf{u}_{i}\\ \mathbf{v}_{i}\end{matrix}\right)=\left(\begin{matrix}\mathbf{X}v_{i}\\ \mathbf{X}^{\top}\mathbf{u}_{i}\end{matrix}\right)=\left(\begin{matrix}\sqrt{\lambda_{i}}\mathbf{u}_{i}\\ \sqrt{\lambda_{i}}\mathbf{v}_{i}\end{matrix}\right).

An important probabilistic concept that will be used repeatedly is the notion of overwhelming probability.

Definition 1 (Overwhelming probability).

Let {N}\{{\mathcal{E}}_{N}\} be a sequence of events. We say that N{\mathcal{E}}_{N} holds with overwhelming probability if for any (large) D>0D>0, there exists N0(D)N_{0}(D) such that for all NN0(D)N\geq N_{0}(D) we have

(N)1ND.\mathbb{P}({\mathcal{E}}_{N})\geq 1-N^{-D}.

Organization

The appendix is organized as follows. In Section B, we collect some useful tools for the proof, including a variance formula for resampling and classical results from random matrix theory. In Section C, we prove the sensitivity of PCA under excessive resampling. In Section D, we prove that PCA is stable if resampling of the data is moderate.

Appendix B Preliminaries

B.1 Variance formula and resampling

An essential technique for our proof is the formula for the variance of a function of independent random variables. This formula represents the variance via resampling of the random variables. This idea is first due to Chatterjee [Cha05], and in this paper we will use a slight extension of it as in [BLZ20].

Let X1,,XNX_{1},\cdots,X_{N} be independent random variables taking values in some set 𝒳\mathcal{X}, and let f:𝒳Nf:\mathcal{X}^{N}\to\mathbb{R} be some measurable function. Let X=(X1,,XN)X=\left(X_{1},\cdots,X_{N}\right) and XX^{\prime} be an independent copy of XX. We denote

X(i)=(X1,,Xi1,Xi,Xi+1,,XN),andX[i]=(X1,,Xi,Xi+1,,XN).X^{(i)}=(X_{1},\cdots,X_{i-1},X_{i}^{\prime},X_{i+1},\cdots,X_{N}),\ \ \mbox{and}\ \ X^{[i]}=(X_{1}^{\prime},\cdots,X_{i}^{\prime},X_{i+1},\cdots,X_{N}).

And in general, for A[N]A\subset[N], we define XAX^{A} to be the random vector obtained from XX by replacing the components indexed by AA by corresponding components of XX^{\prime}. By a result of Chatterjee [Cha05], we have the following variance formula

𝖵𝖺𝗋(f(X))=12i=1N𝔼[(f(X)f(X(i)))(f(X[i1])f(X[i]))].\mathsf{Var}\left(f(X)\right)=\dfrac{1}{2}\sum_{i=1}^{N}\mathbb{E}\left[\left(f(X)-f(X^{(i)})\right)\left(f(X^{[i-1]})-f(X^{[i]})\right)\right].

We remark that this variance formula does not depend on the order of the random variables. Therefore, we can consider an arbitrary permutation of [N][N]. Specifically, let π=(π(1),,π(N))\pi=(\pi(1),\cdots,\pi(N)) be a random permutation sampled uniformly from the symmetric group 𝒮N{\mathcal{S}}_{N} and denote π([i]):={π(1),,π(i)}\pi([i]):=\{\pi(1),\cdots,\pi(i)\}. Then we have

𝖵𝖺𝗋(f(X))=12i=1n𝔼[(f(X)f(X(π(i))))(f(Xπ([i1]))f(Xπ([i])))].\mathsf{Var}\left(f(X)\right)=\dfrac{1}{2}\sum_{i=1}^{n}\mathbb{E}\left[\left(f(X)-f(X^{(\pi(i))})\right)\left(f(X^{\pi([i-1])})-f(X^{\pi([i])})\right)\right].

Note that, in the formula above, the expectation is taken with respect to both XX, XX^{\prime} and the random permutation π\pi.

Let jj have uniform distribution over [N][N]. Let X(j)π([i1])X^{(j)\circ\pi([i-1])} denote the vector obtained from Xπ([i1])X^{\pi([i-1])} by replacing its jj-th component by another independent copy of the random variable XjX_{j} in the following way: If jj belongs to π([i1])\pi([i-1]), then we replace XjX_{j}^{\prime} by Xj′′X_{j}^{\prime\prime}; if jj is not in π([i1])\pi([i-1]), then we replace XjX_{j} by Xj′′′X_{j}^{\prime\prime\prime}, where X′′X^{\prime\prime} and X′′′X^{\prime\prime\prime} are independent copies of XX such that (X,X,X′′,X′′′)(X,X^{\prime},X^{\prime\prime},X^{\prime\prime\prime}) are independent. With this notation, we have the following estimates.

Lemma B.1 (Lemma 3 in [BLZ20]).

Assume that jj is chosen uniformly at random from the set [N][N] and independently of other random variables involved, we have for any k[N]k\in[N],

Bk2𝖵𝖺𝗋(f(X))k(N+1N)B_{k}\leq\dfrac{2\mathsf{Var}\left(f(X)\right)}{k}\left(\dfrac{N+1}{N}\right)

where for any i[N]i\in[N],

Bi:=𝔼[(f(X)f(X(j)))(f(Xπ([i1]))f(X(j)π([i1])))]B_{i}:=\mathbb{E}\left[\left(f(X)-f(X^{(j)})\right)\left(f(X^{\pi([i-1])})-f(X^{(j)\circ\pi([i-1])})\right)\right]

and the expectation is taken with respect to components of vectors, random permutations π\pi and the random variable jj.

B.2 Tools from random matrix theory

In this section we collect some classical results in random matrix theory, which will be indispensable for proving the main theorems. These include concentration of the top eigenvalue, eigenvalue rigidity estimates, and eigenvector delocalization.

To begin with, we first state some basic settings and notations. It is well known that the empirical distribution of the spectrum of the sample covariance matrix converges to the Marchenko-Pastur distribution

ρ𝖬𝖯(x)=12πξ[(xλ)(λ+x)]+x2,\rho_{\mathsf{MP}}(x)=\dfrac{1}{2\pi\xi}\sqrt{\dfrac{[(x-\lambda_{-})(\lambda_{+}-x)]_{+}}{x^{2}}}, (6)

where the endpoints of the spectrum are given by

λ±=(1±ξ)2.\lambda_{\pm}=(1\pm\sqrt{\xi})^{2}. (7)

Define the typical locations of the eigenvalues:

γm:=inf{E>0:Eρ𝖬𝖯(x)dxmp}, 1mp.\gamma_{m}:=\inf\left\{E>0:\int_{-\infty}^{E}\rho_{\mathsf{MP}}(x){\rm d}x\geq\dfrac{m}{p}\right\},\ \ \ 1\leq m\leq p.

A classical result in random matrix theory is the rigidity estimates of the eigenvalues [PY14, BEK+14]. Let m^:=min(m,p+1k)\widehat{m}:=\min(m,p+1-k), for any small ε>0\varepsilon>0 and large D>0D>0 there exists n0(ε,D)n_{0}(\varepsilon,D) such that the following holds for any nn0n\geq n_{0},

(|λmγm|n23+ε(m^)13for all 1mp)>1nD.\mathbb{P}\left(|\lambda_{m}-\gamma_{m}|\leq n^{-\frac{2}{3}+\varepsilon}(\widehat{m})^{-\frac{1}{3}}\ \mbox{for all}\ 1\leq m\leq p\right)>1-n^{-D}. (8)

We remark that the square case ξ1\xi\equiv 1 is actually significantly different, due to the singularity of the Marchenko-Pastur law at the spectral edge x=0x=0. Near this edge, the typical eigenvalue spacing would be of order O(n2)O(n^{-2}). In this case, it would be more convenient to state the rigidity estimates in terms of the singular values of the n×nn\times n data matrix. The following result was proved in [AEK14]. For any small ε>0\varepsilon>0 and large D>0D>0, there exists n0(ε,D)n_{0}(\varepsilon,D) such that the following is true for any nn0n\geq n_{0},

(|λmγm|n23+ε(n+1m)13for all 1mn)>1nD.\mathbb{P}\left(|\sqrt{\lambda_{m}}-\sqrt{\gamma_{m}}|\leq n^{-\frac{2}{3}+\varepsilon}(n+1-m)^{-\frac{1}{3}}\ \mbox{for all}\ 1\leq m\leq n\right)>1-n^{-D}. (9)

The intuition for this case is that the singular values of a square data matrix behaves like the eigenvalues of a Wigner matrix. More specifically, the singular values {λm}\{\sqrt{\lambda_{m}}\} and their symmetrization {λm}\{-\sqrt{\lambda_{m}}\} are the eigenvalues of the symmetrized matrix 𝐗~\widetilde{\mathbf{X}} defined in (5), and 𝐗~\widetilde{\mathbf{X}} can be seen as a 2n×2n2n\times 2n Wigner matrix with imprimitive variance profile (see [AEK14]). For more details, this was explained in [Wan19, Wan22].

Another important result is the Tracy-Widom limit for the top eigenvalue (see e.g. [PY14, DY18, Wan19, SX21]). Specifically,

Lemma B.2.

For any ε>0\varepsilon>0, with overwhelming probability, we have

|λλ+|n2/3+ε,and𝖵𝖺𝗋(λ)n4/3.|\lambda-\lambda_{+}|\leq n^{-2/3+\varepsilon},\ \ \mbox{and}\ \ \mathsf{Var}(\lambda)\lesssim n^{-4/3}.

Moreover, for any δ>0\delta>0, there exists a constant c0>0c_{0}>0 such that

(λ1λ2c0n2/3)1δ.\mathbb{P}\left(\lambda_{1}-\lambda_{2}\geq c_{0}n^{-2/3}\right)\geq 1-\delta.
Proof.

The first result follows from the well-known Tracy-Widom limit for the top eigenvalue. Specifically,

limn(γp2/3(λλ+)s)=F1(s),\lim_{n\to\infty}\mathbb{P}\left(\gamma p^{2/3}(\lambda-\lambda_{+})\leq s\right)=F_{1}(s),

where γ\gamma is a constant depending only on the ratio ξ\xi and F1F_{1} is the type-1 Tracy-Widom distribution (in particular, [Wan19, SX21] provided quantitative rate of convergence). For the variance part, the Gaussian case was proved in [LR10], and the general case follows from universality, i.e. for any fixed mm

limn((p2/3(λλ+)s)1m)=limnG((p2/3(λλ+)s)1m),\lim_{n\to\infty}\mathbb{P}\left(\left(p^{2/3}(\lambda_{\ell}-\lambda_{+})\leq s_{\ell}\right)_{1\leq\ell\leq m}\right)=\lim_{n\to\infty}\mathbb{P}^{G}\left(\left(p^{2/3}(\lambda_{\ell}-\lambda_{+})\leq s_{\ell}\right)_{1\leq\ell\leq m}\right),

where G\mathbb{P}^{G} denotes the probability measure associated with the Gaussian matrix. The spectral gap estimate also follows from universality that the spectral statistics of the sample covariance matrix is the same as the Gaussian Orthogonal Ensemble (GOE), i.e. for any fixed mm

limn((γp2/3(λλ+)s)1m)=limn((p2/3(λGOE2)s)1m)\lim_{n\to\infty}\mathbb{P}\left(\left(\gamma p^{2/3}(\lambda_{\ell}-\lambda_{+})\leq s_{\ell}\right)_{1\leq\ell\leq m}\right)=\lim_{n\to\infty}\mathbb{P}\left(\left(p^{2/3}(\lambda_{\ell}^{GOE}-2)\leq s_{\ell}\right)_{1\leq\ell\leq m}\right)

For GOE, the desired spectral gap estimate was proved in e.g. [AGZ10]. ∎

Moreover, an estimate on the eigenvalue gap near the spectral edge is needed. The following result was proved in [TV12, Wan12]

Lemma B.3.

For any c>0c>0, there exists κ>0\kappa>0 such that for every 1ip1\leq i\leq p, with probability at least 1nκ1-n^{-\kappa}, we have

|λiλi+1|n1c.|\lambda_{i}-\lambda_{i+1}|\geq n^{-1-c}.

The property of eigenvectors is also a key ingredient for our proof. In particular, we extensively rely on the following delocalization property (see e.g. [PY14, BEK+14, DY18, Wan22])

Lemma B.4.

For any ε>0\varepsilon>0, with overwhelming probability, we have

max1mp𝐯m+max1in𝐮in12+ε.\max_{1\leq m\leq p}\|\mathbf{v}_{m}\|_{\infty}+\max_{1\leq i\leq n}\|\mathbf{u}_{i}\|_{\infty}\leq n^{-\frac{1}{2}+\varepsilon}.

Appendix C Proofs for the Sensitivity Regime

C.1 Sensitivity analysis for neighboring data matrices

As a first step, we will first show that resampling of a single entry has little perturbation effect on the top eigenvectors. This will be helpful to control the single entry resampling term in the variance formula (see Lemma B.1).

For any fixed 1in1\leq i\leq n and 1αp1\leq\alpha\leq p, let 𝐗(i,α)\mathbf{X}_{(i,\alpha)} be the matrix obtained from 𝐗\mathbf{X} by replacing the (i,α)(i,\alpha) entry 𝐗iα\mathbf{X}_{i\alpha} with 𝐗iα\mathbf{X}_{i\alpha}^{\prime}. Define the corresponding covariance matrix 𝐇(i,α):=𝐗(i,α)𝐗(i,α)\mathbf{H}_{(i,\alpha)}:=\mathbf{X}_{(i,\alpha)}^{\top}\mathbf{X}_{(i,\alpha)}, and use 𝐯(i,α)\mathbf{v}^{(i,\alpha)} to denote its unit top eigenvector. Similarly, we denote by 𝐮(i,α)\mathbf{u}^{(i,\alpha)} the unit top eigenvector of 𝐇^(i,α):=𝐗(i,α)𝐗(i,α)\widehat{\mathbf{H}}_{(i,\alpha)}:=\mathbf{X}_{(i,\alpha)}\mathbf{X}_{(i,\alpha)}^{\top}.

Lemma C.1.

Let c>0c>0 small and 0<δ<12c0<\delta<\frac{1}{2}-c. For all 1in1\leq i\leq n and 1αp1\leq\alpha\leq p, on the event {λ1λ2n1c}\{\lambda_{1}-\lambda_{2}\geq n^{-1-c}\}, with overwhelming probability

maxi,αmins{±1}𝐯s𝐯(i,α)n12δ\max_{i,\alpha}\min_{s\in\left\{\pm 1\right\}}\|\mathbf{v}-s\mathbf{v}^{(i,\alpha)}\|_{\infty}\leq n^{-\frac{1}{2}-\delta} (10)

and similarly

maxi,αmins{±1}𝐮s𝐮(i,α)n12δ\max_{i,\alpha}\min_{s\in\left\{\pm 1\right\}}\|\mathbf{u}-s\mathbf{u}^{(i,\alpha)}\|_{\infty}\leq n^{-\frac{1}{2}-\delta}
Proof.

Let λ1(i,α)λp(1,α)\lambda_{1}^{(i,\alpha)}\geq\cdots\geq\lambda_{p}^{(1,\alpha)} denote the eigenvalues of the matrix 𝐇(i,α)\mathbf{H}_{(i,\alpha)}, and let 𝐯β(i,α)\mathbf{v}_{\beta}^{(i,\alpha)} denote the unit eigenvector associated with the eigenvalue λβ(i,α)\lambda_{\beta}^{(i,\alpha)}. Similarly, we define the unit eigenvectors {𝐮j(i,α)}\{\mathbf{u}_{j}^{(i,\alpha)}\} for the matrix 𝐇^(i,α)\widehat{\mathbf{H}}_{(i,\alpha)}. Using the variational characterization of the eigenvalues, we have

λ1𝐯1(i,α),𝐇𝐯1(i,α)=λ1(i,α)+𝐯1(i,α),(𝐇𝐇(i,α))𝐯1(i,α).\lambda_{1}\geq\langle\mathbf{v}_{1}^{(i,\alpha)},\mathbf{H}\mathbf{v}_{1}^{(i,\alpha)}\rangle=\lambda_{1}^{(i,\alpha)}+\langle\mathbf{v}_{1}^{(i,\alpha)},(\mathbf{H}-\mathbf{H}_{(i,\alpha)})\mathbf{v}_{1}^{(i,\alpha)}\rangle.

Since 𝐗\mathbf{X} and 𝐗(i,α)\mathbf{X}_{(i,\alpha)} differ only at the (i,α)(i,\alpha) entry, we have

(𝐇𝐇(i,α))βγ=(𝐗𝐗𝐗(i,α)𝐗(i,α))βγ={(𝐗iα𝐗iα)𝐗iγifβ=α,γα,(𝐗iα𝐗iα)𝐗iβifβα,γ=α,𝐗iα2(𝐗iα)2ifβ=α,γ=α,0otherwise.(\mathbf{H}-\mathbf{H}_{(i,\alpha)})_{\beta\gamma}=(\mathbf{X}^{\top}\mathbf{X}-\mathbf{X}_{(i,\alpha)}^{\top}\mathbf{X}_{(i,\alpha)})_{\beta\gamma}=\begin{cases}(\mathbf{X}_{i\alpha}-\mathbf{X}_{i\alpha}^{\prime})\mathbf{X}_{i\gamma}\ \ &\mbox{if}\ \beta=\alpha,\gamma\neq\alpha,\\ (\mathbf{X}_{i\alpha}-\mathbf{X}_{i\alpha}^{\prime})\mathbf{X}_{i\beta}\ \ &\mbox{if}\ \beta\neq\alpha,\gamma=\alpha,\\ \mathbf{X}_{i\alpha}^{2}-(\mathbf{X}_{i\alpha}^{\prime})^{2}\ \ &\mbox{if}\ \beta=\alpha,\gamma=\alpha,\\ 0\ \ &\mbox{otherwise}.\end{cases}

Thus, setting Δiα:=𝐗iα𝐗iα\Delta_{i\alpha}:=\mathbf{X}_{i\alpha}-\mathbf{X}_{i\alpha}^{\prime}, we have

𝐯1(i,α),(𝐇𝐇(i,α))𝐯1(i,α)\displaystyle\langle\mathbf{v}_{1}^{(i,\alpha)},(\mathbf{H}-\mathbf{H}_{(i,\alpha)})\mathbf{v}_{1}^{(i,\alpha)}\rangle (11)
=2𝐯1(i,α)(α)Δiα(β=1p(𝐗(i,α))iβ𝐯1(i,α)(β)𝐗iα𝐯1(i,α)(α))+(𝐯1(i,α)(α))2(𝐗iα2(𝐗iα)2)\displaystyle\quad=2\mathbf{v}_{1}^{(i,\alpha)}(\alpha)\Delta_{i\alpha}\left(\sum_{\beta=1}^{p}(\mathbf{X}_{(i,\alpha)})_{i\beta}\mathbf{v}_{1}^{(i,\alpha)}(\beta)-\mathbf{X}_{i\alpha}^{\prime}\mathbf{v}_{1}^{(i,\alpha)}(\alpha)\right)+\left(\mathbf{v}_{1}^{(i,\alpha)}(\alpha)\right)^{2}(\mathbf{X}_{i\alpha}^{2}-(\mathbf{X}_{i\alpha}^{\prime})^{2})
=2𝐯1(i,α)(α)Δiα(𝐗(i,α)𝐯1(i,α))(i)+(𝐯1(i,α)(α))2(𝐗iα2(𝐗iα)22(𝐗iα𝐗iα)𝐗iα)\displaystyle\quad=2\mathbf{v}_{1}^{(i,\alpha)}(\alpha)\Delta_{i\alpha}\left(\mathbf{X}_{(i,\alpha)}\mathbf{v}_{1}^{(i,\alpha)}\right)(i)+\left(\mathbf{v}_{1}^{(i,\alpha)}(\alpha)\right)^{2}\left(\mathbf{X}_{i\alpha}^{2}-(\mathbf{X}_{i\alpha}^{\prime})^{2}-2(\mathbf{X}_{i\alpha}-\mathbf{X}_{i\alpha}^{\prime})\mathbf{X}_{i\alpha}^{\prime}\right)
=2λ1(i,α)𝐯1(i,α)(α)𝐮1(i,α)(i)Δiα+(𝐯1(i,α)(α)2)Δiα2.\displaystyle\quad=2\sqrt{\lambda_{1}^{(i,\alpha)}}\mathbf{v}_{1}^{(i,\alpha)}(\alpha)\mathbf{u}_{1}^{(i,\alpha)}(i)\Delta_{i\alpha}+\left(\mathbf{v}_{1}^{(i,\alpha)}(\alpha)^{2}\right)\Delta_{i\alpha}^{2}.

This gives us

λ1λ1(i,α)2λ1(i,α)(|𝐗iα|+|𝐗iα|)𝐯1(i,α)𝐮1(i,α)(|𝐗iα|+|𝐗iα|)2𝐯1(i,α)2.\lambda_{1}\geq\lambda_{1}^{(i,\alpha)}-2\sqrt{\lambda_{1}^{(i,\alpha)}}\left(|\mathbf{X}_{i\alpha}|+|\mathbf{X}_{i\alpha}^{\prime}|\right)\|\mathbf{v}_{1}^{(i,\alpha)}\|_{\infty}\|\mathbf{u}_{1}^{(i,\alpha)}\|_{\infty}-\left(|\mathbf{X}_{i\alpha}|+|\mathbf{X}_{i\alpha}^{\prime}|\right)^{2}\|\mathbf{v}_{1}^{(i,\alpha)}\|_{\infty}^{2}. (12)

Similarly,

λ1(i,α)λ12λ1(|𝐗iα|+|𝐗iα|)𝐯1𝐮1(|𝐗iα|+|𝐗iα|)2𝐯12.\lambda_{1}^{(i,\alpha)}\geq\lambda_{1}-2\sqrt{\lambda_{1}}\left(|\mathbf{X}_{i\alpha}|+|\mathbf{X}_{i\alpha}^{\prime}|\right)\|\mathbf{v}_{1}\|_{\infty}\|\mathbf{u}_{1}\|_{\infty}-\left(|\mathbf{X}_{i\alpha}|+|\mathbf{X}_{i\alpha}^{\prime}|\right)^{2}\|\mathbf{v}_{1}\|_{\infty}^{2}. (13)

From the sub-exponential decay assumption (2), we know |𝐗iα|,|𝐗iα|n1/2+ε|\mathbf{X}_{i\alpha}|,|\mathbf{X}_{i\alpha}^{\prime}|\leq n^{-1/2+\varepsilon} with overwhelming probability for any ε>0\varepsilon>0. Moreover, by the delocalization of eigenvectors, with overwhelming probability, we have

max(𝐯1,𝐮1,𝐯1(i,α),𝐮1(i,α))n12+ε\max\left(\|\mathbf{v}_{1}\|_{\infty},\|\mathbf{u}_{1}\|_{\infty},\|\mathbf{v}_{1}^{(i,\alpha)}\|_{\infty},\|\mathbf{u}_{1}^{(i,\alpha)}\|_{\infty}\right)\leq n^{-\frac{1}{2}+\varepsilon}

Moreover, by the rigidity estimates (8), with overwhelming probability we have

|λ1λ+|n23+ε,|λ1(i,α)λ+|n23+ε|\lambda_{1}-\lambda_{+}|\leq n^{-\frac{2}{3}+\varepsilon},\ \ |\lambda_{1}^{(i,\alpha)}-\lambda_{+}|\leq n^{-\frac{2}{3}+\varepsilon}

Therefore, combining with a union bound, the above two inequalities (12) and (13) together yield

max1in,1αp|λ1λ1(i,α)|n3/2+3ε\max_{1\leq i\leq n,1\leq\alpha\leq p}|\lambda_{1}-\lambda_{1}^{(i,\alpha)}|\leq n^{-3/2+3\varepsilon} (14)

with overwhelming probability.

We write 𝐯1(i,α)\mathbf{v}_{1}^{(i,\alpha)} in the orthonormal basis of eigenvectors {𝐯β}\left\{\mathbf{v}_{\beta}\right\}:

𝐯1(i,α)=β=1paβ𝐯β.\mathbf{v}_{1}^{(i,\alpha)}=\sum_{\beta=1}^{p}a_{\beta}\mathbf{v}_{\beta}.

Using this representation and the spectral theorem,

β=1pλβaβ𝐯β=𝐇𝐯1(i,α)=(𝐇𝐇(i,α))𝐯1(i,α)+(λ1(i,α)λ1)𝐯1(i,α)+λ1𝐯1(i,α).\sum_{\beta=1}^{p}\lambda_{\beta}a_{\beta}\mathbf{v}_{\beta}=\mathbf{H}\mathbf{v}_{1}^{(i,\alpha)}=\left(\mathbf{H}-\mathbf{H}_{(i,\alpha)}\right)\mathbf{v}_{1}^{(i,\alpha)}+\left(\lambda_{1}^{(i,\alpha)}-\lambda_{1}\right)\mathbf{v}_{1}^{(i,\alpha)}+\lambda_{1}\mathbf{v}_{1}^{(i,\alpha)}.

As a consequence,

λ1𝐯1(i,α)=β=1pλβ𝐚β𝐯β+(𝐇(i,α)𝐇)𝐯1(i,α)+(λ1λ1(i,α))𝐯1(i,α).\lambda_{1}\mathbf{v}_{1}^{(i,\alpha)}=\sum_{\beta=1}^{p}\lambda_{\beta}\mathbf{a}_{\beta}\mathbf{v}_{\beta}+\left(\mathbf{H}_{(i,\alpha)}-\mathbf{H}\right)\mathbf{v}_{1}^{(i,\alpha)}+\left(\lambda_{1}-\lambda_{1}^{(i,\alpha)}\right)\mathbf{v}_{1}^{(i,\alpha)}.

For β1\beta\neq 1, taking inner product with 𝐯β\mathbf{v}_{\beta} yields

λ1aβ=𝐯β,λ1𝐯1(i,α)=λβaβ+𝐯β,(𝐇(i,α)𝐇)𝐯1(i,α)+(λ1λ1(i,α))aβ,\lambda_{1}a_{\beta}=\langle\mathbf{v}_{\beta},\lambda_{1}\mathbf{v}_{1}^{(i,\alpha)}\rangle=\lambda_{\beta}a_{\beta}+\langle\mathbf{v}_{\beta},(\mathbf{H}_{(i,\alpha)}-\mathbf{H})\mathbf{v}_{1}^{(i,\alpha)}\rangle+(\lambda_{1}-\lambda_{1}^{(i,\alpha)})a_{\beta},

which implies

((λ1λβ)+(λ1(i,α)λ1))aβ=𝐯β,(𝐇(i,α)𝐇)𝐯1(i,α).\left((\lambda_{1}-\lambda_{\beta})+(\lambda_{1}^{(i,\alpha)}-\lambda_{1})\right)a_{\beta}=\langle\mathbf{v}_{\beta},(\mathbf{H}_{(i,\alpha)}-\mathbf{H})\mathbf{v}_{1}^{(i,\alpha)}\rangle. (15)

By a similar computation as in (11), we have

|𝐯β,(𝐇(i,α)𝐇)𝐯1(i,α)|\displaystyle\left|{\langle\mathbf{v}_{\beta},(\mathbf{H}_{(i,\alpha)}-\mathbf{H})\mathbf{v}_{1}^{(i,\alpha)}\rangle}\right| =|Δiα(λ1(i,α)𝐯β(α)𝐮1(i,α)(i)+λβ𝐯1(i,α)(α)𝐮β(i))|\displaystyle=\left|{\Delta_{i\alpha}\left(\sqrt{\lambda_{1}^{(i,\alpha)}}\mathbf{v}_{\beta}(\alpha)\mathbf{u}_{1}^{(i,\alpha)}(i)+\sqrt{\lambda_{\beta}}\mathbf{v}_{1}^{(i,\alpha)}(\alpha)\mathbf{u}_{\beta}(i)\right)}\right| (16)
(|𝐗iα|+|𝐗iα|)(𝐯β𝐮1(i,α)+𝐯1(i,α)𝐮β)\displaystyle\lesssim\left(|\mathbf{X}_{i\alpha}|+|\mathbf{X}_{i\alpha}^{\prime}|\right)\left(\|\mathbf{v}_{\beta}\|_{\infty}\|\mathbf{u}_{1}^{(i,\alpha)}\|_{\infty}+\|\mathbf{v}_{1}^{(i,\alpha)}\|_{\infty}\|\mathbf{u}_{\beta}\|_{\infty}\right)
n32+3ε\displaystyle\leq n^{-\frac{3}{2}+3\varepsilon}

with overwhelming probability, where the second step follows from rigidity of eigenvalues and the last step follows from the sub-exponential decay assumption and delocalization of eigenvectors.

Consider the event :={λ1λ2n1c}{\mathcal{E}}:=\{\lambda_{1}-\lambda_{2}\geq n^{-1-c}\}. Fix some ω>0\omega>0 small. By rigidity of eigenvalues (8), on the event {\mathcal{E}}, with overwhelming probability

λ1λβ{n1cif 2βnω,β2/3n2/3ifnω<βp.\lambda_{1}-\lambda_{\beta}\gtrsim\begin{cases}n^{-1-c}&\mbox{if}\ \ 2\leq\beta\leq n^{\omega},\\ \beta^{2/3}n^{-2/3}&\mbox{if}\ \ n^{\omega}<\beta\leq p.\end{cases} (17)

On the event {\mathcal{E}}, using (14), (16) and (17), with overwhelming probability we have

|aβ|{n12+c+3εif 2βnω,β23n56+3εifnω<βp.|a_{\beta}|\lesssim\begin{cases}n^{-\frac{1}{2}+c+3\varepsilon}&\mbox{if}\ \ 2\leq\beta\leq n^{\omega},\\ \beta^{-\frac{2}{3}}n^{-\frac{5}{6}+3\varepsilon}&\mbox{if}\ \ n^{\omega}<\beta\leq p.\end{cases} (18)

Choose s=a1/|a1|s=a_{1}/|a_{1}|. Note that 1|a1|β=2p|aβ|1-|a_{1}|\leq\sum_{\beta=2}^{p}|a_{\beta}|. Thanks to the delocalization of eigenvectors, with overwhelming probability, we have

s𝐯1𝐯1(i,α)=(sa1)𝐯1+β=2paβ𝐯β(1|a1|)𝐯1+β=2p|aβ|𝐯βn12+εβ=2p|aβ|.\|s\mathbf{v}_{1}-\mathbf{v}_{1}^{(i,\alpha)}\|_{\infty}=\bigg{\|}(s-a_{1})\mathbf{v}_{1}+\sum_{\beta=2}^{p}a_{\beta}\mathbf{v}_{\beta}\bigg{\|}_{\infty}\leq(1-|a_{1}|)\|\mathbf{v}_{1}\|_{\infty}+\sum_{\beta=2}^{p}|a_{\beta}|\|\mathbf{v}_{\beta}\|_{\infty}\leq n^{-\frac{1}{2}+\varepsilon}\sum_{\beta=2}^{p}|a_{\beta}|.

Thus, on the event {\mathcal{E}}, it follows from (17) that

s𝐯1𝐯1(i,α)\displaystyle\|s\mathbf{v}_{1}-\mathbf{v}_{1}^{(i,\alpha)}\|_{\infty} n12+ε(n12+3ε+c+ω+n56+3εβ=nωpβ23)\displaystyle\lesssim n^{-\frac{1}{2}+\varepsilon}\bigg{(}n^{-\frac{1}{2}+3\varepsilon+c+\omega}+n^{-\frac{5}{6}+3\varepsilon}\sum_{\beta=n^{\omega}}^{p}\beta^{-\frac{2}{3}}\bigg{)}
n1+4ε+c+ω+n1+4ε.\displaystyle\lesssim n^{-1+4\varepsilon+c+\omega}+n^{-1+4\varepsilon}.

Choosing ε\varepsilon and ω\omega small enough so that 4ε+c+ω<124\varepsilon+c+\omega<\frac{1}{2}, we conclude that (10) is true.

A similar bound for 𝐮\mathbf{u} can be shown by the same arguments for 𝐇^=𝐗𝐗\widehat{\mathbf{H}}=\mathbf{X}\mathbf{X}^{\top}. Hence, we have shown the desired results. ∎

C.2 Proof of Theorem 1

Now we are ready to prove the resampling sensitivity.

Let 𝐗′′n×p\mathbf{X}^{\prime\prime}\in\mathbb{R}^{n\times p} be a copy of 𝐗\mathbf{X} that is independent of 𝐗\mathbf{X} and 𝐗\mathbf{X}^{\prime}. For an arbitrary index (i,α)(i,\alpha) with 1in1\leq i\leq n and 1αp1\leq\alpha\leq p, we introduce another random matrix 𝐘(i,α)\mathbf{Y}_{(i,\alpha)} obtained from 𝐗\mathbf{X} by replacing the (i,α)(i,\alpha) entry 𝐗iα\mathbf{X}_{i\alpha} by 𝐗iα′′\mathbf{X}_{i\alpha}^{\prime\prime}. Similarly, we denote 𝐘(i,α)[k]\mathbf{Y}^{[k]}_{(i,\alpha)} the matrix obtained via the same procedure from 𝐗[k]\mathbf{X}^{[k]}. For the matrix 𝐗[k]\mathbf{X}^{[k]}, we do the similar resampling procedure in the following way: if (i,α)Sk(i,\alpha)\in S_{k}, then replace 𝐗iα[k]\mathbf{X}^{[k]}_{i\alpha} with 𝐗iα′′\mathbf{X}_{i\alpha}^{\prime\prime}; if (i,α)Sk(i,\alpha)\notin S_{k}, then replace 𝐗iα[k]\mathbf{X}^{[k]}_{i\alpha} with 𝐗iα′′′\mathbf{X}_{i\alpha}^{\prime\prime\prime}, where 𝐗′′′\mathbf{X}^{\prime\prime\prime} is another independent copy of 𝐗\mathbf{X}, 𝐗\mathbf{X}^{\prime} and 𝐗′′\mathbf{X}^{\prime\prime}.

In the following analysis, we choose an index (s,θ)(s,\theta) uniformly at random from the set of all pairs {(i,α):1in,1αp}\left\{(i,\alpha):1\leq i\leq n,1\leq\alpha\leq p\right\}. Let μ\mu be the top singular value of 𝐘(s,θ)\mathbf{Y}_{(s,\theta)} and use 𝐟n\mathbf{f}\in\mathbb{R}^{n} and 𝐠p\mathbf{g}\in\mathbb{R}^{p} to denote the normalized top left and right singular vectors of 𝐘(s,θ)\mathbf{Y}_{(s,\theta)}. Similarly, we define μ[k]\mu^{[k]}, 𝐟[k]\mathbf{f}^{[k]} and 𝐠[k]\mathbf{g}^{[k]} for 𝐘(s,θ)[k]\mathbf{Y}^{[k]}_{(s,\theta)}. We also denote by 𝐡\mathbf{h} and 𝐡[k]\mathbf{h}^{[k]} the concatenation of 𝐟,𝐠\mathbf{f},\mathbf{g} and 𝐟[k],𝐠[k]\mathbf{f}^{[k]},\mathbf{g}^{[k]}, respectively. Finally, let 𝐗~[k]\widetilde{\mathbf{X}}^{[k]}, 𝐘~\widetilde{\mathbf{Y}} and 𝐘~[k]\widetilde{\mathbf{Y}}^{[k]} be the symmetrization (5) of 𝐗[k]\mathbf{X}^{[k]}, 𝐘\mathbf{Y} and 𝐘[k]\mathbf{Y}^{[k]}, respectively. When the context is clear, we will omit the index (s,θ)(s,\theta) for the convenience of notations.

Step 1.

By Lemma B.1, we have

2𝖵𝖺𝗋(σ)knp+1np𝔼[(σμ)(σ[k]μ[k])].\frac{2\mathsf{Var}(\sigma)}{k}\cdot\frac{np+1}{np}\geq\mathbb{E}\left[(\sigma-\mu)\left(\sigma^{[k]}-\mu^{[k]}\right)\right]. (19)

Using the variational characterization of the top singular value

𝐟,𝐗𝐠σ=𝐮,𝐗𝐯,𝐮,𝐘𝐯μ=𝐟,𝐘𝐠.\langle\mathbf{f},\mathbf{X}\mathbf{g}\rangle\leq\sigma=\langle\mathbf{u},\mathbf{X}\mathbf{v}\rangle,\ \ \ \langle\mathbf{u},\mathbf{Y}\mathbf{v}\rangle\leq\mu=\langle\mathbf{f},\mathbf{Y}\mathbf{g}\rangle.

This implies

𝐟,(𝐗𝐘)𝐠σμ𝐮,(𝐗𝐘)𝐯.\langle\mathbf{f},(\mathbf{X}-\mathbf{Y})\mathbf{g}\rangle\leq\sigma-\mu\leq\langle\mathbf{u},(\mathbf{X}-\mathbf{Y})\mathbf{v}\rangle. (20)

Applying the same arguments to 𝐗[k]\mathbf{X}^{[k]} and 𝐘[k]\mathbf{Y}^{[k]}, we have

𝐟[k],(𝐗[k]𝐘[k])𝐠[k]σ[k]μ[k]𝐮[k],(𝐗[k]𝐘[k])𝐯[k].\left\langle\mathbf{f}^{[k]},\left(\mathbf{X}^{[k]}-\mathbf{Y}^{[k]}\right)\mathbf{g}^{[k]}\right\rangle\leq\sigma^{[k]}-\mu^{[k]}\leq\left\langle\mathbf{u}^{[k]},\left(\mathbf{X}^{[k]}-\mathbf{Y}^{[k]}\right)\mathbf{v}^{[k]}\right\rangle.

Since the matrices 𝐗\mathbf{X} and 𝐘\mathbf{Y} differ only at the (s,θ)(s,\theta) entry, for any vectors 𝐚n\mathbf{a}\in\mathbb{R}^{n} and 𝐛p\mathbf{b}\in\mathbb{R}^{p} we have

𝐚,(𝐗𝐘)𝐛=Δsθ𝐚(s)𝐛(θ),𝐚,(𝐗[k]𝐘[k])𝐛=Δsθ[k]𝐚(s)𝐛(θ),\langle\mathbf{a},(\mathbf{X}-\mathbf{Y})\mathbf{b}\rangle=\Delta_{s\theta}\,\mathbf{a}(s)\mathbf{b}(\theta),\ \ \left\langle\mathbf{a},\left(\mathbf{X}^{[k]}-\mathbf{Y}^{[k]}\right)\mathbf{b}\right\rangle=\Delta^{[k]}_{s\theta}\,\mathbf{a}(s)\mathbf{b}(\theta),

where

Δsθ:=𝐗sθ𝐗sθ′′,andΔsθ[k]:={𝐗sθ𝐗sθ′′if(s,θ)Sk,𝐗sθ𝐗sθ′′′if(s,θ)Sk.\Delta_{s\theta}:=\mathbf{X}_{s\theta}-\mathbf{X}_{s\theta}^{\prime\prime},\ \ \mbox{and}\ \ \Delta^{[k]}_{s\theta}:=\begin{cases}\mathbf{X}_{s\theta}^{\prime}-\mathbf{X}_{s\theta}^{\prime\prime}&\mbox{if}\ (s,\theta)\in S_{k},\\ \mathbf{X}_{s\theta}-\mathbf{X}_{s\theta}^{\prime\prime\prime}&\mbox{if}\ (s,\theta)\notin S_{k}.\end{cases}

Therefore,

Δsθ𝐟(s)𝐠(θ)σμΔsθ𝐮(s)𝐯(θ),Δsθ[k]𝐟[k](s)𝐠[k](θ)σ[k]μ[k]Δsθ[k]𝐮[k](s)𝐯[k](θ).\Delta_{s\theta}\,\mathbf{f}(s)\mathbf{g}(\theta)\leq\sigma-\mu\leq\Delta_{s\theta}\mathbf{u}(s)\mathbf{v}(\theta),\ \ \Delta^{[k]}_{s\theta}\,\mathbf{f}^{[k]}(s)\mathbf{g}^{[k]}(\theta)\leq\sigma^{[k]}-\mu^{[k]}\leq\Delta^{[k]}_{s\theta}\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta).

Consider

T1:=(Δsθ𝐮(s)𝐯(θ))(Δsθ[k]𝐮[k](s)𝐯[k](θ)),T2:=(Δsθ𝐟(s)𝐠(θ))(Δsθ[k]𝐟[k](s)𝐠[k](θ)),T_{1}:=\left(\Delta_{s\theta}\mathbf{u}(s)\mathbf{v}(\theta)\right)\left(\Delta^{[k]}_{s\theta}\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)\right),\ \ T_{2}:=\left(\Delta_{s\theta}\,\mathbf{f}(s)\mathbf{g}(\theta)\right)\left(\Delta^{[k]}_{s\theta}\,\mathbf{f}^{[k]}(s)\mathbf{g}^{[k]}(\theta)\right),
T3:=(Δsθ𝐮(s)𝐯(θ))(Δsθ[k]𝐟[k](s)𝐠[k](θ)),T4:=(Δsθ𝐟(s)𝐠(θ))(Δsθ[k]𝐮[k](s)𝐯[k](θ)).T_{3}:=\left(\Delta_{s\theta}\mathbf{u}(s)\mathbf{v}(\theta)\right)\left(\Delta^{[k]}_{s\theta}\,\mathbf{f}^{[k]}(s)\mathbf{g}^{[k]}(\theta)\right),\ \ T_{4}:=\left(\Delta_{s\theta}\,\mathbf{f}(s)\mathbf{g}(\theta)\right)\left(\Delta^{[k]}_{s\theta}\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)\right).

Then we have

min(T1,T2,T3,T4)(σμ)(σ[k]μ[k])max(T1,T2,T3,T4).\min(T_{1},T_{2},T_{3},T_{4})\leq(\sigma-\mu)\left(\sigma^{[k]}-\mu^{[k]}\right)\leq\max(T_{1},T_{2},T_{3},T_{4}). (21)

To estimate (21), we introduce the following events

1:={max(𝐯,𝐮,𝐟,𝐠,𝐯[k],𝐮[k],𝐟[k],𝐠[k])n12+ε},{\mathcal{E}}_{1}:=\left\{\max\left(\|\mathbf{v}\|_{\infty},\|\mathbf{u}\|_{\infty},\|\mathbf{f}\|_{\infty},\|\mathbf{g}\|_{\infty},\|\mathbf{v}^{[k]}\|_{\infty},\|\mathbf{u}^{[k]}\|_{\infty},\|\mathbf{f}^{[k]}\|_{\infty},\|\mathbf{g}^{[k]}\|_{\infty}\right)\leq n^{-\frac{1}{2}+\varepsilon}\right\}, (22)

and

2:={max(𝐯𝐠,𝐮𝐟,𝐯[k]𝐠[k],𝐮[k]𝐟[k])n12δ}.{\mathcal{E}}_{2}:=\left\{\max\left(\|\mathbf{v}-\mathbf{g}\|_{\infty},\|\mathbf{u}-\mathbf{f}\|_{\infty},\|\mathbf{v}^{[k]}-\mathbf{g}^{[k]}\|_{\infty},\|\mathbf{u}^{[k]}-\mathbf{f}^{[k]}\|_{\infty}\right)\leq n^{-\frac{1}{2}-\delta}\right\}. (23)

Define the event :=12{\mathcal{E}}:={\mathcal{E}}_{1}\cap{\mathcal{E}}_{2}. On the event {\mathcal{E}}, for all

J{𝐮(s)𝐯(θ)𝐮[k](s)𝐯[k](θ),𝐮(s)𝐯(θ)𝐟[k](s)𝐠[k](θ),𝐟(s)𝐠(θ)𝐮[k](s)𝐯[k](θ),𝐟(s)𝐠(θ)𝐟[k](s)𝐠[k](θ)}J\in\left\{\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta),\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{f}^{[k]}(s)\mathbf{g}^{[k]}(\theta),\mathbf{f}(s)\mathbf{g}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta),\mathbf{f}(s)\mathbf{g}(\theta)\mathbf{f}^{[k]}(s)\mathbf{g}^{[k]}(\theta)\right\}

we have

|J𝐮(s)𝐯(θ)𝐮[k](s)𝐯[k](θ)|=O(n2δ+3ε).\left|{J-\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)}\right|=O\left(n^{-2-\delta+3\varepsilon}\right). (24)

Let T:=min(T1,T2,T3,T4)T:=\min(T_{1},T_{2},T_{3},T_{4}). On the event {\mathcal{E}}, using (24) we have

T(ΔsθΔsθ[k])𝐮(s)𝐯(θ)𝐮[k](s)𝐯[k](θ)O(|ΔsθΔsθ[k]|n2δ+3ε).T\geq\left(\Delta_{s\theta}\Delta^{[k]}_{s\theta}\right)\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)-O\left(\left|{\Delta_{s\theta}\Delta^{[k]}_{s\theta}}\right|n^{-2-\delta+3\varepsilon}\right). (25)

Step 2.

Next we claim that the contribution of TT when {\mathcal{E}} does not hold is negligible. Specifically, we have

𝔼[T𝟙c]=o(n3).\mathbb{E}\left[T\mathbbm{1}_{{\mathcal{E}}^{c}}\right]=o(n^{-3}). (26)

Without loss of generality, it suffices to show that

𝔼[ΔsθΔsθ[k]𝐮(s)𝐯(θ)𝐮[k](s)𝐯[k](θ)𝟙c]=o(n3).\mathbb{E}\left[\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)\mathbbm{1}_{{\mathcal{E}}^{c}}\right]=o(n^{-3}). (27)

To see this, using 𝟙c=𝟙1\+𝟙1c\mathbbm{1}_{{\mathcal{E}}^{c}}=\mathbbm{1}_{{\mathcal{E}}_{1}\backslash{\mathcal{E}}}+\mathbbm{1}_{{\mathcal{E}}_{1}^{c}}, we decompose the expectation into two parts

𝔼[ΔsθΔsθ[k]𝐮(s)𝐯(θ)𝐮[k](s)𝐯[k](θ)𝟙c]=I1+I2,\mathbb{E}\left[\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)\mathbbm{1}_{{\mathcal{E}}^{c}}\right]=I_{1}+I_{2},

where

I1:=𝔼[ΔsθΔsθ[k]𝐮(s)𝐯(θ)𝐮[k](s)𝐯[k](θ)𝟙1\],I2:=𝔼[ΔsθΔsθ[k]𝐮(s)𝐯(θ)𝐮[k](s)𝐯[k](θ)𝟙1c].I_{1}:=\mathbb{E}\left[\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)\mathbbm{1}_{{\mathcal{E}}_{1}\backslash{\mathcal{E}}}\right],\ \ I_{2}:=\mathbb{E}\left[\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)\mathbbm{1}_{{\mathcal{E}}_{1}^{c}}\right].

For the first term I1I_{1}, by delocalization and the relation 1\=1\2{\mathcal{E}}_{1}\backslash{\mathcal{E}}={\mathcal{E}}_{1}\backslash{\mathcal{E}}_{2}, we have

|I1|n2+4ε𝔼[|ΔsθΔsθ[k]|𝟙1\2]n2+4ε𝔼[|ΔsθΔsθ[k]|𝟙2c].|I_{1}|\leq n^{-2+4\varepsilon}\mathbb{E}\left[\left|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right|\mathbbm{1}_{{\mathcal{E}}_{1}\backslash{\mathcal{E}}_{2}}\right]\leq n^{-2+4\varepsilon}\mathbb{E}\left[\left|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right|\mathbbm{1}_{{\mathcal{E}}_{2}^{c}}\right]. (28)

Note that the random variable ΔsθΔsθ[k]\Delta_{s\theta}\,\Delta^{[k]}_{s\theta} and the event 2{\mathcal{E}}_{2} are dependent. To decouple these variables, we introduce a new event. Consider the event 3:=𝒜{\mathcal{E}}_{3}:={\mathcal{A}}\cup{\mathcal{B}}, where

𝒜:={min(σ1σ2,σ1[k]σ2[k])n1c},:={min(μ1μ2,μ1[k]μ2[k])n1c}{\mathcal{A}}:=\left\{\min\left(\sigma_{1}-\sigma_{2},\sigma^{[k]}_{1}-\sigma^{[k]}_{2}\right)\geq n^{-1-c}\right\},\ \ {\mathcal{B}}:=\left\{\min\left(\mu_{1}-\mu_{2},\mu^{[k]}_{1}-\mu^{[k]}_{2}\right)\geq n^{-1-c}\right\}

Then,

𝔼[|ΔsθΔsθ[k]|𝟙3c]\displaystyle\mathbb{E}\left[\left|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right|\mathbbm{1}_{{\mathcal{E}}_{3}^{c}}\right] 𝔼[(Δsθ2+(Δsθ[k])2)𝟙3c]\displaystyle\lesssim\mathbb{E}\left[\left(\Delta_{s\theta}^{2}+(\Delta^{[k]}_{s\theta})^{2}\right)\mathbbm{1}_{{\mathcal{E}}_{3}^{c}}\right]
𝔼[(𝐗sθ2+(𝐗sθ)2+(𝐗sθ′′)2+(𝐗sθ′′′)2)𝟙3c]\displaystyle\lesssim\mathbb{E}\left[\left(\mathbf{X}_{s\theta}^{2}+(\mathbf{X}_{s\theta}^{\prime})^{2}+(\mathbf{X}_{s\theta}^{\prime\prime})^{2}+(\mathbf{X}_{s\theta}^{\prime\prime\prime})^{2}\right)\mathbbm{1}_{{\mathcal{E}}_{3}^{c}}\right]
𝔼[(𝐗sθ2+(𝐗sθ)2)𝟙c]+𝔼[((𝐗sθ′′)2+(𝐗sθ′′′)2)𝟙𝒜c].\displaystyle\lesssim\mathbb{E}\left[\left(\mathbf{X}_{s\theta}^{2}+(\mathbf{X}_{s\theta}^{\prime})^{2}\right)\mathbbm{1}_{{\mathcal{B}}^{c}}\right]+\mathbb{E}\left[\left((\mathbf{X}_{s\theta}^{\prime\prime})^{2}+(\mathbf{X}_{s\theta}^{\prime\prime\prime})^{2}\right)\mathbbm{1}_{{\mathcal{A}}^{c}}\right].

Observe that the random variables 𝐗sθ\mathbf{X}_{s\theta} and 𝐗sθ\mathbf{X}_{s\theta}^{\prime} are independent of the event {\mathcal{B}}, and the random variable 𝐗sθ′′\mathbf{X}_{s\theta}^{\prime\prime} is independent of 𝒜{\mathcal{A}}. Therefore, by Lemma B.3,

𝔼[(𝐗sθ2+(𝐗sθ)2)𝟙c]=O(n1κ),𝔼[((𝐗sθ′′)2+(𝐗sθ′′′)2)𝟙𝒜c]=O(n1κ).\mathbb{E}\left[\left(\mathbf{X}_{s\theta}^{2}+(\mathbf{X}_{s\theta}^{\prime})^{2}\right)\mathbbm{1}_{{\mathcal{B}}^{c}}\right]=O(n^{-1-\kappa}),\ \ \ \mathbb{E}\left[\left((\mathbf{X}_{s\theta}^{\prime\prime})^{2}+(\mathbf{X}_{s\theta}^{\prime\prime\prime})^{2}\right)\mathbbm{1}_{{\mathcal{A}}^{c}}\right]=O(n^{-1-\kappa}).

By Lemma C.1, we have (3\2)=O(ND)\mathbb{P}({\mathcal{E}}_{3}\backslash{\mathcal{E}}_{2})=O(N^{-D}) for any fixed large D>0D>0. Using the Cauchy-Schwarz inequality, we have

𝔼[|ΔsθΔsθ[k]|𝟙2c]\displaystyle\mathbb{E}\left[\left|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right|\mathbbm{1}_{{\mathcal{E}}_{2}^{c}}\right] =𝔼[|ΔsθΔsθ[k]|𝟙3c]+𝔼[|ΔsθΔsθ[k]|𝟙3\2]\displaystyle=\mathbb{E}\left[\left|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right|\mathbbm{1}_{{\mathcal{E}}_{3}^{c}}\right]+\mathbb{E}\left[\left|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right|\mathbbm{1}_{{\mathcal{E}}_{3}\backslash{\mathcal{E}}_{2}}\right]
=O(n1κ)+𝔼[|ΔsθΔsθ[k]|2](3\2)\displaystyle=O(n^{-1-\kappa})+\sqrt{\mathbb{E}\left[\left|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right|^{2}\right]}\sqrt{\mathbb{P}({\mathcal{E}}_{3}\backslash{\mathcal{E}}_{2})}
=O(n1κ)+O(ND)\displaystyle=O(n^{-1-\kappa})+O(N^{-D})
=O(n1κ).\displaystyle=O(n^{-1-\kappa}).

Choosing 4ε<κ4\varepsilon<\kappa, then (28) yields

|I1|O(n2+4ε1κ)=o(n3).|I_{1}|\leq O(n^{-2+4\varepsilon-1-\kappa})=o(n^{-3}).

For the term I2I_{2}, note that 𝐮\mathbf{u}, 𝐯\mathbf{v}, 𝐮[k]\mathbf{u}^{[k]} and 𝐯[k]\mathbf{v}^{[k]} are unit vectors. We have that

max(𝐮,𝐯,𝐮[k],𝐯[k])1.\max(\|\mathbf{u}\|_{\infty},\|\mathbf{v}\|_{\infty},\|\mathbf{u}^{[k]}\|_{\infty},\|\mathbf{v}^{[k]}\|_{\infty})\leq 1.

Recall that 1{\mathcal{E}}_{1} holds with overwhelming probability. By the Cauchy-Schwarz inequality, for any large D>0D>0, we have

|I2|𝔼[|ΔsθΔsθ[k]|𝟙1c]𝔼[|ΔsθΔsθ[k]|2](1c)=O(ND).|I_{2}|\leq\mathbb{E}\left[\left|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right|\mathbbm{1}_{{\mathcal{E}}_{1}^{c}}\right]\leq\sqrt{\mathbb{E}\left[\left|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right|^{2}\right]}\sqrt{\mathbb{P}({\mathcal{E}}_{1}^{c})}=O(N^{-D}).

Hence we have shown the desired claim (27).

Step 3.

Combining (21), (25) and (26), we obtain

𝔼[(σμ)(σ[k]μ[k])]𝔼[ΔsθΔsθ[k]𝐮(s)𝐯(θ)𝐮[k](s)𝐯[k](θ)]+o(n3).\mathbb{E}\left[(\sigma-\mu)\left(\sigma^{[k]}-\mu^{[k]}\right)\right]\geq\mathbb{E}\left[\Delta_{s\theta}\Delta^{[k]}_{s\theta}\,\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)\right]+o(n^{-3}).

Since np+1np2\frac{np+1}{np}\leq 2, by (19) we have

𝔼[ΔsθΔsθ[k]𝐮(s)𝐯(θ)𝐮[k](s)𝐯[k](θ)]4𝖵𝖺𝗋(σ)k+o(n3).\mathbb{E}\left[\Delta_{s\theta}\Delta^{[k]}_{s\theta}\,\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)\right]\leq\frac{4\mathsf{Var}(\sigma)}{k}+o(n^{-3}). (29)

Since the random index (s,θ)(s,\theta) is uniformly sampled, we have

𝔼[ΔsθΔsθ[k]𝐮(s)𝐯(θ)𝐮[k](s)𝐯[k](θ)]=1np𝔼[1in,1αpΔiαΔiα[k]𝐮(i)𝐯(α)𝐮[k](i)𝐯[k](α)].\mathbb{E}\left[\Delta_{s\theta}\Delta^{[k]}_{s\theta}\,\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)\right]=\frac{1}{np}\mathbb{E}\left[\sum_{1\leq i\leq n,1\leq\alpha\leq p}\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}\,\mathbf{u}(i)\mathbf{v}(\alpha)\mathbf{u}^{[k]}(i)\mathbf{v}^{[k]}(\alpha)\right]. (30)

Note that

ΔiαΔiα[k]={(𝐗iα𝐗iα)(𝐗iα𝐗iα′′)if(i,α)Sk,(𝐗iα𝐗iα)(𝐗iα𝐗iα′′′)if(i,α)Sk.\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}=\begin{cases}(\mathbf{X}_{i\alpha}-\mathbf{X}_{i\alpha}^{\prime})(\mathbf{X}_{i\alpha}^{\prime}-\mathbf{X}_{i\alpha}^{\prime\prime})&\mbox{if}\ (i,\alpha)\in S_{k},\\ (\mathbf{X}_{i\alpha}-\mathbf{X}_{i\alpha}^{\prime})(\mathbf{X}_{i\alpha}-\mathbf{X}_{i\alpha}^{\prime\prime\prime})&\mbox{if}\ (i,\alpha)\notin S_{k}.\end{cases}

In either case, we have 𝔼[ΔiαΔiα[k]]=p1\mathbb{E}[\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}]=p^{-1}. Therefore,

1in,1αp𝔼[ΔiαΔiα[k]|Sk]𝐮(i)𝐯(α)𝐮[k](i)𝐯[k](α)=1p𝐯,𝐯[k]𝐮,𝐮[k].\sum_{1\leq i\leq n,1\leq\alpha\leq p}\mathbb{E}\left[\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}\left|\right.S_{k}\right]\mathbf{u}(i)\mathbf{v}(\alpha)\mathbf{u}^{[k]}(i)\mathbf{v}^{[k]}(\alpha)=\frac{1}{p}\langle\mathbf{v},\mathbf{v}^{[k]}\rangle\langle\mathbf{u},\mathbf{u}^{[k]}\rangle.

Consequently, this implies

𝔼[1in,1αp𝔼[ΔiαΔiα[k]|Sk]𝐮(i)𝐯(α)𝐮[k](i)𝐯[k](α)]=1p𝔼[𝐯,𝐯[k]𝐮,𝐮[k]].\mathbb{E}\left[\sum_{1\leq i\leq n,1\leq\alpha\leq p}\mathbb{E}\left[\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}\left.\right|S_{k}\right]\mathbf{u}(i)\mathbf{v}(\alpha)\mathbf{u}^{[k]}(i)\mathbf{v}^{[k]}(\alpha)\right]=\frac{1}{p}\mathbb{E}\left[\langle\mathbf{v},\mathbf{v}^{[k]}\rangle\langle\mathbf{u},\mathbf{u}^{[k]}\rangle\right]. (31)

Moreover, we claim that

𝔼[1in,1αp(ΔiαΔiα[k]𝔼[ΔiαΔiα[k]|Sk])𝐮(i)𝐯(α)𝐮[k](i)𝐯[k](α)]=o(n1).\mathbb{E}\left[\sum_{1\leq i\leq n,1\leq\alpha\leq p}\left(\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}-\mathbb{E}\left[\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}\left.\right|S_{k}\right]\right)\mathbf{u}(i)\mathbf{v}(\alpha)\mathbf{u}^{[k]}(i)\mathbf{v}^{[k]}(\alpha)\right]=o(n^{-1}). (32)

For the ease of notations, we set Ξiα:=ΔiαΔiα[k]𝔼[ΔiαΔiα[k]|Sk]\Xi_{i\alpha}:=\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}-\mathbb{E}[\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}|S_{k}]. It suffices to show that for all pairs (i,α)(i,\alpha) we have

𝔼[Ξiα𝐮(i)𝐯(α)𝐮[k](i)𝐯[k](α)]=o(n3).\mathbb{E}\left[\Xi_{i\alpha}\mathbf{u}(i)\mathbf{v}(\alpha)\mathbf{u}^{[k]}(i)\mathbf{v}^{[k]}(\alpha)\right]=o(n^{-3}). (33)

To see this, we introduce another copy of 𝐗\mathbf{X}, denoted by 𝐗′′′′\mathbf{X}^{\prime\prime\prime\prime}, which is independent of all previous random variables (𝐗,𝐗,𝐗′′,𝐗′′′)(\mathbf{X},\mathbf{X}^{\prime},\mathbf{X}^{\prime\prime},\mathbf{X}^{\prime\prime\prime}). For an arbitrarily fixed index (i,α)(i,\alpha), we define matrices 𝐗^(i,α)\widehat{\mathbf{X}}_{(i,\alpha)} and 𝐗^(i,α)[k]\widehat{\mathbf{X}}^{[k]}_{(i,\alpha)} by resampling the (i,α)(i,\alpha) entry of 𝐗\mathbf{X} and 𝐗[k]\mathbf{X}^{[k]} with 𝐗iα′′′′\mathbf{X}_{i\alpha}^{\prime\prime\prime\prime}. Let 𝐮^\widehat{\mathbf{u}}, 𝐯^\widehat{\mathbf{v}} be the left and right top singular vector of 𝐗^\widehat{\mathbf{X}}, and similarly 𝐮^[k]\widehat{\mathbf{u}}^{[k]}, 𝐯^[k]\widehat{\mathbf{v}}^{[k]} for 𝐗^[k]\widehat{\mathbf{X}}^{[k]}. Define

ψiα:=𝐮(i)𝐯(α)𝐮[k](i)𝐯[k](α),ψ^iα:=𝐮^(i)𝐯^(α)𝐮^[k](i)𝐯^[k](α).\psi_{i\alpha}:=\mathbf{u}(i)\mathbf{v}(\alpha)\mathbf{u}^{[k]}(i)\mathbf{v}^{[k]}(\alpha),\ \ \widehat{\psi}_{i\alpha}:=\widehat{\mathbf{u}}(i)\widehat{\mathbf{v}}(\alpha)\widehat{\mathbf{u}}^{[k]}(i)\widehat{\mathbf{v}}^{[k]}(\alpha).

A crucial observation is that Ξiα\Xi_{i\alpha} and ψ^iα\widehat{\psi}_{i\alpha} are independent. This is because, conditioned on SkS_{k}, the matrices 𝐗^\widehat{\mathbf{X}} and 𝐗^[k]\widehat{\mathbf{X}}^{[k]} are independent of (𝐗iα,𝐗iα,𝐗iα′′,𝐗iα′′′)(\mathbf{X}_{i\alpha},\mathbf{X}_{i\alpha}^{\prime},\mathbf{X}_{i\alpha}^{\prime\prime},\mathbf{X}_{i\alpha}^{\prime\prime\prime}). Such a conditional independence is also true for the singular vectors, and hence also holds for ψ^iα\widehat{\psi}_{i\alpha}. On the other hand, by definition, the variable Ξiα\Xi_{i\alpha} only depends on (𝐗iα,𝐗iα,𝐗iα′′,𝐗iα′′′)(\mathbf{X}_{i\alpha},\mathbf{X}_{i\alpha}^{\prime},\mathbf{X}_{i\alpha}^{\prime\prime},\mathbf{X}_{i\alpha}^{\prime\prime\prime}). Therefore,

𝔼[Ξiαψ^iα]=𝔼[𝔼[Ξiα|Sk]𝔼[ψ^iα|Sk]]=0\mathbb{E}\left[\Xi_{i\alpha}\widehat{\psi}_{i\alpha}\right]=\mathbb{E}\left[\mathbb{E}[\Xi_{i\alpha}|S_{k}]\,\mathbb{E}[\widehat{\psi}_{i\alpha}|S_{k}]\right]=0

Thus, we reduce (33) to showing

𝔼[Ξiα(ψiαψ^iα)]=o(n3).\mathbb{E}\left[\Xi_{i\alpha}\left(\psi_{i\alpha}-\widehat{\psi}_{i\alpha}\right)\right]=o(n^{-3}). (34)

The proof of (34) is similar as previous arguments. Consider the events

^1:={max(𝐯,𝐮,𝐮^,𝐯^,𝐯[k],𝐮[k],𝐮^[k],𝐯^[k])n12+ε},\widehat{{\mathcal{E}}}_{1}:=\left\{\max\left(\|\mathbf{v}\|_{\infty},\|\mathbf{u}\|_{\infty},\|\widehat{\mathbf{u}}\|_{\infty},\|\widehat{\mathbf{v}}\|_{\infty},\|\mathbf{v}^{[k]}\|_{\infty},\|\mathbf{u}^{[k]}\|_{\infty},\|\widehat{\mathbf{u}}^{[k]}\|_{\infty},\|\widehat{\mathbf{v}}^{[k]}\|_{\infty}\right)\leq n^{-\frac{1}{2}+\varepsilon}\right\},
^2:={max(𝐯𝐯^,𝐮𝐮^,𝐯[k]𝐯^[k],𝐮[k]𝐮^[k])n12δ}.\widehat{{\mathcal{E}}}_{2}:=\left\{\max\left(\|\mathbf{v}-\widehat{\mathbf{v}}\|_{\infty},\|\mathbf{u}-\widehat{\mathbf{u}}\|_{\infty},\|\mathbf{v}^{[k]}-\widehat{\mathbf{v}}^{[k]}\|_{\infty},\|\mathbf{u}^{[k]}-\widehat{\mathbf{u}}^{[k]}\|_{\infty}\right)\leq n^{-\frac{1}{2}-\delta}\right\}.

On the event ^:=^1^2\widehat{{\mathcal{E}}}:=\widehat{{\mathcal{E}}}_{1}\cap\widehat{{\mathcal{E}}}_{2}, we have |ψiαψ^iα|=O(n2δ+3ε)|\psi_{i\alpha}-\widehat{\psi}_{i\alpha}|=O(n^{-2-\delta+3\varepsilon}). Note that 𝔼[|Ξiα|]=O(n1)\mathbb{E}[|\Xi_{i\alpha}|]=O(n^{-1}) since 𝔼[|ΔiαΔiα[k]|]=O(n1)\mathbb{E}[|\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}|]=O(n^{-1}). As a consequence,

𝔼[|Ξiα(ψiαψ^iα)|𝟙^]=O(n3δ+3ε)=o(n3).\mathbb{E}\left[\left|{\Xi_{i\alpha}(\psi_{i\alpha}-\widehat{\psi}_{i\alpha})}\right|\mathbbm{1}_{\widehat{{\mathcal{E}}}}\right]=O(n^{-3-\delta+3\varepsilon})=o(n^{-3}). (35)

Using the same argument as in (27), we have

𝔼[|Ξiα(ψiαψ^iα)|𝟙^c]N2+4ε𝔼[|Ξiα|𝟙^c]=O(N2+4εN1κ)=o(n3),\mathbb{E}\left[\left|{\Xi_{i\alpha}(\psi_{i\alpha}-\widehat{\psi}_{i\alpha})}\right|\mathbbm{1}_{\widehat{{\mathcal{E}}}^{c}}\right]\lesssim N^{-2+4\varepsilon}\mathbb{E}\left[|\Xi_{i\alpha}|\mathbbm{1}_{\widehat{{\mathcal{E}}}^{c}}\right]=O(N^{-2+4\varepsilon}N^{-1-\kappa})=o(n^{-3}), (36)

where κ\kappa is the constant in the gap property (Lemma B.3) Thus, by (35) and (36), we have shown the desired claim (34).

Based on (29) and (30), combining (31) and (32) yields

1np2𝔼[𝐯,𝐯[k]𝐮,𝐮[k]]4𝖵𝖺𝗋(σ)k+o(1n3)+o(1n2p)\frac{1}{np^{2}}\mathbb{E}\left[\langle\mathbf{v},\mathbf{v}^{[k]}\rangle\langle\mathbf{u},\mathbf{u}^{[k]}\rangle\right]\leq\frac{4\mathsf{Var}(\sigma)}{k}+o\left(\frac{1}{n^{3}}\right)+o\left(\frac{1}{n^{2}p}\right)

By Lemma B.2 and the assumption kn5/3k\gg n^{5/3}, we have

𝔼[𝐯,𝐯[k]𝐮,𝐮[k]]np2kO(n4/3)+o(1)=o(1).\mathbb{E}\left[\langle\mathbf{v},\mathbf{v}^{[k]}\rangle\langle\mathbf{u},\mathbf{u}^{[k]}\rangle\right]\leq\frac{np^{2}}{k}O(n^{-4/3})+o(1)=o(1).

This implies

𝔼[|𝐯,𝐯[k]𝐮,𝐮[k]|]0.\mathbb{E}\left[|\langle\mathbf{v},\mathbf{v}^{[k]}\rangle\langle\mathbf{u},\mathbf{u}^{[k]}\rangle|\right]\to 0. (37)

Step 4.

Consider the symmetrization matrix 𝐗~\widetilde{\mathbf{X}} defined in (5). The variational representation of the top eigenvalue yields

σ=𝐰,𝐗~𝐰𝐰22,σ[k]=𝐰[k],𝐗~[k]𝐰[k]𝐰[k]22with𝐰22=𝐰[k]22=2.\sigma=\frac{\langle\mathbf{w},\widetilde{\mathbf{X}}\mathbf{w}\rangle}{\|\mathbf{w}\|_{2}^{2}},\ \ \sigma^{[k]}=\frac{\langle\mathbf{w}^{[k]},\widetilde{\mathbf{X}}^{[k]}\mathbf{w}^{[k]}\rangle}{\|\mathbf{w}^{[k]}\|_{2}^{2}}\ \ \mbox{with}\ \|\mathbf{w}\|_{2}^{2}=\|\mathbf{w}^{[k]}\|_{2}^{2}=2.

Using the same arguments as in Step 1-3, we can conclude that

𝔼[|𝐰,𝐰[k]|2]=𝔼[|𝐯,𝐯[k]+𝐮,𝐮[k]|2]0\mathbb{E}\left[|\langle\mathbf{w},\mathbf{w}^{[k]}\rangle|^{2}\right]=\mathbb{E}\left[|\langle\mathbf{v},\mathbf{v}^{[k]}\rangle+\langle\mathbf{u},\mathbf{u}^{[k]}\rangle|^{2}\right]\to 0

Combined with (37), this gives us

𝔼[|𝐯,𝐯[k]|2+|𝐮,𝐮[k]|2]0,\mathbb{E}\left[|\langle\mathbf{v},\mathbf{v}^{[k]}\rangle|^{2}+|\langle\mathbf{u},\mathbf{u}^{[k]}\rangle|^{2}\right]\to 0,

which proves the desired results.

Appendix D Proofs for the Stability Regime

Throughout the whole section, we will focus on the behaviour of 𝐯\mathbf{v} and 𝐯[k]\mathbf{v}^{[k]}. Similar results also hold for 𝐮\mathbf{u} and 𝐮[k]\mathbf{u}^{[k]} via the same arguments.

D.1 Linearization and local law of resolvent

As mentioned in the introduction, in certain cases it would be more convenient to consider the symmetrization 𝐗~\widetilde{\mathbf{X}} of the matrix 𝐗\mathbf{X} (defined as in (5)) when studying its spectral properties. For zz\in\mathbb{C} with Imz>0\text{\rm Im}{\,}z>0, We introduce the resolvent of this symmetrization

𝐑(z):=(𝐈n𝐗𝐗z𝐈p)1.\mathbf{R}(z):=\left(\begin{matrix}-\mathbf{I}_{n}&\mathbf{X}\\ \mathbf{X}^{\top}&-z\mathbf{I}_{p}\end{matrix}\right)^{-1}. (38)

Note that 𝐑(z)\mathbf{R}(z) is not the conventional definition of the resolvent matrix, but we still call it resolvent for convenience. For the ease of notations, we will relabel the indices in 𝐑\mathbf{R} in the following way:

Definition 2 (Index sets).

We define the index sets

1:={1,,n},2:={1,,p},:=1{n+α:α2}.{\mathcal{I}}_{1}:=\left\{1,\dots,n\right\},\ \ {\mathcal{I}}_{2}:=\left\{1,\dots,p\right\},\ \ {\mathcal{I}}:={\mathcal{I}}_{1}\cup\left\{n+\alpha:\alpha\in{\mathcal{I}}_{2}\right\}.

For a general matrix 𝐌||×||\mathbf{M}\in\mathbb{R}^{|{\mathcal{I}}|\times|{\mathcal{I}}|}, we label the indices of the matrix elements in the following way: for a,ba,b\in{\mathcal{I}}, if 1a,bn1\leq a,b\leq n, then typically we use Latin letters i,ji,j to represent them; if n+1a,bn+pn+1\leq a,b\leq n+p, we use the corresponding Greek letters α=an\alpha=a-n and β=bn\beta=b-n to represent them.

The resolvent 𝐑\mathbf{R} is closely related to the eigenvalue and eigenvectors of the covariance matrix. As discussed in [DY18][Equation (3.9),(3.10)], we have

𝐑ij(z)==1nz𝐮(i)𝐮(j)λz,𝐑αβ(z)==1p𝐯(α)𝐯(β)λz,\mathbf{R}_{ij}(z)=\sum_{\ell=1}^{n}\frac{z\mathbf{u}_{\ell}(i)\mathbf{u}_{\ell}(j)}{\lambda_{\ell}-z},\ \ \mathbf{R}_{\alpha\beta}(z)=\sum_{\ell=1}^{p}\frac{\mathbf{v}_{\ell}(\alpha)\mathbf{v}_{\ell}(\beta)}{\lambda_{\ell}-z}, (39)

and

𝐑iα(z)==1pλ𝐮(i)𝐯(α)λz,𝐑αi(z)==1pλ𝐯(α)𝐮(i)λz.\mathbf{R}_{i\alpha}(z)=\sum_{\ell=1}^{p}\frac{\sqrt{\lambda_{\ell}}\mathbf{u}_{\ell}(i)\mathbf{v}_{\ell}(\alpha)}{\lambda_{\ell}-z},\ \ \mathbf{R}_{\alpha i}(z)=\sum_{\ell=1}^{p}\frac{\sqrt{\lambda_{\ell}}\mathbf{v}_{\ell}(\alpha)\mathbf{u}_{\ell}(i)}{\lambda_{\ell}-z}.

An important result is the local Marchenko-Pastur law for the resolvent matrix 𝐑\mathbf{R}. This was first proved in [DY18][Lemma 3.11], and we also refer to [HLS19][Proposition 2.13] for a version that can be directly applied. Specifically, the resolvent matrix 𝐑\mathbf{R} has a deterministic limit, defined by

𝐆(z):=((1+m𝖬𝖯(z))1𝐈n00m𝖬𝖯(z)𝐈p),\mathbf{G}(z):=\left(\begin{matrix}-(1+m_{\mathsf{MP}}(z))^{-1}\mathbf{I}_{n}&0\\ 0&m_{\mathsf{MP}}(z)\mathbf{I}_{p}\end{matrix}\right), (40)

where m𝖬𝖯(z)m_{\mathsf{MP}}(z) is the Stieltjes transform of the Marchenko-Pasture law (6), given by

m𝖬𝖯(z):=ρ𝖬𝖯(x)xzdx=1ξz+(zλ)(zλ+)2ξz,m_{\mathsf{MP}}(z):=\int_{\mathbb{R}}\dfrac{\rho_{\mathsf{MP}}(x)}{x-z}{\rm d}x=\dfrac{1-\xi-z+\sqrt{(z-\lambda_{-})(z-\lambda_{+})}}{2\xi z},

where \sqrt{\cdot} denotes the square root on the complex plane whose branch cut is the negative real line. With this choice we always have Imm𝖬𝖯(z)>0\text{\rm Im}{\,}m_{\mathsf{MP}}(z)>0 when Imz>0\text{\rm Im}{\,}z>0.

To state the local law, we will focus on the spectral domain

𝒮:={{E+iη:λ2Eλ++1,0<η<3}if 0<ξ<1.{E+iη:110Eλ++1,0<η<3}ifξ=1.{\mathcal{S}}:=\begin{cases}\left\{E+\mathrm{i}\eta:\frac{\lambda_{-}}{2}\leq E\leq\lambda_{+}+1,0<\eta<3\right\}&\mbox{if}\ 0<\xi<1.\\ \left\{E+\mathrm{i}\eta:\frac{1}{10}\leq E\leq\lambda_{+}+1,0<\eta<3\right\}&\mbox{if}\ \xi=1.\end{cases} (41)
Lemma D.1 (Local Marchenko-Pastur law).

For any ε>0\varepsilon>0, the following estimate holds With overwhelming probability uniformly for z𝒮z\in{\mathcal{S}},

maxa,b|𝐑ab(z)𝐆ab(z)|nε(Imm𝖬𝖯(z)nη+1nη).\max_{a,b\in{\mathcal{I}}}\left|{\mathbf{R}_{ab}(z)-\mathbf{G}_{ab}(z)}\right|\leq n^{\varepsilon}\left(\sqrt{\frac{\text{\rm Im}{\,}m_{\mathsf{MP}}(z)}{n\eta}}+\frac{1}{n\eta}\right). (42)

To give a precise characterization of the resolvent, we rely on the following estimates for the Stieltjes transform m𝖬𝖯(z)m_{\mathsf{MP}}(z) of the Marchenko-Pasture law. We refer to e.g. [BKYY16][Lemma 3.6] for more details.

Lemma D.2 (Estimate for m𝖬𝖯(z)m_{\mathsf{MP}}(z)).

For z=E+iηz=E+\mathrm{i}\eta, let κ(z):=min(|Eλ|,|Eλ+|)\kappa(z):=\min(|E-\lambda_{-}|,|E-\lambda_{+}|) denote the distance to the spectral edge. If z𝒮z\in{\mathcal{S}}, we have

|m𝖬𝖯(z)|1,andImm𝖬𝖯(z){κ(z)+ηifE[λ,λ+],ηκ(z)+ηifE[λ,λ+].|m_{\mathsf{MP}}(z)|\asymp 1,\ \ \mbox{and}\ \ \ \text{\rm Im}{\,}m_{\mathsf{MP}}(z)\asymp\begin{cases}\sqrt{\kappa(z)+\eta}&\mbox{if}\ E\in[\lambda_{-},\lambda_{+}],\\ \frac{\eta}{\sqrt{\kappa(z)+\eta}}&\mbox{if}\ E\notin[\lambda_{-},\lambda_{+}].\end{cases} (43)

In the following analysis, we will work with z=E+iηz=E+\mathrm{i}\eta satisfying |Eλ+|n2/3+δ|E-\lambda_{+}|\leq n^{-2/3+\delta} and η=n2/3δ\eta=n^{-2/3-\delta}, where 0<δ<130<\delta<\frac{1}{3} is some parameter. Uniformly in this regime, the local law yields that the following is true with overwhelming probability for all ε>0\varepsilon>0 and some universal constant C0>0C_{0}>0,

supzmaxab|𝐑ab(z)|n13+δ+ε,andsupzmaxa|𝐑aa(z)|C0.\sup_{z}\max_{a\neq b\in{\mathcal{I}}}|\mathbf{R}_{ab}(z)|\leq n^{-\frac{1}{3}+\delta+\varepsilon},\ \ \mbox{and}\ \ \sup_{z}\max_{a\in{\mathcal{I}}}|\mathbf{R}_{aa}(z)|\leq C_{0}. (44)

These estimates will be used repeatedly in the following subsections.

D.2 Stability of the resolvent

In this subsection, we will prove the main technical result for the proof of resampling stability. Specifically, we will show that under moderate resampling, the resolvent matrices are stable. Since resolvent is closely related to various spectral statistics, this stability lemma for resolvent will be a key ingredient for our proof.

Lemma D.3.

Assume kn5/3ϵ0k\leq n^{5/3-\epsilon_{0}} for some ϵ0>0\epsilon_{0}>0. There exists δ0>0\delta_{0}>0 such that for all 0<δ<δ00<\delta<\delta_{0}, uniformly for z=E+iηz=E+\mathrm{i}\eta with |Eλ+|n2/3+δ|E-\lambda_{+}|\leq n^{-2/3+\delta} and η=n2/3δ\eta=n^{-2/3-\delta}, there exists c>0c>0 such that the following is true with overwhelming probability

maxα,β|𝐑αβ[k](z)𝐑αβ(z)|1n1+cη,maxi,j|𝐑ij[k](z)𝐑ij(z)|1n1+cη.\max_{\alpha,\beta}\left|{\mathbf{R}^{[k]}_{\alpha\beta}(z)-\mathbf{R}_{\alpha\beta}(z)}\right|\leq\frac{1}{n^{1+c}\eta},\ \ \max_{i,j}\left|{\mathbf{R}^{[k]}_{ij}(z)-\mathbf{R}_{ij}(z)}\right|\leq\frac{1}{n^{1+c}\eta}. (45)
Proof.

Recall that Sk:={(i1,α1),,(ik,αk)}S_{k}:=\{(i_{1},\alpha_{1}),\dots,(i_{k},\alpha_{k})\} is the random subset of matrix indices whose elements are resampled in the matrix 𝐗\mathbf{X}. For 1tk1\leq t\leq k, let 𝐗[t]\mathbf{X}^{[t]} be the matrix obtained from 𝐗\mathbf{X} by resampling the {(is,αs)}1st\{(i_{s},\alpha_{s})\}_{1\leq s\leq t} entries and let t{\mathcal{F}}_{t} be the σ\sigma-algebra generated by the random variables 𝐗\mathbf{X}, SkS_{k} and {𝐗isαs}1st\{\mathbf{X}_{i_{s}\alpha_{s}}^{\prime}\}_{1\leq s\leq t}. For a,ba,b\in{\mathcal{I}}, we define

Tab:={t:{it,αt}{a,b}}.T_{ab}:=\left\{t:\{i_{t},\alpha_{t}\}\cap\{a,b\}\neq\emptyset\right\}.

Let ε>0\varepsilon>0 be an arbitrarily fixed parameter, and let C0C_{0} be the constant in (44). Consider the event tt{\mathcal{E}}_{t}\in{\mathcal{F}}_{t} where for all z=E+iηz=E+\mathrm{i}\eta with |zλ+|n2/3+δ|z-\lambda_{+}|\leq n^{-2/3+\delta} and η=n2/3δ\eta=n^{-2/3-\delta} we have

maxab|𝐑ab[t](z)|n13+δ+ε=:Ψ,andmaxa|𝐑aa[t](z)|C0.\max_{a\neq b}\left|{\mathbf{R}^{[t]}_{ab}(z)}\right|\leq n^{-\frac{1}{3}+\delta+\varepsilon}=:\Psi,\ \ \mbox{and}\ \ \max_{a}\left|{\mathbf{R}^{[t]}_{aa}(z)}\right|\leq C_{0}.

Define 𝐗0[t]\mathbf{X}^{[t]}_{0} as the matrix obtained from 𝐗[t]\mathbf{X}^{[t]} by replacing the (it,αt)(i_{t},\alpha_{t}) entry with 0, and also define its symmetrization 𝐗~0[t]||×||\widetilde{\mathbf{X}}^{[t]}_{0}\in\mathbb{R}^{|{\mathcal{I}}|\times|{\mathcal{I}}|} as in (5). Note that 𝐗~0[t+1]\widetilde{\mathbf{X}}^{[t+1]}_{0} is t{\mathcal{F}}_{t}-measurable. We write

𝐗~[t]=𝐗~0[t+1]+𝐏~[t+1],𝐗~[t+1]=𝐗~0[t+1]+𝐐~[t+1],\widetilde{\mathbf{X}}^{[t]}=\widetilde{\mathbf{X}}^{[t+1]}_{0}+\widetilde{\mathbf{P}}^{[t+1]},\ \ \ \widetilde{\mathbf{X}}^{[t+1]}=\widetilde{\mathbf{X}}^{[t+1]}_{0}+\widetilde{\mathbf{Q}}^{[t+1]},

where 𝐏~[t],𝐐~[t]\widetilde{\mathbf{P}}^{[t]},\widetilde{\mathbf{Q}}^{[t]} are ||×|||{\mathcal{I}}|\times|{\mathcal{I}}| symmetric matrices whose elements are all 0 except at the (it,αt)(i_{t},\alpha_{t}) and (αt,it)(\alpha_{t},i_{t}) entries, satisfying

(𝐏~[t])ab={𝐗itαtif{a,b}={it,αt},0otherwise(𝐐~[t])ab={𝐗itαtif{a,b}={it,αt},0otherwise.(\widetilde{\mathbf{P}}^{[t]})_{ab}=\begin{cases}\mathbf{X}_{i_{t}\alpha_{t}}&\mbox{if}\ \left\{a,b\right\}=\{i_{t},\alpha_{t}\},\\ 0&\mbox{otherwise}\end{cases}\ \ \ (\widetilde{\mathbf{Q}}^{[t]})_{ab}=\begin{cases}\mathbf{X}_{i_{t}\alpha_{t}}^{\prime}&\mbox{if}\ \left\{a,b\right\}=\{i_{t},\alpha_{t}\},\\ 0&\mbox{otherwise}\end{cases}.

Define the resolvents for the matrices 𝐗~[t]\widetilde{\mathbf{X}}^{[t]} and 𝐗~0[t]\widetilde{\mathbf{X}}^{[t]}_{0} as in (38):

𝐑[t]:=(𝐈n𝐗[t](𝐗[t])z𝐈p)1,𝐑0[t]:=(𝐈n𝐗0[t](𝐗0[t])z𝐈p)1.\mathbf{R}^{[t]}:=\left(\begin{matrix}-\mathbf{I}_{n}&\mathbf{X}^{[t]}\\ (\mathbf{X}^{[t]})^{\top}&-z\mathbf{I}_{p}\end{matrix}\right)^{-1},\ \ \mathbf{R}^{[t]}_{0}:=\left(\begin{matrix}-\mathbf{I}_{n}&\mathbf{X}^{[t]}_{0}\\ (\mathbf{X}^{[t]}_{0})^{\top}&-z\mathbf{I}_{p}\end{matrix}\right)^{-1}.

Using first-order resolvent expansion, we obtain

𝐑0[t+1]=𝐑[t]+𝐑[t]𝐏~[t+1]𝐑[t]+(𝐑[t]𝐏~[t+1])2𝐑0[t+1].\mathbf{R}^{[t+1]}_{0}=\mathbf{R}^{[t]}+\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t]}+\left(\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\right)^{2}\mathbf{R}^{[t+1]}_{0}. (46)

The triangle inequality yields

|(𝐑0[t+1]𝐑[t])αβ||(𝐑[t]𝐏~[t+1]𝐑[t])αβ|+|(𝐑[t]𝐏~[t+1]𝐑[t]𝐏~[t+1]𝐑0[t+1])αβ|.\left|{\left(\mathbf{R}^{[t+1]}_{0}-\mathbf{R}^{[t]}\right)_{\alpha\beta}}\right|\leq\left|{\left(\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t]}\right)_{\alpha\beta}}\right|+\left|{\left(\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t+1]}_{0}\right)_{\alpha\beta}}\right|.

Note that 𝐏~[t+1]\widetilde{\mathbf{P}}^{[t+1]} has only two non-zero entries,

(𝐑[t]𝐏~[t+1]𝐑[t])αβ=1,2𝐑α1[t]𝐏~12[t+1]𝐑2β[t]=Xit+1αt+1(𝐑αit+1[t]𝐑αt+1β[t]+𝐑ααt+1[t]𝐑it+1β[t])\left(\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t]}\right)_{\alpha\beta}=\sum_{\ell_{1},\ell_{2}}\mathbf{R}^{[t]}_{\alpha\ell_{1}}\widetilde{\mathbf{P}}^{[t+1]}_{\ell_{1}\ell_{2}}\mathbf{R}^{[t]}_{\ell_{2}\beta}=X_{i_{t+1}\alpha_{t+1}}\left(\mathbf{R}^{[t]}_{\alpha i_{t+1}}\mathbf{R}^{[t]}_{\alpha_{t+1}\beta}+\mathbf{R}^{[t]}_{\alpha\alpha_{t+1}}\mathbf{R}^{[t]}_{i_{t+1}\beta}\right)

Recall that |Xit+1αt+1|n1/2+ε|X_{i_{t+1}\alpha_{t+1}}|\leq n^{-1/2+\varepsilon} with overwhelming probability thanks to the sub-exponential decay assumption (2). Then on the event t{\mathcal{E}}_{t}, we have

|(𝐑[t]𝐏~[t+1]𝐑[t])αβ|2C0Ψn12+ε.\left|{\left(\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t]}\right)_{\alpha\beta}}\right|\leq 2C_{0}\Psi n^{-\frac{1}{2}+\varepsilon}.

Similarly,

(𝐑[t]𝐏~[t+1]𝐑[t]𝐏~[t+1]𝐑0[t+1])αβ={m1,m2},{m3,m4}={it+1,αt+1}𝐑αm1[t]𝐏~m1m2[t+1]𝐑m2m3[t]𝐏~m3m4[t+1](𝐑0[t+1])m4β.\left(\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t+1]}_{0}\right)_{\alpha\beta}=\sum_{\{m_{1},m_{2}\},\{m_{3},m_{4}\}=\{i_{t+1},\alpha_{t+1}\}}\mathbf{R}^{[t]}_{\alpha m_{1}}\widetilde{\mathbf{P}}^{[t+1]}_{m_{1}m_{2}}\mathbf{R}^{[t]}_{m_{2}m_{3}}\widetilde{\mathbf{P}}^{[t+1]}_{m_{3}m_{4}}(\mathbf{R}^{[t+1]}_{0})_{m_{4}\beta}.

We use the trivial bound |𝐑0[t+1]|η1|\mathbf{R}^{[t+1]}_{0}|\leq\eta^{-1} for the last term. Then, on the event t{\mathcal{E}}_{t}, we have

|(𝐑[t]𝐏~[t+1]𝐑[t]𝐏~[t+1]𝐑0[t+1])αβ|2n1+2εη1(Ψ2+C02)Ψ.\left|{\left(\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t+1]}_{0}\right)_{\alpha\beta}}\right|\leq 2n^{-1+2\varepsilon}\eta^{-1}(\Psi^{2}+C_{0}^{2})\ll\Psi.

Therefore, we have shown that, on the event t{\mathcal{E}}_{t},

maxαβ|(𝐑0[t+1])αβ|2Ψ,maxα|(𝐑0[t+1])αα|2C0.\max_{\alpha\neq\beta}\left|{(\mathbf{R}^{[t+1]}_{0})_{\alpha\beta}}\right|\leq 2\Psi,\ \ \max_{\alpha}\left|{(\mathbf{R}^{[t+1]}_{0})_{\alpha\alpha}}\right|\leq 2C_{0}. (47)

Similarly, using the first-order resolvent expansion for 𝐑[t+1]\mathbf{R}^{[t+1]} around 𝐑[t]\mathbf{R}^{[t]}, we have

𝐑[t+1]=𝐑[t]+𝐑[t](𝐏~[t+1]𝐐~[t+1])𝐑[t]+(𝐑[t](𝐏~[t+1]𝐐~[t+1]))2𝐑[t+1].\mathbf{R}^{[t+1]}=\mathbf{R}^{[t]}+\mathbf{R}^{[t]}(\widetilde{\mathbf{P}}^{[t+1]}-\widetilde{\mathbf{Q}}^{[t+1]})\mathbf{R}^{[t]}+\left(\mathbf{R}^{[t]}(\widetilde{\mathbf{P}}^{[t+1]}-\widetilde{\mathbf{Q}}^{[t+1]})\right)^{2}\mathbf{R}^{[t+1]}.

By the same arguments as above, on the event t{\mathcal{E}}_{t}, we can derive

maxαβ|𝐑αβ[t+1]|2Ψ,maxα|𝐑αα[t+1]|2C0.\max_{\alpha\neq\beta}\left|{\mathbf{R}^{[t+1]}_{\alpha\beta}}\right|\leq 2\Psi,\ \ \max_{\alpha}\left|{\mathbf{R}^{[t+1]}_{\alpha\alpha}}\right|\leq 2C_{0}.

Next, we use the resolvent identity (or zeroth-order expansion) for 𝐑[t+1]\mathbf{R}^{[t+1]} and 𝐑0[t+1]\mathbf{R}^{[t+1]}_{0}:

𝐑[t+1]=𝐑0[t+1]𝐑0[t+1]𝐐~[t+1]𝐑[t+1].\mathbf{R}^{[t+1]}=\mathbf{R}^{[t+1]}_{0}-\mathbf{R}^{[t+1]}_{0}\widetilde{\mathbf{Q}}^{[t+1]}\mathbf{R}^{[t+1]}.

This leads to

|(𝐑[t+1]𝐑0[t+1])αβ|=|{1,2}={it+1αt+1}(𝐑0[t+1])α1𝐐~12[t+1]𝐑2β[t+1]|\left|{\left(\mathbf{R}^{[t+1]}-\mathbf{R}^{[t+1]}_{0}\right)_{\alpha\beta}}\right|=\left|{\sum_{\{\ell_{1},\ell_{2}\}=\{i_{t+1}\alpha_{t+1}\}}(\mathbf{R}^{[t+1]}_{0})_{\alpha\ell_{1}}\widetilde{\mathbf{Q}}^{[t+1]}_{\ell_{1}\ell_{2}}\mathbf{R}^{[t+1]}_{\ell_{2}\beta}}\right|

Thus, on the event t{\mathcal{E}}_{t}, we conclude

|(𝐑[t+1]𝐑0[t+1])αβ|4n12+ε(Ψ2+C0Ψ𝟙((t+1)Tαβ))=:𝔣αβ[t+1]\left|{\left(\mathbf{R}^{[t+1]}-\mathbf{R}^{[t+1]}_{0}\right)_{\alpha\beta}}\right|\leq 4n^{-\frac{1}{2}+\varepsilon}\left(\Psi^{2}+C_{0}\Psi\mathbbm{1}_{((t+1)\in T_{\alpha\beta})}\right)=:\mathfrak{f}_{\alpha\beta}^{[t+1]} (48)

Meanwhile, the second-order resolvent expansion of 𝐑[t+1]\mathbf{R}^{[t+1]} around 𝐑0[t+1]\mathbf{R}^{[t+1]}_{0} yields

𝐑[t+1]=𝐑0[t+1]𝐑0[t+1]𝐐~[t+1]𝐑0[t+1]+(𝐑0[t+1]𝐐~[t+1])2𝐑0[t+1](𝐑0[t+1]𝐐~[t+1])3𝐑[t+1].\mathbf{R}^{[t+1]}=\mathbf{R}^{[t+1]}_{0}-\mathbf{R}^{[t+1]}_{0}\widetilde{\mathbf{Q}}^{[t+1]}\mathbf{R}^{[t+1]}_{0}+\left(\mathbf{R}^{[t+1]}_{0}\widetilde{\mathbf{Q}}^{[t+1]}\right)^{2}\mathbf{R}^{[t+1]}_{0}-\left(\mathbf{R}^{[t+1]}_{0}\widetilde{\mathbf{Q}}^{[t+1]}\right)^{3}\mathbf{R}^{[t+1]}.

A key observation is that 𝐑0[t+1]\mathbf{R}^{[t+1]}_{0} is t{\mathcal{F}}_{t}-measurable, and 𝔼[𝐐~[t+1]|t]=0\mathbb{E}[\widetilde{\mathbf{Q}}^{[t+1]}|{\mathcal{F}}_{t}]=0. For simplicity of notations, we set

𝔮αβ[t]:=((𝐑0[t]𝐄~(it,αt))2𝐑0[t])αβ\mathfrak{q}_{\alpha\beta}^{[t]}:=\left((\mathbf{R}^{[t]}_{0}\widetilde{\mathbf{E}}^{(i_{t},\alpha_{t})})^{2}\mathbf{R}^{[t]}_{0}\right)_{\alpha\beta}

where 𝐄~(it,αt)||×||\widetilde{\mathbf{E}}^{(i_{t},\alpha_{t})}\in\mathbb{R}^{|{\mathcal{I}}|\times|{\mathcal{I}}|} is the symmetrization of the matrix 𝐄(it,αt)n×p\mathbf{E}^{(i_{t},\alpha_{t})}\in\mathbb{R}^{n\times p} whose elements are all 0 except 𝐄itαt(it,αt)=1\mathbf{E}^{(i_{t},\alpha_{t})}_{i_{t}\alpha_{t}}=1. Then we have

|𝔼[𝐑αβ[t+1]|t](𝐑0[t+1])αβp1𝔮αβ[t+1]|32n32+3ε(Ψ2C02+C04𝟙((t+1)Tαβ))=:𝔤αβ[t+1].\left|{\mathbb{E}\left[\mathbf{R}^{[t+1]}_{\alpha\beta}|{\mathcal{F}}_{t}\right]-(\mathbf{R}^{[t+1]}_{0})_{\alpha\beta}-p^{-1}\mathfrak{q}_{\alpha\beta}^{[t+1]}}\right|\leq 32n^{-\frac{3}{2}+3\varepsilon}\left(\Psi^{2}C_{0}^{2}+C_{0}^{4}\mathbbm{1}_{((t+1)\in T_{\alpha\beta})}\right)=:\mathfrak{g}_{\alpha\beta}^{[t+1]}. (49)

Similarly, using resolvent expansion of 𝐑[t]\mathbf{R}^{[t]} around 𝐑0[t+1]\mathbf{R}^{[t+1]}_{0}, we obtain

𝐑[t]=𝐑0[t+1]𝐑0[t+1]𝐏~[t+1]𝐑0[t+1]+(𝐑0[t+1]𝐏~[t+1])2𝐑0[t+1](𝐑0[t+1]𝐏~[t+1])3𝐑[t].\mathbf{R}^{[t]}=\mathbf{R}^{[t+1]}_{0}-\mathbf{R}^{[t+1]}_{0}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t+1]}_{0}+(\mathbf{R}^{[t+1]}_{0}\widetilde{\mathbf{P}}^{[t+1]})^{2}\mathbf{R}^{[t+1]}_{0}-(\mathbf{R}^{[t+1]}_{0}\widetilde{\mathbf{P}}^{[t+1]})^{3}\mathbf{R}^{[t]}.

By the same arguments as above, on the event t{\mathcal{E}}_{t}, we deduce that

|𝐑αβ[t](𝐑0[t+1])αβ+𝐗it+1αt+1𝔭αβ[t+1]𝐗it+1αt+12𝔮αβ[t+1]|𝔤αβ[t+1]\left|{\mathbf{R}^{[t]}_{\alpha\beta}-(\mathbf{R}^{[t+1]}_{0})_{\alpha\beta}+\mathbf{X}_{i_{t+1}\alpha_{t+1}}\mathfrak{p}_{\alpha\beta}^{[t+1]}-\mathbf{X}_{i_{t+1}\alpha_{t+1}}^{2}\mathfrak{q}_{\alpha\beta}^{[t+1]}}\right|\leq\mathfrak{g}_{\alpha\beta}^{[t+1]} (50)

where

𝔭αβ[t]:=(𝐑0[t]𝐄~(it,αt)𝐑0[t])αβ.\mathfrak{p}_{\alpha\beta}^{[t]}:=\left(\mathbf{R}^{[t]}_{0}\widetilde{\mathbf{E}}^{(i_{t},\alpha_{t})}\mathbf{R}^{[t]}_{0}\right)_{\alpha\beta}. (51)

Combining (49) and (50) yields

|𝔼[𝐑αβ[t+1]|t]𝐑αβ[t]𝐗it+1αt+1𝔭αβ[t+1]+(𝐗it+1αt+12p1)𝔮αβ[t+1]|2𝔤αβ[t+1].\left|{\mathbb{E}\left[\mathbf{R}^{[t+1]}_{\alpha\beta}|{\mathcal{F}}_{t}\right]-\mathbf{R}^{[t]}_{\alpha\beta}-\mathbf{X}_{i_{t+1}\alpha_{t+1}}\mathfrak{p}_{\alpha\beta}^{[t+1]}+(\mathbf{X}_{i_{t+1}\alpha_{t+1}}^{2}-p^{-1})\mathfrak{q}_{\alpha\beta}^{[t+1]}}\right|\leq 2\mathfrak{g}_{\alpha\beta}^{[t+1]}. (52)

By a telescopic summation, we obtain

𝐑αβ[k]𝐑αβ\displaystyle\mathbf{R}^{[k]}_{\alpha\beta}-\mathbf{R}_{\alpha\beta} =t=0k1(𝐑αβ[t+1]𝐑αβ[t])\displaystyle=\sum_{t=0}^{k-1}\left(\mathbf{R}^{[t+1]}_{\alpha\beta}-\mathbf{R}^{[t]}_{\alpha\beta}\right) (53)
=t=0k1(𝐑αβ[t+1]𝔼[𝐑αβ[t+1]|t])+t=0k1𝐗it+1αt+1𝔭αβ[t+1]+t=0k1(𝐗it+1αt+12p1)𝔮αβ[t+1]+𝔯αβ\displaystyle=\sum_{t=0}^{k-1}\left(\mathbf{R}^{[t+1]}_{\alpha\beta}-\mathbb{E}\left[\mathbf{R}^{[t+1]}_{\alpha\beta}|{\mathcal{F}}_{t}\right]\right)+\sum_{t=0}^{k-1}\mathbf{X}_{i_{t+1}\alpha_{t+1}}\mathfrak{p}_{\alpha\beta}^{[t+1]}+\sum_{t=0}^{k-1}(\mathbf{X}_{i_{t+1}\alpha_{t+1}}^{2}-p^{-1})\mathfrak{q}_{\alpha\beta}^{[t+1]}+\mathfrak{r}_{\alpha\beta}

where the remainder 𝔯αβ\mathfrak{r}_{\alpha\beta} is bounded by (52)

|𝔯αβ|2t=0k1𝔤αβ[t+1].|\mathfrak{r}_{\alpha\beta}|\leq 2\sum_{t=0}^{k-1}\mathfrak{g}_{\alpha\beta}^{[t+1]}.

Recall the expression of 𝔤αβ[t]\mathfrak{g}_{\alpha\beta}^{[t]}, to estimate the remainder, we need to control the size of the set TαβT_{\alpha\beta}. Note that 𝔼[|Tαβ|]=2k/p.\mathbb{E}\left[|T_{\alpha\beta}|\right]=2k/p. By a Berstein-type inequality (see e.g. [Cha07][Proposition 1.1]), for any x>0x>0, we have

(|Tαβ|𝔼[|Tαβ|]+x)exp(x24𝔼[|Tαβ|]+2x)\mathbb{P}\left(|T_{\alpha\beta}|\geq\mathbb{E}\left[|T_{\alpha\beta}|\right]+x\right)\leq\exp\left(-\frac{x^{2}}{4\mathbb{E}\left[|T_{\alpha\beta}|\right]+2x}\right)

Recall that kn5/3ϵ0k\leq n^{5/3-\epsilon_{0}}. The inequality together with a union bound implies that

maxα,β|Tαβ|3max(k,p(logn)2)p=:𝖳\max_{\alpha,\beta}|T_{\alpha\beta}|\leq\frac{3\max(k,p(\log n)^{2})}{p}=:{\mathsf{T}}

with overwhelming probability. We denote this event by 𝒯{\mathcal{T}}. On the event 𝒯{\mathcal{T}}, we have

|𝔯αβ|2kn32+3εΨ2C02+2n32+3εC04𝖳n3ε𝖳Ψ2.|\mathfrak{r}_{\alpha\beta}|\leq 2kn^{-\frac{3}{2}+3\varepsilon}\Psi^{2}C_{0}^{2}+2n^{-\frac{3}{2}+3\varepsilon}C_{0}^{4}{\mathsf{T}}\leq n^{3\varepsilon}\sqrt{{\mathsf{T}}}\Psi^{2}. (54)

For the first term in (53), we set

𝔴αβ[t+1]:=(𝐑αβ[t+1]𝔼[𝐑αβ[t+1]|t])𝟙t.\mathfrak{w}_{\alpha\beta}^{[t+1]}:=\left(\mathbf{R}^{[t+1]}_{\alpha\beta}-\mathbb{E}\left[\mathbf{R}^{[t+1]}_{\alpha\beta}|{\mathcal{F}}_{t}\right]\right)\mathbbm{1}_{{\mathcal{E}}_{t}}.

Note that tt{\mathcal{E}}_{t}\in{\mathcal{F}}_{t}. This implies that 𝔼[𝔴αβ[t+1]|t]=0\mathbb{E}[\mathfrak{w}_{\alpha\beta}^{[t+1]}|{\mathcal{F}}_{t}]=0. Moreover, by (48), on the event t{\mathcal{E}}_{t} we have |𝔴αβ[t+1]|2𝔣αβ[t+1]|\mathfrak{w}_{\alpha\beta}^{[t+1]}|\leq 2\mathfrak{f}_{\alpha\beta}^{[t+1]}. Further, on the event 𝒯{\mathcal{T}},

(t=0k1(𝔣αβ[t+1])2)1/2n12+εΨ2k+n12+εC0Ψ𝖳2nεΨ2𝖳.\left(\sum_{t=0}^{k-1}(\mathfrak{f}_{\alpha\beta}^{[t+1]})^{2}\right)^{1/2}\leq n^{-\frac{1}{2}+\varepsilon}\Psi^{2}\sqrt{k}+n^{-\frac{1}{2}+\varepsilon}C_{0}\Psi\sqrt{{\mathsf{T}}}\leq 2n^{\varepsilon}\Psi^{2}\sqrt{{\mathsf{T}}}.

Using the Azuma-Hoeffding inequality, for any x0x\geq 0, we have

(|t=0k1𝔴αβ[t+1]|2nεΨ2𝖳x)2exp(x22).\mathbb{P}\left(\left|{\sum_{t=0}^{k-1}\mathfrak{w}_{\alpha\beta}^{[t+1]}}\right|\geq 2n^{\varepsilon}\Psi^{2}\sqrt{{\mathsf{T}}}x\right)\leq 2\exp\left(-\frac{x^{2}}{2}\right).

Moreover,

(|t=0k1(𝐑αβ[t+1]𝔼[𝐑αβ[t+1]|t])|2nεΨ2𝖳x)(|t=0k1𝔴αβ[t+1]|2nεΨ2𝖳x)+t=0k1(tc).\mathbb{P}\left(\left|{\sum_{t=0}^{k-1}\left(\mathbf{R}^{[t+1]}_{\alpha\beta}-\mathbb{E}\left[\mathbf{R}^{[t+1]}_{\alpha\beta}|{\mathcal{F}}_{t}\right]\right)}\right|\geq 2n^{\varepsilon}\Psi^{2}\sqrt{{\mathsf{T}}}x\right)\leq\mathbb{P}\left(\left|{\sum_{t=0}^{k-1}\mathfrak{w}_{\alpha\beta}^{[t+1]}}\right|\geq 2n^{\varepsilon}\Psi^{2}\sqrt{{\mathsf{T}}}x\right)+\sum_{t=0}^{k-1}\mathbb{P}({\mathcal{E}}_{t}^{c}).

Recall that t{\mathcal{E}}_{t} holds with overwhelming probability, and consequently t=0k1(tc)nD\sum_{t=0}^{k-1}\mathbb{P}({\mathcal{E}}_{t}^{c})\leq n^{-D} for any D>0D>0. Choosing x=nεx=n^{\varepsilon} implies that with overwhelming probability

|t=0k1(𝐑αβ[t+1]𝔼[𝐑αβ[t+1]|t])|2n2εΨ2𝖳.\left|{\sum_{t=0}^{k-1}\left(\mathbf{R}^{[t+1]}_{\alpha\beta}-\mathbb{E}\left[\mathbf{R}^{[t+1]}_{\alpha\beta}|{\mathcal{F}}_{t}\right]\right)}\right|\leq 2n^{2\varepsilon}\Psi^{2}\sqrt{{\mathsf{T}}}. (55)

For the next two terms in (53), we will deal with them by introducing a backward filtration. Let t{\mathcal{F}}_{t}^{\prime} be the σ\sigma-algebra generated by the random variables 𝐗\mathbf{X}^{\prime}, SkS_{k} and {𝐗iα}\{\mathbf{X}_{i\alpha}\} with i{i1,,it}i\notin\left\{i_{1},\dots,i_{t}\right\} and α{α1,,αt}\alpha\notin\left\{\alpha_{1},\dots,\alpha_{t}\right\}. Similarly as above, we consider the event t{\mathcal{E}}_{t}^{\prime} that for all z=E+iηz=E+\mathrm{i}\eta with |zλ+|n2/3+δ|z-\lambda_{+}|\leq n^{-2/3+\delta} and η=n2/3δ\eta=n^{-2/3-\delta} we have

maxab|𝐑ab[t](z)|Ψ,andmaxa|𝐑aa[t](z)|C0.\max_{a\neq b}\left|{\mathbf{R}^{[t]}_{ab}(z)}\right|\leq\Psi,\ \ \mbox{and}\ \ \max_{a}\left|{\mathbf{R}^{[t]}_{aa}(z)}\right|\leq C_{0}.

Using resolvent expansion, the same arguments for (47) yield that, on the event t{\mathcal{E}}_{t}^{\prime}, we have

maxαβ|(𝐑0[t])αβ|2Ψ,maxα|(𝐑0[t])αα|2C0.\max_{\alpha\neq\beta}\left|{(\mathbf{R}^{[t]}_{0})_{\alpha\beta}}\right|\leq 2\Psi,\ \ \max_{\alpha}\left|{(\mathbf{R}^{[t]}_{0})_{\alpha\alpha}}\right|\leq 2C_{0}.

A key observation is that 𝔭αβ[t]\mathfrak{p}_{\alpha\beta}^{[t]} defined in (51) is t{\mathcal{F}}_{t}^{\prime}-measurable. Also, we have 𝔼[𝐗itαt|t]=0\mathbb{E}[\mathbf{X}_{i_{t}\alpha_{t}}|{\mathcal{F}}_{t}^{\prime}]=0. Consider

𝔭~αβ[t]:=𝐗itαt𝔭αβ[t]𝟙t.\widetilde{\mathfrak{p}}_{\alpha\beta}^{[t]}:=\mathbf{X}_{i_{t}\alpha_{t}}\mathfrak{p}_{\alpha\beta}^{[t]}\mathbbm{1}_{{\mathcal{E}}_{t}^{\prime}}.

Then we have 𝔼[𝔭~αβ[t]|t]=0\mathbb{E}[\widetilde{\mathfrak{p}}_{\alpha\beta}^{[t]}|{\mathcal{F}}_{t}^{\prime}]=0 since we also have tt{\mathcal{E}}_{t}^{\prime}\in{\mathcal{F}}_{t}^{\prime}. Note that

(|t=0k1𝐗it+1αt+1𝔭αβ[t+1]|x)(|t=0k1𝔭~αβ[t+1]|x)+t=0k1((t+1)c),\mathbb{P}\left(\left|{\sum_{t=0}^{k-1}\mathbf{X}_{i_{t+1}\alpha_{t+1}}\mathfrak{p}_{\alpha\beta}^{[t+1]}}\right|\geq x\right)\leq\mathbb{P}\left(\left|{\sum_{t=0}^{k-1}\widetilde{\mathfrak{p}}_{\alpha\beta}^{[t+1]}}\right|\geq x\right)+\sum_{t=0}^{k-1}\mathbb{P}(({\mathcal{E}}_{t+1}^{\prime})^{c}),

The second term is negligible since t{\mathcal{E}}_{t}^{\prime} holds with overwhelming probability. To estimate the first term, we use Azuma-Hoeffding inequality as before. Based on similar arguments as in (48), we deduce

|𝔭~αβ[t]|4n12+ε(Ψ2+C0Ψ𝟙(tTαβ)).\left|{\widetilde{\mathfrak{p}}_{\alpha\beta}^{[t]}}\right|\leq 4n^{-\frac{1}{2}+\varepsilon}\left(\Psi^{2}+C_{0}\Psi\mathbbm{1}_{(t\in T_{\alpha\beta})}\right).

By considering the event 𝒯{\mathcal{T}} and using Azuma-Hoeffding inequality as in (55), we can conclude that with overwhelming probability,

|t=0k1𝔭~αβ[t+1]|n2εΨ2𝖳\left|{\sum_{t=0}^{k-1}\widetilde{\mathfrak{p}}_{\alpha\beta}^{[t+1]}}\right|\leq n^{2\varepsilon}\Psi^{2}\sqrt{{\mathsf{T}}}

As a consequence, with overwhelming probability

|t=0k1𝐗it+1αt+1𝔭αβ[t+1]|n2εΨ2𝖳.\left|{\sum_{t=0}^{k-1}\mathbf{X}_{i_{t+1}\alpha_{t+1}}\mathfrak{p}_{\alpha\beta}^{[t+1]}}\right|\lesssim n^{2\varepsilon}\Psi^{2}\sqrt{{\mathsf{T}}}. (56)

For the third term in (53), by the same arguments, we have

|t=0k1(𝐗it+1αt+12p1)𝔮αβ[t+1]|n2εΨ2𝖳.\left|{\sum_{t=0}^{k-1}(\mathbf{X}_{i_{t+1}\alpha_{t+1}}^{2}-p^{-1})\mathfrak{q}_{\alpha\beta}^{[t+1]}}\right|\lesssim n^{2\varepsilon}\Psi^{2}\sqrt{{\mathsf{T}}}. (57)

Finally, combining (53), (54), (55), (56) and (57), we have shown that

|𝐑αβ[k](z)𝐑αβ(z)|n3εΨ2𝖳.\left|{\mathbf{R}^{[k]}_{\alpha\beta}(z)-\mathbf{R}_{\alpha\beta}(z)}\right|\lesssim n^{3\varepsilon}\Psi^{2}\sqrt{{\mathsf{T}}}.

Recall that η=n2/3δ\eta=n^{-2/3-\delta}, Ψ=O(n13+δ+ε)\Psi=O(n^{-\frac{1}{3}+\delta+\varepsilon}), and 𝖳=O(n23ϵ0){\mathsf{T}}=O(n^{\frac{2}{3}-\epsilon_{0}}). Then we obtain

nη|𝐑αβ[k](z)𝐑αβ(z)|nϵ02+δ+5ε.n\eta\left|{\mathbf{R}^{[k]}_{\alpha\beta}(z)-\mathbf{R}_{\alpha\beta}(z)}\right|\leq n^{-\frac{\epsilon_{0}}{2}+\delta+5\varepsilon}. (58)

Choosing δ+5ε<ϵ02\delta+5\varepsilon<\frac{\epsilon_{0}}{2} yields the desired bound (45) for a fixed zz.

So far, we have proved the desired result for a fixed zz. To extend this result to a uniform estimate, we simply invoke a standard net argument. To do this, we divide the interval [n2/3+δ,n2/3+δ][-n^{-2/3+\delta},n^{-2/3+\delta}] into n2n^{2} sub-intervals, and consider z=E+iηz=E+\mathrm{i}\eta with κ(z)\kappa(z) taking values in each sub-interval. Note that

|𝐑αβ(z1)𝐑αβ(z2)||z1z2|min(Im(z1),Im(z2))2.|\mathbf{R}_{\alpha\beta}(z_{1})-\mathbf{R}_{\alpha\beta}(z_{2})|\leq\frac{|z_{1}-z_{2}|}{\min(\text{\rm Im}{\,}(z_{1}),\text{\rm Im}{\,}(z_{2}))^{2}}.

For z1z_{1}, z2z_{2} associated with the same sub-interval, we have

nη|𝐑αβ(z1)𝐑αβ(z2)|nηn2/3+δn2η2n1+2δ,n\eta|\mathbf{R}_{\alpha\beta}(z_{1})-\mathbf{R}_{\alpha\beta}(z_{2})|\leq n\eta\frac{n^{-2/3+\delta}n^{-2}}{\eta^{2}}\leq n^{-1+2\delta},

which is of lower order compared with the error bound in (58). This shows that, up to a small multiplicative factor, the desired error bound (45) holds uniformly in each sub-interval with overwhelming probability. Finally, thanks to the overwhelming probability, a union bound over the n2n^{2} sub-intervals yields the desired uniform estimate (45) for all z=E+iηz=E+\mathrm{i}\eta with |Eλ+|n2/3+δ|E-\lambda_{+}|\leq n^{-2/3+\delta} and η=n2/3δ\eta=n^{-2/3-\delta}.

Using the same arguments, we can prove a similar bound for the 𝐑ij[k]\mathbf{R}^{[k]}_{ij} and 𝐑ij\mathbf{R}_{ij} blocks. Hence, we have shown the desired results. ∎

D.3 Stability of the top eigenvalue

As a consequence of the stability of the resolvent, we also have the stability of the top eigenvalue. This stability of the eigenvalue will play a crucial rule for the resolvent approximation of eigenvector statistics in the next subsection.

Lemma D.4.

Assume kn5/3ϵ0k\leq n^{5/3-\epsilon_{0}} for some ϵ0>0\epsilon_{0}>0. Let 0<δ<δ00<\delta<\delta_{0} with δ0\delta_{0} as in Lemma D.3. For any ε>0\varepsilon>0, with overwhelming probability, we have

|λλ[k]|n23δ+ε.\left|{\lambda-\lambda^{[k]}}\right|\leq n^{-\frac{2}{3}-\delta+\varepsilon}.
Proof.

Without loss of generality, we assume that λ>λ[k]\lambda>\lambda^{[k]}. Set η=n2/3δ\eta=n^{-2/3-\delta}. By the spectral representation of the resolvent (39), we have

Im𝐑αα(z)=η=1p|𝐯(α)|2(λE)2+η2η|𝐯(α)|2(λE)2+η2η|𝐯(α)|22(max(|λE|,η))2.\text{\rm Im}{\,}\mathbf{R}_{\alpha\alpha}(z)=\eta\sum_{\ell=1}^{p}\frac{|\mathbf{v}_{\ell}(\alpha)|^{2}}{(\lambda_{\ell}-E)^{2}+\eta^{2}}\geq\frac{\eta|\mathbf{v}(\alpha)|^{2}}{(\lambda-E)^{2}+\eta^{2}}\geq\frac{\eta|\mathbf{v}(\alpha)|^{2}}{2\left(\max(|\lambda-E|,\eta)\right)^{2}}.

By the pigeonhole principle, we know that there exists α\alpha such that |𝐯(α)|p1/2|\mathbf{v}(\alpha)|\geq p^{-1/2}. Choosing this α\alpha and z=λ+iηz=\lambda+\mathrm{i}\eta, we obtain

pη1Im𝐑αα(λ+iη)12η2.p\eta^{-1}\text{\rm Im}{\,}\mathbf{R}_{\alpha\alpha}(\lambda+\mathrm{i}\eta)\geq\frac{1}{2\eta^{2}}. (59)

On the other hand, using the spectral representation of resolvent again for 𝐑[k]\mathbf{R}^{[k]}, we have

pη1Im𝐑ββ[k](z)=m=1pp|𝐯m[k](β)|2(λm[k]λ)2+η2.p\eta^{-1}\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\beta\beta}(z)=\sum_{m=1}^{p}\frac{p|\mathbf{v}^{[k]}_{m}(\beta)|^{2}}{\big{(}\lambda^{[k]}_{m}-\lambda\big{)}^{2}+\eta^{2}}.

Pick ω>0\omega>0, we decompose the summation into two parts

J1=m=1nωp|𝐯m[k](β)|2(λm[k]λ)2+η2,J2=m=nω+1pp|𝐯m[k](β)|2(λm[k]λ)2+η2.J_{1}=\sum_{m=1}^{n^{\omega}}\frac{p|\mathbf{v}^{[k]}_{m}(\beta)|^{2}}{\big{(}\lambda^{[k]}_{m}-\lambda\big{)}^{2}+\eta^{2}},\ \ J_{2}=\sum_{m=n^{\omega}+1}^{p}\frac{p|\mathbf{v}^{[k]}_{m}(\beta)|^{2}}{\big{(}\lambda^{[k]}_{m}-\lambda\big{)}^{2}+\eta^{2}}.

Using delocalization of eigenvectors, for any ε>0\varepsilon>0, with overwhelming probability, we have

J1nω+ε(min1mp|λm[k]λ|)2.J_{1}\lesssim\frac{n^{\omega+\varepsilon}}{(\min_{1\leq m\leq p}|\lambda^{[k]}_{m}-\lambda|)^{2}}. (60)

By the Tracy-Widom limit of the top eigenvalue (Lemma B.2), for any ε>0\varepsilon>0, with overwhelming probability, we have |λλ+|n2/3+ε|\lambda-\lambda_{+}|\leq n^{-2/3+\varepsilon}. Also, as discussed in (17), the rigidity of eigenvalues yields that for all mnωm\geq n^{\omega}, with overwhelming probability,

λλm[k]m2/3p2/3.\lambda-\lambda^{[k]}_{m}\gtrsim m^{2/3}p^{-2/3}.

Then using delocalization again, with overwhelming probability, we have

J2m=nω+1pnε(λm[k]λ)2nε(nω)1/3n4/3.J_{2}\leq\sum_{m=n^{\omega}+1}^{p}\frac{n^{\varepsilon}}{(\lambda^{[k]}_{m}-\lambda)^{2}}\lesssim n^{\varepsilon}(n^{\omega})^{-1/3}n^{4/3}. (61)

Again, since |λ[k]λ|2n2/3+ε|\lambda^{[k]}-\lambda|\leq 2n^{-2/3+\varepsilon}, by choosing ω=2ε\omega=2\varepsilon we have J2J1J_{2}\leq J_{1}. Therefore, by (60) and (61), we have shown that with overwhelming probability

pη1Im𝐑ββ[k](λ+iη)n3ε(min1mp|λm[k]λ|)2.p\eta^{-1}\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\beta\beta}(\lambda+\mathrm{i}\eta)\lesssim n^{3\varepsilon}\left(\min_{1\leq m\leq p}|\lambda^{[k]}_{m}-\lambda|\right)^{-2}.

Note that the minimum is attained by λ[k]\lambda^{[k]}. This shows that

nη1Im𝐑αα[k](λ+iη)n3ε|λ[k]λ|2.n\eta^{-1}\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\alpha\alpha}(\lambda+\mathrm{i}\eta)\lesssim n^{3\varepsilon}|\lambda^{[k]}-\lambda|^{-2}.

Using Lemma D.3 and (59), we have

nη1Im𝐑αα[k](λ+iη)nη1(Im𝐑αα(λ+iη)|Im𝐑αα[k](λ+iη)Im𝐑αα(λ+iη)|)12η21ncη21η2.n\eta^{-1}\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\alpha\alpha}(\lambda+\mathrm{i}\eta)\geq n\eta^{-1}\left(\text{\rm Im}{\,}\mathbf{R}_{\alpha\alpha}(\lambda+\mathrm{i}\eta)-\left|{\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\alpha\alpha}(\lambda+\mathrm{i}\eta)-\text{\rm Im}{\,}\mathbf{R}_{\alpha\alpha}(\lambda+\mathrm{i}\eta)}\right|\right)\geq\frac{1}{2\eta^{2}}-\frac{1}{n^{c}\eta^{2}}\gtrsim\frac{1}{\eta^{2}}.

Therefore, we have shown that, with overwhelming probability,

1η2n3ε1|λλ[k]|2.\frac{1}{\eta^{2}}\lesssim n^{3\varepsilon}\frac{1}{|\lambda-\lambda^{[k]}|^{2}}.

Recall η=n2/3δ\eta=n^{-2/3-\delta}, and we conclude that

|λλ[k]|n2/3δ+3ε,\left|{\lambda-\lambda^{[k]}}\right|\leq n^{-2/3-\delta+3\varepsilon},

which proves the desired result thanks to the arbitrariness of ε>0\varepsilon>0. ∎

D.4 Proof of Theorem 2

The final ingredient to prove the resampling stability is the following approximation lemma, which asserts that the product of entries in the eigenvector can be well approximated by the resolvent entries.

Lemma D.5.

Assume that kn5/3ϵ0k\ll n^{5/3-\epsilon_{0}} for some ϵ0>0\epsilon_{0}>0. Let 0<δ<δ00<\delta<\delta_{0} be as in Lemma D.3. Then, for z0=λ+iηz_{0}=\lambda+\mathrm{i}\eta with η=n2/3δ\eta=n^{-2/3-\delta}, there exists c>0c^{\prime}>0 such that with probability 1o(1)1-o(1) we have

maxα,β|ηIm𝐑αβ(z0)𝐯(α)𝐯(β)|n1c,andmaxα,β|ηIm𝐑αβ[k](z0)𝐯[k](α)𝐯[k](β)|n1c.\max_{\alpha,\beta}\left|{\eta\text{\rm Im}{\,}\mathbf{R}_{\alpha\beta}(z_{0})-\mathbf{v}(\alpha)\mathbf{v}(\beta)}\right|\leq n^{-1-c^{\prime}},\ \ \mbox{and}\ \ \max_{\alpha,\beta}\left|{\eta\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\alpha\beta}(z_{0})-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)}\right|\leq n^{-1-c^{\prime}}.

Similarly, we also have

maxi,j|ηIm𝐑ij(z0)z0𝐮(i)𝐮(j)|n1c,andmaxi,j|ηIm𝐑ij[k](z0)z0𝐮[k](i)𝐮[k](j)|n1c.\max_{i,j}\left|{\eta\text{\rm Im}{\,}\frac{\mathbf{R}_{ij}(z_{0})}{z_{0}}-\mathbf{u}(i)\mathbf{u}(j)}\right|\leq n^{-1-c^{\prime}},\ \ \mbox{and}\ \ \max_{i,j}\left|{\eta\text{\rm Im}{\,}\frac{\mathbf{R}^{[k]}_{ij}(z_{0})}{z_{0}}-\mathbf{u}^{[k]}(i)\mathbf{u}^{[k]}(j)}\right|\leq n^{-1-c^{\prime}}.
Proof.

For any ε>0\varepsilon>0, we consider a general z=E+iηz=E+\mathrm{i}\eta with |Eλ+|n2/3+ε|E-\lambda_{+}|\leq n^{-2/3+\varepsilon}. From the spectral representation of the resolvent (39), we have

Im𝐑αβ(z)=η=1p𝐯(α)𝐯(β)(λE)2+η2.\text{\rm Im}{\,}\mathbf{R}_{\alpha\beta}(z)=\eta\sum_{\ell=1}^{p}\frac{\mathbf{v}_{\ell}(\alpha)\mathbf{v}_{\ell}(\beta)}{(\lambda_{\ell}-E)^{2}+\eta^{2}}.

Pick some ω>0\omega>0, we decompose the summation on the right-hand side into three parts

=1p𝐯(α)𝐯(β)(λE)2+η2=𝐯(α)𝐯(β)(λE)2+η2+J1+J2,\sum_{\ell=1}^{p}\frac{\mathbf{v}_{\ell}(\alpha)\mathbf{v}_{\ell}(\beta)}{(\lambda_{\ell}-E)^{2}+\eta^{2}}=\frac{\mathbf{v}(\alpha)\mathbf{v}(\beta)}{(\lambda-E)^{2}+\eta^{2}}+J_{1}+J_{2},

where

J1==2nω𝐯(α)𝐯(β)(λE)2+η2,J2==nω+1p𝐯(α)𝐯(β)(λE)2+η2.J_{1}=\sum_{\ell=2}^{n^{\omega}}\frac{\mathbf{v}_{\ell}(\alpha)\mathbf{v}_{\ell}(\beta)}{(\lambda_{\ell}-E)^{2}+\eta^{2}},\ \ \ J_{2}=\sum_{\ell=n^{\omega}+1}^{p}\frac{\mathbf{v}_{\ell}(\alpha)\mathbf{v}_{\ell}(\beta)}{(\lambda_{\ell}-E)^{2}+\eta^{2}}.

Using the same arguments as in (61), for any ε>0\varepsilon>0, with overwhelming probability we have

|J2|nε(nω)1/3n1/3.|J_{2}|\lesssim n^{\varepsilon}(n^{\omega})^{-1/3}n^{1/3}.

For the term J1J_{1}, we consider the following event

:={λ1λ2c0n2/3}{max1p𝐯n1/2+ε}{|J2|nε(nω)1/3n4/3}.{\mathcal{E}}:=\left\{\lambda_{1}-\lambda_{2}\geq c_{0}n^{-2/3}\right\}\cap\left\{\max_{1\leq\ell\leq p}\|\mathbf{v}_{\ell}\|_{\infty}\leq n^{-1/2+\varepsilon}\right\}\cap\left\{|J_{2}|\lesssim n^{\varepsilon}(n^{\omega})^{-1/3}n^{4/3}\right\}.

For any ε>0\varepsilon>0, we can find an appropriate c0>0c_{0}>0 such that ()>1ε/2\mathbb{P}({\mathcal{E}})>1-\varepsilon/2. Then, for z=E+iηz=E+\mathrm{i}\eta with |λE|c02n2/3|\lambda-E|\leq\frac{c_{0}}{2}n^{-2/3}, on the event {\mathcal{E}}, we have

|J1|nεnωn1/3.|J_{1}|\lesssim n^{\varepsilon}n^{\omega}n^{1/3}.

Let δ>0\delta^{\prime}>0 with δ+δ<δ0\delta^{\prime}+\delta<\delta_{0}. On the event {\mathcal{E}}, for all z=E+iηz=E+\mathrm{i}\eta with |λE|ηnδ|\lambda-E|\leq\eta n^{-\delta^{\prime}} and η=n2/3δ\eta=n^{-2/3-\delta}, we have

|𝐯(α)𝐯(β)η2𝐯(α)𝐯(β)(λE)2+η2|n1+2ε|1η2(λE)2+η2|n1+2ε2δ.\left|{\mathbf{v}(\alpha)\mathbf{v}(\beta)-\frac{\eta^{2}\mathbf{v}(\alpha)\mathbf{v}(\beta)}{(\lambda-E)^{2}+\eta^{2}}}\right|\leq n^{-1+2\varepsilon}\left|{1-\frac{\eta^{2}}{(\lambda-E)^{2}+\eta^{2}}}\right|\leq n^{-1+2\varepsilon-2\delta^{\prime}}.

This yields

|ηIm𝐑αβ(z)𝐯(α)𝐯(β)|\displaystyle\left|{\eta\text{\rm Im}{\,}\mathbf{R}_{\alpha\beta}(z)-\mathbf{v}(\alpha)\mathbf{v}(\beta)}\right| |𝐯(α)𝐯(β)η2𝐯(α)𝐯(β)(λE)2+η2|+η2(|J1|+|J2|)\displaystyle\leq\left|{\mathbf{v}(\alpha)\mathbf{v}(\beta)-\frac{\eta^{2}\mathbf{v}(\alpha)\mathbf{v}(\beta)}{(\lambda-E)^{2}+\eta^{2}}}\right|+\eta^{2}(|J_{1}|+|J_{2}|)
n1+2ε2δ+n1+ε+ω2δ+n1+εω32δ.\displaystyle\leq n^{-1+2\varepsilon-2\delta^{\prime}}+n^{-1+\varepsilon+\omega-2\delta}+n^{-1+\varepsilon-\frac{\omega}{3}-2\delta}.

Choosing ω=ε<min(δ,δ)/2\omega=\varepsilon<\min(\delta,\delta^{\prime})/2, we obtain

maxα,β|ηIm𝐑αβ(z)𝐯(α)𝐯(β)|n1min(δ,δ).\max_{\alpha,\beta}\left|{\eta\text{\rm Im}{\,}\mathbf{R}_{\alpha\beta}(z)-\mathbf{v}(\alpha)\mathbf{v}(\beta)}\right|\leq n^{-1-\min(\delta,\delta^{\prime})}. (62)

Similarly, we can apply the same arguments to 𝐑[k]\mathbf{R}^{[k]}. Consider the event

:={maxα,β|ηIm𝐑αβ[k](z)𝐯[k](α)𝐯[k](β)|n1min(δ,δ)for all|λ[k]E|ηnδ,η=n2/3δ}.{\mathcal{E}}^{\prime}:=\left\{\max_{\alpha,\beta}\left|{\eta\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\alpha\beta}(z)-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)}\right|\leq n^{-1-\min(\delta,\delta^{\prime})}\ \mbox{for all}\ |\lambda^{[k]}-E|\leq\eta n^{-\delta^{\prime}},\eta=n^{-2/3-\delta}\right\}.

By previous arguments, we know ()>1ε/2\mathbb{P}({\mathcal{E}}^{\prime})>1-\varepsilon/2. This gives us ()>1ε\mathbb{P}({\mathcal{E}}\cap{\mathcal{E}}^{\prime})>1-\varepsilon. Finally, note that δ+δ<δ0\delta+\delta^{\prime}<\delta_{0}, by Lemma D.4, with overwhelming probability we have |λλ[k]|n2/3δδ=ηnδ|\lambda-\lambda^{[k]}|\leq n^{-2/3-\delta-\delta^{\prime}}=\eta n^{-\delta^{\prime}}. This implies that we can choose z=λ+iηz=\lambda+\mathrm{i}\eta in both (62) and {\mathcal{E}}^{\prime}. Thus, we have shown the desired result for 𝐯\mathbf{v} and 𝐯[k]\mathbf{v}^{[k]} by choosing 0<c<min(δ,δ)0<c^{\prime}<\min(\delta,\delta^{\prime}).

On the other hand, from (39) we also have

Im𝐑ij(z)z=η=1n𝐮(i)𝐮(j)(λE)2+η2.\text{\rm Im}{\,}\frac{\mathbf{R}_{ij}(z)}{z}=\eta\sum_{\ell=1}^{n}\frac{\mathbf{u}_{\ell}(i)\mathbf{u}_{\ell}(j)}{(\lambda_{\ell}-E)^{2}+\eta^{2}}.

Using the same methods as above yields the desired result for 𝐮\mathbf{u} and 𝐮[k]\mathbf{u}^{[k]}. ∎

Now we prove the main result Theorem 2 on the stability of PCA under moderate resampling.

Proof of Theorem 2.

Let z0=λ+iηz_{0}=\lambda+\mathrm{i}\eta as in Lemma D.5. By Lemma D.3 and D.5, we know that, with probability 1o(1)1-o(1), for all α,β2\alpha,\beta\in{\mathcal{I}}_{2}, we have

|𝐯(α)𝐯(β)\displaystyle\Big{|}\mathbf{v}(\alpha)\mathbf{v}(\beta) 𝐯[k](α)𝐯[k](β)|\displaystyle-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)\Big{|}
|𝐯(α)𝐯(β)ηIm𝐑αβ(z0)|+|ηIm𝐑αβ(z0)ηIm𝐑αβ[k](z0)|+|ηIm𝐑αβ[k](z0)𝐯[k](α)𝐯[k](β)|\displaystyle\leq\left|{\mathbf{v}(\alpha)\mathbf{v}(\beta)-\eta\text{\rm Im}{\,}\mathbf{R}_{\alpha\beta}(z_{0})}\right|+\left|{\eta\text{\rm Im}{\,}\mathbf{R}_{\alpha\beta}(z_{0})-\eta\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\alpha\beta}(z_{0})}\right|+\left|{\eta\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\alpha\beta}(z_{0})-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)}\right|
n1c+n1c+n1c.\displaystyle\leq n^{-1-c}+n^{-1-c^{\prime}}+n^{-1-c}.

Denote c′′:=min(c,c)c^{\prime\prime}:=\min(c,c^{\prime}), and we have

maxα,β|𝐯(α)𝐯(β)𝐯[k](α)𝐯[k](β)|n1c′′.\max_{\alpha,\beta}\left|{\mathbf{v}(\alpha)\mathbf{v}(\beta)-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)}\right|\lesssim n^{-1-c^{\prime\prime}}.

For any fixed ε>0\varepsilon>0, we consider the event

:={maxα,β|𝐯(α)𝐯(β)𝐯[k](α)𝐯[k](β)|n1c′′}{𝐯[k]n1/2+ε}.{\mathcal{E}}:=\left\{\max_{\alpha,\beta}\left|{\mathbf{v}(\alpha)\mathbf{v}(\beta)-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)}\right|\lesssim n^{-1-c^{\prime\prime}}\right\}\cap\left\{\|\mathbf{v}^{[k]}\|_{\infty}\leq n^{-1/2+\varepsilon}\right\}.

Since delocalization of eigenvectors holds with overwhelming probability, we know that ()=1o(1)\mathbb{P}({\mathcal{E}})=1-o(1).

By the pigeonhole principle, there exists α\alpha such that |𝐯(α)|>p1/2|\mathbf{v}(\alpha)|>p^{-1/2}. We choose the ±\pm phases of 𝐯\mathbf{v} and 𝐯[k]\mathbf{v}^{[k]} in the way that 𝐯(α)\mathbf{v}(\alpha) and 𝐯[k](α)\mathbf{v}^{[k]}(\alpha) are non-negative. On the event {\mathcal{E}}, we obtain

|𝐯(α)𝐯[k](α)|=|(𝐯(α))2(𝐯[k](α))2|𝐯(α)+𝐯[k](α)n1/2c′′.\left|{\mathbf{v}(\alpha)-\mathbf{v}^{[k]}(\alpha)}\right|=\frac{\left|{(\mathbf{v}(\alpha))^{2}-(\mathbf{v}^{[k]}(\alpha))^{2}}\right|}{\mathbf{v}(\alpha)+\mathbf{v}^{[k]}(\alpha)}\lesssim n^{-1/2-c^{\prime\prime}}.

Moreover, for any entry 𝐯(β)\mathbf{v}(\beta) and 𝐯[k](β)\mathbf{v}^{[k]}(\beta), if {\mathcal{E}} holds, the triangle inequality gives us

|𝐯(β)𝐯[k](β)|\displaystyle\left|{\mathbf{v}(\beta)-\mathbf{v}^{[k]}(\beta)}\right| =|𝐯(α)𝐯(β)𝐯(α)𝐯[k](β)|𝐯(α)\displaystyle=\frac{\left|{\mathbf{v}(\alpha)\mathbf{v}(\beta)-\mathbf{v}(\alpha)\mathbf{v}^{[k]}(\beta)}\right|}{\mathbf{v}(\alpha)}
|𝐯(α)𝐯(β)𝐯[k](α)𝐯[k](β)|𝐯(α)+|𝐯[k](β)|𝐯(α)|𝐯(α)𝐯[k](α)|\displaystyle\leq\frac{\left|{\mathbf{v}(\alpha)\mathbf{v}(\beta)-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)}\right|}{\mathbf{v}(\alpha)}+\frac{|\mathbf{v}^{[k]}(\beta)|}{\mathbf{v}(\alpha)}|\mathbf{v}(\alpha)-\mathbf{v}^{[k]}(\alpha)|
n1/2c′′+n1/2c′′+ε.\displaystyle\lesssim n^{-1/2-c^{\prime\prime}}+n^{-1/2-c^{\prime\prime}+\varepsilon}.

Choosing ε\varepsilon sufficiently small, this implies the desired result (4).

For 𝐮\mathbf{u} and 𝐮[k]\mathbf{u}^{[k]}, note that

|𝐮(i)𝐮(j)𝐮[k](i)𝐮[k](j)||𝐮(i)𝐮(j)ηIm𝐑ij(z0)z0|+|ηIm𝐑ij(z0)z0ηIm𝐑ij[k](z0)z0|+|ηIm𝐑ij[k](z0)z0𝐮[k](i)𝐮[k](j)|\left|{\mathbf{u}(i)\mathbf{u}(j)-\mathbf{u}^{[k]}(i)\mathbf{u}^{[k]}(j)}\right|\\ \leq\left|{\mathbf{u}(i)\mathbf{u}(j)-\eta\text{\rm Im}{\,}\frac{\mathbf{R}_{ij}(z_{0})}{z_{0}}}\right|+\left|{\eta\text{\rm Im}{\,}\frac{\mathbf{R}_{ij}(z_{0})}{z_{0}}-\eta\text{\rm Im}{\,}\frac{\mathbf{R}^{[k]}_{ij}(z_{0})}{z_{0}}}\right|+\left|{\eta\text{\rm Im}{\,}\frac{\mathbf{R}^{[k]}_{ij}(z_{0})}{z_{0}}-\mathbf{u}^{[k]}(i)\mathbf{u}^{[k]}(j)}\right|

By Lemma D.3, we have

|Im𝐑ij(z0)z0Im𝐑ij[k](z0)z0||𝐑ij(z0)𝐑ij[k](z0)z0||𝐑ij(z0)𝐑ij[k](z0)|1n1+cη.\left|{\text{\rm Im}{\,}\frac{\mathbf{R}_{ij}(z_{0})}{z_{0}}-\text{\rm Im}{\,}\frac{\mathbf{R}^{[k]}_{ij}(z_{0})}{z_{0}}}\right|\leq\left|{\frac{\mathbf{R}_{ij}(z_{0})-\mathbf{R}^{[k]}_{ij}(z_{0})}{z_{0}}}\right|\lesssim\left|{\mathbf{R}_{ij}(z_{0})-\mathbf{R}^{[k]}_{ij}(z_{0})}\right|\leq\frac{1}{n^{1+c}\eta}.

As a consequence, we have

|𝐮(i)𝐮(j)𝐮[k](i)𝐮[k](j)|n1c′′.\left|{\mathbf{u}(i)\mathbf{u}(j)-\mathbf{u}^{[k]}(i)\mathbf{u}^{[k]}(j)}\right|\lesssim n^{-1-c^{\prime\prime}}.

The desired result for 𝐮\mathbf{u} and 𝐮[k]\mathbf{u}^{[k]} then follows from the same arguments above for 𝐯\mathbf{v} and 𝐯[k]\mathbf{v}^{[k]}. ∎

Acknowledgment

The author thanks Yihong Wu and Paul Bourgade for helpful discussions at the early stage of this project. Thanks also to Yiyun He for helpful discussion on differential privacy.

References

  • [AEK14] Oskari Ajanki, Lászlo Erdős, and Torben Krüger. Local semicircle law with imprimitive variance matrix. Electronic Communications in Probability, 19:1–9, 2014.
  • [AGZ10] Greg W Anderson, Alice Guionnet, and Ofer Zeitouni. An introduction to random matrices. Number 118. Cambridge university press, 2010.
  • [ASX17] Yacine Ait-Sahalia and Dacheng Xiu. Using principal component analysis to estimate a high dimensional factor model with high-frequency data. Journal of Econometrics, 201(2):384–399, 2017.
  • [BBAP05] Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Annals of Probability, pages 1643–1697, 2005.
  • [BCRT21] Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Predictive inference with the jackknife+. The Annals of Statistics, 49(1):486–507, 2021.
  • [BDMN05] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: the sulq framework. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 128–138, 2005.
  • [BE02] Olivier Bousquet and André Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499–526, 2002.
  • [BEK+14] Alex Bloemendal, László Erdős, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Isotropic local laws for sample covariance and generalized Wigner matrices. Electron. J. Probab., 19:no. 33, 53, 2014.
  • [BGN11] Florent Benaych-Georges and Raj Rao Nadakuditi. The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Advances in Mathematics, 227(1):494–521, 2011.
  • [BKS99] Itai Benjamini, Gil Kalai, and Oded Schramm. Noise sensitivity of Boolean functions and applications to percolation. Inst. Hautes Études Sci. Publ. Math., (90):5–43 (2001), 1999.
  • [BKYY16] Alex Bloemendal, Antti Knowles, Horng-Tzer Yau, and Jun Yin. On the principal components of sample covariance matrices. Probab. Theory Related Fields, 164(1-2):459–552, 2016.
  • [BL22] Charles Bordenave and Jaehun Lee. Noise sensitivity for the top eigenvector of a sparse random matrix. Electronic Journal of Probability, 27:1–50, 2022.
  • [BLZ20] Charles Bordenave, Gábor Lugosi, and Nikita Zhivotovskiy. Noise sensitivity of the top eigenvector of a Wigner matrix. Probability Theory and Related Fields, 177(3):1103–1135, 2020.
  • [BPZ15] Zhigang Bao, Guangming Pan, and Wang Zhou. Universality for the largest eigenvalue of sample covariance matrices with general population. Ann. Statist., 43(1):382–421, 2015.
  • [BS06] Jinho Baik and Jack W Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. Journal of multivariate analysis, 97(6):1382–1408, 2006.
  • [CBG16] Romain Couillet and Florent Benaych-Georges. Kernel spectral clustering of large dimensional data. Electronic Journal of Statistics, 10(1):1393–1454, 2016.
  • [Cha05] Sourav Chatterjee. Concentration inequalities with exchangeable pairs. page 105, 2005. Thesis (Ph.D.)–Stanford University.
  • [Cha07] Sourav Chatterjee. Stein’s method for concentration inequalities. Probab. Theory Related Fields, 138(1-2):305–321, 2007.
  • [Cha14] Sourav Chatterjee. Superconcentration and related topics. Springer Monographs in Mathematics. Springer, Cham, 2014.
  • [CLMW11] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):1–37, 2011.
  • [CMW13] T Tony Cai, Zongming Ma, and Yihong Wu. Sparse pca: Optimal rates and adaptive estimation. The Annals of Statistics, 41(6):3074, 2013.
  • [CS13] Xiuyuan Cheng and Amit Singer. The spectrum of random inner-product kernel matrices. Random Matrices: Theory and Applications, 2(04):1350010, 2013.
  • [CSS13] Kamalika Chaudhuri, Anand D Sarwate, and Kaushik Sinha. A near-optimal algorithm for differentially-private principal components. Journal of Machine Learning Research, 14, 2013.
  • [DCK19] Osman E Dai, Daniel Cullina, and Negar Kiyavash. Database alignment with Gaussian features. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3225–3233. PMLR, 2019.
  • [DHS21] Zhun Deng, Hangfeng He, and Weijie Su. Toward better generalization bounds with locally elastic stability. In International Conference on Machine Learning, pages 2590–2600. PMLR, 2021.
  • [DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4):211–407, 2014.
  • [DT11] Momar Dieng and Craig A Tracy. Application of random matrix theory to multivariate statistics. In Random Matrices, Random Processes and Integrable Systems, pages 443–507. Springer, 2011.
  • [DY18] Xiucai Ding and Fan Yang. A necessary and sufficient condition for edge universality at the largest singular values of covariance matrices. The Annals of Applied Probability, 28(3):1679–1738, 2018.
  • [EEPK05] Andre Elisseeff, Theodoros Evgeniou, Massimiliano Pontil, and Leslie Pack Kaelbing. Stability of randomized learning algorithms. Journal of Machine Learning Research, 6(1), 2005.
  • [EK10] Noureddine El Karoui. The spectrum of kernel random matrices. Ann. Statist., 38(1):1–50, 2010.
  • [FMWX20] Zhou Fan, Cheng Mao, Yihong Wu, and Jiaming Xu. Spectral graph matching and regularized quadratic relaxations: Algorithm and theory. In International conference on machine learning, pages 2985–2995. PMLR, 2020.
  • [FMWX22] Zhou Fan, Cheng Mao, Yihong Wu, and Jiaming Xu. Spectral graph matching and regularized quadratic relaxations II. Foundations of Computational Mathematics, pages 1–51, 2022.
  • [FWZ18] Jianqing Fan, Weichen Wang, and Yiqiao Zhong. An \ell_{\infty} eigenvector perturbation bound and its application to robust covariance estimation. Journal of Machine Learning Research, 18(207):1–42, 2018.
  • [GLM22] Luca Ganassali, Marc Lelarge, and Laurent Massoulié. Spectral alignment of correlated gaussian matrices. Advances in Applied Probability, 54(1):279–310, 2022.
  • [GS14] Christophe Garban and Jeffrey E Steif. Noise sensitivity of Boolean functions and percolation, volume 5. Cambridge University Press, 2014.
  • [HLS19] Jong Yun Hwang, Ji Oon Lee, and Kevin Schnelli. Local law and Tracy–Widom limit for sparse sample covariance matrices. The Annals of Applied Probability, 29(5):3006–3036, 2019.
  • [HPY18] Xiao Han, Guangming Pan, and Qing Yang. A unified matrix model including both CCA and F matrices in multivariate analysis: The largest eigenvalue and its applications. Bernoulli, 24(4B):3447–3468, 2018.
  • [HPZ16] Xiao Han, Guangming Pan, and Bo Zhang. The Tracy–Widom law for the largest eigenvalue of F type matrices. The Annals of Statistics, 44(4):1564–1592, 2016.
  • [HRS16] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International conference on machine learning, pages 1225–1234. PMLR, 2016.
  • [JN17] Iain M Johnstone and Boaz Nadler. Roy’s largest root test under rank-one alternatives. Biometrika, 104(1):181–193, 2017.
  • [Joh07] Iain M. Johnstone. High dimensional statistical inference and random matrices. In International Congress of Mathematicians. Vol. I, pages 307–333. Eur. Math. Soc., Zürich, 2007.
  • [KB21] Byol Kim and Rina Foygel Barber. Black box tests for algorithmic stability. arXiv preprint arXiv:2111.15546, 2021.
  • [KN02] Samuel Kutin and Partha Niyogi. Almost-everywhere algorithmic stability and generalization error. In Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, pages 275–282, 2002.
  • [LR10] Michel Ledoux and Brian Rider. Small deviations for beta ensembles. Electronic Journal of Probability, 15:1319–1343, 2010.
  • [LS16] Ji Oon Lee and Kevin Schnelli. Tracy-Widom distribution for the largest eigenvalue of real sample covariance matrices with general population. Ann. Appl. Probab., 26(6):3786–3839, 2016.
  • [MNPR06] Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25(1):161–193, 2006.
  • [MWA05] Baback Moghaddam, Yair Weiss, and Shai Avidan. Spectral bounds for sparse pca: Exact and greedy algorithms. Advances in neural information processing systems, 18, 2005.
  • [Nad08] Boaz Nadler. Finite sample approximation results for principal component analysis: A matrix perturbation approach. The Annals of Statistics, pages 2791–2817, 2008.
  • [NJW01] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14, 2001.
  • [Pau07] Debashis Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, pages 1617–1642, 2007.
  • [PY14] Natesh S. Pillai and Jun Yin. Universality of covariance matrices. Ann. Appl. Probab., 24(3):935–1001, 2014.
  • [Rin08] Markus Ringnér. What is principal component analysis? Nature biotechnology, 26(3):303–304, 2008.
  • [SX21] Kevin Schnelli and Yuanyuan Xu. Convergence rate to the Tracy–Widom laws for the largest eigenvalue of sample covariance matrices. arXiv preprint arXiv:2108.02728, 2021.
  • [Tro12] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
  • [TV12] Terence Tao and Van Vu. Random covariance matrices: universality of local statistics of eigenvalues. Ann. Probab., 40(3):1285–1315, 2012.
  • [VK06] Seema Vyas and Lilani Kumaranayake. Constructing socio-economic status indices: how to use principal components analysis. Health policy and planning, 21(6):459–468, 2006.
  • [VL07] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17:395–416, 2007.
  • [Wan12] Ke Wang. Random covariance matrices: Universality of local statistics of eigenvalues up to the edge. Random Matrices: Theory and Applications, 1(01):1150005, 2012.
  • [Wan19] Haoyu Wang. Quantitative universality for the largest eigenvalue of sample covariance matrices. arXiv preprint arXiv:1912.05473, 2019.
  • [Wan22] Haoyu Wang. Optimal smoothed analysis and quantitative universality for the smallest singular value of random matrices. arXiv preprint arXiv:2211.03975, 2022.
  • [WWXY22] Haoyu Wang, Yihong Wu, Jiaming Xu, and Israel Yolou. Random graph matching in geometric models: the case of complete graphs. In Conference on Learning Theory, pages 3441–3488. PMLR, 2022.
  • [WXS22] Yihong Wu, Jiaming Xu, and H Yu Sophie. Settling the sharp reconstruction thresholds of random graph matching. IEEE Transactions on Information Theory, 2022.
  • [Yan22] Fan Yang. Sample canonical correlation coefficients of high-dimensional random vectors: Local law and Tracy–Widom limit. Random Matrices: Theory and Applications, 11(01):2250007, 2022.
  • [ZHT06] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265–286, 2006.