Resampling Sensitivity of High-Dimensional PCA

Haoyu Wang Department of Mathematics, Yale University, New Haven, CT 06511, USA, [email protected]

Abstract

The study of stability and sensitivity of statistical methods or algorithms with respect to their data is an important problem in machine learning and statistics. The performance of the algorithm under resampling of the data is a fundamental way to measure its stability and is closely related to generalization or privacy of the algorithm. In this paper, we study the resampling sensitivity for the principal component analysis (PCA). Given an $n\times p$ random matrix $\mathbf{X}$ , let $\mathbf{X}^{[k]}$ be the matrix obtained from $\mathbf{X}$ by resampling $k$ randomly chosen entries of $\mathbf{X}$ . Let $\mathbf{v}$ and $\mathbf{v}^{[k]}$ denote the principal components of $\mathbf{X}$ and $\mathbf{X}^{[k]}$ . In the proportional growth regime $p/n\to\xi\in(0,1]$ , we establish the sharp threshold for the sensitivity/stability transition of PCA. When $k\gg n^{5/3}$ , the principal components $\mathbf{v}$ and $\mathbf{v}^{[k]}$ are asymptotically orthogonal. On the other hand, when $k\ll n^{5/3}$ , the principal components $\mathbf{v}$ and $\mathbf{v}^{[k]}$ are asymptotically colinear. In words, we show that PCA is sensitive to the input data in the sense that resampling even a negligible portion of the input may completely change the output.

1 Introduction

The study of stability and sensitivity of statistical methods and algorithms with respect to the input data is an important task in machine learning and statistics [BE02, EEPK05, MNPR06, HRS16, DHS21]. The notion of stability for algorithms is also closely related to differential privacy [DR14] and generalization error [KN02]. To measure algorithmic stability, one fundamental question is to study the performance of the algorithm under resampling of its input data [BCRT21, KB21]. Originating from the analysis of Boolean functions, resampling sensitivity (also called noise sensitivity) is an important concept in theoretical computer science, which refers to the phenomenon that resampling a small portion of the random input data may lead to decorrelation of the output. Such a remarkable phenomenon was first studied in the pioneering work of Benjamini, Kalai and Schramm [BKS99], and we refer to the monograph [GS14] for a systematic discussion on this topic.

In this work, we study the resampling sensitivity of principal component analysis (PCA). As one of the most commonly used statistical methods, PCA is widely applied for dimension reduction, feature extraction, etc [Joh07, DT11]. It is also used in other fields such as economics [VK06], finance [ASX17], genetics [Rin08], and so on. The performance of PCA under the additive or multiplicative independent perturbation of the data matrix has been well studied (see e.g. [BBAP05, BS06, Pau07, BGN11, CLMW11, FWZ18]).

However, how resampling of the data matrix affects the outcome remains unclear. In this paper, we address this problem for the first time. Here, we emphasize that the resampling of the input data may not have any structure, and the specific resampling procedure is given in the next subsection. In our main results, we show that PCA is resampling sensitive, in the sense that, above certain threshold, resampling even a negligible portion of the data may make the resulted principal component completely change (i.e. become orthogonal to the original direction).

Compared with previous work that mainly focused on PCA with additive or multiplicative independent noise, our setting is very different. In our model, if writing the resampling effect as an additive or multiplicative perturbation, then this noise is not independent of the signal and does not possess any special structure. In contrast, in previous work, sometimes low-rank assumptions on the structure of the matrix or the noise, or some kind of incoherence conditions were imposed. In our work, we have almost no assumption on the data other than a sub-exponential decay condition. Moreover, we highlight that our results have universality. In particular, we do not need to know the specific distribution of the data and we do not require the data is i.i.d sampled.

1.1 Model and main results

Let $\mathbf{X}=(\mathbf{X}_{ij})$ be an $n\times p$ data matrix with independent real valued entries with mean 0 and variance $p^{-1}$ ,

\mathbf{X}_{ij}=p^{-1/2}x_{ij},\ \ \ \mathbb{E}[x_{ij}]=0,\ \ \ \mathbb{E}[x_{ij}^{2}]=1.

(1)

Note that we do not require the i.i.d. condition for the data. Furthermore, we assume the entries $x_{ij}$ have a sub-exponential decay, that is, there exists a constant $\theta>0$ such that for $u>1$ ,

\mathbb{P}(|x_{ij}|>u)\leq\theta^{-1}\exp(-u^{\theta}).

(2)

This sub-exponential decay assumption is mainly for convenience, and other conditions such as the finiteness of a sufficiently high moment would be enough.

Motivated by high-dimensional statistics, we will work in the proportional growth regime $n\asymp p$ . Throughout this paper, to avoid trivial eigenvalues, we will be working in the regime

\lim_{n\to\infty}p/n=\xi\in(0,1)\ \ \mbox{or}\ \ p/n\equiv 1.

In the case $\lim p/n=1$ , our assymption $p/n\equiv 1$ is due to technical reasons in random matrix theory. Specifically, the proof relies on the delocalization of eigenvectors in the whole spectrum. As one of the major open problems in random matrix theory, delocalization of eigenvectors near the lower spectral edge is not known in the general case with just $\lim p/n=1$ . The strictly square assumption $p\equiv n$ can be slightly relaxed to $|n-p|=p^{o(1)}$ (see e.g. [Wan22]), but we do not pursue such an extension for simplicity.

The sample covariance matrix corresponding to data matrix $\mathbf{X}$ is defined by $\mathbf{H}:=\mathbf{X}^{\top}\mathbf{X}$ . We order the eigenvalues of $\mathbf{H}$ as $\lambda_{1}\geq\cdots\geq\lambda_{p}$ , and use $\mathbf{v}_{i}\in\mathbb{R}^{p}$ to denote the unit eigenvector corresponding to the eigenvalue $\lambda_{i}$ . If the context is clear, we just use $\lambda:=\lambda_{1}$ and $\mathbf{v}:=\mathbf{v}_{1}$ to denote the largest eigenvalue and the top eigenvector. We also consider the eigenvalues and eigenvectors of the Gram matrix $\widehat{\mathbf{H}}:=\mathbf{X}\mathbf{X}^{\top}$ . Note that $\widehat{\mathbf{H}}$ and $\mathbf{H}$ have the same non-trivial eigenvalues, and the spectrum of $\widehat{\mathbf{H}}$ is given by $\{\lambda_{i}\}_{i=1}^{n}$ with $\lambda_{p+1}=\cdots=\lambda_{n}=0$ . We denote the unit eigenvectors of $\widehat{\mathbf{H}}$ associated with the eigenvalue $\lambda_{i}$ by $\mathbf{u}_{i}\in\mathbb{R}^{n}$ .

Let $\mathbf{U}=[\mathbf{u}_{1},\cdots,\mathbf{u}_{p}]\in\mathbb{R}^{n\times p}$ and $\mathbf{V}=[\mathbf{v}_{1},\cdots,\mathbf{v}_{p}]\in\mathbb{R}^{p\times p}$ . These eigenvectors may be connected by the singular value decomposition of the data matrix $\mathbf{X}=\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{\top}$ , where $\mathbf{\Sigma}:=\mathsf{diag}(\sigma_{1},\cdots,\sigma_{p})$ with $\sigma_{i}=\sqrt{\lambda_{i}}$ corresponds to the singular values. For convenience, we also denote $\sigma:=\sigma_{1}$ . And therefore, up to the sign of the eigenvectors, we have

\mathbf{X}\mathbf{v}_{\alpha}=\sqrt{\lambda_{\alpha}}\mathbf{u}_{\alpha},\ \ \mathbf{X}^{\top}\mathbf{u}_{\alpha}=\sqrt{\lambda_{\alpha}}\mathbf{v}_{\alpha}.

We now describe the resampling procedure. For a positive number $k\leq np$ , define the random matrix $\mathbf{X}^{[k]}$ in the following way. Let $S_{k}=\{(i_{1},\alpha_{1}),\cdots,(i_{k},\alpha_{k})\}$ be a set of $k$ pairs chosen uniformly at random without raplacement from the set of all ordered pairs $(i,\alpha)$ of indices with $1\leq i\leq n$ and $1\leq\alpha\leq p$ . We assume that the set $S_{k}$ is independent of the entries of $\mathbf{X}$ . The entries of $\mathbf{X}^{[k]}$ are given by

\mathbf{X}^{[k]}_{i,\alpha}=\left\{\begin{aligned} &\mathbf{X}_{i,\alpha}^{\prime}&\quad&\mbox{if }(i,\alpha)\in S_{k},\\ &\mathbf{X}_{i,\alpha}&\quad&\mbox{otherwise},\end{aligned}\right.

where $(\mathbf{X}_{i,\alpha}^{\prime})_{1\leq i\leq n,1\leq\alpha\leq p}$ are independent random variables, independent of $\mathbf{X}$ , and $\mathbf{X}_{i,\alpha}^{\prime}$ has the same distribution as $\mathbf{X}_{i,\alpha}$ . In other words, $\mathbf{X}^{[k]}$ is obtained from $\mathbf{X}$ by resampling $k$ random entries of the matrix, and therefore $\mathbf{X}^{[k]}$ clearly has the same distribution as $\mathbf{X}$ . Let $\mathbf{H}^{[k]}:=(\mathbf{X}^{[k]})^{\top}\mathbf{X}^{[k]}$ the sample covariance matrix corresponding to the resampled data matrix $\mathbf{X}^{[k]}$ . Denote the eigenvalues and the corresponding normalized eigenvectors of $\mathbf{H}^{[k]}$ by $\lambda^{[k]}_{1}\geq\cdots\geq\lambda^{[k]}_{p}$ and $\mathbf{v}^{[k]}_{1},\cdots,\mathbf{v}^{[k]}_{p}$ . When the context is clear, the principal component is just denoted by $\lambda^{[k]}$ and $\mathbf{v}^{[k]}$ . Similarly, denote the eigenvector of the matrix $\widehat{\mathbf{H}}^{[k]}:=\mathbf{X}^{[k]}(\mathbf{X}^{[k]})^{\top}$ associated with the eigenvalue $\lambda^{[k]}_{i}$ by $\mathbf{u}^{[k]}_{i}$ .

With the resampling parameter in two different regimes, we have the following results.

Theorem 1 (Noise sensitivity under excessive resampling).

Let $\mathbf{X}$ be a random data matrix satisfying (1) and (2) and $\mathbf{X}^{[k]}$ be the resampled matrix defined as above. If $k\gg n^{5/3}$ , then the associated principal components are asymptotically orthogonal, i.e.

\lim_{n\to\infty}\mathbb{E}\left|\langle\mathbf{v},\mathbf{v}^{[k]}\rangle\right|=0,\ \ \mbox{and}\ \ \lim_{n\to\infty}\mathbb{E}\left|\langle\mathbf{u},\mathbf{u}^{[k]}\rangle\right|=0.

(3)

Theorem 2 (Noise sensitivity under moderate resampling).

Let $\mathbf{X}$ be a random data matrix satisfying (1) and (2) and $\mathbf{X}^{[k]}$ be the resampled matrix defined as above. For any $\epsilon_{0}>0$ ,

\max_{1\leq k\leq n^{5/3-\epsilon_{0}}}\min_{s\in\{-1,1\}}\sqrt{n}\|\mathbf{v}-s\mathbf{v}^{[k]}\|_{\infty}\xrightarrow{\mathbb{P}}0,

(4)

where $\xrightarrow{\mathbb{P}}$ means convergence in probability. In particular, this implies

\lim_{n\to\infty}\mathbb{E}\left[\max_{1\leq k\leq n^{5/3-\epsilon_{0}}}\min_{s\in\{-1,1\}}\|\mathbf{v}-s\mathbf{v}^{[k]}\|_{2}\right]=0.

The same result also holds for $\mathbf{u}$ and $\mathbf{u}^{[k]}$ .

These two theorems together state that the critical threshold for the resampling strength is of order $k\asymp n^{5/3}$ . Note that $n^{5/3}$ compared with the total number of inputs $np\asymp n^{2}$ is negligible. We show that, above the threshold $n^{5/3}$ , resampling even a negligible portion of the data will result in a dramatic change of the resulting principal component, in the sense that the new principal component is asymptotically orthogonal to the old one; while below the threshold, a relatively mild resampling has almost no effect on the corresponding new principal component. If considering the eigenvector overlaps $|\langle\mathbf{v},\mathbf{v}^{[k]}\rangle|$ and $|\langle\mathbf{u},\mathbf{u}^{[k]}\rangle|$ , these quantities exhibit sharp phase transitions from $1$ to $0$ near the critical threshold $k\asymp n^{5/3}$ .

We remark that the phase transition stated in the above theorems is not restricted to the top eigenvectors $\mathbf{v},\mathbf{v}^{[k]},\mathbf{u},\mathbf{u}^{[k]}$ . With essentially the same arguments, we can prove that for any fixed $m>0$ , the $m$ -th leading eigenvectors $\mathbf{v}_{m},\mathbf{v}^{[k]}_{m}$ and $\mathbf{u}_{m},\mathbf{u}^{[k]}_{m}$ exhibit the same phase transition at the critical threshold of the same order $n^{5/3}$ .

1.2 High-Level Proof Scheme

The high-level idea is the “superconcentration implies chaos” phenomenon established by Chatterjee [Cha14], which means that small perturbation (beyond certain threshold) of super-concentrated system ( in the sense that it is characterized by some quantity with small variance) will lead to a dramatic change of the system. For random matrix models, the super-concentrated quantity is usually the eigenvalues. Resampling sensitivity for random matrices was first studied for Wigner matrices and sparse Erdős-Rényi graphs in [BLZ20, BL22]. For the PCA model, the key difference is that the entries of the sample covariance matrix are correlated. Moreover, resampling a single entry of the data will change $\Theta(n)$ entries in the sample covariance matrix. These two differences will make the proofs more technical and a linearization trick would be important to reduce the interdependency of the matrix entries. In addition, we also need a technical variance formula related to resampling to compute the variance of the top eigenvalue and tools from random matrix theory, in particular the local Marchenko-Pastur law for the resolvent and the delocalization of eigenvectors.

For the sensitivity of the principal components. We show that the inner products $\langle\mathbf{v},\mathbf{v}^{[k]}\rangle$ and $\langle\mathbf{u},\mathbf{u}^{[k]}\rangle$ are closely related to the variance of the top eigenvalue of the sample covariance matrix. Using the variance formula for resampling, we have a precise characterization between the inner product of the top eigenvectors and the concentration of the top eigenvalue. The perturbation effect of the resampling is studied via the variational representation of the eigenvalue.

For the stability under moderate resampling, the key idea of our proof is to study the stability of the resolvent. On the one hand, the stability of resolvent implies the stability of the top eigenvalue. On the other hand, the resolvent can be used to approximate certain useful statistics of the top eigenvector. The stability of the resolvent is proved via a Lindeberg exchange strategy. The resampling procedure can be decomposed into a martingale, and the difference between the resolvents can be therefore bounded by martingale concentration inequalities combined with the local Marchenko-Pastur law of the resolvent for a priori estimates.

2 Sensitivity under Excessive Resampling

We provide a heuristic argument for deriving the threshold for the sensitivity regime. We consider the derivative of the top eigenvalue as a function of the matrix entries. Then we have the approximation

\lambda^{[1]}-\lambda\approx\mathbf{v}^{\top}\left[(\mathbf{X}^{[1]})^{\top}\mathbf{X}^{[1]}-\mathbf{X}^{\top}\mathbf{X}\right]\mathbf{v}

Note that the matrix in the parenthesis has only $\Theta(p)$ non-zero entries, and each entry is roughly of size $O(p^{-1})$ . Also, the eigenvector $\mathbf{v}$ is delocalized in the sense that $\mathbf{v}(m)\approx p^{-1/2}$ for all $m=1,\dots,p$ . A central limit theorem yields that approximately we have

\lambda^{[1]}-\lambda\approx\sqrt{p}p^{-1}p^{-1}=p^{-3/2}.

By this heuristic argument and central limit theorem, we have

\lambda^{[k]}-\lambda\approx\sqrt{k}p^{-3/2}.

Note that from random matrix theory, we know that $\lambda_{1}-\lambda_{2}$ is of order $p^{-2/3}$ . Therefore, if we have $\sqrt{k}p^{-3/2}\ll p^{-2/3}$ (this corresponds to $k\ll n^{5/3}$ ), then the difference the two top eigenvalues $\lambda$ and $\lambda^{[k]}$ is much smaller than the first two eigenvalues $\lambda_{1}$ and $\lambda_{2}$ of the matrix $\mathbf{X}^{\top}\mathbf{X}$ . This implies that the perturbation effect on $\mathbf{X}^{[k]}$ is small, and therefore in this case it is plausible to believe that $\mathbf{v}^{[k]}$ is just a small perturbation of $\mathbf{v}$ . Thus, for the sensitivity regime, we must have $k\gg n^{5/3}$ .

Our proof is essentially trying to make the above heuristics rigorous. To do this, a key observation is that the inner products $\langle\mathbf{v},\mathbf{v}^{[k]}\rangle$ and $\langle\mathbf{u},\mathbf{u}^{[k]}\rangle$ can be related to the variance of the leading eigenvalue.

2.1 Connection with Variance of Top Eigenvalue

As mentioned above, the key step in the proof for sensitivity regime is to establish a connection between the inner products of top eigenvalues with the variance of the top eigenvalue. Specifically, we will prove

\mathbb{E}\left[|\langle\mathbf{v},\mathbf{v}^{[k]}\rangle|^{2}\right]\leq C\frac{n^{3}\mathsf{Var}(\sigma)}{k}+o(1),

where $C>0$ is some universal constant and $\sigma=\sqrt{\lambda}$ is the leading singular value. A similar result is also true for $\mathbf{u}$ and $\mathbf{u}^{[k]}$ . For more details, we refer to Section C.2.

From random matrix theory, we have $\mathsf{Var}(\sigma)=O(n^{-4/3})$ . Then, based on this inequality, we derive the threshold $k\gg n^{5/3}$ for the sensitivity regime.

3 Stability under Moderate Resampling

To establish the stability of PCA when the resampling strength is mild, we will utilize tools from random matrix theory and specifically the proof relies on the analysis of the resolvent matrix. Also, to simplify the Gram matrix structure of the sample covariance matrix, when considering the resolvent we use a linearization trick. For any $z\in\mathbb{C}$ with $\text{\rm Im}{\,}z>0$ , the resolvent is defined as

\mathbf{R}(z):=\left(\begin{matrix}-\mathbf{I}_{n}&\mathbf{X}\\ \mathbf{X}^{\top}&-z\mathbf{I}_{p}\end{matrix}\right)^{-1}.

Similarly, we denote the resolvent of $\mathbf{X}^{[k]}$ by $\mathbf{R}^{[k]}(z)$ . The key idea for the proofs in the stability regime is that eigenvectors can be approximated by resolvents and the resolvents are stable under moderate resampling.

3.1 Resolvent Approximation

To illustrate the usefulness of the resolvent, we show that the entries of the resolvent can be used to approximate the product of entries in the eigenvector. For some small $\delta>0$ , let $z_{0}=\lambda+\mathrm{i}\eta$ with $\eta=n^{-2/3-\delta}$ . In the regime $k\leq n^{5/3-\epsilon_{0}}$ for some $\epsilon_{0}>0$ , there exists some small $c>0$ such that for all $\alpha,\beta=1,\dots,p$ , we have

\left|{\eta\text{\rm Im}{\,}\mathbf{R}_{n+\alpha,n+\beta}(z_{0})-\mathbf{v}(\alpha)\mathbf{v}(\beta)}\right|\leq n^{-1-c},

and

\left|{\eta\text{\rm Im}{\,}\mathbf{R}^{[k]}_{n+\alpha,n+\beta}(z_{0})-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)}\right|\leq n^{-1-c}.

A similar result also holds for $\mathbf{u}$ and $\mathbf{u}^{[k]}$ . For more details, we refer to Lemma D.5.

3.2 Stability of the Resolvent

Since the eigenvector can be approximated by the resolvent, it suffices to show the stability of the resolvent. Consider the regime $k\leq n^{5/3-\epsilon_{0}}$ for some $\epsilon_{0}>0$ . For some small $\delta>0$ and all $z=E+\mathrm{i}\eta$ that is close to the upper spectral edge and $\eta=n^{-2/3-\delta}$ , there exists a small constant $c>0$ such that the following is true for all $i,j=1,\dots,n$ and $\alpha,\beta=1,\dots,p$ ,

\left|{\mathbf{R}^{[k]}_{ij}(z)-\mathbf{R}_{ij}(z)}\right|\leq\frac{1}{n^{1+c}\eta},

and

\left|{\mathbf{R}^{[k]}_{n+\alpha,n+\beta}(z)-\mathbf{R}_{n+\alpha,n+\beta}(z)}\right|\leq\frac{1}{n^{1+c}\eta}.

This is the main technical part of the whole argument, and its proof relies on the Lindeberg exchange method and a martingale concentration argument. For more details, we refer to Lemma D.3.

Combining the stability of the resolvents with the resolvent approximation for eigenvectors, we can conclude that $\mathbf{v}$ and $\mathbf{v}^{[k]}$ must be close (similarly, also for $\mathbf{u}$ and $\mathbf{u}^{[k]}$ ).

3.3 Stability of the Top Eigenvalue

As a byproduct, we derive an stability estimate of the top eigenvalues, which may be of independent interest. For $k\leq n^{5/3-\epsilon_{0}}$ with some arbitrary $\epsilon_{0}>0$ and an arbitrary $\varepsilon>0$ ,

|\lambda^{[k]}-\lambda|\leq n^{-2/3-\delta+\varepsilon},

where $\delta>0$ is some small constant. Note that the fluctuation of the top eigenvalue around its typical location is of order $O(n^{-2/3+\varepsilon})$ , this result shows that the top eigenvalues under moderate resampling are indeed non-trivially stable. For more details, we refer to Lemma D.4.

4 Discussions and Applications

We have shown that PCA is sensitive to the input data, in the sense that resampling $\omega(n^{-2/3})$ fraction of the data will results in decorrelation between the new output and the old output. We further prove that this threshold is sharp that a moderate resampling below this threshold will have no effect on the outcome.

Moreover, besides demonstrating an exciting phenomenon, our results have broad implications in other related fields. We briefly discuss a few potential extensions and applications that would be worth further exploration.

4.1 Extensions to Broader PCA Models

Sparse PCA

A natural extension is to consider sparse data, and this corresponds to sparse PCA that received a lot of attention in the past decade (see e.g. [MWA05, ZHT06, CMW13]). However, establishing the sharp phase transition for sparse PCA lacks several technical ingredients. In particular, for the sensitivity regime, the eigenvalue gap property (i.e. tail estimates for the eigenvalue gap, see Lemma B.3) is unknown. Also, to establish the sharp threshold for the stability regime, an improved local law of the resolvent is needed (cf. Lemma D.1). Though we expect these missing parts to be correct, proving them would be beyond the scope of this paper and we leave them to future work. Assuming the eigenvalue gap property and the improved local law, the phase transition can be proved by the same arguments in this paper.

PCA with General Population

For practical purposes, it would be significant to consider data with a general population covariance profile (see e.g. [Nad08]). The corresponding matrices were studied in [BPZ15, LS16]. The only missing ingredient we need is the eigenvalue gap property. Under reasonable assumptions for the population matrix so that the eigenvalue gap property is true, our arguments in this paper will yield the same stability-sensitivity transition.

4.2 Extensions to Other Statistical Methods

Within the general PCA framework, one important variant is the kernel PCA, which is closely related to the widely used spectral clustering [NJW01, VL07, CBG16]. The corresponding kernel random matrices were studied in [EK10, CS13]. However, the study of these kernel random matrices are far from being well-understood. In particular, the study of eigenvectors were very limited and the indispensable property of delocalization is still unknown.

It would be interesting to explore whether other statistical method share the same resampling sensitivity phenomenon. Random matrices associated with canonical correlation analysis (CCA) or multivariate analysis of variance (MANOVA) are well studied [HPZ16, HPY18, Yan22]. In particular, the Tracy-Widom concentration of the top eigenvalue, one of the most important ingredients that we need, has been proved. We anticipate that these models exhibit a similar stability-sensitivity transition as in PCA.

4.3 Differential Privacy

In our paper, we study the stability of the top eigenvalue and the top eigenvector under resampling in terms of bounding the $\ell_{\infty}$ distance. Such stability estimates can be regarded as the global sensitivity of PCA performed on neighboring datasets. This measurement is closely related to the analysis of differential privacy [DR14]. PCA under differential privacy was previous studied in [BDMN05, CSS13], etc. Our result revisit the problem of designing a private algorithm for solving the principal component. Here we remark that though the statements in Theorem 1 and Theorem 2 are qualitative, a careful examinination of the proof can yield some quantitative estimates. Based on the stability estimates in terms of the $\ell_{\infty}$ metric, a simple Laplace mechanism produces a differentially private version of PCA for computing the top eigenvalue or the top eigenvector. However, compared with [CSS13], our results are limited in the sense that their results are non-asymptotic for all sample size $n$ and data dimension $p$ , while ours are restricted to the proportional growth regime.

Moreover, previous works on differentially private PCA focused on neighbouring datasets that differ by one sample vector. Our result may be seen as a refined notion of privacy, since we can analyze the sensitivity of PCA over two “neighbouring” datasets with $k$ different entries for any $k$ .

Meanwhile, the largest eigenvalue of the sample covariance matrix plays an important rule in hypothesis testing. For example, the Roy’s largest root test is used in many problems (see e.g. [JN17]). Our result may provide useful insights to construct a differentially private test statistic based on the top eigenvalue.

4.4 Database Alignment

Database alignment (or in some cases graph matching) refers to the optimization problem in which we are given two datasets and the goal is to find an optimal correspondence between the samples and features that maximally align the data. For datasets $\mathbf{X},\mathbf{Y}\in\mathbb{R}^{n\times p}$ , we look for permutations $\pi_{\mathrm{s}}\in{\mathcal{S}}_{n}$ and $\pi_{\mathrm{f}}\in{\mathcal{S}}_{p}$ to solve the optimization problem

\max_{\pi_{\mathrm{s}},\pi_{\mathrm{f}}}\sum_{i=1}^{n}\sum_{\alpha=1}^{p}\mathbf{X}_{i\alpha}\mathbf{Y}_{\pi_{\mathrm{s}}(i)\pi_{\mathrm{f}}(\alpha)},

where ${\mathcal{S}}_{n}$ and ${\mathcal{S}}_{p}$ are the sets of all permutations on $[n]$ and $[p]$ , respectively. This problem is closely related to the Quadratic Assignment Problem (QAP), which is known to be NP-hard to solve or even approximate.

The study of the alignment problem for correlated random databases has a long history. The previous work mainly considered matrices that are correlated through some additive perturbation, and some of the general model were studied with a homogeneous correlation (i.e. the correlation between all correponding pairs are the same). See for example [DCK19, WXS22] and many other works.

Our resampling procedure may be regarded as an adversarial corruption of the dataset, which is a different kind of correlation compared with previous work. To our knowledge, this is the first time to consider database alignment with adversarial corruption. To state the setting of the problem, we have two matrices

\mathbf{X}\in\mathbb{R}^{n\times p},\ \ \ \mathbf{Y}=\mathbf{\Pi}_{\mathrm{s}}\mathbf{X}^{[k]}\mathbf{\Pi}_{\mathrm{f}}^{\top}

where $\mathbf{X}$ is a random matrix satisfying (1) and (2), and $\mathbf{\Pi}_{\mathrm{s}}$ and $\mathbf{\Pi}_{\mathrm{f}}$ are permutation matrices of order $n$ and $p$ chosen uniformly at random. The goal is to recover the permutations $\mathbf{\Pi}_{\mathrm{s}}$ and $\mathbf{\Pi}_{\mathrm{f}}$ based on the observations $\mathbf{X}$ and $\mathbf{Y}$ . Here, we can think of $\mathbf{Y}$ as the unlabeled version of $\mathbf{X}$ with adversarial corruption. By considering the covariance matrices, we have

\mathbf{A}=\mathbf{X}\mathbf{X}^{\top},\ \ \mathbf{B}=\mathbf{Y}\mathbf{Y}^{\top}=\mathbf{\Pi}_{\mathrm{s}}\left(\mathbf{X}^{[k]}(\mathbf{X}^{[k]})^{\top}\right)\mathbf{\Pi}_{\mathrm{s}}^{\top},

and similarly we consider

\widehat{\mathbf{A}}=\mathbf{X}^{\top}\mathbf{X},\ \ \widehat{\mathbf{B}}=\mathbf{Y}^{\top}\mathbf{Y}=\mathbf{\Pi}_{\mathrm{f}}\left((\mathbf{X}^{[k]})^{\top}\mathbf{X}^{[k]}\right)\mathbf{\Pi}_{\mathrm{f}}^{\top}.

A natural idea to reconstruct the permutations $\mathbf{\Pi}_{\mathrm{s}}$ (and $\mathbf{\Pi}_{\mathrm{f}})$ is to align the top eigenvectors of the matrices $\mathbf{A}$ and $\mathbf{B}$ (and $\widehat{\mathbf{A}}$ and $\widehat{\mathbf{B}}$ ). See Algorithm 1 for details. This spectral method is a natural technique for databse alignment and graph matching. We are interested in under what resampling strength, the PCA-Recovery algorithm can almost perfectly reconstruct the permutations, and under what condition this method completely fail.

Algorithm 1 PCA-Recovery

Input: data matrices

\mathbf{X},\mathbf{Y}\in\mathbb{R}^{n\times p}

Output: permutation matrices

\mathbf{\Pi}_{\mathrm{s}}\in\mathbb{R}^{n\times n}

\mathbf{\Pi}_{\mathrm{f}}\in\mathbb{R}^{p\times p}

Compute

\mathbf{u}

the unit leading left singular vectors of

\mathbf{X}

Compute

\mathbf{v}

the unit leading right singular vectors of

\mathbf{X}

Compute

\mathbf{u}^{\prime}

the unit leading left singular vectors of

\mathbf{Y}

Compute

\mathbf{v}^{\prime}

the unit leading right singular vectors of

\mathbf{Y}

Compute

\mathbf{\Pi}_{\mathrm{s}}^{+}

the permutation aligning

\mathbf{u}

and

\mathbf{u}^{\prime}

Compute

\mathbf{\Pi}_{\mathrm{s}}^{-}

the permutation aligning

\mathbf{u}

and

-\mathbf{u}^{\prime}

Compute

\mathbf{\Pi}_{\mathrm{f}}^{+}

the permutation aligning

\mathbf{v}

and

\mathbf{v}^{\prime}

Compute

\mathbf{\Pi}_{\mathrm{f}}^{-}

the permutation aligning

\mathbf{v}

and

-\mathbf{v}^{\prime}

\langle{\mathbf{A}},\mathbf{\Pi}_{\mathrm{s}}^{+}{\mathbf{B}}(\mathbf{\Pi}_{\mathrm{s}}^{+})^{\top}\rangle\geq\langle{\mathbf{A}},\mathbf{\Pi}_{\mathrm{s}}^{-}{\mathbf{B}}(\mathbf{\Pi}_{\mathrm{s}}^{-})^{\top}\rangle

then

\mathbf{\Pi}_{\mathrm{s}}\leftarrow\mathbf{\Pi}_{\mathrm{s}}^{+}

else

\mathbf{\Pi}_{\mathrm{s}}\leftarrow\mathbf{\Pi}_{\mathrm{s}}^{-}

end if

\langle\widehat{\mathbf{A}},\mathbf{\Pi}_{\mathrm{f}}^{+}\widehat{\mathbf{B}}(\mathbf{\Pi}_{\mathrm{f}}^{+})^{\top}\rangle\geq\langle\widehat{\mathbf{A}},\mathbf{\Pi}_{\mathrm{f}}^{-}\widehat{\mathbf{B}}(\mathbf{\Pi}_{\mathrm{f}}^{-})^{\top}\rangle

then

\mathbf{\Pi}_{\mathrm{f}}\leftarrow\mathbf{\Pi}_{\mathrm{f}}^{+}

else

\mathbf{\Pi}_{\mathrm{f}}\leftarrow\mathbf{\Pi}_{\mathrm{f}}^{-}

end if

Spectral methods have been studied and applied in many scenarios (see e.g. [FMWX20, FMWX22] and [GLM22]). In particular, in [GLM22], a similar PCA method was studied to match two symmetric Gaussian matrices correlated via additive Gaussian noise. Their work proved a similar $0-1$ transition for the inner product of the top eigenvectors, which leads to a all-or-nothing phenomenon in the alignment problem, i.e. the accuracy of the recovery undergoes a sharp transition from $0$ to $1$ near some critical threshold. However, the arguments in [GLM22] are not applicable in our case. Their proof heavily depends on the Gaussian assumption of the matrices, and the additive strucutre of the noise. In particular, they proof crucially relies on the orthogonal invariance of the Gaussian noise. While in our case, the noise is presented in terms of the resampling strength. There is no way to write the “noise” in an additive form that is independent of the “signal”. Even in the Gaussian case, a rigorous analysis of the PCA-Recovery algorithm seems difficult.

Nevertheless, our results on the sensitivity of the eigenvector inner products suggest that, when $k\gg n^{5/3}$ , the two eigenvectors are approximately de-correlated so that they share almost no common information. Consequently, recovery of the data via aligning the principal components would be basically random guessing. Therefore, we conjecture that if $k\gg n^{5/3}$ , PCA-Recovery fails to recover the latent permutations in the sense that it can only achieve $o(1)$ fraction of correct matching with the ground truth. On the other hand, when $k\ll n^{5/3}$ , the performance of our algorithm seems mysterious.

In Section 5, we empirically check the performance of the PCA-Recovery algorithm. Numerical simulations suggest that when $k\gg n^{5/3}$ , the performance of PCA-Recovery is indeed poor in the sense that the accuracy of the recovery is almost $0$ . On the other hand, when $k\ll n^{5/3}$ , experiments show that we cannot expect the sharp all-or-nothing phenomenon similarly as in [GLM22].

Finally, we remark that what PCA-Recovery actually studies is a more difficult task, as we do not need direct observations of $\mathbf{X}$ and $\mathbf{Y}$ . We can consider a harder problem (in both statistical and computational sense), which we call alignment from covariance profile. In this problem, we only have access to the covariance between the samples and we aim to recover the correspondence between the samples from the two databases. A similar problem with Gaussian data and additive noise was considered in [WWXY22] as a prototype for matching random geometric dot-product graphs. The analysis of such database alignment problem with adversarial corruption will be an interesting direction for future studies.

5 Numerical Experiments

We validate our theoretical prediction by checking the performance of PCA on synthetic data. To highlight the universality of our results, we will consider Gaussian data and Bernoulli data. The Gaussian data matrix consists of i.i.d. ${\mathcal{N}}(0,1)$ entries. The Bernoulli data matrix consists of i.i.d. entries taking value in $\{\pm 1\}$ with equal probability $\frac{1}{2}$ . To visualize the stability-sensitivity transition, we focus on the overlap of the leading eigenvectors $|\langle\mathbf{v},\mathbf{v}^{[k]}\rangle|$ and $|\langle\mathbf{u},\mathbf{u}^{[k]}\rangle|$ as the observable. Note that, in the stability regime, the asymptotic colinearity (4) implies that $|\langle\mathbf{v},\mathbf{v}^{[k]}\rangle|,|\langle\mathbf{u},\mathbf{u}^{[k]}\rangle|\to 1$ . Therefore, we expect a phase transition from 1 to 0 at the critical threshold $k\asymp n^{5/3}$ .

We first focus on rectangular data matrices. For concreteness, we set $p/n=0.25$ and $n=1000$ . As shown in Figure 1, there is a clear phase transition for the eigenvector overlap varying from $1$ to $0$ . Also, it provide good evidence that the transition happens at the critical threshold $k\asymp n^{5/3}$ .

Refer to caption — (a) The inner product $|\langle\mathbf{v},\mathbf{v}^{[k]}\rangle|$

We also check square data $1000\times 1000$ matrices. As shown in Figure 2, for both Gaussian data and Bernoulli data, we have the same phase transition for the overlap of the leading eigenvectors. Again, this numerical simulation supports our theoretical prediction that the transition happens at the critical threshold $k\asymp n^{5/3}$ .

Also we check the performance of the PCA-Recovery algorithm for database alignment. Still, we consider both Gaussian data and Bernoulli data. As shown in Figure 3, in both the rectangular case and the square case, the accuracy of the algorithm is almost $0$ as long as the resampling strength $k$ surpasses $n^{5/3}$ . This is consistent with the theoretical prediction that in this case the top eigenvectors approximately decorrelate. On the other hand, numerical simulations suggest that PCA-Recovery is brittle to resampling. In particular, we cannot expect a similar all-or-nothing phenomenon as in [GLM22]. Identifying the critical threshold below which PCA-Recovery can achieve almost perfect recovery is an interesting open problem for future research.

Appendix A Notations and Organization

We use $C$ to denote generic constant, which may be different in each appearance. We denote $A\lesssim B$ if there exists a universal $C>0$ such that $A\leq CB$ , and denote $A\gtrsim B$ if $A\geq CB$ for some universal $C>0$ . We write $A\asymp B$ if $A\lesssim B$ and $B\lesssim A$ .

For the analysis of the sample covariance matrix, it is useful to apply the linearization trick (see e.g. [Tro12, DY18, FWZ18]). Specifically, we will also analyze the symmetrization of $\mathbf{X}$ , which is defined as

\widetilde{\mathbf{X}}:=\left(\begin{matrix}0&\mathbf{X}\\ \mathbf{X}^{\top}&0\end{matrix}\right)

(5)

The spectrum of the symmetrization $\widetilde{\mathbf{X}}$ are given by the singular values $\{\sqrt{\lambda_{m}}\}_{m=1}^{p}$ , the symmetrized singular values $\{-\sqrt{\lambda_{m}}\}_{m=1}^{p}$ , and trivial eigenvalue $0$ with multiplicity $n-p$ . Let $\mathbf{w}_{i}:=(\mathbf{u}_{i}^{\top},\mathbf{v}_{i}^{\top})^{\top}\in\mathbb{R}^{n+p}$ be the concatenation of the eigenvectors $\mathbf{u}_{i}$ and $\mathbf{v}_{i}$ . Then $\mathbf{w}_{i}$ is the eigenvector of $\widetilde{\mathbf{X}}$ associated with the eigenvalue $\sqrt{\lambda_{i}}$ . Indeed, we have

\left(\begin{matrix}0&\mathbf{X}\\ \mathbf{X}^{\top}&0\end{matrix}\right)\left(\begin{matrix}\mathbf{u}_{i}\\ \mathbf{v}_{i}\end{matrix}\right)=\left(\begin{matrix}\mathbf{X}v_{i}\\ \mathbf{X}^{\top}\mathbf{u}_{i}\end{matrix}\right)=\left(\begin{matrix}\sqrt{\lambda_{i}}\mathbf{u}_{i}\\ \sqrt{\lambda_{i}}\mathbf{v}_{i}\end{matrix}\right).

An important probabilistic concept that will be used repeatedly is the notion of overwhelming probability.

Definition 1 (Overwhelming probability).

Let $\{{\mathcal{E}}_{N}\}$ be a sequence of events. We say that ${\mathcal{E}}_{N}$ holds with overwhelming probability if for any (large) $D>0$ , there exists $N_{0}(D)$ such that for all $N\geq N_{0}(D)$ we have

\mathbb{P}({\mathcal{E}}_{N})\geq 1-N^{-D}.

Organization

The appendix is organized as follows. In Section B, we collect some useful tools for the proof, including a variance formula for resampling and classical results from random matrix theory. In Section C, we prove the sensitivity of PCA under excessive resampling. In Section D, we prove that PCA is stable if resampling of the data is moderate.

Appendix B Preliminaries

B.1 Variance formula and resampling

An essential technique for our proof is the formula for the variance of a function of independent random variables. This formula represents the variance via resampling of the random variables. This idea is first due to Chatterjee [Cha05], and in this paper we will use a slight extension of it as in [BLZ20].

Let $X_{1},\cdots,X_{N}$ be independent random variables taking values in some set $\mathcal{X}$ , and let $f:\mathcal{X}^{N}\to\mathbb{R}$ be some measurable function. Let $X=\left(X_{1},\cdots,X_{N}\right)$ and $X^{\prime}$ be an independent copy of $X$ . We denote

X^{(i)}=(X_{1},\cdots,X_{i-1},X_{i}^{\prime},X_{i+1},\cdots,X_{N}),\ \ \mbox{and}\ \ X^{[i]}=(X_{1}^{\prime},\cdots,X_{i}^{\prime},X_{i+1},\cdots,X_{N}).

And in general, for $A\subset[N]$ , we define $X^{A}$ to be the random vector obtained from $X$ by replacing the components indexed by $A$ by corresponding components of $X^{\prime}$ . By a result of Chatterjee [Cha05], we have the following variance formula

\mathsf{Var}\left(f(X)\right)=\dfrac{1}{2}\sum_{i=1}^{N}\mathbb{E}\left[\left(f(X)-f(X^{(i)})\right)\left(f(X^{[i-1]})-f(X^{[i]})\right)\right].

We remark that this variance formula does not depend on the order of the random variables. Therefore, we can consider an arbitrary permutation of $[N]$ . Specifically, let $\pi=(\pi(1),\cdots,\pi(N))$ be a random permutation sampled uniformly from the symmetric group ${\mathcal{S}}_{N}$ and denote $\pi([i]):=\{\pi(1),\cdots,\pi(i)\}$ . Then we have

\mathsf{Var}\left(f(X)\right)=\dfrac{1}{2}\sum_{i=1}^{n}\mathbb{E}\left[\left(f(X)-f(X^{(\pi(i))})\right)\left(f(X^{\pi([i-1])})-f(X^{\pi([i])})\right)\right].

Note that, in the formula above, the expectation is taken with respect to both $X$ , $X^{\prime}$ and the random permutation $\pi$ .

Let $j$ have uniform distribution over $[N]$ . Let $X^{(j)\circ\pi([i-1])}$ denote the vector obtained from $X^{\pi([i-1])}$ by replacing its $j$ -th component by another independent copy of the random variable $X_{j}$ in the following way: If $j$ belongs to $\pi([i-1])$ , then we replace $X_{j}^{\prime}$ by $X_{j}^{\prime\prime}$ ; if $j$ is not in $\pi([i-1])$ , then we replace $X_{j}$ by $X_{j}^{\prime\prime\prime}$ , where $X^{\prime\prime}$ and $X^{\prime\prime\prime}$ are independent copies of $X$ such that $(X,X^{\prime},X^{\prime\prime},X^{\prime\prime\prime})$ are independent. With this notation, we have the following estimates.

Lemma B.1 (Lemma 3 in [BLZ20]).

Assume that $j$ is chosen uniformly at random from the set $[N]$ and independently of other random variables involved, we have for any $k\in[N]$ ,

B_{k}\leq\dfrac{2\mathsf{Var}\left(f(X)\right)}{k}\left(\dfrac{N+1}{N}\right)

where for any $i\in[N]$ ,

B_{i}:=\mathbb{E}\left[\left(f(X)-f(X^{(j)})\right)\left(f(X^{\pi([i-1])})-f(X^{(j)\circ\pi([i-1])})\right)\right]

and the expectation is taken with respect to components of vectors, random permutations $\pi$ and the random variable $j$ .

B.2 Tools from random matrix theory

In this section we collect some classical results in random matrix theory, which will be indispensable for proving the main theorems. These include concentration of the top eigenvalue, eigenvalue rigidity estimates, and eigenvector delocalization.

To begin with, we first state some basic settings and notations. It is well known that the empirical distribution of the spectrum of the sample covariance matrix converges to the Marchenko-Pastur distribution

\rho_{\mathsf{MP}}(x)=\dfrac{1}{2\pi\xi}\sqrt{\dfrac{[(x-\lambda_{-})(\lambda_{+}-x)]_{+}}{x^{2}}},

(6)

where the endpoints of the spectrum are given by

\lambda_{\pm}=(1\pm\sqrt{\xi})^{2}.

(7)

Define the typical locations of the eigenvalues:

\gamma_{m}:=\inf\left\{E>0:\int_{-\infty}^{E}\rho_{\mathsf{MP}}(x){\rm d}x\geq\dfrac{m}{p}\right\},\ \ \ 1\leq m\leq p.

A classical result in random matrix theory is the rigidity estimates of the eigenvalues [PY14, BEK⁺14]. Let $\widehat{m}:=\min(m,p+1-k)$ , for any small $\varepsilon>0$ and large $D>0$ there exists $n_{0}(\varepsilon,D)$ such that the following holds for any $n\geq n_{0}$ ,

\mathbb{P}\left(|\lambda_{m}-\gamma_{m}|\leq n^{-\frac{2}{3}+\varepsilon}(\widehat{m})^{-\frac{1}{3}}\ \mbox{for all}\ 1\leq m\leq p\right)>1-n^{-D}.

(8)

We remark that the square case $\xi\equiv 1$ is actually significantly different, due to the singularity of the Marchenko-Pastur law at the spectral edge $x=0$ . Near this edge, the typical eigenvalue spacing would be of order $O(n^{-2})$ . In this case, it would be more convenient to state the rigidity estimates in terms of the singular values of the $n\times n$ data matrix. The following result was proved in [AEK14]. For any small $\varepsilon>0$ and large $D>0$ , there exists $n_{0}(\varepsilon,D)$ such that the following is true for any $n\geq n_{0}$ ,

\mathbb{P}\left(|\sqrt{\lambda_{m}}-\sqrt{\gamma_{m}}|\leq n^{-\frac{2}{3}+\varepsilon}(n+1-m)^{-\frac{1}{3}}\ \mbox{for all}\ 1\leq m\leq n\right)>1-n^{-D}.

(9)

The intuition for this case is that the singular values of a square data matrix behaves like the eigenvalues of a Wigner matrix. More specifically, the singular values $\{\sqrt{\lambda_{m}}\}$ and their symmetrization $\{-\sqrt{\lambda_{m}}\}$ are the eigenvalues of the symmetrized matrix $\widetilde{\mathbf{X}}$ defined in (5), and $\widetilde{\mathbf{X}}$ can be seen as a $2n\times 2n$ Wigner matrix with imprimitive variance profile (see [AEK14]). For more details, this was explained in [Wan19, Wan22].

Another important result is the Tracy-Widom limit for the top eigenvalue (see e.g. [PY14, DY18, Wan19, SX21]). Specifically,

Lemma B.2.

For any $\varepsilon>0$ , with overwhelming probability, we have

|\lambda-\lambda_{+}|\leq n^{-2/3+\varepsilon},\ \ \mbox{and}\ \ \mathsf{Var}(\lambda)\lesssim n^{-4/3}.

Moreover, for any $\delta>0$ , there exists a constant $c_{0}>0$ such that

\mathbb{P}\left(\lambda_{1}-\lambda_{2}\geq c_{0}n^{-2/3}\right)\geq 1-\delta.

Proof.

The first result follows from the well-known Tracy-Widom limit for the top eigenvalue. Specifically,

\lim_{n\to\infty}\mathbb{P}\left(\gamma p^{2/3}(\lambda-\lambda_{+})\leq s\right)=F_{1}(s),

where $\gamma$ is a constant depending only on the ratio $\xi$ and $F_{1}$ is the type-1 Tracy-Widom distribution (in particular, [Wan19, SX21] provided quantitative rate of convergence). For the variance part, the Gaussian case was proved in [LR10], and the general case follows from universality, i.e. for any fixed $m$

\lim_{n\to\infty}\mathbb{P}\left(\left(p^{2/3}(\lambda_{\ell}-\lambda_{+})\leq s_{\ell}\right)_{1\leq\ell\leq m}\right)=\lim_{n\to\infty}\mathbb{P}^{G}\left(\left(p^{2/3}(\lambda_{\ell}-\lambda_{+})\leq s_{\ell}\right)_{1\leq\ell\leq m}\right),

where $\mathbb{P}^{G}$ denotes the probability measure associated with the Gaussian matrix. The spectral gap estimate also follows from universality that the spectral statistics of the sample covariance matrix is the same as the Gaussian Orthogonal Ensemble (GOE), i.e. for any fixed $m$

\lim_{n\to\infty}\mathbb{P}\left(\left(\gamma p^{2/3}(\lambda_{\ell}-\lambda_{+})\leq s_{\ell}\right)_{1\leq\ell\leq m}\right)=\lim_{n\to\infty}\mathbb{P}\left(\left(p^{2/3}(\lambda_{\ell}^{GOE}-2)\leq s_{\ell}\right)_{1\leq\ell\leq m}\right)

For GOE, the desired spectral gap estimate was proved in e.g. [AGZ10]. ∎

Moreover, an estimate on the eigenvalue gap near the spectral edge is needed. The following result was proved in [TV12, Wan12]

Lemma B.3.

For any $c>0$ , there exists $\kappa>0$ such that for every $1\leq i\leq p$ , with probability at least $1-n^{-\kappa}$ , we have

|\lambda_{i}-\lambda_{i+1}|\geq n^{-1-c}.

The property of eigenvectors is also a key ingredient for our proof. In particular, we extensively rely on the following delocalization property (see e.g. [PY14, BEK⁺14, DY18, Wan22])

Lemma B.4.

For any $\varepsilon>0$ , with overwhelming probability, we have

\max_{1\leq m\leq p}\|\mathbf{v}_{m}\|_{\infty}+\max_{1\leq i\leq n}\|\mathbf{u}_{i}\|_{\infty}\leq n^{-\frac{1}{2}+\varepsilon}.

Appendix C Proofs for the Sensitivity Regime

C.1 Sensitivity analysis for neighboring data matrices

As a first step, we will first show that resampling of a single entry has little perturbation effect on the top eigenvectors. This will be helpful to control the single entry resampling term in the variance formula (see Lemma B.1).

For any fixed $1\leq i\leq n$ and $1\leq\alpha\leq p$ , let $\mathbf{X}_{(i,\alpha)}$ be the matrix obtained from $\mathbf{X}$ by replacing the $(i,\alpha)$ entry $\mathbf{X}_{i\alpha}$ with $\mathbf{X}_{i\alpha}^{\prime}$ . Define the corresponding covariance matrix $\mathbf{H}_{(i,\alpha)}:=\mathbf{X}_{(i,\alpha)}^{\top}\mathbf{X}_{(i,\alpha)}$ , and use $\mathbf{v}^{(i,\alpha)}$ to denote its unit top eigenvector. Similarly, we denote by $\mathbf{u}^{(i,\alpha)}$ the unit top eigenvector of $\widehat{\mathbf{H}}_{(i,\alpha)}:=\mathbf{X}_{(i,\alpha)}\mathbf{X}_{(i,\alpha)}^{\top}$ .

Lemma C.1.

Let $c>0$ small and $0<\delta<\frac{1}{2}-c$ . For all $1\leq i\leq n$ and $1\leq\alpha\leq p$ , on the event $\{\lambda_{1}-\lambda_{2}\geq n^{-1-c}\}$ , with overwhelming probability

\max_{i,\alpha}\min_{s\in\left\{\pm 1\right\}}\|\mathbf{v}-s\mathbf{v}^{(i,\alpha)}\|_{\infty}\leq n^{-\frac{1}{2}-\delta}

(10)

and similarly

\max_{i,\alpha}\min_{s\in\left\{\pm 1\right\}}\|\mathbf{u}-s\mathbf{u}^{(i,\alpha)}\|_{\infty}\leq n^{-\frac{1}{2}-\delta}

Proof.

Let $\lambda_{1}^{(i,\alpha)}\geq\cdots\geq\lambda_{p}^{(1,\alpha)}$ denote the eigenvalues of the matrix $\mathbf{H}_{(i,\alpha)}$ , and let $\mathbf{v}_{\beta}^{(i,\alpha)}$ denote the unit eigenvector associated with the eigenvalue $\lambda_{\beta}^{(i,\alpha)}$ . Similarly, we define the unit eigenvectors $\{\mathbf{u}_{j}^{(i,\alpha)}\}$ for the matrix $\widehat{\mathbf{H}}_{(i,\alpha)}$ . Using the variational characterization of the eigenvalues, we have

\lambda_{1}\geq\langle\mathbf{v}_{1}^{(i,\alpha)},\mathbf{H}\mathbf{v}_{1}^{(i,\alpha)}\rangle=\lambda_{1}^{(i,\alpha)}+\langle\mathbf{v}_{1}^{(i,\alpha)},(\mathbf{H}-\mathbf{H}_{(i,\alpha)})\mathbf{v}_{1}^{(i,\alpha)}\rangle.

Since $\mathbf{X}$ and $\mathbf{X}_{(i,\alpha)}$ differ only at the $(i,\alpha)$ entry, we have

(\mathbf{H}-\mathbf{H}_{(i,\alpha)})_{\beta\gamma}=(\mathbf{X}^{\top}\mathbf{X}-\mathbf{X}_{(i,\alpha)}^{\top}\mathbf{X}_{(i,\alpha)})_{\beta\gamma}=\begin{cases}(\mathbf{X}_{i\alpha}-\mathbf{X}_{i\alpha}^{\prime})\mathbf{X}_{i\gamma}\ \ &\mbox{if}\ \beta=\alpha,\gamma\neq\alpha,\\ (\mathbf{X}_{i\alpha}-\mathbf{X}_{i\alpha}^{\prime})\mathbf{X}_{i\beta}\ \ &\mbox{if}\ \beta\neq\alpha,\gamma=\alpha,\\ \mathbf{X}_{i\alpha}^{2}-(\mathbf{X}_{i\alpha}^{\prime})^{2}\ \ &\mbox{if}\ \beta=\alpha,\gamma=\alpha,\\ 0\ \ &\mbox{otherwise}.\end{cases}

Thus, setting $\Delta_{i\alpha}:=\mathbf{X}_{i\alpha}-\mathbf{X}_{i\alpha}^{\prime}$ , we have

		$\displaystyle\langle\mathbf{v}_{1}^{(i,\alpha)},(\mathbf{H}-\mathbf{H}_{(i,\alpha)})\mathbf{v}_{1}^{(i,\alpha)}\rangle$		(11)
		$\displaystyle\quad=2\mathbf{v}_{1}^{(i,\alpha)}(\alpha)\Delta_{i\alpha}\left(\sum_{\beta=1}^{p}(\mathbf{X}_{(i,\alpha)})_{i\beta}\mathbf{v}_{1}^{(i,\alpha)}(\beta)-\mathbf{X}_{i\alpha}^{\prime}\mathbf{v}_{1}^{(i,\alpha)}(\alpha)\right)+\left(\mathbf{v}_{1}^{(i,\alpha)}(\alpha)\right)^{2}(\mathbf{X}_{i\alpha}^{2}-(\mathbf{X}_{i\alpha}^{\prime})^{2})$
		$\displaystyle\quad=2\mathbf{v}_{1}^{(i,\alpha)}(\alpha)\Delta_{i\alpha}\left(\mathbf{X}_{(i,\alpha)}\mathbf{v}_{1}^{(i,\alpha)}\right)(i)+\left(\mathbf{v}_{1}^{(i,\alpha)}(\alpha)\right)^{2}\left(\mathbf{X}_{i\alpha}^{2}-(\mathbf{X}_{i\alpha}^{\prime})^{2}-2(\mathbf{X}_{i\alpha}-\mathbf{X}_{i\alpha}^{\prime})\mathbf{X}_{i\alpha}^{\prime}\right)$
		$\displaystyle\quad=2\sqrt{\lambda_{1}^{(i,\alpha)}}\mathbf{v}_{1}^{(i,\alpha)}(\alpha)\mathbf{u}_{1}^{(i,\alpha)}(i)\Delta_{i\alpha}+\left(\mathbf{v}_{1}^{(i,\alpha)}(\alpha)^{2}\right)\Delta_{i\alpha}^{2}.$

This gives us

\lambda_{1}\geq\lambda_{1}^{(i,\alpha)}-2\sqrt{\lambda_{1}^{(i,\alpha)}}\left(|\mathbf{X}_{i\alpha}|+|\mathbf{X}_{i\alpha}^{\prime}|\right)\|\mathbf{v}_{1}^{(i,\alpha)}\|_{\infty}\|\mathbf{u}_{1}^{(i,\alpha)}\|_{\infty}-\left(|\mathbf{X}_{i\alpha}|+|\mathbf{X}_{i\alpha}^{\prime}|\right)^{2}\|\mathbf{v}_{1}^{(i,\alpha)}\|_{\infty}^{2}.

(12)

Similarly,

\lambda_{1}^{(i,\alpha)}\geq\lambda_{1}-2\sqrt{\lambda_{1}}\left(|\mathbf{X}_{i\alpha}|+|\mathbf{X}_{i\alpha}^{\prime}|\right)\|\mathbf{v}_{1}\|_{\infty}\|\mathbf{u}_{1}\|_{\infty}-\left(|\mathbf{X}_{i\alpha}|+|\mathbf{X}_{i\alpha}^{\prime}|\right)^{2}\|\mathbf{v}_{1}\|_{\infty}^{2}.

(13)

From the sub-exponential decay assumption (2), we know $|\mathbf{X}_{i\alpha}|,|\mathbf{X}_{i\alpha}^{\prime}|\leq n^{-1/2+\varepsilon}$ with overwhelming probability for any $\varepsilon>0$ . Moreover, by the delocalization of eigenvectors, with overwhelming probability, we have

\max\left(\|\mathbf{v}_{1}\|_{\infty},\|\mathbf{u}_{1}\|_{\infty},\|\mathbf{v}_{1}^{(i,\alpha)}\|_{\infty},\|\mathbf{u}_{1}^{(i,\alpha)}\|_{\infty}\right)\leq n^{-\frac{1}{2}+\varepsilon}

Moreover, by the rigidity estimates (8), with overwhelming probability we have

|\lambda_{1}-\lambda_{+}|\leq n^{-\frac{2}{3}+\varepsilon},\ \ |\lambda_{1}^{(i,\alpha)}-\lambda_{+}|\leq n^{-\frac{2}{3}+\varepsilon}

Therefore, combining with a union bound, the above two inequalities (12) and (13) together yield

\max_{1\leq i\leq n,1\leq\alpha\leq p}|\lambda_{1}-\lambda_{1}^{(i,\alpha)}|\leq n^{-3/2+3\varepsilon}

(14)

with overwhelming probability.

We write $\mathbf{v}_{1}^{(i,\alpha)}$ in the orthonormal basis of eigenvectors $\left\{\mathbf{v}_{\beta}\right\}$ :

\mathbf{v}_{1}^{(i,\alpha)}=\sum_{\beta=1}^{p}a_{\beta}\mathbf{v}_{\beta}.

Using this representation and the spectral theorem,

\sum_{\beta=1}^{p}\lambda_{\beta}a_{\beta}\mathbf{v}_{\beta}=\mathbf{H}\mathbf{v}_{1}^{(i,\alpha)}=\left(\mathbf{H}-\mathbf{H}_{(i,\alpha)}\right)\mathbf{v}_{1}^{(i,\alpha)}+\left(\lambda_{1}^{(i,\alpha)}-\lambda_{1}\right)\mathbf{v}_{1}^{(i,\alpha)}+\lambda_{1}\mathbf{v}_{1}^{(i,\alpha)}.

As a consequence,

\lambda_{1}\mathbf{v}_{1}^{(i,\alpha)}=\sum_{\beta=1}^{p}\lambda_{\beta}\mathbf{a}_{\beta}\mathbf{v}_{\beta}+\left(\mathbf{H}_{(i,\alpha)}-\mathbf{H}\right)\mathbf{v}_{1}^{(i,\alpha)}+\left(\lambda_{1}-\lambda_{1}^{(i,\alpha)}\right)\mathbf{v}_{1}^{(i,\alpha)}.

For $\beta\neq 1$ , taking inner product with $\mathbf{v}_{\beta}$ yields

\lambda_{1}a_{\beta}=\langle\mathbf{v}_{\beta},\lambda_{1}\mathbf{v}_{1}^{(i,\alpha)}\rangle=\lambda_{\beta}a_{\beta}+\langle\mathbf{v}_{\beta},(\mathbf{H}_{(i,\alpha)}-\mathbf{H})\mathbf{v}_{1}^{(i,\alpha)}\rangle+(\lambda_{1}-\lambda_{1}^{(i,\alpha)})a_{\beta},

which implies

\left((\lambda_{1}-\lambda_{\beta})+(\lambda_{1}^{(i,\alpha)}-\lambda_{1})\right)a_{\beta}=\langle\mathbf{v}_{\beta},(\mathbf{H}_{(i,\alpha)}-\mathbf{H})\mathbf{v}_{1}^{(i,\alpha)}\rangle.

(15)

By a similar computation as in (11), we have

$\displaystyle\left\|{\langle\mathbf{v}_{\beta},(\mathbf{H}_{(i,\alpha)}-\mathbf{H})\mathbf{v}_{1}^{(i,\alpha)}\rangle}\right\|$	$\displaystyle=\left\|{\Delta_{i\alpha}\left(\sqrt{\lambda_{1}^{(i,\alpha)}}\mathbf{v}_{\beta}(\alpha)\mathbf{u}_{1}^{(i,\alpha)}(i)+\sqrt{\lambda_{\beta}}\mathbf{v}_{1}^{(i,\alpha)}(\alpha)\mathbf{u}_{\beta}(i)\right)}\right\|$	(16)
	$\displaystyle\lesssim\left(\|\mathbf{X}_{i\alpha}\|+\|\mathbf{X}_{i\alpha}^{\prime}\|\right)\left(\\|\mathbf{v}_{\beta}\\|_{\infty}\\|\mathbf{u}_{1}^{(i,\alpha)}\\|_{\infty}+\\|\mathbf{v}_{1}^{(i,\alpha)}\\|_{\infty}\\|\mathbf{u}_{\beta}\\|_{\infty}\right)$
	$\displaystyle\leq n^{-\frac{3}{2}+3\varepsilon}$

with overwhelming probability, where the second step follows from rigidity of eigenvalues and the last step follows from the sub-exponential decay assumption and delocalization of eigenvectors.

Consider the event ${\mathcal{E}}:=\{\lambda_{1}-\lambda_{2}\geq n^{-1-c}\}$ . Fix some $\omega>0$ small. By rigidity of eigenvalues (8), on the event ${\mathcal{E}}$ , with overwhelming probability

\lambda_{1}-\lambda_{\beta}\gtrsim\begin{cases}n^{-1-c}&\mbox{if}\ \ 2\leq\beta\leq n^{\omega},\\ \beta^{2/3}n^{-2/3}&\mbox{if}\ \ n^{\omega}<\beta\leq p.\end{cases}

(17)

On the event ${\mathcal{E}}$ , using (14), (16) and (17), with overwhelming probability we have

|a_{\beta}|\lesssim\begin{cases}n^{-\frac{1}{2}+c+3\varepsilon}&\mbox{if}\ \ 2\leq\beta\leq n^{\omega},\\ \beta^{-\frac{2}{3}}n^{-\frac{5}{6}+3\varepsilon}&\mbox{if}\ \ n^{\omega}<\beta\leq p.\end{cases}

(18)

Choose $s=a_{1}/|a_{1}|$ . Note that $1-|a_{1}|\leq\sum_{\beta=2}^{p}|a_{\beta}|$ . Thanks to the delocalization of eigenvectors, with overwhelming probability, we have

\|s\mathbf{v}_{1}-\mathbf{v}_{1}^{(i,\alpha)}\|_{\infty}=\bigg{\|}(s-a_{1})\mathbf{v}_{1}+\sum_{\beta=2}^{p}a_{\beta}\mathbf{v}_{\beta}\bigg{\|}_{\infty}\leq(1-|a_{1}|)\|\mathbf{v}_{1}\|_{\infty}+\sum_{\beta=2}^{p}|a_{\beta}|\|\mathbf{v}_{\beta}\|_{\infty}\leq n^{-\frac{1}{2}+\varepsilon}\sum_{\beta=2}^{p}|a_{\beta}|.

Thus, on the event ${\mathcal{E}}$ , it follows from (17) that

	$\displaystyle\\|s\mathbf{v}_{1}-\mathbf{v}_{1}^{(i,\alpha)}\\|_{\infty}$	$\displaystyle\lesssim n^{-\frac{1}{2}+\varepsilon}\bigg{(}n^{-\frac{1}{2}+3\varepsilon+c+\omega}+n^{-\frac{5}{6}+3\varepsilon}\sum_{\beta=n^{\omega}}^{p}\beta^{-\frac{2}{3}}\bigg{)}$
		$\displaystyle\lesssim n^{-1+4\varepsilon+c+\omega}+n^{-1+4\varepsilon}.$

Choosing $\varepsilon$ and $\omega$ small enough so that $4\varepsilon+c+\omega<\frac{1}{2}$ , we conclude that (10) is true.

A similar bound for $\mathbf{u}$ can be shown by the same arguments for $\widehat{\mathbf{H}}=\mathbf{X}\mathbf{X}^{\top}$ . Hence, we have shown the desired results. ∎

C.2 Proof of Theorem 1

Now we are ready to prove the resampling sensitivity.

Let $\mathbf{X}^{\prime\prime}\in\mathbb{R}^{n\times p}$ be a copy of $\mathbf{X}$ that is independent of $\mathbf{X}$ and $\mathbf{X}^{\prime}$ . For an arbitrary index $(i,\alpha)$ with $1\leq i\leq n$ and $1\leq\alpha\leq p$ , we introduce another random matrix $\mathbf{Y}_{(i,\alpha)}$ obtained from $\mathbf{X}$ by replacing the $(i,\alpha)$ entry $\mathbf{X}_{i\alpha}$ by $\mathbf{X}_{i\alpha}^{\prime\prime}$ . Similarly, we denote $\mathbf{Y}^{[k]}_{(i,\alpha)}$ the matrix obtained via the same procedure from $\mathbf{X}^{[k]}$ . For the matrix $\mathbf{X}^{[k]}$ , we do the similar resampling procedure in the following way: if $(i,\alpha)\in S_{k}$ , then replace $\mathbf{X}^{[k]}_{i\alpha}$ with $\mathbf{X}_{i\alpha}^{\prime\prime}$ ; if $(i,\alpha)\notin S_{k}$ , then replace $\mathbf{X}^{[k]}_{i\alpha}$ with $\mathbf{X}_{i\alpha}^{\prime\prime\prime}$ , where $\mathbf{X}^{\prime\prime\prime}$ is another independent copy of $\mathbf{X}$ , $\mathbf{X}^{\prime}$ and $\mathbf{X}^{\prime\prime}$ .

In the following analysis, we choose an index $(s,\theta)$ uniformly at random from the set of all pairs $\left\{(i,\alpha):1\leq i\leq n,1\leq\alpha\leq p\right\}$ . Let $\mu$ be the top singular value of $\mathbf{Y}_{(s,\theta)}$ and use $\mathbf{f}\in\mathbb{R}^{n}$ and $\mathbf{g}\in\mathbb{R}^{p}$ to denote the normalized top left and right singular vectors of $\mathbf{Y}_{(s,\theta)}$ . Similarly, we define $\mu^{[k]}$ , $\mathbf{f}^{[k]}$ and $\mathbf{g}^{[k]}$ for $\mathbf{Y}^{[k]}_{(s,\theta)}$ . We also denote by $\mathbf{h}$ and $\mathbf{h}^{[k]}$ the concatenation of $\mathbf{f},\mathbf{g}$ and $\mathbf{f}^{[k]},\mathbf{g}^{[k]}$ , respectively. Finally, let $\widetilde{\mathbf{X}}^{[k]}$ , $\widetilde{\mathbf{Y}}$ and $\widetilde{\mathbf{Y}}^{[k]}$ be the symmetrization (5) of $\mathbf{X}^{[k]}$ , $\mathbf{Y}$ and $\mathbf{Y}^{[k]}$ , respectively. When the context is clear, we will omit the index $(s,\theta)$ for the convenience of notations.

Step 1.

By Lemma B.1, we have

\frac{2\mathsf{Var}(\sigma)}{k}\cdot\frac{np+1}{np}\geq\mathbb{E}\left[(\sigma-\mu)\left(\sigma^{[k]}-\mu^{[k]}\right)\right].

(19)

Using the variational characterization of the top singular value

\langle\mathbf{f},\mathbf{X}\mathbf{g}\rangle\leq\sigma=\langle\mathbf{u},\mathbf{X}\mathbf{v}\rangle,\ \ \ \langle\mathbf{u},\mathbf{Y}\mathbf{v}\rangle\leq\mu=\langle\mathbf{f},\mathbf{Y}\mathbf{g}\rangle.

This implies

\langle\mathbf{f},(\mathbf{X}-\mathbf{Y})\mathbf{g}\rangle\leq\sigma-\mu\leq\langle\mathbf{u},(\mathbf{X}-\mathbf{Y})\mathbf{v}\rangle.

(20)

Applying the same arguments to $\mathbf{X}^{[k]}$ and $\mathbf{Y}^{[k]}$ , we have

\left\langle\mathbf{f}^{[k]},\left(\mathbf{X}^{[k]}-\mathbf{Y}^{[k]}\right)\mathbf{g}^{[k]}\right\rangle\leq\sigma^{[k]}-\mu^{[k]}\leq\left\langle\mathbf{u}^{[k]},\left(\mathbf{X}^{[k]}-\mathbf{Y}^{[k]}\right)\mathbf{v}^{[k]}\right\rangle.

Since the matrices $\mathbf{X}$ and $\mathbf{Y}$ differ only at the $(s,\theta)$ entry, for any vectors $\mathbf{a}\in\mathbb{R}^{n}$ and $\mathbf{b}\in\mathbb{R}^{p}$ we have

\langle\mathbf{a},(\mathbf{X}-\mathbf{Y})\mathbf{b}\rangle=\Delta_{s\theta}\,\mathbf{a}(s)\mathbf{b}(\theta),\ \ \left\langle\mathbf{a},\left(\mathbf{X}^{[k]}-\mathbf{Y}^{[k]}\right)\mathbf{b}\right\rangle=\Delta^{[k]}_{s\theta}\,\mathbf{a}(s)\mathbf{b}(\theta),

where

\Delta_{s\theta}:=\mathbf{X}_{s\theta}-\mathbf{X}_{s\theta}^{\prime\prime},\ \ \mbox{and}\ \ \Delta^{[k]}_{s\theta}:=\begin{cases}\mathbf{X}_{s\theta}^{\prime}-\mathbf{X}_{s\theta}^{\prime\prime}&\mbox{if}\ (s,\theta)\in S_{k},\\ \mathbf{X}_{s\theta}-\mathbf{X}_{s\theta}^{\prime\prime\prime}&\mbox{if}\ (s,\theta)\notin S_{k}.\end{cases}

Therefore,

\Delta_{s\theta}\,\mathbf{f}(s)\mathbf{g}(\theta)\leq\sigma-\mu\leq\Delta_{s\theta}\mathbf{u}(s)\mathbf{v}(\theta),\ \ \Delta^{[k]}_{s\theta}\,\mathbf{f}^{[k]}(s)\mathbf{g}^{[k]}(\theta)\leq\sigma^{[k]}-\mu^{[k]}\leq\Delta^{[k]}_{s\theta}\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta).

Consider

T_{1}:=\left(\Delta_{s\theta}\mathbf{u}(s)\mathbf{v}(\theta)\right)\left(\Delta^{[k]}_{s\theta}\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)\right),\ \ T_{2}:=\left(\Delta_{s\theta}\,\mathbf{f}(s)\mathbf{g}(\theta)\right)\left(\Delta^{[k]}_{s\theta}\,\mathbf{f}^{[k]}(s)\mathbf{g}^{[k]}(\theta)\right),

T_{3}:=\left(\Delta_{s\theta}\mathbf{u}(s)\mathbf{v}(\theta)\right)\left(\Delta^{[k]}_{s\theta}\,\mathbf{f}^{[k]}(s)\mathbf{g}^{[k]}(\theta)\right),\ \ T_{4}:=\left(\Delta_{s\theta}\,\mathbf{f}(s)\mathbf{g}(\theta)\right)\left(\Delta^{[k]}_{s\theta}\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)\right).

Then we have

\min(T_{1},T_{2},T_{3},T_{4})\leq(\sigma-\mu)\left(\sigma^{[k]}-\mu^{[k]}\right)\leq\max(T_{1},T_{2},T_{3},T_{4}).

(21)

To estimate (21), we introduce the following events

{\mathcal{E}}_{1}:=\left\{\max\left(\|\mathbf{v}\|_{\infty},\|\mathbf{u}\|_{\infty},\|\mathbf{f}\|_{\infty},\|\mathbf{g}\|_{\infty},\|\mathbf{v}^{[k]}\|_{\infty},\|\mathbf{u}^{[k]}\|_{\infty},\|\mathbf{f}^{[k]}\|_{\infty},\|\mathbf{g}^{[k]}\|_{\infty}\right)\leq n^{-\frac{1}{2}+\varepsilon}\right\},

(22)

and

{\mathcal{E}}_{2}:=\left\{\max\left(\|\mathbf{v}-\mathbf{g}\|_{\infty},\|\mathbf{u}-\mathbf{f}\|_{\infty},\|\mathbf{v}^{[k]}-\mathbf{g}^{[k]}\|_{\infty},\|\mathbf{u}^{[k]}-\mathbf{f}^{[k]}\|_{\infty}\right)\leq n^{-\frac{1}{2}-\delta}\right\}.

(23)

Define the event ${\mathcal{E}}:={\mathcal{E}}_{1}\cap{\mathcal{E}}_{2}$ . On the event ${\mathcal{E}}$ , for all

J\in\left\{\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta),\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{f}^{[k]}(s)\mathbf{g}^{[k]}(\theta),\mathbf{f}(s)\mathbf{g}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta),\mathbf{f}(s)\mathbf{g}(\theta)\mathbf{f}^{[k]}(s)\mathbf{g}^{[k]}(\theta)\right\}

we have

\left|{J-\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)}\right|=O\left(n^{-2-\delta+3\varepsilon}\right).

(24)

Let $T:=\min(T_{1},T_{2},T_{3},T_{4})$ . On the event ${\mathcal{E}}$ , using (24) we have

T\geq\left(\Delta_{s\theta}\Delta^{[k]}_{s\theta}\right)\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)-O\left(\left|{\Delta_{s\theta}\Delta^{[k]}_{s\theta}}\right|n^{-2-\delta+3\varepsilon}\right).

(25)

Step 2.

Next we claim that the contribution of $T$ when ${\mathcal{E}}$ does not hold is negligible. Specifically, we have

\mathbb{E}\left[T\mathbbm{1}_{{\mathcal{E}}^{c}}\right]=o(n^{-3}).

(26)

Without loss of generality, it suffices to show that

\mathbb{E}\left[\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)\mathbbm{1}_{{\mathcal{E}}^{c}}\right]=o(n^{-3}).

(27)

To see this, using $\mathbbm{1}_{{\mathcal{E}}^{c}}=\mathbbm{1}_{{\mathcal{E}}_{1}\backslash{\mathcal{E}}}+\mathbbm{1}_{{\mathcal{E}}_{1}^{c}}$ , we decompose the expectation into two parts

\mathbb{E}\left[\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)\mathbbm{1}_{{\mathcal{E}}^{c}}\right]=I_{1}+I_{2},

where

I_{1}:=\mathbb{E}\left[\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)\mathbbm{1}_{{\mathcal{E}}_{1}\backslash{\mathcal{E}}}\right],\ \ I_{2}:=\mathbb{E}\left[\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)\mathbbm{1}_{{\mathcal{E}}_{1}^{c}}\right].

For the first term $I_{1}$ , by delocalization and the relation ${\mathcal{E}}_{1}\backslash{\mathcal{E}}={\mathcal{E}}_{1}\backslash{\mathcal{E}}_{2}$ , we have

|I_{1}|\leq n^{-2+4\varepsilon}\mathbb{E}\left[\left|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right|\mathbbm{1}_{{\mathcal{E}}_{1}\backslash{\mathcal{E}}_{2}}\right]\leq n^{-2+4\varepsilon}\mathbb{E}\left[\left|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right|\mathbbm{1}_{{\mathcal{E}}_{2}^{c}}\right].

(28)

Note that the random variable $\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}$ and the event ${\mathcal{E}}_{2}$ are dependent. To decouple these variables, we introduce a new event. Consider the event ${\mathcal{E}}_{3}:={\mathcal{A}}\cup{\mathcal{B}}$ , where

{\mathcal{A}}:=\left\{\min\left(\sigma_{1}-\sigma_{2},\sigma^{[k]}_{1}-\sigma^{[k]}_{2}\right)\geq n^{-1-c}\right\},\ \ {\mathcal{B}}:=\left\{\min\left(\mu_{1}-\mu_{2},\mu^{[k]}_{1}-\mu^{[k]}_{2}\right)\geq n^{-1-c}\right\}

Then,

	$\displaystyle\mathbb{E}\left[\left\|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right\|\mathbbm{1}_{{\mathcal{E}}_{3}^{c}}\right]$	$\displaystyle\lesssim\mathbb{E}\left[\left(\Delta_{s\theta}^{2}+(\Delta^{[k]}_{s\theta})^{2}\right)\mathbbm{1}_{{\mathcal{E}}_{3}^{c}}\right]$
		$\displaystyle\lesssim\mathbb{E}\left[\left(\mathbf{X}_{s\theta}^{2}+(\mathbf{X}_{s\theta}^{\prime})^{2}+(\mathbf{X}_{s\theta}^{\prime\prime})^{2}+(\mathbf{X}_{s\theta}^{\prime\prime\prime})^{2}\right)\mathbbm{1}_{{\mathcal{E}}_{3}^{c}}\right]$
		$\displaystyle\lesssim\mathbb{E}\left[\left(\mathbf{X}_{s\theta}^{2}+(\mathbf{X}_{s\theta}^{\prime})^{2}\right)\mathbbm{1}_{{\mathcal{B}}^{c}}\right]+\mathbb{E}\left[\left((\mathbf{X}_{s\theta}^{\prime\prime})^{2}+(\mathbf{X}_{s\theta}^{\prime\prime\prime})^{2}\right)\mathbbm{1}_{{\mathcal{A}}^{c}}\right].$

Observe that the random variables $\mathbf{X}_{s\theta}$ and $\mathbf{X}_{s\theta}^{\prime}$ are independent of the event ${\mathcal{B}}$ , and the random variable $\mathbf{X}_{s\theta}^{\prime\prime}$ is independent of ${\mathcal{A}}$ . Therefore, by Lemma B.3,

\mathbb{E}\left[\left(\mathbf{X}_{s\theta}^{2}+(\mathbf{X}_{s\theta}^{\prime})^{2}\right)\mathbbm{1}_{{\mathcal{B}}^{c}}\right]=O(n^{-1-\kappa}),\ \ \ \mathbb{E}\left[\left((\mathbf{X}_{s\theta}^{\prime\prime})^{2}+(\mathbf{X}_{s\theta}^{\prime\prime\prime})^{2}\right)\mathbbm{1}_{{\mathcal{A}}^{c}}\right]=O(n^{-1-\kappa}).

By Lemma C.1, we have $\mathbb{P}({\mathcal{E}}_{3}\backslash{\mathcal{E}}_{2})=O(N^{-D})$ for any fixed large $D>0$ . Using the Cauchy-Schwarz inequality, we have

	$\displaystyle\mathbb{E}\left[\left\|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right\|\mathbbm{1}_{{\mathcal{E}}_{2}^{c}}\right]$	$\displaystyle=\mathbb{E}\left[\left\|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right\|\mathbbm{1}_{{\mathcal{E}}_{3}^{c}}\right]+\mathbb{E}\left[\left\|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right\|\mathbbm{1}_{{\mathcal{E}}_{3}\backslash{\mathcal{E}}_{2}}\right]$
		$\displaystyle=O(n^{-1-\kappa})+\sqrt{\mathbb{E}\left[\left\|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right\|^{2}\right]}\sqrt{\mathbb{P}({\mathcal{E}}_{3}\backslash{\mathcal{E}}_{2})}$
		$\displaystyle=O(n^{-1-\kappa})+O(N^{-D})$
		$\displaystyle=O(n^{-1-\kappa}).$

Choosing $4\varepsilon<\kappa$ , then (28) yields

|I_{1}|\leq O(n^{-2+4\varepsilon-1-\kappa})=o(n^{-3}).

For the term $I_{2}$ , note that $\mathbf{u}$ , $\mathbf{v}$ , $\mathbf{u}^{[k]}$ and $\mathbf{v}^{[k]}$ are unit vectors. We have that

\max(\|\mathbf{u}\|_{\infty},\|\mathbf{v}\|_{\infty},\|\mathbf{u}^{[k]}\|_{\infty},\|\mathbf{v}^{[k]}\|_{\infty})\leq 1.

Recall that ${\mathcal{E}}_{1}$ holds with overwhelming probability. By the Cauchy-Schwarz inequality, for any large $D>0$ , we have

|I_{2}|\leq\mathbb{E}\left[\left|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right|\mathbbm{1}_{{\mathcal{E}}_{1}^{c}}\right]\leq\sqrt{\mathbb{E}\left[\left|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right|^{2}\right]}\sqrt{\mathbb{P}({\mathcal{E}}_{1}^{c})}=O(N^{-D}).

Hence we have shown the desired claim (27).

Step 3.

Combining (21), (25) and (26), we obtain

\mathbb{E}\left[(\sigma-\mu)\left(\sigma^{[k]}-\mu^{[k]}\right)\right]\geq\mathbb{E}\left[\Delta_{s\theta}\Delta^{[k]}_{s\theta}\,\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)\right]+o(n^{-3}).

Since $\frac{np+1}{np}\leq 2$ , by (19) we have

\mathbb{E}\left[\Delta_{s\theta}\Delta^{[k]}_{s\theta}\,\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)\right]\leq\frac{4\mathsf{Var}(\sigma)}{k}+o(n^{-3}).

(29)

Since the random index $(s,\theta)$ is uniformly sampled, we have

\mathbb{E}\left[\Delta_{s\theta}\Delta^{[k]}_{s\theta}\,\mathbf{u}(s)\mathbf{v}(\theta)\mathbf{u}^{[k]}(s)\mathbf{v}^{[k]}(\theta)\right]=\frac{1}{np}\mathbb{E}\left[\sum_{1\leq i\leq n,1\leq\alpha\leq p}\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}\,\mathbf{u}(i)\mathbf{v}(\alpha)\mathbf{u}^{[k]}(i)\mathbf{v}^{[k]}(\alpha)\right].

(30)

Note that

\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}=\begin{cases}(\mathbf{X}_{i\alpha}-\mathbf{X}_{i\alpha}^{\prime})(\mathbf{X}_{i\alpha}^{\prime}-\mathbf{X}_{i\alpha}^{\prime\prime})&\mbox{if}\ (i,\alpha)\in S_{k},\\ (\mathbf{X}_{i\alpha}-\mathbf{X}_{i\alpha}^{\prime})(\mathbf{X}_{i\alpha}-\mathbf{X}_{i\alpha}^{\prime\prime\prime})&\mbox{if}\ (i,\alpha)\notin S_{k}.\end{cases}

In either case, we have $\mathbb{E}[\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}]=p^{-1}$ . Therefore,

\sum_{1\leq i\leq n,1\leq\alpha\leq p}\mathbb{E}\left[\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}\left|\right.S_{k}\right]\mathbf{u}(i)\mathbf{v}(\alpha)\mathbf{u}^{[k]}(i)\mathbf{v}^{[k]}(\alpha)=\frac{1}{p}\langle\mathbf{v},\mathbf{v}^{[k]}\rangle\langle\mathbf{u},\mathbf{u}^{[k]}\rangle.

Consequently, this implies

\mathbb{E}\left[\sum_{1\leq i\leq n,1\leq\alpha\leq p}\mathbb{E}\left[\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}\left.\right|S_{k}\right]\mathbf{u}(i)\mathbf{v}(\alpha)\mathbf{u}^{[k]}(i)\mathbf{v}^{[k]}(\alpha)\right]=\frac{1}{p}\mathbb{E}\left[\langle\mathbf{v},\mathbf{v}^{[k]}\rangle\langle\mathbf{u},\mathbf{u}^{[k]}\rangle\right].

(31)

Moreover, we claim that

\mathbb{E}\left[\sum_{1\leq i\leq n,1\leq\alpha\leq p}\left(\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}-\mathbb{E}\left[\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}\left.\right|S_{k}\right]\right)\mathbf{u}(i)\mathbf{v}(\alpha)\mathbf{u}^{[k]}(i)\mathbf{v}^{[k]}(\alpha)\right]=o(n^{-1}).

(32)

For the ease of notations, we set $\Xi_{i\alpha}:=\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}-\mathbb{E}[\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}|S_{k}]$ . It suffices to show that for all pairs $(i,\alpha)$ we have

\mathbb{E}\left[\Xi_{i\alpha}\mathbf{u}(i)\mathbf{v}(\alpha)\mathbf{u}^{[k]}(i)\mathbf{v}^{[k]}(\alpha)\right]=o(n^{-3}).

(33)

To see this, we introduce another copy of $\mathbf{X}$ , denoted by $\mathbf{X}^{\prime\prime\prime\prime}$ , which is independent of all previous random variables $(\mathbf{X},\mathbf{X}^{\prime},\mathbf{X}^{\prime\prime},\mathbf{X}^{\prime\prime\prime})$ . For an arbitrarily fixed index $(i,\alpha)$ , we define matrices $\widehat{\mathbf{X}}_{(i,\alpha)}$ and $\widehat{\mathbf{X}}^{[k]}_{(i,\alpha)}$ by resampling the $(i,\alpha)$ entry of $\mathbf{X}$ and $\mathbf{X}^{[k]}$ with $\mathbf{X}_{i\alpha}^{\prime\prime\prime\prime}$ . Let $\widehat{\mathbf{u}}$ , $\widehat{\mathbf{v}}$ be the left and right top singular vector of $\widehat{\mathbf{X}}$ , and similarly $\widehat{\mathbf{u}}^{[k]}$ , $\widehat{\mathbf{v}}^{[k]}$ for $\widehat{\mathbf{X}}^{[k]}$ . Define

\psi_{i\alpha}:=\mathbf{u}(i)\mathbf{v}(\alpha)\mathbf{u}^{[k]}(i)\mathbf{v}^{[k]}(\alpha),\ \ \widehat{\psi}_{i\alpha}:=\widehat{\mathbf{u}}(i)\widehat{\mathbf{v}}(\alpha)\widehat{\mathbf{u}}^{[k]}(i)\widehat{\mathbf{v}}^{[k]}(\alpha).

A crucial observation is that $\Xi_{i\alpha}$ and $\widehat{\psi}_{i\alpha}$ are independent. This is because, conditioned on $S_{k}$ , the matrices $\widehat{\mathbf{X}}$ and $\widehat{\mathbf{X}}^{[k]}$ are independent of $(\mathbf{X}_{i\alpha},\mathbf{X}_{i\alpha}^{\prime},\mathbf{X}_{i\alpha}^{\prime\prime},\mathbf{X}_{i\alpha}^{\prime\prime\prime})$ . Such a conditional independence is also true for the singular vectors, and hence also holds for $\widehat{\psi}_{i\alpha}$ . On the other hand, by definition, the variable $\Xi_{i\alpha}$ only depends on $(\mathbf{X}_{i\alpha},\mathbf{X}_{i\alpha}^{\prime},\mathbf{X}_{i\alpha}^{\prime\prime},\mathbf{X}_{i\alpha}^{\prime\prime\prime})$ . Therefore,

\mathbb{E}\left[\Xi_{i\alpha}\widehat{\psi}_{i\alpha}\right]=\mathbb{E}\left[\mathbb{E}[\Xi_{i\alpha}|S_{k}]\,\mathbb{E}[\widehat{\psi}_{i\alpha}|S_{k}]\right]=0

Thus, we reduce (33) to showing

\mathbb{E}\left[\Xi_{i\alpha}\left(\psi_{i\alpha}-\widehat{\psi}_{i\alpha}\right)\right]=o(n^{-3}).

(34)

The proof of (34) is similar as previous arguments. Consider the events

\widehat{{\mathcal{E}}}_{1}:=\left\{\max\left(\|\mathbf{v}\|_{\infty},\|\mathbf{u}\|_{\infty},\|\widehat{\mathbf{u}}\|_{\infty},\|\widehat{\mathbf{v}}\|_{\infty},\|\mathbf{v}^{[k]}\|_{\infty},\|\mathbf{u}^{[k]}\|_{\infty},\|\widehat{\mathbf{u}}^{[k]}\|_{\infty},\|\widehat{\mathbf{v}}^{[k]}\|_{\infty}\right)\leq n^{-\frac{1}{2}+\varepsilon}\right\},

\widehat{{\mathcal{E}}}_{2}:=\left\{\max\left(\|\mathbf{v}-\widehat{\mathbf{v}}\|_{\infty},\|\mathbf{u}-\widehat{\mathbf{u}}\|_{\infty},\|\mathbf{v}^{[k]}-\widehat{\mathbf{v}}^{[k]}\|_{\infty},\|\mathbf{u}^{[k]}-\widehat{\mathbf{u}}^{[k]}\|_{\infty}\right)\leq n^{-\frac{1}{2}-\delta}\right\}.

On the event $\widehat{{\mathcal{E}}}:=\widehat{{\mathcal{E}}}_{1}\cap\widehat{{\mathcal{E}}}_{2}$ , we have $|\psi_{i\alpha}-\widehat{\psi}_{i\alpha}|=O(n^{-2-\delta+3\varepsilon})$ . Note that $\mathbb{E}[|\Xi_{i\alpha}|]=O(n^{-1})$ since $\mathbb{E}[|\Delta_{i\alpha}\Delta^{[k]}_{i\alpha}|]=O(n^{-1})$ . As a consequence,

\mathbb{E}\left[\left|{\Xi_{i\alpha}(\psi_{i\alpha}-\widehat{\psi}_{i\alpha})}\right|\mathbbm{1}_{\widehat{{\mathcal{E}}}}\right]=O(n^{-3-\delta+3\varepsilon})=o(n^{-3}).

(35)

Using the same argument as in (27), we have

\mathbb{E}\left[\left|{\Xi_{i\alpha}(\psi_{i\alpha}-\widehat{\psi}_{i\alpha})}\right|\mathbbm{1}_{\widehat{{\mathcal{E}}}^{c}}\right]\lesssim N^{-2+4\varepsilon}\mathbb{E}\left[|\Xi_{i\alpha}|\mathbbm{1}_{\widehat{{\mathcal{E}}}^{c}}\right]=O(N^{-2+4\varepsilon}N^{-1-\kappa})=o(n^{-3}),

(36)

where $\kappa$ is the constant in the gap property (Lemma B.3) Thus, by (35) and (36), we have shown the desired claim (34).

Based on (29) and (30), combining (31) and (32) yields

\frac{1}{np^{2}}\mathbb{E}\left[\langle\mathbf{v},\mathbf{v}^{[k]}\rangle\langle\mathbf{u},\mathbf{u}^{[k]}\rangle\right]\leq\frac{4\mathsf{Var}(\sigma)}{k}+o\left(\frac{1}{n^{3}}\right)+o\left(\frac{1}{n^{2}p}\right)

By Lemma B.2 and the assumption $k\gg n^{5/3}$ , we have

\mathbb{E}\left[\langle\mathbf{v},\mathbf{v}^{[k]}\rangle\langle\mathbf{u},\mathbf{u}^{[k]}\rangle\right]\leq\frac{np^{2}}{k}O(n^{-4/3})+o(1)=o(1).

This implies

\mathbb{E}\left[|\langle\mathbf{v},\mathbf{v}^{[k]}\rangle\langle\mathbf{u},\mathbf{u}^{[k]}\rangle|\right]\to 0.

(37)

Step 4.

Consider the symmetrization matrix $\widetilde{\mathbf{X}}$ defined in (5). The variational representation of the top eigenvalue yields

\sigma=\frac{\langle\mathbf{w},\widetilde{\mathbf{X}}\mathbf{w}\rangle}{\|\mathbf{w}\|_{2}^{2}},\ \ \sigma^{[k]}=\frac{\langle\mathbf{w}^{[k]},\widetilde{\mathbf{X}}^{[k]}\mathbf{w}^{[k]}\rangle}{\|\mathbf{w}^{[k]}\|_{2}^{2}}\ \ \mbox{with}\ \|\mathbf{w}\|_{2}^{2}=\|\mathbf{w}^{[k]}\|_{2}^{2}=2.

Using the same arguments as in Step 1-3, we can conclude that

\mathbb{E}\left[|\langle\mathbf{w},\mathbf{w}^{[k]}\rangle|^{2}\right]=\mathbb{E}\left[|\langle\mathbf{v},\mathbf{v}^{[k]}\rangle+\langle\mathbf{u},\mathbf{u}^{[k]}\rangle|^{2}\right]\to 0

Combined with (37), this gives us

\mathbb{E}\left[|\langle\mathbf{v},\mathbf{v}^{[k]}\rangle|^{2}+|\langle\mathbf{u},\mathbf{u}^{[k]}\rangle|^{2}\right]\to 0,

which proves the desired results.

Appendix D Proofs for the Stability Regime

Throughout the whole section, we will focus on the behaviour of $\mathbf{v}$ and $\mathbf{v}^{[k]}$ . Similar results also hold for $\mathbf{u}$ and $\mathbf{u}^{[k]}$ via the same arguments.

D.1 Linearization and local law of resolvent

As mentioned in the introduction, in certain cases it would be more convenient to consider the symmetrization $\widetilde{\mathbf{X}}$ of the matrix $\mathbf{X}$ (defined as in (5)) when studying its spectral properties. For $z\in\mathbb{C}$ with $\text{\rm Im}{\,}z>0$ , We introduce the resolvent of this symmetrization

\mathbf{R}(z):=\left(\begin{matrix}-\mathbf{I}_{n}&\mathbf{X}\\ \mathbf{X}^{\top}&-z\mathbf{I}_{p}\end{matrix}\right)^{-1}.

(38)

Note that $\mathbf{R}(z)$ is not the conventional definition of the resolvent matrix, but we still call it resolvent for convenience. For the ease of notations, we will relabel the indices in $\mathbf{R}$ in the following way:

Definition 2 (Index sets).

We define the index sets

{\mathcal{I}}_{1}:=\left\{1,\dots,n\right\},\ \ {\mathcal{I}}_{2}:=\left\{1,\dots,p\right\},\ \ {\mathcal{I}}:={\mathcal{I}}_{1}\cup\left\{n+\alpha:\alpha\in{\mathcal{I}}_{2}\right\}.

For a general matrix $\mathbf{M}\in\mathbb{R}^{|{\mathcal{I}}|\times|{\mathcal{I}}|}$ , we label the indices of the matrix elements in the following way: for $a,b\in{\mathcal{I}}$ , if $1\leq a,b\leq n$ , then typically we use Latin letters $i,j$ to represent them; if $n+1\leq a,b\leq n+p$ , we use the corresponding Greek letters $\alpha=a-n$ and $\beta=b-n$ to represent them.

The resolvent $\mathbf{R}$ is closely related to the eigenvalue and eigenvectors of the covariance matrix. As discussed in [DY18][Equation (3.9),(3.10)], we have

\mathbf{R}_{ij}(z)=\sum_{\ell=1}^{n}\frac{z\mathbf{u}_{\ell}(i)\mathbf{u}_{\ell}(j)}{\lambda_{\ell}-z},\ \ \mathbf{R}_{\alpha\beta}(z)=\sum_{\ell=1}^{p}\frac{\mathbf{v}_{\ell}(\alpha)\mathbf{v}_{\ell}(\beta)}{\lambda_{\ell}-z},

(39)

and

\mathbf{R}_{i\alpha}(z)=\sum_{\ell=1}^{p}\frac{\sqrt{\lambda_{\ell}}\mathbf{u}_{\ell}(i)\mathbf{v}_{\ell}(\alpha)}{\lambda_{\ell}-z},\ \ \mathbf{R}_{\alpha i}(z)=\sum_{\ell=1}^{p}\frac{\sqrt{\lambda_{\ell}}\mathbf{v}_{\ell}(\alpha)\mathbf{u}_{\ell}(i)}{\lambda_{\ell}-z}.

An important result is the local Marchenko-Pastur law for the resolvent matrix $\mathbf{R}$ . This was first proved in [DY18][Lemma 3.11], and we also refer to [HLS19][Proposition 2.13] for a version that can be directly applied. Specifically, the resolvent matrix $\mathbf{R}$ has a deterministic limit, defined by

\mathbf{G}(z):=\left(\begin{matrix}-(1+m_{\mathsf{MP}}(z))^{-1}\mathbf{I}_{n}&0\\ 0&m_{\mathsf{MP}}(z)\mathbf{I}_{p}\end{matrix}\right),

(40)

where $m_{\mathsf{MP}}(z)$ is the Stieltjes transform of the Marchenko-Pasture law (6), given by

m_{\mathsf{MP}}(z):=\int_{\mathbb{R}}\dfrac{\rho_{\mathsf{MP}}(x)}{x-z}{\rm d}x=\dfrac{1-\xi-z+\sqrt{(z-\lambda_{-})(z-\lambda_{+})}}{2\xi z},

where $\sqrt{\cdot}$ denotes the square root on the complex plane whose branch cut is the negative real line. With this choice we always have $\text{\rm Im}{\,}m_{\mathsf{MP}}(z)>0$ when $\text{\rm Im}{\,}z>0$ .

To state the local law, we will focus on the spectral domain

{\mathcal{S}}:=\begin{cases}\left\{E+\mathrm{i}\eta:\frac{\lambda_{-}}{2}\leq E\leq\lambda_{+}+1,0<\eta<3\right\}&\mbox{if}\ 0<\xi<1.\\ \left\{E+\mathrm{i}\eta:\frac{1}{10}\leq E\leq\lambda_{+}+1,0<\eta<3\right\}&\mbox{if}\ \xi=1.\end{cases}

(41)

Lemma D.1 (Local Marchenko-Pastur law).

For any $\varepsilon>0$ , the following estimate holds With overwhelming probability uniformly for $z\in{\mathcal{S}}$ ,

\max_{a,b\in{\mathcal{I}}}\left|{\mathbf{R}_{ab}(z)-\mathbf{G}_{ab}(z)}\right|\leq n^{\varepsilon}\left(\sqrt{\frac{\text{\rm Im}{\,}m_{\mathsf{MP}}(z)}{n\eta}}+\frac{1}{n\eta}\right).

(42)

To give a precise characterization of the resolvent, we rely on the following estimates for the Stieltjes transform $m_{\mathsf{MP}}(z)$ of the Marchenko-Pasture law. We refer to e.g. [BKYY16][Lemma 3.6] for more details.

Lemma D.2 (Estimate for $m_{\mathsf{MP}}(z)$ ).

For $z=E+\mathrm{i}\eta$ , let $\kappa(z):=\min(|E-\lambda_{-}|,|E-\lambda_{+}|)$ denote the distance to the spectral edge. If $z\in{\mathcal{S}}$ , we have

|m_{\mathsf{MP}}(z)|\asymp 1,\ \ \mbox{and}\ \ \ \text{\rm Im}{\,}m_{\mathsf{MP}}(z)\asymp\begin{cases}\sqrt{\kappa(z)+\eta}&\mbox{if}\ E\in[\lambda_{-},\lambda_{+}],\\ \frac{\eta}{\sqrt{\kappa(z)+\eta}}&\mbox{if}\ E\notin[\lambda_{-},\lambda_{+}].\end{cases}

(43)

In the following analysis, we will work with $z=E+\mathrm{i}\eta$ satisfying $|E-\lambda_{+}|\leq n^{-2/3+\delta}$ and $\eta=n^{-2/3-\delta}$ , where $0<\delta<\frac{1}{3}$ is some parameter. Uniformly in this regime, the local law yields that the following is true with overwhelming probability for all $\varepsilon>0$ and some universal constant $C_{0}>0$ ,

\sup_{z}\max_{a\neq b\in{\mathcal{I}}}|\mathbf{R}_{ab}(z)|\leq n^{-\frac{1}{3}+\delta+\varepsilon},\ \ \mbox{and}\ \ \sup_{z}\max_{a\in{\mathcal{I}}}|\mathbf{R}_{aa}(z)|\leq C_{0}.

(44)

These estimates will be used repeatedly in the following subsections.

D.2 Stability of the resolvent

In this subsection, we will prove the main technical result for the proof of resampling stability. Specifically, we will show that under moderate resampling, the resolvent matrices are stable. Since resolvent is closely related to various spectral statistics, this stability lemma for resolvent will be a key ingredient for our proof.

Lemma D.3.

Assume $k\leq n^{5/3-\epsilon_{0}}$ for some $\epsilon_{0}>0$ . There exists $\delta_{0}>0$ such that for all $0<\delta<\delta_{0}$ , uniformly for $z=E+\mathrm{i}\eta$ with $|E-\lambda_{+}|\leq n^{-2/3+\delta}$ and $\eta=n^{-2/3-\delta}$ , there exists $c>0$ such that the following is true with overwhelming probability

\max_{\alpha,\beta}\left|{\mathbf{R}^{[k]}_{\alpha\beta}(z)-\mathbf{R}_{\alpha\beta}(z)}\right|\leq\frac{1}{n^{1+c}\eta},\ \ \max_{i,j}\left|{\mathbf{R}^{[k]}_{ij}(z)-\mathbf{R}_{ij}(z)}\right|\leq\frac{1}{n^{1+c}\eta}.

(45)

Proof.

Recall that $S_{k}:=\{(i_{1},\alpha_{1}),\dots,(i_{k},\alpha_{k})\}$ is the random subset of matrix indices whose elements are resampled in the matrix $\mathbf{X}$ . For $1\leq t\leq k$ , let $\mathbf{X}^{[t]}$ be the matrix obtained from $\mathbf{X}$ by resampling the $\{(i_{s},\alpha_{s})\}_{1\leq s\leq t}$ entries and let ${\mathcal{F}}_{t}$ be the $\sigma$ -algebra generated by the random variables $\mathbf{X}$ , $S_{k}$ and $\{\mathbf{X}_{i_{s}\alpha_{s}}^{\prime}\}_{1\leq s\leq t}$ . For $a,b\in{\mathcal{I}}$ , we define

T_{ab}:=\left\{t:\{i_{t},\alpha_{t}\}\cap\{a,b\}\neq\emptyset\right\}.

Let $\varepsilon>0$ be an arbitrarily fixed parameter, and let $C_{0}$ be the constant in (44). Consider the event ${\mathcal{E}}_{t}\in{\mathcal{F}}_{t}$ where for all $z=E+\mathrm{i}\eta$ with $|z-\lambda_{+}|\leq n^{-2/3+\delta}$ and $\eta=n^{-2/3-\delta}$ we have

\max_{a\neq b}\left|{\mathbf{R}^{[t]}_{ab}(z)}\right|\leq n^{-\frac{1}{3}+\delta+\varepsilon}=:\Psi,\ \ \mbox{and}\ \ \max_{a}\left|{\mathbf{R}^{[t]}_{aa}(z)}\right|\leq C_{0}.

Define $\mathbf{X}^{[t]}_{0}$ as the matrix obtained from $\mathbf{X}^{[t]}$ by replacing the $(i_{t},\alpha_{t})$ entry with $0$ , and also define its symmetrization $\widetilde{\mathbf{X}}^{[t]}_{0}\in\mathbb{R}^{|{\mathcal{I}}|\times|{\mathcal{I}}|}$ as in (5). Note that $\widetilde{\mathbf{X}}^{[t+1]}_{0}$ is ${\mathcal{F}}_{t}$ -measurable. We write

\widetilde{\mathbf{X}}^{[t]}=\widetilde{\mathbf{X}}^{[t+1]}_{0}+\widetilde{\mathbf{P}}^{[t+1]},\ \ \ \widetilde{\mathbf{X}}^{[t+1]}=\widetilde{\mathbf{X}}^{[t+1]}_{0}+\widetilde{\mathbf{Q}}^{[t+1]},

where $\widetilde{\mathbf{P}}^{[t]},\widetilde{\mathbf{Q}}^{[t]}$ are $|{\mathcal{I}}|\times|{\mathcal{I}}|$ symmetric matrices whose elements are all $0$ except at the $(i_{t},\alpha_{t})$ and $(\alpha_{t},i_{t})$ entries, satisfying

(\widetilde{\mathbf{P}}^{[t]})_{ab}=\begin{cases}\mathbf{X}_{i_{t}\alpha_{t}}&\mbox{if}\ \left\{a,b\right\}=\{i_{t},\alpha_{t}\},\\ 0&\mbox{otherwise}\end{cases}\ \ \ (\widetilde{\mathbf{Q}}^{[t]})_{ab}=\begin{cases}\mathbf{X}_{i_{t}\alpha_{t}}^{\prime}&\mbox{if}\ \left\{a,b\right\}=\{i_{t},\alpha_{t}\},\\ 0&\mbox{otherwise}\end{cases}.

Define the resolvents for the matrices $\widetilde{\mathbf{X}}^{[t]}$ and $\widetilde{\mathbf{X}}^{[t]}_{0}$ as in (38):

\mathbf{R}^{[t]}:=\left(\begin{matrix}-\mathbf{I}_{n}&\mathbf{X}^{[t]}\\ (\mathbf{X}^{[t]})^{\top}&-z\mathbf{I}_{p}\end{matrix}\right)^{-1},\ \ \mathbf{R}^{[t]}_{0}:=\left(\begin{matrix}-\mathbf{I}_{n}&\mathbf{X}^{[t]}_{0}\\ (\mathbf{X}^{[t]}_{0})^{\top}&-z\mathbf{I}_{p}\end{matrix}\right)^{-1}.

Using first-order resolvent expansion, we obtain

\mathbf{R}^{[t+1]}_{0}=\mathbf{R}^{[t]}+\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t]}+\left(\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\right)^{2}\mathbf{R}^{[t+1]}_{0}.

(46)

The triangle inequality yields

\left|{\left(\mathbf{R}^{[t+1]}_{0}-\mathbf{R}^{[t]}\right)_{\alpha\beta}}\right|\leq\left|{\left(\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t]}\right)_{\alpha\beta}}\right|+\left|{\left(\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t+1]}_{0}\right)_{\alpha\beta}}\right|.

Note that $\widetilde{\mathbf{P}}^{[t+1]}$ has only two non-zero entries,

\left(\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t]}\right)_{\alpha\beta}=\sum_{\ell_{1},\ell_{2}}\mathbf{R}^{[t]}_{\alpha\ell_{1}}\widetilde{\mathbf{P}}^{[t+1]}_{\ell_{1}\ell_{2}}\mathbf{R}^{[t]}_{\ell_{2}\beta}=X_{i_{t+1}\alpha_{t+1}}\left(\mathbf{R}^{[t]}_{\alpha i_{t+1}}\mathbf{R}^{[t]}_{\alpha_{t+1}\beta}+\mathbf{R}^{[t]}_{\alpha\alpha_{t+1}}\mathbf{R}^{[t]}_{i_{t+1}\beta}\right)

Recall that $|X_{i_{t+1}\alpha_{t+1}}|\leq n^{-1/2+\varepsilon}$ with overwhelming probability thanks to the sub-exponential decay assumption (2). Then on the event ${\mathcal{E}}_{t}$ , we have

\left|{\left(\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t]}\right)_{\alpha\beta}}\right|\leq 2C_{0}\Psi n^{-\frac{1}{2}+\varepsilon}.

Similarly,

\left(\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t+1]}_{0}\right)_{\alpha\beta}=\sum_{\{m_{1},m_{2}\},\{m_{3},m_{4}\}=\{i_{t+1},\alpha_{t+1}\}}\mathbf{R}^{[t]}_{\alpha m_{1}}\widetilde{\mathbf{P}}^{[t+1]}_{m_{1}m_{2}}\mathbf{R}^{[t]}_{m_{2}m_{3}}\widetilde{\mathbf{P}}^{[t+1]}_{m_{3}m_{4}}(\mathbf{R}^{[t+1]}_{0})_{m_{4}\beta}.

We use the trivial bound $|\mathbf{R}^{[t+1]}_{0}|\leq\eta^{-1}$ for the last term. Then, on the event ${\mathcal{E}}_{t}$ , we have

\left|{\left(\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t]}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t+1]}_{0}\right)_{\alpha\beta}}\right|\leq 2n^{-1+2\varepsilon}\eta^{-1}(\Psi^{2}+C_{0}^{2})\ll\Psi.

Therefore, we have shown that, on the event ${\mathcal{E}}_{t}$ ,

\max_{\alpha\neq\beta}\left|{(\mathbf{R}^{[t+1]}_{0})_{\alpha\beta}}\right|\leq 2\Psi,\ \ \max_{\alpha}\left|{(\mathbf{R}^{[t+1]}_{0})_{\alpha\alpha}}\right|\leq 2C_{0}.

(47)

Similarly, using the first-order resolvent expansion for $\mathbf{R}^{[t+1]}$ around $\mathbf{R}^{[t]}$ , we have

\mathbf{R}^{[t+1]}=\mathbf{R}^{[t]}+\mathbf{R}^{[t]}(\widetilde{\mathbf{P}}^{[t+1]}-\widetilde{\mathbf{Q}}^{[t+1]})\mathbf{R}^{[t]}+\left(\mathbf{R}^{[t]}(\widetilde{\mathbf{P}}^{[t+1]}-\widetilde{\mathbf{Q}}^{[t+1]})\right)^{2}\mathbf{R}^{[t+1]}.

By the same arguments as above, on the event ${\mathcal{E}}_{t}$ , we can derive

\max_{\alpha\neq\beta}\left|{\mathbf{R}^{[t+1]}_{\alpha\beta}}\right|\leq 2\Psi,\ \ \max_{\alpha}\left|{\mathbf{R}^{[t+1]}_{\alpha\alpha}}\right|\leq 2C_{0}.

Next, we use the resolvent identity (or zeroth-order expansion) for $\mathbf{R}^{[t+1]}$ and $\mathbf{R}^{[t+1]}_{0}$ :

\mathbf{R}^{[t+1]}=\mathbf{R}^{[t+1]}_{0}-\mathbf{R}^{[t+1]}_{0}\widetilde{\mathbf{Q}}^{[t+1]}\mathbf{R}^{[t+1]}.

This leads to

\left|{\left(\mathbf{R}^{[t+1]}-\mathbf{R}^{[t+1]}_{0}\right)_{\alpha\beta}}\right|=\left|{\sum_{\{\ell_{1},\ell_{2}\}=\{i_{t+1}\alpha_{t+1}\}}(\mathbf{R}^{[t+1]}_{0})_{\alpha\ell_{1}}\widetilde{\mathbf{Q}}^{[t+1]}_{\ell_{1}\ell_{2}}\mathbf{R}^{[t+1]}_{\ell_{2}\beta}}\right|

Thus, on the event ${\mathcal{E}}_{t}$ , we conclude

\left|{\left(\mathbf{R}^{[t+1]}-\mathbf{R}^{[t+1]}_{0}\right)_{\alpha\beta}}\right|\leq 4n^{-\frac{1}{2}+\varepsilon}\left(\Psi^{2}+C_{0}\Psi\mathbbm{1}_{((t+1)\in T_{\alpha\beta})}\right)=:\mathfrak{f}_{\alpha\beta}^{[t+1]}

(48)

Meanwhile, the second-order resolvent expansion of $\mathbf{R}^{[t+1]}$ around $\mathbf{R}^{[t+1]}_{0}$ yields

\mathbf{R}^{[t+1]}=\mathbf{R}^{[t+1]}_{0}-\mathbf{R}^{[t+1]}_{0}\widetilde{\mathbf{Q}}^{[t+1]}\mathbf{R}^{[t+1]}_{0}+\left(\mathbf{R}^{[t+1]}_{0}\widetilde{\mathbf{Q}}^{[t+1]}\right)^{2}\mathbf{R}^{[t+1]}_{0}-\left(\mathbf{R}^{[t+1]}_{0}\widetilde{\mathbf{Q}}^{[t+1]}\right)^{3}\mathbf{R}^{[t+1]}.

A key observation is that $\mathbf{R}^{[t+1]}_{0}$ is ${\mathcal{F}}_{t}$ -measurable, and $\mathbb{E}[\widetilde{\mathbf{Q}}^{[t+1]}|{\mathcal{F}}_{t}]=0$ . For simplicity of notations, we set

\mathfrak{q}_{\alpha\beta}^{[t]}:=\left((\mathbf{R}^{[t]}_{0}\widetilde{\mathbf{E}}^{(i_{t},\alpha_{t})})^{2}\mathbf{R}^{[t]}_{0}\right)_{\alpha\beta}

where $\widetilde{\mathbf{E}}^{(i_{t},\alpha_{t})}\in\mathbb{R}^{|{\mathcal{I}}|\times|{\mathcal{I}}|}$ is the symmetrization of the matrix $\mathbf{E}^{(i_{t},\alpha_{t})}\in\mathbb{R}^{n\times p}$ whose elements are all $0$ except $\mathbf{E}^{(i_{t},\alpha_{t})}_{i_{t}\alpha_{t}}=1$ . Then we have

\left|{\mathbb{E}\left[\mathbf{R}^{[t+1]}_{\alpha\beta}|{\mathcal{F}}_{t}\right]-(\mathbf{R}^{[t+1]}_{0})_{\alpha\beta}-p^{-1}\mathfrak{q}_{\alpha\beta}^{[t+1]}}\right|\leq 32n^{-\frac{3}{2}+3\varepsilon}\left(\Psi^{2}C_{0}^{2}+C_{0}^{4}\mathbbm{1}_{((t+1)\in T_{\alpha\beta})}\right)=:\mathfrak{g}_{\alpha\beta}^{[t+1]}.

(49)

Similarly, using resolvent expansion of $\mathbf{R}^{[t]}$ around $\mathbf{R}^{[t+1]}_{0}$ , we obtain

\mathbf{R}^{[t]}=\mathbf{R}^{[t+1]}_{0}-\mathbf{R}^{[t+1]}_{0}\widetilde{\mathbf{P}}^{[t+1]}\mathbf{R}^{[t+1]}_{0}+(\mathbf{R}^{[t+1]}_{0}\widetilde{\mathbf{P}}^{[t+1]})^{2}\mathbf{R}^{[t+1]}_{0}-(\mathbf{R}^{[t+1]}_{0}\widetilde{\mathbf{P}}^{[t+1]})^{3}\mathbf{R}^{[t]}.

By the same arguments as above, on the event ${\mathcal{E}}_{t}$ , we deduce that

\left|{\mathbf{R}^{[t]}_{\alpha\beta}-(\mathbf{R}^{[t+1]}_{0})_{\alpha\beta}+\mathbf{X}_{i_{t+1}\alpha_{t+1}}\mathfrak{p}_{\alpha\beta}^{[t+1]}-\mathbf{X}_{i_{t+1}\alpha_{t+1}}^{2}\mathfrak{q}_{\alpha\beta}^{[t+1]}}\right|\leq\mathfrak{g}_{\alpha\beta}^{[t+1]}

(50)

where

\mathfrak{p}_{\alpha\beta}^{[t]}:=\left(\mathbf{R}^{[t]}_{0}\widetilde{\mathbf{E}}^{(i_{t},\alpha_{t})}\mathbf{R}^{[t]}_{0}\right)_{\alpha\beta}.

(51)

Combining (49) and (50) yields

\left|{\mathbb{E}\left[\mathbf{R}^{[t+1]}_{\alpha\beta}|{\mathcal{F}}_{t}\right]-\mathbf{R}^{[t]}_{\alpha\beta}-\mathbf{X}_{i_{t+1}\alpha_{t+1}}\mathfrak{p}_{\alpha\beta}^{[t+1]}+(\mathbf{X}_{i_{t+1}\alpha_{t+1}}^{2}-p^{-1})\mathfrak{q}_{\alpha\beta}^{[t+1]}}\right|\leq 2\mathfrak{g}_{\alpha\beta}^{[t+1]}.

(52)

By a telescopic summation, we obtain

	$\displaystyle\mathbf{R}^{[k]}_{\alpha\beta}-\mathbf{R}_{\alpha\beta}$	$\displaystyle=\sum_{t=0}^{k-1}\left(\mathbf{R}^{[t+1]}_{\alpha\beta}-\mathbf{R}^{[t]}_{\alpha\beta}\right)$		(53)
		$\displaystyle=\sum_{t=0}^{k-1}\left(\mathbf{R}^{[t+1]}_{\alpha\beta}-\mathbb{E}\left[\mathbf{R}^{[t+1]}_{\alpha\beta}\|{\mathcal{F}}_{t}\right]\right)+\sum_{t=0}^{k-1}\mathbf{X}_{i_{t+1}\alpha_{t+1}}\mathfrak{p}_{\alpha\beta}^{[t+1]}+\sum_{t=0}^{k-1}(\mathbf{X}_{i_{t+1}\alpha_{t+1}}^{2}-p^{-1})\mathfrak{q}_{\alpha\beta}^{[t+1]}+\mathfrak{r}_{\alpha\beta}$		(53)

where the remainder $\mathfrak{r}_{\alpha\beta}$ is bounded by (52)

|\mathfrak{r}_{\alpha\beta}|\leq 2\sum_{t=0}^{k-1}\mathfrak{g}_{\alpha\beta}^{[t+1]}.

Recall the expression of $\mathfrak{g}_{\alpha\beta}^{[t]}$ , to estimate the remainder, we need to control the size of the set $T_{\alpha\beta}$ . Note that $\mathbb{E}\left[|T_{\alpha\beta}|\right]=2k/p.$ By a Berstein-type inequality (see e.g. [Cha07][Proposition 1.1]), for any $x>0$ , we have

\mathbb{P}\left(|T_{\alpha\beta}|\geq\mathbb{E}\left[|T_{\alpha\beta}|\right]+x\right)\leq\exp\left(-\frac{x^{2}}{4\mathbb{E}\left[|T_{\alpha\beta}|\right]+2x}\right)

Recall that $k\leq n^{5/3-\epsilon_{0}}$ . The inequality together with a union bound implies that

\max_{\alpha,\beta}|T_{\alpha\beta}|\leq\frac{3\max(k,p(\log n)^{2})}{p}=:{\mathsf{T}}

with overwhelming probability. We denote this event by ${\mathcal{T}}$ . On the event ${\mathcal{T}}$ , we have

|\mathfrak{r}_{\alpha\beta}|\leq 2kn^{-\frac{3}{2}+3\varepsilon}\Psi^{2}C_{0}^{2}+2n^{-\frac{3}{2}+3\varepsilon}C_{0}^{4}{\mathsf{T}}\leq n^{3\varepsilon}\sqrt{{\mathsf{T}}}\Psi^{2}.

(54)

For the first term in (53), we set

\mathfrak{w}_{\alpha\beta}^{[t+1]}:=\left(\mathbf{R}^{[t+1]}_{\alpha\beta}-\mathbb{E}\left[\mathbf{R}^{[t+1]}_{\alpha\beta}|{\mathcal{F}}_{t}\right]\right)\mathbbm{1}_{{\mathcal{E}}_{t}}.

Note that ${\mathcal{E}}_{t}\in{\mathcal{F}}_{t}$ . This implies that $\mathbb{E}[\mathfrak{w}_{\alpha\beta}^{[t+1]}|{\mathcal{F}}_{t}]=0$ . Moreover, by (48), on the event ${\mathcal{E}}_{t}$ we have $|\mathfrak{w}_{\alpha\beta}^{[t+1]}|\leq 2\mathfrak{f}_{\alpha\beta}^{[t+1]}$ . Further, on the event ${\mathcal{T}}$ ,

\left(\sum_{t=0}^{k-1}(\mathfrak{f}_{\alpha\beta}^{[t+1]})^{2}\right)^{1/2}\leq n^{-\frac{1}{2}+\varepsilon}\Psi^{2}\sqrt{k}+n^{-\frac{1}{2}+\varepsilon}C_{0}\Psi\sqrt{{\mathsf{T}}}\leq 2n^{\varepsilon}\Psi^{2}\sqrt{{\mathsf{T}}}.

Using the Azuma-Hoeffding inequality, for any $x\geq 0$ , we have

\mathbb{P}\left(\left|{\sum_{t=0}^{k-1}\mathfrak{w}_{\alpha\beta}^{[t+1]}}\right|\geq 2n^{\varepsilon}\Psi^{2}\sqrt{{\mathsf{T}}}x\right)\leq 2\exp\left(-\frac{x^{2}}{2}\right).

Moreover,

\mathbb{P}\left(\left|{\sum_{t=0}^{k-1}\left(\mathbf{R}^{[t+1]}_{\alpha\beta}-\mathbb{E}\left[\mathbf{R}^{[t+1]}_{\alpha\beta}|{\mathcal{F}}_{t}\right]\right)}\right|\geq 2n^{\varepsilon}\Psi^{2}\sqrt{{\mathsf{T}}}x\right)\leq\mathbb{P}\left(\left|{\sum_{t=0}^{k-1}\mathfrak{w}_{\alpha\beta}^{[t+1]}}\right|\geq 2n^{\varepsilon}\Psi^{2}\sqrt{{\mathsf{T}}}x\right)+\sum_{t=0}^{k-1}\mathbb{P}({\mathcal{E}}_{t}^{c}).

Recall that ${\mathcal{E}}_{t}$ holds with overwhelming probability, and consequently $\sum_{t=0}^{k-1}\mathbb{P}({\mathcal{E}}_{t}^{c})\leq n^{-D}$ for any $D>0$ . Choosing $x=n^{\varepsilon}$ implies that with overwhelming probability

\left|{\sum_{t=0}^{k-1}\left(\mathbf{R}^{[t+1]}_{\alpha\beta}-\mathbb{E}\left[\mathbf{R}^{[t+1]}_{\alpha\beta}|{\mathcal{F}}_{t}\right]\right)}\right|\leq 2n^{2\varepsilon}\Psi^{2}\sqrt{{\mathsf{T}}}.

(55)

For the next two terms in (53), we will deal with them by introducing a backward filtration. Let ${\mathcal{F}}_{t}^{\prime}$ be the $\sigma$ -algebra generated by the random variables $\mathbf{X}^{\prime}$ , $S_{k}$ and $\{\mathbf{X}_{i\alpha}\}$ with $i\notin\left\{i_{1},\dots,i_{t}\right\}$ and $\alpha\notin\left\{\alpha_{1},\dots,\alpha_{t}\right\}$ . Similarly as above, we consider the event ${\mathcal{E}}_{t}^{\prime}$ that for all $z=E+\mathrm{i}\eta$ with $|z-\lambda_{+}|\leq n^{-2/3+\delta}$ and $\eta=n^{-2/3-\delta}$ we have

\max_{a\neq b}\left|{\mathbf{R}^{[t]}_{ab}(z)}\right|\leq\Psi,\ \ \mbox{and}\ \ \max_{a}\left|{\mathbf{R}^{[t]}_{aa}(z)}\right|\leq C_{0}.

Using resolvent expansion, the same arguments for (47) yield that, on the event ${\mathcal{E}}_{t}^{\prime}$ , we have

\max_{\alpha\neq\beta}\left|{(\mathbf{R}^{[t]}_{0})_{\alpha\beta}}\right|\leq 2\Psi,\ \ \max_{\alpha}\left|{(\mathbf{R}^{[t]}_{0})_{\alpha\alpha}}\right|\leq 2C_{0}.

A key observation is that $\mathfrak{p}_{\alpha\beta}^{[t]}$ defined in (51) is ${\mathcal{F}}_{t}^{\prime}$ -measurable. Also, we have $\mathbb{E}[\mathbf{X}_{i_{t}\alpha_{t}}|{\mathcal{F}}_{t}^{\prime}]=0$ . Consider

\widetilde{\mathfrak{p}}_{\alpha\beta}^{[t]}:=\mathbf{X}_{i_{t}\alpha_{t}}\mathfrak{p}_{\alpha\beta}^{[t]}\mathbbm{1}_{{\mathcal{E}}_{t}^{\prime}}.

Then we have $\mathbb{E}[\widetilde{\mathfrak{p}}_{\alpha\beta}^{[t]}|{\mathcal{F}}_{t}^{\prime}]=0$ since we also have ${\mathcal{E}}_{t}^{\prime}\in{\mathcal{F}}_{t}^{\prime}$ . Note that

\mathbb{P}\left(\left|{\sum_{t=0}^{k-1}\mathbf{X}_{i_{t+1}\alpha_{t+1}}\mathfrak{p}_{\alpha\beta}^{[t+1]}}\right|\geq x\right)\leq\mathbb{P}\left(\left|{\sum_{t=0}^{k-1}\widetilde{\mathfrak{p}}_{\alpha\beta}^{[t+1]}}\right|\geq x\right)+\sum_{t=0}^{k-1}\mathbb{P}(({\mathcal{E}}_{t+1}^{\prime})^{c}),

The second term is negligible since ${\mathcal{E}}_{t}^{\prime}$ holds with overwhelming probability. To estimate the first term, we use Azuma-Hoeffding inequality as before. Based on similar arguments as in (48), we deduce

\left|{\widetilde{\mathfrak{p}}_{\alpha\beta}^{[t]}}\right|\leq 4n^{-\frac{1}{2}+\varepsilon}\left(\Psi^{2}+C_{0}\Psi\mathbbm{1}_{(t\in T_{\alpha\beta})}\right).

By considering the event ${\mathcal{T}}$ and using Azuma-Hoeffding inequality as in (55), we can conclude that with overwhelming probability,

\left|{\sum_{t=0}^{k-1}\widetilde{\mathfrak{p}}_{\alpha\beta}^{[t+1]}}\right|\leq n^{2\varepsilon}\Psi^{2}\sqrt{{\mathsf{T}}}

As a consequence, with overwhelming probability

\left|{\sum_{t=0}^{k-1}\mathbf{X}_{i_{t+1}\alpha_{t+1}}\mathfrak{p}_{\alpha\beta}^{[t+1]}}\right|\lesssim n^{2\varepsilon}\Psi^{2}\sqrt{{\mathsf{T}}}.

(56)

For the third term in (53), by the same arguments, we have

\left|{\sum_{t=0}^{k-1}(\mathbf{X}_{i_{t+1}\alpha_{t+1}}^{2}-p^{-1})\mathfrak{q}_{\alpha\beta}^{[t+1]}}\right|\lesssim n^{2\varepsilon}\Psi^{2}\sqrt{{\mathsf{T}}}.

(57)

Finally, combining (53), (54), (55), (56) and (57), we have shown that

\left|{\mathbf{R}^{[k]}_{\alpha\beta}(z)-\mathbf{R}_{\alpha\beta}(z)}\right|\lesssim n^{3\varepsilon}\Psi^{2}\sqrt{{\mathsf{T}}}.

Recall that $\eta=n^{-2/3-\delta}$ , $\Psi=O(n^{-\frac{1}{3}+\delta+\varepsilon})$ , and ${\mathsf{T}}=O(n^{\frac{2}{3}-\epsilon_{0}})$ . Then we obtain

n\eta\left|{\mathbf{R}^{[k]}_{\alpha\beta}(z)-\mathbf{R}_{\alpha\beta}(z)}\right|\leq n^{-\frac{\epsilon_{0}}{2}+\delta+5\varepsilon}.

(58)

Choosing $\delta+5\varepsilon<\frac{\epsilon_{0}}{2}$ yields the desired bound (45) for a fixed $z$ .

So far, we have proved the desired result for a fixed $z$ . To extend this result to a uniform estimate, we simply invoke a standard net argument. To do this, we divide the interval $[-n^{-2/3+\delta},n^{-2/3+\delta}]$ into $n^{2}$ sub-intervals, and consider $z=E+\mathrm{i}\eta$ with $\kappa(z)$ taking values in each sub-interval. Note that

|\mathbf{R}_{\alpha\beta}(z_{1})-\mathbf{R}_{\alpha\beta}(z_{2})|\leq\frac{|z_{1}-z_{2}|}{\min(\text{\rm Im}{\,}(z_{1}),\text{\rm Im}{\,}(z_{2}))^{2}}.

For $z_{1}$ , $z_{2}$ associated with the same sub-interval, we have

n\eta|\mathbf{R}_{\alpha\beta}(z_{1})-\mathbf{R}_{\alpha\beta}(z_{2})|\leq n\eta\frac{n^{-2/3+\delta}n^{-2}}{\eta^{2}}\leq n^{-1+2\delta},

which is of lower order compared with the error bound in (58). This shows that, up to a small multiplicative factor, the desired error bound (45) holds uniformly in each sub-interval with overwhelming probability. Finally, thanks to the overwhelming probability, a union bound over the $n^{2}$ sub-intervals yields the desired uniform estimate (45) for all $z=E+\mathrm{i}\eta$ with $|E-\lambda_{+}|\leq n^{-2/3+\delta}$ and $\eta=n^{-2/3-\delta}$ .

Using the same arguments, we can prove a similar bound for the $\mathbf{R}^{[k]}_{ij}$ and $\mathbf{R}_{ij}$ blocks. Hence, we have shown the desired results. ∎

D.3 Stability of the top eigenvalue

As a consequence of the stability of the resolvent, we also have the stability of the top eigenvalue. This stability of the eigenvalue will play a crucial rule for the resolvent approximation of eigenvector statistics in the next subsection.

Lemma D.4.

Assume $k\leq n^{5/3-\epsilon_{0}}$ for some $\epsilon_{0}>0$ . Let $0<\delta<\delta_{0}$ with $\delta_{0}$ as in Lemma D.3. For any $\varepsilon>0$ , with overwhelming probability, we have

\left|{\lambda-\lambda^{[k]}}\right|\leq n^{-\frac{2}{3}-\delta+\varepsilon}.

Proof.

Without loss of generality, we assume that $\lambda>\lambda^{[k]}$ . Set $\eta=n^{-2/3-\delta}$ . By the spectral representation of the resolvent (39), we have

\text{\rm Im}{\,}\mathbf{R}_{\alpha\alpha}(z)=\eta\sum_{\ell=1}^{p}\frac{|\mathbf{v}_{\ell}(\alpha)|^{2}}{(\lambda_{\ell}-E)^{2}+\eta^{2}}\geq\frac{\eta|\mathbf{v}(\alpha)|^{2}}{(\lambda-E)^{2}+\eta^{2}}\geq\frac{\eta|\mathbf{v}(\alpha)|^{2}}{2\left(\max(|\lambda-E|,\eta)\right)^{2}}.

By the pigeonhole principle, we know that there exists $\alpha$ such that $|\mathbf{v}(\alpha)|\geq p^{-1/2}$ . Choosing this $\alpha$ and $z=\lambda+\mathrm{i}\eta$ , we obtain

p\eta^{-1}\text{\rm Im}{\,}\mathbf{R}_{\alpha\alpha}(\lambda+\mathrm{i}\eta)\geq\frac{1}{2\eta^{2}}.

(59)

On the other hand, using the spectral representation of resolvent again for $\mathbf{R}^{[k]}$ , we have

p\eta^{-1}\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\beta\beta}(z)=\sum_{m=1}^{p}\frac{p|\mathbf{v}^{[k]}_{m}(\beta)|^{2}}{\big{(}\lambda^{[k]}_{m}-\lambda\big{)}^{2}+\eta^{2}}.

Pick $\omega>0$ , we decompose the summation into two parts

J_{1}=\sum_{m=1}^{n^{\omega}}\frac{p|\mathbf{v}^{[k]}_{m}(\beta)|^{2}}{\big{(}\lambda^{[k]}_{m}-\lambda\big{)}^{2}+\eta^{2}},\ \ J_{2}=\sum_{m=n^{\omega}+1}^{p}\frac{p|\mathbf{v}^{[k]}_{m}(\beta)|^{2}}{\big{(}\lambda^{[k]}_{m}-\lambda\big{)}^{2}+\eta^{2}}.

Using delocalization of eigenvectors, for any $\varepsilon>0$ , with overwhelming probability, we have

J_{1}\lesssim\frac{n^{\omega+\varepsilon}}{(\min_{1\leq m\leq p}|\lambda^{[k]}_{m}-\lambda|)^{2}}.

(60)

By the Tracy-Widom limit of the top eigenvalue (Lemma B.2), for any $\varepsilon>0$ , with overwhelming probability, we have $|\lambda-\lambda_{+}|\leq n^{-2/3+\varepsilon}$ . Also, as discussed in (17), the rigidity of eigenvalues yields that for all $m\geq n^{\omega}$ , with overwhelming probability,

\lambda-\lambda^{[k]}_{m}\gtrsim m^{2/3}p^{-2/3}.

Then using delocalization again, with overwhelming probability, we have

J_{2}\leq\sum_{m=n^{\omega}+1}^{p}\frac{n^{\varepsilon}}{(\lambda^{[k]}_{m}-\lambda)^{2}}\lesssim n^{\varepsilon}(n^{\omega})^{-1/3}n^{4/3}.

(61)

Again, since $|\lambda^{[k]}-\lambda|\leq 2n^{-2/3+\varepsilon}$ , by choosing $\omega=2\varepsilon$ we have $J_{2}\leq J_{1}$ . Therefore, by (60) and (61), we have shown that with overwhelming probability

p\eta^{-1}\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\beta\beta}(\lambda+\mathrm{i}\eta)\lesssim n^{3\varepsilon}\left(\min_{1\leq m\leq p}|\lambda^{[k]}_{m}-\lambda|\right)^{-2}.

Note that the minimum is attained by $\lambda^{[k]}$ . This shows that

n\eta^{-1}\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\alpha\alpha}(\lambda+\mathrm{i}\eta)\lesssim n^{3\varepsilon}|\lambda^{[k]}-\lambda|^{-2}.

Using Lemma D.3 and (59), we have

n\eta^{-1}\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\alpha\alpha}(\lambda+\mathrm{i}\eta)\geq n\eta^{-1}\left(\text{\rm Im}{\,}\mathbf{R}_{\alpha\alpha}(\lambda+\mathrm{i}\eta)-\left|{\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\alpha\alpha}(\lambda+\mathrm{i}\eta)-\text{\rm Im}{\,}\mathbf{R}_{\alpha\alpha}(\lambda+\mathrm{i}\eta)}\right|\right)\geq\frac{1}{2\eta^{2}}-\frac{1}{n^{c}\eta^{2}}\gtrsim\frac{1}{\eta^{2}}.

Therefore, we have shown that, with overwhelming probability,

\frac{1}{\eta^{2}}\lesssim n^{3\varepsilon}\frac{1}{|\lambda-\lambda^{[k]}|^{2}}.

Recall $\eta=n^{-2/3-\delta}$ , and we conclude that

\left|{\lambda-\lambda^{[k]}}\right|\leq n^{-2/3-\delta+3\varepsilon},

which proves the desired result thanks to the arbitrariness of $\varepsilon>0$ . ∎

D.4 Proof of Theorem 2

The final ingredient to prove the resampling stability is the following approximation lemma, which asserts that the product of entries in the eigenvector can be well approximated by the resolvent entries.

Lemma D.5.

Assume that $k\ll n^{5/3-\epsilon_{0}}$ for some $\epsilon_{0}>0$ . Let $0<\delta<\delta_{0}$ be as in Lemma D.3. Then, for $z_{0}=\lambda+\mathrm{i}\eta$ with $\eta=n^{-2/3-\delta}$ , there exists $c^{\prime}>0$ such that with probability $1-o(1)$ we have

\max_{\alpha,\beta}\left|{\eta\text{\rm Im}{\,}\mathbf{R}_{\alpha\beta}(z_{0})-\mathbf{v}(\alpha)\mathbf{v}(\beta)}\right|\leq n^{-1-c^{\prime}},\ \ \mbox{and}\ \ \max_{\alpha,\beta}\left|{\eta\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\alpha\beta}(z_{0})-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)}\right|\leq n^{-1-c^{\prime}}.

Similarly, we also have

\max_{i,j}\left|{\eta\text{\rm Im}{\,}\frac{\mathbf{R}_{ij}(z_{0})}{z_{0}}-\mathbf{u}(i)\mathbf{u}(j)}\right|\leq n^{-1-c^{\prime}},\ \ \mbox{and}\ \ \max_{i,j}\left|{\eta\text{\rm Im}{\,}\frac{\mathbf{R}^{[k]}_{ij}(z_{0})}{z_{0}}-\mathbf{u}^{[k]}(i)\mathbf{u}^{[k]}(j)}\right|\leq n^{-1-c^{\prime}}.

Proof.

For any $\varepsilon>0$ , we consider a general $z=E+\mathrm{i}\eta$ with $|E-\lambda_{+}|\leq n^{-2/3+\varepsilon}$ . From the spectral representation of the resolvent (39), we have

\text{\rm Im}{\,}\mathbf{R}_{\alpha\beta}(z)=\eta\sum_{\ell=1}^{p}\frac{\mathbf{v}_{\ell}(\alpha)\mathbf{v}_{\ell}(\beta)}{(\lambda_{\ell}-E)^{2}+\eta^{2}}.

Pick some $\omega>0$ , we decompose the summation on the right-hand side into three parts

\sum_{\ell=1}^{p}\frac{\mathbf{v}_{\ell}(\alpha)\mathbf{v}_{\ell}(\beta)}{(\lambda_{\ell}-E)^{2}+\eta^{2}}=\frac{\mathbf{v}(\alpha)\mathbf{v}(\beta)}{(\lambda-E)^{2}+\eta^{2}}+J_{1}+J_{2},

where

J_{1}=\sum_{\ell=2}^{n^{\omega}}\frac{\mathbf{v}_{\ell}(\alpha)\mathbf{v}_{\ell}(\beta)}{(\lambda_{\ell}-E)^{2}+\eta^{2}},\ \ \ J_{2}=\sum_{\ell=n^{\omega}+1}^{p}\frac{\mathbf{v}_{\ell}(\alpha)\mathbf{v}_{\ell}(\beta)}{(\lambda_{\ell}-E)^{2}+\eta^{2}}.

Using the same arguments as in (61), for any $\varepsilon>0$ , with overwhelming probability we have

|J_{2}|\lesssim n^{\varepsilon}(n^{\omega})^{-1/3}n^{1/3}.

For the term $J_{1}$ , we consider the following event

{\mathcal{E}}:=\left\{\lambda_{1}-\lambda_{2}\geq c_{0}n^{-2/3}\right\}\cap\left\{\max_{1\leq\ell\leq p}\|\mathbf{v}_{\ell}\|_{\infty}\leq n^{-1/2+\varepsilon}\right\}\cap\left\{|J_{2}|\lesssim n^{\varepsilon}(n^{\omega})^{-1/3}n^{4/3}\right\}.

For any $\varepsilon>0$ , we can find an appropriate $c_{0}>0$ such that $\mathbb{P}({\mathcal{E}})>1-\varepsilon/2$ . Then, for $z=E+\mathrm{i}\eta$ with $|\lambda-E|\leq\frac{c_{0}}{2}n^{-2/3}$ , on the event ${\mathcal{E}}$ , we have

|J_{1}|\lesssim n^{\varepsilon}n^{\omega}n^{1/3}.

Let $\delta^{\prime}>0$ with $\delta^{\prime}+\delta<\delta_{0}$ . On the event ${\mathcal{E}}$ , for all $z=E+\mathrm{i}\eta$ with $|\lambda-E|\leq\eta n^{-\delta^{\prime}}$ and $\eta=n^{-2/3-\delta}$ , we have

\left|{\mathbf{v}(\alpha)\mathbf{v}(\beta)-\frac{\eta^{2}\mathbf{v}(\alpha)\mathbf{v}(\beta)}{(\lambda-E)^{2}+\eta^{2}}}\right|\leq n^{-1+2\varepsilon}\left|{1-\frac{\eta^{2}}{(\lambda-E)^{2}+\eta^{2}}}\right|\leq n^{-1+2\varepsilon-2\delta^{\prime}}.

This yields

	$\displaystyle\left\|{\eta\text{\rm Im}{\,}\mathbf{R}_{\alpha\beta}(z)-\mathbf{v}(\alpha)\mathbf{v}(\beta)}\right\|$	$\displaystyle\leq\left\|{\mathbf{v}(\alpha)\mathbf{v}(\beta)-\frac{\eta^{2}\mathbf{v}(\alpha)\mathbf{v}(\beta)}{(\lambda-E)^{2}+\eta^{2}}}\right\|+\eta^{2}(\|J_{1}\|+\|J_{2}\|)$
		$\displaystyle\leq n^{-1+2\varepsilon-2\delta^{\prime}}+n^{-1+\varepsilon+\omega-2\delta}+n^{-1+\varepsilon-\frac{\omega}{3}-2\delta}.$

Choosing $\omega=\varepsilon<\min(\delta,\delta^{\prime})/2$ , we obtain

\max_{\alpha,\beta}\left|{\eta\text{\rm Im}{\,}\mathbf{R}_{\alpha\beta}(z)-\mathbf{v}(\alpha)\mathbf{v}(\beta)}\right|\leq n^{-1-\min(\delta,\delta^{\prime})}.

(62)

Similarly, we can apply the same arguments to $\mathbf{R}^{[k]}$ . Consider the event

{\mathcal{E}}^{\prime}:=\left\{\max_{\alpha,\beta}\left|{\eta\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\alpha\beta}(z)-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)}\right|\leq n^{-1-\min(\delta,\delta^{\prime})}\ \mbox{for all}\ |\lambda^{[k]}-E|\leq\eta n^{-\delta^{\prime}},\eta=n^{-2/3-\delta}\right\}.

By previous arguments, we know $\mathbb{P}({\mathcal{E}}^{\prime})>1-\varepsilon/2$ . This gives us $\mathbb{P}({\mathcal{E}}\cap{\mathcal{E}}^{\prime})>1-\varepsilon$ . Finally, note that $\delta+\delta^{\prime}<\delta_{0}$ , by Lemma D.4, with overwhelming probability we have $|\lambda-\lambda^{[k]}|\leq n^{-2/3-\delta-\delta^{\prime}}=\eta n^{-\delta^{\prime}}$ . This implies that we can choose $z=\lambda+\mathrm{i}\eta$ in both (62) and ${\mathcal{E}}^{\prime}$ . Thus, we have shown the desired result for $\mathbf{v}$ and $\mathbf{v}^{[k]}$ by choosing $0<c^{\prime}<\min(\delta,\delta^{\prime})$ .

On the other hand, from (39) we also have

\text{\rm Im}{\,}\frac{\mathbf{R}_{ij}(z)}{z}=\eta\sum_{\ell=1}^{n}\frac{\mathbf{u}_{\ell}(i)\mathbf{u}_{\ell}(j)}{(\lambda_{\ell}-E)^{2}+\eta^{2}}.

Using the same methods as above yields the desired result for $\mathbf{u}$ and $\mathbf{u}^{[k]}$ . ∎

Now we prove the main result Theorem 2 on the stability of PCA under moderate resampling.

Proof of Theorem 2.

Let $z_{0}=\lambda+\mathrm{i}\eta$ as in Lemma D.5. By Lemma D.3 and D.5, we know that, with probability $1-o(1)$ , for all $\alpha,\beta\in{\mathcal{I}}_{2}$ , we have

	$\displaystyle\Big{\|}\mathbf{v}(\alpha)\mathbf{v}(\beta)$	$\displaystyle-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)\Big{\|}$
		$\displaystyle\leq\left\|{\mathbf{v}(\alpha)\mathbf{v}(\beta)-\eta\text{\rm Im}{\,}\mathbf{R}_{\alpha\beta}(z_{0})}\right\|+\left\|{\eta\text{\rm Im}{\,}\mathbf{R}_{\alpha\beta}(z_{0})-\eta\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\alpha\beta}(z_{0})}\right\|+\left\|{\eta\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\alpha\beta}(z_{0})-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)}\right\|$
		$\displaystyle\leq n^{-1-c}+n^{-1-c^{\prime}}+n^{-1-c}.$

Denote $c^{\prime\prime}:=\min(c,c^{\prime})$ , and we have

\max_{\alpha,\beta}\left|{\mathbf{v}(\alpha)\mathbf{v}(\beta)-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)}\right|\lesssim n^{-1-c^{\prime\prime}}.

For any fixed $\varepsilon>0$ , we consider the event

{\mathcal{E}}:=\left\{\max_{\alpha,\beta}\left|{\mathbf{v}(\alpha)\mathbf{v}(\beta)-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)}\right|\lesssim n^{-1-c^{\prime\prime}}\right\}\cap\left\{\|\mathbf{v}^{[k]}\|_{\infty}\leq n^{-1/2+\varepsilon}\right\}.

Since delocalization of eigenvectors holds with overwhelming probability, we know that $\mathbb{P}({\mathcal{E}})=1-o(1)$ .

By the pigeonhole principle, there exists $\alpha$ such that $|\mathbf{v}(\alpha)|>p^{-1/2}$ . We choose the $\pm$ phases of $\mathbf{v}$ and $\mathbf{v}^{[k]}$ in the way that $\mathbf{v}(\alpha)$ and $\mathbf{v}^{[k]}(\alpha)$ are non-negative. On the event ${\mathcal{E}}$ , we obtain

\left|{\mathbf{v}(\alpha)-\mathbf{v}^{[k]}(\alpha)}\right|=\frac{\left|{(\mathbf{v}(\alpha))^{2}-(\mathbf{v}^{[k]}(\alpha))^{2}}\right|}{\mathbf{v}(\alpha)+\mathbf{v}^{[k]}(\alpha)}\lesssim n^{-1/2-c^{\prime\prime}}.

Moreover, for any entry $\mathbf{v}(\beta)$ and $\mathbf{v}^{[k]}(\beta)$ , if ${\mathcal{E}}$ holds, the triangle inequality gives us

	$\displaystyle\left\|{\mathbf{v}(\beta)-\mathbf{v}^{[k]}(\beta)}\right\|$	$\displaystyle=\frac{\left\|{\mathbf{v}(\alpha)\mathbf{v}(\beta)-\mathbf{v}(\alpha)\mathbf{v}^{[k]}(\beta)}\right\|}{\mathbf{v}(\alpha)}$
		$\displaystyle\leq\frac{\left\|{\mathbf{v}(\alpha)\mathbf{v}(\beta)-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)}\right\|}{\mathbf{v}(\alpha)}+\frac{\|\mathbf{v}^{[k]}(\beta)\|}{\mathbf{v}(\alpha)}\|\mathbf{v}(\alpha)-\mathbf{v}^{[k]}(\alpha)\|$
		$\displaystyle\lesssim n^{-1/2-c^{\prime\prime}}+n^{-1/2-c^{\prime\prime}+\varepsilon}.$

Choosing $\varepsilon$ sufficiently small, this implies the desired result (4).

For $\mathbf{u}$ and $\mathbf{u}^{[k]}$ , note that

\left|{\mathbf{u}(i)\mathbf{u}(j)-\mathbf{u}^{[k]}(i)\mathbf{u}^{[k]}(j)}\right|\\ \leq\left|{\mathbf{u}(i)\mathbf{u}(j)-\eta\text{\rm Im}{\,}\frac{\mathbf{R}_{ij}(z_{0})}{z_{0}}}\right|+\left|{\eta\text{\rm Im}{\,}\frac{\mathbf{R}_{ij}(z_{0})}{z_{0}}-\eta\text{\rm Im}{\,}\frac{\mathbf{R}^{[k]}_{ij}(z_{0})}{z_{0}}}\right|+\left|{\eta\text{\rm Im}{\,}\frac{\mathbf{R}^{[k]}_{ij}(z_{0})}{z_{0}}-\mathbf{u}^{[k]}(i)\mathbf{u}^{[k]}(j)}\right|

By Lemma D.3, we have

\left|{\text{\rm Im}{\,}\frac{\mathbf{R}_{ij}(z_{0})}{z_{0}}-\text{\rm Im}{\,}\frac{\mathbf{R}^{[k]}_{ij}(z_{0})}{z_{0}}}\right|\leq\left|{\frac{\mathbf{R}_{ij}(z_{0})-\mathbf{R}^{[k]}_{ij}(z_{0})}{z_{0}}}\right|\lesssim\left|{\mathbf{R}_{ij}(z_{0})-\mathbf{R}^{[k]}_{ij}(z_{0})}\right|\leq\frac{1}{n^{1+c}\eta}.

As a consequence, we have

\left|{\mathbf{u}(i)\mathbf{u}(j)-\mathbf{u}^{[k]}(i)\mathbf{u}^{[k]}(j)}\right|\lesssim n^{-1-c^{\prime\prime}}.

The desired result for $\mathbf{u}$ and $\mathbf{u}^{[k]}$ then follows from the same arguments above for $\mathbf{v}$ and $\mathbf{v}^{[k]}$ . ∎

Acknowledgment

The author thanks Yihong Wu and Paul Bourgade for helpful discussions at the early stage of this project. Thanks also to Yiyun He for helpful discussion on differential privacy.

References

[AEK14] Oskari Ajanki, Lászlo Erdős, and Torben Krüger. Local semicircle law with imprimitive variance matrix. Electronic Communications in Probability, 19:1–9, 2014.
[AGZ10] Greg W Anderson, Alice Guionnet, and Ofer Zeitouni. An introduction to random matrices. Number 118. Cambridge university press, 2010.
[ASX17] Yacine Ait-Sahalia and Dacheng Xiu. Using principal component analysis to estimate a high dimensional factor model with high-frequency data. Journal of Econometrics, 201(2):384–399, 2017.
[BBAP05] Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Annals of Probability, pages 1643–1697, 2005.
[BCRT21] Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Predictive inference with the jackknife+. The Annals of Statistics, 49(1):486–507, 2021.
[BDMN05] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: the sulq framework. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 128–138, 2005.
[BE02] Olivier Bousquet and André Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499–526, 2002.
[BEK⁺14] Alex Bloemendal, László Erdős, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Isotropic local laws for sample covariance and generalized Wigner matrices. Electron. J. Probab., 19:no. 33, 53, 2014.
[BGN11] Florent Benaych-Georges and Raj Rao Nadakuditi. The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Advances in Mathematics, 227(1):494–521, 2011.
[BKS99] Itai Benjamini, Gil Kalai, and Oded Schramm. Noise sensitivity of Boolean functions and applications to percolation. Inst. Hautes Études Sci. Publ. Math., (90):5–43 (2001), 1999.
[BKYY16] Alex Bloemendal, Antti Knowles, Horng-Tzer Yau, and Jun Yin. On the principal components of sample covariance matrices. Probab. Theory Related Fields, 164(1-2):459–552, 2016.
[BL22] Charles Bordenave and Jaehun Lee. Noise sensitivity for the top eigenvector of a sparse random matrix. Electronic Journal of Probability, 27:1–50, 2022.
[BLZ20] Charles Bordenave, Gábor Lugosi, and Nikita Zhivotovskiy. Noise sensitivity of the top eigenvector of a Wigner matrix. Probability Theory and Related Fields, 177(3):1103–1135, 2020.
[BPZ15] Zhigang Bao, Guangming Pan, and Wang Zhou. Universality for the largest eigenvalue of sample covariance matrices with general population. Ann. Statist., 43(1):382–421, 2015.
[BS06] Jinho Baik and Jack W Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. Journal of multivariate analysis, 97(6):1382–1408, 2006.
[CBG16] Romain Couillet and Florent Benaych-Georges. Kernel spectral clustering of large dimensional data. Electronic Journal of Statistics, 10(1):1393–1454, 2016.
[Cha05] Sourav Chatterjee. Concentration inequalities with exchangeable pairs. page 105, 2005. Thesis (Ph.D.)–Stanford University.
[Cha07] Sourav Chatterjee. Stein’s method for concentration inequalities. Probab. Theory Related Fields, 138(1-2):305–321, 2007.
[Cha14] Sourav Chatterjee. Superconcentration and related topics. Springer Monographs in Mathematics. Springer, Cham, 2014.
[CLMW11] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):1–37, 2011.
[CMW13] T Tony Cai, Zongming Ma, and Yihong Wu. Sparse pca: Optimal rates and adaptive estimation. The Annals of Statistics, 41(6):3074, 2013.
[CS13] Xiuyuan Cheng and Amit Singer. The spectrum of random inner-product kernel matrices. Random Matrices: Theory and Applications, 2(04):1350010, 2013.
[CSS13] Kamalika Chaudhuri, Anand D Sarwate, and Kaushik Sinha. A near-optimal algorithm for differentially-private principal components. Journal of Machine Learning Research, 14, 2013.
[DCK19] Osman E Dai, Daniel Cullina, and Negar Kiyavash. Database alignment with Gaussian features. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3225–3233. PMLR, 2019.
[DHS21] Zhun Deng, Hangfeng He, and Weijie Su. Toward better generalization bounds with locally elastic stability. In International Conference on Machine Learning, pages 2590–2600. PMLR, 2021.
[DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4):211–407, 2014.
[DT11] Momar Dieng and Craig A Tracy. Application of random matrix theory to multivariate statistics. In Random Matrices, Random Processes and Integrable Systems, pages 443–507. Springer, 2011.
[DY18] Xiucai Ding and Fan Yang. A necessary and sufficient condition for edge universality at the largest singular values of covariance matrices. The Annals of Applied Probability, 28(3):1679–1738, 2018.
[EEPK05] Andre Elisseeff, Theodoros Evgeniou, Massimiliano Pontil, and Leslie Pack Kaelbing. Stability of randomized learning algorithms. Journal of Machine Learning Research, 6(1), 2005.
[EK10] Noureddine El Karoui. The spectrum of kernel random matrices. Ann. Statist., 38(1):1–50, 2010.
[FMWX20] Zhou Fan, Cheng Mao, Yihong Wu, and Jiaming Xu. Spectral graph matching and regularized quadratic relaxations: Algorithm and theory. In International conference on machine learning, pages 2985–2995. PMLR, 2020.
[FMWX22] Zhou Fan, Cheng Mao, Yihong Wu, and Jiaming Xu. Spectral graph matching and regularized quadratic relaxations II. Foundations of Computational Mathematics, pages 1–51, 2022.
[FWZ18] Jianqing Fan, Weichen Wang, and Yiqiao Zhong. An $\ell_{\infty}$ eigenvector perturbation bound and its application to robust covariance estimation. Journal of Machine Learning Research, 18(207):1–42, 2018.
[GLM22] Luca Ganassali, Marc Lelarge, and Laurent Massoulié. Spectral alignment of correlated gaussian matrices. Advances in Applied Probability, 54(1):279–310, 2022.
[GS14] Christophe Garban and Jeffrey E Steif. Noise sensitivity of Boolean functions and percolation, volume 5. Cambridge University Press, 2014.
[HLS19] Jong Yun Hwang, Ji Oon Lee, and Kevin Schnelli. Local law and Tracy–Widom limit for sparse sample covariance matrices. The Annals of Applied Probability, 29(5):3006–3036, 2019.
[HPY18] Xiao Han, Guangming Pan, and Qing Yang. A unified matrix model including both CCA and F matrices in multivariate analysis: The largest eigenvalue and its applications. Bernoulli, 24(4B):3447–3468, 2018.
[HPZ16] Xiao Han, Guangming Pan, and Bo Zhang. The Tracy–Widom law for the largest eigenvalue of F type matrices. The Annals of Statistics, 44(4):1564–1592, 2016.
[HRS16] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International conference on machine learning, pages 1225–1234. PMLR, 2016.
[JN17] Iain M Johnstone and Boaz Nadler. Roy’s largest root test under rank-one alternatives. Biometrika, 104(1):181–193, 2017.
[Joh07] Iain M. Johnstone. High dimensional statistical inference and random matrices. In International Congress of Mathematicians. Vol. I, pages 307–333. Eur. Math. Soc., Zürich, 2007.
[KB21] Byol Kim and Rina Foygel Barber. Black box tests for algorithmic stability. arXiv preprint arXiv:2111.15546, 2021.
[KN02] Samuel Kutin and Partha Niyogi. Almost-everywhere algorithmic stability and generalization error. In Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, pages 275–282, 2002.
[LR10] Michel Ledoux and Brian Rider. Small deviations for beta ensembles. Electronic Journal of Probability, 15:1319–1343, 2010.
[LS16] Ji Oon Lee and Kevin Schnelli. Tracy-Widom distribution for the largest eigenvalue of real sample covariance matrices with general population. Ann. Appl. Probab., 26(6):3786–3839, 2016.
[MNPR06] Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25(1):161–193, 2006.
[MWA05] Baback Moghaddam, Yair Weiss, and Shai Avidan. Spectral bounds for sparse pca: Exact and greedy algorithms. Advances in neural information processing systems, 18, 2005.
[Nad08] Boaz Nadler. Finite sample approximation results for principal component analysis: A matrix perturbation approach. The Annals of Statistics, pages 2791–2817, 2008.
[NJW01] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14, 2001.
[Pau07] Debashis Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, pages 1617–1642, 2007.
[PY14] Natesh S. Pillai and Jun Yin. Universality of covariance matrices. Ann. Appl. Probab., 24(3):935–1001, 2014.
[Rin08] Markus Ringnér. What is principal component analysis? Nature biotechnology, 26(3):303–304, 2008.
[SX21] Kevin Schnelli and Yuanyuan Xu. Convergence rate to the Tracy–Widom laws for the largest eigenvalue of sample covariance matrices. arXiv preprint arXiv:2108.02728, 2021.
[Tro12] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
[TV12] Terence Tao and Van Vu. Random covariance matrices: universality of local statistics of eigenvalues. Ann. Probab., 40(3):1285–1315, 2012.
[VK06] Seema Vyas and Lilani Kumaranayake. Constructing socio-economic status indices: how to use principal components analysis. Health policy and planning, 21(6):459–468, 2006.
[VL07] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17:395–416, 2007.
[Wan12] Ke Wang. Random covariance matrices: Universality of local statistics of eigenvalues up to the edge. Random Matrices: Theory and Applications, 1(01):1150005, 2012.
[Wan19] Haoyu Wang. Quantitative universality for the largest eigenvalue of sample covariance matrices. arXiv preprint arXiv:1912.05473, 2019.
[Wan22] Haoyu Wang. Optimal smoothed analysis and quantitative universality for the smallest singular value of random matrices. arXiv preprint arXiv:2211.03975, 2022.
[WWXY22] Haoyu Wang, Yihong Wu, Jiaming Xu, and Israel Yolou. Random graph matching in geometric models: the case of complete graphs. In Conference on Learning Theory, pages 3441–3488. PMLR, 2022.
[WXS22] Yihong Wu, Jiaming Xu, and H Yu Sophie. Settling the sharp reconstruction thresholds of random graph matching. IEEE Transactions on Information Theory, 2022.
[Yan22] Fan Yang. Sample canonical correlation coefficients of high-dimensional random vectors: Local law and Tracy–Widom limit. Random Matrices: Theory and Applications, 11(01):2250007, 2022.
[ZHT06] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265–286, 2006.

$\displaystyle\left\|{\langle\mathbf{v}_{\beta},(\mathbf{H}_{(i,\alpha)}-\mathbf{H})\mathbf{v}_{1}^{(i,\alpha)}\rangle}\right\|$	$\displaystyle=\left\|{\Delta_{i\alpha}\left(\sqrt{\lambda_{1}^{(i,\alpha)}}\mathbf{v}_{\beta}(\alpha)\mathbf{u}_{1}^{(i,\alpha)}(i)+\sqrt{\lambda_{\beta}}\mathbf{v}_{1}^{(i,\alpha)}(\alpha)\mathbf{u}_{\beta}(i)\right)}\right\|$	(16)
	$\displaystyle\lesssim\left(\|\mathbf{X}_{i\alpha}\|+\|\mathbf{X}_{i\alpha}^{\prime}\|\right)\left(\\|\mathbf{v}_{\beta}\\|_{\infty}\\|\mathbf{u}_{1}^{(i,\alpha)}\\|_{\infty}+\\|\mathbf{v}_{1}^{(i,\alpha)}\\|_{\infty}\\|\mathbf{u}_{\beta}\\|_{\infty}\right)$
	$\displaystyle\leq n^{-\frac{3}{2}+3\varepsilon}$

	$\displaystyle\mathbb{E}\left[\left\|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right\|\mathbbm{1}_{{\mathcal{E}}_{2}^{c}}\right]$	$\displaystyle=\mathbb{E}\left[\left\|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right\|\mathbbm{1}_{{\mathcal{E}}_{3}^{c}}\right]+\mathbb{E}\left[\left\|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right\|\mathbbm{1}_{{\mathcal{E}}_{3}\backslash{\mathcal{E}}_{2}}\right]$
		$\displaystyle=O(n^{-1-\kappa})+\sqrt{\mathbb{E}\left[\left\|{\Delta_{s\theta}\,\Delta^{[k]}_{s\theta}}\right\|^{2}\right]}\sqrt{\mathbb{P}({\mathcal{E}}_{3}\backslash{\mathcal{E}}_{2})}$
		$\displaystyle=O(n^{-1-\kappa})+O(N^{-D})$
		$\displaystyle=O(n^{-1-\kappa}).$

	$\displaystyle\Big{\|}\mathbf{v}(\alpha)\mathbf{v}(\beta)$	$\displaystyle-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)\Big{\|}$
		$\displaystyle\leq\left\|{\mathbf{v}(\alpha)\mathbf{v}(\beta)-\eta\text{\rm Im}{\,}\mathbf{R}_{\alpha\beta}(z_{0})}\right\|+\left\|{\eta\text{\rm Im}{\,}\mathbf{R}_{\alpha\beta}(z_{0})-\eta\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\alpha\beta}(z_{0})}\right\|+\left\|{\eta\text{\rm Im}{\,}\mathbf{R}^{[k]}_{\alpha\beta}(z_{0})-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)}\right\|$
		$\displaystyle\leq n^{-1-c}+n^{-1-c^{\prime}}+n^{-1-c}.$

	$\displaystyle\left\|{\mathbf{v}(\beta)-\mathbf{v}^{[k]}(\beta)}\right\|$	$\displaystyle=\frac{\left\|{\mathbf{v}(\alpha)\mathbf{v}(\beta)-\mathbf{v}(\alpha)\mathbf{v}^{[k]}(\beta)}\right\|}{\mathbf{v}(\alpha)}$
		$\displaystyle\leq\frac{\left\|{\mathbf{v}(\alpha)\mathbf{v}(\beta)-\mathbf{v}^{[k]}(\alpha)\mathbf{v}^{[k]}(\beta)}\right\|}{\mathbf{v}(\alpha)}+\frac{\|\mathbf{v}^{[k]}(\beta)\|}{\mathbf{v}(\alpha)}\|\mathbf{v}(\alpha)-\mathbf{v}^{[k]}(\alpha)\|$
		$\displaystyle\lesssim n^{-1/2-c^{\prime\prime}}+n^{-1/2-c^{\prime\prime}+\varepsilon}.$

Resampling Sensitivity of High-Dimensional PCA

Abstract

1 Introduction

1.1 Model and main results

Theorem 1 (Noise sensitivity under excessive resampling).

Theorem 2 (Noise sensitivity under moderate resampling).

1.2 High-Level Proof Scheme

2 Sensitivity under Excessive Resampling

2.1 Connection with Variance of Top Eigenvalue

3 Stability under Moderate Resampling

3.1 Resolvent Approximation

3.2 Stability of the Resolvent

3.3 Stability of the Top Eigenvalue

4 Discussions and Applications

4.1 Extensions to Broader PCA Models

Sparse PCA

PCA with General Population

4.2 Extensions to Other Statistical Methods

4.3 Differential Privacy

4.4 Database Alignment

5 Numerical Experiments

Appendix A Notations and Organization

Definition 1 (Overwhelming probability).

Organization

Appendix B Preliminaries

B.1 Variance formula and resampling

Lemma B.1 (Lemma 3 in [BLZ20]).

B.2 Tools from random matrix theory

Lemma B.2.

Proof.

Lemma B.3.

Lemma B.4.

Appendix C Proofs for the Sensitivity Regime

C.1 Sensitivity analysis for neighboring data matrices

Lemma C.1.

Proof.

C.2 Proof of Theorem 1

Step 1.

Step 2.

Step 3.

Step 4.

Appendix D Proofs for the Stability Regime

D.1 Linearization and local law of resolvent

Definition 2 (Index sets).

Lemma D.1 (Local Marchenko-Pastur law).

Lemma D.2 (Estimate for m𝖬𝖯​(z)m_{\mathsf{MP}}(z)).

D.2 Stability of the resolvent

Lemma D.3.

Proof.

D.3 Stability of the top eigenvalue

Lemma D.4.

Proof.

D.4 Proof of Theorem 2

Lemma D.5.

Proof.

Proof of Theorem 2.

Acknowledgment

References

Lemma D.2 (Estimate for $m_{\mathsf{MP}}(z)$ ).