This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Generalization error of min-norm interpolators in transfer learning

Yanke Song Department of Statistics, Harvard University Sohom Bhattacharya Department of Statistics, University of Florida Pragya Sur Department of Statistics, Harvard University
Abstract

This paper establishes the generalization error of pooled min-2\ell_{2}-norm interpolation in transfer learning where data from diverse distributions are available. Min-norm interpolators emerge naturally as implicit regularized limits of modern machine learning algorithms. Previous work characterized their out-of-distribution risk when samples from the test distribution are unavailable during training. However, in many applications, a limited amount of test data may be available during training, yet properties of min-norm interpolation in this setting are not well-understood. We address this gap by characterizing the bias and variance of pooled min-2\ell_{2}-norm interpolation under covariate and model shifts. The pooled interpolator captures both early fusion and a form of intermediate fusion. Our results have several implications: under model shift, for low signal-to-noise ratio (SNR), adding data always hurts. For higher SNR, transfer learning helps as long as the shift-to-signal (SSR) ratio lies below a threshold that we characterize explicitly. By consistently estimating these ratios, we provide a data-driven method to determine: (i) when the pooled interpolator outperforms the target-based interpolator, and (ii) the optimal number of target samples that minimizes the generalization error. Under covariate shift, if the source sample size is small relative to the dimension, heterogeneity between between domains improves the risk, and vice versa. We establish a novel anisotropic local law to achieve these characterizations, which may be of independent interest in random matrix theory. We supplement our theoretical characterizations with comprehensive simulations that demonstrate the finite-sample efficacy of our results.

11footnotetext: co-senior authors.22footnotetext: Emails: [email protected]; [email protected]; [email protected]

1 Introduction

Recent advancements in deep learning methodology have uncovered a surprising phenomenon that defies conventional statistical wisdom: overfitting can yield remarkably effective generalization [5, 9, 10, 11]. In the overparametrized regime, complex models that achieve zero-training error, i.e. models that interpolate the training data, have gained popularity [25, 29, 37]. However, as researchers grapple with increasingly large and heterogeneous datasets, the imperative to effectively integrate diverse sources of information has become crucial. Transfer learning [53] has emerged as a promising approach to address this challenge, allowing one to leverage knowledge from related datasets to boost performance on target problems. Yet, in the context of overparametrization, a crucial question arises: how should one leverage transfer learning in the presence of overfitting? Specifically, should diverse datasets be aggregated to construct a unified interpolator, or should they be handled individually? Intuition suggests that for closely related datasets, a pooled interpolator may yield superior performance compared to those derived from individual datasets. Conversely, for unrelated datasets, the opposite might be true. This paper investigates the interplay between interpolation risk and task relatedness, focusing particularly on overparametrized linear models and min-norm interpolation.

To achieve this, we explore two common ways datasets can differ: covariate shift, where the covariate distribution varies across different datasets, and model shift, where the conditional distribution of the outcome given the covariates differ across datasets. Covariate shift has been extensively studied through concepts such as likelihood ratio [40] and transfer exponent [28, 34, 47], among others. Similarly, a substantial body of literature has studied model shift problems [6, 35, 36, 41, 52, 47].

However, the aforementioned literature has primarily considered statistical ideas where explicit penalties are added to loss functions to aid learning in high dimensions or non-parametric approaches. In recent machine learning, implicit regularization based prediction rules/classifiers have emerged supremely popular. Implicit regularization is the phenomenon by which modern ML algorithms, under suitable initialization and step sizes, converges to special prediction rules that generalize well under overparametrization. In this context, min-norm interpolators have emerged as an important class of predictors that arise commonly as implicit regularized limits of these algorithms [5, 18, 26, 27, 38, 44, 45, 51, 58, 56, 60]. However, properties of these interpolators under transfer learning is less well-studied.

A recent work [42] identified conditions on benign overfitting, i.e., data distribution choices such that min-norm interpolators are consistent under covariate shift. Another related work [47] characterized the prediction risk of ridge regression and min-2\ell_{2}-norm interpolators, while [54] provided an asymptotic analysis of high-dimensional random feature regression under covariate shift. However, these works consider an out-of-distribution (OOD) setting, where no observation from the test distribution is utilized during training. Transfer learning is often required in settings where along with enormous source data, limited test data is indeed available, c.f. [59]. Data from the test distribution can enhance transfer performance. However, min-norm interpolators in this context are not well-understood. This paper addresses this crucial gap in the literature and characterizes the precise prediction performance of a pooled min-2\ell_{2}-norm interpolator calculated using both source and test data under overparametrization. Our contributions are summarized as follows:

  1. (i)

    We consider a min-2\ell_{2}-norm interpolator constructed by combining the source and target data (henceforth referred to as the pooled min-2\ell_{2}-norm interpolator). We show this estimator can be interpreted both as an early and an intermediate fusion estimator (see (2.10) for details). We characterize the finite-sample bias and variance of the pooled min-2\ell_{2}-norm interpolator in the overparametrized regime under both model shift and covariate shift. Our results show that the generalization error of this interpolator exhibits the well-known double-descent phenomenon observed in the machine learning literature [9, 10].

  2. (ii)

    In the presence of model shift, we introduce a data-driven approach to estimate the threshold that governs when the pooled interpolator outperforms the target-based interpolator and also to obtain the optimal sample size from the target distribution that minimizes the risk of the pooled min-2\ell_{2}-norm interpolator. Both the threshold and the optimal sample size are function of the signal-to-noise ratio (SNR) and the shift-to-signal ratio (SSR), which is a measure of similarity between the datasets, defined in Section 3.3. We provide consistent estimators for both SNR and SSR.

  3. (iii)

    Our findings on covariate shift reveal a surprising dichotomy under stylized settings: increased degrees of covariate shift can in some cases enhance the performance of the pooled min-2\ell_{2}-norm interpolator (while the opposite happens in other regimes). This property is entirely dependent on the dimensionalities of the observed data (see Proposition 4.2). We hypothesize that such phenomena is applicable in more general settings.

  4. (iv)

    On the technical front, a significant mathematical contribution of our work is the derivation of an anisotropic law applicable to a broad class of random matrices (see Theorem D.1). This complements previous anisotropic local laws [33, 57] and may be of independent interest both in random matrix theory as well as in the study of other high-dimensional transfer learning problems, particularly those involving multiple covariance matrices.

  5. (v)

    We interpret the pooled min-2\ell_{2}-norm interpolator as a ridgeless limit of a pooled ridge estimator (see (2.9)). As part of our results, we characterize the exact bias and variance of the pooled ridge estimator under aforementioned settings. This contributes independently to the expanding body of work on the generalization error of high-dimensional ridge regression [16, 29]. Our work complements the out-of-distribution error characterized by recent works [42, 47] where target data is not used in constructing the estimator.

  6. (vi)

    Finally, we complement our theoretical findings with extensive simulations that demonstrate the finite-sample efficacy of our theory. Our results exhibit remarkable finite-sample performance under both model shift (Figure 1) and covariate shift (Figure 2), as well as across varying shift levels, signal-to-noise ratios, and dimensionality ratios.

The remainder of the paper is structured as follows: In Section 2, we layout our framework and model assumptions. Sections 3 and 4 provide detailed analyses of the generalization errors for model shift and covariate shift, respectively. A proof outline is provided in Section 5, followed by discussions and future directions in Section 6. Auxiliary results concerning pooled ridge regression are included in Appendix A, with detailed proofs of our main results deferred to the remaining appendices.

2 Setup

In this section, we present our formal mathematical setup.

2.1 Data Model

We consider the canonical transfer learning setting where random samples are observed from two different populations. Formally, we observe two datasets (𝒚(1),𝑿(1))(\bm{y}^{(1)},\bm{X}^{(1)}) and (𝒚(2),𝑿(2))(\bm{y}^{(2)},\bm{X}^{(2)})—one serves as the source data and the other the target. Typically, the former has more sample size than the latter. (Some of our results extend to the general case of multiple source data, see Section 3.4). We assume that the observed data comes from linear models given by

𝒚(k)=𝑿(k)𝜷(k)+ϵ(k),k=1,2,\bm{y}^{(k)}=\bm{X}^{(k)}\bm{\beta}^{(k)}+\bm{\epsilon}^{(k)},k=1,2, (2.1)

where k=1k=1 and k=2k=2 corresponds to the source and target respectively. Here, 𝐲(k)nk\mathbf{y}^{(k)}\in\mathbb{R}^{n_{k}} denotes the response vectors, with nkn_{k} representing the number of samples observed in population, 𝐗(k)nk×p\mathbf{X}^{(k)}\in\mathbb{R}^{n_{k}\times p} denotes the observed covariate matrix from population kk with pp the number of features, 𝜷(k)p\bm{\beta}^{(k)}\in\mathbb{R}^{p} are signal vectors independent of 𝑿(k)\bm{X}^{(k)}, and ϵ(k)nk\bm{\epsilon}^{(k)}\in\mathbb{R}^{n_{k}} are noise vectors independent of both 𝐗(k)\mathbf{X}^{(k)} and 𝜷(k)\bm{\beta}^{(k)}. We denote n=n1+n2n=n_{1}+n_{2} as the total number of samples.

Throughout the paper, we make the following assumptions on 𝑿(k)\bm{X}^{(k)} and ϵ(k)\bm{\epsilon}^{(k)} which are standard in the random matrix literature [55, 2]. First, for the covariates 𝑿(k)\bm{X}^{(k)}, we assume

𝑿(k)=𝒁(k)(𝚺(k))1/2,\bm{X}^{(k)}=\bm{Z}^{(k)}(\bm{\Sigma}^{(k)})^{1/2}, (2.2)

where 𝒁(k)nk×p\bm{Z}^{(k)}\in\mathbb{R}^{n_{k}\times p} are random matrices whose entries Zij(k)Z^{(k)}_{ij} are independent random variables with zero mean and unit variance. We further assume that there exists a constant τ>0\tau>0 by which the φ\varphi-th moment of each entry is bounded for some φ>4\varphi>4:

𝔼[|Zij(k)|φ]τ1.\mathbb{E}\left[\left|Z_{ij}^{(k)}\right|^{\varphi}\right]\leq\tau^{-1}. (2.3)

The eigenvalues of 𝚺(k)\bm{\Sigma}^{(k)}, denoted as λ1(k),,λp(k)\lambda^{(k)}_{1},\ldots,\lambda^{(k)}_{p} are bounded:

τλp(k)λ1(k)τ1.\tau\leq\lambda_{p}^{(k)}\leq...\leq\lambda_{1}^{(k)}\leq\tau^{-1}. (2.4)

Second, the noises ϵ(k)\bm{\epsilon}^{(k)} are assumed to have i.i.d. entries with mean 0, variance σ2\sigma^{2}, and bounded moments up to any order. That is, for any ϕ>0\phi>0, there exists a constant CϕC_{\phi} such that

𝔼[|ϵi(k)|ϕ]Cϕ.\mathbb{E}\left[\left|\epsilon_{i}^{(k)}\right|^{\phi}\right]\leq C_{\phi}. (2.5)

Lastly, as our analysis pertains to the regime where sample size is smaller than feature dimensions, and the ratio between sample size and feature dimensions remain bounded. Denoting γ1:=p/n1,γ2:=p/n2,γ:=p/n\gamma_{1}:=p/n_{1},\gamma_{2}:=p/n_{2},\gamma:=p/n, we assume

1+τγ1,γ2,γτ1.1+\tau\leq\gamma_{1},\gamma_{2},\gamma\leq\tau^{-1}. (2.6)

As displayed, we are particularly interested in the overparameterized regime where p>np>n, partly because it is more prevalent in the modern machine learning applications. Besides, the underparamterized regime, i.e. p<np<n, is better understood in literature, cf. [57] (they assumed n2>pn_{2}>p, but their results remain valid for the general case n>pn>p).

For convenience we summarize all our assumptions below:

Assumption 1.

Let τ>0\tau>0 be a small constant. 𝐗(1),𝐗(2),ϵ(1),ϵ(2)\bm{X}^{(1)},\bm{X}^{(2)},\bm{\epsilon}^{(1)},\bm{\epsilon}^{(2)} are mutually independent. Moreover, for k=1,2k=1,2, the following holds:

  1. (i)

    𝑿(k)\bm{X}^{(k)} satisfies (2.2), where 𝒁(k)\bm{Z}^{(k)} are random vectors having i.i.d. entries with zero mean, unit variance and bounded moments as in (2.3), and 𝚺(k)\bm{\Sigma}^{(k)} are deterministic positive definite matrices satisfying (2.4).

  2. (ii)

    ϵ(k)\bm{\epsilon}^{(k)} are random vectors having i.i.d. entries with zero mean, variance σ\sigma, and bounded moments as in (2.5).

  3. (iii)

    The dimensions γ1:=p/n1,γ2:=p/n2,γ:=p/n\gamma_{1}:=p/n_{1},\gamma_{2}:=p/n_{2},\gamma:=p/n satisfies (2.6).

We remark that we do not require any additional assumptions on the signals 𝜷(k)\bm{\beta}^{(k)}, except for their independence on 𝑿(k)\bm{X}^{(k)}. In fact, our results later reveal that the convergences of risks of interest is conditional on 𝜷(k)\bm{\beta}^{(k)}, and only depends on 𝜷(k)\bm{\beta}^{(k)} through their norms.

2.2 Estimator

Data fusion, the integration of information from diverse datasets, has been broadly categorized into three approaches in literature: early, intermediate and late fusion [3, 49]. Early fusion combines information from multiple sources at the input level. Late fusion learns independently from each dataset and combines information at the end. Intermediate fusion achieves a middle ground and attempts to jointly learn information from both datasets in a more sophisticated manner than simply concatenating them. In this work, we consider a form of min-norm interpolation that captures both early and a class of intermediate fusions strategies. To achieve early fusion using min-norm interpolation, one would naturally consider the pooled min-2\ell_{2}-norm interpolator, defined as

𝜷^=argmin{𝒃2s.t.𝒚(k)=𝑿(k)𝒃,k=1,2}.\hat{\bm{\beta}}=\operatorname*{arg\,min}\left\{\|\bm{b}\|_{2}\ \ \text{s.t.}\ \bm{y}^{(k)}=\bm{X}^{(k)}\bm{b},\ k=1,2\right\}. (2.7)

This interpolator admits the following analytical expression:

𝜷^=(𝑿(1)𝑿(1)+𝑿(2)𝑿(2))(𝑿(1)𝒚(1)+𝑿(2)𝒚(2)),\hat{\bm{\beta}}=\left(\bm{X}^{(1)\top}\bm{X}^{(1)}+\bm{X}^{(2)\top}\bm{X}^{(2)}\right)^{\dagger}\left(\bm{X}^{(1)\top}\bm{y}^{(1)}+\bm{X}^{(2)\top}\bm{y}^{(2)}\right), (2.8)

where for any matrix 𝑨\bm{A}, we denote its pseudoinverse by 𝑨\bm{A}^{\dagger}. This interpolator is alternatively designated as the ”ridgeless” estimator, owing to it being the limit of the ridge estimator when the penalty term vanishes (noticed earlier by [29] in the context of a single training data):

𝜷^\displaystyle\hat{\bm{\beta}} =limλ0(𝑿(1)𝑿(1)+𝑿(2)𝑿(2)+nλ𝑰)1(𝑿(1)𝒚(1)+𝑿(2)𝒚(2))\displaystyle=\lim_{\lambda\rightarrow 0}\left(\bm{X}^{(1)\top}\bm{X}^{(1)}+\bm{X}^{(2)\top}\bm{X}^{(2)}+n\lambda\bm{I}\right)^{-1}\left(\bm{X}^{(1)\top}\bm{y}^{(1)}+\bm{X}^{(2)\top}\bm{y}^{(2)}\right) (2.9)
=limλ0argmin𝒃{1n𝒚(1)𝑿(1)𝒃22+1n𝒚(2)𝑿(2)𝒃22+λ𝒃22}.\displaystyle=\lim_{\lambda\rightarrow 0}\operatorname*{arg\,min}_{\bm{b}}\left\{\frac{1}{n}\|\bm{y}^{(1)}-\bm{X}^{(1)}\bm{b}\|_{2}^{2}+\frac{1}{n}\|\bm{y}^{(2)}-\bm{X}^{(2)}\bm{b}\|_{2}^{2}+\lambda\|\bm{b}\|_{2}^{2}\right\}.

Coincidentally, although the pooled min-2\ell_{2}-norm intepolator is motivated via early fusion, it can also capture a form of intermediate fusion as described below. If we consider a weighted ridge loss with weights w1,w2w_{1},w_{2} on the source and target data respectively, then the ridgeless limit of this loss recovers the pooled min-2\ell_{2}-norm interpolator. This can be seen as follows: for any two weighting coefficient w1,w2>0w_{1},w_{2}>0,

𝜷^\displaystyle\hat{\bm{\beta}} =argmin{𝒃2s.t.wk𝒚(k)=wk𝑿(k)𝒃,k=1,2}\displaystyle=\operatorname*{arg\,min}\left\{\|\bm{b}\|_{2}\ \ \text{s.t.}\ w_{k}\bm{y}^{(k)}=w_{k}\bm{X}^{(k)}\bm{b},\ k=1,2\right\} (2.10)
=(w1𝑿(1)𝑿(1)+w2𝑿(2)𝑿(2))(w1𝑿(1)𝒚(1)+w2𝑿(2)𝒚(2))\displaystyle=\left(w_{1}\bm{X}^{(1)\top}\bm{X}^{(1)}+w_{2}\bm{X}^{(2)\top}\bm{X}^{(2)}\right)^{\dagger}\left(w_{1}\bm{X}^{(1)\top}\bm{y}^{(1)}+w_{2}\bm{X}^{(2)\top}\bm{y}^{(2)}\right)
=limλ0(w1𝑿(1)𝑿(1)+w2𝑿(2)𝑿(2)+nλ𝑰)1(w1𝑿(1)𝒚(1)+w2𝑿(2)𝒚(2))\displaystyle=\lim_{\lambda\rightarrow 0}\left(w_{1}\bm{X}^{(1)\top}\bm{X}^{(1)}+w_{2}\bm{X}^{(2)\top}\bm{X}^{(2)}+n\lambda\bm{I}\right)^{-1}\left(w_{1}\bm{X}^{(1)\top}\bm{y}^{(1)}+w_{2}\bm{X}^{(2)\top}\bm{y}^{(2)}\right)
=limλ0argmin𝒃{w1n𝒚(1)𝑿(1)𝒃22+w2n𝒚(2)𝑿(2)𝒃22+λ𝒃22},\displaystyle=\lim_{\lambda\rightarrow 0}\operatorname*{arg\,min}_{\bm{b}}\left\{\frac{w_{1}}{n}\|\bm{y}^{(1)}-\bm{X}^{(1)}\bm{b}\|_{2}^{2}+\frac{w_{2}}{n}\|\bm{y}^{(2)}-\bm{X}^{(2)}\bm{b}\|_{2}^{2}+\lambda\|\bm{b}\|_{2}^{2}\right\},

indicating that the interpolator above uniquely stands as the common solution of both the min-2\ell_{2}-norm interpolator and the ridgeless estimator when two datasets are weighted by w1,w2w_{1},w_{2} in the intermediate fusion loss formulation.

Furthermore, the pooled min-2\ell_{2}-norm interpolator (2.7) appears naturally as the limit of the gradient descent based linear estimator in the following way: Initialize 𝜷0=0\bm{\beta}_{0}=0 and run gradient descent with iterates given by

𝜷t+1=𝜷t+ηk=12𝑿(k)(𝒚(k)𝑿(k)𝜷t),t=0,1,,\bm{\beta}_{t+1}=\bm{\beta}_{t}+\eta\sum_{k=1}^{2}\bm{X}^{(k)\top}(\bm{y}^{(k)}-\bm{X}^{(k)}\bm{\beta}_{t}),\qquad t=0,1,\ldots,

with 0<ηmin{λmax(𝑿(1)𝑿(1)),λmax(𝑿(2)𝑿(2))}0<\eta\leq\min\{\lambda_{\max}(\bm{X}^{(1)\top}\bm{X}^{(1)}),\lambda_{\max}(\bm{X}^{(2)\top}\bm{X}^{(2)})\}, then limt𝜷t=𝜷^\lim_{t\rightarrow\infty}\bm{\beta}_{t}=\hat{\bm{\beta}}.

In recent years, min-norm interpolated type solutions and their statistical properties have been extensively studied in numerous contexts [5, 8, 10, 11, 15, 37, 39, 38]. In fact, it is conjectured that this min-norm regularization is responsible for superior statistical behavior of estimators in overparametrized regime. In light of these results and the aforementioned abundant formulations, we view our work as a first step of understanding the behavior of this important family of estimators in transfer learning, under the lens of early and intermediate fusion.

We conclude this section by highlighting that computing such an interpolator requires only the summary statistics 𝐗(k)𝐗(k)\mathbf{X}^{(k)\top}\mathbf{X}^{(k)} and 𝐗(k)𝐲(k)\mathbf{X}^{(k)\top}\mathbf{y}^{(k)} for k=1,2k=1,2, rather than entire datasets. This characteristic is particularly beneficial in fields such as medical or genomics research, where individual-level data may be highly sensitive and not permissible for sharing across institutions.

2.3 Risk

To assess the performance of 𝜷^\hat{\bm{\beta}}, we focus on its prediction risk at a new (unseen) test point (𝒚0,𝒙0)(\bm{y}_{0},\bm{x}_{0}) drawn from the target distribution, i.e., k=2k=2. This out-of-sample prediction risk (hereafter simply referred to as risk) is defined as

R(𝜷^;𝜷(2)):=𝔼[(𝒙0𝜷^𝒙0𝜷(2))2|𝑿]=𝔼[𝜷^𝜷(2)𝚺(2)2|𝑿],R(\hat{\bm{\beta}};\bm{\beta}^{(2)}):=\mathbb{E}\left[(\bm{x}^{\top}_{0}\hat{\bm{\beta}}-\bm{x}^{\top}_{0}\bm{\beta}^{(2)})^{2}\hskip 2.15277pt|\hskip 2.15277pt\bm{X}\right]=\mathbb{E}\left[\|\hat{\bm{\beta}}-\bm{\beta}^{(2)}\|_{\bm{\Sigma}^{(2)}}^{2}\hskip 2.15277pt|\hskip 2.15277pt\bm{X}\right], (2.11)

where 𝒙𝚺2:=𝒙𝚺𝒙\|\bm{x}\|_{\bm{\Sigma}}^{2}:=\bm{x}^{\top}\bm{\Sigma}\bm{x} and 𝑿:=(𝑿(1),𝑿(2))\bm{X}:=(\bm{X}^{(1)\top},\bm{X}^{(2)\top})^{\top}. Note that this risk has a σ2\sigma^{2} difference from the mean-squared prediction error for the new data point, which does not affect the relative performance and is therefore omitted. Also, this risk definition takes expectation over both the randomness in the new test point (𝒚0,𝒙0)(\bm{y}_{0},\bm{x}_{0}) and the randomness in the training noises (ϵ(1),ϵ(2))(\bm{\epsilon}^{(1)},\bm{\epsilon}^{(2)}). The risk is conditional on the covariate 𝑿\bm{X}. Despite this dependence, we will use the notation R(𝜷^,𝜷(2))R(\hat{\bm{\beta}},\bm{\beta}^{(2)}) for conciseness since the context is clear. The risk admits a bias-variance decomposition:

R(𝜷^;𝜷(2))=𝔼(𝜷^|𝑿)𝜷(2)𝚺(2)2B(𝜷^;𝜷(2))+Tr[Cov(𝜷^|𝑿)𝚺(2)]V(𝜷^;𝜷(2)).R(\hat{\bm{\beta}};\bm{\beta}^{(2)})=\underbrace{\|\mathbb{E}(\hat{\bm{\beta}}|\bm{X})-\bm{\beta}^{(2)}\|_{\bm{\Sigma}^{(2)}}^{2}}_{B(\hat{\bm{\beta}};\bm{\beta}^{(2)})}+\underbrace{\textnormal{Tr}[\textnormal{Cov}(\hat{\bm{\beta}}|\bm{X})\bm{\Sigma}^{(2)}]}_{V(\hat{\bm{\beta}};\bm{\beta}^{(2)})}. (2.12)

Plugging in (2.9), we immediately obtain the following result which we will crucially use for the remainder of the sequel:

Lemma 2.1.

Under Assumption 1, the min-norm interpolator (2.7) has variance

V(𝜷^;𝜷(2))=σ2nTr(𝚺^𝚺(2))V(\hat{\bm{\beta}};\bm{\beta}^{(2)})=\frac{\sigma^{2}}{n}\textnormal{Tr}(\hat{\bm{\Sigma}}^{\dagger}\bm{\Sigma}^{(2)}) (2.13)

and bias

B(𝜷^;𝜷(2))=\displaystyle B(\hat{\bm{\beta}};\bm{\beta}^{(2)})= 𝜷(2)(𝚺^𝚺^𝑰)𝚺(2)(𝚺^𝚺^𝑰)𝜷(2)B1(𝜷^;𝜷(2))+𝜷~(𝑿(1)𝑿(1)n)𝚺^𝚺(2)𝚺^(𝑿(1)𝑿(1)n)𝜷~B2(𝜷^;𝜷(2))\displaystyle\underbrace{\bm{\beta}^{(2)^{\top}}(\hat{\bm{\Sigma}}^{\dagger}\hat{\bm{\Sigma}}-\bm{I})\bm{\Sigma}^{(2)}(\hat{\bm{\Sigma}}^{\dagger}\hat{\bm{\Sigma}}-\bm{I})\bm{\beta}^{(2)}}_{B_{1}(\hat{\bm{\beta}};\bm{\beta}^{(2)})}+\underbrace{\tilde{\bm{\beta}}^{\top}(\frac{\bm{X}^{(1)^{\top}}\bm{X}^{(1)}}{n})\hat{\bm{\Sigma}}^{\dagger}\bm{\Sigma}^{(2)}\hat{\bm{\Sigma}}^{\dagger}(\frac{\bm{X}^{(1)^{\top}}\bm{X}^{(1)}}{n})\tilde{\bm{\beta}}}_{B_{2}(\hat{\bm{\beta}};\bm{\beta}^{(2)})} (2.14)
+2𝜷(2)(𝚺^𝚺^𝑰)𝚺(2)𝚺^(𝑿(1)𝑿(1)n)𝜷~B3(𝜷^;𝜷(2)),\displaystyle+\underbrace{2\bm{\beta}^{(2)^{\top}}(\hat{\bm{\Sigma}}^{\dagger}\hat{\bm{\Sigma}}-\bm{I})\bm{\Sigma}^{(2)}\hat{\bm{\Sigma}}^{\dagger}(\frac{\bm{X}^{(1)^{\top}}\bm{X}^{(1)}}{n})\tilde{\bm{\beta}}}_{B_{3}(\hat{\bm{\beta}};\bm{\beta}^{(2)})},

where 𝚺^=𝐗𝐗/n\hat{\bm{\Sigma}}=\bm{X}^{\top}\bm{X}/n is the (uncentered) sample covariance matrix obtained on appending the source and target samples, and 𝛃~:=𝛃(1)𝛃(2)\tilde{\bm{\beta}}:=\bm{\beta}^{(1)}-\bm{\beta}^{(2)} is the signal shift vector.

Proof.

The proof is straightforward via plugging in respective definitions, and observing the fact that (𝑿𝑿)𝑿𝑿(𝑿𝑿)=(𝑿𝑿)(\bm{X}^{\top}\bm{X})^{\dagger}\bm{X}^{\top}\bm{X}(\bm{X}^{\top}\bm{X})^{\dagger}=(\bm{X}^{\top}\bm{X})^{\dagger}. ∎

In the next two sections, we will layout our main results which involve precise risk characterizations of the pooled min-2\ell_{2}-norm interpolator. Bias and variance terms, as decomposed in Lemma 2.1, will be presented separately.

Throughout the rest of the paper, we denote f(p)=O(g(p))f(p)=O(g(p)) if there exists a constant CC such that |f(p)|Cg(p)|f(p)|\leq Cg(p) for large enough pp. What’s more, we say an event \mathcal{E} happens with high probability (w.h.p.) if ()1\mathbb{P}(\mathcal{E})\rightarrow 1 as pp\rightarrow\infty.

3 Model Shift

We first investigate the effect of model shift. More precisely, we assume the covariates 𝑿(1)\bm{X}^{(1)} and 𝑿(2)\bm{X}^{(2)} comes from the same distribution, but the underlying signal changes from 𝜷(1)\bm{\beta}^{(1)} to 𝜷(2)\bm{\beta}^{(2)}. We denote the shift in the signal parameter by 𝜷~:=𝜷(1)𝜷(2)\tilde{\bm{\beta}}:=\bm{\beta}^{(1)}-\bm{\beta}^{(2)}. We will first precise risk of the pooled min-2\ell_{2}-norm interpolator and then compare with the min-2\ell_{2}-norm interpolator using solely the target data.

3.1 Risk under isotropic design

Recall the formula for the risk R(𝜷^,𝜷(2))R(\hat{\bm{\beta}},\bm{\beta}^{(2)}) in (2.12) and its decomposition in Lemma 2.1. In this sequel, we make a further isotropic assumption that 𝚺(1)=𝚺(2)=𝑰\bm{\Sigma}^{(1)}=\bm{\Sigma}^{(2)}=\bm{I} and leave generalizations for future works. We will split the risk into bias and variance following Lemma 2.1, and characterize their behavior respectively. Note 𝚺(1)=𝚺(2)=𝑰\bm{\Sigma}^{(1)}=\bm{\Sigma}^{(2)}=\bm{I} yields B3(𝜷^;𝜷(2))=0B_{3}(\hat{\bm{\beta}};\bm{\beta}^{(2)})=0. Therefore, our contribution focuses on characterizing the asymptotic behavior of B1(𝜷^;𝜷(2)),B2(𝜷^;𝜷(2))B_{1}(\hat{\bm{\beta}};\bm{\beta}^{(2)}),B_{2}(\hat{\bm{\beta}};\bm{\beta}^{(2)}) and V(𝜷^;𝜷(2))V(\hat{\bm{\beta}};\bm{\beta}^{(2)}) as defined by (2.13) and (2.14). Our main result in this regard is given below.

Theorem 3.1 (Risk under model shift).

Under Assumption 1, and further assuming that 𝚺(1)=𝚺(2)=𝐈\bm{\Sigma}^{(1)}=\bm{\Sigma}^{(2)}=\bm{I} and (𝐙(1),𝐙(2))(\bm{Z}^{(1)},\bm{Z}^{(2)}) are both Gaussian random matrices. Then, with high probability over the randomness of (𝐙(1),𝐙(2))(\bm{Z}^{(1)},\bm{Z}^{(2)}), we have

V(𝜷^;𝜷(2))\displaystyle V(\hat{\bm{\beta}};\bm{\beta}^{(2)}) =σ2npn+O(σ2p1/7),\displaystyle=\sigma^{2}\frac{n}{p-n}+O(\sigma^{2}p^{-1/7}), (3.1)
B1(𝜷^;𝜷(2))\displaystyle B_{1}(\hat{\bm{\beta}};\bm{\beta}^{(2)}) =pnp𝜷(2)22+O(p1/7𝜷(2)22),\displaystyle=\frac{p-n}{p}\|\bm{\beta}^{(2)}\|_{2}^{2}+O(p^{-1/7}\|\bm{\beta}^{(2)}\|_{2}^{2}), (3.2)
B2(𝜷^;𝜷(2))\displaystyle B_{2}(\hat{\bm{\beta}};\bm{\beta}^{(2)}) =n1(pn1)p(pn)𝜷~22+O(p1/7𝜷~22).\displaystyle=\frac{n_{1}(p-n_{1})}{p(p-n)}\|\tilde{\bm{\beta}}\|_{2}^{2}+O(p^{-1/7}\|\tilde{\bm{\beta}}\|_{2}^{2}). (3.3)

We remark that Theorem 3.1 holds true for finite values of nn and pp. It characterizes the bias and variance of the pooled min-2\ell_{2}-norm interpolator for the overparametrized regime p>np>n. In fact, it shows that the bias can be decomposed into two terms (3.2) and (3.3), which depend only on the 2\ell_{2}-norm of 𝜷(2)\bm{\beta}^{(2)} and 𝜷~\tilde{\bm{\beta}} respectively. Therefore, the generalization error is independent of the angle between 𝜷(1)\bm{\beta}^{(1)} and 𝜷(2)\bm{\beta}^{(2)}. The variance V(𝜷^;𝜷(2))V(\hat{\bm{\beta}};\bm{\beta}^{(2)}) is decreasing in p/np/n, but the bias B(𝜷^;𝜷(2))B(\hat{\bm{\beta}};\bm{\beta}^{(2)}) is not necessarily monotone. We provide extensive numerical examples of R(𝜷^;𝜷(2))R(\hat{\bm{\beta}};\bm{\beta}^{(2)}) in Section 3.5 showing that the risk function still exhibits the well-known double-descent phenomena. The main challenge in establishing Theorem 3.1 lies in the analysis of B2(𝜷^;𝜷(2))B_{2}(\hat{\bm{\beta}};\bm{\beta}^{(2)}). To this end, we develop a novel local law that involves extensive use of free probability. More details and explanations are provided in Section 5. The theorem allows us to answer the following questions:

  1. (i)

    When does the pooled min-2\ell_{2}-norm interpolator outperform the target-based min-2\ell_{2}-norm interpolator? (Section 3.2)

  2. (ii)

    What is the optimal target sample size that minimizes the generalization error of pooled min-2\ell_{2}-norm interpolator? (Section 3.3)

3.2 Performance Comparison

In this section, we compare the risk of 𝜷^\hat{\bm{\beta}}, the pooled min-2\ell_{2}-norm interpolator, with that of the interpolator using only the target dataset, defined as follows:

𝜷^(2)=argmin𝒃2s.t.𝒚(2)=𝑿(2)𝒃.\hat{\bm{\beta}}^{(2)}=\operatorname*{arg\,min}\|\bm{b}\|_{2}\ \ \text{s.t.}\ \bm{y}^{(2)}=\bm{X}^{(2)}\bm{b}. (3.4)

The risk of β^(2)\hat{\beta}^{(2)} has been previously studied:

Proposition 3.2 (Corollary of Theorem 1 in [29]).

Consider the estimator 𝛃^(2)\hat{\bm{\beta}}^{(2)} in (3.4). Under the same setting as Theorem 3.1 and assuming that 𝛃(2)22\|\bm{\beta}^{(2)}\|_{2}^{2} is bounded, we have almost surely,

R(𝜷^(2);𝜷(2))=pn2p𝜷(2)22+n2pn2σ2+o(1)R(\hat{\bm{\beta}}^{(2)};\bm{\beta}^{(2)})=\frac{p-n_{2}}{p}\|\bm{\beta}^{(2)}\|_{2}^{2}+\frac{n_{2}}{p-n_{2}}\sigma^{2}+o(1) (3.5)

The above proposition, along with our results of model shift (Theorem 3.1 and Proposition 3.2), allow us to characterize when pooled min-2\ell_{2}-norm interpolator yields lower risk than the target-only min-2\ell_{2}-norm interpolator, i.e. when data pooling leads to positive/negative transfer:

Proposition 3.3 (Impact of model shift).

Under the same setting as Theorem 3.1 and assuming that 𝛃(2)22,𝛃~22\|\bm{\beta}^{(2)}\|_{2}^{2},\|\tilde{\bm{\beta}}\|_{2}^{2} are bounded, and define the signal-to-noise ratio SNR:=𝛃(2)22σ2\rm{SNR}:=\frac{\|\bm{\beta}^{(2)}\|_{2}^{2}}{\sigma^{2}} and the shift-to-signal ratio SSR:=𝛃~22𝛃(2)22\rm{SSR}:=\frac{\|\tilde{\bm{\beta}}\|_{2}^{2}}{\|\bm{\beta}^{(2)}\|_{2}^{2}} Then the following statements hold with high probability.

  1. (i)

    If SNRp2(pn)(pn2)\rm{SNR}\leq\frac{p^{2}}{(p-n)(p-n_{2})}, then

    R(𝜷^(2);𝜷(2))R(𝜷^;𝜷(2))+o(1).R(\hat{\bm{\beta}}^{(2)};\bm{\beta}^{(2)})\leq R(\hat{\bm{\beta}};\bm{\beta}^{(2)})+o(1). (3.6)
  2. (ii)

    If SNR>p2(pn)(pn2)\rm{SNR}>\frac{p^{2}}{(p-n)(p-n_{2})}, then there exists

    ρ:=pnpn1p2(pn1)(pn2)1SNR\rho:=\frac{p-n}{p-n_{1}}-\frac{p^{2}}{(p-n_{1})(p-n_{2})}\cdot\frac{1}{\rm{SNR}} (3.7)

    such that when SSRρ\rm{SSR}\geq\rho, then (3.6) holds; when SSR<ρ\rm{SSR}<\rho, then

    R(𝜷^;𝜷(2))R(𝜷^(2);𝜷(2))+o(1).R(\hat{\bm{\beta}};\bm{\beta}^{(2)})\leq R(\hat{\bm{\beta}}^{(2)};\bm{\beta}^{(2)})+o(1). (3.8)
Proof.

From Theorem 3.1 and Proposition 3.2, we know with high probability,

R(𝜷^;𝜷(2))R(𝜷^(2);𝜷(2))=n1p𝜷(2)22+n1p(pn)(pn2)σ2+n1(pn1)p(pn)𝜷~22+o(1).R(\hat{\bm{\beta}};\bm{\beta}^{(2)})-R(\hat{\bm{\beta}}^{(2)};\bm{\beta}^{(2)})=-\frac{n_{1}}{p}\|\bm{\beta}^{(2)}\|_{2}^{2}+\frac{n_{1}p}{(p-n)(p-n_{2})}\sigma^{2}+\frac{n_{1}(p-n_{1})}{p(p-n)}\|\tilde{\bm{\beta}}\|_{2}^{2}+o(1).

The rest follows with simple algebra. ∎

The implications of Proposition 3.3 are as-follows:

  1. (i)

    When the SNR is small, i.e. SNR1(11/γ)(11/γ2)\rm{SNR}\leq\frac{1}{(1-1/\gamma)(1-1/\gamma_{2})}, adding more data always hurts the pooled min-2\ell_{2}-norm interpolator. Note that, even if the source and target populations have the exact same model settings, pooling the data induces higher risk for the inteprolator.

  2. (ii)

    If SNR is large, whether pooling the data helps or not depends on the level of model shift. Namely, if SSR exceeds ρ\rho defined by (3.7), the models are far apart resulting in negative transfer. Otherwise, pooling the datasets help. In particular if SSR11/γ11/γ1\rm{SSR}\geq\frac{1-1/\gamma}{1-1/\gamma_{1}}, then for any values of SNR\rm{SNR}, data-pooling harms.

  3. (iii)

    Since the SNR\rm{SNR} and SSR\rm{SSR} can be consistently estimated using the available samples (see (3.3)), Proposition 3.3 provides a data-driven method for deciding whether to use the pooled interpolator or the target-based interpolator.

In sum, in the over-parametrized regime, positive transfer happens when the model shift is low and the SNR is sufficiently high. Figure 1 illustrates our findings in a simulation setting.

3.3 Choice of interpolator

Our findings in the previous section highlights that, to achieve the least possible risk for pooled min-2\ell_{2}-norm interpolator, one should neither always pool both datasets nor always use all samples in either datasets. In fact, if SNR and SSR were known, it would be possible to identify subsets of source and target datasets, having sample size n~1,n~2\tilde{n}_{1},\tilde{n}_{2} respectively, such that the risk of the pooled min-2\ell_{2}-norm interpolator based on the total data of size n~1+n~2\tilde{n}_{1}+\tilde{n}_{2} is minimized. To this end, the purpose of this section is two-fold: first to provide an estimation strategy for SNR and SSR, and second to invoke the estimator and provide a guideline for choosing appropriate sample size.

For SNR and SSR estimation, we employ a simplified version of the recently introduced HEDE method [50]. Let 𝜷^d(k)\hat{\bm{\beta}}_{d}^{(k)} denote the following debiased Lasso estimator [32, 12]:

𝜷^d(k)=𝜷^L(k)+𝑿(k)(𝒚(k)𝑿(k)𝜷^L(k))nk𝜷^L(k)0\hat{\bm{\beta}}_{d}^{(k)}=\hat{\bm{\beta}}_{\rm{L}}^{(k)}+\frac{\bm{X}^{(k)\top}(\bm{y}^{(k)}-\bm{X}^{(k)}\hat{\bm{\beta}}_{\rm{L}}^{(k)})}{n_{k}-\|\hat{\bm{\beta}}_{\rm{L}}^{(k)}\|_{0}}

that is based on the Lasso estimator for some fixed λL\lambda_{\rm{L}}:

𝜷^L(k):=argmin𝒃12n𝒚(k)𝑿(k)𝒃22+λLnk𝒃1,\hat{\bm{\beta}}_{\rm{L}}^{(k)}:=\operatorname*{arg\,min}_{\bm{b}}\frac{1}{2n}\|\bm{y}^{(k)}-\bm{X}^{(k)}\bm{b}\|_{2}^{2}+\frac{\lambda_{\textrm{L}}}{\sqrt{n_{k}}}\|\bm{b}\|_{1},

and let

τ^k2=𝒚(k)𝑿(k)𝜷^L(k)22(nk𝜷^L(k)0)2\hat{\tau}_{k}^{2}=\frac{\|\bm{y}^{(k)}-\bm{X}^{(k)}\hat{\bm{\beta}}_{\textrm{L}}^{(k)}\|_{2}^{2}}{(n_{k}-\|\hat{\bm{\beta}}_{\textrm{L}}^{(k)}\|_{0})^{2}}

be the corresponding estimated variance for k=1,2k=1,2. Then by Theorem 4.1 and Proposition 4.2 in [50], we know

𝜷^d(k)22pτ^k2𝜷(k)22,k=1,2,\displaystyle\|\hat{\bm{\beta}}_{d}^{(k)}\|_{2}^{2}-p\hat{\tau}^{2}_{k}\rightarrow\|\bm{\beta}^{(k)}\|_{2}^{2},\ \ k=1,2,
Var^(𝒚(k))(𝜷^d(k)22pτ^k2)σ2,k=1,2,\displaystyle\widehat{\textnormal{Var}}(\bm{y}^{(k)})-(\|\hat{\bm{\beta}}_{d}^{(k)}\|_{2}^{2}-p\hat{\tau}^{2}_{k})\rightarrow\sigma^{2},\ \ k=1,2,
𝜷^1d𝜷^2d22p(τ^12+τ^22)𝜷(1)𝜷(2)22.\displaystyle\|\hat{\bm{\beta}}^{d}_{1}-\hat{\bm{\beta}}^{d}_{2}\|_{2}^{2}-p(\hat{\tau}^{2}_{1}+\hat{\tau}^{2}_{2})\rightarrow\|\bm{\beta}^{(1)}-\bm{\beta}^{(2)}\|_{2}^{2}.

where the last line invokes independence between the datasets. Hence, we obtain the following consistent estimators for SNR and SSR:

SNR^:=𝜷^d(2)22pτ^22Var^(𝒚(2))(𝜷^d(2)22pτ^22)\displaystyle\widehat{\text{SNR}}:=\frac{\|\hat{\bm{\beta}}_{d}^{(2)}\|_{2}^{2}-p\hat{\tau}^{2}_{2}}{\widehat{\textnormal{Var}}(\bm{y}^{(2)})-(\|\hat{\bm{\beta}}_{d}^{(2)}\|_{2}^{2}-p\hat{\tau}^{2}_{2})}\rightarrow SNR,\displaystyle\rm{SNR},
SSR^:=𝜷^1d𝜷^2d22p(τ^12+τ^22)𝜷^d(2)22pτ^22\displaystyle\widehat{\text{SSR}}:=\frac{\|\hat{\bm{\beta}}^{d}_{1}-\hat{\bm{\beta}}^{d}_{2}\|_{2}^{2}-p(\hat{\tau}^{2}_{1}+\hat{\tau}^{2}_{2})}{\|\hat{\bm{\beta}}_{d}^{(2)}\|_{2}^{2}-p\hat{\tau}^{2}_{2}}\rightarrow SSR.\displaystyle\rm{SSR}. (3.9)

Given the estimators of SNR and SSR, denoted by SNR^\widehat{\text{SNR}} and SSR^\widehat{\text{SSR}} respectively, an important question arises: what is the optimal size of the target dataset for which the pooled min-2\ell_{2}-norm interpolator achieves least mean-squared prediction error R(𝜷^;𝜷(2))R(\hat{\bm{\beta}};\bm{\beta}^{(2)})? This is answered by the following proposition:

Corollary 3.4.

Assume the conditions of Theorem 3.1 holds. Denote by 𝛃^t\hat{\bm{\beta}}_{t} the pooled min-2\ell_{2}-norm interpolator based on n1n_{1} samples from the source and n2n_{2} samples from the target where n2/p=tn_{2}/p=t. Then we have

argmint:t>0,n1+n2<pR(𝜷^t;𝜷(2))=(pn1p2SNR+n1SSR)+,n2/p=t,\operatorname*{arg\,min}_{t:t>0,n_{1}+n_{2}<p}R(\hat{\bm{\beta}}_{t};\bm{\beta}^{(2)})=\left(p-n_{1}-\sqrt{\frac{p^{2}}{\rm{SNR}}+n_{1}\rm{SSR}}\right)_{+},\quad n_{2}/p=t, (3.10)

where x+=xx_{+}=x if x>0x>0 and =0=0 otherwise.

Proof.

Note that, by Theorem 3.1, we have

R(𝜷^;𝜷(2))=σ2npn+pnp𝜷(2)22+n1(pn1)p(pn)𝜷~22+o(1).R(\hat{\bm{\beta}};\bm{\beta}^{(2)})=\sigma^{2}\frac{n}{p-n}+\frac{p-n}{p}\|\bm{\beta}^{(2)}\|_{2}^{2}+\frac{n_{1}(p-n_{1})}{p(p-n)}\|\tilde{\bm{\beta}}\|_{2}^{2}+o(1).

Plugging in n2=ptn_{2}=pt and differentiating yields the finishes the proof. ∎

This proposition validates our intuition for the problem: If SNR is low or SSR is large, large amount of target data is not necessary for achieving good generalization. In fact, given the estimators SNR^\widehat{\rm{SNR}}, SSR^\widehat{\rm{SSR}}, a data-dependent estimated optimal target data size is given by

n^2,opt=(pn1p2SNR^+n1SSR^)+\hat{n}_{2,\text{opt}}=\left(p-n_{1}-\sqrt{\frac{p^{2}}{\widehat{\rm{SNR}}}+n_{1}\widehat{\rm{SSR}}}\right)_{+}

3.4 Extension to multiple source distributions

Our results for model shift, i.e., Theorem 3.1 can be extended to the general framework of multiple sources and one target datasets. Denote the target as TT and KK sources as 1,,k1,...,k, and further denote the signal shift vectors as 𝜷~(k):=𝜷(k)𝜷(T)\tilde{\bm{\beta}}^{(k)}:=\bm{\beta}^{(k)}-\bm{\beta}^{(T)}. Assuming we are still in the overparametrized regime p>n1++nK+nTp>n_{1}+...+n_{K}+n_{T}, and consider the pooled min-2\ell_{2}-norm interpolator as

𝜷^=argmin𝒃2s.t.𝒚(T)=𝑿(T)𝒃,𝒚(k)=𝑿(k)𝒃,k=1,,K.\hat{\bm{\beta}}=\operatorname*{arg\,min}\|\bm{b}\|_{2}\ \ \text{s.t.}\ \bm{y}^{(T)}=\bm{X}^{(T)}\bm{b},\ \bm{y}^{(k)}=\bm{X}^{(k)}\bm{b},\ k=1,...,K. (3.11)

Then the following theorem charactarizes the bias and variance of the pooled min-2\ell_{2}-norm interpolator under this setting:

Proposition 3.5.

Under Assumption 1 (adjusted for multiple source data k=1,,Kk=1,...,K and a target data TT), and further assuming that 𝚺(1)==𝚺(T)=𝐈\bm{\Sigma}^{(1)}=...=\bm{\Sigma}^{(T)}=\bm{I} and (𝐙(1),,𝐙(K),𝐙(T))(\bm{Z}^{(1)},...,\bm{Z}^{(K)},\bm{Z}^{(T)}) are Gaussian random matrices. Then, with high probability over the randomness of (𝐙(1),,𝐙(K),𝐙(T))(\bm{Z}^{(1)},...,\bm{Z}^{(K)},\bm{Z}^{(T)}), we have

V(𝜷^;𝜷(T))\displaystyle V(\hat{\bm{\beta}};\bm{\beta}^{(T)}) =σ2npn+O(σ2p1/7),\displaystyle=\sigma^{2}\frac{n}{p-n}+O(\sigma^{2}p^{-1/7}), (3.12)
B1(𝜷^;𝜷(T))\displaystyle B_{1}(\hat{\bm{\beta}};\bm{\beta}^{(T)}) =pnp𝜷(T)22+O(p1/7𝜷(T)22),\displaystyle=\frac{p-n}{p}\|\bm{\beta}^{(T)}\|_{2}^{2}+O(p^{-1/7}\|\bm{\beta}^{(T)}\|_{2}^{2}), (3.13)
B2(𝜷^;𝜷(T))\displaystyle B_{2}(\hat{\bm{\beta}};\bm{\beta}^{(T)}) =k=1Knk(pnk)p(pnknT)𝜷~(k)22+O(p1/7𝜷~(k)22).\displaystyle=\sum_{k=1}^{K}\frac{n_{k}(p-n_{k})}{p(p-n_{k}-n_{T})}\|\tilde{\bm{\beta}}^{(k)}\|_{2}^{2}+O(p^{-1/7}\|\tilde{\bm{\beta}}^{(k)}\|_{2}^{2}). (3.14)

The proof of Proposition 3.5 is similar to that of Theorem 3.1 and is therefore omitted.

3.5 Numerical examples under model shift

In this Section, we demonstrate the applicability of our results for moderate sample sizes, as illustrated in Figure 1. Specifically, we explore the generalization error of the pooled min-2\ell_{2}-norm interpolator in a scenario where both the source and target data follow isotropic Gaussian distributions. The signals are generated as follows: βi(2)𝒩(0,σβ2/p)\beta_{i}^{(2)}\sim\mathcal{N}(0,\sigma_{\beta}^{2}/p) and βi(1)=βi(2)+𝒩(0,σs2/p)\beta_{i}^{(1)}=\beta_{i}^{(2)}+\mathcal{N}(0,\sigma_{s}^{2}/p) (and thereafter considered fixed). The signal shift is thus given by 𝜷~𝒩(0,σs2/p)\tilde{\bm{\beta}}\sim\mathcal{N}(0,\sigma_{s}^{2}/p). We display the generalization error across varied combinations of source and target sample sizes (denoted by n1n_{1} and n2n_{2} respectively), number of covariates (pp), signal-to-noise ratios (SNRs), and shift-to-signal ratios (SSRs). For completeness, we also present plots of the generalization error for the underparameterized regime, where p<np<n, a regime previously explored in the literature [57].

Figure 1 shows the empirical generalization errors of the pooled min-2\ell_{2}-norm interpolator closely match the theoretical predictions from Theorem 3.1 in moderate dimensions. Additionally, we present the generalization error of the target-only interpolator, as established by [29]. The risk of the target-only interpolator matches with that of our pooled interpolator when n1=0n_{1}=0, validating our results.

Considering the OLS estimator’s generalization error for p<np<n, the concatenated error curve reveals a double descent phenomenon. For smaller sample size from the source distribution, the risk of the pooled min-2\ell_{2}-norm interpolator is a decreasing function of n1n_{1}, rendering the pooled interpolator superior than the target-only interpolator. However, if the SSR value is too high, the first stage of descent is missing, rendering the pooled min-2\ell_{2}-norm interpolator inferior to its target-only counterpart. Figures 1 (b) and (c) display this dependence on the SNR and SSR values. Namely, higher signal strength and less model shift magnitudes lead to a wider range of n1n_{1} for which the pooled interpolator is better. Figure 1 (a) further demonstrates the intricate dependence of this dichotomy on n1,n2n_{1},n_{2} and pp, indicating that the pooled interpolator performs particularly well when n1,n2n_{1},n_{2} are much smaller than pp, which is commonplace in most modern datasets. These findings reinforce and complement the conclusions from Proposition 3.3, providing a more complete picture regarding the positive/negative effects of transfer learning for the pooled min-2\ell_{2}-interpolator under overparametrization.

Refer to caption
(a) Fixing SNR=5,SSR=0.2\textnormal{SNR}=5,\textnormal{SSR}=0.2
Refer to caption
(b) Fixing n2=100,p=600,SNR=5n_{2}=100,p=600,\textnormal{SNR}=5
Refer to caption
(c) Fixing n2=100,p=600,SSR=0.2n_{2}=100,p=600,\textnormal{SSR}=0.2
Figure 1: Generalization error of pooled-2\ell_{2}-norm interpolator under covariate shift. Solid lines: theoretically predicted values. ++ marks: empirical values. Dotted horizontal line: Risk for min-2\ell_{2}-norm interpolator using only the target data. Design choices: βi(2)𝒩(0,σβ2/p)\beta_{i}^{(2)}\sim\mathcal{N}(0,\sigma_{\beta}^{2}/p) where σβ2=SNR\sigma_{\beta}^{2}=\textnormal{SNR}, βi(1)=βi(2)+𝒩(0,σs2/p)\beta_{i}^{(1)}=\beta_{i}^{(2)}+\mathcal{N}(0,\sigma_{s}^{2}/p) where σs2=σβ2SSR\sigma_{s}^{2}=\sigma_{\beta}^{2}\cdot\textnormal{SSR}, 𝚺(1)=𝚺(2)=𝑰\bm{\Sigma}^{(1)}=\bm{\Sigma}^{(2)}=\bm{I}, 𝑿(k)\bm{X}^{(k)} has i.i.d. rows from 𝒩(0,𝚺(k))\mathcal{N}(0,\bm{\Sigma}^{(k)}). The signals and design matrices are then held fixed. We generate 5050 random ϵi𝒩(0,1)\epsilon_{i}\sim\mathcal{N}(0,1) and the average empirical risks are presented.

4 Covariate Shift

In this section, we characterize the bias and variance of the pooled min-2\ell_{2}-norm interpolator under covariate shift. We assume the underlying signal 𝜷(1)=𝜷(2)=𝜷\bm{\beta}^{(1)}=\bm{\beta}^{(2)}=\bm{\beta} stays the same for both datasets, whereas the population covariance matrix changes from 𝚺(1)\bm{\Sigma}^{(1)} to 𝚺(2)\bm{\Sigma}^{(2)}. Under this condition, B2(𝜷^;𝜷(2))=B3(𝜷^;𝜷(2))=0B_{2}(\hat{\bm{\beta}};\bm{\beta}^{(2)})=B_{3}(\hat{\bm{\beta}};\bm{\beta}^{(2)})=0 in (2.14). We then compare the remaining bias term B1(𝜷^,𝜷(2))B_{1}(\hat{\bm{\beta}},\bm{\beta}^{(2)}) and the variance term V(𝜷^;𝜷(2))V(\hat{\bm{\beta}};\bm{\beta}^{(2)}) of the pooled min-2\ell_{2}-norm interpolator with the target-based interpolator and investigate conditions that determine achievability of positive/negative transfer.

4.1 Risk under simultaneously diagonalizable covariances

We assume that the source and target covariance matrices (𝚺(1),𝚺(2))(\bm{\Sigma}^{(1)},\bm{\Sigma}^{(2)}) are simultaneously diagonalizable, as defined below:

Definition 4.1 (Simultaneously diagonalizable).

For two symmetric matrices 𝚺(1),𝚺(2)p×p\bm{\Sigma}^{(1)},\bm{\Sigma}^{(2)}\in\mathbb{R}^{p\times p}, we say they are simultaneously diagonalizable if there exist an orthogonal matrix 𝐔\bm{U} and two diagonal matrices 𝚲(1),𝚲(2)\bm{\Lambda}^{(1)},\bm{\Lambda}^{(2)} such that 𝚺(1)=𝐕𝚲(1)𝐕\bm{\Sigma}^{(1)}=\bm{V}\bm{\Lambda}^{(1)}\bm{V}^{\top} and 𝚺(2)=𝐕𝚲(2)𝐕\bm{\Sigma}^{(2)}=\bm{V}\bm{\Lambda}^{(2)}\bm{V}^{\top}. Under this condition, the generalization error depends on the joint empirical distribution as well as the geometry between 𝐕\bm{V} and 𝛃(2)\bm{\beta}^{(2)}. We encode these via the following probability distributions:

H^p(λ(1),λ(2))\displaystyle\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}) :=1pi=1p𝟏{(λ(1),λ(2))=(λi(1),λi(2))},\displaystyle:=\frac{1}{p}\sum_{i=1}^{p}\bm{1}_{\{(\lambda^{(1)},\lambda^{(2)})=(\lambda_{i}^{(1)},\lambda_{i}^{(2)})\}}, (4.1)
G^p(λ(1),λ(2))\displaystyle\hat{G}_{p}(\lambda^{(1)},\lambda^{(2)}) :=1𝜷(2)22i=1p𝜷(2),𝒗i2𝟏{(λ(1),λ(2))=(λi(1),λi(2))},\displaystyle:=\frac{1}{\|\bm{\beta}^{(2)}\|_{2}^{2}}\sum_{i=1}^{p}\langle\bm{\beta}^{(2)},\bm{v}_{i}\rangle^{2}\bm{1}_{\{(\lambda^{(1)},\lambda^{(2)})=(\lambda_{i}^{(1)},\lambda_{i}^{(2)})\}},

where, λi(k):=Λii(k)\lambda_{i}^{(k)}:=\Lambda_{ii}^{(k)} for k=1,2k=1,2, and 𝐯i\bm{v}_{i} is the ii-th eigenvector of 𝚺\bm{\Sigma} (the ii-th column of 𝐕\bm{V}).

The simultaneous diagonalizability assumption has previously been used in the transfer learning literature (cf. [42]). It ensures that the difference in the covariate distributions of the source and the target is solely captured by the eigenvalues λi(1),λi(2)\lambda_{i}^{(1)},\lambda_{i}^{(2)}. Characterizing the generalization error when covariance matrices lack a common orthonormal basis appears to be fairly technical, and we defer this to future work.

Note that in the typical overparameterized setup with a single dataset, the risk of the min-2\ell_{2}-norm interpolator is expressed in terms of the empirical distribution of eigenvalues and the angle between the signal and the eigenvector of the covariance matrix (equation (9) in [29]). Consequently, it is natural to expect that the generalization error of our estimator will depend on H^p\hat{H}_{p} and G^p\hat{G}_{p} defined by (4.1), which indeed proves to be the case. The following presents our main result concerning covariate shift.

Theorem 4.1 (Risk under covariate shift).

Suppose Assumption 1 holds, and further assuming that 𝚺(1),𝚺(2)\bm{\Sigma}^{(1)},\bm{\Sigma}^{(2)} are simultaneously diagonalizable with joint spectral distributions defined in (4.1). Then, with high probability over the randomness of (𝐙(1),𝐙(2))(\bm{Z}^{(1)},\bm{Z}^{(2)}), we have

V(𝜷^;𝜷(2))=σ2γλ(2)(a~3λ(1)+a~4λ(2))(a~1λ(1)+a~2λ(2)+1)2𝑑H^p(λ(1),λ(2))+O(p1/12)\displaystyle V(\hat{\bm{\beta}};\bm{\beta}^{(2)})=-\sigma^{2}\gamma\int\frac{\lambda^{(2)}(\tilde{a}_{3}\lambda^{(1)}+\tilde{a}_{4}\lambda^{(2)})}{(\tilde{a}_{1}\lambda^{(1)}+\tilde{a}_{2}\lambda^{(2)}+1)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})+O(p^{-1/12}) (4.2)
B(𝜷^;𝜷(2))=𝜷(2)22b~3λ(1)+(b~4+1)λ(2)(b~1λ(1)+b~2λ(2)+1)2𝑑G^p(λ(1),λ(2))+O(p1/12)\displaystyle B(\hat{\bm{\beta}};\bm{\beta}^{(2)})=\|\bm{\beta}^{(2)}\|_{2}^{2}\cdot\int\frac{\tilde{b}_{3}\lambda^{(1)}+(\tilde{b}_{4}+1)\lambda^{(2)}}{(\tilde{b}_{1}\lambda^{(1)}+\tilde{b}_{2}\lambda^{(2)}+1)^{2}}d\hat{G}_{p}(\lambda^{(1)},\lambda^{(2)})+O(p^{-1/12}) (4.3)

where (a~1,a~2,a~3,a~4)(\tilde{a}_{1},\tilde{a}_{2},\tilde{a}_{3},\tilde{a}_{4}) is the unique solution, with a~1,a~2\tilde{a}_{1},\tilde{a}_{2} positive, to the following system of equations:

0\displaystyle 0 =1γa~1λ(1)+a~2λ(2)a~1λ(1)+a~2λ(2)+1𝑑H^p(λ(1),λ(2)),\displaystyle=1-\gamma\int\frac{\tilde{a}_{1}\lambda^{(1)}+\tilde{a}_{2}\lambda^{(2)}}{\tilde{a}_{1}\lambda^{(1)}+\tilde{a}_{2}\lambda^{(2)}+1}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}), (4.4)
0\displaystyle 0 =n1nγa~1λ(1)a~1λ(1)+a~2λ(2)+1𝑑H^p(λ(1),λ(2)),\displaystyle=\frac{n_{1}}{n}-\gamma\int\frac{\tilde{a}_{1}\lambda^{(1)}}{\tilde{a}_{1}\lambda^{(1)}+\tilde{a}_{2}\lambda^{(2)}+1}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}),
a~1+a~2\displaystyle\tilde{a}_{1}+\tilde{a}_{2} =γa~3λ(1)+a~4λ(2)(a~1λ(1)+a~2λ(2)+1)2𝑑H^p(λ(1),λ(2)),\displaystyle=-\gamma\int\frac{\tilde{a}_{3}\lambda^{(1)}+\tilde{a}_{4}\lambda^{(2)}}{(\tilde{a}_{1}\lambda^{(1)}+\tilde{a}_{2}\lambda^{(2)}+1)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}),
a~1\displaystyle\tilde{a}_{1} =γa~3λ(1)+λ(1)λ(2)(a~3a~2a~4a~1)(a~1λ(1)+a~2λ(2)+1)2𝑑H^p(λ(1),λ(2)),\displaystyle=-\gamma\int\frac{\tilde{a}_{3}\lambda^{(1)}+\lambda^{(1)}\lambda^{(2)}(\tilde{a}_{3}\tilde{a}_{2}-\tilde{a}_{4}\tilde{a}_{1})}{(\tilde{a}_{1}\lambda^{(1)}+\tilde{a}_{2}\lambda^{(2)}+1)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}),

and (b~1,b~2,b~3,b~4)(\tilde{b}_{1},\tilde{b}_{2},\tilde{b}_{3},\tilde{b}_{4}) is the unique solution, with b~1,b~2\tilde{b}_{1},\tilde{b}_{2} positive, to the following system of equations:

0\displaystyle 0 =1γb~1λ(1)+b~2λ(2)b~1λ(1)+b~2λ(2)+1𝑑H^p(λ(1),λ(2)),\displaystyle=1-\gamma\int\frac{\tilde{b}_{1}\lambda^{(1)}+\tilde{b}_{2}\lambda^{(2)}}{\tilde{b}_{1}\lambda^{(1)}+\tilde{b}_{2}\lambda^{(2)}+1}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}), (4.5)
0\displaystyle 0 =n1nγb~1λ(1)b~1λ(1)+b~2λ(2)+1𝑑H^p(λ(1),λ(2)),\displaystyle=\frac{n_{1}}{n}-\gamma\int\frac{\tilde{b}_{1}\lambda^{(1)}}{\tilde{b}_{1}\lambda^{(1)}+\tilde{b}_{2}\lambda^{(2)}+1}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}),
0\displaystyle 0 =λ(1)(b~3b~1λ(2))+λ(2)(b~4b~2λ(2))(b~1λ(1)+b~2λ(2)+1)2𝑑H^p(λ(1),λ(2)),\displaystyle=\int\frac{\lambda^{(1)}(\tilde{b}_{3}-\tilde{b}_{1}\lambda^{(2)})+\lambda^{(2)}(\tilde{b}_{4}-\tilde{b}_{2}\lambda^{(2)})}{(\tilde{b}_{1}\lambda^{(1)}+\tilde{b}_{2}\lambda^{(2)}+1)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}),
0\displaystyle 0 =λ(1)(b~3b~1λ(2))+λ(1)λ(2)(b~3b~2b~4b~1)(b~1λ(1)+b~2λ(2)+1)2𝑑H^p(λ(1),λ(2)).\displaystyle=\int\frac{\lambda^{(1)}(\tilde{b}_{3}-\tilde{b}_{1}\lambda^{(2)})+\lambda^{(1)}\lambda^{(2)}(\tilde{b}_{3}\tilde{b}_{2}-\tilde{b}_{4}\tilde{b}_{1})}{(\tilde{b}_{1}\lambda^{(1)}+\tilde{b}_{2}\lambda^{(2)}+1)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}).

Note that although (a~1,a~2)=(b~1,b~2)(\tilde{a}_{1},\tilde{a}_{2})=(\tilde{b}_{1},\tilde{b}_{2}) in the displayed equations above, we intentionally used different notations for clarity, as they are derived through distinct routes in the proof.

4.2 Performance Comparison Example: When does heterogeneity help?

In this section, we illustrate the impact of covariate shift on the performance of our pooled min-2\ell_{2}-norm interpolator. Since the closed-form expressions for the bias and variance of are complicated (see (4.2), (4.3)), we exhibit the impact through stylized examples. Specifically, we consider the setting where the target covariance 𝚺(2)=𝐈\bm{\Sigma}^{(2)}=\mathbf{I} and the source covariance 𝚺(1)\bm{\Sigma}^{(1)} has pairs of reciprocal eigenvalues: 𝚺(1):=𝑴𝒮\bm{\Sigma}^{(1)}:=\bm{M}\in\mathcal{S}, where 𝒮\mathcal{S} denotes the collection of all diagonal matrices with pairs of reciprocal eigenvalues. To elaborate, suppose that pp is an even integer. In this case, 𝚺(1)\bm{\Sigma}^{(1)} is diagonal, with λp+1i(1)=1/λi(1)\lambda_{p+1-i}^{(1)}=1/\lambda_{i}^{(1)} for i=1,,p/2i=1,...,p/2. Note that 𝐈𝒮\mathbf{I}\in\mathcal{S}. This setting leads to a simplification in the bias and the variance, as well as alows us to study the effects of covariate shift.

We will analyze the risk of the pooled min-2\ell_{2}-norm interpolator R(𝜷^;𝜷(2))R(\hat{\bm{\beta}};\bm{\beta}^{(2)}), defined by (2.11), under this covariate shift model. To emphasize that the difference in the source and target feature covariances is captured by the matrix 𝑴\bm{M}, we denote this risk as R^(𝑴)\hat{R}(\bm{M}) through the rest of the section.

Our first result studies the following question: under what conditions does heterogeneity reduce the risk for the pooled estimator? To this end, we compare R^(𝑴),𝑴𝒮\hat{R}(\bm{M}),\bm{M}\in\mathcal{S}, with R^(𝐈)\hat{R}(\mathbf{I}), i.e. the risk under covariate shift with the one without covariate shift.

Proposition 4.2 (Dependence on sample size).

Under the conditions of Theorem 4.1, and additionally assuming 𝚺(2)=𝐈\bm{\Sigma}^{(2)}=\mathbf{I}, the following holds with high probability:

  1. (i)

    When n1<min{p/2,pn2}n_{1}<\min\{p/2,p-n_{2}\}, R^(𝑴)R^(𝐈)+o(1)\hat{R}(\bm{M})\leq\hat{R}(\mathbf{I})+o(1) for any 𝑴𝒮\bm{M}\in\mathcal{S}.

  2. (ii)

    When p/2n1<pn2p/2\leq n_{1}<p-n_{2}, R^(𝑴)R^(𝐈)+o(1)\hat{R}(\bm{M})\geq\hat{R}(\mathbf{I})+o(1) for any 𝑴𝒮\bm{M}\in\mathcal{S}.

Proposition 4.2 is proved in Section D.4. Its implications are as follows: in the overparametrized regime, heterogeneity leads to lower risk for the pooled min-2\ell_{2}-norm interpolator if and only if the sample size from the source data is small relative to the dimension, specifically, smaller than p/2p/2. Note that since we consider the overparametrized regime, we always have p>n1+n2p>n_{1}+n_{2} in other words, n1<pn2n_{1}<p-n_{2}. Thus, Proposition 4.2 gives an exhaustive characterization of when heterogeneity helps in terms of the relations between p,n1,n2p,n_{1},n_{2}. More generally, an additional question may be posed: how does the degree of heterogeneity affect the risk of the pooled min-2\ell_{2}-norm interpolator? Investigating this is hard even for the structured class of covariance matrices 𝒮\mathcal{S}. Therefore, we consider the case when 𝚺(1)\bm{\Sigma}^{(1)} has only two eigenvalues: λp+1i(1)=1/λi(1)=κ,1ip/2\lambda_{p+1-i}^{(1)}=1/\lambda_{i}^{(1)}=\kappa,\forall 1\leq i\leq p/2 for some κ>1\kappa>1. We denote this covariance matrix as 𝑴(κ)\bm{M}(\kappa). Note that 𝑴(1)=𝑰\bm{M}(1)=\bm{I} and κ\kappa here denotes the degree of heterogeneity. In this special setting, we can establish the following.

Proposition 4.3 (Dependence on the degree of heterogeneity).

Under the conditions of Theorem 4.1, and additionally assuming 𝚺(2)=𝐈\bm{\Sigma}^{(2)}=\mathbf{I}, the following holds with high probability:

  1. (i)

    When n1<min{p/2,pn2}n_{1}<\min\{p/2,p-n_{2}\}, R^(𝑴(κ1))R^(𝑴(κ2))+o(1)\hat{R}(\bm{M}(\kappa_{1}))\leq\hat{R}(\bm{M}(\kappa_{2}))+o(1) for any κ1>κ2>1\kappa_{1}>\kappa_{2}>1.

  2. (ii)

    When p/2<n1<pn2p/2<n_{1}<p-n_{2}, R^(𝑴(κ1))R^(𝑴(κ2))+o(1)\hat{R}(\bm{M}(\kappa_{1}))\geq\hat{R}(\bm{M}(\kappa_{2}))+o(1) for any κ1>κ2>1\kappa_{1}>\kappa_{2}>1.

  3. (iii)

    If n1=min{p/2,pn2}n_{1}=\min\{p/2,p-n_{2}\}, then R^(𝑴(κ))\hat{R}(\bm{M}(\kappa)) does not depend on κ1\kappa\geq 1.

The key takeaways from Proposition 4.3 can be summarized as follows:

  1. (i)

    If the size of source dataset is small, then the risk is a decreasing function of κ\kappa, i.e., more discrepancy between source and target dataset helps achieve lower risk for the pooled min-2\ell_{2}-norm interpolator.

  2. (ii)

    When the size of source dataset is large, then larger heterogeneity hurts the pooled min-2\ell_{2}-norm interpolator.

Lastly, similar to Section 3.2, we here compare the pooled min-2\ell_{2}-norm interpolator with the target-based interpolator (3.4). Note the risk of the latter runder similar settings has been studied in Theorem 2 of [29] which we cite below:

Proposition 4.4.

Consider the estimator (3.4). Let H^p(2)(λ(2)),G^p(2)(λ(2))\hat{H}_{p}^{(2)}(\lambda^{(2)}),\hat{G}_{p}^{(2)}(\lambda^{(2)}) be the marginal distribution of H^p(λ(1),λ(2)),G^p(λ(1),λ(2))\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}),\hat{G}_{p}(\lambda^{(1)},\lambda^{(2)}) in (4.1) on λ(2)\lambda^{(2)}. Under the same setting as Theorem 4.1, and further assuming that entries of 𝐙(2)\bm{Z}^{(2)} has bounded ϕ\phi moments for all ϕ2\phi\geq 2. Then with high probability,

R(𝜷^(2);𝜷(2))=\displaystyle R(\hat{\bm{\beta}}^{(2)};\bm{\beta}^{(2)})= 𝜷(2)22{1+γc0(λ(2))2(1+c0γλ(2))2𝑑H^p(2)(λ(2))λ(2)(1+c0γλ(2))2𝑑H^p(2)(λ(2))}λ(2)(1+c0γλ(2))2𝑑G^p(2)(λ(2))\displaystyle\|\bm{\beta}^{(2)}\|_{2}^{2}\left\{1+\gamma c_{0}\frac{\int\frac{(\lambda^{(2)})^{2}}{(1+c_{0}\gamma\lambda^{(2)})^{2}}d\hat{H}_{p}^{(2)}(\lambda^{(2)})}{\int\frac{\lambda^{(2)}}{(1+c_{0}\gamma\lambda^{(2)})^{2}}d\hat{H}_{p}^{(2)}(\lambda^{(2)})}\right\}\cdot\int\frac{\lambda^{(2)}}{(1+c_{0}\gamma\lambda^{(2)})^{2}}d\hat{G}_{p}^{(2)}(\lambda^{(2)}) (4.6)
+σ2γc0(λ(2))2(1+c0γλ(2))2𝑑H^p(2)(λ(2))λ(2)(1+c0γλ(2))2𝑑H^p(2)(λ(2))+O(n1/7),\displaystyle+\sigma^{2}\gamma c_{0}\frac{\int\frac{(\lambda^{(2)})^{2}}{(1+c_{0}\gamma\lambda^{(2)})^{2}}d\hat{H}_{p}^{(2)}(\lambda^{(2)})}{\int\frac{\lambda^{(2)}}{(1+c_{0}\gamma\lambda^{(2)})^{2}}d\hat{H}_{p}^{(2)}(\lambda^{(2)})}+O(n^{-1/7}),

where c0c_{0} is the unique non-negative solution of

11γ=λ(2)1+c0γλ(2)𝑑H^p(2)(λ(2)).1-\frac{1}{\gamma}=\int\frac{\lambda^{(2)}}{1+c_{0}\gamma\lambda^{(2)}}d\hat{H}_{p}^{(2)}(\lambda^{(2)}).

We remark that using similar proof techniques as Theorem 4.1, the additional bounded moments condition in Proposition 4.4 can be relaxed to (2.3). We omit the details. Regretfully, although the precise risk for both interpolators are available from Theorem 4.1 and Proposition 4.4, one cannot obtain a simple result even for 𝚺(2)=𝑰\bm{\Sigma}^{(2)}=\bm{I} due to the complexity of expressions. Thus we make the comparison via solving the expressions numerically in the next section.

4.3 Numerical examples under covariate shift

Similar to Section 3.5, we illustrate the applicability of our results in finite sample and compare the performances of the pooled min-2\ell_{2}-norm interpolator and its target-based counterpart. We pick the special setting of Proposition 4.3 where the target distribution has covariance 𝑰\bm{I}, while the source distribution has covariance matrix with two distinct eigenvalues κ\kappa and 1/κ1/\kappa, each having multiplicity p/2p/2. The choice of covariance matrices ensure that both have equal determinants. In Figure 2, we illustrate how the generalization error varies with n1,n2,pn_{1},n_{2},p, the SNR, and the covariate shift magnitude κ\kappa. Again, we include the generalization error of the OLS estimator for the underparameterized regime p<np<n for completeness.

The key insights from the plots are summarized as follows. Figure 2 (a) illustrates the generalization error of the pooled min-2\ell_{2}-norm interpolator, showcasing the double descent phenomena across varying p,n1,n2p,n_{1},n_{2}. The generalization error of the target-based interpolator is represented by dotted horizontal lines. We observe that in the overparameterized regime, if n1n_{1} is small, the pooled estimator consistently outperforms the target-based estimator. Figure 2 (b), presents our findings from Proposition 4.3. For small values of n1n_{1}, the generalization error decreases with the degree of heterogeneity, i.e., the value of κ\kappa. However, for n1>min{p/2,pn2}n_{1}>\min\{p/2,p-n_{2}\}, higher values of κ\kappa results in higher generalization error. A notable scenario occurs at n1=min{p/2,pn2}n_{1}=\min\{p/2,p-n_{2}\}, where the generalization error becomes independent of κ\kappa, causing all curves to intersect at a single point. Finally, Figure 2 (c) demonstrates the dependence of the generalization error on the SNR. Higher SNR values lead to significantly higher error when n1n_{1} is small, but the curves tend to converge at the interpolation threshold, i.e., when the total sample size matches the number of covariates.

Refer to caption
(a) Fixing κ=4,SNR=5\kappa=4,\textnormal{SNR}=5
Refer to caption
(b) Fixing n2=100,p=600,SNR=5n_{2}=100,p=600,\textnormal{SNR}=5
Refer to caption
(c) Fixing n2=100,p=600,κ=4n_{2}=100,p=600,\kappa=4
Figure 2: Generalization error of pooled-2\ell_{2}-norm interpolator under covariate shift. Solid lines: theoretically predicted values. ++ marks: empirical values. Dotted horizontal line: Risk for min-2\ell_{2}-norm interpolator using only the target data. Design choices: βi(1)=βi(2)𝒩(0,σβ2)\beta_{i}^{(1)}=\beta_{i}^{(2)}\sim\mathcal{N}(0,\sigma_{\beta}^{2}) where σβ2=SNR\sigma_{\beta}^{2}=\textnormal{SNR}, 𝚺(2)=𝑰\bm{\Sigma}^{(2)}=\bm{I} and 𝚺(1)\bm{\Sigma}^{(1)} has two distinct eigenvalues κ\kappa and 1/κ1/\kappa, 𝑿(k)\bm{X}^{(k)} has i.i.d. rows from 𝒩(0,𝚺(k))\mathcal{N}(0,\bm{\Sigma}^{(k)}). The signals and design matrices are then held fixed. We generate 5050 random ϵi𝒩(0,1)\epsilon_{i}\sim\mathcal{N}(0,1) and report the average empirical risks.

5 Proof Outline

In this section, we outline the key ideas behind the proofs of our main results in Sections 3 and 4, highlighting crucial auxiliary results along the way. All convergence statements will be presented in an approximate sense, with comprehensive proof details provided in the Appendix.

To circumvent the difficulty of dealing with the pseudo-inverse formula in (2.8), we utilize the connection between the pooled min-2\ell_{2}-norm interpolators and ridgeless least squares interpolation. This says that to study the interpolator, it suffices to study the following ridge estimator

𝜷^λ=(𝑿(1)𝑿(1)+𝑿(2)𝑿(2)+nλ𝑰)1(𝑿(1)𝒚(1)+𝑿(2)𝒚(2)),\hat{\bm{\beta}}_{\lambda}=\left(\bm{X}^{(1)\top}\bm{X}^{(1)}+\bm{X}^{(2)\top}\bm{X}^{(2)}+n\lambda\bm{I}\right)^{-1}\left(\bm{X}^{(1)\top}\bm{y}^{(1)}+\bm{X}^{(2)\top}\bm{y}^{(2)}\right), (5.1)

and then study its λ0\lambda\rightarrow 0 limit. Owing to the decomposition (2.12), we analyse the bias and the variance terms separately. With this in mind, our proof consists of the following four steps:

  1. (i)

    For a fixed value of λ>0\lambda>0, we analyze the risk of the pooled ridge estimator defined by (5.1), i.e., we show there exists \mathcal{R} such that R(𝜷^λ;𝜷(2))p(𝜷^λ;𝜷(2)),R(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})\overset{p\rightarrow\infty}{\longrightarrow}\mathcal{R}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}), where RR is defined as in (2.11). [see Theorem A.2 and Theorem A.3 for the proof under model shift and covariate shift respectively].

  2. (ii)

    Show that with high probability, the difference between the (random) risks of the min 2\ell_{2}-norm interpolator and the ridge estimator vanishes as λ0+\lambda\rightarrow 0^{+}, i.e., R(𝜷^λ;𝜷(2))R(𝜷^;𝜷(2))λ0+0R(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})-R(\hat{\bm{\beta}};\bm{\beta}^{(2)})\overset{\lambda\rightarrow 0^{+}}{\longrightarrow}0.

  3. (iii)

    Analyze the limit of the (deterministic) risk of 𝜷^λ\hat{\bm{\beta}}_{\lambda},i.e., (𝜷^λ;𝜷(2))λ0+(𝜷^;𝜷(2))\mathcal{R}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})\overset{\lambda\rightarrow 0^{+}}{\longrightarrow}\mathcal{R}(\hat{\bm{\beta}};\bm{\beta}^{(2)}).

  4. (iv)

    Find a suitable relation between λ\lambda and pp such that the aforementioned three convergence relationships are simultaneously satisfied.

The most non-trivial part of our results is to prove the convergence (i) and obtain the precise expression for (𝜷^λ;𝜷(2))\mathcal{R}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}). For (ii), one can rewrite the difference R(𝜷^λ;𝜷(2))R(𝜷^;𝜷(2))R(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})-R(\hat{\bm{\beta}};\bm{\beta}^{(2)}) in terms of the eigenvalues of some symmetric random matrices such as 𝑿(1)𝑿(1)+𝑿(2)𝑿(2)\bm{X}^{(1)\top}\bm{X}^{(1)}+\bm{X}^{(2)\top}\bm{X}^{(2)}. Next, using high probability bounds of the largest and smallest eigenvalues (see Lemma B.3 and Corollary B.4), we are able to establish high probability convergence given by (ii) above. Part (iii) only involves deterministic quantities and therefore classical analysis techniques. Finally, Step (iv) reduces to optimization problems over λ\lambda.

The rest of this section is devoted to outlining the proof for part (i). These steps for model shift and covariate shift are detailed in Appendix C.2 and Appendix D.3, respectively.

5.1 Model Shift

For model shift, after decomposing R(𝜷^λ;𝜷(2))R(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}) into bias and variance terms, the one term that requires novel treatment is (see Appendix C.1)

B2(𝜷^λ;𝜷(2))=𝜷~𝑸(1)(𝑸(1)+α𝑸(2)+s𝑰)2𝑸(1)𝜷~,B_{2}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})=\tilde{\bm{\beta}}^{\top}\bm{Q}^{(1)}\left(\bm{Q}^{(1)}+\alpha\bm{Q}^{(2)}+s\bm{I}\right)^{-2}\bm{Q}^{(1)}\tilde{\bm{\beta}},

where α,s\alpha,s are model-dependent constants, and 𝑸(k)=1nk𝒁(k)𝒁(k)\bm{Q}^{(k)}=\frac{1}{n_{k}}\bm{Z}^{(k)\top}\bm{Z}^{(k)} for k=1,2k=1,2. Established results (see Lemma C.1) guarantee that

B2(𝜷^λ;𝜷(2))\displaystyle B_{2}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}) 𝜷~221pTr[(𝑸(1)+α𝑸(2)+s𝑰)2(𝑸(1))2]\displaystyle\approx\|\tilde{\bm{\beta}}\|_{2}^{2}\cdot\frac{1}{p}\textnormal{Tr}\left[\left(\bm{Q}^{(1)}+\alpha\bm{Q}^{(2)}+s\bm{I}\right)^{-2}(\bm{Q}^{(1)})^{2}\right]
=𝜷~22ddt1pTr[((𝑸(1)+t(𝑸(1))2+α𝑸(2)+s𝑰)1)]|t=0.\displaystyle=-\|\tilde{\bm{\beta}}\|_{2}^{2}\cdot\frac{d}{dt}\frac{1}{p}\textnormal{Tr}\left[\left(\left(\bm{Q}^{(1)}+t(\bm{Q}^{(1)})^{2}+\alpha\bm{Q}^{(2)}+s\bm{I}\right)^{-1}\right)\right]\bigg{|}_{t=0}.

We know 𝑸(1),𝑸(2)\bm{Q}^{(1)},\bm{Q}^{(2)} are Wishart matrices whose spectral distribution converges to the Marčenko-Pastur Law, from which we also have control over the spectral distribution of 𝑸(1)+t(𝑸(1))2\bm{Q}^{(1)}+t(\bm{Q}^{(1)})^{2} and α𝑸(2)+s𝑰\alpha\bm{Q}^{(2)}+s\bm{I}. Moreover, since the previous two quantities are both rotationally invariant and asymptotically free, we can effectively handle the limiting spectral distribution of their sum with free addition. Further algebra and convergence analysis yield the limit of B2(𝜷^λ;𝜷(2))B_{2}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}).

The outline above also illustrates the necessity of certain assumptions on Theorem 3.1. Due to our utilization of free addition techniques, we require both the Gaussianity assumption on (𝒁(1),𝒁(2))(\bm{Z}^{(1)},\bm{Z}^{(2)}) for rotational invariance, and the isotropic assumption 𝚺=𝑰\bm{\Sigma}=\bm{I} for free independence. Further elaboration can be found in Appendix C. We hypothesize that (3.3) remains applicable for non-Gaussian matrices (𝒁(1),𝒁(2))(\bm{Z}^{(1)},\bm{Z}^{(2)}), and a comprehensive formula exists for general 𝚺\bm{\Sigma}. One promising avenue to explore this is through the lens of linear pencil techniques [30], which we defer to future investigations.

5.2 Covariate Shift

For covariate shift, B2(𝜷^λ,𝜷(2))=0B_{2}(\hat{\bm{\beta}}_{\lambda},\bm{\beta}^{(2)})=0 since 𝜷~=0\tilde{\bm{\beta}}=0, eliminating the need for dealing with quadratic terms with free addition. One can therefore relax the Gaussianity assumption on (𝒁(1),𝒁(2))(\bm{Z}^{(1)},\bm{Z}^{(2)}) to the one in Thoerem 4.1. However, new challenges arise with anisotropic covariances (𝚺(1),𝚺(2))(\bm{\Sigma}^{(1)},\bm{\Sigma}^{(2)}). For conciseness, we illustrate the representative variance term, which is formulated as (see (D.2) in Section 4)

V(𝜷^λ;𝜷(2))=\displaystyle V(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})= λ{σ2λnTr(𝚺(2)(𝚺^+λ𝑰)1)}\displaystyle\frac{\partial}{\partial\lambda}\left\{\frac{\sigma^{2}\lambda}{n}\textnormal{Tr}\left(\bm{\Sigma}^{(2)}(\hat{\bm{\Sigma}}+\lambda\bm{I})^{-1}\right)\right\} (5.2)
=\displaystyle= λ{σ2λnTr((𝚲(2))1/2(𝑾+λ𝑰)1(𝚲(2))1/2)},\displaystyle\frac{\partial}{\partial\lambda}\left\{\frac{\sigma^{2}\lambda}{n}\textnormal{Tr}\left((\bm{\Lambda}^{(2)})^{1/2}\left(\bm{W}+\lambda\bm{I}\right)^{-1}(\bm{\Lambda}^{(2)})^{1/2}\right)\right\},

where

𝑾:=(𝚲(1))1/2𝑽𝚺^𝒁(1)𝑽(𝚲(1))1/2+(𝚲(2))1/2𝑽𝚺^𝒁(2)𝑽(𝚲(2))1/2.\bm{W}:=(\bm{\Lambda}^{(1)})^{1/2}\bm{V}^{\top}\hat{\bm{\Sigma}}_{\bm{Z}}^{(1)}\bm{V}(\bm{\Lambda}^{(1)})^{1/2}+(\bm{\Lambda}^{(2)})^{1/2}\bm{V}^{\top}\hat{\bm{\Sigma}}_{\bm{Z}}^{(2)}\bm{V}(\bm{\Lambda}^{(2)})^{1/2}.

This simplification highlights the necessity of the jointly diagonalizable assumption for (𝚺(1),𝚺(2))(\bm{\Sigma}^{(1)},\bm{\Sigma}^{(2)}). The core challenge, therefore, reduces to characterizing the behavior of (𝑾+λ𝑰)1(\bm{W}+\lambda\bm{I})^{-1} under the setting of Theorem 4.1. In other words, we aim to characterize the resolvent of 𝑾\bm{W}, a sum of two sample-covariance-like matrices with anisotropic covariances. To this end, we present Theorem D.1 (omitted here for the sake of space), a novel type of anisotropic local law [33], which might be of independent interest. Convergence of the trace term in (5.2) follows as a corollary of our anisotropic local law, and further analysis w.r.t. the derivative yields the value of V(𝜷^λ;𝜷(2))V(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}).

6 Discussion and Future directions

In this paper, we characterize the generalization error of the pooled min-2\ell_{2}-norm interpolator under overparametrization and distribution shift. Our study underscores how the degree of model shift or covariate shift influences the risk associated with the interpolator. Mathematically, we derive a novel anisotropic local law, which may be of independent interest in random matrix theory. We conclude by identifying several avenues for future research and discussing some limitations of our current findings:

  1. (i)

    Expanding to Multiple Datasets: In Section 3.4, we established the risk of the pooled min-2\ell_{2}-norm interpolator with multiple source data under model shift. Extending our results under covariate shift to the case of multiple source data is equally important. We anticipate that this would be feasible by extending our approach to incorporate more general anisotropic local laws. However, we defer this study to future work in the interest of space.

  2. (ii)

    Generalizing Assumptions and Combined Shift: For our model shift results in Section 3, we imposed an isotropic Gaussian assumption on the covariate distribution. Extending this to cover anisotropic cases would be important. This would require significant new ideas for handling products of random matrices that are not necessarily asymptotically free. We expect such extensions may also enable understanding the pooled min-2\ell_{2}-norm interpolator under simultaneous model and covariate shift, which is common in real-life transfer learning settings and is partially explored in [57] for underparametrized regime.

  3. (iii)

    Comparison with Other Estimators: Our pooled min-2\ell_{2}-norm interpolator covers early and intermediate min-2\ell_{2}-norm interpolation within a single umbrella thanks to the alternate representation (2.10). However, a more complete picture of interpolation under transfer learning and over-parametrization necessitates comprehensive comparisons with the following different types of estimators: 1) Late fused min-2\ell_{2}-norm interpolators, which combine min-2\ell_{2}-norm interpolators that are individually obtained from different datasets; 2) Other types of interpolators including min-1\ell_{1}-norm ones [38]. 3) Other estimators in over-parametrized regimes including the penalized ones, beyond the ridge estimators covered in Appendix A.

  4. (iv)

    Extension to Nonlinear Models: Recent literature has explored the connection between linear models and more complex models such as neural networks [1, 20, 31, 43]. To be precise, for i.i.d. data (yi,zi)(y_{i},z_{i}), ini\leq n, yiy_{i}\in\mathbb{R}, zidz_{i}\in\mathbb{R}^{d}. Consider learning a neural network with weights 𝜽p\bm{\theta}\in\mathbb{R}^{p}, f(.;𝜽):df(.;\bm{\theta}):\mathbb{R}^{d}\mapsto\mathbb{R}, zf(z,𝜽)z\mapsto f(z,\bm{\theta}), where the form of ff need not be related to the distribution of (yi,zi)(y_{i},z_{i}). In some settings, the number of parameters pp is so large that training effectively changes 𝜽\bm{\theta} only by a small amount with respect to a random initialization 𝜽0p\bm{\theta}_{0}\in\mathbb{R}^{p}. Then, given 𝜽0\bm{\theta}_{0} such that f(z,𝜽0)0f(z,\bm{\theta}_{0})\approx 0, one obtains that the neural network can be approximated by z𝜽f(z,𝜽0)βz\mapsto\nabla_{\bm{\theta}}f(z,\bm{\theta}_{0})^{\top}\beta, where we have reparametrized 𝜽=𝜽0+𝜷\bm{\theta}=\bm{\theta}_{0}+\bm{\beta}. Such models, despite being nonlinear in ziz_{i}s, are linear in 𝜷\bm{\beta}. Moving beyond high-dimensional linear regression, one might ask the statistical consequences of such linearization in the context of transfer learning. We leave such directions for future research.

Acknowledgements

P.S. would like to acknowledge partial funding from NSF DMS-21134262113426.

References

  • Allen-Zhu et al. [2019] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International conference on machine learning, pages 242–252. PMLR, 2019.
  • Bai and Silverstein [2010] Zhidong Bai and Jack W Silverstein. Spectral analysis of large dimensional random matrices, volume 20. Springer, 2010.
  • Baltrušaitis et al. [2018] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018.
  • Bao et al. [2020] Zhigang Bao, László Erdős, and Kevin Schnelli. Spectral rigidity for addition of random matrices at the regular edge. Journal of Functional Analysis, 279(7):108639, 2020.
  • Bartlett et al. [2020] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
  • Bastani [2021] Hamsa Bastani. Predicting with proxies: Transfer learning in high dimension. Management Science, 67(5):2964–2984, 2021.
  • Belinschi and Bercovici [2007] Serban T Belinschi and Hari Bercovici. A new approach to subordination results in free probability. Journal d’Analyse Mathématique, 101(1):357–365, 2007.
  • Belkin et al. [2018] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 541–549. PMLR, 2018.
  • Belkin et al. [2019a] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019a.
  • Belkin et al. [2019b] Mikhail Belkin, Alexander Rakhlin, and Alexandre B Tsybakov. Does data interpolation contradict statistical optimality? In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1611–1619. PMLR, 2019b.
  • Belkin et al. [2020] Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020.
  • Bellec and Zhang [2022] Pierre C Bellec and Cun-Hui Zhang. De-biasing the lasso with degrees-of-freedom adjustment. Bernoulli, 28(2):713–743, 2022.
  • Bloemendal et al. [2014] Alex Bloemendal, László Erdos, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Isotropic local laws for sample covariance and generalized wigner matrices. Electron. J. Probab, 19(33):1–53, 2014.
  • Brailovskaya and van Handel [2022] Tatiana Brailovskaya and Ramon van Handel. Universality and sharp matrix concentration inequalities. arXiv preprint arXiv:2201.05142, 2022.
  • Bunea et al. [2020] Florentina Bunea, Seth Strimas-Mackey, and Marten Wegkamp. Interpolation under latent factor regression models. arXiv preprint arXiv:2002.02525, 2(7), 2020.
  • Cheng et al. [2022] Jiahui Cheng, Minshuo Chen, Hao Liu, Tuo Zhao, and Wenjing Liao. High dimensional binary classification under label shift: Phase transition and regularization. arXiv preprint arXiv:2212.00700, 2022.
  • Chistyakov and Götze [2011] Gennadii P Chistyakov and Friedrich Götze. The arithmetic of distributions in free probability theory. Central European Journal of Mathematics, 9:997–1050, 2011.
  • Deng et al. [2022] Zeyu Deng, Abla Kammoun, and Christos Thrampoulidis. A model of double descent for high-dimensional binary linear classification. Information and Inference: A Journal of the IMA, 11(2):435–495, 2022.
  • Ding and Yang [2018] Xiucai Ding and Fan Yang. A necessary and sufficient condition for edge universality at the largest singular values of covariance matrices. The Annals of Applied Probability, 28(3):1679–1738, 2018.
  • Du et al. [2018] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
  • Erdős et al. [2013a] László Erdős, Antti Knowles, and Horng-Tzer Yau. Averaging fluctuations in resolvents of random band matrices. Annales Henri Poincaré, 14(8):1837–1926, 2013a.
  • Erdős et al. [2013b] László Erdős, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Delocalization and diffusion profile for random band matrices. Communications in Mathematical Physics, 323:367–416, 2013b.
  • Erdős et al. [2013c] László Erdős, Antti Knowles, Horng-Tzer Yau, and Jun Yin. The local semicircle law for a general class of random matrices. 2013c.
  • Erdős et al. [2013d] László Erdős, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Spectral statistics of erdős—rényi graphs i: Local semicircle law. The Annals of Probability, pages 2279–2375, 2013d.
  • Ghorbani et al. [2021] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-layers neural networks in high dimension. The Annals of Statistics, 49(2):1029–1054, 2021.
  • Gunasekar et al. [2018a] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pages 1832–1841. PMLR, 2018a.
  • Gunasekar et al. [2018b] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. Advances in neural information processing systems, 31, 2018b.
  • Hanneke and Kpotufe [2019] Steve Hanneke and Samory Kpotufe. On the value of target data in transfer learning. Advances in Neural Information Processing Systems, 32, 2019.
  • Hastie et al. [2022] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. Annals of statistics, 50(2):949, 2022.
  • Helton et al. [2018] J William Helton, Tobias Mai, and Roland Speicher. Applications of realizations (aka linearizations) to free probability. Journal of Functional Analysis, 274(1):1–79, 2018.
  • Jacot et al. [2018] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  • Javanmard and Montanari [2014] Adel Javanmard and Andrea Montanari. Hypothesis testing in high-dimensional regression under the gaussian random design model: Asymptotic theory. IEEE Transactions on Information Theory, 60(10):6522–6554, 2014.
  • Knowles and Yin [2017] Antti Knowles and Jun Yin. Anisotropic local laws for random matrices. Probability Theory and Related Fields, 169:257–352, 2017.
  • Kpotufe and Martinet [2021] Samory Kpotufe and Guillame Martinet. Marginal singularity and the benefits of labels in covariate-shift. The Annals of Statistics, 49(6):3299–3323, 2021.
  • Li et al. [2022] Sai Li, T Tony Cai, and Hongzhe Li. Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):149–173, 2022.
  • Li et al. [2023] Sai Li, T Tony Cai, and Hongzhe Li. Transfer learning in large-scale gaussian graphical models with false discovery rate control. Journal of the American Statistical Association, 118(543):2171–2183, 2023.
  • Liang and Rakhlin [2020] Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel “ridgeless” regression can generalize. The Annals of Statistics, 48(3), 2020.
  • Liang and Sur [2022] Tengyuan Liang and Pragya Sur. A precise high-dimensional asymptotic theory for boosting and minimum-1\ell_{1}-norm interpolated classifiers. The Annals of Statistics, 50(3), 2022.
  • Liang and Tran-Bach [2022] Tengyuan Liang and Hai Tran-Bach. Mehler’s formula, branching process, and compositional kernels of deep neural networks. Journal of the American Statistical Association, 117(539):1324–1337, 2022.
  • Ma et al. [2023] Cong Ma, Reese Pathak, and Martin J Wainwright. Optimally tackling covariate shift in rkhs-based nonparametric regression. The Annals of Statistics, 51(2):738–761, 2023.
  • Maity et al. [2022] Subha Maity, Yuekai Sun, and Moulinath Banerjee. Minimax optimal approaches to the label shift problem in non-parametric settings. The Journal of Machine Learning Research, 23(1):15698–15742, 2022.
  • Mallinar et al. [2024] Neil Mallinar, Austin Zane, Spencer Frei, and Bin Yu. Minimum-norm interpolation under covariate shift. arXiv preprint arXiv:2404.00522, 2024.
  • Mei et al. [2019] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory, pages 2388–2464. PMLR, 2019.
  • Montanari et al. [2019] Andrea Montanari, Feng Ruan, Youngtak Sohn, and Jun Yan. The generalization error of max-margin linear classifiers: Benign overfitting and high dimensional asymptotics in the overparametrized regime. arXiv preprint arXiv:1911.01544, 2019.
  • Muthukumar et al. [2020] Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai. Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020.
  • Parmaksiz and Handel [2024] Emin Parmaksiz and Ramon Van Handel. In preparation, 2024.
  • Patil et al. [2024] Pratik Patil, Jin-Hong Du, and Ryan J Tibshirani. Optimal ridge regularization for out-of-distribution prediction. arXiv preprint arXiv:2404.01233, 2024.
  • Pillai and Yin [2014] Natesh S Pillai and Jun Yin. Universality of covariance matrices. The Annals of Applied Probability, 24(3):935–1001, 2014.
  • Sidheekh et al. [2024] Sahil Sidheekh, Pranuthi Tenali, Saurabh Mathur, Erik Blasch, Kristian Kersting, and Sriraam Natarajan. Credibility-aware multi-modal fusion using probabilistic circuits. arXiv preprint arXiv:2403.03281, 2024.
  • Song et al. [2024] Yanke Song, Xihong Lin, and Pragya Sur. HEDE: Heritability estimation in high dimensions by ensembling debiased estimators. arXiv preprint arXiv:2406.11184, 2024.
  • Soudry et al. [2018] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19(70):1–57, 2018.
  • Tian and Feng [2022] Ye Tian and Yang Feng. Transfer learning under high-dimensional generalized linear models. Journal of the American Statistical Association, pages 1–14, 2022.
  • Torrey and Shavlik [2010] Lisa Torrey and Jude Shavlik. Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pages 242–264. IGI global, 2010.
  • Tripuraneni et al. [2021] Nilesh Tripuraneni, Ben Adlam, and Jeffrey Pennington. Covariate shift in high-dimensional random feature regression. arXiv preprint arXiv:2111.08234, 2021.
  • Tulino et al. [2004] Antonia M Tulino, Sergio Verdú, et al. Random matrix theory and wireless communications. Foundations and Trends® in Communications and Information Theory, 1(1):1–182, 2004.
  • Wang et al. [2022] Guillaume Wang, Konstantin Donhauser, and Fanny Yang. Tight bounds for minimum 1\ell_{1}-norm interpolation of noisy data. In International Conference on Artificial Intelligence and Statistics, pages 10572–10602. PMLR, 2022.
  • Yang et al. [2020] Fan Yang, Hongyang R Zhang, Sen Wu, Weijie J Su, and Christopher Ré. Analysis of information transfer from heterogeneous sources via precise high-dimensional asymptotics. arXiv preprint arXiv:2010.11750, 2020.
  • Zhang and Yu [2005] Tong Zhang and Bin Yu. Boosting with early stopping: Convergence and consistency. The Annals of Statistics, 33(4):1538, 2005.
  • Zhao et al. [2022] Zhangchen Zhao, Lars G Fritsche, Jennifer A Smith, Bhramar Mukherjee, and Seunggeun Lee. The construction of cross-population polygenic risk scores using transfer learning. The American Journal of Human Genetics, 109(11):1998–2008, 2022.
  • Zhou et al. [2022] Lijia Zhou, Frederic Koehler, Pragya Sur, Danica J Sutherland, and Nati Srebro. A non-asymptotic moreau envelope theory for high-dimensional generalized linear models. Advances in Neural Information Processing Systems, 35:21286–21299, 2022.

Appendix A Ridge Regression

As mentioned in Section 5, the risk of the pooled min-2\ell_{2}-norm interpolator is closely related to that of the pooled ridge estimator in (5.1), with a fixed penalty λ>0\lambda>0. In this section, we obtain exact generalization error of 𝜷^λ\hat{\bm{\beta}}_{\lambda} under the assumptions of Section 3 and Section 4. These results are crucial intermediate results in our proofs of evaluating interpolator risk.

Similar to Lemma 2.1, we first state the bias-variance decomposition of the prediction risk:

Lemma A.1.

Under Assumption 1, the ridge estimator (5.1) has variance

V(𝜷^λ;𝜷(2))=σ2nTr((𝚺^+λ𝑰)2𝚺^𝚺(2))V(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})=\frac{\sigma^{2}}{n}\textnormal{Tr}((\hat{\bm{\Sigma}}+\lambda\bm{I})^{-2}\hat{\bm{\Sigma}}\bm{\Sigma}^{(2)}) (A.1)

and bias

B(𝜷^λ;𝜷(2))=\displaystyle B(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})= λ2𝜷(2)(𝚺^+λ𝑰)1𝚺(2)(𝚺^+λ𝑰)1𝜷(2)B1(𝜷^λ;𝜷(2))\displaystyle\underbrace{\lambda^{2}\bm{\beta}^{(2)^{\top}}(\hat{\bm{\Sigma}}+\lambda\bm{I})^{-1}\bm{\Sigma}^{(2)}(\hat{\bm{\Sigma}}+\lambda\bm{I})^{-1}\bm{\beta}^{(2)}}_{B_{1}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})} (A.2)
+𝜷~(𝑿(1)𝑿(1)n)(𝚺^+λ𝑰)1𝚺(2)(𝚺^+λ𝑰)1(𝑿(1)𝑿(1)n)𝜷~B2(𝜷^λ;𝜷(2))\displaystyle+\underbrace{\tilde{\bm{\beta}}^{\top}(\frac{\bm{X}^{(1)^{\top}}\bm{X}^{(1)}}{n})(\hat{\bm{\Sigma}}+\lambda\bm{I})^{-1}\bm{\Sigma}^{(2)}(\hat{\bm{\Sigma}}+\lambda\bm{I})^{-1}(\frac{\bm{X}^{(1)^{\top}}\bm{X}^{(1)}}{n})\tilde{\bm{\beta}}}_{B_{2}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})}
2λ𝜷(2)(𝚺^+λ𝑰)1𝚺(2)(𝚺^+λ𝑰)1(𝑿(1)𝑿(1)n)𝜷~B3(𝜷^λ;𝜷(2)).\displaystyle\underbrace{-2\lambda\bm{\beta}^{(2)^{\top}}(\hat{\bm{\Sigma}}+\lambda\bm{I})^{-1}\bm{\Sigma}^{(2)}(\hat{\bm{\Sigma}}+\lambda\bm{I})^{-1}(\frac{\bm{X}^{(1)^{\top}}\bm{X}^{(1)}}{n})\tilde{\bm{\beta}}}_{B_{3}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})}.

In the subsections below, we present exact risk formulae under model shift and covariate shift, respectively.

A.1 Model Shift

We first define deterministic quantities that only depends on γ1,γ2,γ\gamma_{1},\gamma_{2},\gamma and λ\lambda, then present our main results stating that these deterministic quantities are the limits of the bias and variance componenents of prediction risk.

Definition A.1 (Risk limit, ridge).

Define m(λ)=m(λ;γ)m(-\lambda)=m(-\lambda;\gamma) as the unique solution of

m(λ)=11γγm(λ)+λ.m(-\lambda)=\frac{1}{1-\gamma-\gamma m(-\lambda)+\lambda}. (A.3)

Next, define f1=f1(λ;γ1,γ2)f_{1}=f_{1}(\lambda;\gamma_{1},\gamma_{2}) as the unique positive solution of

sγ1f12+(1+αγ1+s)f11=0s\gamma_{1}f_{1}^{2}+(1+\alpha-\gamma_{1}+s)f_{1}-1=0 (A.4)

with α=γ1/γ2\alpha=\gamma_{1}/\gamma_{2} and s=λ(1+α)s=\lambda(1+\alpha), and define

f2\displaystyle f_{2} :=f12(1+γ1f1)2(1+γ1f1)2γ1f12,\displaystyle:=\frac{f_{1}^{2}(1+\gamma_{1}f_{1})^{2}}{(1+\gamma_{1}f_{1})^{2}-\gamma_{1}f_{1}^{2}}, (A.5)
f3\displaystyle f_{3} :=α1+γ1f1+s.\displaystyle:=\frac{\alpha}{1+\gamma_{1}f_{1}}+s. (A.6)

We then define the desired deterministic quantities:

𝒱(m)(λ;γ)\displaystyle\mathcal{V}^{(m)}(\lambda;\gamma) :=σ2γm(λ)2(1γ+γλ2m(λ)),\displaystyle:=\sigma^{2}\gamma m(-\lambda)^{2}(1-\gamma+\gamma\lambda^{2}m^{\prime}(-\lambda)), (A.7)
1(m)(λ;γ)\displaystyle\mathcal{B}_{1}^{(m)}(\lambda;\gamma) :=λ2𝜷(2)22m(λ)2(1+γm(λ))1+γλm(λ)2,\displaystyle:=\lambda^{2}\|\bm{\beta}^{(2)}\|_{2}^{2}\cdot\frac{m(-\lambda)^{2}(1+\gamma m(-\lambda))}{1+\gamma\lambda m(-\lambda)^{2}}, (A.8)
2(m)(λ;γ1,γ2)\displaystyle\mathcal{B}_{2}^{(m)}(\lambda;\gamma_{1},\gamma_{2}) :=𝜷~2212f1f3+f2f321γ1f2(f3s)1+γ1f1,\displaystyle:=\|\tilde{\bm{\beta}}\|_{2}^{2}\cdot\frac{1-2f_{1}f_{3}+f_{2}f_{3}^{2}}{1-\frac{\gamma_{1}f_{2}(f_{3}-s)}{1+\gamma_{1}f_{1}}}, (A.9)
3(m)(λ;γ1,γ2)\displaystyle\mathcal{B}_{3}^{(m)}(\lambda;\gamma_{1},\gamma_{2}) :=(𝜷~𝜷(2))f1f3f21γ1f2(f3s)1+γ1f1,\displaystyle:=-(\tilde{\bm{\beta}}^{\top}\bm{\beta}^{(2)})\cdot\frac{f_{1}-f_{3}f_{2}}{1-\frac{\gamma_{1}f_{2}(f_{3}-s)}{1+\gamma_{1}f_{1}}}, (A.10)

where the superscript (m)(m) denotes model shift.

Theorem A.2 (Risk limit under model shift, ridge).

Suppose Assumption 1 holds, and 𝚺(1)=𝚺(2)=𝐈\bm{\Sigma}^{(1)}=\bm{\Sigma}^{(2)}=\bm{I} and (𝐙(1),𝐙(2))(\bm{Z}^{(1)},\bm{Z}^{(2)}) are both Gaussian random matrices. Further let s:=λ(1+α)>p2/3+ϵs:=\lambda(1+\alpha)>p^{-2/3+\epsilon} where ϵ>0\epsilon>0 is a small constant. Then, for any small constant c>0c>0, with high probability over the randomness of (𝐙(1),𝐙(2))(\bm{Z}^{(1)},\bm{Z}^{(2)}), we have

V(𝜷^λ;𝜷(2))\displaystyle V(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}) =𝒱(m)(λ;γ)+O(σ2λ2p1/2+c),\displaystyle=\mathcal{V}^{(m)}(\lambda;\gamma)+O(\sigma^{2}\lambda^{-2}p^{-1/2+c}), (A.11)
B1(𝜷^λ;𝜷(2))\displaystyle B_{1}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}) =1(m)(λ;γ)+O(λ1p1/2+c𝜷(2)22),\displaystyle=\mathcal{B}_{1}^{(m)}(\lambda;\gamma)+O(\lambda^{-1}p^{-1/2+c}\|\bm{\beta}^{(2)}\|_{2}^{2}), (A.12)
B2(𝜷^λ;𝜷(2))\displaystyle B_{2}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}) =2(m)(λ;γ1,γ2)+O(λ2p1/2+c𝜷~22),\displaystyle=\mathcal{B}_{2}^{(m)}(\lambda;\gamma_{1},\gamma_{2})+O(\lambda^{-2}p^{-1/2+c}\|\tilde{\bm{\beta}}\|_{2}^{2}), (A.13)
B3(𝜷^λ;𝜷(2))\displaystyle B_{3}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}) =3(m)(λ;γ1,γ2)+O(λ2p1/2+c|𝜷~𝜷(2)|).\displaystyle=\mathcal{B}_{3}^{(m)}(\lambda;\gamma_{1},\gamma_{2})+O(\lambda^{-2}p^{-1/2+c}|\tilde{\bm{\beta}}^{\top}\bm{\beta}^{(2)}|). (A.14)

In Theorem A.2, both (A.11) and (A.12) are immediate corollaries of [29, Theorem 5] by plugging in 𝚺=𝑰\bm{\Sigma}=\bm{I}. We defer the proof of (A.13) and (A.14) to Appendix C.1.

A.2 Covariate Shift

Theorem A.3.

Under Assumption 1, and further assuming that 𝚺(1),𝚺(2)\bm{\Sigma}^{(1)},\bm{\Sigma}^{(2)} are simultaneously diagonalizable with joint spectral distribution defined in (4.1). Let λ>p1/7+ϵ\lambda>p^{-1/7+\epsilon} where ϵ>0\epsilon>0 is a small constant. Then, for any small constant c>0c>0, with high probability over the randomness of (𝐙(1),𝐙(2))(\bm{Z}^{(1)},\bm{Z}^{(2)}), we have

V(𝜷^λ;𝜷(2))\displaystyle V(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}) =σ2γλ(1)λ(2)(a1a3λ)+(λ(2))2(a2a4λ)(a1λ(1)+a2λ(2)+λ)2𝑑H^p(λ(1),λ(2))+O(σ2λ9/2p1/2+c),\displaystyle=\sigma^{2}\gamma\int\frac{\lambda^{(1)}\lambda^{(2)}(a_{1}-a_{3}\lambda)+(\lambda^{(2)})^{2}(a_{2}-a_{4}\lambda)}{(a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}+\lambda)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})+O(\sigma^{2}\lambda^{-9/2}p^{-1/2+c}), (A.15)
B(𝜷^λ;𝜷(2))\displaystyle B(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}) =𝜷(2)22b3λλ(1)+(b4+λ)λλ(2)(b1λ(1)+b2λ(2)+λ)2𝑑G^p(λ(1),λ(2))+O(λ3/2p1/4+c𝜷(2)22),\displaystyle=\|\bm{\beta}^{(2)}\|_{2}^{2}\cdot\int\frac{b_{3}\lambda\lambda^{(1)}+(b_{4}+\lambda)\lambda\lambda^{(2)}}{(b_{1}\lambda^{(1)}+b_{2}\lambda^{(2)}+\lambda)^{2}}d\hat{G}_{p}(\lambda^{(1)},\lambda^{(2)})+O(\lambda^{-3/2}p^{-1/4+c}\|\bm{\beta}^{(2)}\|_{2}^{2}), (A.16)

where (a1,a2,a3,a4)(a_{1},a_{2},a_{3},a_{4}) is the unique solution, with a1,a2a_{1},a_{2} positive, to the following system of equations:

a1+a2\displaystyle a_{1}+a_{2} =1γa1λ(1)+a2λ(2)a1λ(1)+a2λ(2)+λ𝑑H^p(λ(1),λ(2)),\displaystyle=1-\gamma\int\frac{a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}}{a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}+\lambda}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}), (A.17)
a1\displaystyle a_{1} =n1nγa1λ(1)a1λ(1)+a2λ(2)+λ𝑑H^p(λ(1),λ(2)),\displaystyle=\frac{n_{1}}{n}-\gamma\int\frac{a_{1}\lambda^{(1)}}{a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}+\lambda}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}),
a3+a4\displaystyle a_{3}+a_{4} =γλ(1)(a3λa1)+λ(2)(a4λa2)(a1λ(1)+a2λ(2)+λ)2𝑑H^p(λ(1),λ(2)),\displaystyle=-\gamma\int\frac{\lambda^{(1)}(a_{3}\lambda-a_{1})+\lambda^{(2)}(a_{4}\lambda-a_{2})}{(a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}+\lambda)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}),
a3\displaystyle a_{3} =γλ(1)(a3λa1)+λ(1)λ(2)(a3a2a4a1)(a1λ(1)+a2λ(2)+λ)2𝑑H^p(λ(1),λ(2)),\displaystyle=-\gamma\int\frac{\lambda^{(1)}(a_{3}\lambda-a_{1})+\lambda^{(1)}\lambda^{(2)}(a_{3}a_{2}-a_{4}a_{1})}{(a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}+\lambda)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}),

and (b1,b2,b3,b4)(b_{1},b_{2},b_{3},b_{4}) is the unique solution, with b1,b2b_{1},b_{2} positive, to the following system of equations:

b1+b2\displaystyle b_{1}+b_{2} =1γb1λ(1)+b2λ(2)b1λ(1)+b2λ(2)+λ𝑑H^p(λ(1),λ(2)),\displaystyle=1-\gamma\int\frac{b_{1}\lambda^{(1)}+b_{2}\lambda^{(2)}}{b_{1}\lambda^{(1)}+b_{2}\lambda^{(2)}+\lambda}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}), (A.18)
b1\displaystyle b_{1} =n1nγb1λ(1)b1λ(1)+b2λ(2)+λ𝑑H^p(λ(1),λ(2)),\displaystyle=\frac{n_{1}}{n}-\gamma\int\frac{b_{1}\lambda^{(1)}}{b_{1}\lambda^{(1)}+b_{2}\lambda^{(2)}+\lambda}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}),
b3+b4\displaystyle b_{3}+b_{4} =γλ(1)λ(b3b1λ(2))+λ(2)λ(b4b2λ(2))(b1λ(1)+b2λ(2)+λ)2𝑑H^p(λ(1),λ(2)),\displaystyle=-\gamma\int\frac{\lambda^{(1)}\lambda(b_{3}-b_{1}\lambda^{(2)})+\lambda^{(2)}\lambda(b_{4}-b_{2}\lambda^{(2)})}{(b_{1}\lambda^{(1)}+b_{2}\lambda^{(2)}+\lambda)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}),
b3\displaystyle b_{3} =γλ(1)λ(b3b1λ(2))+λ(1)λ(2)(b3b2b4b1)(b1λ(1)+b2λ(2)+λ)2𝑑H^p(λ(1),λ(2)).\displaystyle=-\gamma\int\frac{\lambda^{(1)}\lambda(b_{3}-b_{1}\lambda^{(2)})+\lambda^{(1)}\lambda^{(2)}(b_{3}b_{2}-b_{4}b_{1})}{(b_{1}\lambda^{(1)}+b_{2}\lambda^{(2)}+\lambda)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}).

Again, notice that (a1,a2)=(b1,b2)(a_{1},a_{2})=(b_{1},b_{2}) above.

Appendix B Basic Tools

We first collect the notations that we will use throughout the proof. Since p,n1,n2p,n_{1},n_{2} has comparable orders, we use pp as the fundamental large parameter. All constants only depend on parameters introduced in Assumption 1. For any matrix 𝑿\bm{X}, λmin(𝑨)\lambda_{\min}(\bm{A}) denotes its smallest eigenvalue, λmax(𝑿):=𝑿op\lambda_{\max}(\bm{X}):=\|\bm{X}\|_{\textnormal{op}} denotes its largest eigenvalue (or equivalently the operator norm), λi(𝑿)\lambda_{i}(\bm{X}) denotes its ii-th largest eigenvalue, and 𝑿\bm{X}^{\dagger} denotes its pseudoinverse. We say an event \mathcal{E} happens with high probability (w.h.p.) if ()1\mathbb{P}(\mathcal{E})\rightarrow 1 as pp\rightarrow\infty. We write f(p)=O(g(p))f(p)=O(g(p)) if there exists a constant CC such that |f(p)|Cg(p)|f(p)|\leq Cg(p) for large enough pp. For f(p),g(p)0f(p),g(p)\geq 0, we also write f(p)g(p)f(p)\lesssim g(p) and g(p)=Ω(f(p))g(p)=\Omega(f(p)) if f(p)=O(g(p))f(p)=O(g(p)), and f(p)=Θ(g(p))f(p)=\Theta(g(p)) if f(p)=O(g(p))f(p)=O(g(p)) and g(p)=O(f(p))g(p)=O(f(p)). Whenever we talk about constants, we refer to quantities that does not depend on λ,n1,n2,p,\lambda,n_{1},n_{2},p, or random quantities such as 𝑿(k),𝒚(k),ϵ(k),k=1,2\bm{X}^{(k)},\bm{y}^{(k)},\bm{\epsilon}^{(k)},k=1,2. They might depend on other constants such as τ\tau in Assumption 1. We often use c,Cc,C to denote constants whose specific value might change from case to case. We use check symbols (e.g. xˇ,yˇ\check{x},\check{y}) to denote temporary notations (whose definitions might also change) for derivation simplicity. We further make the following definitions:

Definition B.1 (Overwhelming Probability).

We say an event \mathcal{E} holds with overwhelming probability (w.o.p.) if for any constant D>0D>0, ()1pD\mathbb{P}(\mathcal{E})\geq 1-p^{D} for large enough pp. Moreover, we say \mathcal{E} holds with overwhelming probability in another event Ω\Omega if for any constant D>0D>0, (Ω)1pD\mathbb{P}(\Omega\setminus\mathcal{E})\geq 1-p^{D} for large enough pp.

Definition B.2 (Stochastic domination, first introduced in [21]).

Let ξ(p)\xi^{(p)} and ζ(p)\zeta^{(p)} be two pp-dependent random variables. We say that ξ\xi is stochastically dominate by ζ\zeta, denoted by ξζ\xi\prec\zeta or ξ=O(ζ)\xi=O_{\prec}(\zeta), if for any small constant c>0c>0 and large constant D>0D>0,

(|ξ|>pc|ζ|)pD\mathbb{P}(|\xi|>p^{c}|\zeta|)\leq p^{-D}

for large enough pp. That is, |ξ|pc|ζ||\xi|\leq p^{c}|\zeta| w.o.p. for any c>0c>0. If ξ(u)\xi(u) and ζ(u)\zeta(u) are functions of uu supported in 𝒰\mathcal{U}, then we say ξ(u)\xi(u) is stochastically dominated by ζ(u)\zeta(u) uniformly in 𝒰\mathcal{U} if

(u𝒰{|ξ(u)|>pc|ζ(u)|})pD.\mathbb{P}(\bigcup_{u\in\mathcal{U}}\{|\xi(u)|>p^{c}|\zeta(u)|\})\leq p^{-D}.

Note that since (logp)C1(\log p)^{C}\prec 1 for any constant C>0C>0 and we are including pcp^{c}, we can safely ignore log terms. Also, for a random variable ξ\xi with finite moments up to any order, we have |ξ|1|\xi|\prec 1, since

(|ξ|pc)pkc𝔼|ξ|kpD\mathbb{P}(|\xi|\geq p^{c})\leq p^{-kc}\mathbb{E}|\xi|^{k}\leq p^{-D}

by Markov’s inequality, when k>D/ck>D/c.

The following lemma collect several algebraic properties of stochastic domination:

Lemma B.1 (Lemma 3.2 of [13]).

Let ξ\xi and ζ\zeta be two families of nonnegative random variables depending on some parameters u𝒰u\in\mathcal{U} and v𝒱v\in\mathcal{V}. Let C>0C>0 be an arbitrary constant.

  1. (i)

    Sum. Suppose ξ(u,v)ζ(u,v)\xi(u,v)\prec\zeta(u,v) uniformly in u𝒰u\in\mathcal{U} and v𝒱v\in\mathcal{V}. If |𝒱|pC|\mathcal{V}|\leq p^{C}, then v𝒱ξ(u,v)v𝒱ζ(u,v)\sum_{v\in\mathcal{V}}\xi(u,v)\prec\sum_{v\in\mathcal{V}}\zeta(u,v) uniformly in uu.

  2. (ii)

    Product. If ξ1(u)ζ1(u)\xi_{1}(u)\prec\zeta_{1}(u) and ξ2(u)ζ2(u)\xi_{2}(u)\prec\zeta_{2}(u), then ξ1(u)ξ2(u)ζ1(u)ζ2(u)\xi_{1}(u)\xi_{2}(u)\prec\zeta_{1}(u)\zeta_{2}(u) uniformly in u𝒰u\in\mathcal{U}.

  3. (iii)

    Expectation. Suppose that Ψ(u)pC\Psi(u)\geq p^{-C} is a family of deterministic parameters, and ξ(u)\xi(u) satisfies 𝔼ξ(u)2pC\mathbb{E}\xi(u)^{2}\leq p^{C}. If ξ(u)Ψ(u)\xi(u)\prec\Psi(u) uniformly in uu, then we also have 𝔼ξ(u)Ψ(u)\mathbb{E}\xi(u)\prec\Psi(u) uniformly in uu.

Definition B.3 (Bounded Support).

Let Q>0Q>0 be a (pp-dependent) deterministic parameter. We say a random matrix 𝐙n×p\bm{Z}\in\mathbb{R}^{n\times p} satisfies the bounded support condition with QQ (or 𝐙\bm{Z} has bounded support QQ) if

max1in,1jp|Zij|Q.\max_{1\leq i\leq n,1\leq j\leq p}|Z_{ij}|\prec Q. (B.1)

As showed above, if the entries of 𝒁\bm{Z} have finite moments up to any order, then 𝒁\bm{Z} has bounded support Q=1Q=1. More generally, if every entry of 𝒁\bm{Z} has a finite φ\varphi-th moment as in (2.3) and npn\leq p, then using Markov’s inequality and a union bound we have

(max1in,1jp|Zij|(logp)p2/φ)\displaystyle\mathbb{P}\left(\max_{1\leq i\leq n,1\leq j\leq p}|Z_{ij}|\geq(\log p)p^{2/\varphi}\right) i=1nj=1p(|Zij|(logp)p2/φ)\displaystyle\leq\sum_{i=1}^{n}\sum_{j=1}^{p}\mathbb{P}\left(|Z_{ij}|\geq(\log p)p^{2/\varphi}\right) (B.2)
i=1nj=1p[(logp)p2/φ)]φ(logp)φ.\displaystyle\lesssim\sum_{i=1}^{n}\sum_{j=1}^{p}\left[(\log p)p^{2/\varphi)}\right]^{-\varphi}\leq(\log p)^{-\varphi}.

In other words, 𝒁\bm{Z} has bounded support Q=p2/φQ=p^{2/\varphi} with high probability.

The following lemma provides concentration bounds for linear and quadratic forms of random variables with bounded support.

Lemma B.2 (Lemma 3.8 of [24] and Theorem B.1 of [22]).

Let (xi),(yi)(x_{i}),(y_{i}) be families of centered independent random variables, and (ai),(Bij)(a_{i}),(B_{ij}) be families of deterministic complex numbers. Suppose the entries xix_{i} and yjy_{j} have variance at most 11, and satisfy the bounded support condition (B.1) for a deterministic parameter Q1Q\geq 1. Then we have the following estimates:

|i=1naixi|Qmax1in|ai|+(i=1n|ai|2)1/2,\displaystyle\left|\sum_{i=1}^{n}a_{i}x_{i}\right|\prec Q\max_{1\leq i\leq n}|a_{i}|+\left(\sum_{i=1}^{n}|a_{i}|^{2}\right)^{1/2}, (B.3)
|i,j=1nxiBijyj|Q2Bd+Qn1/2Bo+(1i,jn|Bij|2)1/2,\displaystyle\left|\sum_{i,j=1}^{n}x_{i}B_{ij}y_{j}\right|\prec Q^{2}B_{d}+Qn^{1/2}B_{o}+\left(\sum_{1\leq i,j\leq n}|B_{ij}|^{2}\right)^{1/2}, (B.4)
|i=1n(|xi|2𝔼|xi|2)Bii|Qn1/2Bd,\displaystyle\left|\sum_{i=1}^{n}(|x_{i}|^{2}-\mathbb{E}|x_{i}|^{2})B_{ii}\right|\prec Qn^{1/2}B_{d}, (B.5)
|1ijnxiBijxj|Qn1/2Bo+(1ijn|Bij|2)1/2,\displaystyle\left|\sum_{1\leq i\neq j\leq n}x_{i}B_{ij}x_{j}\right|\prec Qn^{1/2}B_{o}+\left(\sum_{1\leq i\neq j\leq n}|B_{ij}|^{2}\right)^{1/2}, (B.6)

where we denote Bd:=maxi|Bii|B_{d}:=\max_{i}|B_{ii}| and Bo:=maxij|Bij|B_{o}:=\max_{i\neq j}|B_{ij}|. Moreover, if xix_{i} and yjy_{j} have finite moments up to any order, then we have the following stronger estimates:

|i=1naixi|(i=1n|ai|2)1/2,\displaystyle\left|\sum_{i=1}^{n}a_{i}x_{i}\right|\prec\left(\sum_{i=1}^{n}|a_{i}|^{2}\right)^{1/2}, (B.7)
|i,j=1nxiBijyj|(1i,jn|Bij|2)1/2,\displaystyle\left|\sum_{i,j=1}^{n}x_{i}B_{ij}y_{j}\right|\prec\left(\sum_{1\leq i,j\leq n}|B_{ij}|^{2}\right)^{1/2}, (B.8)
|i=1n(|xi|2𝔼|xi|2)Bii|(i=1n|Bii|2)1/2,\displaystyle\left|\sum_{i=1}^{n}(|x_{i}|^{2}-\mathbb{E}|x_{i}|^{2})B_{ii}\right|\prec\left(\sum_{i=1}^{n}|B_{ii}|^{2}\right)^{1/2}, (B.9)
|1ijnxiBijxj|(1ijn|Bij|2)1/2.\displaystyle\left|\sum_{1\leq i\neq j\leq n}x_{i}B_{ij}x_{j}\right|\prec\left(\sum_{1\leq i\neq j\leq n}|B_{ij}|^{2}\right)^{1/2}. (B.10)

The following lemma deals with the empirical spectral distribution of 𝒁𝒁\bm{Z}^{\top}\bm{Z} for n<pn<p. Since 𝒁𝒁\bm{Z}^{\top}\bm{Z} is rank-deficient, its empirical spectral distribution has a peak at 0. However, its nonzero eigenvalues are the same as the eigenvalues of 𝒁𝒁\bm{Z}\bm{Z}^{\top}. Therefore, previous results for 𝒁𝒁\bm{Z}\bm{Z}^{\top} directly controls the positive empirical spectral distribution of 𝒁𝒁\bm{Z}^{\top}\bm{Z}.

Lemma B.3.

Suppose 1+τp/nτ11+\tau\leq p/n\leq\tau^{-1}, and 𝐙n×p\bm{Z}\in\mathbb{R}^{n\times p} is a random matrix satisfying the same assumptions as 𝐙(2)\bm{Z}^{(2)} in Assumption 1 as well as the bounded support condition (B.1) for a deterministic parameter QQ such that 1Qp1/2cQ1\leq Q\leq p^{1/2-c_{Q}} for a constant cQ>0c_{Q}>0. Then, we have that

(pn)2O(pQ)λn(𝒁𝒁)λ1(𝒁𝒁)(p+n)2+O(pQ)(\sqrt{p}-\sqrt{n})^{2}-O_{\prec}(\sqrt{p}\cdot Q)\leq\lambda_{n}(\bm{Z}^{\top}\bm{Z})\leq\lambda_{1}(\bm{Z}^{\top}\bm{Z})\leq(\sqrt{p}+\sqrt{n})^{2}+O_{\prec}(\sqrt{p}\cdot Q) (B.11)
Proof.

First observe that λ1(𝒁𝒁)=λ1(𝒁𝒁)\lambda_{1}(\bm{Z}^{\top}\bm{Z})=\lambda_{1}(\bm{Z}\bm{Z}^{\top}), and λn(𝒁𝒁)=λn(𝒁𝒁)\lambda_{n}(\bm{Z}^{\top}\bm{Z})=\lambda_{n}(\bm{Z}\bm{Z}^{\top}). With this fact in mind, the case when QQ is of order 11 follows from [13, Theorem 2.10] and the case for general QQ follows from [19, Lemma 3.11]. ∎

Using a standard cut-off argument, the above results can be extended to random matrices with the weaker bounded moment assumptions:

Corollary B.4.

Suppose 1+τp/nτ11+\tau\leq p/n\leq\tau^{-1}, and 𝐙n×p\bm{Z}\in\mathbb{R}^{n\times p} is a random matrix satisfying the same assumptions as 𝐙(2)\bm{Z}^{(2)} in Assumption 1, then (B.11) holds on a high probability event with Q=p2/φQ=p^{2/\varphi}, where φ\varphi is the constant in (2.3).

Proof.

The proof follows verbatim [57, Corollary A.7] upon observing the fact that λ1(𝒁𝒁)=λ1(𝒁𝒁)\lambda_{1}(\bm{Z}^{\top}\bm{Z})=\lambda_{1}(\bm{Z}\bm{Z}^{\top}), and λn(𝒁𝒁)=λn(𝒁𝒁)\lambda_{n}(\bm{Z}^{\top}\bm{Z})=\lambda_{n}(\bm{Z}\bm{Z}^{\top}). ∎

We further cite two additional corollaries:

Corollary B.5 (Corollary A.8 of [57]).

Suppose 𝐙n×p\bm{Z}\in\mathbb{R}^{n\times p} is a random matrix satisfying the bounded moment condition (2.3). Then, for Q=n2/φQ=n^{2/\varphi}, there exists a high probability event on which the following estimate holds for any deterministic vector 𝐯p\bm{v}\in\mathbb{R}^{p}:

|𝒁𝒗22n𝒗22|n1/2Q𝒗22\left|\|\bm{Z}\bm{v}\|_{2}^{2}-n\|\bm{v}\|_{2}^{2}\right|\prec n^{1/2}Q\|\bm{v}\|_{2}^{2} (B.12)
Corollary B.6 (Corollary A.9 of [57]).

Suppose ϵ(1),,ϵ(t)n\bm{\epsilon}^{(1)},...,\bm{\epsilon}^{(t)}\in\mathbb{R}^{n} are random vectors satisfying Assumption 1(2). Then for any deterministic vector 𝐯Rn\bm{v}\in R^{n}, we have

|𝒗ϵ(i)|σ𝒗2,i=1,,t,|\bm{v}^{\top}\bm{\epsilon}^{(i)}|\prec\sigma\|\bm{v}\|_{2},\ i=1,...,t, (B.13)

and for any deterministic matrix 𝐁n×n\bm{B}\in\mathbb{R}^{n\times n},

|ϵ(i)𝑩ϵ(j)δijσ2Tr(𝑩)|σ2𝑩F,i,j=1,,t.\left|\bm{\epsilon}^{(i)\top}\bm{B}\bm{\epsilon}^{(j)}-\delta_{ij}\cdot\sigma^{2}\textnormal{Tr}(\bm{B})\right|\prec\sigma^{2}\|\bm{B}\|_{F},\ i,j=1,...,t. (B.14)

Appendix C Proof for Model Shift

Throughout this section, we introduce the simplified notations

α:=n2/n1=γ1/γ2,s=λ(1+α),𝑸(1)=1n1𝒁(1)𝒁(1),𝑸(2)=1n2𝒁(2)𝒁(2).\alpha:=n_{2}/n_{1}=\gamma_{1}/\gamma_{2},s=\lambda(1+\alpha),\bm{Q}^{(1)}=\frac{1}{n_{1}}\bm{Z}^{(1)\top}\bm{Z}^{(1)},\bm{Q}^{(2)}=\frac{1}{n_{2}}\bm{Z}^{(2)\top}\bm{Z}^{(2)}. (C.1)

Similar to Section D, we first prove the ridge case (Theorem A.2) and subsequently prove the min-norm interpolator case (Theorem 3.1).

C.1 Proof of Theorem A.2

We only prove (A.13) and (A.14) here as (A.11) and (A.12) are direct corollaries of [29, Theorem 5]. We begin with the proof of (A.13), the proof of (A.14) is similar. The proof follows mostly [57, Section D.1], and we show here necessary modifications.

Since 𝚺(1)=𝚺(2)=𝑰\bm{\Sigma}^{(1)}=\bm{\Sigma}^{(2)}=\bm{I}, (A.2) becomes

B2(𝜷^λ;𝜷(2))\displaystyle B_{2}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}) =𝜷~(𝒁(1)𝒁(1)n)(𝚺^+λ𝑰)2(𝒁(1)𝒁(1)n)𝜷~\displaystyle=\tilde{\bm{\beta}}^{\top}(\frac{\bm{Z}^{(1)^{\top}}\bm{Z}^{(1)}}{n})(\hat{\bm{\Sigma}}+\lambda\bm{I})^{-2}(\frac{\bm{Z}^{(1)^{\top}}\bm{Z}^{(1)}}{n})\tilde{\bm{\beta}}
=𝜷~𝑸(1)(𝑸(1)+α𝑸(2)+s𝑰)2𝑸(1)𝜷~,\displaystyle=\tilde{\bm{\beta}}^{\top}\bm{Q}^{(1)}\left(\bm{Q}^{(1)}+\alpha\bm{Q}^{(2)}+s\bm{I}\right)^{-2}\bm{Q}^{(1)}\tilde{\bm{\beta}},

where 𝑸(1),𝑸(2)\bm{Q}^{(1)},\bm{Q}^{(2)} are defined by (C.1). Note that 𝑸(1),𝑸(2)\bm{Q}^{(1)},\bm{Q}^{(2)} are both Wishart matrices, hence

𝑯:=𝑸(1)(𝑸(1)+α𝑸(2)+λ𝑰)2𝑸(1)\bm{H}:=\bm{Q}^{(1)}\left(\bm{Q}^{(1)}+\alpha\bm{Q}^{(2)}+\lambda\bm{I}\right)^{-2}\bm{Q}^{(1)} (C.2)

is rotationally invariant. Using this, we can simplify B2(𝜷^;𝜷(2))B_{2}(\hat{\bm{\beta}};\bm{\beta}^{(2)}) below:

Lemma C.1.

In the setting of Theorem A.2, we have

B2(𝜷^λ;𝜷(2))=[1+O(p1/2)]𝜷~22pTr[(𝑸(1)+α𝑸(2)+s𝑰)2(𝑸(1))2]B_{2}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})=\left[1+O_{\prec}(p^{-1/2})\right]\frac{\|\tilde{\bm{\beta}}\|_{2}^{2}}{p}\cdot\textnormal{Tr}\left[\left(\bm{Q}^{(1)}+\alpha\bm{Q}^{(2)}+s\bm{I}\right)^{-2}(\bm{Q}^{(1)})^{2}\right] (C.3)
Proof.

Since the matrix 𝑯\bm{H} defined by (C.2) is rotationally invariant, we have

𝜷~𝑯𝜷~=𝑑𝜷~22𝒈𝒈2𝑯𝒈𝒈2,\tilde{\bm{\beta}}^{\top}\bm{H}\tilde{\bm{\beta}}\overset{d}{=}\|\tilde{\bm{\beta}}\|_{2}^{2}\frac{\bm{g}^{\top}}{\|\bm{g}\|_{2}}\bm{H}\frac{\bm{g}}{\|\bm{g}\|_{2}}, (C.4)

where =𝑑\overset{d}{=} means equal in distribution, and 𝒈p\bm{g}\in\mathbb{R}^{p} is a random vector independent of 𝑯\bm{H} with i.i.d. Gaussian entries of mean zero and variance one. We know 𝑯\bm{H} has rank n1n_{1} with probability 11, and from Lemma B.3 we know λ1(𝑯)λn1(𝑯)\lambda_{1}(\bm{H})\prec\lambda_{n_{1}}(\bm{H}). Therefore,

𝑯Fp1/2𝑯opp1/2n11Tr(𝑯)p1/2Tr(𝑯).\|\bm{H}\|_{F}\leq p^{1/2}\|\bm{H}\|_{\textnormal{op}}\prec p^{1/2}n_{1}^{-1}\textnormal{Tr}(\bm{H})\prec p^{-1/2}\textnormal{Tr}(\bm{H}).

Using (B.14), we further have

𝒈2=p+O(p1/2),|𝒈𝑯𝒈Tr(𝑯)|𝑯Fp1/2Tr(𝑯),\|\bm{g}\|_{2}=p+O_{\prec}(p^{1/2}),\ |\bm{g}^{\top}\bm{H}\bm{g}-\textnormal{Tr}(\bm{H})|\prec\|\bm{H}\|_{F}\prec p^{-1/2}\textnormal{Tr}(\bm{H}),

which, when plugged into (C.4), we conclude (C.3). ∎

With Lemma C.1, we can write

B2(𝜷^λ;𝜷(2))=[1+O(p1/2)]𝜷~22dhs(t)dt|t=0,B_{2}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})=\left[1+O_{\prec}(p^{-1/2})\right]\|\tilde{\bm{\beta}}\|_{2}^{2}\cdot-\frac{\textnormal{d}h_{s}(t)}{\textnormal{d}t}\bigg{|}_{t=0}, (C.5)

where

hs(t):=1pTr((𝑸(1)+t(𝑸(1))2+α𝑸(2)+s𝑰)1).h_{s}(t):=\frac{1}{p}\textnormal{Tr}\left(\left(\bm{Q}^{(1)}+t(\bm{Q}^{(1)})^{2}+\alpha\bm{Q}^{(2)}+s\bm{I}\right)^{-1}\right). (C.6)

For a p×pp\times p symmetric matrix 𝑨\bm{A}, let μ𝑨:=p1i=1pδλi(𝑨)\mu_{\bm{A}}:=p^{-1}\sum_{i=1}^{p}\delta_{\lambda_{i}(\bm{A})} denote the empirical spectral distribution (ESD) of 𝑨\bm{A}. Next, for any probability measure μ\mu supported on [0,)[0,\infty), denote its Stieltjes transform as

mμ(z):=0dμ(x)xz,forzsupp(μ),m_{\mu}(z):=\int_{0}^{\infty}\frac{\textnormal{d}\mu(x)}{x-z},\ \ \text{for}\ \ z\in\mathbb{C}\setminus\text{supp}(\mu), (C.7)

which, in the case of μ𝑨\mu_{\bm{A}}, writes

mμ𝑨(z):=1pi=1p1λi(𝑨)z=1pTr[(𝑨z𝑰)1].m_{\mu_{\bm{A}}}(z):=\frac{1}{p}\sum_{i=1}^{p}\frac{1}{\lambda_{i}(\bm{A})-z}=\frac{1}{p}\textnormal{Tr}[(\bm{A}-z\bm{I})^{-1}].

It is well-known that the ESDs μ𝑸(1),μ𝑸(2)\mu_{\bm{Q}^{(1)}},\mu_{\bm{Q}^{(2)}} of 𝑸(1),𝑸(2)\bm{Q}^{(1)},\bm{Q}^{(2)} converge to Marchenko-Pastur distribution μ~(1),μ~(2)\tilde{\mu}^{(1)},\tilde{\mu}^{(2)}, whose Stieltjes transforms satisfy the self-consistent equations [2, Lemma 3.11]

zγkmμ~(k)2(1γkz)mμ~(k)+1=0,k=1,2.z\gamma_{k}m_{\tilde{\mu}^{(k)}}^{2}-(1-\gamma_{k}-z)m_{\tilde{\mu}^{(k)}}+1=0,\ \ k=1,2.

Reformulating, we equivalently have gk(mμ~(k)(z))=zg_{k}(m_{\tilde{\mu}^{(k)}}(z))=z, where gkg_{k} is defined as

gk(m):=11+γkm1m,k=1,2.g_{k}(m):=\frac{1}{1+\gamma_{k}m}-\frac{1}{m},\ \ k=1,2. (C.8)

The convergence rate of μ𝑸(1),μ𝑸(2)\mu_{\bm{Q}^{(1)}},\mu_{\bm{Q}^{(2)}} has been recently obtained in [48, Theorem 3.3]:

dK(μ𝑸(k),μ~(k))p1,k=1,2,d_{K}\left(\mu_{\bm{Q}^{(k)}},\tilde{\mu}^{(k)}\right)\prec p^{-1},\ \ k=1,2, (C.9)

where dKd_{K} denotes the Kolmogorov distance between two probability measures:

dK(μ𝑸(k),μ~(k)):=supx|μ𝑸(k)((,x])μ~(k)((,x])|.d_{K}\left(\mu_{\bm{Q}^{(k)}},\tilde{\mu}^{(k)}\right):=\sup_{x\in\mathbb{R}}\left|\mu_{\bm{Q}^{(k)}}((-\infty,x])-\tilde{\mu}^{(k)}((-\infty,x])\right|.

For fixed a,t0a,t\geq 0, we know the ESDs of α𝑸(2)\alpha\bm{Q}^{(2)} and 𝑸(1)+t(𝑸(1))2\bm{Q}^{(1)}+t(\bm{Q}^{(1)})^{2} converge weakly to two measures μ(2)\mu^{(2)} and μt(1)\mu_{t}^{(1)} (we suppress the dependence on α\alpha since the context is clear), whose Stieltjes transform are given by

mμ(2)(z)=1αmμ~(2)(zα),mμt(1)(z)=dμ(1)(x)x+tx2z.m_{\mu^{(2)}}(z)=\frac{1}{\alpha}m_{\tilde{\mu}^{(2)}}\left(\frac{z}{\alpha}\right),\ \ m_{\mu_{t}^{(1)}}(z)=\int\frac{\textnormal{d}\mu^{(1)}(x)}{x+tx^{2}-z}. (C.10)

Since the eigenmatrices of α𝑸(2)\alpha\bm{Q}^{(2)} and 𝑸(1)+t(𝑸(1))2\bm{Q}^{(1)}+t(\bm{Q}^{(1)})^{2} are independent Haar-distributed orthogonal matrices, the ESD of 𝑸(1)+t(𝑸(1))2+α𝑸(2)\bm{Q}^{(1)}+t(\bm{Q}^{(1)})^{2}+\alpha\bm{Q}^{(2)} converges weakly to μt(1)μ(2)\mu_{t}^{(1)}\boxplus\mu^{(2)}, where \boxplus denote the free addition of two probability measures. In particular, we have the following almost-sharp convergence rate:

Lemma C.2.

In the setting of Theorem A.2, suppose 0tC0\leq t\leq C for a constant C>0C>0, and s>p2/3+ϵs>p^{-2/3+\epsilon} for a small constant ϵ>0\epsilon>0, we have

|1pTr((𝑸(1)+t(𝑸(1))2+α𝑸(2)+s𝑰)1)mμt(1)μ(2)(s)|1ps.\left|\frac{1}{p}\textnormal{Tr}\left(\left(\bm{Q}^{(1)}+t(\bm{Q}^{(1)})^{2}+\alpha\bm{Q}^{(2)}+s\bm{I}\right)^{-1}\right)-m_{\mu_{t}^{(1)}\boxplus\mu^{(2)}}(-s)\right|\prec\frac{1}{ps}. (C.11)
Proof.

Since x+tx2x+tx^{2} and αx\alpha x are both strictly increasing functions of xx, (C.9) directly yields

dK(μ𝑸(1)+t(𝑸(1))2,μt(1))p1,dK(μα𝑸(2),μ(2))p1.d_{K}\left(\mu_{\bm{Q}^{(1)}+t(\bm{Q}^{(1)})^{2}},\mu_{t}^{(1)}\right)\prec p^{-1},\ \ d_{K}\left(\mu_{\alpha\bm{Q}^{(2)}},\mu^{(2)}\right)\prec p^{-1}.

Then (C.11)follows as a consequence of [4, Theorem 2.5] by observing that s>p2/3+ϵs>p^{-2/3+\epsilon}. ∎

Lemma C.3.

We have

ddtmμt(1)μ(2)(s)|t=0=12ff3+f2f321γ2f2(f2s)1+γ2f\frac{\textnormal{d}}{\textnormal{d}t}m_{\mu_{t}^{(1)}\boxplus\mu^{(2)}}(-s)\bigg{|}_{t=0}=-\frac{1-2ff_{3}+f_{2}f_{3}^{2}}{1-\frac{\gamma_{2}f_{2}(f_{2}-s)}{1+\gamma_{2}f}} (C.12)

where f,f2,f3f,f_{2},f_{3} are defined in (A.4), (A.5), (A.6), respectively.

Proof.

We calculate the Stileltjes transform of the free addition μt(1)μ(2)\mu_{t}^{(1)}\boxplus\mu^{(2)} using the following lemma:

Lemma C.4 (Theorem 4.1 of [7] and Theorem 2.1 of [17]).

Given two probability measures μ1,μ2\mu_{1},\mu_{2} on \mathbb{R}, there exists unique analytic functions ω1,ω2:++\omega_{1},\omega_{2}:\mathbb{C}^{+}\rightarrow\mathbb{C}^{+}, where +:={z:Im(z)>0}\mathbb{C}^{+}:=\{z\in\mathbb{C}:\textnormal{Im}(z)>0\} is the upper half complex plane, such that the following equations hold: for any z+z\in\mathbb{C}^{+},

mμ1(ω2(z))=mμ2(ω1(z)),ω1(z)+ω2(z)z=1mμ1(ω2(z)).m_{\mu_{1}}(\omega_{2}(z))=m_{\mu_{2}}(\omega_{1}(z)),\ \ \omega_{1}(z)+\omega_{2}(z)-z=-\frac{1}{m_{\mu_{1}}(\omega_{2}(z))}. (C.13)

Moreover, mμ1(ω2(z))m_{\mu_{1}}(\omega_{2}(z)) is the Stieltjes transform of μ1μ2\mu_{1}\boxplus\mu_{2}, that is,

mμ1μ2(z)=mμ1(ω2(z)).m_{\mu_{1}\boxplus\mu_{2}}(z)=m_{\mu_{1}}(\omega_{2}(z)). (C.14)

Now we plug in μ1=μt(1)\mu_{1}=\mu_{t}^{(1)} and μ(2)\mu^{(2)} when zsz\rightarrow-s:

mμt(1)(ω2(s,t))=mμ(2)(ω1(s,t)),ω1(s,t)+ω2(s,t)+s=1mμ1(ω2(s,t)).m_{\mu_{t}^{(1)}}(\omega_{2}(-s,t))=m_{\mu^{(2)}}(\omega_{1}(-s,t)),\ \ \omega_{1}(-s,t)+\omega_{2}(-s,t)+s=-\frac{1}{m_{\mu_{1}}(\omega_{2}(-s,t))}. (C.15)

For simplicity, we sometimes write ω1,ω2\omega_{1},\omega_{2} for short when the context is clear. Using the definition of mμ(2)m_{\mu^{(2)}} in (C.10), we know

αg2(αmμ(2)(z))=z,\alpha g_{2}(\alpha m_{\mu^{(2)}}(z))=z, (C.16)

where g2g_{2} is defined in (C.8). Applying (C.16) to the first equation of (C.15), we get

ω1=α1+γ2αmμt(1)(ω2)1mμt(1)(ω2).\omega_{1}=\frac{\alpha}{1+\gamma_{2}\alpha m_{\mu_{t}^{(1)}}(\omega_{2})}-\frac{1}{m_{\mu_{t}^{(1)}}(\omega_{2})}.

Plugging this equation into the second equation of (C.15), we get

α1+γ2αmμt(1)(ω2)+ω2+s=0α+(ω2+s)[1+αγ2mμt(1)(ω2)]=0.\frac{\alpha}{1+\gamma_{2}\alpha m_{\mu_{t}^{(1)}}(\omega_{2})}+\omega_{2}+s=0\Leftrightarrow\alpha+(\omega_{2}+s)\left[1+\alpha\gamma_{2}m_{\mu_{t}^{(1)}}(\omega_{2})\right]=0. (C.17)

Now we define the following quantities at t=0t=0:

f1(s):=mμt(1)(ω2(s,0))=mμ0(1)μ(2)(s),\displaystyle f_{1}(s):=m_{\mu_{t}^{(1)}}(\omega_{2}(-s,0))=m_{\mu_{0}^{(1)}\boxplus\mu^{(2)}}(-s),
f2(s):=dmμt(1)(z)dz|z=ω2(s,0)=dμ(1)(x)[xω2(s,0)]2,f3(s):=ω2(s,0).\displaystyle f_{2}(s):=\frac{\textnormal{d}m_{\mu_{t}^{(1)}}(z)}{\textnormal{d}z}\bigg{|}_{z=\omega_{2}(-s,0)}=\int\frac{\textnormal{d}\mu^{(1)}(x)}{[x-\omega_{2}(-s,0)]^{2}},\ f_{3}(s):=-\omega_{2}(-s,0).

First, from (C.17) we obtain (A.6). Further, using the fact that g1(f1(s))=sg_{1}(f_{1}(s))=-s and the first equation in (C.15), we get

ω2=11+γ1f11f1,\omega_{2}=\frac{1}{1+\gamma_{1}f_{1}}-\frac{1}{f_{1}},

which, when plugged into (C.17), yields (A.4). It is straightforward to check that it has a unique positive solution. Lastly, calculating the derivative of mμ0(1)m_{\mu_{0}^{(1)}} using its inverse function, we obtain that f2(s)=[g1(f1)]]1f_{2}(s)=[g_{1}^{\prime}(f_{1})]]^{-1}, which gives (A.5).

As the final step, we calculate tmμt(1)(ω2(s,t))|t=0\partial_{t}m_{\mu_{t}^{(1)}}(\omega_{2}(-s,t))|_{t=0}. Taking the derivative of (C.17) w.r.t. tt at t=0t=0, we have

tω2(s,0)[1+αγ2f1(s)]+αγ2(sf3(s))tmμt(1)(ω2(s,t))|t=0=0.\partial_{t}\omega_{2}(-s,0)\cdot[1+\alpha\gamma_{2}f_{1}(s)]+\alpha\gamma_{2}(s-f_{3}(s))\cdot\partial_{t}m_{\mu_{t}^{(1)}}(\omega_{2}(-s,t))\big{|}_{t=0}=0. (C.18)

The second equation of (C.10) gives

tmμt(1)(ω2(s,t))=tdμ(1)(x)x+tx2ω2(s,t)=[x2tω2(s,t)]dμ(1)(x)[x+tx2ω2(s,t)]2.\partial_{t}m_{\mu_{t}^{(1)}}(\omega_{2}(-s,t))=\partial_{t}\int\frac{\textnormal{d}\mu^{(1)}(x)}{x+tx^{2}-\omega_{2}(-s,t)}=-\int\frac{[x^{2}-\partial_{t}\omega_{2}(-s,t)]\textnormal{d}\mu^{(1)}(x)}{[x+tx^{2}-\omega_{2}(-s,t)]^{2}}.

Taking t=0t=0 in the above equation, we have

tmμt(1)(ω2(s,t))|t=0\displaystyle\partial_{t}m_{\mu_{t}^{(1)}}(\omega_{2}(-s,t))\bigg{|}_{t=0}
=\displaystyle= tω2(s,0)f2(s)[(xω2(s,0))2+2ω2(s,0)(xω2(s,0))+ω2(s,0)2]dμ(1)(x)[xω2(s,0)]2\displaystyle\partial_{t}\omega_{2}(-s,0)\cdot f_{2}(-s)-\int\frac{[(x-\omega_{2}(-s,0))^{2}+2\omega_{2}(-s,0)(x-\omega_{2}(-s,0))+\omega_{2}(-s,0)^{2}]\textnormal{d}\mu^{(1)}(x)}{[x-\omega_{2}(-s,0)]^{2}}
=\displaystyle= tω2(s,0)f2(s)1+2f1(s)f3(s)f2(s)f3(s)2,\displaystyle\partial_{t}\omega_{2}(-s,0)\cdot f_{2}(s)-1+2f_{1}(s)f_{3}(s)-f_{2}(s)f_{3}(s)^{2},

Combining the above equation with equation (C.18) yields (C.12). ∎

Now we are ready to prove (A.13) in Theorem A.2.

Proof of (A.13).

Recall the definition of hs(t)h_{s}(t) in (C.6). Form Lemma C.2, we have

|1pTr((𝑸(1)+t(𝑸(1))2+α𝑸(2)+s𝑰)1)mμt(1)μ(2)(s)|1ps,\displaystyle\left|\frac{1}{p}\textnormal{Tr}\left(\left(\bm{Q}^{(1)}+t(\bm{Q}^{(1)})^{2}+\alpha\bm{Q}^{(2)}+s\bm{I}\right)^{-1}\right)-m_{\mu_{t}^{(1)}\boxplus\mu^{(2)}}(-s)\right|\prec\frac{1}{ps},
|1pTr((𝑸(1)+α𝑸(2)+s𝑰)1)mμ0(1)μ(2)(s)|1ps.\displaystyle\left|\frac{1}{p}\textnormal{Tr}\left(\left(\bm{Q}^{(1)}+\alpha\bm{Q}^{(2)}+s\bm{I}\right)^{-1}\right)-m_{\mu_{0}^{(1)}\boxplus\mu^{(2)}}(-s)\right|\prec\frac{1}{ps}.

Further, denote 𝑨:=𝑸(1)+t(𝑸(1))2+α𝑸(2)+s𝑰,𝑩:=𝑸(1)+α𝑸(2)+s𝑰\bm{A}:=\bm{Q}^{(1)}+t(\bm{Q}^{(1)})^{2}+\alpha\bm{Q}^{(2)}+s\bm{I},\bm{B}:=\bm{Q}^{(1)}+\alpha\bm{Q}^{(2)}+s\bm{I}, we have

hs(t)hs(0)t=\displaystyle\frac{h_{s}(t)-h_{s}(0)}{t}= 1ptTr[𝑨1𝑩1]=1ptTr[𝑩1(𝑩𝑨)𝑨1]=1pTr[𝑨1𝑩1(𝑸(1))2],\displaystyle\frac{1}{pt}\textnormal{Tr}\left[\bm{A}^{-1}-\bm{B}^{-1}\right]=\frac{1}{pt}\textnormal{Tr}\left[\bm{B}^{-1}(\bm{B}-\bm{A})\bm{A}^{-1}\right]=-\frac{1}{p}\textnormal{Tr}\left[\bm{A}^{-1}\bm{B}^{-1}(\bm{Q}^{(1)})^{2}\right],

which yields

|dhs(t)dt|t=0hs(t)hs(0)t|=1pTr[(𝑩2𝑨1𝑩1)(𝑸(1))2]\displaystyle\left|\frac{\textnormal{d}h_{s}(t)}{\textnormal{d}t}\bigg{|}_{t=0}-\frac{h_{s}(t)-h_{s}(0)}{t}\right|=\frac{1}{p}\textnormal{Tr}\left[\left(\bm{B}^{-2}-\bm{A}^{-1}\bm{B}^{-1}\right)(\bm{Q}^{(1)})^{2}\right]
=\displaystyle= 1pTr[t𝑩1(𝑸(1))2𝑨1𝑩1(𝑸(1))2]ts3w.o.p.\displaystyle\frac{1}{p}\textnormal{Tr}\left[t\bm{B}^{-1}(\bm{Q}^{(1)})^{2}\bm{A}^{-1}\bm{B}^{-1}(\bm{Q}^{(1)})^{2}\right]\lesssim\frac{t}{s^{3}}\ \ \textnormal{w.o.p.}

Similarly,

|dmμt(1)μ(2)(0)dt|t=0mμt(1)μ(2)(0)mμ0(1)μ(2)(0)t|ts3w.o.p.\left|\frac{\textnormal{d}m_{\mu_{t}^{(1)}\boxplus\mu^{(2)}}(0)}{\textnormal{d}t}\bigg{|}_{t=0}-\frac{m_{\mu_{t}^{(1)}\boxplus\mu^{(2)}}(0)-m_{\mu_{0}^{(1)}\boxplus\mu^{(2)}}(0)}{t}\right|\lesssim\frac{t}{s^{3}}\ \ \textnormal{w.o.p.}

Combining the three displays above and fixing t=sp1/2t=sp^{-1/2}, we obtain that

|dhs(t)dt|t=0dmμt(1)μ(2)(0)dt|t=0|\displaystyle\left|\frac{\textnormal{d}h_{s}(t)}{\textnormal{d}t}\bigg{|}_{t=0}-\frac{\textnormal{d}m_{\mu_{t}^{(1)}\boxplus\mu^{(2)}}(0)}{\textnormal{d}t}\bigg{|}_{t=0}\right|\prec ts3+|hs(0)mμ0(1)μ(2)(0)|+|hs(t)mμt(1)μ(2)(0)|t\displaystyle\frac{t}{s^{3}}+\frac{|h_{s}(0)-m_{\mu_{0}^{(1)}\boxplus\mu^{(2)}}(0)|+|h_{s}(t)-m_{\mu_{t}^{(1)}\boxplus\mu^{(2)}}(0)|}{t}
\displaystyle\prec s2p1/2+c\displaystyle s^{-2}p^{-1/2+c}

for any small c>0c>0. Plugging this into (C.5) and using (C.12), we obtain (A.13).

(A.14) can be proved using similar procedure upon oberving the critical fact that

B3(𝜷^λ;𝜷(2))\displaystyle B_{3}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}) =[1+O(p1/2)]2𝜷~𝜷(2)pTr[(𝑸(1)+α𝑸(2)+s𝑰)2(𝑸(1))2]\displaystyle=-\left[1+O_{\prec}(p^{-1/2})\right]\frac{2\tilde{\bm{\beta}}^{\top}\bm{\beta}^{(2)}}{p}\cdot\textnormal{Tr}\left[\left(\bm{Q}^{(1)}+\alpha\bm{Q}^{(2)}+s\bm{I}\right)^{-2}(\bm{Q}^{(1)})^{2}\right]
=[1+O(p1/2)](2𝜷~𝜷(2))ddt[1pTr((𝑸(1)+t𝑸(1)+α𝑸(2)+s𝑰)1)]|t=0,\displaystyle=-\left[1+O_{\prec}(p^{-1/2})\right]\cdot(2\tilde{\bm{\beta}}^{\top}\bm{\beta}^{(2)})\cdot\frac{d}{dt}\left[\frac{1}{p}\textnormal{Tr}\left(\left(\bm{Q}^{(1)}+t\bm{Q}^{(1)}+\alpha\bm{Q}^{(2)}+s\bm{I}\right)^{-1}\right)\right]\bigg{|}_{t=0},

and we omit the details. ∎

C.2 Proof of Theorem 3.1

We only prove (3.3) as (3.2) and (3.1) are direct corollaries of [29, Theorem 2]. First we show that B2(𝜷^;𝜷(2))B_{2}(\hat{\bm{\beta}};\bm{\beta}^{(2)}) and B2(𝜷^λ;𝜷(2))B_{2}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}) are close. Recall their definitions in (2.14) and (A.2), and note that 𝚺(1)=𝚺(2)=𝑰\bm{\Sigma}^{(1)}=\bm{\Sigma}^{(2)}=\bm{I}. Recall the defintions of 𝑸(1),𝑸(2)\bm{Q}^{(1)},\bm{Q}^{(2)} from (C.1). Denoting 𝑸:=𝑸(1)+α𝑸(2)\bm{Q}:=\bm{Q}^{(1)}+\alpha\bm{Q}^{(2)}, we have

|B2(𝜷^λ;𝜷(2))B2(𝜷^;𝜷(2))|=|𝑸𝑸(1)𝜷~22(𝑸+s𝑰)1𝑸(1)𝜷~22|\displaystyle\left|B_{2}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})-B_{2}(\hat{\bm{\beta}};\bm{\beta}^{(2)})\right|=\left|\|\bm{Q}^{\dagger}\bm{Q}^{(1)}\tilde{\bm{\beta}}\|_{2}^{2}-\|(\bm{Q}+s\bm{I})^{-1}\bm{Q}^{(1)}\tilde{\bm{\beta}}\|_{2}^{2}\right|
\displaystyle\leq |𝑸𝑸(1)𝜷~2+(𝑸+s𝑰)1𝑸(1)𝜷~2||𝑸𝑸(1)𝜷~2(𝑸+s𝑰)1𝑸(1)𝜷~2|.\displaystyle\left|\|\bm{Q}^{\dagger}\bm{Q}^{(1)}\tilde{\bm{\beta}}\|_{2}+\|(\bm{Q}+s\bm{I})^{-1}\bm{Q}^{(1)}\tilde{\bm{\beta}}\|_{2}\right|\cdot\left|\|\bm{Q}^{\dagger}\bm{Q}^{(1)}\tilde{\bm{\beta}}\|_{2}-\|(\bm{Q}+s\bm{I})^{-1}\bm{Q}^{(1)}\tilde{\bm{\beta}}\|_{2}\right|.

By Lemma B.3, we know 𝑸𝑸(1)𝜷~2𝜷~2\|\bm{Q}^{\dagger}\bm{Q}^{(1)}\tilde{\bm{\beta}}\|_{2}\lesssim\|\tilde{\bm{\beta}}\|_{2}. Moreover, denote 𝒁:=(𝒁(1),𝒁(2))\bm{Z}:=(\bm{Z}^{(1)\top},\bm{Z}^{(2)\top})^{\top} and its singular value decomposition be 𝒁:=1n1𝑼𝚲1/2𝑽\bm{Z}:=\frac{1}{\sqrt{n_{1}}}\bm{U}\bm{\Lambda}^{1/2}\bm{V}^{\top}, we have

|𝑸𝑸(1)𝜷~2(𝑸+s𝑰)1𝑸(1)𝜷~2|\displaystyle\left|\|\bm{Q}^{\dagger}\bm{Q}^{(1)}\tilde{\bm{\beta}}\|_{2}-\|(\bm{Q}+s\bm{I})^{-1}\bm{Q}^{(1)}\tilde{\bm{\beta}}\|_{2}\right|
\displaystyle\leq (𝑸(𝑸+s𝑰)1)𝑸(1)𝜷~2\displaystyle\left\|\left(\bm{Q}^{\dagger}-(\bm{Q}+s\bm{I})^{-1}\right)\bm{Q}^{(1)}\tilde{\bm{\beta}}\right\|_{2}
\displaystyle\leq (𝑸(𝑸+s𝑰)1)𝒁(1)n1op𝒁(1)n1op𝜷~2\displaystyle\left\|\left(\bm{Q}^{\dagger}-(\bm{Q}+s\bm{I})^{-1}\right)\frac{\bm{Z}^{(1)\top}}{\sqrt{n_{1}}}\right\|_{\textnormal{op}}\cdot\left\|\frac{\bm{Z}^{(1)}}{\sqrt{n_{1}}}\right\|_{\textnormal{op}}\cdot\|\tilde{\bm{\beta}}\|_{2}
\displaystyle\lesssim (𝑸(𝑸+s𝑰)1)𝒁n1op𝜷~2\displaystyle\left\|\left(\bm{Q}^{\dagger}-(\bm{Q}+s\bm{I})^{-1}\right)\frac{\bm{Z}^{\top}}{\sqrt{n_{1}}}\right\|_{\textnormal{op}}\cdot\|\tilde{\bm{\beta}}\|_{2}
\displaystyle\leq (𝚲(𝚲+s𝑰)1)𝚲1/2op𝜷~2\displaystyle\left\|\left(\bm{\Lambda}^{\dagger}-(\bm{\Lambda}+s\bm{I})^{-1}\right)\bm{\Lambda}^{1/2}\right\|_{\textnormal{op}}\cdot\|\tilde{\bm{\beta}}\|_{2}
=\displaystyle= sΛnn1/2(Λnn+s)𝜷~2\displaystyle\frac{s}{\Lambda_{nn}^{1/2}(\Lambda_{nn}+s)}\cdot\|\tilde{\bm{\beta}}\|_{2}
\displaystyle\lesssim s𝜷~2w.o.p.,\displaystyle s\|\tilde{\bm{\beta}}\|_{2}\ \ \textnormal{w.o.p.,}

where the third inequality follows since 𝒁(1)\bm{Z}^{(1)} is a sub-matrix of 𝒁\bm{Z}, and the last line follows by Lemma B.3. This also yields that (𝑸+s𝑰)1𝑸(1)𝜷~2𝜷~2\|(\bm{Q}+s\bm{I})^{-1}\bm{Q}^{(1)}\tilde{\bm{\beta}}\|_{2}\lesssim\|\tilde{\bm{\beta}}\|_{2}, and, consequently,

|B2(𝜷^λ;𝜷(2))B2(𝜷^;𝜷(2))|spc𝜷~22\left|B_{2}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})-B_{2}(\hat{\bm{\beta}};\bm{\beta}^{(2)})\right|\prec sp^{c}\|\tilde{\bm{\beta}}\|_{2}^{2} (C.19)

for any small c>0c>0. Next we show that 2(m)(λ;γ1,γ2)\mathcal{B}_{2}^{(m)}(\lambda;\gamma_{1},\gamma_{2}) in (A.9) is close to n1(pn1)p(pn)𝜷~22\frac{n_{1}(p-n_{1})}{p(p-n)}\|\tilde{\bm{\beta}}\|_{2}^{2} in (3.3). Consider the equation (A.4) about f1(s)f_{1}(s) (where we make the dependence on ss explicit). Define f~1(s):=sf1(s)\tilde{f}_{1}(s):=sf_{1}(s) for s>0s>0, then after reformulating, we equivalently know that f~1(s)\tilde{f}_{1}(s) is the solution to the following equation:

γ1f~12+(sγ1+α+1)f~1s=0.\gamma_{1}\tilde{f}_{1}^{2}+(s-\gamma_{1}+\alpha+1)\tilde{f}_{1}-s=0. (C.20)

If we extend the definition to f~1(0):=11/γ11/γ2=pnp\tilde{f}_{1}(0):=1-1/\gamma_{1}-1/\gamma_{2}=\frac{p-n}{p}, then a quick check confirms that f~1(0)\tilde{f}_{1}(0) is the unique positive solution to the equation above at s=0s=0, and that f~1(s)\tilde{f}_{1}(s) is a continuous function of ss for s0s\geq 0. Calculating the derivative of (C.20) w.r.t. ss, we get

f~1(s)=1f~1(s)2γ1f~1(s)+sγ1+α+1.\tilde{f}_{1}^{\prime}(s)=\frac{1-\tilde{f}_{1}(s)}{2\gamma_{1}\tilde{f}_{1}(s)+s-\gamma_{1}+\alpha+1}.

Denote F(x):=γ1x2+(sγ1+α+1)xsF(x):=\gamma_{1}x^{2}+(s-\gamma_{1}+\alpha+1)x-s, then we know F(f~1(s))=0F(\tilde{f}_{1}(s))=0, F(0)=s<0F(0)=-s<0 and F(1)=α+1>0F(1)=\alpha+1>0. This yields f~1(s)(0,1)\tilde{f}_{1}(s)\in(0,1). Moreover, 2γ1f~1(s)+sγ1+α+12γ1f~1(0)γ1+α+1=γ1f~1(0)2\gamma_{1}\tilde{f}_{1}(s)+s-\gamma_{1}+\alpha+1\rightarrow 2\gamma_{1}\tilde{f}_{1}(0)-\gamma_{1}+\alpha+1=\gamma_{1}\tilde{f}_{1}(0) as s0s\rightarrow 0. By continuity, there exists constants c1,c2>0c_{1},c_{2}>0 such that f~1(s)c1\tilde{f}_{1}(s)\geq c_{1} whenever 0<s<c20<s<c_{2}. Subsequently, f~1(s)\tilde{f}_{1}^{\prime}(s) is positive and bounded above whenever 0<s<c20<s<c_{2}. Now fix s(0,c2)s\in(0,c_{2}), we have |f~1(s)f~1(0)|=s|f~1(s)|s|\tilde{f}_{1}(s)-\tilde{f}_{1}(0)|=s\cdot|\tilde{f}_{1}^{\prime}(s^{*})|\lesssim s for some s(0,s)s^{*}\in(0,s). In other words, there exists a constant cc such that for all s(0,c)s\in(0,c),

|f~1(s)f~1(0)|:=|sf1(s)pnp|=O(s).\left|\tilde{f}_{1}(s)-\tilde{f}_{1}(0)\right|:=\left|sf_{1}(s)-\frac{p-n}{p}\right|=O(s). (C.21)

Next, recall f3(s)f_{3}(s) in (A.6). We define f~3(s):=f3(s)/s\tilde{f}_{3}(s):=f_{3}(s)/s for s>0s>0 and f~3(0)=1/γ2f~1(0)+1\tilde{f}_{3}(0)=1/\gamma_{2}\tilde{f}_{1}(0)+1. Plugging respective definitions, we have

|f~3(s)f~3(0)|:=|f3(s)s(1γ2f~1(0)+1)|=[s+γ1(f~1(s)f~1(0))]γ2f~1(0)(s+γ1f~1(s))=O(s)\left|\tilde{f}_{3}(s)-\tilde{f}_{3}(0)\right|:=\left|\frac{f_{3}(s)}{s}-\left(\frac{1}{\gamma_{2}\tilde{f}_{1}(0)}+1\right)\right|=\frac{[s+\gamma_{1}(\tilde{f}_{1}(s)-\tilde{f}_{1}(0))]}{\gamma_{2}\tilde{f}_{1}(0)(s+\gamma_{1}\tilde{f}_{1}(s))}=O(s) (C.22)

for ss small enough, because of (C.21) and the fact that the denominator is bounded below. Similarly, recall f2(s)f_{2}(s) in (A.5). We define f~2(s):=s2f2(s)\tilde{f}_{2}(s):=s^{2}f_{2}(s) for s>0s>0 and f~2(0)=γ1γ11f~1(0)2\tilde{f}_{2}(0)=\frac{\gamma_{1}}{\gamma_{1}-1}\tilde{f}_{1}(0)^{2}. Plugging in respective definitions, we have

|f~2(s)f~2(0)|:=|s2f2(s)γ1γ11f~1(0)2|\displaystyle\left|\tilde{f}_{2}(s)-\tilde{f}_{2}(0)\right|:=\left|s^{2}f_{2}(s)-\frac{\gamma_{1}}{\gamma_{1}-1}\tilde{f}_{1}(0)^{2}\right| (C.23)
=\displaystyle= |γ1(f~1(s)2f~1(0)2)(s+γ1f~1(s))2+f~1(s)2[γ12f~1(0)2(s+γ1f~1(s))2](γ11)[(s+γ1f~1(s))2γ1f~1(s)2]|\displaystyle\left|\frac{\gamma_{1}\left(\tilde{f}_{1}(s)^{2}-\tilde{f}_{1}(0)^{2}\right)\left(s+\gamma_{1}\tilde{f}_{1}(s)\right)^{2}+\tilde{f}_{1}(s)^{2}\left[\gamma_{1}^{2}\tilde{f}_{1}(0)^{2}-\left(s+\gamma_{1}\tilde{f}_{1}(s)\right)^{2}\right]}{(\gamma_{1}-1)\left[\left(s+\gamma_{1}\tilde{f}_{1}(s)\right)^{2}-\gamma_{1}\tilde{f}_{1}(s)^{2}\right]}\right|
=\displaystyle= O(s)\displaystyle O(s)

for ss small enough, since f~1(s)2f~1(0)2=O(s)\tilde{f}_{1}(s)^{2}-\tilde{f}_{1}(0)^{2}=O(s) and γ12f~1(0)2(s+γ1f~1(s))2=O(s)\gamma_{1}^{2}\tilde{f}_{1}(0)^{2}-\left(s+\gamma_{1}\tilde{f}_{1}(s)\right)^{2}=O(s) by (C.21), and other relevant terms are all bounded.

Plugging (C.21), (C.22), (C.23) into (A.9) and applying similar arguments (details omitted), we finally have

2(m)(λ;γ1,γ2):=\displaystyle\mathcal{B}_{2}^{(m)}(\lambda;\gamma_{1},\gamma_{2}):= 𝜷~2212f~1(s)f~3(s)+f~2(s)f~3(s)21γ1f~2(s)(f~3(s)1)s+γ1f~1(s)\displaystyle\|\tilde{\bm{\beta}}\|_{2}^{2}\cdot\frac{1-2\tilde{f}_{1}(s)\tilde{f}_{3}(s)+\tilde{f}_{2}(s)\tilde{f}_{3}(s)^{2}}{1-\frac{\gamma_{1}\tilde{f}_{2}(s)(\tilde{f}_{3}(s)-1)}{s+\gamma_{1}\tilde{f}_{1}(s)}} (C.24)
=\displaystyle= 𝜷~22n1(pn1)p(pn)+O(s𝜷~22)\displaystyle\|\tilde{\bm{\beta}}\|_{2}^{2}\cdot\frac{n_{1}(p-n_{1})}{p(p-n)}+O(s\|\tilde{\bm{\beta}}\|_{2}^{2})

Finally, combining (A.13) with (C.19), (C.24) and let s=λ(1+α)=p1/7s=\lambda(1+\alpha)=p^{-1/7}, we obtain (3.3). Hence the proof is complete.

Appendix D Proof for Covariate Shift

We first prove the generalization error for ridge regression under covariate shift (Theorem A.3), and then obtain the generalization error for the min-norm interpolator by letting λ0\lambda\rightarrow 0. Recall the expression for bias and variance was given by (A.1) and (A.2). Note that 𝜷~=𝟎\tilde{\bm{\beta}}=\bm{0} here. Define

𝑾:=(𝚲(1))1/2𝑽𝚺^𝒁(1)𝑽(𝚲(1))1/2+(𝚲(2))1/2𝑽𝚺^𝒁(2)𝑽(𝚲(2))1/2\bm{W}:=(\bm{\Lambda}^{(1)})^{1/2}\bm{V}^{\top}\hat{\bm{\Sigma}}_{\bm{Z}}^{(1)}\bm{V}(\bm{\Lambda}^{(1)})^{1/2}+(\bm{\Lambda}^{(2)})^{1/2}\bm{V}^{\top}\hat{\bm{\Sigma}}_{\bm{Z}}^{(2)}\bm{V}(\bm{\Lambda}^{(2)})^{1/2} (D.1)

Taking the antiderivative, we have

V(𝜷^λ;𝜷(2))\displaystyle V(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}) (D.2)
=\displaystyle= λ{σ2λnTr(𝚺(2)(𝚺^+λ𝑰)1)}\displaystyle\frac{\partial}{\partial\lambda}\left\{\frac{\sigma^{2}\lambda}{n}\textnormal{Tr}\left(\bm{\Sigma}^{(2)}(\hat{\bm{\Sigma}}+\lambda\bm{I})^{-1}\right)\right\}
=\displaystyle= λ{σ2λnTr(𝚺(2)((𝚺(1))1/2𝚺^𝒁(1)(𝚺(1))1/2+(𝚺(2))1/2𝚺^𝒁(2)(𝚺(2))1/2+λ𝑰)1)}\displaystyle\frac{\partial}{\partial\lambda}\left\{\frac{\sigma^{2}\lambda}{n}\textnormal{Tr}\left(\bm{\Sigma}^{(2)}\left((\bm{\Sigma}^{(1)})^{1/2}\hat{\bm{\Sigma}}_{\bm{Z}}^{(1)}(\bm{\Sigma}^{(1)})^{1/2}+(\bm{\Sigma}^{(2)})^{1/2}\hat{\bm{\Sigma}}_{\bm{Z}}^{(2)}(\bm{\Sigma}^{(2)})^{1/2}+\lambda\bm{I}\right)^{-1}\right)\right\}
=\displaystyle= λ{σ2λnTr((𝚲(2))1/2(𝑾+λ𝑰)1(𝚲(2))1/2)},\displaystyle\frac{\partial}{\partial\lambda}\left\{\frac{\sigma^{2}\lambda}{n}\textnormal{Tr}\left((\bm{\Lambda}^{(2)})^{1/2}\left(\bm{W}+\lambda\bm{I}\right)^{-1}(\bm{\Lambda}^{(2)})^{1/2}\right)\right\},

and

B(𝜷^λ,𝜷(2))\displaystyle B(\hat{\bm{\beta}}_{\lambda},\bm{\beta}^{(2)}) (D.3)
=\displaystyle= η{λ𝜷(2)(𝚺^+λ𝑰+λη𝚺(2))1𝜷(2)}|η=0\displaystyle-\frac{\partial}{\partial\eta}\left\{\lambda\bm{\beta}^{(2)\top}(\hat{\bm{\Sigma}}+\lambda\bm{I}+\lambda\eta\bm{\Sigma}^{(2)})^{-1}\bm{\beta}^{(2)}\right\}\bigg{|}_{\eta=0}
=\displaystyle= η{λ𝜷η(2)((𝚲η(1))1/2𝑽𝚺^𝒁(1)𝑽(𝚲η(1))1/2+(𝚲η(2))1/2𝑽𝚺^𝒁(2)𝑽(𝚲η(2))1/2+λ𝑰)1𝜷η(2)}|η=0,\displaystyle-\frac{\partial}{\partial\eta}\left\{\lambda\bm{\beta}_{\eta}^{(2)\top}\left((\bm{\Lambda}_{\eta}^{(1)})^{1/2}\bm{V}^{\top}\hat{\bm{\Sigma}}_{\bm{Z}}^{(1)}\bm{V}(\bm{\Lambda}_{\eta}^{(1)})^{1/2}+(\bm{\Lambda}_{\eta}^{(2)})^{1/2}\bm{V}^{\top}\hat{\bm{\Sigma}}_{\bm{Z}}^{(2)}\bm{V}(\bm{\Lambda}_{\eta}^{(2)})^{1/2}+\lambda\bm{I}\right)^{-1}\bm{\beta}_{\eta}^{(2)}\right\}\bigg{|}_{\eta=0},

where 𝚲η(k):=𝚲(k)(𝑰+η𝚲(2))1\bm{\Lambda}_{\eta}^{(k)}:=\bm{\Lambda}^{(k)}(\bm{I}+\eta\bm{\Lambda}^{(2)})^{-1} for k=1,2,𝜷η(2):=(𝑰+η𝚲(2))1/2𝑽𝜷(2)k=1,2,\bm{\beta}_{\eta}^{(2)}:=(\bm{I}+\eta\bm{\Lambda}^{(2)})^{-1/2}\bm{V}^{\top}\bm{\beta}^{(2)}, and 𝚺^𝒁(k):=1n𝒁(k)𝒁(k)\hat{\bm{\Sigma}}_{\bm{Z}}^{(k)}:=\frac{1}{n}\bm{Z}^{(k)\top}\bm{Z}^{(k)} where 𝒁(k)\bm{Z}^{(k)} satisfies Assumption 1(1). Here 𝑽,𝚲(1),𝚲(2)\bm{V},\bm{\Lambda}^{(1)},\bm{\Lambda}^{(2)} are defined in Definition 4.1.

D.1 Prototype Problem

Both the bias and variance terms above (D.3), (D.2) involves characterizing the asymptotic behavior of (𝑾+λ𝑰)1(\bm{W}+\lambda\bm{I})^{-1} (i.e. the resolvent of 𝑾\bm{W} at λ-\lambda) where 𝑾\bm{W} is defined by (D.1).

A powerful tool to handle this type of problems is known as the anisotropic local law. [33] first introduced this technique and effectively characterized the behavior of ((𝚲(1))1/2𝑽𝚺^𝒁(1)𝑽(𝚲(1))1/2+z𝑰)1\left((\bm{\Lambda}^{(1)})^{1/2}\bm{V}^{\top}\hat{\bm{\Sigma}}_{\bm{Z}}^{(1)}\bm{V}(\bm{\Lambda}^{(1)})^{1/2}+z\bm{I}\right)^{-1} for a wide range of zz\in\mathbb{C}, whereas [57] generalized the result to ((𝚲(1))1/2𝑽𝚺^𝒁(1)𝑽(𝚲(1))1/2+𝚺^𝒁(2)+z𝑰)1\left((\bm{\Lambda}^{(1)})^{1/2}\bm{V}^{\top}\hat{\bm{\Sigma}}_{\bm{Z}}^{(1)}\bm{V}(\bm{\Lambda}^{(1)})^{1/2}+\hat{\bm{\Sigma}}_{\bm{Z}}^{(2)}+z\bm{I}\right)^{-1} at zz around 0.

In this section, we state and prove a more general local law which is one of the major technical contributions of this manuscript. This local law will serve as the basis for studying the limit of (D.3) and (D.2). The following subsections follows [57, Section C.2] closely.

D.1.1 Resolvent and Local Law

We can write 𝑾=𝑭𝑭\bm{W}=\bm{F}\bm{F}^{\top} where 𝑭p×n\bm{F}\in\mathbb{R}^{p\times n} is given by

𝑭:=n1/2[(𝚲(1))1/2𝑽𝒁(1),(𝚲(2))1/2𝑽𝒁(2)].\bm{F}:=n^{-1/2}[(\bm{\Lambda}^{(1)})^{1/2}\bm{V}^{\top}\bm{Z}^{(1)\top},(\bm{\Lambda}^{(2)})^{1/2}\bm{V}^{\top}\bm{Z}^{(2)\top}]. (D.4)
Definition D.1 (Self-adjoint linearization and resolvent).

We define the following (p+n)×(p+n)(p+n)\times(p+n) symmetric block matrix

𝑯:=(𝟎𝑭𝑭𝟎)\bm{H}:=\begin{pmatrix}\bm{0}&\bm{F}\\ \bm{F}^{\top}&\bm{0}\end{pmatrix} (D.5)

and its resolvent as

𝑮(z):=[𝑯(z𝑰p𝟎𝟎𝑰n)]1,z\bm{G}(z):=\left[\bm{H}-\begin{pmatrix}z\bm{I}_{p}&\bm{0}\\ \bm{0}&\bm{I}_{n}\end{pmatrix}\right]^{-1},z\in\mathbb{C} (D.6)

as long as the inverse exists. We further define the following (weighted) partial traces

m01(z):=1pi0λi(1)Gii(z),\displaystyle m_{01}(z):=\frac{1}{p}\sum_{i\in\mathcal{I}_{0}}\lambda_{i}^{(1)}G_{ii}(z), m02(z):=1pi0λi(2)Gii(z),\displaystyle m_{02}(z):=\frac{1}{p}\sum_{i\in\mathcal{I}_{0}}\lambda_{i}^{(2)}G_{ii}(z), (D.7)
m1(z):=1n1μ1Gμμ(z),\displaystyle m_{1}(z):=\frac{1}{n_{1}}\sum_{\mu\in\mathcal{I}_{1}}G_{\mu\mu}(z), m2(z):=1n2ν2Gνν(z),\displaystyle m_{2}(z):=\frac{1}{n_{2}}\sum_{\nu\in\mathcal{I}_{2}}G_{\nu\nu}(z),

where i,i=0,1,2\mathcal{I}_{i},i=0,1,2, are index sets defined as

0:=[[1,p]],1:=[[p+1,p+n1]],0:=[[p+n1+1,p+n1+n2]].\mathcal{I}_{0}:=[\![1,p]\!],\ \ \mathcal{I}_{1}:=[\![p+1,p+n_{1}]\!],\ \ \mathcal{I}_{0}:=[\![p+n_{1}+1,p+n_{1}+n_{2}]\!].

We will consistently use i,j0i,j\in\mathcal{I}_{0} and μ,ν12\mu,\nu\in\mathcal{I}_{1}\cup\mathcal{I}_{2}. Correspondingly, the indices are labeled as

𝒁(1)=[Zμi(1):i0,μ1],𝒁(2)=[Zνi(2):i0,ν2].\bm{Z}^{(1)}=\left[Z_{\mu i}^{(1)}:i\in\mathcal{I}_{0},\mu\in\mathcal{I}_{1}\right],\ \ \bm{Z}^{(2)}=\left[Z_{\nu i}^{(2)}:i\in\mathcal{I}_{0},\nu\in\mathcal{I}_{2}\right].

We further define the set of all indices as :=012\mathcal{I}:=\mathcal{I}_{0}\cup\mathcal{I}_{1}\cup\mathcal{I}_{2}, and label indices as 𝔞,𝔟,𝔠\mathfrak{a},\mathfrak{b},\mathfrak{c}, etc.

Using Schur complement formula, (D.6) becomes

𝑮(z)=((𝑭𝑭z𝑰)1(𝑭𝑭z𝑰)1𝑭𝑭(𝑭𝑭z𝑰)1z(𝑭𝑭z𝑰)1)\bm{G}(z)=\begin{pmatrix}(\bm{F}\bm{F}^{\top}-z\bm{I})^{-1}&(\bm{F}\bm{F}^{\top}-z\bm{I})^{-1}\bm{F}\\ \bm{F}^{\top}(\bm{F}\bm{F}^{\top}-z\bm{I})^{-1}&z(\bm{F}^{\top}\bm{F}-z\bm{I})^{-1}\end{pmatrix} (D.8)

Notice that the upper-left block of 𝑮\bm{G} is directly related to the resolvent of 𝑾=𝑭𝑭\bm{W}=\bm{F}\bm{F}^{\top} that we are interested in. However, rather than studying (𝑭𝑭z𝑰)1(\bm{F}\bm{F}^{\top}-z\bm{I})^{-1} directly, we work with 𝑯\bm{H} as it is a linear function of 𝒁(1),𝒁(2)\bm{Z}^{(1)},\bm{Z}^{(2)} and is therefore easier to work with.

We now define the limit of 𝑮\bm{G} as

𝔊(z):=([a1(z)𝚲(1)+a2(z)𝚲(2)z]1000nn1a1(z)𝑰n1000nn2a2(z)𝑰n2)\mathfrak{G}(z):=\begin{pmatrix}[a_{1}(z)\bm{\Lambda}^{(1)}+a_{2}(z)\bm{\Lambda}^{(2)}-z]^{-1}&0&0\\ 0&-\frac{n}{n_{1}}a_{1}(z)\bm{I}_{n_{1}}&0\\ 0&0&-\frac{n}{n_{2}}a_{2}(z)\bm{I}_{n_{2}}\end{pmatrix} (D.9)

where a1(z)a_{1}(z) and a2(z)a_{2}(z) are the unique solution to the following system of equations (we omit the dependence on zz for conciseness):

a1+a2\displaystyle a_{1}+a_{2} =1γa1λ(1)+a2λ(2)a1λ(1)+a2λ(2)z𝑑H^p(λ(1),λ(2)),\displaystyle=1-\gamma\int\frac{a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}}{a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}-z}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}), (D.10)
a1\displaystyle a_{1} =n1nγa1λ(1)a1λ(1)+a2λ(2)z𝑑H^p(λ(1),λ(2)),\displaystyle=\frac{n_{1}}{n}-\gamma\int\frac{a_{1}\lambda^{(1)}}{a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}-z}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}),

such that Im(a1)0\textnormal{Im}(a_{1})\leq 0 and Im(a2)0\textnormal{Im}(a_{2})\leq 0 whenever Im(z)>0\textnormal{Im}(z)>0. The existence and uniqueness of the above system will be proved in Lemma D.2.

We now state our main result, Theorem D.1, which characterizes the convergence (rate) of 𝑮(z)\bm{G}(z) to 𝔊(z)\mathfrak{G}(z) at z=λz=-\lambda. This Theorem is a more general extension in the family of anisotropic local laws [33]. In specific, it extends [57, Theorem C.4] to allow a general 𝚺(2)\bm{\Sigma}^{(2)} control at z=λ0z=-\lambda\neq 0. For a fix small constant τ>0\tau>0, we define a domain of zz as

𝑫𝑫(τ):={z=E+iη+:|z+λ|(logp)1λ}.\bm{D}\equiv\bm{D}(\tau):=\{z=E+i\eta\in\mathbb{C}_{+}:|z+\lambda|\leq(\log p)^{-1}\lambda\}. (D.11)
Theorem D.1.

Under Assumption 1, and further assuming that 𝚺(1),𝚺(2)\bm{\Sigma}^{(1)},\bm{\Sigma}^{(2)} are simultaneously diagonalizable with joint spectral distribution defined in (4.1). Suppose also that 𝐙(1)\bm{Z}^{(1)} and 𝐙(2)\bm{Z}^{(2)} satisfy the bounded support condition (B.1) with Q=p2/φQ=p^{2/\varphi}. Then the following local laws hold for λ>p1/7+ϵ\lambda>p^{-1/7+\epsilon} with a small comstant ϵ\epsilon:

  1. (i)

    Averaged local law: We have

    supz𝑫|p1i0λi(k)[Gii(z)𝔊ii(z)]|p1λ7Q,k=1,2.\displaystyle\sup_{z\in\bm{D}}\left|p^{-1}\sum_{i\in\mathcal{I}_{0}}\lambda_{i}^{(k)}[G_{ii}(z)-\mathfrak{G}_{ii}(z)]\right|\prec p^{-1}\lambda^{-7}Q,\ \ k=1,2. (D.12)
  2. (ii)

    Anisotropic local law: For any deterministic unit vectors 𝒖,𝒗p+n\bm{u},\bm{v}\in\mathbb{R}^{p+n}, we have

    supz𝑫|𝒖[𝑮(z)𝔊(z)]𝒗|p1/2λ3Q.\sup_{z\in\bm{D}}\left|\bm{u}^{\top}[\bm{G}(z)-\mathfrak{G}(z)]\bm{v}\right|\prec p^{-1/2}\lambda^{-3}Q. (D.13)

The rest of this section is devoted to the proof of Theorem D.1. We focus on the case of λ(p1/7+ϵ,1)\lambda\in(p^{-1/7+\epsilon},1), since the case for λ1\lambda\geq 1 can be trivially extended by dropping all negative powers of λ\lambda in the proof below.

D.1.2 Self-consistent equations

In this subsection, we show that the self-consistent equations (D.10) has a unique solution for any z𝑫z\in\bm{D}. For simplicity, we further define

r1:=n1/n,r2:=n2/n.r_{1}:=n_{1}/n,r_{2}:=n_{2}/n. (D.14)

We first show that (D.10) has a unique solution at z=λz=-\lambda, where (D.10) reduces to the following system (equivalent to the first two equations of (A.17)):

f(a1,a2)\displaystyle f(a_{1},a_{2}) :=a1r1+γa1λ(1)a1λ(1)+a2λ(2)+λ𝑑H^p(λ(1),λ(2))=0,\displaystyle:=a_{1}-r_{1}+\gamma\int\frac{a_{1}\lambda^{(1)}}{a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}+\lambda}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})=0, (D.15)
g(a1,a2)\displaystyle g(a_{1},a_{2}) :=a2r2+γa2λ(2)a1λ(1)+a2λ(2)+λ𝑑H^p(λ(1),λ(2))=0.\displaystyle:=a_{2}-r_{2}+\gamma\int\frac{a_{2}\lambda^{(2)}}{a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}+\lambda}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})=0.

Focusing on ff and consider any fixed a2>0a_{2}>0 for a small enough constant cc. We know

f(Cλ,a2)>r1+γCλλ(1)Cλλ(1)+a2λ(2)+λ𝑑H^p(λ(1),λ(2))>0,f(C\lambda,a_{2})>-r_{1}+\gamma\int\frac{C\lambda\lambda^{(1)}}{C\lambda\lambda^{(1)}+a_{2}\lambda^{(2)}+\lambda}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})>0,

if CC is large enough,

f(cmin{λ,r1},a2)<cmin{λ,r1}r1+γcλλ(1)λ𝑑H^p(λ(1),λ(2))<0,f(c\min\{\lambda,r_{1}\},a_{2})<c\min\{\lambda,r_{1}\}-r_{1}+\gamma\int\frac{c\lambda\lambda^{(1)}}{\lambda}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})<0,

if cc is small enough, and

a1f(a1,a2)=1+γλ(1)(a2λ(2)+λ)(a1λ(1)+a2λ(2)+λ)2𝑑H^p(λ(1),λ(2))>0.\frac{\partial}{\partial a_{1}}f(a_{1},a_{2})=1+\gamma\int\frac{\lambda^{(1)}(a_{2}\lambda^{(2)}+\lambda)}{(a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}+\lambda)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})>0.

for any a1>0a_{1}>0. Together with the obvious continuity, we know, given a2a_{2}, there exists a unique a1(cmin{λ,r1},Cλ)a_{1}\in(c\min\{\lambda,r_{1}\},C\lambda) such that f(a1,a2)=0f(a_{1},a_{2})=0. Define the root as

a1:=χ(a2;f).a_{1}:=\chi(a_{2};f).

By implicit function theorem, we know the function χ\chi is continuous, and

χ(a2;f)+γχ(a2;f)λ(1)(a2λ(2)+λ)χ(a2;f)λ(1)λ(2)(a1λ(1)+a2λ(2)+λ)2𝑑H^p(λ(1),λ(2))=0,\chi^{\prime}(a_{2};f)+\gamma\int\frac{\chi^{\prime}(a_{2};f)\lambda^{(1)}(a_{2}\lambda^{(2)}+\lambda)-\chi(a_{2};f)\lambda^{(1)}\lambda^{(2)}}{(a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}+\lambda)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})=0,

which, after straightforward rearranging, yields χ(a2;f)>0\chi^{\prime}(a_{2};f)>0.

Now focus on g(χ(a2;f),a2)g(\chi(a_{2};f),a_{2}). Since χ(a2;f)(cmin{λ,r2},Cλ)\chi(a_{2};f)\in(c\min\{\lambda,r_{2}\},C\lambda) for any a2>0a_{2}>0, similar to our derivation above, we have

g(χ(cmin{λ,r2};f),cmin{λ,r1})<0,g(χ(Cλ;f),Cλ)>0g(\chi(c\min\{\lambda,r_{2}\};f),c\min\{\lambda,r_{1}\})<0,g(\chi(C\lambda;f),C\lambda)>0

for cc small enough and CC large enough, and dda2g(χ(a2;f),a2)>0\frac{\textnormal{d}}{\textnormal{d}a_{2}}g(\chi(a_{2};f),a_{2})>0. Therefore there exists a unique a2(cmin{λ,r2},Cλ)a_{2}\in(c\min\{\lambda,r_{2}\},C\lambda) such that g(χ(a2;f),a2)=0g(\chi(a_{2};f),a_{2})=0. Hence, there exists a unique pair of positive (a1,a2)(a_{1},a_{2}) satisfying (D.15), such that

(a1,a2)(cmin{λ,r1},Cλ)×(cmin{λ,r2},Cλ)(a_{1},a_{2})\in(c\min\{\lambda,r_{1}\},C\lambda)\times(c\min\{\lambda,r_{2}\},C\lambda) (D.16)

Next, we prove the existence and uniqueness of the solution to (D.10) for a general z𝑫z\in\bm{D}, where 𝑫\bm{D} is defined by (D.11), presented by the following lemma:

Lemma D.2.

There exists constants c,c1,C>0c,c_{1},C>0 such that the following statements hold. There exists a unique solution (a1(z),a2(z))(a_{1}(z),a_{2}(z)) of (D.10) under the conditions

|z+λ|cλ,|a1(λ)a1(z)|+|a2(λ)a2(z)|c1λ.|z+\lambda|\leq c\lambda,|a_{1}(-\lambda)-a_{1}(z)|+|a_{2}(-\lambda)-a_{2}(z)|\leq c_{1}\lambda. (D.17)

Moreover, the solution satisfies

|a1(λ)a1(z)|+|a2(λ)a2(z)|C|z+λ|.|a_{1}(-\lambda)-a_{1}(z)|+|a_{2}(-\lambda)-a_{2}(z)|\leq C|z+\lambda|. (D.18)
Proof.

The proof is based on contraction principle. Define the more general form of (D.15) as

fz(a1,a2)\displaystyle f_{z}(a_{1},a_{2}) :=a1r1+γa1λ(1)a1λ(1)+a2λ(2)z𝑑H^p(λ(1),λ(2))=0,\displaystyle:=a_{1}-r_{1}+\gamma\int\frac{a_{1}\lambda^{(1)}}{a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}-z}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})=0, (D.19)
gz(a1,a2)\displaystyle g_{z}(a_{1},a_{2}) :=a2r2+γa2λ(2)a1λ(1)+a2λ(2)z𝑑H^p(λ(1),λ(2))=0.\displaystyle:=a_{2}-r_{2}+\gamma\int\frac{a_{2}\lambda^{(2)}}{a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}-z}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})=0.

Define the vector

𝑭z(a1,a2):=[fz(a1,a2),gz(a1,a2)].\bm{F}_{z}(a_{1},a_{2}):=[f_{z}(a_{1},a_{2}),g_{z}(a_{1},a_{2})].

Denote the (potential) solution to 𝑭=𝟎\bm{F}=\bm{0} as 𝒂(z):=[a1(z),a2(z)]\bm{a}(z):=[a_{1}(z),a_{2}(z)]. By definition, we know 𝑭z(𝒂(z))=0\bm{F}_{z}(\bm{a}(z))=0. Therefore, defining 𝜹(z):=𝒂(z)𝒂(λ)\bm{\delta}(z):=\bm{a}(z)-\bm{a}(-\lambda) we have the following identity:

𝜹(z)\displaystyle\bm{\delta}(z)\equiv 𝑱z(𝒂(λ))1(𝑭z(𝒂(λ))𝑭λ(𝒂(λ)))\displaystyle-\bm{J}_{z}(\bm{a}(-\lambda))^{-1}\left(\bm{F}_{z}(\bm{a}(-\lambda))-\bm{F}_{-\lambda}(\bm{a}(-\lambda))\right)
𝑱z(𝒂(λ))1(𝑭z(𝒂(λ)+𝜹(z))𝑭z(𝒂(λ))𝑱z(𝒂(λ))𝜹(z)),\displaystyle-\bm{J}_{z}(\bm{a}(-\lambda))^{-1}\left(\bm{F}_{z}(\bm{a}(-\lambda)+\bm{\delta}(z))-\bm{F}_{z}(\bm{a}(-\lambda))-\bm{J}_{z}(\bm{a}(-\lambda))\bm{\delta}(z)\right),

where 𝑱z(𝒂)\bm{J}_{z}(\bm{a}) is the Jacobian of 𝑭z\bm{F}_{z} at 𝒂\bm{a}:

𝑱z(a1,a2)\displaystyle\bm{J}_{z}(a_{1},a_{2}) :=(1fz(a1,a2)2fz(a1,a2)1gz(a1,a2)2gz(a1,a2))\displaystyle:=\begin{pmatrix}\partial_{1}f_{z}(a_{1},a_{2})&\partial_{2}f_{z}(a_{1},a_{2})\\ \partial_{1}g_{z}(a_{1},a_{2})&\partial_{2}g_{z}(a_{1},a_{2})\end{pmatrix} (D.20)
=(1+γλ(1)(a2λ(2)z)(a1λ(1)+a2λ(2)z)2𝑑H^p(λ(1),λ(2))γa1λ(1)λ(2)(a1λ(1)+a2λ(2)z)2𝑑H^p(λ(1),λ(2))γa2λ(1)λ(2)(a1λ(1)+a2λ(2)z)2𝑑H^p(λ(1),λ(2))1+γλ(2)(a1λ(1)z)(a1λ(1)+a2λ(2)z)2𝑑H^p(λ(1),λ(2)))\displaystyle=\begin{pmatrix}1+\gamma\int\frac{\lambda^{(1)}(a_{2}\lambda^{(2)}-z)}{(a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}-z)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})&-\gamma\int\frac{a_{1}\lambda^{(1)}\lambda^{(2)}}{(a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}-z)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})\\ -\gamma\int\frac{a_{2}\lambda^{(1)}\lambda^{(2)}}{(a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}-z)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})&1+\gamma\int\frac{\lambda^{(2)}(a_{1}\lambda^{(1)}-z)}{(a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}-z)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})\\ \end{pmatrix}
:=(1+xˇ2(z)+yˇ1(z)xˇ1(z)xˇ2(z)1+xˇ1(z)+y2(z)),\displaystyle:=\begin{pmatrix}1+\check{x}_{2}(z)+\check{y}_{1}(z)&-\check{x}_{1}(z)\\ -\check{x}_{2}(z)&1+\check{x}_{1}(z)+y_{2}(z)\\ \end{pmatrix},

where for k=1,2k=1,2, k\partial_{k} denotes the partial derivative with respect to the kk-th argument, and

xˇk(z):=γakλ(1)λ(2)(a1λ(1)+a2λ(2)z)2𝑑H^p(λ(1),λ(2)),\displaystyle\check{x}_{k}(z):=\gamma\int\frac{a_{k}\lambda^{(1)}\lambda^{(2)}}{(a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}-z)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}),
yˇk(z):=γλ(k)z(a1λ(1)+a2λ(2)z)2𝑑H^p(λ(1),λ(2)).\displaystyle\check{y}_{k}(z):=-\gamma\int\frac{\lambda^{(k)}z}{(a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}-z)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}).

Notice that we omit the dependence on λ-\lambda for simplicity without confusion, as only 𝒂(λ)\bm{a}(-\lambda) appear from now on. Taking this inspiration, we define the iterative sequence 𝜹(k)(z)\bm{\delta}^{(k)}(z) such that 𝜹(0)(z)=𝟎\bm{\delta}^{(0)}(z)=\bm{0} and

𝜹(k+1)(z):=𝒉z(𝜹(k)(z))\displaystyle\bm{\delta}^{(k+1)}(z):=\bm{h}_{z}(\bm{\delta}^{(k)}(z)) (D.21)
=\displaystyle= 𝑱z(𝒂(λ))1(𝑭z(𝒂(λ))𝑭λ(𝒂(λ)))\displaystyle-\bm{J}_{z}(\bm{a}(-\lambda))^{-1}\left(\bm{F}_{z}(\bm{a}(-\lambda))-\bm{F}_{-\lambda}(\bm{a}(-\lambda))\right)
𝑱z(𝒂(λ))1(𝑭z(𝒂(λ)+𝜹(k)(z))𝑭z(𝒂(λ))𝑱z(𝒂(λ))𝜹(k)(z)).\displaystyle-\bm{J}_{z}(\bm{a}(-\lambda))^{-1}\left(\bm{F}_{z}(\bm{a}(-\lambda)+\bm{\delta}^{(k)}(z))-\bm{F}_{z}(\bm{a}(-\lambda))-\bm{J}_{z}(\bm{a}(-\lambda))\bm{\delta}^{(k)}(z)\right).

We want to show that 𝒉z\bm{h}_{z} is a contractive self-mapping. Let zˇ:=z+λ\check{z}:=z+\lambda. We first deal with the first summand of (D.21). Since |zˇ|cλ|\check{z}|\leq c\lambda, we have

|zˇxˇ1(z)|\displaystyle\left|\frac{\partial}{\partial\check{z}}\check{x}_{1}(z)\right|
=\displaystyle= |2γa1(z)λ(1)λ(2)(a1(z)λ(1)+a2(z)λ(2)+λzˇ)3𝑑H^p(λ(1),λ(2))|\displaystyle\left|2\gamma\int\frac{a_{1}(z)\lambda^{(1)}\lambda^{(2)}}{(a_{1}(z)\lambda^{(1)}+a_{2}(z)\lambda^{(2)}+\lambda-\check{z})^{3}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})\right|
\displaystyle\leq 2γ|a1(z)λ(1)λ(2)(a1(z)λ(1)+a2(z)λ(2)+λzˇ)3|𝑑H^p(λ(1),λ(2))\displaystyle 2\gamma\int\left|\frac{a_{1}(z)\lambda^{(1)}\lambda^{(2)}}{(a_{1}(z)\lambda^{(1)}+a_{2}(z)\lambda^{(2)}+\lambda-\check{z})^{3}}\right|d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})
=\displaystyle= O(λ2)\displaystyle O(\lambda^{-2})

if cc is small enough. Further, a corollary of (D.16) yields

xˇ1(λ)\displaystyle\check{x}_{1}(-\lambda) :=γa1λ(1)λ(2)(a1λ(1)+a2λ(2)+λ)2𝑑H^p(λ(1),λ(2))=Θ(λ1),\displaystyle:=\gamma\int\frac{a_{1}\lambda^{(1)}\lambda^{(2)}}{(a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}+\lambda)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})=\Theta(\lambda^{-1}),

which, together with the equation above and the fact that |zˇ|cλ|\check{z}|\leq c\lambda, yields

|xˇ1(λ)xˇ1(z)|=O(cλ1)|\check{x}_{1}(-\lambda)-\check{x}_{1}(z)|=O(c\lambda^{-1})

Using similar derivations for xˇ2,yˇ1,yˇ2\check{x}_{2},\check{y}_{1},\check{y}_{2}, we obtain that for k=1,2k=1,2,

xˇk(λ)=Θ(λ1),yˇk(λ)=Θ(λ1),\displaystyle\check{x}_{k}(-\lambda)=\Theta(\lambda^{-1}),\quad\check{y}_{k}(-\lambda)=\Theta(\lambda^{-1}), (D.22)
|xˇk(λ)xˇk(z)|=O(cλ1),|yˇk(λ)yˇk(z)|=O(cλ1).\displaystyle|\check{x}_{k}(-\lambda)-\check{x}_{k}(z)|=O(c\lambda^{-1}),\ \ |\check{y}_{k}(-\lambda)-\check{y}_{k}(z)|=O(c\lambda^{-1}).

Let λˇ±\check{\lambda}_{\pm} denote the two eigenvalues of 𝑱\bm{J}. Since 𝑱\bm{J} is a 2×22\times 2 matrix, it is easy to check that

λˇ±(𝑱z(𝒂(λ)))\displaystyle\check{\lambda}_{\pm}(\bm{J}_{z}(\bm{a}(-\lambda))) :=2+xˇ1(z)+xˇ2(z)+yˇ1(z)+yˇ2(z)2\displaystyle:=\frac{2+\check{x}_{1}(z)+\check{x}_{2}(z)+\check{y}_{1}(z)+\check{y}_{2}(z)}{2}
±(xˇ2(z)+yˇ1(z)xˇ1(z)yˇ2(z))2+4xˇ1(z)xˇ2(z)2.\displaystyle\pm\frac{\sqrt{(\check{x}_{2}(z)+\check{y}_{1}(z)-\check{x}_{1}(z)-\check{y}_{2}(z))^{2}+4\check{x}_{1}(z)\check{x}_{2}(z)}}{2}.

At z=λz=-\lambda, we know

(xˇ1(z)+xˇ2(z)+yˇ1(z)+yˇ2(z))2(xˇ2(z)+yˇ1(z)xˇ1(z)yˇ2(z))24xˇ1(z)xˇ2(z)\displaystyle(\check{x}_{1}(z)+\check{x}_{2}(z)+\check{y}_{1}(z)+\check{y}_{2}(z))^{2}-(\check{x}_{2}(z)+\check{y}_{1}(z)-\check{x}_{1}(z)-\check{y}_{2}(z))^{2}-4\check{x}_{1}(z)\check{x}_{2}(z)
=4xˇ1(λ)yˇ1(λ)+4xˇ2(λ)yˇ2(λ)+4yˇ1(λ)yˇ2(λ)=Θ(λ2),\displaystyle=4\check{x}_{1}(-\lambda)\check{y}_{1}(-\lambda)+4\check{x}_{2}(-\lambda)\check{y}_{2}(-\lambda)+4\check{y}_{1}(-\lambda)\check{y}_{2}(-\lambda)=\Theta(\lambda^{-2}),

and xˇ1(z)+xˇ2(z)+yˇ1(z)+yˇ2(z)=Θ(λ1)\check{x}_{1}(z)+\check{x}_{2}(z)+\check{y}_{1}(z)+\check{y}_{2}(z)=\Theta(\lambda^{-1}). Therefore we obtain

λˇ±(𝑱z(𝒂(λ)))=Θ(λ1).\check{\lambda}_{\pm}(\bm{J}_{z}(\bm{a}(-\lambda)))=\Theta(\lambda^{-1}).

Now consider a general |zˇ|cλ|\check{z}|\leq c\lambda. By (D.22), it is straightforward to derive that

|λˇ±(𝑱z(𝒂(λ)))λˇ±(𝑱λ(𝒂(λ)))|=O(cλ1).|\check{\lambda}_{\pm}(\bm{J}_{z}(\bm{a}(-\lambda)))-\check{\lambda}_{\pm}(\bm{J}_{-\lambda}(\bm{a}(-\lambda)))|=O(c\lambda^{-1}).

Therefore we have

|λˇ±(𝑱z(𝒂(λ)))||λˇ±(𝑱λ(𝒂(λ)))|O(cλ1)=Θ(λ1)O(cλ1)=Θ(λ1),|\check{\lambda}_{\pm}(\bm{J}_{z}(\bm{a}(-\lambda)))|\geq|\check{\lambda}_{\pm}(\bm{J}_{-\lambda}(\bm{a}(-\lambda)))|-O(c\lambda^{-1})=\Theta(\lambda^{-1})-O(c\lambda^{-1})=\Theta(\lambda^{-1}),

if cc is small enough, which yields

𝑱z(𝒂(λ))1op=O(λ).\|\bm{J}_{z}(\bm{a}(-\lambda))^{-1}\|_{\textnormal{op}}=O(\lambda). (D.23)

Using similar derivations as (D.22), we obtain that for |zˇ|cλ|\check{z}|\leq c\lambda,

zˇfz(𝒂(λ))=O(λ1),zˇgz(𝒂(λ))=O(λ1),\displaystyle\frac{\partial}{\partial\check{z}}f_{z}(\bm{a}(-\lambda))=O(\lambda^{-1}),\qquad\frac{\partial}{\partial\check{z}}g_{z}(\bm{a}(-\lambda))=O(\lambda^{-1}), (D.24)
|fλ(𝒂(λ))fz(𝒂(λ))|=O(λ1|zˇ|),|gλ(𝒂(λ))gz(𝒂(λ))|=O(λ1|zˇ|),\displaystyle|f_{-\lambda}(\bm{a}(-\lambda))-f_{z}(\bm{a}(-\lambda))|=O(\lambda^{-1}|\check{z}|),|g_{-\lambda}(\bm{a}(-\lambda))-g_{z}(\bm{a}(-\lambda))|=O(\lambda^{-1}|\check{z}|),

which yields

𝑭z(𝒂(λ))𝑭λ(𝒂(λ))2=O(λ1|zˇ|)\|\bm{F}_{z}(\bm{a}(-\lambda))-\bm{F}_{-\lambda}(\bm{a}(-\lambda))\|_{2}=O(\lambda^{-1}|\check{z}|) (D.25)

Now we deal with the second term of 𝒉z\bm{h}_{z} (D.21). Using (D.20) and similar derivations as (D.22), we know there exists a constant c1c_{1} such that for zcλz\leq c\lambda, |δ1|2+|δ2|2c1λ\sqrt{|\delta_{1}|^{2}+|\delta_{2}|^{2}}\leq c_{1}\lambda and kˇ1,kˇ2,kˇ3=1,2\check{k}_{1},\check{k}_{2},\check{k}_{3}=1,2

kˇ1kˇ2fz(a1,a2)=Θ(λ2),kˇ1kˇ2gz(a1,a2)=Θ(λ2),\displaystyle\partial_{\check{k}_{1}\check{k}_{2}}f_{z}(a_{1},a_{2})=\Theta(\lambda^{-2}),\quad\partial_{\check{k}_{1}\check{k}_{2}}g_{z}(a_{1},a_{2})=\Theta(\lambda^{-2}), (D.26)
kˇ1kˇ2kˇ3fz(a1+δ1,a2+δ2)=O(λ3),kˇ1kˇ2kˇ3gz(a1+δ1,a2+δ2)=O(λ3),\displaystyle\partial_{\check{k}_{1}\check{k}_{2}\check{k}_{3}}f_{z}(a_{1}+\delta_{1},a_{2}+\delta_{2})=O(\lambda^{-3}),\quad\partial_{\check{k}_{1}\check{k}_{2}\check{k}_{3}}g_{z}(a_{1}+\delta_{1},a_{2}+\delta_{2})=O(\lambda^{-3}),

As a corollary, for 𝜹1,𝜹2\bm{\delta}_{1},\bm{\delta}_{2} satisfying 𝜹12c1λ,𝜹22c1λ\|\bm{\delta}_{1}\|_{2}\leq c_{1}\lambda,\|\bm{\delta}_{2}\|_{2}\leq c_{1}\lambda, the difference of the second term of 𝒉z\bm{h}_{z} reads

𝑱z(𝒂(λ))1(𝑭z(𝒂(λ)+𝜹1)𝑭z(𝒂(λ)+𝜹2))(𝜹1𝜹2)2\displaystyle\left\|\bm{J}_{z}(\bm{a}(-\lambda))^{-1}\left(\bm{F}_{z}(\bm{a}(-\lambda)+\bm{\delta}_{1})-\bm{F}_{z}(\bm{a}(-\lambda)+\bm{\delta}_{2})\right)-(\bm{\delta}_{1}-\bm{\delta}_{2})\right\|_{2} (D.27)
\displaystyle\leq 𝑱z(𝒂(λ))op𝑭z(𝒂(λ)+𝜹1)𝑭z(𝒂(λ)+𝜹2)𝑱z(𝒂(λ))(𝜹1𝜹2)2\displaystyle\|\bm{J}_{z}(\bm{a}(-\lambda))\|_{\textnormal{op}}\cdot\|\bm{F}_{z}(\bm{a}(-\lambda)+\bm{\delta}_{1})-\bm{F}_{z}(\bm{a}(-\lambda)+\bm{\delta}_{2})-\bm{J}_{z}(\bm{a}(-\lambda))(\bm{\delta}_{1}-\bm{\delta}_{2})\|_{2}
=\displaystyle= O(λ)𝜹1𝜹22O(c1λ1)\displaystyle O(\lambda)\cdot\|\bm{\delta}_{1}-\bm{\delta}_{2}\|_{2}\cdot O(c_{1}\lambda^{-1})
=\displaystyle= O(c1𝜹1𝜹22).\displaystyle O(c_{1}\|\bm{\delta}_{1}-\bm{\delta}_{2}\|_{2}).

Consequently, for 𝜹2c1λ\|\bm{\delta}\|_{2}\leq c_{1}\lambda, plugging in 𝜹1=𝜹\bm{\delta}_{1}=\bm{\delta} and 𝜹2=𝟎\bm{\delta}_{2}=\bm{0} to the equation above and combining (D.23), (D.25), we obtain that

𝒉z(𝜹)2O(|zˇ|)+O(c1𝜹2)=O(cλ)+O(c12λ)c1λ\|\bm{h}_{z}(\bm{\delta})\|_{2}\leq O(|\check{z}|)+O(c_{1}\|\bm{\delta}\|_{2})=O(c\lambda)+O(c_{1}^{2}\lambda)\leq c_{1}\lambda

if we first choose c1c_{1} small enough such that O(c12λ)12c1λO(c_{1}^{2}\lambda)\leq\frac{1}{2}c_{1}\lambda and then choose cc small enough such that O(c)12c1O(c)\leq\frac{1}{2}c_{1}.

Therefore 𝒉z:Bc1Bc1\bm{h}_{z}:B_{c_{1}}\rightarrow B_{c_{1}} is a self-mapping on the ball Bc1:={𝜹2:𝜹c1}B_{c_{1}}:=\{\bm{\delta}\in\mathbb{C}^{2}:\|\bm{\delta}\|\leq c_{1}\}. Moreover, from (D.21) and (D.27), we know

𝜹(k+1)𝜹(k)2O(c1𝜹(k)𝜹(k1)2)12𝜹(k)𝜹(k1)2\|\bm{\delta}^{(k+1)}-\bm{\delta}^{(k)}\|_{2}\leq O(c_{1}\|\bm{\delta}^{(k)}-\bm{\delta}^{(k-1)}\|_{2})\leq\frac{1}{2}\|\bm{\delta}^{(k)}-\bm{\delta}^{(k-1)}\|_{2}

for the c1c_{1} chosen. This means that 𝒉z\bm{h}_{z} restricted to Bc1B_{c_{1}} is a contraction. By contraction mapping theorem, 𝜹:=limk𝜹(k)\bm{\delta}^{*}:=\lim_{k\rightarrow\infty}\bm{\delta}^{(k)} exists and 𝒂(λ)+𝜹\bm{a}(-\lambda)+\bm{\delta} is a unique solution to the equation 𝑭z(𝒂(z))=0\bm{F}_{z}(\bm{a}(z))=0 subject to the condition 𝜹c1\|\bm{\delta}\|\leq c_{1}. (D.17) is then obtained by noticing the final fact that for all δˇ1,δˇ2\check{\delta}_{1},\check{\delta}_{2},

|δˇ1|+|δˇ2|2(|δˇ1|2+|δˇ2|2)|\check{\delta}_{1}|+|\check{\delta}_{2}|\leq\sqrt{2(|\check{\delta}_{1}|^{2}+|\check{\delta}_{2}|^{2})} (D.28)

We are now left with proving (D.18). Using (D.23), (D.25) and 𝜹(0)=𝟎\bm{\delta}^{(0)}=\bm{0}, we can obtain from (D.21) that 𝜹(1)2O(|zˇ|)\|\bm{\delta}^{(1)}\|_{2}\leq O(|\check{z}|), which means there exists C0C_{0} such that 𝜹(1)𝜹(0)2C1|zˇ|\|\bm{\delta}^{(1)}-\bm{\delta}^{(0)}\|_{2}\leq C_{1}|\check{z}|. Then by the contraction mapping, we have

𝜹k=0𝜹(k+1)𝜹(k)22C1|zˇ|,\|\bm{\delta}^{*}\|\leq\sum_{k=0}^{\infty}\|\bm{\delta}^{(k+1)}-\bm{\delta}^{(k)}\|_{2}\leq 2C_{1}|\check{z}|,

which, together with (D.28), yields (D.18). ∎

As a by-product, we obtain the following stability result which is useful for proving Theorem D.1. It roughly says that if we replace the 0’s on the right hand sides of (D.19) with some small errors, the resulting solutions will still be close to a1(z),a2(z)a_{1}(z),a_{2}(z). We further make the following rescaling, which will prove handy in later derivations:

m1c(z):=r11a1(z),m2c(z):=r21a2(z),m_{1c}(z):=-r_{1}^{-1}a_{1}(z),\ \ m_{2c}(z):=-r_{2}^{-1}a_{2}(z), (D.29)

where r1,r2r_{1},r_{2} are defined by (D.14). Combining (D.29) with (D.16) yields

cλm1c(λ)Cλ,cλm2c(λ)Cλc\lambda\leq-m_{1c}(-\lambda)\leq C\lambda,\ \ c\lambda\leq-m_{2c}(-\lambda)\leq C\lambda (D.30)

for some (potentially different from before) constants c,C>0c,C>0.

Lemma D.3.

There exists constants c,c1,C>0c,c_{1},C>0 such that the following statements hold. Suppose |z+λ|cλ|z+\lambda|\leq c\lambda, and m1(z),m2(z)m_{1}(z),m_{2}(z) are analytic functions of zz such that

|m1(z)m1c(λ)|+|m2(z)m2c(λ)|c1λ.|m_{1}(z)-m_{1c}(-\lambda)|+|m_{2}(z)-m_{2c}(-\lambda)|\leq c_{1}\lambda. (D.31)

Moreover, assume that (m1,m2)(m_{1},m_{2}) satisfies the system of equations

1m1+1γλ(1)r1m1λ(1)+r2m2λ(2)+z𝑑H^p(λ(1),λ(2))\displaystyle\frac{1}{m_{1}}+1-\gamma\int\frac{\lambda^{(1)}}{r_{1}m_{1}\lambda^{(1)}+r_{2}m_{2}\lambda^{(2)}+z}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}) =1,\displaystyle=\mathcal{E}_{1}, (D.32)
1m2+1γλ(2)r1m1λ(1)+r2m2λ(2)+z𝑑H^p(λ(1),λ(2))\displaystyle\frac{1}{m_{2}}+1-\gamma\int\frac{\lambda^{(2)}}{r_{1}m_{1}\lambda^{(1)}+r_{2}m_{2}\lambda^{(2)}+z}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}) =2,\displaystyle=\mathcal{E}_{2},

for some (deterministic or random) errors such that |1|+|2|(logn)1/2|\mathcal{E}_{1}|+|\mathcal{E}_{2}|\leq(\log n)^{-1/2}. Then we have

|m1(z)m1c(z)|+|m2(z)m2c(z)|C(|1|+|2|)λ.|m_{1}(z)-m_{1c}(z)|+|m_{2}(z)-m_{2c}(z)|\leq C(|\mathcal{E}_{1}|+|\mathcal{E}_{2}|)\lambda.
Proof.

We connect with Lemma D.2. Notice that a1ϵ(z):=r1m1(z),a2ϵ(z):=r2m2(z)a_{1\epsilon}(z):=-r_{1}m_{1}(z),a_{2\epsilon}(z):=-r_{2}m_{2}(z) satisfies the following:

fz(a1ϵ,a1ϵ)\displaystyle f_{z}(a_{1\epsilon},a_{1\epsilon}) :=a1ϵr1+γa1ϵλ(1)a1ϵλ(1)+a2ϵλ(2)z𝑑H^p(λ(1),λ(2))=ˇ1,\displaystyle:=a_{1\epsilon}-r_{1}+\gamma\int\frac{a_{1\epsilon}\lambda^{(1)}}{a_{1\epsilon}\lambda^{(1)}+a_{2\epsilon}\lambda^{(2)}-z}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})=\check{\mathcal{E}}_{1}, (D.33)
gz(a1ϵ,a2ϵ)\displaystyle g_{z}(a_{1\epsilon},a_{2\epsilon}) :=a2ϵr2+γa1ϵλ(2)a1ϵλ(1)+a2ϵλ(2)z𝑑H^p(λ(1),λ(2))=ˇ2.\displaystyle:=a_{2\epsilon}-r_{2}+\gamma\int\frac{a_{1\epsilon}\lambda^{(2)}}{a_{1\epsilon}\lambda^{(1)}+a_{2\epsilon}\lambda^{(2)}-z}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})=\check{\mathcal{E}}_{2}.

with |~1|+|~2||1|+|2||\tilde{\mathcal{E}}_{1}|+|\tilde{\mathcal{E}}_{2}|\lesssim|{\mathcal{E}}_{1}|+|{\mathcal{E}}_{2}|. We then subtract (D.19) from (D.33) and consider the contraction principle for 𝜹(z):=𝒂ϵ(z)𝒂(z)\bm{\delta}(z):=\bm{a}_{\epsilon}(z)-\bm{a}(z). The rest of the proof is exactly the same as the one for D.2 and we omit the details. ∎

D.1.3 Multivariate Gaussian Case

In this section we prove the idealized setting where the entries of 𝒁(1)\bm{Z}^{(1)} and 𝒁(2)\bm{Z}^{(2)} are i.i.d. Gaussian. By the rotational invariance of multivariate Gaussian distribution, we have

𝒁(1)𝑽(𝚲(1))1/2,𝒁(2)𝑽(𝚲(2))1/2=𝑑𝒁(1)(𝚲(1))1/2,𝒁(2)(𝚲(2))1/2.\bm{Z}^{(1)}\bm{V}(\bm{\Lambda}^{(1)})^{1/2},\bm{Z}^{(2)}\bm{V}(\bm{\Lambda}^{(2)})^{1/2}\overset{d}{=}\bm{Z}^{(1)}(\bm{\Lambda}^{(1)})^{1/2},\bm{Z}^{(2)}(\bm{\Lambda}^{(2)})^{1/2}. (D.34)

Notice that since the entries of 𝒁(1),𝒁(2)\bm{Z}^{(1)},\bm{Z}^{(2)} are i.i.d. Gaussian, they have bounded support Q=1Q=1 by the remark after Definition B.3. Hence we can use the resolvent method (cf. [13]) to prove the following proposition.

Proposition D.4.

Under the setting of Theorem D.1, assume further that the entries of 𝐙(1)\bm{Z}^{(1)} and 𝐙(2)\bm{Z}^{(2)} are i.i.d. Gaussian random variables. Then the estimates (D.12) and (D.13) hold with Q=1Q=1.

Proposition D.4 is a corollary of the following entrywise local law.

Lemma D.5 (Entrywise Local Law).

Recall the definition of 𝐆\bm{G} in (D.8) and 𝔊\mathfrak{G} in (D.9). Under the setting of Proposition D.4, the averaged local laws (D.12) and the following entrywise local law hold with Q=1Q=1:

supz𝑫max𝔞,𝔟|𝑮𝔞𝔟(z)𝔊𝔞𝔟(z)|p1/2λ3.\sup_{z\in\bm{D}}\max_{\mathfrak{a},\mathfrak{b}\in\mathcal{I}}\left|\bm{G}_{\mathfrak{a}\mathfrak{b}}(z)-\mathfrak{G}_{\mathfrak{a}\mathfrak{b}}(z)\right|\prec p^{-1/2}\lambda^{-3}. (D.35)
Proof of Proposition D.4.

With estimate (D.35), we can use the polynomialization method in [13, Section 5] to prove (D.13) with Q=1Q=1. The proof is exactly the same with minor notation differences. We omit the details. ∎

In the remaining of this subsection we prove Lemma D.5, where the resolvent in Definition D.1 becomes

𝑮(z)=(z𝑰pn1/2(𝚲(1))1/2𝒁(1)n1/2(𝚲(2))1/2𝒁(2)n1/2𝒁(1)(𝚲(1))1/2𝑰𝟎n1/2𝒁(2)(𝚲(2))1/2𝟎𝑰)1.\bm{G}(z)=\begin{pmatrix}-z\bm{I}_{p}&n^{-1/2}(\bm{\Lambda}^{(1)})^{1/2}\bm{Z}^{(1)\top}&n^{-1/2}(\bm{\Lambda}^{(2)})^{1/2}\bm{Z}^{(2)\top}\\ n^{-1/2}\bm{Z}^{(1)}(\bm{\Lambda}^{(1)})^{1/2}&-\bm{I}&\bm{0}\\ n^{-1/2}\bm{Z}^{(2)}(\bm{\Lambda}^{(2)})^{1/2}&\bm{0}&-\bm{I}\end{pmatrix}^{-1}. (D.36)

Below we introduce resolvent minors to deal with the inverse using Schur’s complement formula.

Definition D.2 (Resolvent minors).

Given a (p+n)×(p+n)(p+n)\times(p+n) matrix 𝐀\bm{A} and 𝔠\mathfrak{c}\in\mathcal{I}, the minor of 𝐀\bm{A} after removing the 𝔠\mathfrak{c}-th row and column is a (p+n1)×(p+n1)(p+n-1)\times(p+n-1) matrix denoted by 𝐀(𝔠):=[𝐀𝔞𝔟:𝔞,𝔟{𝔠}]\bm{A}^{(\mathfrak{c})}:=[\bm{A}_{\mathfrak{a}\mathfrak{b}}:\mathfrak{a},\mathfrak{b}\in\mathcal{I}\setminus\{\mathfrak{c}\}]. We keep the names of indices when defining 𝐀(𝔠)\bm{A}^{(\mathfrak{c})}, i.e. 𝐀𝔞𝔟(𝔠)=𝐀𝔞𝔟\bm{A}_{\mathfrak{a}\mathfrak{b}}^{(\mathfrak{c})}=\bm{A}_{\mathfrak{a}\mathfrak{b}} for 𝔞,𝔟𝔠\mathfrak{a},\mathfrak{b}\neq\mathfrak{c}. Correspondingly, we define the resolvent minor of 𝐆(z)\bm{G}(z) by

𝑮(𝔠)(z):=[(z𝑰pn1/2(𝚲(1))1/2𝒁(1)n1/2(𝚲(2))1/2𝒁(2)n1/2𝒁(1)(𝚲(1))1/2𝑰𝟎n1/2𝒁(2)(𝚲(2))1/2𝟎𝑰)(𝔠)]1.\bm{G}^{(\mathfrak{c})}(z):=\left[\begin{pmatrix}-z\bm{I}_{p}&n^{-1/2}(\bm{\Lambda}^{(1)})^{1/2}\bm{Z}^{(1)\top}&n^{-1/2}(\bm{\Lambda}^{(2)})^{1/2}\bm{Z}^{(2)\top}\\ n^{-1/2}\bm{Z}^{(1)}(\bm{\Lambda}^{(1)})^{1/2}&-\bm{I}&\bm{0}\\ n^{-1/2}\bm{Z}^{(2)}(\bm{\Lambda}^{(2)})^{1/2}&\bm{0}&-\bm{I}\end{pmatrix}^{(\mathfrak{c})}\right]^{-1}.

We further define the partial traces m(𝔠)(z),m(𝔠)(z),m(𝔠)(z),m(𝔠)(z)m^{(\mathfrak{c})}(z),m^{(\mathfrak{c})}(z),m^{(\mathfrak{c})}(z),m^{(\mathfrak{c})}(z) by replacing 𝐆(z)\bm{G}(z) with 𝐆(𝔠)(z)\bm{G}^{(\mathfrak{c})}(z) in (D.7). For convenience, we adopt the notation 𝐆𝔞𝔟(𝔠)=0\bm{G}_{\mathfrak{a}\mathfrak{b}}^{(}{\mathfrak{c})}=0 if 𝐚=𝔠\bm{a}=\mathfrak{c} or 𝔟=𝔠\mathfrak{b}=\mathfrak{c}.

The following formulae are identical to those in [57, Lemma C.10]. In fact they can all be proved using Schur’s complement formula.

Lemma D.6.

We have the following resovent identities.

  1. (i)

    For i0i\in\mathcal{I}_{0} and μ12\mu\in\mathcal{I}_{1}\cup\mathcal{I}_{2}, we have

    1Gii=z(𝑭𝑮(i)𝑭)ii,1Gμμ=1(𝑭𝑮(μ)𝑭)μμ.\frac{1}{G_{ii}}=-z-\left(\bm{F}\bm{G}^{(i)}\bm{F}^{\top}\right)_{ii},\ \ \frac{1}{G_{\mu\mu}}=-1-\left(\bm{F}^{\top}\bm{G}^{(\mu)}\bm{F}\right)_{\mu\mu}. (D.37)
  2. (ii)

    For i0,μ12,𝔞{i}i\in\mathcal{I}_{0},\mu\in\mathcal{I}_{1}\cup\mathcal{I}_{2},\mathfrak{a}\in\mathcal{I}\setminus\{i\}, and 𝔟{μ}\mathfrak{b}\in\mathcal{I}\setminus\{\mu\}, we have

    Gi𝔞=Gii(𝑭𝑮(i))i𝔞,Gμ𝔟=Gμμ(𝑭𝑮(μ))μ𝔟.G_{i\mathfrak{a}}=-G_{ii}\left(\bm{F}\bm{G}^{(i)}\right)_{i\mathfrak{a}},\ \ G_{\mu\mathfrak{b}}=-G_{\mu\mu}\left(\bm{F}^{\top}\bm{G}^{(\mu)}\right)_{\mu\mathfrak{b}}. (D.38)
  3. (iii)

    For 𝔠\mathfrak{c}\in\mathcal{I} and 𝔞,𝔟{𝔠}\mathfrak{a},\mathfrak{b}\in\mathcal{I}\setminus\{\mathfrak{c}\}, we have

    G𝔞𝔟(𝔠)=G𝔞𝔟G𝔞𝔠G𝔠𝔟G𝔠𝔠G_{\mathfrak{a}\mathfrak{b}}^{(\mathfrak{c})}=G_{\mathfrak{a}\mathfrak{b}}-\frac{G_{\mathfrak{a}\mathfrak{c}}G_{\mathfrak{c}\mathfrak{b}}}{G_{\mathfrak{c}\mathfrak{c}}} (D.39)

The next lemma provides a priori estimate on the resolvent 𝑮(z)\bm{G}(z) for z𝑫z\in\bm{D}.

Lemma D.7.

Under the setting of Theorem D.1, there exists a constant CC such that with overwhelming probability the following estimates hold uniformly in z,z𝐃z,z^{\prime}\in\bm{D}:

𝑮(z)opCλ1,\|\bm{G}(z)\|_{\textnormal{op}}\leq C\lambda^{-1}, (D.40)

and

𝑮(z)𝑮(z)opCλ1|zz|.\|\bm{G}(z)-\bm{G}(z^{\prime})\|_{\textnormal{op}}\leq C\lambda^{-1}|z-z^{\prime}|. (D.41)
Proof.

Let

𝑭=k=1nμˇk𝝃ˇk𝜻ˇk,μˇ1μˇ2μˇn0=μˇn+1==μˇp\bm{F}=\sum_{k=1}^{n}\sqrt{\check{\mu}_{k}}\check{\bm{\xi}}_{k}\check{\bm{\zeta}}_{k}^{\top},\ \ \check{\mu}_{1}\geq\check{\mu}_{2}\geq...\geq\check{\mu}_{n}\geq 0=\check{\mu}_{n+1}=...=\check{\mu}_{p} (D.42)

be a singular value decomposition of 𝑭\bm{F}, where {𝝃ˇk}k=1p\{\check{\bm{\xi}}_{k}\}_{k=1}^{p} are the left singular vectors and {𝜻ˇk}k=1n\{\check{\bm{\zeta}}_{k}\}_{k=1}^{n} are the right singular vectors. Then using (D.8), we know for i,j0,μ,ν12i,j\in\mathcal{I}_{0},\mu,\nu\in\mathcal{I}_{1}\cup\mathcal{I}_{2},

Gij=k=1pξˇk(i)ξˇk(j)μˇkz,Gμν=zk=1nζˇk(μ)ζˇk(ν)μˇzz,Giμ=Gμi=k=1pμˇk𝝃ˇk(i)𝜻ˇk(μ)μˇkzG_{ij}=\sum_{k=1}^{p}\frac{\check{\xi}_{k}(i)\check{\xi}_{k}^{\top}(j)}{\check{\mu}_{k}-z},\ \ G_{\mu\nu}=z\sum_{k=1}^{n}\frac{\check{\zeta}_{k}(\mu)\check{\zeta}_{k}^{\top}(\nu)}{\check{\mu}_{z}-z},\ \ G_{i\mu}=G_{\mu i}=\sum_{k=1}^{p}\frac{\sqrt{\check{\mu}_{k}}\check{\bm{\xi}}_{k}(i)\check{\bm{\zeta}}_{k}^{\top}(\mu)}{\check{\mu}_{k}-z} (D.43)

Since z𝑫z\in\bm{D}, we know

infz𝑫min1kp|μkz|Cλ\inf_{z\in\bm{D}}\min_{1\leq k\leq p}|\mu_{k}-z|\geq C\lambda

for some CC. Combining this fact with (D.43), we readily conclude (D.40) and (D.41). ∎

Now we are ready to prove Lemma D.5.

Proof of Lemma D.5.

In the setting of Lemma D.5, we have (D.34) and (D.36). Similar to [57], we divide our proof into four steps:

Step 1: Large deviation estimates. In this step, we provide large deviation estimates on the off-diagonal entries of 𝑮\bm{G}. We introduce

𝒵𝔞:=(1𝔼𝔞)[(G𝔞𝔞)1],𝔞\mathcal{Z}_{\mathfrak{a}}:=(1-\mathbb{E}_{\mathfrak{a}})[(G_{\mathfrak{a}\mathfrak{a}})^{-1}],\ \ \mathfrak{a}\in\mathcal{I}

, where 𝔼𝔞:=𝔼[|𝑯(𝔞)]\mathbb{E}_{\mathfrak{a}}:=\mathbb{E}[\cdot|\bm{H}^{(\mathfrak{a})}] denotes the partial expectation over the entries in the 𝔞\mathfrak{a}-th row and column of 𝑯\bm{H}. Using (D.37), we have for i0i\in\mathcal{I}_{0},

𝒵i=\displaystyle\mathcal{Z}_{i}= λi(1)nμ,ν1Gμν(i)(δμνZμi(1)Zνi(1))+λi(2)nμ,ν2Gμν(i)(δμνZμi(2)Zνi(2))\displaystyle\frac{\lambda_{i}^{(1)}}{n}\sum_{\mu,\nu\in\mathcal{I}_{1}}G_{\mu\nu}^{(i)}\left(\delta_{\mu\nu}-Z_{\mu i}^{(1)}Z_{\nu i}^{(1)}\right)+\frac{\lambda_{i}^{(2)}}{n}\sum_{\mu,\nu\in\mathcal{I}_{2}}G_{\mu\nu}^{(i)}\left(\delta_{\mu\nu}-Z_{\mu i}^{(2)}Z_{\nu i}^{(2)}\right) (D.44)
λi(1)λi(2)nμ,ν1Gμν(i)Zμi(1)Zνi(1),\displaystyle-\frac{\sqrt{\lambda_{i}^{(1)}\lambda_{i}^{(2)}}}{n}\sum_{\mu,\nu\in\mathcal{I}_{1}}G_{\mu\nu}^{(i)}Z_{\mu i}^{(1)}Z_{\nu i}^{(1)},

and for μ1\mu\in\mathcal{I}_{1} and ν2\nu\in\mathcal{I}_{2},

𝒵μ=1ni,j0λi(1)λj(1)Gij(μ)(δijZμi(1)Zμj(1)),𝒵ν=1ni,j0λi(2)λj(2)Gij(ν)(δijZνi(1)Zνj(1)).\mathcal{Z}_{\mu}=\frac{1}{n}\sum_{i,j\in\mathcal{I}_{0}}\sqrt{\lambda_{i}^{(1)}\lambda_{j}^{(1)}}G_{ij}^{(\mu)}\left(\delta_{ij}-Z_{\mu i}^{(1)}Z_{\mu j}^{(1)}\right),\ \ \mathcal{Z}_{\nu}=\frac{1}{n}\sum_{i,j\in\mathcal{I}_{0}}\sqrt{\lambda_{i}^{(2)}\lambda_{j}^{(2)}}G_{ij}^{(\nu)}\left(\delta_{ij}-Z_{\nu i}^{(1)}Z_{\nu j}^{(1)}\right). (D.45)

We further introduce the random error

Λˇo:=max𝔞𝔟|G𝔞𝔞1G𝔞𝔟|,\check{\Lambda}_{o}:=\max_{\mathfrak{a}\neq\mathfrak{b}}\left|G_{\mathfrak{a}\mathfrak{a}}^{-1}G_{\mathfrak{a}\mathfrak{b}}\right|, (D.46)

which controls the size of the largest off-diagonal entries.

Lemma D.8.

In the setting of Proposition D.4, the following holds uniformly in z𝐃z\in\bm{D}:

Λˇ0+max𝔞|𝒵𝔞|p1/2λ1\check{\Lambda}_{0}+\max_{\mathfrak{a}\in\mathcal{I}}|\mathcal{Z}_{\mathfrak{a}}|\prec p^{-1/2}\lambda^{-1} (D.47)
Proof.

Notice that for 𝔞\mathfrak{a}\in\mathcal{I}, 𝑯(𝔞)\bm{H}^{(\mathfrak{a})} and 𝑮(𝔞)\bm{G}^{(\mathfrak{a})} also satisfy the assumptions of Lemma D.7. Thus the estimates (D.40) and (D.41) also hold for 𝑮(a)\bm{G}^{(a)} with overwhelming probability. For any i0i\in\mathcal{I}_{0}, since 𝑮(i)\bm{G}^{(i)} is independent of the entries in the ii-th row and column of 𝑯\bm{H}, we can combine (B.8), (B.9) and (B.10) with (D.44) to get

|𝒵i|\displaystyle|\mathcal{Z}_{i}|\lesssim 1nk=12|μ,νkGμν(i)(δμνZμi(k)Zνi(k))|+1n|μ,ν1Gμν(i)Zμi(1)Zνi(1)|\displaystyle\frac{1}{n}\sum_{k=1}^{2}\left|\sum_{\mu,\nu\in\mathcal{I}_{k}}G_{\mu\nu}^{(i)}\left(\delta_{\mu\nu}-Z_{\mu i}^{(k)}Z_{\nu i}^{(k)}\right)\right|+\frac{1}{n}\left|\sum_{\mu,\nu\in\mathcal{I}_{1}}G_{\mu\nu}^{(i)}Z_{\mu i}^{(1)}Z_{\nu i}^{(1)}\right|
\displaystyle\prec 1n(μ,ν12|Gμν(i)|2)1/2p1/2λ1,\displaystyle\frac{1}{n}\left(\sum_{\mu,\nu\in\mathcal{I}_{1}\cup\mathcal{I}_{2}}|G_{\mu\nu}^{(i)}|^{2}\right)^{1/2}\prec p^{-1/2}\lambda^{-1},

where in the last step we used (D.40) to get that for μ12\mu\in\mathcal{I}_{1}\cup\mathcal{I}_{2}, w.h.p.,

ν12|Gμν(i)|2𝔞|Gμ𝔞(i)|2=(G(i)G(i))μμ=O(λ2).\sum_{\nu\in\mathcal{I}_{1}\cup\mathcal{I}_{2}}|G_{\mu\nu}^{(i)}|^{2}\leq\sum_{\mathfrak{a}\in\mathcal{I}}|G_{\mu\mathfrak{a}}^{(i)}|^{2}=\left(G^{(i)}G^{(i)*}\right)_{\mu\mu}=O(\lambda^{-2}). (D.48)

The same bound for |𝒵μ||\mathcal{Z}_{\mu}| and |𝒵ν||\mathcal{Z}_{\nu}| can be obtained by combining (B.8), (B.9) and (B.10) with (D.45). This gives max𝔞|𝒵𝔞|p1/2λ1\max_{\mathfrak{a}\in\mathcal{I}}|\mathcal{Z}_{\mathfrak{a}}|\prec p^{-1/2}\lambda^{-1}.

Next we consider λˇo\check{\lambda}_{o}. For i0i\in\mathcal{I}_{0} and 𝔞{i}\mathfrak{a}\in\mathcal{I}\setminus\{i\}, using (D.38), (B.7) and (D.40), we obtain that

|Gii1Gia|\displaystyle|G_{ii}^{-1}G_{ia}|\lesssim n1/2|μ1Zμi(1)Gμ𝔞(i)|+n1/2|μ2Zμi(2)Gμ𝔞(i)|\displaystyle n^{-1/2}\left|\sum_{\mu\in\mathcal{I}_{1}}Z_{\mu i}^{(1)}G_{\mu\mathfrak{a}}^{(i)}\right|+n^{-1/2}\left|\sum_{\mu\in\mathcal{I}_{2}}Z_{\mu i}^{(2)}G_{\mu\mathfrak{a}}^{(i)}\right|
\displaystyle\prec n1/2(μ12|Gμ𝔞(1)|2)1/2p1/2λ1\displaystyle n^{-1/2}\left(\sum_{\mu\in\mathcal{I}_{1}\cup\mathcal{I}_{2}}|G_{\mu\mathfrak{a}}^{(1)}|^{2}\right)^{1/2}\prec p^{-1/2}\lambda^{-1}

The exact same bound for |Gμμ1Gμ𝔟||G_{\mu\mu}^{-1}G_{\mu\mathfrak{b}}| with μ12\mu\in\mathcal{I}_{1}\cup\mathcal{I}_{2} and 𝔟{μ}\mathfrak{b}\in\mathcal{I}\setminus\{\mu\} can be obtained with similar argument. Thus Λˇop1/2λ1\check{\Lambda}_{o}\prec p^{-1/2}\lambda^{-1}. ∎

Combining (D.47) with (D.40), we immediately obtain (D.35) for off-diagonals 𝔞𝔟\mathfrak{a}\neq\mathfrak{b}.

Step 2: Self-consistent equations. In this step we show that (m1(z),m2(z))(m_{1}(z),m_{2}(z)), defined in (D.7) satisfies the system of approximate self-consistent equations (D.32). By (D.18) and (D.29), we know the following estimate hold for z𝑫z\in\bm{D}:

|m1c(z)m1c(λ)|(logp)1λ,|m2c(z)m2c(λ)|(logp)1λ.|m_{1c}(z)-m_{1c}(-\lambda)|\lesssim(\log p)^{-1}\lambda,\ \ |m_{2c}(z)-m_{2c}(-\lambda)|\lesssim(\log p)^{-1}\lambda.

Combining with (D.30), we know that uniformly in z𝑫z\in\bm{D},

|m1c(z)||m2c(z)|λ,|z+λi(1)r1m1c(z)+λi(2)r2m2c(z)|λ.|m_{1c}(z)|\sim|m_{2c}(z)|\sim\lambda,\ \ |z+\lambda_{i}^{(1)}r_{1}m_{1c}(z)+\lambda_{i}^{(2)}r_{2}m_{2c}(z)|\sim\lambda. (D.49)

Further, combining (D.19) with (D.29), we get that uniformly in z𝑫z\in\bm{D}, for k=1,2k=1,2

|1+γm0kc(z)|=|mkc1(z)|λ1,|1+\gamma m_{0kc}(z)|=|m_{kc}^{-1}(z)|\sim\lambda^{-1}, (D.50)

where we denote

m0kc(z):=λ(k)z+λ(1)m1c(z)+λ(2)m2c(z)𝑑H^p(λ(1),λ(2)).m_{0kc}(z):=-\int\frac{\lambda^{(k)}}{z+\lambda^{(1)}m_{1c}(z)+\lambda^{(2)}m_{2c}(z)}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}). (D.51)

We will later see in fact that m0kc(z)m_{0kc}(z) are the asymptotic limits of m0k(z)m_{0k}(z) defined in (D.7). Applying (D.49) to (D.9) and using (D.29), we get that

|𝔊ii(z)|λ1,|𝔊μμ(z)|λ,uniformly for z𝑫 and i0,μ12.|\mathfrak{G}_{ii}(z)|\sim\lambda^{-1},\ \ |\mathfrak{G}_{\mu\mu}(z)|\sim\lambda,\ \ \text{uniformly for $z\in\bm{D}$ and $i\in\mathcal{I}_{0},\mu\in\mathcal{I}_{1}\cup\mathcal{I}_{2}$}. (D.52)

We now define the critical zz-dependent event:

Ξ(z):={|m1(z)m1c(λ)|+|m1(z)m1c(λ)|(logp)1/2λ}.\Xi(z):=\left\{|m_{1}(z)-m_{1c}(-\lambda)|+|m_{1}(z)-m_{1c}(-\lambda)|\leq(\log p)^{-1/2}\lambda\right\}. (D.53)

(D.49) yields that on Ξ(z)\Xi(z),

|m1(z)||m2(z)|λ,|z+λi(1)r1m1(z)+λi(2)r2m2(z)|λ.|m_{1}(z)|\sim|m_{2}(z)|\sim\lambda,\ \ |z+\lambda_{i}^{(1)}r_{1}m_{1}(z)+\lambda_{i}^{(2)}r_{2}m_{2}(z)|\sim\lambda. (D.54)

We now propose the following key lemma, which introduces the approximate self-consistent equations for (m1(z),m2(z))(m_{1}(z),m_{2}(z)) on Ξ(z)\Xi(z).

Lemma D.9.

Under the setting of Lemma D.5, the following estimates hold uniformly in z𝐃z\in\bm{D} for k=1,2k=1,2:

𝟏(Ξ)\displaystyle\bm{1}(\Xi) |1mk+1γλ(k)z+λ(1)r1m1+λ(2)r2m2𝑑H^p(λ(1),λ(2))|\displaystyle\left|\frac{1}{m_{k}}+1-\gamma\int\frac{\lambda^{(k)}}{z+\lambda^{(1)}r_{1}m_{1}+\lambda^{(2)}r_{2}m_{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})\right| (D.55)
p1λ5+p1/2λ4Θ+|𝒵ˇ0k|+|𝒵ˇk|,\displaystyle\prec p^{-1}\lambda^{-5}+p^{-1/2}\lambda^{-4}\Theta+|\check{\mathcal{Z}}_{0k}|+|\check{\mathcal{Z}}_{k}|,

where we denote

Θ:=|m1(z)m1c(z)|+|m2(z)m2c(z)|\Theta:=|m_{1}(z)-m_{1c}(z)|+|m_{2}(z)-m_{2c}(z)| (D.56)

and

𝒵ˇ0k\displaystyle\check{\mathcal{Z}}_{0k} :=1pi0λi(k)𝒵i(z+λi(1)r1m1c+λi(2)r2m2c)2,\displaystyle:=\frac{1}{p}\sum_{i\in\mathcal{I}_{0}}\frac{\lambda_{i}^{(k)}\mathcal{Z}_{i}}{(z+\lambda_{i}^{(1)}r_{1}m_{1c}+\lambda_{i}^{(2)}r_{2}m_{2c})^{2}}, (D.57)
𝒵ˇ1\displaystyle\check{\mathcal{Z}}_{1} :=1n1μ1𝒵μ,𝒵ˇ2:=1n2ν2𝒵ν.\displaystyle:=\frac{1}{n_{1}}\sum_{\mu\in\mathcal{I}_{1}}\mathcal{Z}_{\mu},\ \ \check{\mathcal{Z}}_{2}:=\frac{1}{n_{2}}\sum_{\nu\in\mathcal{I}_{2}}\mathcal{Z}_{\nu}.
Proof.

Using (D.37),(D.44) and (D.45), we have

1Gii\displaystyle\frac{1}{G_{ii}} =zλi(1)r1m1λi(2)r2m2+i,for i0,\displaystyle=-z-\lambda_{i}^{(1)}r_{1}m_{1}-\lambda_{i}^{(2)}r_{2}m_{2}+\mathcal{E}_{i},\ \ \text{for $i\in\mathcal{I}_{0}$}, (D.58)
1Gμμ\displaystyle\frac{1}{G_{\mu\mu}} =1γm01+μ,for μ1,\displaystyle=-1-\gamma m_{01}+\mathcal{E}_{\mu},\ \ \text{for $\mu\in\mathcal{I}_{1}$}, (D.59)
1Gνν\displaystyle\frac{1}{G_{\nu\nu}} =1γm02+ν,for ν2,\displaystyle=-1-\gamma m_{02}+\mathcal{E}_{\nu},\ \ \text{for $\nu\in\mathcal{I}_{2}$}, (D.60)

where we denote (recall (D.7) and Definition D.2)

i:=𝒵i+λi(1)r1(m1m1(i))+r2(m2m2(i)),\mathcal{E}_{i}:=\mathcal{Z}_{i}+\lambda_{i}^{(1)}r_{1}(m_{1}-m_{1}^{(i)})+r_{2}(m_{2}-m_{2}^{(i)}),

and

μ:=𝒵μ+γ(m01m01(μ)),ν:=𝒵ν+γ(m02m02(ν)).\mathcal{E}_{\mu}:=\mathcal{Z}_{\mu}+\gamma(m_{01}-m_{01}^{(\mu)}),\ \ \mathcal{E}_{\nu}:=\mathcal{Z}_{\nu}+\gamma(m_{02}-m_{02}^{(\nu)}).

Using equations (D.39), (D.46) and (D.47), we know

|m1mi(i)|1n1μ1|GμiGiμGii||Λˇ0|2|Gii|p1λ3.|m_{1}-m_{i}^{(i)}|\leq\frac{1}{n_{1}}\sum_{\mu\in\mathcal{I}_{1}}\left|\frac{G_{\mu i}G_{i\mu}}{G_{ii}}\right|\leq|\check{\Lambda}_{0}|^{2}|G_{ii}|\prec p^{-1}\lambda^{-3}. (D.61)

Similarly, we also have

|m2m2(i)|p1λ3,|m01m01(μ)|p1λ3,|m02m02(ν)|p1λ3,|m_{2}-m_{2}^{(i)}|\prec p^{-1}\lambda^{-3},\ \ |m_{01}-m_{01}^{(\mu)}|\prec p^{-1}\lambda^{-3},\ \ |m_{02}-m_{02}^{(\nu)}|\prec p^{-1}\lambda^{-3}, (D.62)

for i0,μ1,ν2i\in\mathcal{I}_{0},\mu\in\mathcal{I}_{1},\nu\in\mathcal{I}_{2}. Combining (D.61) and (D.62) with (D.47), we have

maxi0|i|+maxμ12|μ|p1/2λ1.\max_{i\in\mathcal{I}_{0}}|\mathcal{E}_{i}|+\max_{\mu\in\mathcal{I}_{1}\cup\mathcal{I}_{2}}|\mathcal{E}_{\mu}|\prec p^{-1/2}\lambda^{-1}. (D.63)

Now from equation (D.58), we know that on Ξ\Xi,

Gii\displaystyle G_{ii} =1z+λi(1)r1m1+λi(2)r2m2i(z+λi(1)r1m1+λi(2)r2m2)2+O(p1λ5)\displaystyle=-\frac{1}{z+\lambda_{i}^{(1)}r_{1}m_{1}+\lambda_{i}^{(2)}r_{2}m_{2}}--\frac{\mathcal{E}_{i}}{(z+\lambda_{i}^{(1)}r_{1}m_{1}+\lambda_{i}^{(2)}r_{2}m_{2})^{2}}+O_{\prec}(p^{-1}\lambda^{-5}) (D.64)
=1z+λi(1)r1m1+λi(2)r2m2𝒵i(z+λi(1)r1m1+λi(2)r2m2)2+O(p1λ5+p1/2λ4Θ)\displaystyle=-\frac{1}{z+\lambda_{i}^{(1)}r_{1}m_{1}+\lambda_{i}^{(2)}r_{2}m_{2}}--\frac{\mathcal{Z}_{i}}{(z+\lambda_{i}^{(1)}r_{1}m_{1}+\lambda_{i}^{(2)}r_{2}m_{2})^{2}}+O_{\prec}(p^{-1}\lambda^{-5}+p^{-1/2}\lambda^{-4}\Theta)

where in the first step we use (D.63) and (D.54) on Ξ\Xi, and in the second step we use (D.56), (D.61) and (D.47). Plugging (D.64) into the definitions of m0km_{0k} in (D.7) and using (D.57), we get that on Ξ\Xi, for k=1,2k=1,2,

m0k=λ(k)z+λ(1)r1m1+λ(2)r2m2𝑑H^p(λ(1),λ(2))𝒵ˇ0k+O(p1λ5+p1/2λ4Θ).\displaystyle m_{0k}=-\int\frac{\lambda^{(k)}}{z+\lambda^{(1)}r_{1}m_{1}+\lambda^{(2)}r_{2}m_{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})-\check{\mathcal{Z}}_{0k}+O_{\prec}(p^{-1}\lambda^{-5}+p^{-1/2}\lambda^{-4}\Theta). (D.65)

Comparing the equations above with (D.51) and applying (D.57) and (D.47), we get

|m01(z)m01c(z)|+|m02(z)m02c(z)|(logp)1/2λ1,w.o.p.onΞ.|m_{01}(z)-m_{01c}(z)|+|m_{02}(z)-m_{02c}(z)|\lesssim(\log p)^{-1/2}\lambda^{-1},\ \ \text{w.o.p.}\ \ \text{on}\ \ \Xi. (D.66)

Together with equation (D.50), we have

|1+γm01(z)|λ1,|1+γm01(z)|λ1w.o.p.onΞ.|1+\gamma m_{01}(z)|\sim\lambda^{-1},\ \ |1+\gamma m_{01}(z)|\sim\lambda^{-1}\ \ \text{w.o.p.}\ \ \text{on}\ \ \Xi. (D.67)

With a similar argument, from equations (D.59), we know that on Ξ\Xi,

Gμμ=11+γm0k𝒵μ(1+γm0k)2+O(p1λ1),μk,k=1,2,G_{\mu\mu}=-\frac{1}{1+\gamma m_{0k}}-\frac{\mathcal{Z}_{\mu}}{(1+\gamma m_{0k})^{2}}+O_{\prec}(p^{-1}\lambda^{-1}),\ \ \mu\in\mathcal{I}_{k},\ \ k=1,2, (D.68)

where we used (D.61), (D.63) and (D.67). Taking the average of (D.68) over μk\mu\in\mathcal{I}_{k}, we get that on Ξ\Xi,

mk=11+γm0k𝒵ˇk(1+γm0k)2+O(p1λ1),k=1,2,m_{k}=-\frac{1}{1+\gamma m_{0k}}-\frac{\check{\mathcal{Z}}_{k}}{(1+\gamma m_{0k})^{2}}+O_{\prec}(p^{-1}\lambda^{-1}),\ \ k=1,2, (D.69)

which implies that on Ξ\Xi,

1mk+1+γm0kp1λ3+|𝒵kˇ|,\frac{1}{m_{k}}+1+\gamma m_{0k}\prec p^{-1}\lambda^{-3}+|\check{\mathcal{Z}_{k}}|, (D.70)

where we used (D.54) and (D.67). Finally, plugging (D.65) into (D.70), we get (D.55). ∎

Step 3: Entrywise Local Law. In this step, we show that the event Ξ(z)\Xi(z) in (D.53) actually holds with high probability, and then we apply Lemma D.9 to conclude the entrywise local law (D.35). We first claim that it suffices to show

|m1(λ)m1c(λ)|+|m2(λ)m2c(λ)|p1/2λ2.|m_{1}(-\lambda)-m_{1c}(-\lambda)|+|m_{2}(-\lambda)-m_{2c}(-\lambda)|\prec p^{-1/2}\lambda^{-2}. (D.71)

In fact, by (D.43) and similar arguments as in the proof of Lemma D.7, we know that uniformly for z𝑫z\in\bm{D}, w.o.p.,

|m1(z)m1(λ)|+|m2(z)m2(λ)||z+λ|(logp)1λ.|m_{1}(z)-m_{1}(-\lambda)|+|m_{2}(z)-m_{2}(-\lambda)|\lesssim|z+\lambda|\leq(\log p)^{-1}\lambda.

Thus, if (D.71) holds, we can obtain with triangle inequality that

supz𝑫(|m1(z)m1c(λ)|+|m2(z)m2c(λ)|)(logp)1λw.o.p.,\sup_{z\in\bm{D}}(|m_{1}(z)-m_{1c}(-\lambda)|+|m_{2}(z)-m_{2c}(-\lambda)|)\lesssim(\log p)^{-1}\lambda\ \ \text{w.o.p.,} (D.72)

which confirms that Ξ\Xi holds w.o.p., and also verifies the condition (D.31) of Lemma D.3. Now applying Lemma D.3 to (D.55), we get

Θ(z)\displaystyle\Theta(z) =|m1(z)m1c(z)|+|m2(z)m2c(z)|\displaystyle=|m_{1}(z)-m_{1c}(z)|+|m_{2}(z)-m_{2c}(z)|
p1λ4+p1/2λ3Θ+λ(|𝒵ˇ1|+|𝒵ˇ2|+|𝒵ˇ01|+|𝒵ˇ02|),\displaystyle\prec p^{-1}\lambda^{-4}+p^{-1/2}\lambda^{-3}\Theta+\lambda(|\check{\mathcal{Z}}_{1}|+|\check{\mathcal{Z}}_{2}|+|\check{\mathcal{Z}}_{01}|+|\check{\mathcal{Z}}_{02}|),

which implies

Θ(z)p1λ4+λ(|𝒵ˇ1|+|𝒵ˇ2|+|𝒵ˇ01|+|𝒵ˇ02|)p1/2λ2\Theta(z)\prec p^{-1}\lambda^{-4}+\lambda(|\check{\mathcal{Z}}_{1}|+|\check{\mathcal{Z}}_{2}|+|\check{\mathcal{Z}}_{01}|+|\check{\mathcal{Z}}_{02}|)\prec p^{-1/2}\lambda^{-2} (D.73)

uniformly for z𝑫z\in\bm{D}, where we used (D.47). On the other hand, with equation (D.68) and (D.69), we know

maxμ1|Gμμ(z)m1(z)|+maxν2|Gνν(z)m2(z)|p1λ1,\max_{\mu\in\mathcal{I}_{1}}|G_{\mu\mu}(z)-m_{1}(z)|+\max_{\nu\in\mathcal{I}_{2}}|G_{\nu\nu}(z)-m_{2}(z)|\prec p^{-1}\lambda^{-1},

which, combined with (D.73), yields

maxμ1|Gμμ(z)m1c(z)|+maxν2|Gνν(z)m2c(z)|p1/2λ2.\max_{\mu\in\mathcal{I}_{1}}|G_{\mu\mu}(z)-m_{1c}(z)|+\max_{\nu\in\mathcal{I}_{2}}|G_{\nu\nu}(z)-m_{2c}(z)|\prec p^{-1/2}\lambda^{-2}. (D.74)

Next, plugging (D.73) into (D.64) and recalling (D.9) and (D.29), we get that

maxi0|Gii(z)𝔊ii(z)|p1/2λ3.\max_{i\in\mathcal{I}_{0}}|G_{ii}(z)-\mathfrak{G}_{ii}(z)|\prec p^{-1/2}\lambda^{-3}.

Together with (D.74), we have the diagonal estimate

max𝔞|G𝔞𝔞(z)𝔊𝔞𝔞(z)|p1/2λ3,\max_{\mathfrak{a}\in\mathcal{I}}|G_{\mathfrak{a}\mathfrak{a}}(z)-\mathfrak{G}_{\mathfrak{a}\mathfrak{a}}(z)|\prec p^{-1/2}\lambda^{-3}, (D.75)

which, when combined with the previously established off-diagonal estimate, we conclude the entrywise local law (D.35).

Thus we are left with proving (D.71). Using (D.43) and (D.40), we know m0k(λ)λ1m_{0k}(-\lambda)\sim\lambda^{-1} for k=1,2k=1,2. Therefore we have

1+γm0k(λ)λ1,k=1,2.1+\gamma m_{0k}(-\lambda)\sim\lambda^{-1},\ \ k=1,2. (D.76)

Combining these estimates with (D.59), (D.60) and (D.63), we conclude that (D.69) hold at z=λz=-\lambda without requiring Ξ(0)\Xi(0). This further gives that w.o.p.,

|λi(1)r1m1(λ)+λi(2)r2m2(λ)|=|λi(1)r11+γm01(λ)+λi(2)r11+γm02(λ)+O(p1/2λ1)|λ.|\lambda_{i}^{(1)}r_{1}m_{1}(-\lambda)+\lambda_{i}^{(2)}r_{2}m_{2}(-\lambda)|=\left|\frac{\lambda_{i}^{(1)}r_{1}}{1+\gamma m_{01}(-\lambda)}+\frac{\lambda_{i}^{(2)}r_{1}}{1+\gamma m_{02}(-\lambda)}+O_{\prec}(p^{-1/2}\lambda^{-1})\right|\sim\lambda.

Combining the above estimate with (D.58) and (D.63), we obtain that (D.65) also hold at z=λz=-\lambda without requiring Ξ(0)\Xi(0). Finally, plugging (D.65) into (D.70), we conclude that (D.55) hold at z=0z=0.

Also, combining (D.69) at z=λz=-\lambda with (D.76), we know there exists constants c,C>0c,C>0 such that

cλa1ϵ(λ):=r1m1(λ)Cλ,cλa2ϵ(λ):=r2m2(λ)Cλ,w.o.p.c\lambda\leq a_{1\epsilon}(-\lambda):=-r_{1}m_{1}(-\lambda)\leq C\lambda,\ \ c\lambda\leq a_{2\epsilon}(-\lambda):=-r_{2}m_{2}(-\lambda)\leq C\lambda,\ \ \text{w.o.p.} (D.77)

Consequently, we know (a1(λ),a2(λ))(a_{1}(-\lambda),a_{2}(-\lambda)) satisfies (D.15) and (a1ϵ(λ),a2ϵ(λ))(a_{1\epsilon}(-\lambda),a_{2\epsilon}(-\lambda)) satisfies a similar system with right-hand-sides replaced by O(p1/2λ3)O_{\prec}(p^{-1/2}\lambda^{-3}). Denote 𝑭(𝒂)=[f(a1,a2),g(a1,a2)]\bm{F}(\bm{a})=[f(a_{1},a_{2}),g(a_{1},a_{2})] for 𝒂=(a1,a2)\bm{a}=(a_{1},a_{2}) and f,gf,g defined in (D.15), and further denote 𝑱\bm{J} to be the Jacobian of 𝑭\bm{F}. Similar to the derivation of (D.23), because of (D.77) and (D.16), we know 𝑱1op=O(λ)\|\bm{J}^{-1}\|_{\textnormal{op}}=O(\lambda) at both (a1(λ),a2(λ))(a_{1}(-\lambda),a_{2}(-\lambda)) and (a1ϵ(λ),a2ϵ(λ))(a_{1\epsilon}(-\lambda),a_{2\epsilon}(-\lambda)). Hence we obtain that

|m1(λ)m1c(λ)|p1/2λ2,|m2(λ)m2c(λ)|p1/2λ2,|m_{1}(-\lambda)-m_{1c}(-\lambda)|\prec p^{-1/2}\lambda^{-2},\ \ |m_{2}(-\lambda)-m_{2c}(-\lambda)|\prec p^{-1/2}\lambda^{-2},

thereby yielding (D.71).

Step 4: Averaged Local Law. Now we prove the averaged local laws (D.12). We have the following fluctuation averaging lemma:

Lemma D.10 (Fluctuation averaging).

Under the setting of Proposition D.4, suppose the entrywise law holds uniformly in z𝐃z\in\bm{D}, then we have

|𝒵ˇ1|+|𝒵ˇ2|+|𝒵ˇ01|+|𝒵ˇ02|p1λ6|\check{\mathcal{Z}}_{1}|+|\check{\mathcal{Z}}_{2}|+|\check{\mathcal{Z}}_{01}|+|\check{\mathcal{Z}}_{02}|\prec p^{-1}\lambda^{-6} (D.78)

uniformly in z𝐃z\in\bm{D}.

Proof.

The proof exactly follows [23, Theorem 4.7]. ∎

Now plugging in (D.78) into (D.73), we get

|m1(z)m1c(z)|+|m2(z)m2c(z)|p1λ5.|m_{1}(z)-m_{1c}(z)|+|m_{2}(z)-m_{2c}(z)|\prec p^{-1}\lambda^{-5}. (D.79)

Then, substracting (D.51) from (D.65) and applying (D.78) and (D.79), we get

|m0k(z)m0kc(z)|p1λ7,|m_{0k}(z)-m_{0kc}(z)|\prec p^{-1}\lambda^{-7},

which is exactly the averaged local law. ∎

D.1.4 Non-Gaussian extension

Now with Proposition D.4, we can extend from Gaussian random matrices to generally distributed random matrices using the exact same proof as [57, Sections C.2.4 and C.2.5], since it longer involves tracking powers of λ\lambda, we omit the details here.

D.2 Proof of Theorem A.3

First, using a standard truncation argument (c.f. [57, proof of Theorem 3.1]), we know that Theorem D.1 holds for Q=p2/φQ=p^{2/\varphi} with high probability if (𝒁(1),𝒁(2))(\bm{Z}^{(1)},\bm{Z}^{(2)}) now satisfy the bounded moment condition (2.3).

We first prove the variance term (A.15). Define the auxiliary variable

Vˇ(λ)\displaystyle\check{V}(\lambda) :=σ2λnTr((𝚲(2))1/2((𝚲(1))1/2𝑽𝚺^𝒁(1)𝑽(𝚲(1))1/2+(𝚲(2))1/2𝑽𝚺^𝒁(2)𝑽(𝚲(2))1/2+λ𝑰)1(𝚲(2))1/2)\displaystyle:=\frac{\sigma^{2}\lambda}{n}\textnormal{Tr}\left((\bm{\Lambda}^{(2)})^{1/2}\left((\bm{\Lambda}^{(1)})^{1/2}\bm{V}^{\top}\hat{\bm{\Sigma}}_{\bm{Z}}^{(1)}\bm{V}(\bm{\Lambda}^{(1)})^{1/2}+(\bm{\Lambda}^{(2)})^{1/2}\bm{V}^{\top}\hat{\bm{\Sigma}}_{\bm{Z}}^{(2)}\bm{V}(\bm{\Lambda}^{(2)})^{1/2}+\lambda\bm{I}\right)^{-1}(\bm{\Lambda}^{(2)})^{1/2}\right) (D.80)
=σ2λni0λi(2)Gii(λ).\displaystyle=\frac{\sigma^{2}\lambda}{n}\sum_{i\in\mathcal{I}_{0}}\lambda_{i}^{(2)}G_{ii}(-\lambda).

Since (D.19) reduces to the first two equations in (A.17) at z=λz=-\lambda, then by (D.12), with high probability,

Vˇ(λ)\displaystyle\check{V}(\lambda) =σ2λni0𝔊ii(λ)+O(p1λ6)\displaystyle=\frac{\sigma^{2}\lambda}{n}\sum_{i\in\mathcal{I}_{0}}\mathfrak{G}_{ii}(-\lambda)+O(p^{-1}\lambda^{-6})
=σ2γλλ(2)a1λ(1)+a2λ(2)+λ𝑑H^p(λ(1),λ(2))+O(p1λ6)\displaystyle=\sigma^{2}\gamma\int\frac{\lambda\lambda^{(2)}}{a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}+\lambda}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})+O(p^{-1}\lambda^{-6})
:=V¯(λ)+O(p1λ6)\displaystyle:=\overline{V}(\lambda)+O(p^{-1}\lambda^{-6})

where (a1,a2)(a_{1},a_{2}) is defined in (A.17).

Fix t>0t>0. Let 𝑨:=𝚺^+λ𝑰,𝑩:=𝚺^+(λ+tλ)𝑰\bm{A}:=\hat{\bm{\Sigma}}+\lambda\bm{I},\bm{B}:=\hat{\bm{\Sigma}}+(\lambda+t\lambda)\bm{I}. Then from (D.2),

|V(𝜷^λ;𝜷(2))1tλ(Vˇ(λ+tλ)Vˇ(λ))|\displaystyle\left|V(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})-\frac{1}{t\lambda}(\check{V}(\lambda+t\lambda)-\check{V}(\lambda))\right|
=\displaystyle= |σ2nTr(𝚺(2)𝚺^(𝑨21tλ(𝑨1𝑩1)))|\displaystyle\left|\frac{\sigma^{2}}{n}\textnormal{Tr}\left(\bm{\Sigma}^{(2)}\hat{\bm{\Sigma}}\left(\bm{A}^{-2}-\frac{1}{t\lambda}(\bm{A}^{-1}-\bm{B}^{-1})\right)\right)\right|
\displaystyle\lesssim 𝚺(2)𝚺^(𝑨21tλ(𝑩1(𝑩𝑨)𝑨1))op\displaystyle\left\|\bm{\Sigma}^{(2)}\hat{\bm{\Sigma}}\left(\bm{A}^{-2}-\frac{1}{t\lambda}(\bm{B}^{-1}(\bm{B}-\bm{A})\bm{A}^{-1})\right)\right\|_{\textnormal{op}}
\displaystyle\lesssim 𝚺^(𝑨2𝑩1𝑨1)op\displaystyle\left\|\hat{\bm{\Sigma}}\left(\bm{A}^{-2}-\bm{B}^{-1}\bm{A}^{-1}\right)\right\|_{\textnormal{op}}
=\displaystyle= tλ𝚺^𝑩1𝑨2op\displaystyle t\lambda\|\hat{\bm{\Sigma}}\bm{B}^{-1}\bm{A}^{-2}\|_{\textnormal{op}}
=\displaystyle= O(tλ2)\displaystyle O(t\lambda^{-2})

with high probability, where in the last line we used Corollary B.4.

Similarly (but without randomness), we have

|σ2γλ(1)λ(2)(a1a3λ)+(λ(2))2(a2a4λ)(a1λ(1)+a2λ(2)+λ)2𝑑H^p(λ(1),λ(2))1tλ(V¯(λ+tλ)V¯(λ))|+O(tλ2),\left|\sigma^{2}\gamma\int\frac{\lambda^{(1)}\lambda^{(2)}(a_{1}-a_{3}\lambda)+(\lambda^{(2)})^{2}(a_{2}-a_{4}\lambda)}{(a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}+\lambda)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})-\frac{1}{t\lambda}(\overline{V}(\lambda+t\lambda)-\overline{V}(\lambda))\right|+O(t\lambda^{-2}),

where (a3,a4):=λ(a1,a2)(a_{3},a_{4}):=\frac{\partial}{\partial\lambda}(a_{1},a_{2}) satisfies the last two equations in (A.17). Combining the previous two displays and setting t=p1/2λ5/2t=p^{-1/2}\lambda^{-5/2}, we get

|V(𝜷^λ;𝜷(2))σ2γλ(1)λ(2)(a1a3λ)+(λ(2))2(a2a4λ)(a1λ(1)+a2λ(2)+λ)2𝑑H^p(λ(1),λ(2))|\displaystyle\left|V(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})-\sigma^{2}\gamma\int\frac{\lambda^{(1)}\lambda^{(2)}(a_{1}-a_{3}\lambda)+(\lambda^{(2)})^{2}(a_{2}-a_{4}\lambda)}{(a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}+\lambda)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})\right|
\displaystyle\leq 1tλ(|Vˇ(λ)V¯(λ)|+|Vˇ(λ+tλ)V¯(λ+tλ)|)+O(tλ2)\displaystyle\frac{1}{t\lambda}\left(|\check{V}(\lambda)-\overline{V}(\lambda)|+|\check{V}(\lambda+t\lambda)-\overline{V}(\lambda+t\lambda)|\right)+O(t\lambda^{-2})
=\displaystyle= O(t1p1λ7)+O(tλ2)\displaystyle O(t^{-1}p^{-1}\lambda^{-7})+O(t\lambda^{-2})
=\displaystyle= O(p1/2λ9/2)p1/2+cλ9/2\displaystyle O(p^{-1/2}\lambda^{-9/2})\prec p^{-1/2+c}\lambda^{-9/2}

for any small constant c>0c>0, thereby obtaining (A.15).

Similarly, for the bias term, define the auxiliary variables

Bˇ(η)\displaystyle\check{B}(\eta) =λ𝜷η(2)((𝚲η(1))1/2𝑽𝚺^𝒁(1)𝑽(𝚲η(1))1/2+(𝚲η(2))1/2𝑽𝚺^𝒁(2)𝑽(𝚲η(2))1/2+λ𝑰)1𝜷η(2),\displaystyle=\lambda\bm{\beta}_{\eta}^{(2)\top}\left((\bm{\Lambda}_{\eta}^{(1)})^{1/2}\bm{V}^{\top}\hat{\bm{\Sigma}}_{\bm{Z}}^{(1)}\bm{V}(\bm{\Lambda}_{\eta}^{(1)})^{1/2}+(\bm{\Lambda}_{\eta}^{(2)})^{1/2}\bm{V}^{\top}\hat{\bm{\Sigma}}_{\bm{Z}}^{(2)}\bm{V}(\bm{\Lambda}_{\eta}^{(2)})^{1/2}+\lambda\bm{I}\right)^{-1}\bm{\beta}_{\eta}^{(2)}, (D.81)
B¯(η)\displaystyle\overline{B}(\eta) =𝜷(2)22λb1λ(1)+b2λ(2)+λ(1+ηλ(2))𝑑G^p(λ(1),λ(2)),\displaystyle=\|\bm{\beta}^{(2)}\|_{2}^{2}\cdot\int\frac{\lambda}{b_{1}\lambda^{(1)}+b_{2}\lambda^{(2)}+\lambda(1+\eta\lambda^{(2)})}d\hat{G}_{p}(\lambda^{(1)},\lambda^{(2)}),

where (b1,b2)(b_{1},b_{2}) is defined in (A.18), then we know by (D.13) that with high probability,

Bˇ(η)=B¯(η)+O(p1/2λ3)\check{B}(\eta)=\overline{B}(\eta)+O(p^{-1/2}\lambda^{-3})

Let η>0\eta>0 and denote 𝑨:=𝚺^+λ𝑰,𝑩:=𝚺^+λ𝑰+λη𝚺(2)\bm{A}:=\hat{\bm{\Sigma}}+\lambda\bm{I},\bm{B}:=\hat{\bm{\Sigma}}+\lambda\bm{I}+\lambda\eta\bm{\Sigma}^{(2)}. Then from (D.3),

|B(𝜷^λ;𝜷(2))1η(Bˇ(0)Bˇ(η))|\displaystyle\left|B(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})-\frac{1}{\eta}(\check{B}(0)-\check{B}(\eta))\right|
=\displaystyle= |λ𝜷(2)(λ𝑨1𝚺(2)𝑨11η(𝑨1𝑩1))𝜷(2)|\displaystyle\left|\lambda\bm{\beta}^{(2)\top}\left(\lambda\bm{A}^{-1}\bm{\Sigma}^{(2)}\bm{A}^{-1}-\frac{1}{\eta}(\bm{A}^{-1}-\bm{B}^{-1})\right)\bm{\beta}^{(2)}\right|
\displaystyle\lesssim λ(λ𝑨1𝚺(2)𝑨11η(𝑩1(𝑩𝑨)𝑨1))op\displaystyle\lambda\left\|\left(\lambda\bm{A}^{-1}\bm{\Sigma}^{(2)}\bm{A}^{-1}-\frac{1}{\eta}(\bm{B}^{-1}(\bm{B}-\bm{A})\bm{A}^{-1})\right)\right\|_{\textnormal{op}}
=\displaystyle= λ2(𝑨1𝑩1)𝚺(2)𝑨1op\displaystyle\lambda^{2}\|(\bm{A}^{-1}-\bm{B}^{-1})\bm{\Sigma}^{(2)}\bm{A}^{-1}\|_{\textnormal{op}}
=\displaystyle= λ3η𝑩1𝚺(2)𝑨1𝚺(2)𝑨1op\displaystyle\lambda^{3}\eta\|\bm{B}^{-1}\bm{\Sigma}^{(2)}\bm{A}^{-1}\bm{\Sigma}^{(2)}\bm{A}^{-1}\|_{\textnormal{op}}
=\displaystyle= O(η).\displaystyle O(\eta).

with high probability. Similarly, we have

|𝜷(2)22b3λλ(1)+(b4+λ)λλ(2)(b1λ(1)+b2λ(2)+λ)2𝑑G^p(λ(1),λ(2))1η(B¯(0)B¯(η))|+O(η),\left|\|\bm{\beta}^{(2)}\|_{2}^{2}\cdot\int\frac{b_{3}\lambda\lambda^{(1)}+(b_{4}+\lambda)\lambda\lambda^{(2)}}{(b_{1}\lambda^{(1)}+b_{2}\lambda^{(2)}+\lambda)^{2}}d\hat{G}_{p}(\lambda^{(1)},\lambda^{(2)})-\frac{1}{\eta}(\overline{B}(0)-\overline{B}(\eta))\right|+O(\eta),

where (b3,b4):=η(b1,b2)(b_{3},b_{4}):=\frac{\partial}{\partial\eta}(b_{1},b_{2}) satisties the last two equations in (A.18). Combining the previous two displays and setting η=p1/4λ3/2\eta=p^{-1/4}\lambda^{-3/2}, we get

|B(𝜷^λ;𝜷(2))𝜷(2)22b3λλ(1)+(b4+λ)λλ(2)(b1λ(1)+b2λ(2)+λ)2𝑑G^p(λ(1),λ(2))|\displaystyle\left|B(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})-\|\bm{\beta}^{(2)}\|_{2}^{2}\cdot\int\frac{b_{3}\lambda\lambda^{(1)}+(b_{4}+\lambda)\lambda\lambda^{(2)}}{(b_{1}\lambda^{(1)}+b_{2}\lambda^{(2)}+\lambda)^{2}}d\hat{G}_{p}(\lambda^{(1)},\lambda^{(2)})\right|
\displaystyle\leq 1η(|Bˇ(0)B¯(0)|+|Bˇ(η)B¯(η)|)+O(η)\displaystyle\frac{1}{\eta}\left(|\check{B}(0)-\overline{B}(0)|+|\check{B}(\eta)-\overline{B}(\eta)|\right)+O(\eta)
=\displaystyle= O(η1p1/2λ3)+O(η)\displaystyle O(\eta^{-1}p^{-1/2}\lambda^{-3})+O(\eta)
=\displaystyle= O(p1/4λ3/2)p1/4+cλ3/2\displaystyle O(p^{-1/4}\lambda^{-3/2})\prec p^{-1/4+c}\lambda^{-3/2}

for any small constant c>0c>0, thereby obtaining (A.16).

D.3 Proof of Theorem 4.1

The proof involves the following three components.

  1. (i)

    First, we prove that the limiting risk of the ridge estimator is close to the limiting risk of the interpolator.

    We first consider the limits of the variances, which are defined respectively as

    𝒱(d)(𝜷^)\displaystyle\mathcal{V}^{(d)}(\hat{\bm{\beta}}) :=σ2γλ(2)(a~3λ(1)+a~4λ(2))(a~1λ(1)+a~2λ(2)+1)2𝑑H^p(λ(1),λ(2))\displaystyle:=-\sigma^{2}\gamma\int\frac{\lambda^{(2)}(\tilde{a}_{3}\lambda^{(1)}+\tilde{a}_{4}\lambda^{(2)})}{(\tilde{a}_{1}\lambda^{(1)}+\tilde{a}_{2}\lambda^{(2)}+1)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}) (D.82)
    𝒱(d)(𝜷^λ)\displaystyle\mathcal{V}^{(d)}(\hat{\bm{\beta}}_{\lambda}) :=σ2γλ(1)λ(2)(a1a3λ)+(λ(2))2(a2a4λ)(a1λ(1)+a2λ(2)+λ)2𝑑H^p(λ(1),λ(2))\displaystyle:=\sigma^{2}\gamma\int\frac{\lambda^{(1)}\lambda^{(2)}(a_{1}-a_{3}\lambda)+(\lambda^{(2)})^{2}(a_{2}-a_{4}\lambda)}{(a_{1}\lambda^{(1)}+a_{2}\lambda^{(2)}+\lambda)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})

    To this end, we first show that

    aˇk:=ak/λ=a~k+O(λ),k=1,2.\check{a}_{k}:=a_{k}/\lambda=\tilde{a}_{k}+O(\lambda),\ \ k=1,2. (D.83)

    From (D.83), we know the first two equations of (A.17) becomes

    fˇ(aˇ1,aˇ2)\displaystyle\check{f}(\check{a}_{1},\check{a}_{2}) :=r1γaˇ1λ(1)aˇ1λ(1)+aˇ2λ(2)+1𝑑H^p(λ(1),λ(2))=aˇ1λ,\displaystyle:=r_{1}-\gamma\int\frac{\check{a}_{1}\lambda^{(1)}}{\check{a}_{1}\lambda^{(1)}+\check{a}_{2}\lambda^{(2)}+1}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})=\check{a}_{1}\lambda, (D.84)
    gˇ(aˇ1,aˇ2)\displaystyle\check{g}(\check{a}_{1},\check{a}_{2}) :=r2γaˇ2λ(1)aˇ1λ(1)+aˇ2λ(2)+1𝑑H^p(λ(1),λ(2))=aˇ2λ.\displaystyle:=r_{2}-\gamma\int\frac{\check{a}_{2}\lambda^{(1)}}{\check{a}_{1}\lambda^{(1)}+\check{a}_{2}\lambda^{(2)}+1}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})=\check{a}_{2}\lambda.

    Let 𝑭ˇ(𝒂ˇ):=(fˇ(𝒂ˇ),gˇ(𝒂ˇ))\check{\bm{F}}(\check{\bm{a}}):=(\check{f}(\check{\bm{a}}),\check{g}(\check{\bm{a}})) where 𝒂ˇ:=(aˇ1,aˇ2)\check{\bm{a}}:=(\check{a}_{1},\check{a}_{2}). From (D.83) and (D.16), we know there exists some constants c,C>0c,C>0 such that caˇ1,aˇ2Cc\leq\check{a}_{1},\check{a}_{2}\leq C. This yields 𝑭ˇ(𝒂ˇ)=(aˇ1λ,aˇ2λ)λ\check{\bm{F}}(\check{\bm{a}})=(\check{a}_{1}\lambda,\check{a}_{2}\lambda)\sim\lambda. Also, similar to how we derived (D.16), we can also show that there exists constants c,C>0c,C>0 such that

    ca~1,a~2Cc\leq\tilde{a}_{1},\tilde{a}_{2}\leq C (D.85)

    . On the other hand, similar to how we obtained (D.20) and (D.23), let 𝑱ˇ\check{\bm{J}} be the Jacobian of 𝑭ˇ\check{\bm{F}}, we can similarly show that 𝑱(𝒂)op=O(1)\|\bm{J}(\bm{a})\|_{\textnormal{op}}=O(1) for all 𝒂[c,C]2\bm{a}\in[c,C]^{2}. Together with the fact that 𝑭ˇ(𝒂~)=0\check{\bm{F}}(\tilde{\bm{a}})=0 from (4.4), we have obtained (D.83).

    Now taking derivative of (D.84) w.r.t. λ\lambda, we get

    aˇ1+aˇ1λ\displaystyle\check{a}_{1}+\check{a}_{1}^{\prime}\lambda =γaˇ1λ(1)+λ(1)λ(2)(aˇ1aˇ2aˇ2aˇ1)(aˇ1λ(1)+aˇ2λ(2)+1)2𝑑H^p(λ(1),λ(2)),\displaystyle=-\gamma\int\frac{\check{a}_{1}^{\prime}\lambda^{(1)}+\lambda^{(1)}\lambda^{(2)}(\check{a}_{1}^{\prime}\check{a}_{2}-\check{a}_{2}^{\prime}\check{a}_{1})}{(\check{a}_{1}\lambda^{(1)}+\check{a}_{2}\lambda^{(2)}+1)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}), (D.86)
    aˇ2+aˇ2λ\displaystyle\check{a}_{2}+\check{a}_{2}^{\prime}\lambda =γaˇ2λ(2)+λ(1)λ(2)(aˇ2aˇ1aˇ1aˇ2)(aˇ1λ(1)+aˇ2λ(2)+1)2𝑑H^p(λ(1),λ(2)),\displaystyle=-\gamma\int\frac{\check{a}_{2}^{\prime}\lambda^{(2)}+\lambda^{(1)}\lambda^{(2)}(\check{a}_{2}^{\prime}\check{a}_{1}-\check{a}_{1}^{\prime}\check{a}_{2})}{(\check{a}_{1}\lambda^{(1)}+\check{a}_{2}\lambda^{(2)}+1)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}),

    where (a1,a2)(a_{1}^{\prime},a_{2}^{\prime}) represents respective derivatives w.r.t. λ\lambda. Since caˇ1,aˇ2Cc\leq\check{a}_{1},\check{a}_{2}\leq C, we know (aˇ1,aˇ2)=O(λ1)(\check{a}_{1}^{\prime},\check{a}_{2}^{\prime})=O(\lambda^{-1}), which means the RHS of (D.86) is O(1)O(1). Thus the LHS is also O(1)O(1), which yields (aˇ1,aˇ2)=O(1)(aˇ1λ,aˇ2λ)=O(λ)(\check{a}_{1}^{\prime},\check{a}_{2}^{\prime})=O(1)\Rightarrow(\check{a}_{1}^{\prime}\lambda,\check{a}_{2}^{\prime}\lambda)=O(\lambda). Now combining (D.86) and the last two equations of (4.4), we can use similar analysis to conclude that

    aˇ1=a~3+O(λ),aˇ2=a~4+O(λ)\check{a}_{1}^{\prime}=\tilde{a}_{3}+O(\lambda),\ \ \check{a}_{2}^{\prime}=\tilde{a}_{4}+O(\lambda) (D.87)

    Finally, from the second equation of (D.82) and the fact that ak=aˇkλak+2:=ak=aˇk+aˇkλa_{k}=\check{a}_{k}\lambda\Rightarrow a_{k+2}:=a_{k}^{\prime}=\check{a}_{k}+\check{a}_{k}^{\prime}\lambda, we know

    𝒱(d)(𝜷^λ):=σ2γλ(2)(aˇ1λ(1)+aˇ2λ(2))(aˇ1λ(1)+aˇ2λ(2)+1)2𝑑H^p(λ(1),λ(2)),\mathcal{V}^{(d)}(\hat{\bm{\beta}}_{\lambda}):=-\sigma^{2}\gamma\int\frac{\lambda^{(2)}(\check{a}_{1}^{\prime}\lambda^{(1)}+\check{a}_{2}^{\prime}\lambda^{(2)})}{(\check{a}_{1}\lambda^{(1)}+\check{a}_{2}\lambda^{(2)}+1)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)}),

    which, when combined with (D.83) and (D.87), yields

    𝒱(d)(𝜷^λ)=𝒱(d)(𝜷^)+O(λ).\mathcal{V}^{(d)}(\hat{\bm{\beta}}_{\lambda})=\mathcal{V}^{(d)}(\hat{\bm{\beta}})+O(\lambda). (D.88)

    Using similar analysis we can also show that the bias limits, defined respectively as

    (d)(𝜷^)\displaystyle\mathcal{B}^{(d)}(\hat{\bm{\beta}}) :=𝜷(2)22b~3λ(1)+(b~4+1)λ(2)(b~1λ(1)+b~2λ(2)+1)2𝑑G^p(λ(1),λ(2))\displaystyle:=\|\bm{\beta}^{(2)}\|_{2}^{2}\cdot\int\frac{\tilde{b}_{3}\lambda^{(1)}+(\tilde{b}_{4}+1)\lambda^{(2)}}{(\tilde{b}_{1}\lambda^{(1)}+\tilde{b}_{2}\lambda^{(2)}+1)^{2}}d\hat{G}_{p}(\lambda^{(1)},\lambda^{(2)}) (D.89)
    (d)(𝜷^λ)\displaystyle\mathcal{B}^{(d)}(\hat{\bm{\beta}}_{\lambda}) :=𝜷(2)22b3λλ(1)+(b4+λ)λλ(2)(b1λ(1)+b2λ(2)+λ)2𝑑G^p(λ(1),λ(2)),\displaystyle:=\|\bm{\beta}^{(2)}\|_{2}^{2}\cdot\int\frac{b_{3}\lambda\lambda^{(1)}+(b_{4}+\lambda)\lambda\lambda^{(2)}}{(b_{1}\lambda^{(1)}+b_{2}\lambda^{(2)}+\lambda)^{2}}d\hat{G}_{p}(\lambda^{(1)},\lambda^{(2)}),

    are close:

    (d)(𝜷^λ)=(d)(𝜷^)+O(λ).\mathcal{B}^{(d)}(\hat{\bm{\beta}}_{\lambda})=\mathcal{B}^{(d)}(\hat{\bm{\beta}})+O(\lambda). (D.90)
  2. (ii)

    Next, we show that the ridge risk is close to the interpolator risk with high probability.

    We begin by showing V(𝜷^;𝜷(2))V(\hat{\bm{\beta}};\bm{\beta}^{(2)}) and V(𝜷^λ;𝜷(2))V(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}) are close, where the quantities are defined in (2.13) and (A.1) respectively. To this end, denote the singular value decomposition of 𝚺^\hat{\bm{\Sigma}} by 𝚺^=𝑼𝑫𝑼\hat{\bm{\Sigma}}=\bm{U}\bm{D}\bm{U}^{\top}. Then the quantities (2.13) and (A.1) can be rewritten as

    V(𝜷^;𝜷(2))\displaystyle V(\hat{\bm{\beta}};\bm{\beta}^{(2)}) =σ2nTr(𝑼𝑫1𝟏𝑫>0𝑼𝚺(2))\displaystyle=\frac{\sigma^{2}}{n}\textnormal{Tr}(\bm{U}\bm{D}^{-1}\bm{1}_{\bm{D}>0}\bm{U}^{\top}\bm{\Sigma}^{(2)})
    V(𝜷^λ;𝜷(2))\displaystyle V(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}) =σ2nTr(𝑼(𝑫+λ𝑰)2𝑫𝑼𝚺(2)),\displaystyle=\frac{\sigma^{2}}{n}\textnormal{Tr}(\bm{U}(\bm{D}+\lambda\bm{I})^{-2}\bm{D}\bm{U}^{\top}\bm{\Sigma}^{(2)}),

    where 𝟏𝑫>0\bm{1}_{\bm{D}>0} is the diagonal matrix with (i,i)(i,i)-th entry equals 11 if Dii>0D_{ii}>0 and 0 otherwise. Therefore, we have,

    |V(𝜷^;𝜷(2))V(𝜷^λ;𝜷(2))|\displaystyle|V(\hat{\bm{\beta}};\bm{\beta}^{(2)})-V(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})| (D.91)
    \displaystyle\leq σ2n𝑼𝚺(2)𝑼opTr(𝑫1𝟏𝑫>0(𝑫+λ𝑰)2𝑫)\displaystyle\frac{\sigma^{2}}{n}\|\bm{U}^{\top}\bm{\Sigma}^{(2)}\bm{U}\|_{\textnormal{op}}\textnormal{Tr}(\bm{D}^{-1}\bm{1}_{\bm{D}>0}-(\bm{D}+\lambda\bm{I})^{-2}\bm{D})
    \displaystyle\lesssim σ2nTr(𝑫1𝟏𝑫>0(𝑫+λ𝑰)2𝑫)\displaystyle\frac{\sigma^{2}}{n}\textnormal{Tr}(\bm{D}^{-1}\bm{1}_{\bm{D}>0}-(\bm{D}+\lambda\bm{I})^{-2}\bm{D})
    \displaystyle\lesssim λσ2λmin2(Σ^)=O(λ),\displaystyle\frac{\lambda\sigma^{2}}{\lambda^{2}_{\min}(\hat{\Sigma})}=O(\lambda),

    where λmin\lambda_{\min} denotes the minimum nonzero eigenvalue. Here, the second inequality follows from the fact that 𝚺(2)op1\|\bm{\Sigma}^{(2)}\|_{\textnormal{op}}\lesssim 1, the third inequality holds because x1(x+λ)2x2λ/x2x^{-1}-(x+\lambda)^{-2}x\leq 2\lambda/x^{2} for x>0x>0, and the final inequality is given by the following lemma.

    Lemma D.11.

    Consider the same setup as Theorem 4.1. Recall 𝚺^=1n(𝐗(1)𝐗(1)+𝐗(2)𝐗(2))\hat{\bm{\Sigma}}=\frac{1}{n}(\bm{X}^{(1)\top}\bm{X}^{(1)}+\bm{X}^{(2)\top}\bm{X}^{(2)}). Let λmin(𝚺^)\lambda_{\min}(\hat{\bm{\Sigma}}) denote the smallest nonzero eigenvalue of 𝚺^\hat{\bm{\Sigma}}, then there exists a positive constant cc such that with high probability,

    λmin2(𝚺^)c\lambda_{\min}^{2}(\hat{\bm{\Sigma}})\geq c (D.92)
    Proof.

    We know λmin(𝚺^)=λmin(1n𝑿𝑿)\lambda_{\min}(\hat{\bm{\Sigma}})=\lambda_{\min}\left(\frac{1}{n}\bm{X}\bm{X}^{\top}\right) where

    𝑿=[(𝚺(1))1/2𝒁(1),(𝚺(2))1/2𝒁(2)]\bm{X}^{\top}=[(\bm{\Sigma}^{(1)})^{1/2}\bm{Z}^{(1)\top},(\bm{\Sigma}^{(2)})^{1/2}\bm{Z}^{(2)\top}]

    and 𝒁(1),𝒁(2)\bm{Z}^{(1)},\bm{Z}^{(2)} satisfies (2.3). From [14, Theorem 3.12], we know that with high probability,

    λmin(1n𝑿𝑿)λmin(1n𝑿G𝑿G)o(1),\lambda_{\min}\left(\frac{1}{n}\bm{X}\bm{X}^{\top}\right)\geq\lambda_{\min}\left(\frac{1}{n}\bm{X}_{G}\bm{X}_{G}^{\top}\right)-o(1), (D.93)

    where 𝑿G=[(𝚺(1))1/2𝒁G(1),(𝚺(2))1/2𝒁G(2)]\bm{X}_{G}^{\top}=[(\bm{\Sigma}^{(1)})^{1/2}\bm{Z}_{G}^{(1)\top},(\bm{\Sigma}^{(2)})^{1/2}\bm{Z}_{G}^{(2)\top}] and 𝒁G(1),𝒁G(2)\bm{Z}_{G}^{(1)},\bm{Z}_{G}^{(2)} have i.i.d. standard Gaussian entries instead. Since we assumed 𝚺(1)\bm{\Sigma}^{(1)} and 𝚺(2)\bm{\Sigma}^{(2)} are simultaneously diagonalizable, we further have, by rotational invariance,

    λmin(1n𝑿G𝑿G)=λmin(1n𝒁ˇ𝒁ˇ)\lambda_{\min}\left(\frac{1}{n}\bm{X}_{G}\bm{X}_{G}^{\top}\right)=\lambda_{\min}\left(\frac{1}{n}\check{\bm{Z}}\check{\bm{Z}}^{\top}\right)

    where 𝒁ˇ=[(𝚲(1))𝒁(1),(𝚲(2))𝒁(2)]\check{\bm{Z}}^{\top}=[(\bm{\Lambda}^{(1)})\bm{Z}^{(1)\top},(\bm{\Lambda}^{(2)})\bm{Z}^{(2)\top}] has (blockwise) diagonal covariances. From [46], we obtain that with high probability, λmin2(1n𝒁ˇ𝒁ˇ)=λˇ+o(1)\lambda_{\min}^{2}\left(\frac{1}{n}\check{\bm{Z}}\check{\bm{Z}}^{\top}\right)=\check{\lambda}+o(1) where λˇ\check{\lambda} is the solution to the following variational problem:

    λˇ\displaystyle\check{\lambda} :=supx1,,xn<0min1in{1xi+j=1p1nλj(1)1k=1n11nλj(1)xk+k=n1+1n1nλj(2)xk}\displaystyle:=\sup_{x_{1},...,x_{n}<0}\min_{1\leq i\leq n}\left\{\frac{1}{x_{i}}+\sum_{j=1}^{p}\frac{\frac{1}{n}\lambda_{j}^{(1)}}{1-\sum_{k=1}^{n_{1}}\frac{1}{n}\lambda_{j}^{(1)}x_{k}+\sum_{k=n_{1}+1}^{n}\frac{1}{n}\lambda_{j}^{(2)}x_{k}}\right\} (D.94)
    =supα1,α2>0mink=1,2{rkαk+γλ(k)α1λ(1)+α2λ(2)+1𝑑H^p(λ(1),λ(2))},\displaystyle=\sup_{\alpha_{1},\alpha_{2}>0}\min_{k=1,2}\left\{-\frac{r_{k}}{\alpha_{k}}+\gamma\int\frac{\lambda^{(k)}}{\alpha_{1}\lambda^{(1)}+\alpha_{2}\lambda^{(2)}+1}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})\right\},

    where recall that rk=nk/nr_{k}=n_{k}/n, k=1,2k=1,2 and dH^pd\hat{H}_{p} is defined by (4.1). The second equality above the fact that the first line is maximized at x1==xn1,xn1+1==xnx_{1}=...=x_{n_{1}},x_{n_{1}+1}=...=x_{n}, and α1:=r1x1,α2:=r2xn1+1\alpha_{1}:=r_{1}x_{1},\alpha_{2}:=r_{2}x_{n_{1}+1}.

    Recall the definition of (a~1,a~2)(\tilde{a}_{1},\tilde{a}_{2}) in (4.4). For k=1,2k=1,2, we set αk=2a~k\alpha_{k}=2\tilde{a}_{k}, we have

    rkαk+γλ(k)α1λ(1)+α2λ(2)+1𝑑H^p(λ(1),λ(2))\displaystyle-\frac{r_{k}}{\alpha_{k}}+\gamma\int\frac{\lambda^{(k)}}{\alpha_{1}\lambda^{(1)}+\alpha_{2}\lambda^{(2)}+1}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})
    =\displaystyle= 1αk(rkγa~kλ(k)a~1λ(1)+a~2λ(2)+12𝑑H^p(λ(1),λ(2)))\displaystyle-\frac{1}{\alpha_{k}}\left(r_{k}-\gamma\int\frac{\tilde{a}_{k}\lambda^{(k)}}{\tilde{a}_{1}\lambda^{(1)}+\tilde{a}_{2}\lambda^{(2)}+\frac{1}{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})\right)
    =\displaystyle= γa~k12a~kλ(k)(a~1λ(1)+a~2λ(2)+12)(a~1λ(1)+a~2λ(2)+1)𝑑H^p(λ(1),λ(2))\displaystyle\frac{\gamma}{\tilde{a}_{k}}\int\frac{\frac{1}{2}\tilde{a}_{k}\lambda^{(k)}}{(\tilde{a}_{1}\lambda^{(1)}+\tilde{a}_{2}\lambda^{(2)}+\frac{1}{2})(\tilde{a}_{1}\lambda^{(1)}+\tilde{a}_{2}\lambda^{(2)}+1)}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})
    \displaystyle\geq c\displaystyle c

    for some constant c>0c>0, where the last line follows from (D.85). Now combining with (D.94) and (D.93), we obtain (D.92), thereby finishing the proof. ∎

    Now we turn to show B(𝜷^λ;𝜷(2))B(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}) and B(𝜷^λ;𝜷(2))B(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}) are close, where the quantities are defined via (2.14) and (A.2) respectively. Since 𝜷~=0\tilde{\bm{\beta}}=0, we only need to compare B1(𝜷^λ;𝜷(2))B_{1}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}) and B1(𝜷^λ;𝜷(2))B_{1}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}). Note that,

    B1(𝜷^;𝜷(2))\displaystyle B_{1}(\hat{\bm{\beta}};\bm{\beta}^{(2)}) =𝜷(2)(𝚺^𝚺^𝑰)𝚺(2)(𝚺^𝚺^𝑰)𝜷(2)\displaystyle=\bm{\beta}^{(2)^{\top}}(\hat{\bm{\Sigma}}^{\dagger}\hat{\bm{\Sigma}}-\bm{I})\bm{\Sigma}^{(2)}(\hat{\bm{\Sigma}}^{\dagger}\hat{\bm{\Sigma}}-\bm{I})\bm{\beta}^{(2)}
    =𝜷(2)𝑼𝟏𝑫=0𝑼𝚺(2)𝑼𝟏𝑫=0𝑼𝜷(2)\displaystyle=\bm{\beta}^{(2)^{\top}}\bm{U}\bm{1}_{\bm{D}=0}\bm{U}^{\top}\bm{\Sigma}^{(2)}\bm{U}\bm{1}_{\bm{D}=0}\bm{U}^{\top}\bm{\beta}^{(2)}
    :=𝜷(2)𝑼𝟏𝑫=0𝑨𝟏𝑫=0𝑼𝜷(2)=𝑨1/2𝟏𝑫=0𝑼𝜷(2)22,\displaystyle:=\bm{\beta}^{(2)^{\top}}\bm{U}\bm{1}_{\bm{D}=0}\bm{A}\bm{1}_{\bm{D}=0}\bm{U}^{\top}\bm{\beta}^{(2)}=\|\bm{A}^{1/2}\bm{1}_{\bm{D}=0}\bm{U}^{\top}\bm{\beta}^{(2)}\|^{2}_{2},

    where we defined 𝑨=𝑼𝚺(2)𝑼\bm{A}=\bm{U}^{\top}\bm{\Sigma}^{(2)}\bm{U}. Also,

    B1(𝜷^λ;𝜷(2))\displaystyle B_{1}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}) =λ2𝜷(2)(𝚺^+λ𝑰)1𝚺(2)(𝚺^+λ𝑰)1𝜷(2)\displaystyle=\lambda^{2}\bm{\beta}^{(2)^{\top}}(\hat{\bm{\Sigma}}+\lambda\bm{I})^{-1}\bm{\Sigma}^{(2)}(\hat{\bm{\Sigma}}+\lambda\bm{I})^{-1}\bm{\beta}^{(2)}
    =λ2𝜷(2)𝑼(𝑫+λ𝑰)1𝑨(𝑫+λ𝑰)1𝑼𝜷(2)\displaystyle=\lambda^{2}\bm{\beta}^{(2)^{\top}}\bm{U}(\bm{D}+\lambda\bm{I})^{-1}\bm{A}(\bm{D}+\lambda\bm{I})^{-1}\bm{U}^{\top}\bm{\beta}^{(2)}
    =𝑨1/2λ(𝑫+λ𝑰)1𝑼𝜷(2)22.\displaystyle=\|\bm{A}^{1/2}\lambda(\bm{D}+\lambda\bm{I})^{-1}\bm{U}^{\top}\bm{\beta}^{(2)}\|^{2}_{2}.

    Therefore, we have

    |B11/2(𝜷^;𝜷(2))B11/2(𝜷^λ;𝜷(2))|\displaystyle|B^{1/2}_{1}(\hat{\bm{\beta}};\bm{\beta}^{(2)})-B^{1/2}_{1}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})| 𝑨1/2(𝟏𝑫=0λ(𝑫+λ𝑰)1)𝑼𝜷(2)2\displaystyle\leq\|\bm{A}^{1/2}(\bm{1}_{\bm{D}=0}-\lambda(\bm{D}+\lambda\bm{I})^{-1})\bm{U}^{\top}\bm{\beta}^{(2)}\|_{2}
    𝑨op1/2λ(𝑫+λ𝑰)1)𝟏𝑫>02\displaystyle\lesssim\|\bm{A}\|^{1/2}_{\textnormal{op}}\|\lambda(\bm{D}+\lambda\bm{I})^{-1})\bm{1}_{\bm{D}>0}\|_{2}
    λλmin(Σ^)=O(λ),\displaystyle\lesssim\frac{\lambda}{\lambda_{\min}(\hat{\Sigma})}=O(\lambda),

    where the second inequality holds as 𝑨op=𝚺(2)op1\|\bm{A}\|_{\textnormal{op}}=\|\bm{\Sigma}^{(2)}\|_{\textnormal{op}}\lesssim 1 and the last inequality follows from Lemma (D.11). Moreover, since B1(𝜷^λ;𝜷(2)),B1(𝜷^;𝜷(2))1B_{1}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)}),B_{1}(\hat{\bm{\beta}};\bm{\beta}^{(2)})\lesssim 1, we have

    |B1(𝜷^;𝜷(2))B1(𝜷^λ;𝜷(2))|=O(λ).|B_{1}(\hat{\bm{\beta}};\bm{\beta}^{(2)})-B_{1}(\hat{\bm{\beta}}_{\lambda};\bm{\beta}^{(2)})|=O(\lambda). (D.95)
  3. (iii)

    Now we combine previous displays to obtain the desired results. Combining (A.15), (D.88) and (LABEL:eqn:design_shift_V_originals_close), we get

    V(𝜷^;𝜷(2))=σ2γλ(2)(a~3λ(1)+a~4λ(2))(a~1λ(1)+a~2λ(2)+1)2𝑑H^p(λ(1),λ(2))+O(λ9/2p1/2+c)+O(λ).V(\hat{\bm{\beta}};\bm{\beta}^{(2)})=-\sigma^{2}\gamma\int\frac{\lambda^{(2)}(\tilde{a}_{3}\lambda^{(1)}+\tilde{a}_{4}\lambda^{(2)})}{(\tilde{a}_{1}\lambda^{(1)}+\tilde{a}_{2}\lambda^{(2)}+1)^{2}}d\hat{H}_{p}(\lambda^{(1)},\lambda^{(2)})+O(\lambda^{-9/2}p^{-1/2+c})+O(\lambda).

    Combining (A.16), (D.90) and (D.95), we get

    B(𝜷^;𝜷(2))=𝜷(2)22b~3λ(1)+(b~4+1)λ(2)(b~1λ(1)+b~2λ(2)+1)2𝑑G^p(λ(1),λ(2))+O(λ3/2p1/4+c)+O(λ).B(\hat{\bm{\beta}};\bm{\beta}^{(2)})=\|\bm{\beta}^{(2)}\|_{2}^{2}\cdot\int\frac{\tilde{b}_{3}\lambda^{(1)}+(\tilde{b}_{4}+1)\lambda^{(2)}}{(\tilde{b}_{1}\lambda^{(1)}+\tilde{b}_{2}\lambda^{(2)}+1)^{2}}d\hat{G}_{p}(\lambda^{(1)},\lambda^{(2)})+O(\lambda^{-3/2}p^{-1/4+c})+O(\lambda).

    Taking λ=p1/12\lambda=p^{-1/12} and choosing cc small enough complete the proof.

D.4 Proof of Proposition 4.2

Assume throughout the example that n1+n2<pn_{1}+n_{2}<p. Here, we want to compare R^(𝑰)\hat{R}(\bm{I}) with R^(𝑴),𝑴𝒮\hat{R}(\bm{M}),\bm{M}\in\mathcal{S}. To this end, plugging in 𝚺(1)=𝚺(2)=𝑰\bm{\Sigma}^{(1)}=\bm{\Sigma}^{(2)}=\bm{I} into Theorem 4.1 yields

R^(𝑰)=σ2npn+𝜷(2)22pnp+o(1),\hat{R}(\bm{I})=\sigma^{2}\cdot\frac{n}{p-n}+\|\bm{\beta}^{(2)}\|_{2}^{2}\cdot\frac{p-n}{p}+o(1), (D.96)

with high probability, where the first term is the variance and the second term is the bias.

Now consider any 𝚺(1)=𝑴𝒮\bm{\Sigma}^{(1)}=\bm{M}\in\mathcal{S}. Plugging in 𝚺(2)=𝑰\bm{\Sigma}^{(2)}=\bm{I} and combining (4.3) and the third equation of (4.5), we obtain that the bias is also 𝜷(2)22pnp\|\bm{\beta}^{(2)}\|_{2}^{2}\cdot\frac{p-n}{p}. Therefore we are left with comparing the variance.

Plugging in 𝚺(2)=𝑰\bm{\Sigma}^{(2)}=\bm{I} and combining (4.2) and the third equation of (4.4), we obtain that the variance is σ2(a~1+a~2)\sigma^{2}\cdot(\tilde{a}_{1}+\tilde{a}_{2}). Moreover, the first and second equations of (4.4) yields a~2=n2pn\tilde{a}_{2}=\frac{n_{2}}{p-n}. Therefore, using (D.96), the proof is complete upon showing a~1n1pn\tilde{a}_{1}\leq\frac{n_{1}}{p-n} if and only if n1p/2n_{1}\leq p/2.

Since a~2=n2pn\tilde{a}_{2}=\frac{n_{2}}{p-n}, the second equation of (4.4) yields

1=i=1p/2[1a~1(pn)λi(1)+pn1+1a~1(pn)/λi(1)+pn1].1=\sum_{i=1}^{p/2}\left[\frac{1}{\tilde{a}_{1}(p-n)\lambda_{i}^{(1)}+p-n_{1}}+\frac{1}{\tilde{a}_{1}(p-n)/\lambda_{i}^{(1)}+p-n_{1}}\right]. (D.97)

We know the right hand side is a decreasing function of a~1\tilde{a}_{1}. Hence a~1n1pn\tilde{a}_{1}\leq\frac{n_{1}}{p-n} if and only if we have

1i=1p/2[1n1λi(1)+pn1+1n1/λi(1)+pn1]:=ψ(n1).1\geq\sum_{i=1}^{p/2}\left[\frac{1}{n_{1}\lambda_{i}^{(1)}+p-n_{1}}+\frac{1}{n_{1}/\lambda_{i}^{(1)}+p-n_{1}}\right]:=\psi(n_{1}).

Note by straightforward calculation, ψ(0)=ψ(p/2)=1\psi(0)=\psi(p/2)=1, ψ(p/2)>0\psi^{\prime}(p/2)>0, and ψ′′(x)>0\psi^{\prime\prime}(x)>0 for all 0xp0\leq x\leq p. Therefore, we have a~1n1pn\tilde{a}_{1}\leq\frac{n_{1}}{p-n} if and only if n1p/2n_{1}\leq p/2.

D.5 Proof of Proposition 4.3

Here, we want to study the effect of κ\kappa in the quantity R^(𝚺(κ))\hat{R}(\bm{\Sigma}(\kappa)).

To analyze the quantity R^(𝚺(κ))\hat{R}(\bm{\Sigma}(\kappa)), first note that, by the proof of Proposition 4.2, the bias stays identical for any κ\kappa and the variance is σ2(a~1+a~2)\sigma^{2}(\tilde{a}_{1}+\tilde{a}_{2}), where a~1,a~2\tilde{a}_{1},\tilde{a}_{2} are defined by (4.4). Moreover, again by proof of Proposition 4.2, one has a~2=n2pn2\tilde{a}_{2}=\frac{n_{2}}{p-n_{2}}. Hence, we are left with analyzing a~1\tilde{a}_{1} given by (D.97) which simplifies here as

2p=[1a~1(pn)κ+pn1+1a~1(pn)/κ+pn1]\displaystyle\frac{2}{p}=\left[\frac{1}{\tilde{a}_{1}(p-n)\kappa+p-n_{1}}+\frac{1}{\tilde{a}_{1}(p-n)/\kappa+p-n_{1}}\right]
\displaystyle\implies a~122(pn)2p+2(pn1)2p+a~12(pn)(pn1)p(κ+1κ)=a~1(pn)(κ+1κ)+2(pn1)\displaystyle\tilde{a}^{2}_{1}\frac{2(p-n)^{2}}{p}+\frac{2(p-n_{1})^{2}}{p}+\tilde{a}_{1}\frac{2(p-n)(p-n_{1})}{p}(\kappa+\frac{1}{\kappa})=\tilde{a}_{1}(p-n)(\kappa+\frac{1}{\kappa})+2(p-n_{1})
\displaystyle\implies 2a~12(pn)2+a~1(pn)(p2n1)(κ+1κ)2n1(pn1)=0\displaystyle 2\tilde{a}_{1}^{2}(p-n)^{2}+\tilde{a}_{1}(p-n)(p-2n_{1})(\kappa+\frac{1}{\kappa})-2n_{1}(p-n_{1})=0

If n1<p/2n_{1}<p/2, the linear term in the above quadratic equation is positive, implying a~1\tilde{a}_{1} is smaller for larger values of (κ+1κ)(\kappa+\frac{1}{\kappa}). Hence, a~1\tilde{a}_{1} is a decreasing function in κ\kappa, implying R^(𝚺(κ))\hat{R}(\bm{\Sigma}(\kappa)) decreases as κ\kappa increases.

Similarly, if n1<p/2n_{1}<p/2, the linear term in the above quadratic equation is negative, and hence R^(𝚺(κ))\hat{R}(\bm{\Sigma}(\kappa)) increases as κ\kappa increases.

Finally, if n1=p/2n_{1}=p/2, the linear term vanishes leading to the fact that R^(𝚺(κ))\hat{R}(\bm{\Sigma}(\kappa)) is agnostic of κ\kappa.