Generalization error of min-norm interpolators in transfer learning
Abstract
This paper establishes the generalization error of pooled min--norm interpolation in transfer learning where data from diverse distributions are available. Min-norm interpolators emerge naturally as implicit regularized limits of modern machine learning algorithms. Previous work characterized their out-of-distribution risk when samples from the test distribution are unavailable during training. However, in many applications, a limited amount of test data may be available during training, yet properties of min-norm interpolation in this setting are not well-understood. We address this gap by characterizing the bias and variance of pooled min--norm interpolation under covariate and model shifts. The pooled interpolator captures both early fusion and a form of intermediate fusion. Our results have several implications: under model shift, for low signal-to-noise ratio (SNR), adding data always hurts. For higher SNR, transfer learning helps as long as the shift-to-signal (SSR) ratio lies below a threshold that we characterize explicitly. By consistently estimating these ratios, we provide a data-driven method to determine: (i) when the pooled interpolator outperforms the target-based interpolator, and (ii) the optimal number of target samples that minimizes the generalization error. Under covariate shift, if the source sample size is small relative to the dimension, heterogeneity between between domains improves the risk, and vice versa. We establish a novel anisotropic local law to achieve these characterizations, which may be of independent interest in random matrix theory. We supplement our theoretical characterizations with comprehensive simulations that demonstrate the finite-sample efficacy of our results.
1 Introduction
Recent advancements in deep learning methodology have uncovered a surprising phenomenon that defies conventional statistical wisdom: overfitting can yield remarkably effective generalization [5, 9, 10, 11]. In the overparametrized regime, complex models that achieve zero-training error, i.e. models that interpolate the training data, have gained popularity [25, 29, 37]. However, as researchers grapple with increasingly large and heterogeneous datasets, the imperative to effectively integrate diverse sources of information has become crucial. Transfer learning [53] has emerged as a promising approach to address this challenge, allowing one to leverage knowledge from related datasets to boost performance on target problems. Yet, in the context of overparametrization, a crucial question arises: how should one leverage transfer learning in the presence of overfitting? Specifically, should diverse datasets be aggregated to construct a unified interpolator, or should they be handled individually? Intuition suggests that for closely related datasets, a pooled interpolator may yield superior performance compared to those derived from individual datasets. Conversely, for unrelated datasets, the opposite might be true. This paper investigates the interplay between interpolation risk and task relatedness, focusing particularly on overparametrized linear models and min-norm interpolation.
To achieve this, we explore two common ways datasets can differ: covariate shift, where the covariate distribution varies across different datasets, and model shift, where the conditional distribution of the outcome given the covariates differ across datasets. Covariate shift has been extensively studied through concepts such as likelihood ratio [40] and transfer exponent [28, 34, 47], among others. Similarly, a substantial body of literature has studied model shift problems [6, 35, 36, 41, 52, 47].
However, the aforementioned literature has primarily considered statistical ideas where explicit penalties are added to loss functions to aid learning in high dimensions or non-parametric approaches. In recent machine learning, implicit regularization based prediction rules/classifiers have emerged supremely popular. Implicit regularization is the phenomenon by which modern ML algorithms, under suitable initialization and step sizes, converges to special prediction rules that generalize well under overparametrization. In this context, min-norm interpolators have emerged as an important class of predictors that arise commonly as implicit regularized limits of these algorithms [5, 18, 26, 27, 38, 44, 45, 51, 58, 56, 60]. However, properties of these interpolators under transfer learning is less well-studied.
A recent work [42] identified conditions on benign overfitting, i.e., data distribution choices such that min-norm interpolators are consistent under covariate shift. Another related work [47] characterized the prediction risk of ridge regression and min--norm interpolators, while [54] provided an asymptotic analysis of high-dimensional random feature regression under covariate shift. However, these works consider an out-of-distribution (OOD) setting, where no observation from the test distribution is utilized during training. Transfer learning is often required in settings where along with enormous source data, limited test data is indeed available, c.f. [59]. Data from the test distribution can enhance transfer performance. However, min-norm interpolators in this context are not well-understood. This paper addresses this crucial gap in the literature and characterizes the precise prediction performance of a pooled min--norm interpolator calculated using both source and test data under overparametrization. Our contributions are summarized as follows:
-
(i)
We consider a min--norm interpolator constructed by combining the source and target data (henceforth referred to as the pooled min--norm interpolator). We show this estimator can be interpreted both as an early and an intermediate fusion estimator (see (2.10) for details). We characterize the finite-sample bias and variance of the pooled min--norm interpolator in the overparametrized regime under both model shift and covariate shift. Our results show that the generalization error of this interpolator exhibits the well-known double-descent phenomenon observed in the machine learning literature [9, 10].
-
(ii)
In the presence of model shift, we introduce a data-driven approach to estimate the threshold that governs when the pooled interpolator outperforms the target-based interpolator and also to obtain the optimal sample size from the target distribution that minimizes the risk of the pooled min--norm interpolator. Both the threshold and the optimal sample size are function of the signal-to-noise ratio (SNR) and the shift-to-signal ratio (SSR), which is a measure of similarity between the datasets, defined in Section 3.3. We provide consistent estimators for both SNR and SSR.
-
(iii)
Our findings on covariate shift reveal a surprising dichotomy under stylized settings: increased degrees of covariate shift can in some cases enhance the performance of the pooled min--norm interpolator (while the opposite happens in other regimes). This property is entirely dependent on the dimensionalities of the observed data (see Proposition 4.2). We hypothesize that such phenomena is applicable in more general settings.
-
(iv)
On the technical front, a significant mathematical contribution of our work is the derivation of an anisotropic law applicable to a broad class of random matrices (see Theorem D.1). This complements previous anisotropic local laws [33, 57] and may be of independent interest both in random matrix theory as well as in the study of other high-dimensional transfer learning problems, particularly those involving multiple covariance matrices.
-
(v)
We interpret the pooled min--norm interpolator as a ridgeless limit of a pooled ridge estimator (see (2.9)). As part of our results, we characterize the exact bias and variance of the pooled ridge estimator under aforementioned settings. This contributes independently to the expanding body of work on the generalization error of high-dimensional ridge regression [16, 29]. Our work complements the out-of-distribution error characterized by recent works [42, 47] where target data is not used in constructing the estimator.
-
(vi)
Finally, we complement our theoretical findings with extensive simulations that demonstrate the finite-sample efficacy of our theory. Our results exhibit remarkable finite-sample performance under both model shift (Figure 1) and covariate shift (Figure 2), as well as across varying shift levels, signal-to-noise ratios, and dimensionality ratios.
The remainder of the paper is structured as follows: In Section 2, we layout our framework and model assumptions. Sections 3 and 4 provide detailed analyses of the generalization errors for model shift and covariate shift, respectively. A proof outline is provided in Section 5, followed by discussions and future directions in Section 6. Auxiliary results concerning pooled ridge regression are included in Appendix A, with detailed proofs of our main results deferred to the remaining appendices.
2 Setup
In this section, we present our formal mathematical setup.
2.1 Data Model
We consider the canonical transfer learning setting where random samples are observed from two different populations. Formally, we observe two datasets and —one serves as the source data and the other the target. Typically, the former has more sample size than the latter. (Some of our results extend to the general case of multiple source data, see Section 3.4). We assume that the observed data comes from linear models given by
(2.1) |
where and corresponds to the source and target respectively. Here, denotes the response vectors, with representing the number of samples observed in population, denotes the observed covariate matrix from population with the number of features, are signal vectors independent of , and are noise vectors independent of both and . We denote as the total number of samples.
Throughout the paper, we make the following assumptions on and which are standard in the random matrix literature [55, 2]. First, for the covariates , we assume
(2.2) |
where are random matrices whose entries are independent random variables with zero mean and unit variance. We further assume that there exists a constant by which the -th moment of each entry is bounded for some :
(2.3) |
The eigenvalues of , denoted as are bounded:
(2.4) |
Second, the noises are assumed to have i.i.d. entries with mean , variance , and bounded moments up to any order. That is, for any , there exists a constant such that
(2.5) |
Lastly, as our analysis pertains to the regime where sample size is smaller than feature dimensions, and the ratio between sample size and feature dimensions remain bounded. Denoting , we assume
(2.6) |
As displayed, we are particularly interested in the overparameterized regime where , partly because it is more prevalent in the modern machine learning applications. Besides, the underparamterized regime, i.e. , is better understood in literature, cf. [57] (they assumed , but their results remain valid for the general case ).
For convenience we summarize all our assumptions below:
Assumption 1.
Let be a small constant. are mutually independent. Moreover, for , the following holds:
We remark that we do not require any additional assumptions on the signals , except for their independence on . In fact, our results later reveal that the convergences of risks of interest is conditional on , and only depends on through their norms.
2.2 Estimator
Data fusion, the integration of information from diverse datasets, has been broadly categorized into three approaches in literature: early, intermediate and late fusion [3, 49]. Early fusion combines information from multiple sources at the input level. Late fusion learns independently from each dataset and combines information at the end. Intermediate fusion achieves a middle ground and attempts to jointly learn information from both datasets in a more sophisticated manner than simply concatenating them. In this work, we consider a form of min-norm interpolation that captures both early and a class of intermediate fusions strategies. To achieve early fusion using min-norm interpolation, one would naturally consider the pooled min--norm interpolator, defined as
(2.7) |
This interpolator admits the following analytical expression:
(2.8) |
where for any matrix , we denote its pseudoinverse by . This interpolator is alternatively designated as the ”ridgeless” estimator, owing to it being the limit of the ridge estimator when the penalty term vanishes (noticed earlier by [29] in the context of a single training data):
(2.9) | ||||
Coincidentally, although the pooled min--norm intepolator is motivated via early fusion, it can also capture a form of intermediate fusion as described below. If we consider a weighted ridge loss with weights on the source and target data respectively, then the ridgeless limit of this loss recovers the pooled min--norm interpolator. This can be seen as follows: for any two weighting coefficient ,
(2.10) | ||||
indicating that the interpolator above uniquely stands as the common solution of both the min--norm interpolator and the ridgeless estimator when two datasets are weighted by in the intermediate fusion loss formulation.
Furthermore, the pooled min--norm interpolator (2.7) appears naturally as the limit of the gradient descent based linear estimator in the following way: Initialize and run gradient descent with iterates given by
with , then .
In recent years, min-norm interpolated type solutions and their statistical properties have been extensively studied in numerous contexts [5, 8, 10, 11, 15, 37, 39, 38]. In fact, it is conjectured that this min-norm regularization is responsible for superior statistical behavior of estimators in overparametrized regime. In light of these results and the aforementioned abundant formulations, we view our work as a first step of understanding the behavior of this important family of estimators in transfer learning, under the lens of early and intermediate fusion.
We conclude this section by highlighting that computing such an interpolator requires only the summary statistics and for , rather than entire datasets. This characteristic is particularly beneficial in fields such as medical or genomics research, where individual-level data may be highly sensitive and not permissible for sharing across institutions.
2.3 Risk
To assess the performance of , we focus on its prediction risk at a new (unseen) test point drawn from the target distribution, i.e., . This out-of-sample prediction risk (hereafter simply referred to as risk) is defined as
(2.11) |
where and . Note that this risk has a difference from the mean-squared prediction error for the new data point, which does not affect the relative performance and is therefore omitted. Also, this risk definition takes expectation over both the randomness in the new test point and the randomness in the training noises . The risk is conditional on the covariate . Despite this dependence, we will use the notation for conciseness since the context is clear. The risk admits a bias-variance decomposition:
(2.12) |
Plugging in (2.9), we immediately obtain the following result which we will crucially use for the remainder of the sequel:
Lemma 2.1.
and bias
(2.14) | ||||
where is the (uncentered) sample covariance matrix obtained on appending the source and target samples, and is the signal shift vector.
Proof.
The proof is straightforward via plugging in respective definitions, and observing the fact that . ∎
In the next two sections, we will layout our main results which involve precise risk characterizations of the pooled min--norm interpolator. Bias and variance terms, as decomposed in Lemma 2.1, will be presented separately.
Throughout the rest of the paper, we denote if there exists a constant such that for large enough . What’s more, we say an event happens with high probability (w.h.p.) if as .
3 Model Shift
We first investigate the effect of model shift. More precisely, we assume the covariates and comes from the same distribution, but the underlying signal changes from to . We denote the shift in the signal parameter by . We will first precise risk of the pooled min--norm interpolator and then compare with the min--norm interpolator using solely the target data.
3.1 Risk under isotropic design
Recall the formula for the risk in (2.12) and its decomposition in Lemma 2.1. In this sequel, we make a further isotropic assumption that and leave generalizations for future works. We will split the risk into bias and variance following Lemma 2.1, and characterize their behavior respectively. Note yields . Therefore, our contribution focuses on characterizing the asymptotic behavior of and as defined by (2.13) and (2.14). Our main result in this regard is given below.
Theorem 3.1 (Risk under model shift).
Under Assumption 1, and further assuming that and are both Gaussian random matrices. Then, with high probability over the randomness of , we have
(3.1) | ||||
(3.2) | ||||
(3.3) |
We remark that Theorem 3.1 holds true for finite values of and . It characterizes the bias and variance of the pooled min--norm interpolator for the overparametrized regime . In fact, it shows that the bias can be decomposed into two terms (3.2) and (3.3), which depend only on the -norm of and respectively. Therefore, the generalization error is independent of the angle between and . The variance is decreasing in , but the bias is not necessarily monotone. We provide extensive numerical examples of in Section 3.5 showing that the risk function still exhibits the well-known double-descent phenomena. The main challenge in establishing Theorem 3.1 lies in the analysis of . To this end, we develop a novel local law that involves extensive use of free probability. More details and explanations are provided in Section 5. The theorem allows us to answer the following questions:
3.2 Performance Comparison
In this section, we compare the risk of , the pooled min--norm interpolator, with that of the interpolator using only the target dataset, defined as follows:
(3.4) |
The risk of has been previously studied:
Proposition 3.2 (Corollary of Theorem 1 in [29]).
The above proposition, along with our results of model shift (Theorem 3.1 and Proposition 3.2), allow us to characterize when pooled min--norm interpolator yields lower risk than the target-only min--norm interpolator, i.e. when data pooling leads to positive/negative transfer:
Proposition 3.3 (Impact of model shift).
Under the same setting as Theorem 3.1 and assuming that are bounded, and define the signal-to-noise ratio and the shift-to-signal ratio Then the following statements hold with high probability.
-
(i)
If , then
(3.6) - (ii)
Proof.
The implications of Proposition 3.3 are as-follows:
-
(i)
When the SNR is small, i.e. , adding more data always hurts the pooled min--norm interpolator. Note that, even if the source and target populations have the exact same model settings, pooling the data induces higher risk for the inteprolator.
-
(ii)
If SNR is large, whether pooling the data helps or not depends on the level of model shift. Namely, if SSR exceeds defined by (3.7), the models are far apart resulting in negative transfer. Otherwise, pooling the datasets help. In particular if , then for any values of , data-pooling harms.
- (iii)
In sum, in the over-parametrized regime, positive transfer happens when the model shift is low and the SNR is sufficiently high. Figure 1 illustrates our findings in a simulation setting.
3.3 Choice of interpolator
Our findings in the previous section highlights that, to achieve the least possible risk for pooled min--norm interpolator, one should neither always pool both datasets nor always use all samples in either datasets. In fact, if SNR and SSR were known, it would be possible to identify subsets of source and target datasets, having sample size respectively, such that the risk of the pooled min--norm interpolator based on the total data of size is minimized. To this end, the purpose of this section is two-fold: first to provide an estimation strategy for SNR and SSR, and second to invoke the estimator and provide a guideline for choosing appropriate sample size.
For SNR and SSR estimation, we employ a simplified version of the recently introduced HEDE method [50]. Let denote the following debiased Lasso estimator [32, 12]:
that is based on the Lasso estimator for some fixed :
and let
be the corresponding estimated variance for . Then by Theorem 4.1 and Proposition 4.2 in [50], we know
where the last line invokes independence between the datasets. Hence, we obtain the following consistent estimators for SNR and SSR:
(3.9) |
Given the estimators of SNR and SSR, denoted by and respectively, an important question arises: what is the optimal size of the target dataset for which the pooled min--norm interpolator achieves least mean-squared prediction error ? This is answered by the following proposition:
Corollary 3.4.
Assume the conditions of Theorem 3.1 holds. Denote by the pooled min--norm interpolator based on samples from the source and samples from the target where . Then we have
(3.10) |
where if and otherwise.
Proof.
This proposition validates our intuition for the problem: If SNR is low or SSR is large, large amount of target data is not necessary for achieving good generalization. In fact, given the estimators , , a data-dependent estimated optimal target data size is given by
3.4 Extension to multiple source distributions
Our results for model shift, i.e., Theorem 3.1 can be extended to the general framework of multiple sources and one target datasets. Denote the target as and sources as , and further denote the signal shift vectors as . Assuming we are still in the overparametrized regime , and consider the pooled min--norm interpolator as
(3.11) |
Then the following theorem charactarizes the bias and variance of the pooled min--norm interpolator under this setting:
Proposition 3.5.
Under Assumption 1 (adjusted for multiple source data and a target data ), and further assuming that and are Gaussian random matrices. Then, with high probability over the randomness of , we have
(3.12) | ||||
(3.13) | ||||
(3.14) |
3.5 Numerical examples under model shift
In this Section, we demonstrate the applicability of our results for moderate sample sizes, as illustrated in Figure 1. Specifically, we explore the generalization error of the pooled min--norm interpolator in a scenario where both the source and target data follow isotropic Gaussian distributions. The signals are generated as follows: and (and thereafter considered fixed). The signal shift is thus given by . We display the generalization error across varied combinations of source and target sample sizes (denoted by and respectively), number of covariates (), signal-to-noise ratios (SNRs), and shift-to-signal ratios (SSRs). For completeness, we also present plots of the generalization error for the underparameterized regime, where , a regime previously explored in the literature [57].
Figure 1 shows the empirical generalization errors of the pooled min--norm interpolator closely match the theoretical predictions from Theorem 3.1 in moderate dimensions. Additionally, we present the generalization error of the target-only interpolator, as established by [29]. The risk of the target-only interpolator matches with that of our pooled interpolator when , validating our results.
Considering the OLS estimator’s generalization error for , the concatenated error curve reveals a double descent phenomenon. For smaller sample size from the source distribution, the risk of the pooled min--norm interpolator is a decreasing function of , rendering the pooled interpolator superior than the target-only interpolator. However, if the SSR value is too high, the first stage of descent is missing, rendering the pooled min--norm interpolator inferior to its target-only counterpart. Figures 1 (b) and (c) display this dependence on the SNR and SSR values. Namely, higher signal strength and less model shift magnitudes lead to a wider range of for which the pooled interpolator is better. Figure 1 (a) further demonstrates the intricate dependence of this dichotomy on and , indicating that the pooled interpolator performs particularly well when are much smaller than , which is commonplace in most modern datasets. These findings reinforce and complement the conclusions from Proposition 3.3, providing a more complete picture regarding the positive/negative effects of transfer learning for the pooled min--interpolator under overparametrization.



4 Covariate Shift
In this section, we characterize the bias and variance of the pooled min--norm interpolator under covariate shift. We assume the underlying signal stays the same for both datasets, whereas the population covariance matrix changes from to . Under this condition, in (2.14). We then compare the remaining bias term and the variance term of the pooled min--norm interpolator with the target-based interpolator and investigate conditions that determine achievability of positive/negative transfer.
4.1 Risk under simultaneously diagonalizable covariances
We assume that the source and target covariance matrices are simultaneously diagonalizable, as defined below:
Definition 4.1 (Simultaneously diagonalizable).
For two symmetric matrices , we say they are simultaneously diagonalizable if there exist an orthogonal matrix and two diagonal matrices such that and . Under this condition, the generalization error depends on the joint empirical distribution as well as the geometry between and . We encode these via the following probability distributions:
(4.1) | ||||
where, for , and is the -th eigenvector of (the -th column of ).
The simultaneous diagonalizability assumption has previously been used in the transfer learning literature (cf. [42]). It ensures that the difference in the covariate distributions of the source and the target is solely captured by the eigenvalues . Characterizing the generalization error when covariance matrices lack a common orthonormal basis appears to be fairly technical, and we defer this to future work.
Note that in the typical overparameterized setup with a single dataset, the risk of the min--norm interpolator is expressed in terms of the empirical distribution of eigenvalues and the angle between the signal and the eigenvector of the covariance matrix (equation (9) in [29]). Consequently, it is natural to expect that the generalization error of our estimator will depend on and defined by (4.1), which indeed proves to be the case. The following presents our main result concerning covariate shift.
Theorem 4.1 (Risk under covariate shift).
Suppose Assumption 1 holds, and further assuming that are simultaneously diagonalizable with joint spectral distributions defined in (4.1). Then, with high probability over the randomness of , we have
(4.2) | |||
(4.3) |
where is the unique solution, with positive, to the following system of equations:
(4.4) | ||||
and is the unique solution, with positive, to the following system of equations:
(4.5) | ||||
Note that although in the displayed equations above, we intentionally used different notations for clarity, as they are derived through distinct routes in the proof.
4.2 Performance Comparison Example: When does heterogeneity help?
In this section, we illustrate the impact of covariate shift on the performance of our pooled min--norm interpolator. Since the closed-form expressions for the bias and variance of are complicated (see (4.2), (4.3)), we exhibit the impact through stylized examples. Specifically, we consider the setting where the target covariance and the source covariance has pairs of reciprocal eigenvalues: , where denotes the collection of all diagonal matrices with pairs of reciprocal eigenvalues. To elaborate, suppose that is an even integer. In this case, is diagonal, with for . Note that . This setting leads to a simplification in the bias and the variance, as well as alows us to study the effects of covariate shift.
We will analyze the risk of the pooled min--norm interpolator , defined by (2.11), under this covariate shift model. To emphasize that the difference in the source and target feature covariances is captured by the matrix , we denote this risk as through the rest of the section.
Our first result studies the following question: under what conditions does heterogeneity reduce the risk for the pooled estimator? To this end, we compare , with , i.e. the risk under covariate shift with the one without covariate shift.
Proposition 4.2 (Dependence on sample size).
Under the conditions of Theorem 4.1, and additionally assuming , the following holds with high probability:
-
(i)
When , for any .
-
(ii)
When , for any .
Proposition 4.2 is proved in Section D.4. Its implications are as follows: in the overparametrized regime, heterogeneity leads to lower risk for the pooled min--norm interpolator if and only if the sample size from the source data is small relative to the dimension, specifically, smaller than . Note that since we consider the overparametrized regime, we always have in other words, . Thus, Proposition 4.2 gives an exhaustive characterization of when heterogeneity helps in terms of the relations between . More generally, an additional question may be posed: how does the degree of heterogeneity affect the risk of the pooled min--norm interpolator? Investigating this is hard even for the structured class of covariance matrices . Therefore, we consider the case when has only two eigenvalues: for some . We denote this covariance matrix as . Note that and here denotes the degree of heterogeneity. In this special setting, we can establish the following.
Proposition 4.3 (Dependence on the degree of heterogeneity).
Under the conditions of Theorem 4.1, and additionally assuming , the following holds with high probability:
-
(i)
When , for any .
-
(ii)
When , for any .
-
(iii)
If , then does not depend on .
The key takeaways from Proposition 4.3 can be summarized as follows:
-
(i)
If the size of source dataset is small, then the risk is a decreasing function of , i.e., more discrepancy between source and target dataset helps achieve lower risk for the pooled min--norm interpolator.
-
(ii)
When the size of source dataset is large, then larger heterogeneity hurts the pooled min--norm interpolator.
Lastly, similar to Section 3.2, we here compare the pooled min--norm interpolator with the target-based interpolator (3.4). Note the risk of the latter runder similar settings has been studied in Theorem 2 of [29] which we cite below:
Proposition 4.4.
We remark that using similar proof techniques as Theorem 4.1, the additional bounded moments condition in Proposition 4.4 can be relaxed to (2.3). We omit the details. Regretfully, although the precise risk for both interpolators are available from Theorem 4.1 and Proposition 4.4, one cannot obtain a simple result even for due to the complexity of expressions. Thus we make the comparison via solving the expressions numerically in the next section.
4.3 Numerical examples under covariate shift
Similar to Section 3.5, we illustrate the applicability of our results in finite sample and compare the performances of the pooled min--norm interpolator and its target-based counterpart. We pick the special setting of Proposition 4.3 where the target distribution has covariance , while the source distribution has covariance matrix with two distinct eigenvalues and , each having multiplicity . The choice of covariance matrices ensure that both have equal determinants. In Figure 2, we illustrate how the generalization error varies with , the SNR, and the covariate shift magnitude . Again, we include the generalization error of the OLS estimator for the underparameterized regime for completeness.
The key insights from the plots are summarized as follows. Figure 2 (a) illustrates the generalization error of the pooled min--norm interpolator, showcasing the double descent phenomena across varying . The generalization error of the target-based interpolator is represented by dotted horizontal lines. We observe that in the overparameterized regime, if is small, the pooled estimator consistently outperforms the target-based estimator. Figure 2 (b), presents our findings from Proposition 4.3. For small values of , the generalization error decreases with the degree of heterogeneity, i.e., the value of . However, for , higher values of results in higher generalization error. A notable scenario occurs at , where the generalization error becomes independent of , causing all curves to intersect at a single point. Finally, Figure 2 (c) demonstrates the dependence of the generalization error on the SNR. Higher SNR values lead to significantly higher error when is small, but the curves tend to converge at the interpolation threshold, i.e., when the total sample size matches the number of covariates.



5 Proof Outline
In this section, we outline the key ideas behind the proofs of our main results in Sections 3 and 4, highlighting crucial auxiliary results along the way. All convergence statements will be presented in an approximate sense, with comprehensive proof details provided in the Appendix.
To circumvent the difficulty of dealing with the pseudo-inverse formula in (2.8), we utilize the connection between the pooled min--norm interpolators and ridgeless least squares interpolation. This says that to study the interpolator, it suffices to study the following ridge estimator
(5.1) |
and then study its limit. Owing to the decomposition (2.12), we analyse the bias and the variance terms separately. With this in mind, our proof consists of the following four steps:
- (i)
-
(ii)
Show that with high probability, the difference between the (random) risks of the min -norm interpolator and the ridge estimator vanishes as , i.e., .
-
(iii)
Analyze the limit of the (deterministic) risk of ,i.e., .
-
(iv)
Find a suitable relation between and such that the aforementioned three convergence relationships are simultaneously satisfied.
The most non-trivial part of our results is to prove the convergence (i) and obtain the precise expression for . For (ii), one can rewrite the difference in terms of the eigenvalues of some symmetric random matrices such as . Next, using high probability bounds of the largest and smallest eigenvalues (see Lemma B.3 and Corollary B.4), we are able to establish high probability convergence given by (ii) above. Part (iii) only involves deterministic quantities and therefore classical analysis techniques. Finally, Step (iv) reduces to optimization problems over .
The rest of this section is devoted to outlining the proof for part (i). These steps for model shift and covariate shift are detailed in Appendix C.2 and Appendix D.3, respectively.
5.1 Model Shift
For model shift, after decomposing into bias and variance terms, the one term that requires novel treatment is (see Appendix C.1)
where are model-dependent constants, and for . Established results (see Lemma C.1) guarantee that
We know are Wishart matrices whose spectral distribution converges to the Marčenko-Pastur Law, from which we also have control over the spectral distribution of and . Moreover, since the previous two quantities are both rotationally invariant and asymptotically free, we can effectively handle the limiting spectral distribution of their sum with free addition. Further algebra and convergence analysis yield the limit of .
The outline above also illustrates the necessity of certain assumptions on Theorem 3.1. Due to our utilization of free addition techniques, we require both the Gaussianity assumption on for rotational invariance, and the isotropic assumption for free independence. Further elaboration can be found in Appendix C. We hypothesize that (3.3) remains applicable for non-Gaussian matrices , and a comprehensive formula exists for general . One promising avenue to explore this is through the lens of linear pencil techniques [30], which we defer to future investigations.
5.2 Covariate Shift
For covariate shift, since , eliminating the need for dealing with quadratic terms with free addition. One can therefore relax the Gaussianity assumption on to the one in Thoerem 4.1. However, new challenges arise with anisotropic covariances . For conciseness, we illustrate the representative variance term, which is formulated as (see (D.2) in Section 4)
(5.2) | ||||
where
This simplification highlights the necessity of the jointly diagonalizable assumption for . The core challenge, therefore, reduces to characterizing the behavior of under the setting of Theorem 4.1. In other words, we aim to characterize the resolvent of , a sum of two sample-covariance-like matrices with anisotropic covariances. To this end, we present Theorem D.1 (omitted here for the sake of space), a novel type of anisotropic local law [33], which might be of independent interest. Convergence of the trace term in (5.2) follows as a corollary of our anisotropic local law, and further analysis w.r.t. the derivative yields the value of .
6 Discussion and Future directions
In this paper, we characterize the generalization error of the pooled min--norm interpolator under overparametrization and distribution shift. Our study underscores how the degree of model shift or covariate shift influences the risk associated with the interpolator. Mathematically, we derive a novel anisotropic local law, which may be of independent interest in random matrix theory. We conclude by identifying several avenues for future research and discussing some limitations of our current findings:
-
(i)
Expanding to Multiple Datasets: In Section 3.4, we established the risk of the pooled min--norm interpolator with multiple source data under model shift. Extending our results under covariate shift to the case of multiple source data is equally important. We anticipate that this would be feasible by extending our approach to incorporate more general anisotropic local laws. However, we defer this study to future work in the interest of space.
-
(ii)
Generalizing Assumptions and Combined Shift: For our model shift results in Section 3, we imposed an isotropic Gaussian assumption on the covariate distribution. Extending this to cover anisotropic cases would be important. This would require significant new ideas for handling products of random matrices that are not necessarily asymptotically free. We expect such extensions may also enable understanding the pooled min--norm interpolator under simultaneous model and covariate shift, which is common in real-life transfer learning settings and is partially explored in [57] for underparametrized regime.
-
(iii)
Comparison with Other Estimators: Our pooled min--norm interpolator covers early and intermediate min--norm interpolation within a single umbrella thanks to the alternate representation (2.10). However, a more complete picture of interpolation under transfer learning and over-parametrization necessitates comprehensive comparisons with the following different types of estimators: 1) Late fused min--norm interpolators, which combine min--norm interpolators that are individually obtained from different datasets; 2) Other types of interpolators including min--norm ones [38]. 3) Other estimators in over-parametrized regimes including the penalized ones, beyond the ridge estimators covered in Appendix A.
-
(iv)
Extension to Nonlinear Models: Recent literature has explored the connection between linear models and more complex models such as neural networks [1, 20, 31, 43]. To be precise, for i.i.d. data , , , . Consider learning a neural network with weights , , , where the form of need not be related to the distribution of . In some settings, the number of parameters is so large that training effectively changes only by a small amount with respect to a random initialization . Then, given such that , one obtains that the neural network can be approximated by , where we have reparametrized . Such models, despite being nonlinear in s, are linear in . Moving beyond high-dimensional linear regression, one might ask the statistical consequences of such linearization in the context of transfer learning. We leave such directions for future research.
Acknowledgements
P.S. would like to acknowledge partial funding from NSF DMS-.
References
- Allen-Zhu et al. [2019] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International conference on machine learning, pages 242–252. PMLR, 2019.
- Bai and Silverstein [2010] Zhidong Bai and Jack W Silverstein. Spectral analysis of large dimensional random matrices, volume 20. Springer, 2010.
- Baltrušaitis et al. [2018] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018.
- Bao et al. [2020] Zhigang Bao, László Erdős, and Kevin Schnelli. Spectral rigidity for addition of random matrices at the regular edge. Journal of Functional Analysis, 279(7):108639, 2020.
- Bartlett et al. [2020] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
- Bastani [2021] Hamsa Bastani. Predicting with proxies: Transfer learning in high dimension. Management Science, 67(5):2964–2984, 2021.
- Belinschi and Bercovici [2007] Serban T Belinschi and Hari Bercovici. A new approach to subordination results in free probability. Journal d’Analyse Mathématique, 101(1):357–365, 2007.
- Belkin et al. [2018] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 541–549. PMLR, 2018.
- Belkin et al. [2019a] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019a.
- Belkin et al. [2019b] Mikhail Belkin, Alexander Rakhlin, and Alexandre B Tsybakov. Does data interpolation contradict statistical optimality? In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1611–1619. PMLR, 2019b.
- Belkin et al. [2020] Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020.
- Bellec and Zhang [2022] Pierre C Bellec and Cun-Hui Zhang. De-biasing the lasso with degrees-of-freedom adjustment. Bernoulli, 28(2):713–743, 2022.
- Bloemendal et al. [2014] Alex Bloemendal, László Erdos, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Isotropic local laws for sample covariance and generalized wigner matrices. Electron. J. Probab, 19(33):1–53, 2014.
- Brailovskaya and van Handel [2022] Tatiana Brailovskaya and Ramon van Handel. Universality and sharp matrix concentration inequalities. arXiv preprint arXiv:2201.05142, 2022.
- Bunea et al. [2020] Florentina Bunea, Seth Strimas-Mackey, and Marten Wegkamp. Interpolation under latent factor regression models. arXiv preprint arXiv:2002.02525, 2(7), 2020.
- Cheng et al. [2022] Jiahui Cheng, Minshuo Chen, Hao Liu, Tuo Zhao, and Wenjing Liao. High dimensional binary classification under label shift: Phase transition and regularization. arXiv preprint arXiv:2212.00700, 2022.
- Chistyakov and Götze [2011] Gennadii P Chistyakov and Friedrich Götze. The arithmetic of distributions in free probability theory. Central European Journal of Mathematics, 9:997–1050, 2011.
- Deng et al. [2022] Zeyu Deng, Abla Kammoun, and Christos Thrampoulidis. A model of double descent for high-dimensional binary linear classification. Information and Inference: A Journal of the IMA, 11(2):435–495, 2022.
- Ding and Yang [2018] Xiucai Ding and Fan Yang. A necessary and sufficient condition for edge universality at the largest singular values of covariance matrices. The Annals of Applied Probability, 28(3):1679–1738, 2018.
- Du et al. [2018] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
- Erdős et al. [2013a] László Erdős, Antti Knowles, and Horng-Tzer Yau. Averaging fluctuations in resolvents of random band matrices. Annales Henri Poincaré, 14(8):1837–1926, 2013a.
- Erdős et al. [2013b] László Erdős, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Delocalization and diffusion profile for random band matrices. Communications in Mathematical Physics, 323:367–416, 2013b.
- Erdős et al. [2013c] László Erdős, Antti Knowles, Horng-Tzer Yau, and Jun Yin. The local semicircle law for a general class of random matrices. 2013c.
- Erdős et al. [2013d] László Erdős, Antti Knowles, Horng-Tzer Yau, and Jun Yin. Spectral statistics of erdős—rényi graphs i: Local semicircle law. The Annals of Probability, pages 2279–2375, 2013d.
- Ghorbani et al. [2021] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-layers neural networks in high dimension. The Annals of Statistics, 49(2):1029–1054, 2021.
- Gunasekar et al. [2018a] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pages 1832–1841. PMLR, 2018a.
- Gunasekar et al. [2018b] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. Advances in neural information processing systems, 31, 2018b.
- Hanneke and Kpotufe [2019] Steve Hanneke and Samory Kpotufe. On the value of target data in transfer learning. Advances in Neural Information Processing Systems, 32, 2019.
- Hastie et al. [2022] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. Annals of statistics, 50(2):949, 2022.
- Helton et al. [2018] J William Helton, Tobias Mai, and Roland Speicher. Applications of realizations (aka linearizations) to free probability. Journal of Functional Analysis, 274(1):1–79, 2018.
- Jacot et al. [2018] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Javanmard and Montanari [2014] Adel Javanmard and Andrea Montanari. Hypothesis testing in high-dimensional regression under the gaussian random design model: Asymptotic theory. IEEE Transactions on Information Theory, 60(10):6522–6554, 2014.
- Knowles and Yin [2017] Antti Knowles and Jun Yin. Anisotropic local laws for random matrices. Probability Theory and Related Fields, 169:257–352, 2017.
- Kpotufe and Martinet [2021] Samory Kpotufe and Guillame Martinet. Marginal singularity and the benefits of labels in covariate-shift. The Annals of Statistics, 49(6):3299–3323, 2021.
- Li et al. [2022] Sai Li, T Tony Cai, and Hongzhe Li. Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):149–173, 2022.
- Li et al. [2023] Sai Li, T Tony Cai, and Hongzhe Li. Transfer learning in large-scale gaussian graphical models with false discovery rate control. Journal of the American Statistical Association, 118(543):2171–2183, 2023.
- Liang and Rakhlin [2020] Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel “ridgeless” regression can generalize. The Annals of Statistics, 48(3), 2020.
- Liang and Sur [2022] Tengyuan Liang and Pragya Sur. A precise high-dimensional asymptotic theory for boosting and minimum--norm interpolated classifiers. The Annals of Statistics, 50(3), 2022.
- Liang and Tran-Bach [2022] Tengyuan Liang and Hai Tran-Bach. Mehler’s formula, branching process, and compositional kernels of deep neural networks. Journal of the American Statistical Association, 117(539):1324–1337, 2022.
- Ma et al. [2023] Cong Ma, Reese Pathak, and Martin J Wainwright. Optimally tackling covariate shift in rkhs-based nonparametric regression. The Annals of Statistics, 51(2):738–761, 2023.
- Maity et al. [2022] Subha Maity, Yuekai Sun, and Moulinath Banerjee. Minimax optimal approaches to the label shift problem in non-parametric settings. The Journal of Machine Learning Research, 23(1):15698–15742, 2022.
- Mallinar et al. [2024] Neil Mallinar, Austin Zane, Spencer Frei, and Bin Yu. Minimum-norm interpolation under covariate shift. arXiv preprint arXiv:2404.00522, 2024.
- Mei et al. [2019] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory, pages 2388–2464. PMLR, 2019.
- Montanari et al. [2019] Andrea Montanari, Feng Ruan, Youngtak Sohn, and Jun Yan. The generalization error of max-margin linear classifiers: Benign overfitting and high dimensional asymptotics in the overparametrized regime. arXiv preprint arXiv:1911.01544, 2019.
- Muthukumar et al. [2020] Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai. Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020.
- Parmaksiz and Handel [2024] Emin Parmaksiz and Ramon Van Handel. In preparation, 2024.
- Patil et al. [2024] Pratik Patil, Jin-Hong Du, and Ryan J Tibshirani. Optimal ridge regularization for out-of-distribution prediction. arXiv preprint arXiv:2404.01233, 2024.
- Pillai and Yin [2014] Natesh S Pillai and Jun Yin. Universality of covariance matrices. The Annals of Applied Probability, 24(3):935–1001, 2014.
- Sidheekh et al. [2024] Sahil Sidheekh, Pranuthi Tenali, Saurabh Mathur, Erik Blasch, Kristian Kersting, and Sriraam Natarajan. Credibility-aware multi-modal fusion using probabilistic circuits. arXiv preprint arXiv:2403.03281, 2024.
- Song et al. [2024] Yanke Song, Xihong Lin, and Pragya Sur. HEDE: Heritability estimation in high dimensions by ensembling debiased estimators. arXiv preprint arXiv:2406.11184, 2024.
- Soudry et al. [2018] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19(70):1–57, 2018.
- Tian and Feng [2022] Ye Tian and Yang Feng. Transfer learning under high-dimensional generalized linear models. Journal of the American Statistical Association, pages 1–14, 2022.
- Torrey and Shavlik [2010] Lisa Torrey and Jude Shavlik. Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pages 242–264. IGI global, 2010.
- Tripuraneni et al. [2021] Nilesh Tripuraneni, Ben Adlam, and Jeffrey Pennington. Covariate shift in high-dimensional random feature regression. arXiv preprint arXiv:2111.08234, 2021.
- Tulino et al. [2004] Antonia M Tulino, Sergio Verdú, et al. Random matrix theory and wireless communications. Foundations and Trends® in Communications and Information Theory, 1(1):1–182, 2004.
- Wang et al. [2022] Guillaume Wang, Konstantin Donhauser, and Fanny Yang. Tight bounds for minimum -norm interpolation of noisy data. In International Conference on Artificial Intelligence and Statistics, pages 10572–10602. PMLR, 2022.
- Yang et al. [2020] Fan Yang, Hongyang R Zhang, Sen Wu, Weijie J Su, and Christopher Ré. Analysis of information transfer from heterogeneous sources via precise high-dimensional asymptotics. arXiv preprint arXiv:2010.11750, 2020.
- Zhang and Yu [2005] Tong Zhang and Bin Yu. Boosting with early stopping: Convergence and consistency. The Annals of Statistics, 33(4):1538, 2005.
- Zhao et al. [2022] Zhangchen Zhao, Lars G Fritsche, Jennifer A Smith, Bhramar Mukherjee, and Seunggeun Lee. The construction of cross-population polygenic risk scores using transfer learning. The American Journal of Human Genetics, 109(11):1998–2008, 2022.
- Zhou et al. [2022] Lijia Zhou, Frederic Koehler, Pragya Sur, Danica J Sutherland, and Nati Srebro. A non-asymptotic moreau envelope theory for high-dimensional generalized linear models. Advances in Neural Information Processing Systems, 35:21286–21299, 2022.
Appendix A Ridge Regression
As mentioned in Section 5, the risk of the pooled min--norm interpolator is closely related to that of the pooled ridge estimator in (5.1), with a fixed penalty . In this section, we obtain exact generalization error of under the assumptions of Section 3 and Section 4. These results are crucial intermediate results in our proofs of evaluating interpolator risk.
Similar to Lemma 2.1, we first state the bias-variance decomposition of the prediction risk:
In the subsections below, we present exact risk formulae under model shift and covariate shift, respectively.
A.1 Model Shift
We first define deterministic quantities that only depends on and , then present our main results stating that these deterministic quantities are the limits of the bias and variance componenents of prediction risk.
Definition A.1 (Risk limit, ridge).
Define as the unique solution of
(A.3) |
Next, define as the unique positive solution of
(A.4) |
with and , and define
(A.5) | ||||
(A.6) |
We then define the desired deterministic quantities:
(A.7) | ||||
(A.8) | ||||
(A.9) | ||||
(A.10) |
where the superscript denotes model shift.
Theorem A.2 (Risk limit under model shift, ridge).
Suppose Assumption 1 holds, and and are both Gaussian random matrices. Further let where is a small constant. Then, for any small constant , with high probability over the randomness of , we have
(A.11) | ||||
(A.12) | ||||
(A.13) | ||||
(A.14) |
A.2 Covariate Shift
Theorem A.3.
Under Assumption 1, and further assuming that are simultaneously diagonalizable with joint spectral distribution defined in (4.1). Let where is a small constant. Then, for any small constant , with high probability over the randomness of , we have
(A.15) | ||||
(A.16) |
where is the unique solution, with positive, to the following system of equations:
(A.17) | ||||
and is the unique solution, with positive, to the following system of equations:
(A.18) | ||||
Again, notice that above.
Appendix B Basic Tools
We first collect the notations that we will use throughout the proof. Since has comparable orders, we use as the fundamental large parameter. All constants only depend on parameters introduced in Assumption 1. For any matrix , denotes its smallest eigenvalue, denotes its largest eigenvalue (or equivalently the operator norm), denotes its -th largest eigenvalue, and denotes its pseudoinverse. We say an event happens with high probability (w.h.p.) if as . We write if there exists a constant such that for large enough . For , we also write and if , and if and . Whenever we talk about constants, we refer to quantities that does not depend on or random quantities such as . They might depend on other constants such as in Assumption 1. We often use to denote constants whose specific value might change from case to case. We use check symbols (e.g. ) to denote temporary notations (whose definitions might also change) for derivation simplicity. We further make the following definitions:
Definition B.1 (Overwhelming Probability).
We say an event holds with overwhelming probability (w.o.p.) if for any constant , for large enough . Moreover, we say holds with overwhelming probability in another event if for any constant , for large enough .
Definition B.2 (Stochastic domination, first introduced in [21]).
Let and be two -dependent random variables. We say that is stochastically dominate by , denoted by or , if for any small constant and large constant ,
for large enough . That is, w.o.p. for any . If and are functions of supported in , then we say is stochastically dominated by uniformly in if
Note that since for any constant and we are including , we can safely ignore log terms. Also, for a random variable with finite moments up to any order, we have , since
by Markov’s inequality, when .
The following lemma collect several algebraic properties of stochastic domination:
Lemma B.1 (Lemma 3.2 of [13]).
Let and be two families of nonnegative random variables depending on some parameters and . Let be an arbitrary constant.
-
(i)
Sum. Suppose uniformly in and . If , then uniformly in .
-
(ii)
Product. If and , then uniformly in .
-
(iii)
Expectation. Suppose that is a family of deterministic parameters, and satisfies . If uniformly in , then we also have uniformly in .
Definition B.3 (Bounded Support).
Let be a (-dependent) deterministic parameter. We say a random matrix satisfies the bounded support condition with (or has bounded support ) if
(B.1) |
As showed above, if the entries of have finite moments up to any order, then has bounded support . More generally, if every entry of has a finite -th moment as in (2.3) and , then using Markov’s inequality and a union bound we have
(B.2) | ||||
In other words, has bounded support with high probability.
The following lemma provides concentration bounds for linear and quadratic forms of random variables with bounded support.
Lemma B.2 (Lemma 3.8 of [24] and Theorem B.1 of [22]).
Let be families of centered independent random variables, and be families of deterministic complex numbers. Suppose the entries and have variance at most , and satisfy the bounded support condition (B.1) for a deterministic parameter . Then we have the following estimates:
(B.3) | |||
(B.4) | |||
(B.5) | |||
(B.6) |
where we denote and . Moreover, if and have finite moments up to any order, then we have the following stronger estimates:
(B.7) | |||
(B.8) | |||
(B.9) | |||
(B.10) |
The following lemma deals with the empirical spectral distribution of for . Since is rank-deficient, its empirical spectral distribution has a peak at . However, its nonzero eigenvalues are the same as the eigenvalues of . Therefore, previous results for directly controls the positive empirical spectral distribution of .
Lemma B.3.
Proof.
Using a standard cut-off argument, the above results can be extended to random matrices with the weaker bounded moment assumptions:
Corollary B.4.
Proof.
The proof follows verbatim [57, Corollary A.7] upon observing the fact that , and . ∎
We further cite two additional corollaries:
Appendix C Proof for Model Shift
Throughout this section, we introduce the simplified notations
(C.1) |
Similar to Section D, we first prove the ridge case (Theorem A.2) and subsequently prove the min-norm interpolator case (Theorem 3.1).
C.1 Proof of Theorem A.2
We only prove (A.13) and (A.14) here as (A.11) and (A.12) are direct corollaries of [29, Theorem 5]. We begin with the proof of (A.13), the proof of (A.14) is similar. The proof follows mostly [57, Section D.1], and we show here necessary modifications.
Since , (A.2) becomes
where are defined by (C.1). Note that are both Wishart matrices, hence
(C.2) |
is rotationally invariant. Using this, we can simplify below:
Lemma C.1.
In the setting of Theorem A.2, we have
(C.3) |
Proof.
Since the matrix defined by (C.2) is rotationally invariant, we have
(C.4) |
where means equal in distribution, and is a random vector independent of with i.i.d. Gaussian entries of mean zero and variance one. We know has rank with probability , and from Lemma B.3 we know . Therefore,
Using (B.14), we further have
With Lemma C.1, we can write
(C.5) |
where
(C.6) |
For a symmetric matrix , let denote the empirical spectral distribution (ESD) of . Next, for any probability measure supported on , denote its Stieltjes transform as
(C.7) |
which, in the case of , writes
It is well-known that the ESDs of converge to Marchenko-Pastur distribution , whose Stieltjes transforms satisfy the self-consistent equations [2, Lemma 3.11]
Reformulating, we equivalently have , where is defined as
(C.8) |
The convergence rate of has been recently obtained in [48, Theorem 3.3]:
(C.9) |
where denotes the Kolmogorov distance between two probability measures:
For fixed , we know the ESDs of and converge weakly to two measures and (we suppress the dependence on since the context is clear), whose Stieltjes transform are given by
(C.10) |
Since the eigenmatrices of and are independent Haar-distributed orthogonal matrices, the ESD of converges weakly to , where denote the free addition of two probability measures. In particular, we have the following almost-sharp convergence rate:
Lemma C.2.
In the setting of Theorem A.2, suppose for a constant , and for a small constant , we have
(C.11) |
Proof.
Proof.
We calculate the Stileltjes transform of the free addition using the following lemma:
Lemma C.4 (Theorem 4.1 of [7] and Theorem 2.1 of [17]).
Given two probability measures on , there exists unique analytic functions , where is the upper half complex plane, such that the following equations hold: for any ,
(C.13) |
Moreover, is the Stieltjes transform of , that is,
(C.14) |
Now we plug in and when :
(C.15) |
For simplicity, we sometimes write for short when the context is clear. Using the definition of in (C.10), we know
(C.16) |
where is defined in (C.8). Applying (C.16) to the first equation of (C.15), we get
Plugging this equation into the second equation of (C.15), we get
(C.17) |
Now we define the following quantities at :
First, from (C.17) we obtain (A.6). Further, using the fact that and the first equation in (C.15), we get
which, when plugged into (C.17), yields (A.4). It is straightforward to check that it has a unique positive solution. Lastly, calculating the derivative of using its inverse function, we obtain that , which gives (A.5).
Proof of (A.13).
Recall the definition of in (C.6). Form Lemma C.2, we have
Further, denote , we have
which yields
Similarly,
Combining the three displays above and fixing , we obtain that
for any small . Plugging this into (C.5) and using (C.12), we obtain (A.13).
(A.14) can be proved using similar procedure upon oberving the critical fact that
and we omit the details. ∎
C.2 Proof of Theorem 3.1
We only prove (3.3) as (3.2) and (3.1) are direct corollaries of [29, Theorem 2]. First we show that and are close. Recall their definitions in (2.14) and (A.2), and note that . Recall the defintions of from (C.1). Denoting , we have
By Lemma B.3, we know . Moreover, denote and its singular value decomposition be , we have
where the third inequality follows since is a sub-matrix of , and the last line follows by Lemma B.3. This also yields that , and, consequently,
(C.19) |
for any small . Next we show that in (A.9) is close to in (3.3). Consider the equation (A.4) about (where we make the dependence on explicit). Define for , then after reformulating, we equivalently know that is the solution to the following equation:
(C.20) |
If we extend the definition to , then a quick check confirms that is the unique positive solution to the equation above at , and that is a continuous function of for . Calculating the derivative of (C.20) w.r.t. , we get
Denote , then we know , and . This yields . Moreover, as . By continuity, there exists constants such that whenever . Subsequently, is positive and bounded above whenever . Now fix , we have for some . In other words, there exists a constant such that for all ,
(C.21) |
Next, recall in (A.6). We define for and . Plugging respective definitions, we have
(C.22) |
for small enough, because of (C.21) and the fact that the denominator is bounded below. Similarly, recall in (A.5). We define for and . Plugging in respective definitions, we have
(C.23) | ||||
for small enough, since and by (C.21), and other relevant terms are all bounded.
Appendix D Proof for Covariate Shift
We first prove the generalization error for ridge regression under covariate shift (Theorem A.3), and then obtain the generalization error for the min-norm interpolator by letting . Recall the expression for bias and variance was given by (A.1) and (A.2). Note that here. Define
(D.1) |
Taking the antiderivative, we have
(D.2) | ||||
and
(D.3) | ||||
where for , and where satisfies Assumption 1(1). Here are defined in Definition 4.1.
D.1 Prototype Problem
Both the bias and variance terms above (D.3), (D.2) involves characterizing the asymptotic behavior of (i.e. the resolvent of at ) where is defined by (D.1).
A powerful tool to handle this type of problems is known as the anisotropic local law. [33] first introduced this technique and effectively characterized the behavior of for a wide range of , whereas [57] generalized the result to at around .
In this section, we state and prove a more general local law which is one of the major technical contributions of this manuscript. This local law will serve as the basis for studying the limit of (D.3) and (D.2). The following subsections follows [57, Section C.2] closely.
D.1.1 Resolvent and Local Law
We can write where is given by
(D.4) |
Definition D.1 (Self-adjoint linearization and resolvent).
We define the following symmetric block matrix
(D.5) |
and its resolvent as
(D.6) |
as long as the inverse exists. We further define the following (weighted) partial traces
(D.7) | ||||
where , are index sets defined as
We will consistently use and . Correspondingly, the indices are labeled as
We further define the set of all indices as , and label indices as , etc.
Using Schur complement formula, (D.6) becomes
(D.8) |
Notice that the upper-left block of is directly related to the resolvent of that we are interested in. However, rather than studying directly, we work with as it is a linear function of and is therefore easier to work with.
We now define the limit of as
(D.9) |
where and are the unique solution to the following system of equations (we omit the dependence on for conciseness):
(D.10) | ||||
such that and whenever . The existence and uniqueness of the above system will be proved in Lemma D.2.
We now state our main result, Theorem D.1, which characterizes the convergence (rate) of to at . This Theorem is a more general extension in the family of anisotropic local laws [33]. In specific, it extends [57, Theorem C.4] to allow a general control at . For a fix small constant , we define a domain of as
(D.11) |
Theorem D.1.
Under Assumption 1, and further assuming that are simultaneously diagonalizable with joint spectral distribution defined in (4.1). Suppose also that and satisfy the bounded support condition (B.1) with . Then the following local laws hold for with a small comstant :
-
(i)
Averaged local law: We have
(D.12) -
(ii)
Anisotropic local law: For any deterministic unit vectors , we have
(D.13)
The rest of this section is devoted to the proof of Theorem D.1. We focus on the case of , since the case for can be trivially extended by dropping all negative powers of in the proof below.
D.1.2 Self-consistent equations
In this subsection, we show that the self-consistent equations (D.10) has a unique solution for any . For simplicity, we further define
(D.14) |
We first show that (D.10) has a unique solution at , where (D.10) reduces to the following system (equivalent to the first two equations of (A.17)):
(D.15) | ||||
Focusing on and consider any fixed for a small enough constant . We know
if is large enough,
if is small enough, and
for any . Together with the obvious continuity, we know, given , there exists a unique such that . Define the root as
By implicit function theorem, we know the function is continuous, and
which, after straightforward rearranging, yields .
Now focus on . Since for any , similar to our derivation above, we have
for small enough and large enough, and . Therefore there exists a unique such that . Hence, there exists a unique pair of positive satisfying (D.15), such that
(D.16) |
Next, we prove the existence and uniqueness of the solution to (D.10) for a general , where is defined by (D.11), presented by the following lemma:
Lemma D.2.
There exists constants such that the following statements hold. There exists a unique solution of (D.10) under the conditions
(D.17) |
Moreover, the solution satisfies
(D.18) |
Proof.
The proof is based on contraction principle. Define the more general form of (D.15) as
(D.19) | ||||
Define the vector
Denote the (potential) solution to as . By definition, we know . Therefore, defining we have the following identity:
where is the Jacobian of at :
(D.20) | ||||
where for , denotes the partial derivative with respect to the -th argument, and
Notice that we omit the dependence on for simplicity without confusion, as only appear from now on. Taking this inspiration, we define the iterative sequence such that and
(D.21) | ||||
We want to show that is a contractive self-mapping. Let . We first deal with the first summand of (D.21). Since , we have
if is small enough. Further, a corollary of (D.16) yields
which, together with the equation above and the fact that , yields
Using similar derivations for , we obtain that for ,
(D.22) | ||||
Let denote the two eigenvalues of . Since is a matrix, it is easy to check that
At , we know
and . Therefore we obtain
Now consider a general . By (D.22), it is straightforward to derive that
Therefore we have
if is small enough, which yields
(D.23) |
Using similar derivations as (D.22), we obtain that for ,
(D.24) | ||||
which yields
(D.25) |
Now we deal with the second term of (D.21). Using (D.20) and similar derivations as (D.22), we know there exists a constant such that for , and
(D.26) | ||||
As a corollary, for satisfying , the difference of the second term of reads
(D.27) | ||||
Consequently, for , plugging in and to the equation above and combining (D.23), (D.25), we obtain that
if we first choose small enough such that and then choose small enough such that .
Therefore is a self-mapping on the ball . Moreover, from (D.21) and (D.27), we know
for the chosen. This means that restricted to is a contraction. By contraction mapping theorem, exists and is a unique solution to the equation subject to the condition . (D.17) is then obtained by noticing the final fact that for all ,
(D.28) |
As a by-product, we obtain the following stability result which is useful for proving Theorem D.1. It roughly says that if we replace the ’s on the right hand sides of (D.19) with some small errors, the resulting solutions will still be close to . We further make the following rescaling, which will prove handy in later derivations:
(D.29) |
where are defined by (D.14). Combining (D.29) with (D.16) yields
(D.30) |
for some (potentially different from before) constants .
Lemma D.3.
There exists constants such that the following statements hold. Suppose , and are analytic functions of such that
(D.31) |
Moreover, assume that satisfies the system of equations
(D.32) | ||||
for some (deterministic or random) errors such that . Then we have
D.1.3 Multivariate Gaussian Case
In this section we prove the idealized setting where the entries of and are i.i.d. Gaussian. By the rotational invariance of multivariate Gaussian distribution, we have
(D.34) |
Notice that since the entries of are i.i.d. Gaussian, they have bounded support by the remark after Definition B.3. Hence we can use the resolvent method (cf. [13]) to prove the following proposition.
Proposition D.4.
Proposition D.4 is a corollary of the following entrywise local law.
Lemma D.5 (Entrywise Local Law).
Proof of Proposition D.4.
In the remaining of this subsection we prove Lemma D.5, where the resolvent in Definition D.1 becomes
(D.36) |
Below we introduce resolvent minors to deal with the inverse using Schur’s complement formula.
Definition D.2 (Resolvent minors).
Given a matrix and , the minor of after removing the -th row and column is a matrix denoted by . We keep the names of indices when defining , i.e. for . Correspondingly, we define the resolvent minor of by
We further define the partial traces by replacing with in (D.7). For convenience, we adopt the notation if or .
The following formulae are identical to those in [57, Lemma C.10]. In fact they can all be proved using Schur’s complement formula.
Lemma D.6.
We have the following resovent identities.
-
(i)
For and , we have
(D.37) -
(ii)
For , and , we have
(D.38) -
(iii)
For and , we have
(D.39)
The next lemma provides a priori estimate on the resolvent for .
Lemma D.7.
Under the setting of Theorem D.1, there exists a constant such that with overwhelming probability the following estimates hold uniformly in :
(D.40) |
and
(D.41) |
Proof.
Now we are ready to prove Lemma D.5.
Proof of Lemma D.5.
In the setting of Lemma D.5, we have (D.34) and (D.36). Similar to [57], we divide our proof into four steps:
Step 1: Large deviation estimates. In this step, we provide large deviation estimates on the off-diagonal entries of . We introduce
, where denotes the partial expectation over the entries in the -th row and column of . Using (D.37), we have for ,
(D.44) | ||||
and for and ,
(D.45) |
We further introduce the random error
(D.46) |
which controls the size of the largest off-diagonal entries.
Lemma D.8.
In the setting of Proposition D.4, the following holds uniformly in :
(D.47) |
Proof.
Notice that for , and also satisfy the assumptions of Lemma D.7. Thus the estimates (D.40) and (D.41) also hold for with overwhelming probability. For any , since is independent of the entries in the -th row and column of , we can combine (B.8), (B.9) and (B.10) with (D.44) to get
where in the last step we used (D.40) to get that for , w.h.p.,
(D.48) |
The same bound for and can be obtained by combining (B.8), (B.9) and (B.10) with (D.45). This gives .
Step 2: Self-consistent equations. In this step we show that , defined in (D.7) satisfies the system of approximate self-consistent equations (D.32). By (D.18) and (D.29), we know the following estimate hold for :
Combining with (D.30), we know that uniformly in ,
(D.49) |
Further, combining (D.19) with (D.29), we get that uniformly in , for
(D.50) |
where we denote
(D.51) |
We will later see in fact that are the asymptotic limits of defined in (D.7). Applying (D.49) to (D.9) and using (D.29), we get that
(D.52) |
We now define the critical -dependent event:
(D.53) |
(D.49) yields that on ,
(D.54) |
We now propose the following key lemma, which introduces the approximate self-consistent equations for on .
Lemma D.9.
Under the setting of Lemma D.5, the following estimates hold uniformly in for :
(D.55) | ||||
where we denote
(D.56) |
and
(D.57) | ||||
Proof.
Using (D.37),(D.44) and (D.45), we have
(D.58) | ||||
(D.59) | ||||
(D.60) |
where we denote (recall (D.7) and Definition D.2)
and
Using equations (D.39), (D.46) and (D.47), we know
(D.61) |
Similarly, we also have
(D.62) |
for . Combining (D.61) and (D.62) with (D.47), we have
(D.63) |
Now from equation (D.58), we know that on ,
(D.64) | ||||
where in the first step we use (D.63) and (D.54) on , and in the second step we use (D.56), (D.61) and (D.47). Plugging (D.64) into the definitions of in (D.7) and using (D.57), we get that on , for ,
(D.65) |
Comparing the equations above with (D.51) and applying (D.57) and (D.47), we get
(D.66) |
Together with equation (D.50), we have
(D.67) |
With a similar argument, from equations (D.59), we know that on ,
(D.68) |
where we used (D.61), (D.63) and (D.67). Taking the average of (D.68) over , we get that on ,
(D.69) |
which implies that on ,
(D.70) |
where we used (D.54) and (D.67). Finally, plugging (D.65) into (D.70), we get (D.55). ∎
Step 3: Entrywise Local Law. In this step, we show that the event in (D.53) actually holds with high probability, and then we apply Lemma D.9 to conclude the entrywise local law (D.35). We first claim that it suffices to show
(D.71) |
In fact, by (D.43) and similar arguments as in the proof of Lemma D.7, we know that uniformly for , w.o.p.,
Thus, if (D.71) holds, we can obtain with triangle inequality that
(D.72) |
which confirms that holds w.o.p., and also verifies the condition (D.31) of Lemma D.3. Now applying Lemma D.3 to (D.55), we get
which implies
(D.73) |
uniformly for , where we used (D.47). On the other hand, with equation (D.68) and (D.69), we know
which, combined with (D.73), yields
(D.74) |
Next, plugging (D.73) into (D.64) and recalling (D.9) and (D.29), we get that
Together with (D.74), we have the diagonal estimate
(D.75) |
which, when combined with the previously established off-diagonal estimate, we conclude the entrywise local law (D.35).
Thus we are left with proving (D.71). Using (D.43) and (D.40), we know for . Therefore we have
(D.76) |
Combining these estimates with (D.59), (D.60) and (D.63), we conclude that (D.69) hold at without requiring . This further gives that w.o.p.,
Combining the above estimate with (D.58) and (D.63), we obtain that (D.65) also hold at without requiring . Finally, plugging (D.65) into (D.70), we conclude that (D.55) hold at .
Also, combining (D.69) at with (D.76), we know there exists constants such that
(D.77) |
Consequently, we know satisfies (D.15) and satisfies a similar system with right-hand-sides replaced by . Denote for and defined in (D.15), and further denote to be the Jacobian of . Similar to the derivation of (D.23), because of (D.77) and (D.16), we know at both and . Hence we obtain that
thereby yielding (D.71).
Step 4: Averaged Local Law. Now we prove the averaged local laws (D.12). We have the following fluctuation averaging lemma:
Lemma D.10 (Fluctuation averaging).
Under the setting of Proposition D.4, suppose the entrywise law holds uniformly in , then we have
(D.78) |
uniformly in .
Proof.
The proof exactly follows [23, Theorem 4.7]. ∎
D.1.4 Non-Gaussian extension
D.2 Proof of Theorem A.3
First, using a standard truncation argument (c.f. [57, proof of Theorem 3.1]), we know that Theorem D.1 holds for with high probability if now satisfy the bounded moment condition (2.3).
We first prove the variance term (A.15). Define the auxiliary variable
(D.80) | ||||
Since (D.19) reduces to the first two equations in (A.17) at , then by (D.12), with high probability,
where is defined in (A.17).
Similarly (but without randomness), we have
where satisfies the last two equations in (A.17). Combining the previous two displays and setting , we get
for any small constant , thereby obtaining (A.15).
D.3 Proof of Theorem 4.1
The proof involves the following three components.
-
(i)
First, we prove that the limiting risk of the ridge estimator is close to the limiting risk of the interpolator.
We first consider the limits of the variances, which are defined respectively as
(D.82) To this end, we first show that
(D.83) From (D.83), we know the first two equations of (A.17) becomes
(D.84) Let where . From (D.83) and (D.16), we know there exists some constants such that . This yields . Also, similar to how we derived (D.16), we can also show that there exists constants such that
(D.85) . On the other hand, similar to how we obtained (D.20) and (D.23), let be the Jacobian of , we can similarly show that for all . Together with the fact that from (4.4), we have obtained (D.83).
Now taking derivative of (D.84) w.r.t. , we get
(D.86) where represents respective derivatives w.r.t. . Since , we know , which means the RHS of (D.86) is . Thus the LHS is also , which yields . Now combining (D.86) and the last two equations of (4.4), we can use similar analysis to conclude that
(D.87) Finally, from the second equation of (D.82) and the fact that , we know
which, when combined with (D.83) and (D.87), yields
(D.88) Using similar analysis we can also show that the bias limits, defined respectively as
(D.89) are close:
(D.90) -
(ii)
Next, we show that the ridge risk is close to the interpolator risk with high probability.
We begin by showing and are close, where the quantities are defined in (2.13) and (A.1) respectively. To this end, denote the singular value decomposition of by . Then the quantities (2.13) and (A.1) can be rewritten as
where is the diagonal matrix with -th entry equals if and otherwise. Therefore, we have,
(D.91) where denotes the minimum nonzero eigenvalue. Here, the second inequality follows from the fact that , the third inequality holds because for , and the final inequality is given by the following lemma.
Lemma D.11.
Consider the same setup as Theorem 4.1. Recall . Let denote the smallest nonzero eigenvalue of , then there exists a positive constant such that with high probability,
(D.92) Proof.
We know where
and satisfies (2.3). From [14, Theorem 3.12], we know that with high probability,
(D.93) where and have i.i.d. standard Gaussian entries instead. Since we assumed and are simultaneously diagonalizable, we further have, by rotational invariance,
where has (blockwise) diagonal covariances. From [46], we obtain that with high probability, where is the solution to the following variational problem:
(D.94) where recall that , and is defined by (4.1). The second equality above the fact that the first line is maximized at , and .
Now we turn to show and are close, where the quantities are defined via (2.14) and (A.2) respectively. Since , we only need to compare and . Note that,
where we defined . Also,
Therefore, we have
where the second inequality holds as and the last inequality follows from Lemma (D.11). Moreover, since , we have
(D.95) - (iii)
D.4 Proof of Proposition 4.2
Assume throughout the example that . Here, we want to compare with . To this end, plugging in into Theorem 4.1 yields
(D.96) |
with high probability, where the first term is the variance and the second term is the bias.
Now consider any . Plugging in and combining (4.3) and the third equation of (4.5), we obtain that the bias is also . Therefore we are left with comparing the variance.
Plugging in and combining (4.2) and the third equation of (4.4), we obtain that the variance is . Moreover, the first and second equations of (4.4) yields . Therefore, using (D.96), the proof is complete upon showing if and only if .
Since , the second equation of (4.4) yields
(D.97) |
We know the right hand side is a decreasing function of . Hence if and only if we have
Note by straightforward calculation, , , and for all . Therefore, we have if and only if .
D.5 Proof of Proposition 4.3
Here, we want to study the effect of in the quantity .
To analyze the quantity , first note that, by the proof of Proposition 4.2, the bias stays identical for any and the variance is , where are defined by (4.4). Moreover, again by proof of Proposition 4.2, one has . Hence, we are left with analyzing given by (D.97) which simplifies here as
If , the linear term in the above quadratic equation is positive, implying is smaller for larger values of . Hence, is a decreasing function in , implying decreases as increases.
Similarly, if , the linear term in the above quadratic equation is negative, and hence increases as increases.
Finally, if , the linear term vanishes leading to the fact that is agnostic of .