On the effect of noise on fitting linear regression models

Insha Ullah and A.H. Welsh
Research School of Finance, Actuarial Studies and Statistics
Australian National University The authors gratefully acknowledge funding support from Data61, CSIRO.

Abstract

In this study, we explore the effects of including noise predictors and noise observations when fitting linear regression models. We present empirical and theoretical results that show that double descent occurs in both cases, albeit with contradictory implications: the implication for noise predictors is that complex models are often better than simple ones, while the implication for noise observations is that simple models are often better than complex ones. We resolve this contradiction by showing that it is not the model complexity but rather the implicit shrinkage by the inclusion of noise in the model that drives the double descent. Specifically, we show how noise predictors or observations shrink the estimators of the regression coefficients and make the test error asymptote, and then how the asymptotes of the test error and the “condition number anomaly” ensure that double descent occurs. We also show that including noise observations in the model makes the (usually unbiased) ordinary least squares estimator biased and indicates that the ridge regression estimator may need a negative ridge parameter to avoid over-shrinkage.

Keywords: Bias-variance tradeoff, Overparameterization, Multiple-descent, Negative ridge parameter; Regularization; Shrinkage; Sparsity.

1 Introduction

Statistical learning, particularly with large, high-dimensional datasets, currently rests on two opposing ideas, namely sparsity and overfitting. Sparsity is the idea that a small (unknown) number of predictors are related to the response, and the remaining predictors are irrelevant noise variables that should be identified and discarded. Overfitting is the idea that very complex models (i.e., models with a very large number of predictors and/or parameters) can perform better than simpler models in prediction and classification. In this paper, we explore the effect of overfitting linear regression models when sparsity holds.

We consider the prediction task in which we use predictors $\mathbf{x^{\prime}}$ to predict the response or label $y^{\prime}$ in an observation $(y^{\prime},\mathbf{x^{\prime}})$ drawn from an unknown distribution $\mathcal{D}$ . Suppose we have an independent training dataset consisting of $n$ observations $(y_{1},\mathbf{x}_{1})\ldots,(y_{n},\mathbf{x}_{n})\in\mathbb{R}\times\mathbb{R}^{d}$ drawn independently from $\mathcal{D}$ . We use the training data to fit a working linear regression model to the test data by calculating $\hat{\boldsymbol{\beta}}\in\mathbb{R}^{d}$ , and then predict $y^{\prime}$ by $\mathbf{x^{\prime}}^{T}\hat{\boldsymbol{\beta}}$ . We take $\hat{\boldsymbol{\beta}}$ to be the least squares estimator when $d<n$ (the underparameterized regime) and the minimum norm least squares estimator when $d\geq n$ (the overparameterized regime). Suppose we also have a test data set $(y^{\prime}_{1},\mathbf{x}^{\prime}_{1})\ldots,(y^{\prime}_{n^{\prime}},\mathbf{x}^{\prime}_{n^{\prime}})$ of size $n^{\prime}$ drawn independently from $\mathcal{D}$ that is independent of the training data. Then we compare different models (e.g. using different elements of $\mathbf{x^{\prime}}$ ) through their test error $n^{\prime-1}\sum_{i=1}^{n^{\prime}}(y^{\prime}_{i}-\mathbf{x}^{\prime T}_{i}\hat{\boldsymbol{\beta}})^{2}$ and base the prediction on the model with the minimum test error.

Under sparsity, only $d_{0}$ of the available predictors are related to the response, and the remaining variables are irrelevant noise variables. When $d_{0}<<n$ , the test error over the underparameterized regime ( $d<n$ ) is typically roughly U-shaped; underfitting ( $d<d_{0}$ ) by excluding variables related to the response increases the bias in the test error while overfitting by including irrelevant noise variables in the model ( $d>d_{0}$ ) increases the variance in the test error. Selecting the model with minimum test error is intended to produce a model of dimension $d=d_{0}$ that includes only the relevant predictors by balancing the effects of under- and over-fitting, referred to as the bias-variance trade-off. There are many different methods for model selection and a vast literature on these methods.

Recent work exploring increasing $d$ past the interpolation point ( $d=n$ ) into the overparameterized regime ( $d>n$ ) has found that the test error decreases again, exhibiting a second descent with increasing model complexity (Belkin et al., 2019; Geiger et al., 2019; Belkin et al., 2020; Bartlett et al., 2020; Kobak et al., 2020; Liang & Rakhlin, 2020; Liang et al., 2020; Muthukumar et al., 2020; Nakkiran et al., 2021; Zhang et al., 2021; Hastie et al., 2022; Deng et al., 2022; Rocks & Mehta, 2022). The second minimum (in the overparameterized regime) can be smaller than the first minimum (in the underparameterized regime), implying that large models with $d>n$ may give better predictions. This is called “double descent” by Belkin et al. (2019); “triple” (Nakkiran et al., 2020) and even “multiple descent” (Liang et al., 2020) have been demonstrated. Double descent also occurs in deep neural networks (Belkin et al., 2019; Spigler et al., 2019; Nakkiran et al., 2021) and random forests (Belkin et al., 2019).

Double descent directly challenges conventional statistical thinking by suggesting that fitting large models in the overparameterized regime ( $d>n$ ) is better than trying to fit optimal sparse models in the underparameterized regime ( $d<n$ ). Confusingly, double descent also occurs with increasing $n$ (e.g. Nakkiran et al., 2021). As this represents moving from the overparameterized to the underparameterized regime, it suggests the opposite conclusion, namely that for large datasets, fitting small models is better than fitting large models.

1.1 Our setting

This study aims to improve our understanding of sparsity and overfitting when fitting linear models by exploring both empirically and theoretically the effect of including either irrelevant predictors (increasing $d$ ) or irrelevant observations (increasing $n$ ). Specifically, suppose that we arrange the training data in a response vector ${\mathbf{y}}=[y_{1},\ldots,y_{n_{0}}]^{T}\in\mathbb{R}^{n_{0}}$ and a $n_{0}\times d_{0}$ matrix of predictors $\mathbf{X}=[{\mathbf{x}}_{1},\ldots,{\mathbf{x}}_{n_{0}}]^{T}$ , such that ${\mathbf{x}}_{i}\sim\mbox{independent}N_{d_{0}}({\mbox{\boldmath$0$}}_{d_{0}},{\mathbf{I}}_{d_{0}})$ , where ${\mbox{\boldmath$0$}}_{d_{0}}$ is the $d_{0}$ -vector of zeros and ${\mathbf{I}}_{d_{0}}$ is the $d_{0}\times d_{0}$ identity matrix, and ${\mathbf{y}}={\mathbf{X}}{\mbox{\boldmath$\beta$}}_{0}+{\mathbf{e}}$ with $\mathbf{e}\sim N_{n_{0}}({\mbox{\boldmath$0$}}_{n_{0}},\sigma^{2}{\mathbf{I}}_{n_{0}})$ , where ${\mbox{\boldmath$\beta$}}_{0}\in\mathbb{R}^{d_{0}}$ is an unknown regression parameter and $\sigma^{2}>0$ is an unknown variance parameter. We consider two sequences of training sets $\mathcal{D}$ : I (Adding predictors; $d$ increases with $n=n_{0}$ fixed) for $d\leq d_{0}$ , we set $\mathcal{D}^{(d)}=({\mathbf{y}},{\mathbf{X}}^{(d)})$ , where ${\mathbf{X}}^{(d)}$ contains the first $d$ columns of ${\mathbf{X}}$ , and for $d>d_{0}$ , we set $\mathcal{D}^{(d)}=({\mathbf{y}},[{\mathbf{X}},{\mathbf{Z}}])$ , where ${\mathbf{Z}}$ is an $n_{0}\times(d-d_{0})$ matrix with independent $N_{d-d_{0}}({\mbox{\boldmath$0$}}_{d-d_{0}},{\mathbf{I}}_{d-d_{0}})$ rows that are independent of ${\mathbf{X}}$ and ${\mathbf{e}}$ ; II (Adding observations; $n$ increases with $d=d_{0}$ fixed) for $n\leq n_{0}$ , we set $\mathcal{D}_{(n)}=({\mathbf{y}}_{(n)},{\mathbf{X}}_{(n)})$ , where ${\mathbf{y}}_{(n)}$ contains the first $n$ elements of ${\mathbf{y}}$ and ${\mathbf{X}}_{(n)}$ contains the first $n$ rows of ${\mathbf{X}}$ , and for $n>n_{0}$ , we set $\mathcal{D}_{(n)}=([{\mathbf{y}}^{T},{\mbox{\boldmath$0$}}_{n-n_{0}}^{T}]^{T},[{\mathbf{X}}^{T},{\mathbf{W}}^{T}]^{T})$ , where ${\mathbf{W}}$ is an $(n-n_{0})\times d_{0}$ matrix with independent $N_{d_{0}}({\mbox{\boldmath$0$}}_{d_{0}},{\mathbf{I}}_{d_{0}})$ rows that are independent of ${\mathbf{X}}$ and ${\mathbf{e}}$ . We follow Kobak et al. (2020) in setting the responses corresponding to ${\mathbf{W}}$ equal to zero (the marginal mean response); a more complicated alternative would be to set them equal to $N_{n-n_{0}}({\mbox{\boldmath$0$}}_{n-n_{0}},\sigma^{2}{\mathbf{I}}_{n-n_{0}})$ noise that is independent of ${\mathbf{e}}$ , ${\mathbf{X}}$ and ${\mathbf{W}}$ .

As we are interested in the conflicting consequences of the two types of sequences, we study them separately rather than together (i.e. both $d,n\rightarrow\infty$ ) as in Bartlett et al. (2020) and Hastie et al. (2022). These authors make all the predictors important ( $d=d_{0}$ ), although Hastie et al. (2022) introduced noise by adding measurement error to the predictors. Nakkiran et al. (2021) increased the number of predictors and the number of observations separately, but made these all important predictors and true observations. Intuitively, insofar as many predictors are “less important” (have non-zero but small coefficients), they have a similar effect to noise predictors. However, clearly separating important and noise predictors as we do is essential for understanding their separate effects. Mitra (2019), albeit studying different estimators from us, included noise predictors but mixed them with important predictors and masked some of the effects we observe. Kobak et al. (2020) also considered both sequences I and II but used specially normalized noise, replacing ${\mathbf{Z}}$ by $(d-d_{0})^{-1/2}\lambda_{z}^{1/2}{\mathbf{Z}}$ and ${\mathbf{W}}$ by $(n-n_{0})^{-1/2}\lambda_{w}^{1/2}{\mathbf{W}}$ , where $\lambda_{z},\lambda_{w}>0$ . We argue that our results for unnormalized noise predictors are more interesting and relevant because normalized noise predictors have to be constructed and added by the analyst whereas we allow noise predictors to simply occur in the data and/or to be added by the analyst. The noise predictors of Kobak et al. (2020) are also asymptotically zero, so increasingly negligible, and following the general recommendation to standardize the predictors (to have variance one) to reduce heterogeneity would remove the normalization. Muthukumar et al. (2020) considered sequences similar to our sequence I, but in their empirical work they only considered one of the cases we consider, namely $d_{0}<<n$ . In this case, we need to include a large number of noise predictors to reach the interpolation point $d=n$ , a circumstance that hides some findings we make by considering a wider set of relationships between $d_{0}$ and $n$ . Finally, Hellkvist et al. (2023) also considered our sequence I (calling noise predictors fake features) when the unknown parameters are zero-mean random vectors. They did not consider noise observations.

1.2 Our results

In our empirical and theoretical study of the estimators and their test errors, we establish two important sets of results.

1) Adding noise predictors or observations to the data as in sequences I and II shrinks the estimators to zero as $d\rightarrow\infty$ or $n\rightarrow\infty$ . We compute the test errors for the estimators under the two sequences and prove that they both asymptote to the same value as $d\rightarrow 0$ or $\infty$ and $n\rightarrow 0$ or $\infty$ . This is a key result for understanding double descent: the convergence of test errors to the same value at both zero and infinity, combined with the ‘condition number anomaly’ (Dax 2022), which is reflected in the test error becoming infinite in Theorems 3.1 and 3.2, collectively drive the double descent phenomenon.

Although there are results relevant to sequence I in the literature (e.g. similar expressions for test errors in Hellkvist et al. (2023)), sequence II has not been considered and our results for sequence II are new. Even for sequence I, our results are not as obvious as they may appear. For example, the estimators do not shrink to zero along the sequences of Kobak et al. (2020) or in the setting without noise predictors of Belkin et al. (2020) who suggest that using as many predictors as possible may be optimal in their setting. For a particular case of sequence I and with a different focus on alias and spuriously correlated predictors, Muthukumar et al. (2020) discussed ‘signal bleeding’ into a large number of alias predictors – which is over-shrinkage – and overfitting of noise under parsimonious selection of noise predictors in scenarios where sparse or limited data makes it challenging to distinguish between true predictors and those that are spuriously correlated with the outcome in the training set. For the test errors when increasing the number of predictors, both Muthukumar et al. (2020) and Hastie et al. (2022) found as we do that the test error of the model approaches that of the null model. We show empirically and theoretically that this holds in greater generality than considered by Muthukumar et al. (2020), when we add noise predictors (rather than predictors with measurement error as in Hastie et al. (2022)), and for sequence II. We suggest that in a limited training set, the less-important predictors in Bartlett et al. (2020) behave similarly to noise, shrinking coefficient estimates toward zero. Importantly, this implies that overfitting noise is not always harmless or benign (Muthukumar et al., 2020; Bartlett et al., 2020; Tsigler & Bartlett, 2023). Finally, for well-specified models, Nakkiran (2019) related the peak in the test error to the condition number of the data matrix. In addition, we use the “condition number anomaly” (Dax, 2022) to obtain additional theoretical insights into why the peak always occurs. This result combined with the fixed asymptotes of the test error establishes that double descent has to occur in our setup (and incidentally, also in that of Hastie et al. (2022)). Showing this for both sequence I and II is important as it highlights the importance of shrinkage induced by noise rather than the specific form of the noise in driving double descent.

2) For both sequences I and II, the estimators are approximately ridge regression estimators. Moreover, for sequence II, the ridge regression estimator applied to the augmented data is also approximately a “double shrunk” ridge regression estimator. The “double shrinking” means that the optimal value of the ridge parameter in this estimator can be negative when $n\rightarrow\infty$ , even when the predictors are independent and the number of predictors $d=d_{0}$ is small (i.e. a low-dimensional setting).

The interpretation of least squares estimators applied to noisy data as being like ridge regression estimators applied to “pure” data originated in the measurement error setting (Webb, 1994; Bishop, 1995; Hastie et al., 2022). This is different from adding additional noise predictors or observations. Kobak et al. (2020) added additional normalized noise predictors, but ignored the estimated coefficients of the noise predictors, and noted that a similar interpretation holds when we add normalized noise observations to the data. Our results for the full regression parameter estimator avoid the need to normalize the noise predictors and observations differently from the true versions. The seminal paper on ridge regression (Hoerl & Kennard, 1970) showed that for $d<n$ , a ridge estimator with a positive ridge parameter $\lambda$ has a lower mean squared error than the OLS estimator, provided the predictor matrix is of full rank $d$ . Subsequent research by Dobriban & Wager (2018) and Hastie et al. (2022) in the high dimensional setting showed that, for predictors with any covariance matrix $\boldsymbol{\Sigma}$ , as $d,n\to\infty$ with $d/n=\gamma$ , the optimal ridge parameter $\lambda_{\text{opt}}$ is positive when the orientation of $\boldsymbol{\beta}$ is random. Nakkiran et al. (2020) showed that a positive $\lambda$ is optimal regardless of the rank of the predictor matrix or the orientation of $\boldsymbol{\beta}$ , as long as the predictors are independent. Recently, Liang & Rakhlin (2020) and Kobak et al. (2020) have shown that $\lambda_{opt}$ can be zero or negative when the predictors are correlated. Furthermore, a recent non-asymptotic analysis by Tsigler & Bartlett (2023) showed that under a specific ‘spiked’ covariance structure for the data in which $\boldsymbol{\Sigma}$ has a few large eigenvalues, a negative $\lambda$ can improve generalization bounds. We find that $\lambda_{opt}$ is positive under sequence I (which is consistent with Nakkiran et al. (2020)) but can be negative under sequence II, even when the predictors are independent and even in the low dimensional setting. This is a completely new result.

Our results under sequence I show double descent occurring in the overparameterized regime, but our results for sequence II show it occurring in the underparameterized regime. This shows that double descent is not driven directly by overparameterization, but rather is driven by shrinkage which can be induced by overparameterization due to including irrelevant noise predictors or by including irrelevant noise observations (or by measurement error in the predictors). The emphasis should be placed on choosing the correct shrinkage, rather than on fitting the exactly correct (possibly sparse) model ( $d=d_{0}$ ) or an overparameterized (usually complex) model ( $d>>d_{0}$ ).

The remainder of the paper is organised as follows. We present numerical results to illustrate our key findings in Section 2 and then present theoretical results that explain our numerical results in Section 3. We present a data application in Section 4 before concluding with a discussion in Section 5. Additional material and proofs of our theorems are given in Supplementary Material.

2 Empirical results

In this section, we empirically explore the effect of noise in the form of noise predictors or noise observations, and its relationship to double descent. We intentionally consider linear regression to make the set up as simple as possible, examine the effect of increasing $d$ or $n$ , and then isolate and interpret the observed effects.

Our first set of experiments explores the effect of adding an increasing number of irrelevant noise predictors (i.e. increasing $d$ ) for different fixed values of $d_{0}$ with $n=n_{0}=50$ . Full details of the data generating steps and constructing the figures are set out in Algorithm S1 in the Supplementary Material.

Figure 1 shows the effect on the test error of increasing $d$ by adding noise predictors. We observe a second descent in all cases, but the second descent does not always correspond to the optimal model in test error. In Figure 1a and 1d, the data are generated by a model with $d_{0}<n$ and an under-parameterized model performs better than the overparameterized model defined by the second descent. When $d_{0}$ is near the interpolation point ( $d_{0}=n$ ), the optimal minimum test error is likely to occur during the second descent after escaping the zone around the interpolation point (see Figure 1b and 1e). If $d_{0}>n$ , we obtain a useful second descent in the overparameterized regime in the sense that the global minimum of the test error is achieved in the second descent (see Figure 1c and 1f). Clearly, we need to consider the dimension $d_{0}$ of the data generating model (also called the true dimension) as well as $d$ and $n$ to understand double-descent. It is also known that the signal-to-noise ratio of the model affects double descent (Hastie et al., 2022). We demonstrate this by recreating Figures 1a, 1b, and 1c with a higher signal-to-noise ratio in Figures 1d, 1e, and 1f.

Refer to caption — Figure 1: Root-mean-squared training (blue) and test (red) errors obtained using Algorithm S1. We used $n_{0}=50$ and varied $d_{0}$ in the subplots as (a) and (d) $d_{0}=25$ , (b) and (e) $d_{0}=50$ , and (c) and (f) $d_{0}=75$ . In subplots (a), (b), and (c), we have a weak signal-to-noise ratio ( $\boldsymbol{\beta}_{0}=\boldsymbol{\beta}/\|\boldsymbol{\beta}\|$ ), while in subplots (d), (e), and (f), we have a strong signal-to-noise ratio ( $\boldsymbol{\beta}_{0}=\boldsymbol{\beta}$ ). The vertical grey and black lines mark $n_{0}$ and $d_{0}$ , respectively, and the horizontal black line marks the error of the null model with $\hat{\boldsymbol{\beta}}^{(0)}=\mathbf{0}$ .

The second descent in the overparameterized regime has led to questioning the classical bias-variance trade-off principle because bias and variance are considered irrelevant in the overparameterized regime. However, the test error can be decomposed into bias and variance terms in either regime, and we can still examine these functions. The bias and variance components of the test error shown in Figures 1a, 1b, and 1c are plotted in Figure 2. Overall, the bias starts out relatively high but decreases as $d$ increases. The bias reaches a minimum in the under-parameterized regime and increases to its highest value near the interpolation point. It reaches another minimum in the overparameterized regime after escaping the zone around the interpolation threshold and increases again, ultimately asymptoting to a limit. The variance, on the other hand, starts relatively low, but as the complexity of the model increases, it increases to its highest point near the interpolation point. Subsequently, in the overparameterized regime, the variance decreases, eventually reaching a minimum equal to the error variance ( $\sigma^{2}=0.25$ for these data) asymptotically.

Examining the effect of including noise predictors on the parameter estimates in the working model is also interesting. Figure 3 shows boxplots of the estimated intercept and the slope coefficients up to the first 60 predictors included in each working model with $d_{0}=50$ and $n=n_{0}=50$ from $1000$ trials. In the top panel of Figure 3, 25 important predictors are omitted. Thus, the coefficient estimates for the included predictors are unbiased (the medians coincide with the true values, as indicated by red dots). This occurs because the omitted predictors are independent of the important predictors included in the model; hence, there is no omitted variable bias. However, it is important to note that the test error includes bias due to the underfitted model, as shown in Figure 2a. In the bottom three panels of Figure 3, the estimates become increasingly biased as we include more irrelevant noise predictors – the estimated coefficients shrink towards zero. Larger coefficients shrink at a faster rate than smaller ones. The variances of the estimates increase as we approach the interpolation point in the underparameterized regime, peaking at the interpolation point (results not shown), and decrease as we add more irrelevant predictors in the overparameterized regime. Thus, the noise predictors have a shrinkage effect on the estimated coefficients, resulting in them approaching zero as $d-d_{0}$ increases.

So far, we have treated the size of the training set $n$ as fixed and varied the number of predictors $d$ (corresponding to parameters) in the working model. It is also plausible that if $n$ is large, some of the observations in the training set may be noise observations that are not generated by the data generating model. Therefore, we consider the problem in which $d$ is fixed and $n$ increases as we add irrelevant noise observations to the training data. In the additional observations, we set the response equal to zero for simplicity. Full details of the data generating steps and constructing the figures are set out in Algorithm S2 in the Supplementary Material.

Figure 4 shows the effect on the training and test errors, and Figure 5 shows boxplots of the regression coefficients as we increase $n$ from the overparameterized to the underparameterized regime. In Figures 4(a), 4(d), and 4(e), the number of standard normal noise observations required to escape the zone near the interpolation point $n=d$ causes overshrinkage, as shown in Figure 5. Consequently, the minimum test error is achieved in the overparameterized regime. Figure 4(b) illustrates that, by incorporating random observations, a global minimum in the underparameterized regime can be achieved with an appropriate amount of noise. Figures 4(c) and 4(f) show that when the signal-to-noise ratio is low, more noise is required to reach the lowest test error than when the signal-to-noise ratio is high. Interestingly, Figure 4 reveals that for double descent, the number of predictors $d$ and the training sample size $n$ do not need to be large. Since double descent can occur for small $d_{0}$ , it is not the overparameterization of the model with noise predictors that leads to improved prediction performance but the regularization due to the inclusion of noise that improves the model performance.

We find that OLS/minimum norm OLS estimation performs regularization in the presence of noise. However, the amount of noise present in real data may not provide optimal regularization, resulting in undershrinkage or overshrinkage. Undershrinkage could be addressed by generating and adding artificial noise, but it seems easier to control the shrinkage by using explicit regularization as in ridge regression. If there is overshrinkage through too much noise in the model, then ridge regression with $\lambda>0$ increases the shrinkage and makes prediction worse. On the other hand, ridge regression with a negative $\lambda$ can reduce the shrinkage due to noise and, as shown by Kobak et al. (2020) in a particular scenario, the optimal $\lambda$ can be zero or negative. We contend that this phenomenon is not limited to such specific cases and can manifest in many situations when noise is present in the data. Figure 6 illustrates that the optimal value of $\lambda$ can be negative when there are large numbers of noise observations.

It has been established numerically in the literature that more than two descents can occur in particular settings (Nakkiran et al., 2020; Liang et al., 2020). For independent predictors, we found that multiple descents occur when the predictors are not on the same scale and that multiple descents can be reduced by ensuring that all predictors are on the same scale; however, double descent remains inevitable. This is shown numerically in Figure S1 in the Supplementary Material.

3 Theoretical results

3.1 The estimators

We construct the two sequences of training data $\mathcal{D}^{(d)}$ and $\mathcal{D}_{(n)}$ as described in the Introduction. Then we fit working linear regression models to the two sequences of training data by estimating the regression parameters using the ordinary least squares (OLS) and minimum norm OLS estimators when $d<n_{0}$ and $d>n_{0}$ for sequence I, and $n>d_{0}$ and $n<d_{0}$ for sequence II, respectively. We write these linear estimators as $\hat{{\mbox{\boldmath$\beta$}}}^{(d)}={\mathbf{M}}^{(d)}{\mathbf{y}}$ for sequence I and $\hat{{\mbox{\boldmath$\beta$}}}_{(n)}={\mathbf{M}}_{(n)}{\mathbf{y}}$ for sequence II, where ${\mathbf{M}}^{(d)}$ is a $d\times n_{0}$ matrix and ${\mathbf{M}}_{(n)}$ is a $d_{0}\times n_{0}$ matrix. The matrix ${\mathbf{M}}^{(d)}$ depends on ${\mathbf{X}}^{(d)}$ or $[{\mathbf{X}},{\mathbf{Z}}]$ and ${\mathbf{M}}_{(n)}$ depends on ${\mathbf{X}}_{(n)}$ or $[{\mathbf{X}}^{T},{\mathbf{W}}^{T}]^{T}$ , but neither depends on ${\mathbf{y}}$ . For sequence I, when $d\leq\min(n_{0},d_{0})$ , ${\mathbf{M}}^{(d)}=({\mathbf{X}}^{(d)T}{\mathbf{X}}^{(d)})^{-1}{\mathbf{X}}^{(d)T}$ . If $d_{0}>n_{0}$ , we switch for $n_{0}<d\leq d_{0}$ to the minimum norm OLS estimator with ${\mathbf{M}}^{(d)}={\mathbf{X}}^{(d)T}({\mathbf{X}}^{(d)}{\mathbf{X}}^{(d)T})^{-1}$ and then for $d>d_{0}$ ${\mathbf{M}}^{(d)}=[{\mathbf{X}},{\mathbf{Z}}]^{T}({\mathbf{X}}{\mathbf{X}}^{T}+{\mathbf{Z}}{\mathbf{Z}}^{T})^{-1}$ ; if $d_{0}<n_{0}$ , for $d_{0}<d\leq n_{0}$ , we use ${\mathbf{M}}^{(d)}=\{[\mathbf{X},\mathbf{Z}]^{T}[\mathbf{X},\mathbf{Z}]\}^{-1}[\mathbf{X},\mathbf{Z}]^{T}$ before we switch to the minimum norm OLS estimator for $d\geq n_{0}$ . For sequence II, when $n\leq\min(n_{0},d_{0})$ , ${\mathbf{M}}_{(n)}={\mathbf{X}}_{(n)}^{T}({\mathbf{X}}_{(n)}{\mathbf{X}}_{(n)}^{T})^{-1}[{\mathbf{I}}_{n},{\mbox{\boldmath$0$}}_{n,n_{0}-n}]$ . If $d_{0}<n_{0}$ , we switch to the OLS estimator for $d_{0}<n\leq n_{0}$ so ${\mathbf{M}}_{(n)}=({\mathbf{X}}_{(n)}^{T}{\mathbf{X}}_{(n)})^{-1}{\mathbf{X}}_{(n)}^{T}[{\mathbf{I}}_{n},{\mbox{\boldmath$0$}}_{n,n_{0}-n}]$ and then for $n>n_{0}$ ${\mathbf{M}}_{(n)}=({\mathbf{X}}^{T}{\mathbf{X}}+{\mathbf{W}}^{T}{\mathbf{W}})^{-1}{\mathbf{X}}^{T}$ ; if $d_{0}>n_{0}$ , for $n_{0}<n\leq d_{0}$ ${\mathbf{M}}_{(n)}={\mathbf{X}}^{T}{\mathbf{A}}^{11}+{\mathbf{W}}^{T}{\mathbf{A}}^{21}$ , where ${\mathbf{A}}^{11}=\{{\mathbf{X}}{\mathbf{X}}^{T}-{\mathbf{X}}{\mathbf{W}}^{T}({\mathbf{W}}{\mathbf{W}}^{T})^{-1}{\mathbf{W}}{\mathbf{X}}^{T}\}^{-1}$ and ${\mathbf{A}}^{21}=-({\mathbf{W}}{\mathbf{W}}^{T})^{-1}{\mathbf{W}}{\mathbf{X}}^{T}{\mathbf{A}}^{11}$ , before we switch to the OLS estimator for $n>d_{0}$ . The second expression for ${\mathbf{M}}_{(n)}$ is obtained by letting ${\mathbf{A}}=[{\mathbf{X}}^{T},{\mathbf{W}}^{T}]^{T}[{\mathbf{X}}^{T},{\mathbf{W}}^{T}]$ , partitioning it into its natural four submatrices and then using standard formulas for the inverse of a partitioned matrix to obtain the corresponding terms in ${\mathbf{A}}^{-1}$ .

We showed in Figures 3 and 5 that adding noise predictors in sequence I or noise observations in sequence II induces shrinkage in $\hat{{\mbox{\boldmath$\beta$}}}^{(d)}$ and $\hat{{\mbox{\boldmath$\beta$}}}_{(n)}$ , respectively, and eventually shrinks these estimators to zero. We now confirm this effect theoretically as $d$ or $n\rightarrow\infty$ . We let ‘a.s.’ mean ‘almost surely’ and ‘P’ mean ’in probability’. Also, let $\|{\mathbf{x}}\|=({\mathbf{x}}^{T}{\mathbf{x}})^{1/2}$ denote the Euclidean norm of ${\mathbf{x}}$ . For sequence I, for $d>\max(d_{0},n_{0})$ , write $\hat{\boldsymbol{\beta}}^{(d)}=(\hat{\boldsymbol{\beta}}^{(d)T}_{d_{0}},\hat{\boldsymbol{\beta}}^{(d)T}_{-d_{0}})^{T}$ , where $\hat{\boldsymbol{\beta}}^{(d)}_{d_{0}}=\mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T}+{\mathbf{Z}}{\mathbf{Z}}^{T})^{-1}\mathbf{y}$ denotes the first $d_{0}$ elements and $\hat{\boldsymbol{\beta}}^{(d)}_{-d_{0}}=\mathbf{Z}^{T}(\mathbf{X}\mathbf{X}^{T}+{\mathbf{Z}}{\mathbf{Z}}^{T})^{-1}\mathbf{y}$ denotes the remaining $d-d_{0}$ elements of $\hat{\boldsymbol{\beta}}^{(d)}$ . By the strong law of large numbers, the $n_{0}\times n_{0}$ matrix ${\mathbf{X}}{\mathbf{X}}^{T}+{\mathbf{Z}}{\mathbf{Z}}^{T}=O(d)$ a.s. as $d\rightarrow\infty$ . Since $\hat{\boldsymbol{\beta}}^{(d)}_{d_{0}}\in\mathbb{R}^{d_{0}}$ and $\hat{\boldsymbol{\beta}}^{(d)}_{-d_{0}}\in\mathbb{R}^{d-d_{0}}$ , we have $\|\hat{\boldsymbol{\beta}}^{(d)}_{d_{0}}\|=O(d^{-1}d_{0}^{1/2})$ and $\|\hat{\boldsymbol{\beta}}^{(d)}_{-d_{0}}\|=O(d^{-1}(d-d_{0})^{1/2})$ a.s. as $d\rightarrow\infty$ . Similarly, for sequence II, for $n>\max(n_{0},d_{0})$ , $\hat{\boldsymbol{\beta}}_{(n)}=(\mathbf{X}^{T}\mathbf{X}+{\mathbf{W}}^{T}{\mathbf{W}})^{-1}\mathbf{X}^{T}\mathbf{y}$ , the $d_{0}\times d_{0}$ matrix ${\mathbf{X}}^{T}{\mathbf{X}}+{\mathbf{W}}^{T}{\mathbf{W}}=O(n)$ a.s. as $n\rightarrow\infty$ , and $\|\hat{{\mbox{\boldmath$\beta$}}}_{(n)}\|=O(n^{-1}n_{0})$ a.s. as $n\rightarrow\infty$ . That is, for $d_{0}$ and $n_{0}$ fixed or increasing sufficiently slowly in sequence I and II, respectively, noise predictors or observations eventually shrink the estimated coefficients to zero.

3.2 The test error

We examine the performance of the working models fitted along the two training sequences through the test error. In our empirical work, we compute empirical test errors by computing the mean squared error over a large test sample. In theoretical work, we calculate the model-expectation of the squared error at a single observation $(y^{\prime},{\mathbf{x}}^{{}^{\prime}(d)})\in\mathbb{R}^{d+1}$ that is generated independently from the same data generating model as the training data, namely ${\mathbf{x}}^{\prime}\sim N_{d_{0}}({\mbox{\boldmath$0$}}_{d_{0}},{\mathbf{I}}_{d_{0}})$ , $y^{\prime}={\mathbf{x}}^{{}^{\prime}T}\boldsymbol{\beta}_{0}+e^{\prime}$ with $e^{\prime}\sim N(0,\sigma^{2})$ , and ${\mathbf{x}}^{{}^{\prime}(d)}$ contains the first $d$ elements of ${\mathbf{x}}^{\prime}$ if $d\leq d_{0}$ , and ${\mathbf{x}}^{{}^{\prime}(d)}=[{\mathbf{x}}^{{}^{\prime}T},{\mathbf{z}}^{{}^{\prime}T}]$ , where ${\mathbf{z}}^{\prime}\sim N_{d-d_{0}}({\mbox{\boldmath$0$}}_{d-d_{0}},{\mathbf{I}}_{d-d_{0}})$ independent of ${\mathbf{x}}^{\prime}$ , if $d>d_{0}$ . We obtain different test errors, depending on what we condition on. For example, corresponding to the light red curves in our figures, we have the conditional (on the training data) test errors

R^{(d)}(\mathcal{D}^{(d)})=\left\{\begin{array}[]{ll}\mbox{E}_{y^{\prime},\mathbf{x}^{\prime}}(y^{\prime}-\mathbf{x}^{{}^{\prime}(d)T}\hat{\boldsymbol{\beta}}^{(d)})^{2}&d\leq d_{0}\\ \mbox{E}_{y^{\prime},\mathbf{x}^{\prime}}(y^{\prime}-[\mathbf{x}^{{}^{\prime}T},{\mathbf{z}}^{{}^{\prime}T}]\hat{\boldsymbol{\beta}}^{(d)})^{2}&d>d_{0}\end{array}\right.,\qquad R_{(n)}(\mathcal{D}_{(n)})=\mbox{E}_{y^{\prime},\mathbf{x}^{\prime}}(y^{\prime}-\mathbf{x}^{{}^{\prime}T}\hat{\boldsymbol{\beta}}_{(n)})^{2}.

Note that for sequence II, we compute the test error under the model for the real data $(y^{\prime},\mathbf{x}^{\prime})$ rather than for noise observations $(0,\mathbf{w}^{\prime})$ because this is the prediction we are actually interested in; alternatives such as the test error under the noise observations or a linear combination of these two possibilities (e.g. with weights $n_{0}/n$ and $(n-n_{0})/n$ ) could also be considered. The dark red curves in our figures are (unconditional) test errors $R^{(d)}$ and $R_{(n)}$ obtained by taking the expectations (over $\mathcal{D}^{(d)}$ or $\mathcal{D}_{(n)}$ ) of $R^{(d)}(\mathcal{D}^{(d)})$ and $R_{(n)}(\mathcal{D}_{(n)})$ .

The test error under sequence I has been obtained by Breiman & Freedman (1983) and Belkin et al. (2019). We state the theorem for completeness.

Theorem 3.1.

(Breiman & Freedman, 1983; Belkin et al., 2019) Suppose that the training set is constructed as in sequence I. If $d<d_{0}$ , partition ${\mbox{\boldmath$\beta$}}_{0}$ into the first $d$ elements and the remaining $d_{0}-d$ elements as ${\mbox{\boldmath$\beta$}}_{0}=[{\mbox{\boldmath$\beta$}}_{0}^{(d)T},\tilde{\mbox{\boldmath$\beta$}}_{0}^{(d)T}]^{T}$ ; if $d\geq d_{0}$ , ${\mbox{\boldmath$\beta$}}_{0}^{(d)}={\mbox{\boldmath$\beta$}}_{0}$ and $\tilde{\mbox{\boldmath$\beta$}}_{0}^{(d)}$ is replaced by zero. Then, for $d\leq n_{0}-2$ , the estimator is OLS and the test error is

\displaystyle R^{(d)}

\displaystyle=(\|\tilde{\mbox{\boldmath$\beta$}}_{0}^{(d)}\|^{2}+\sigma^{2})\Big{(}1+\frac{d}{n_{0}-1-d}\Big{)},

for $d\in\{n_{0}-1,n_{0},n_{0}+1\}$ , the test error is infinite, and for $d\geq n_{0}+2$ , the estimator is minimum norm OLS and the test error is

\displaystyle R^{(d)}

\displaystyle=(\|\tilde{\mbox{\boldmath$\beta$}}_{0}^{(d)}\|^{2}+\sigma^{2})\Big{(}1+\frac{n_{0}}{d-1-n_{0}}\Big{)}+\|{\mbox{\boldmath$\beta$}}_{0}^{(d)}\|^{2}\Big{(}1-\frac{n_{0}}{d}\Big{)}.

(1)

One insight that is not obvious from the form of Theorem 3.1 is that for $d>\max(n_{0},d_{0})$ , the test error $R^{(d)}$ in (1) can be written as the sum of $\sigma^{2}+\operatorname{E}\|\{{\mathbf{X}}^{T}({\mathbf{X}}{\mathbf{X}}^{T}+{\mathbf{Z}}{\mathbf{Z}}^{T})^{-1}{\mathbf{X}}-{\mathbf{I}}_{d_{0}}\}{\mbox{\boldmath$\beta$}}_{0}\|^{2}+\sigma^{2}\operatorname{E}\operatorname{trace}\{{\mathbf{X}}{\mathbf{X}}^{T}({\mathbf{X}}{\mathbf{X}}^{T}+{\mathbf{Z}}{\mathbf{Z}}^{T})^{-2}\}$ , the test error due to the first $d_{0}$ components of $\hat{{\mbox{\boldmath$\beta$}}}^{(d)}$ , and $\operatorname{E}\|{\mathbf{Z}}^{T}({\mathbf{X}}{\mathbf{X}}^{T}+{\mathbf{Z}}{\mathbf{Z}}^{T})^{-1}{\mathbf{X}}{\mbox{\boldmath$\beta$}}_{0}\|^{2}+\sigma^{2}\operatorname{E}\operatorname{trace}\{{\mathbf{Z}}({\mathbf{X}}{\mathbf{X}}^{T}+{\mathbf{Z}}{\mathbf{Z}}^{T})^{-2}\}$ , the test error due to the last $d-d_{0}$ components of $\hat{{\mbox{\boldmath$\beta$}}}^{(d)}$ , respectively. See the Supplementary material for details. We do not have analytic expressions for these component terms. Nonetheless, they are important because they show that the estimated coefficients of the noise predictors contribute to the test error so these predictors need not be beneficial or even harmless.

The test error under sequence II is more complicated to work with than that under sequence I because analytic expressions are not available once we augment the data with noise observations ( $n>n_{0}$ ). Nonetheless, we obtain the following theorem.

Theorem 3.2.

Suppose the training set is constructed as in sequence II. For $n<\min(n_{0},d_{0})$ , the estimator is minimum norm OLS and the test error is

\displaystyle R_{(n)}

\displaystyle=\left\{\begin{array}[]{ll}\|{\mbox{\boldmath$\beta$}}_{0}\|^{2}\Big{(}1-\frac{n}{d_{0}}\Big{)}+\sigma^{2}\Big{(}1+\frac{n}{d_{0}-1-n}\Big{)},&n\leq d_{0}-2\\ \infty,&n\in\{d_{0}-1,d_{0}\}\end{array}\right..

If $n_{0}<d_{0}$ and $n_{0}\leq n<d_{0}$ , the estimator is minimum norm OLS so the test error is

	$\displaystyle R_{(n)}$	$\displaystyle=\sigma^{2}+\operatorname{E}\\|({\mathbf{X}}^{T}{\mathbf{A}}^{11}{\mathbf{X}}+{\mathbf{W}}^{T}{\mathbf{A}}^{21}{\mathbf{X}}-{\mathbf{I}}_{d_{0}}){\mbox{\boldmath$\beta$}}_{0}\\|^{2}$
		$\displaystyle+\sigma^{2}\operatorname{E}\operatorname{trace}\{{\mathbf{A}}^{11}{\mathbf{X}}{\mathbf{X}}^{T}{\mathbf{A}}^{11}+({\mathbf{A}}^{21})^{T}{\mathbf{W}}{\mathbf{X}}^{T}{\mathbf{A}}^{11}+{\mathbf{A}}^{11}{\mathbf{X}}{\mathbf{W}}^{T}{\mathbf{A}}^{21}+({\mathbf{A}}^{21})^{T}{\mathbf{W}}{\mathbf{W}}^{T}{\mathbf{A}}^{21}\}.$

where ${\mathbf{A}}^{11}=\{{\mathbf{X}}{\mathbf{X}}^{T}-{\mathbf{X}}{\mathbf{W}}^{T}({\mathbf{W}}{\mathbf{W}}^{T})^{-1}{\mathbf{W}}{\mathbf{X}}^{T}\}^{-1}$ and ${\mathbf{A}}^{21}=-({\mathbf{W}}{\mathbf{W}}^{T})^{-1}{\mathbf{W}}{\mathbf{X}}^{T}{\mathbf{A}}^{11}$ .
If $d_{0}<n_{0}$ and $d_{0}\leq n<n_{0}$ , the estimator is OLS and the test error is

\displaystyle R_{(n)}

\displaystyle=\left\{\begin{array}[]{ll}\infty,&n\in\{d_{0},d_{0}+1\}\\ \sigma^{2}\Big{(}1+\frac{d_{0}}{n-d_{0}-1}\Big{)},&n\geq d_{0}+2\end{array}\right..

If $n>\max(n_{0},d_{0})$ , the training set is augmented by noise observations, and the estimator is OLS, so the test error is

	$\displaystyle R_{(n)}$	$\displaystyle=\sigma^{2}+\operatorname{E}\\|\{({\mathbf{X}}^{T}{\mathbf{X}}+{\mathbf{W}}^{T}{\mathbf{W}})^{-1}{\mathbf{X}}^{T}{\mathbf{X}}-{\mathbf{I}}_{d_{0}}\}{\mbox{\boldmath$\beta$}}_{0}\\|^{2}$
		$\displaystyle\qquad+\sigma^{2}\operatorname{E}\operatorname{trace}\{{\mathbf{X}}^{T}{\mathbf{X}}({\mathbf{X}}^{T}{\mathbf{X}}+{\mathbf{W}}^{T}{\mathbf{W}})^{-2}\}.$		(2)

The proof is given in the Supplementary material. As $d=d_{0}$ , there is no decomposition of the test error (3.2) corresponding to that of (1).

3.3 The effect of including noise on the test error

We saw empirically in Figures 3 and 5 and theoretically in Section 3.1 that adding noise to the training data shrinks the estimators to zero as $d$ or $n\rightarrow\infty$ . This means that, as shown in Figure 1, the test errors should converge as $d$ or $n\rightarrow\infty$ to the test error of the null model with $d=n=0$ . That is, both ends of the test error should equal the same value $\operatorname{E}(y^{{}^{\prime}2})=\|{\mbox{\boldmath$\beta$}}_{0}\|^{2}+\sigma^{2}$ . This has also been observed by Muthukumar et al. (2020) and, in a different setup, by Hastie et al. (2022) but it has not been proved to occur. That these asymptotes hold under sequence I follows immediately from Theorem 3.1. It is much more challenging to prove that they hold under sequence II (because there is no analytic form for the test error under sequence II). We now prove that this occurs in Theorem 3.3; the proof is given in the Supplementary Material.

Theorem 3.3.

Under sequence I, as $d\rightarrow\infty$ with $d_{0}$ fixed,

R^{(d)}=\sigma^{2}+\|{\mbox{\boldmath$\beta$}}_{0}\|^{2}+O(d^{-1}).

Under sequence II, as $n\rightarrow\infty$ with $n_{0}$ fixed,

R_{(n)}=\sigma^{2}+\|{\mbox{\boldmath$\beta$}}_{0}\|^{2}+O(n^{-1}).

We can extend the proof under sequence II to obtain further terms in the approximation. As these do not give any additional insight, we have kept the result to the simplest one we need and use.

3.4 Why double descent occurs

The fact that both ends of the test error equal the same value means double descent occurs whenever the test error increases above the initial test error. The test error can increase (e.g. Figure 1c) or decrease (e.g. Figure 1a) from its value at the origin. If the test error increases immediately, it has a local minimum at the origin (which we still call the first descent), while if it decreases and then increases above its initial value, it has a local minimum before the interpolation point ( $d=n$ ). In either case, it will have to descend again (the second descent) to achieve its asymptote. If it decreases below the null model test error, the test error has to increase again to achieve its asymptote. The second descent does not necessarily achieve a global minimum; whether this occurs or not is related to the amount of shrinkage induced by the noise variables or observations.

What makes the test error increase above the null model test error? The estimators and their test errors involve $({\mathbf{D}}^{T}{\mathbf{D}})^{-1}$ or $({\mathbf{D}}{\mathbf{D}}^{T})^{-1}$ with ${\mathbf{D}}$ equal to ${\mathbf{X}}^{(d)}$ , $[{\mathbf{X}},{\mathbf{Z}}]$ , ${\mathbf{X}}_{(n)}$ or $[{\mathbf{X}}^{T},{\mathbf{W}}^{T}]^{T}$ . The underlying driver behind the test error behavior is the generalized condition number $\kappa(\mathbf{D})$ of these matrices. In our examples, the elements of the matrix $\mathbf{D}\in\mathbb{R}^{n\times d}$ are independently sampled from the same probability distribution. The generalized condition number $\kappa(\mathbf{D})$ of a $n\times d$ rectangular matrix $\mathbf{D}$ of rank $r\leq\mbox{min}(n,d)$ is $\kappa(\mathbf{D})=\gamma_{1}(\mathbf{D})/\gamma_{r}(\mathbf{D})$ , where $\gamma_{1}(\mathbf{D})\geq\ldots\geq\gamma_{r}(\mathbf{D})$ are the non-zero singular values of $\mathbf{D}$ in descending order (Zielke, 1988). As a direct consequence of the Cauchy Interlacing Theorem (Horn & Johnson, 2012, Theorem 4.3.4), for a fixed $n$ and $d<n$ , the condition number of $\mathbf{D}$ increases monotonically as $d\to n$ ; if $\mathbf{D}$ is of full rank, it reaches its maximum value when $d=n$ . The increasing condition number of $\mathbf{D}$ causes the variance and the bias of the estimated coefficients to increase, with the maximum variance and maximum bias and hence maximum test error at the interpolation point. If this maximum is above the asymptote, we observe a second descent in the overparameterized regime. For $d>n$ , the condition number of $\mathbf{D}$ is itself likely to decrease as we increase $d$ (the“condition number anomaly”, Dax, 2022). This behavior is illustrated in Figure S2 in the Supplementary Material. The shape of the curves is strikingly similar to that of the variance curves in the second row of Figure 2.

3.5 The inclusion of noise and ridge regression

We can obtain additional insight into the estimators by approximating them by the ridge regression estimator $\hat{\boldsymbol{\beta}}_{\lambda}=(\mathbf{X}^{T}\mathbf{X}+\lambda\mathbf{I}_{d_{0}})^{-1}\mathbf{X}^{T}\mathbf{y}={\mathbf{X}}^{T}(\mathbf{X}\mathbf{X}^{T}+\lambda\mathbf{I}_{n_{0}})^{-1}{\mathbf{y}}$ . For sequence I, write $\hat{\boldsymbol{\beta}}^{(d)}_{d_{0}}-\hat{\boldsymbol{\beta}}_{\lambda}=\mathbf{X}^{T}{\mathbf{L}}^{(d)}(\lambda)\mathbf{y}$ , where ${\mathbf{L}}^{(d)}(\lambda)=(\mathbf{X}\mathbf{X}^{T}+{\mathbf{Z}}{\mathbf{Z}}^{T})^{-1}(\lambda\mathbf{I}_{n_{0}}-{\mathbf{Z}}{\mathbf{Z}}^{T})(\mathbf{X}\mathbf{X}^{T}+\lambda\mathbf{I}_{n_{0}})^{-1}$ is an $n_{0}\times n_{0}$ matrix. Let $d^{-1}d_{0}\rightarrow\delta$ , with $0\leq\delta<1$ , as $d\rightarrow\infty$ . Then $(d-d_{0})^{-1}{\mathbf{Z}}{\mathbf{Z}}^{T}-{\mathbf{I}}_{n_{0}}=O_{p}((d-d_{0})^{-1/2})$ as $d\rightarrow\infty$ and hence, ${\mathbf{L}}^{(d)}(d-d_{0})=O_{p}(d^{-3/2})$ , so $\|\hat{\boldsymbol{\beta}}^{(d)}_{d_{0}}-\hat{\boldsymbol{\beta}}_{d-d_{0}}\|=O_{p}(d^{-1})$ and, from above, $\|\hat{\boldsymbol{\beta}}^{(d)}_{-d_{0}}\|=O(d^{-1/2})$ a.s., so all the elements converge to zero. For sequence II, let $n^{-1}n_{0}\rightarrow\nu$ with $0\leq\nu<1$ . Then write $\hat{\boldsymbol{\beta}}_{(n)}-\hat{\boldsymbol{\beta}}_{(n_{0}),n-n_{0}}={\mathbf{L}}_{(n)}(\lambda){\mathbf{X}}^{T}{\mathbf{y}}$ , where the $d_{0}\times d_{0}$ matrix ${\mathbf{L}}_{(n)}(\lambda)=(\mathbf{X}^{T}\mathbf{X}+{\mathbf{W}}^{T}{\mathbf{W}})^{-1}\{\lambda\mathbf{I}_{d_{0}}-{\mathbf{W}}^{T}{\mathbf{W}}\}\{{\mathbf{X}}^{T}{\mathbf{X}}+\lambda{\mathbf{I}}_{d_{0}}\}^{-1}$ . We have ${\mathbf{L}}_{(n)}(n-n_{0})=O_{p}(n^{-3/2})$ and $\|\hat{\boldsymbol{\beta}}_{(n)}-\hat{\boldsymbol{\beta}}_{n-n_{0}}\|=O_{p}(n^{-1/2})$ . We gather the results into a Theorem.

Theorem 3.4.

Let $\hat{\boldsymbol{\beta}}_{\lambda}=(\mathbf{X}^{T}\mathbf{X}+\lambda\mathbf{I}_{d_{0}})^{-1}\mathbf{X}^{T}\mathbf{y}$ be the ridge regression estimator applied to the data $({\mathbf{y}},{\mathbf{X}})$ , let $\hat{\boldsymbol{\beta}}^{(d)}_{d_{0}}$ denote the first $d_{0}$ -elements and $\hat{\boldsymbol{\beta}}^{(d)}_{-d_{0}}$ the remaining $d-d_{0}$ elements of $\hat{\boldsymbol{\beta}}^{(d)}$ . Under sequence I, if $d^{-1}d_{0}\rightarrow\delta$ , with $0\leq\delta<1$ , as $d\rightarrow\infty$ , then $\|\hat{\boldsymbol{\beta}}^{(d)}_{d_{0}}-\hat{\boldsymbol{\beta}}_{d-d_{0}}\|=O_{p}(d^{-1})$ and $\|\hat{\boldsymbol{\beta}}^{(d)}_{-d_{0}}\|=O_{p}(d^{-1/2})$ . Under sequence II, if $n^{-1}n_{0}\rightarrow\nu$ with $0\leq\nu<1$ , then $\|\hat{\boldsymbol{\beta}}_{(n)}-\hat{\boldsymbol{\beta}}_{(n_{0}),n-n_{0}}\|=O_{p}(n^{-1/2})$ as $n\rightarrow\infty$ .

Related results were given by Kobak et al. (2020) for normalized noise variables. Specifically, they normalized the noise predictors and replaced ${\mathbf{Z}}$ in the training set $\mathcal{D}^{(d)}$ by $\tilde{{\mathbf{Z}}}=(d-d_{0})^{-1/2}\lambda_{z}^{1/2}{\mathbf{Z}}$ , and replaced ${\mathbf{W}}$ in the training set $\mathcal{D}_{(n)}$ by $\tilde{{\mathbf{W}}}=(n-n_{0})^{-1/2}\lambda_{w}^{1/2}{\mathbf{W}}$ . In the first case, ${\mathbf{L}}^{(d)}=O_{p}((d-d_{0})^{-1/2})$ , so $\|\hat{\boldsymbol{\beta}}^{(d)}_{d_{0}}-\hat{\boldsymbol{\beta}}_{\lambda_{z}}\|=O_{p}((d-d_{0})^{-1/2})$ ; we can also easily obtain the result of Kobak et al. (2020) that $\|\hat{\boldsymbol{\beta}}^{(d)}_{d_{0}}-\hat{\boldsymbol{\beta}}_{\lambda}\|=o(1)$ a.s., although the rates of convergence give additional information. For the remaining elements, $\|\hat{\boldsymbol{\beta}}^{(d)}_{-d_{0}}\|=O_{p}(1)$ which is not useful. In the second case, ${\mathbf{L}}_{(n)}=O_{p}(n^{-1/2})$ and $\|\hat{\boldsymbol{\beta}}_{(n)}-\hat{\boldsymbol{\beta}}_{\lambda}\|=O_{p}(n^{-1/2})$ . As explained in the Introduction, we argue that our results are more interesting and useful than these.

3.6 Negative ridging

In Figure 6, we showed that the optimal value of the ridge penalty in a ridge regression estimator can be negative. Intuitively, this makes sense because including noise in the test set has the same effect as ridge regularisation so, if we have already overshrunk through including too much noise, the optimal action may be to reduce the shrinkage by using a negative ridge parameter. Theorem 3.5 below which extends Theorem 3.4 for sequence II supports this intuition.

Suppose that for sequence II, we set $n_{0}=n\nu$ with $0<\nu<1$ . In Figure 6, this corresponds to making an approximation along the line $n-n_{0}=n_{0}\nu^{-1}(1-\nu)$ which passes through the origin and has slope $\nu^{-1}(1-\nu)$ . Let $\hat{{\mbox{\boldmath$\beta$}}}_{(n),\lambda}$ denote the ridge regression estimator computed along sequence II. Then write $\hat{\boldsymbol{\beta}}_{(n),n\lambda}-\hat{\boldsymbol{\beta}}_{n(1-\nu+\lambda)}={\mathbf{L}}_{(n)}{\mathbf{X}}^{T}{\mathbf{y}}$ , where the $d_{0}\times d_{0}$ matrix ${\mathbf{L}}_{(n)}=(\mathbf{X}^{T}\mathbf{X}+{\mathbf{W}}^{T}{\mathbf{W}}+n\lambda{\mathbf{I}}_{d_{0}})^{-1}\{(n-n_{0})\mathbf{I}_{d_{0}}-{\mathbf{W}}^{T}{\mathbf{W}}\}\{{\mathbf{X}}^{T}{\mathbf{X}}+n(1-\nu+\lambda){\mathbf{I}}_{d_{0}}\}^{-1}$ . We have ${\mathbf{L}}_{(n)}=O_{p}(n^{-3/2})$ and $\|\hat{\boldsymbol{\beta}}_{(n),n\lambda}-\hat{\boldsymbol{\beta}}_{n(1-\nu+\lambda)}\|=O_{p}(n^{-1/2})$ , which proves the following Theorem.

Theorem 3.5.

Let $\hat{\boldsymbol{\beta}}_{(n_{0}),\lambda}=(\mathbf{X}^{T}\mathbf{X}+\lambda\mathbf{I}_{d_{0}})^{-1}\mathbf{X}^{T}\mathbf{y}$ be the ridge regression estimator applied to the data $({\mathbf{y}},{\mathbf{X}})$ . Under sequence II with $n_{0}=n\nu$ , $\|\hat{\boldsymbol{\beta}}_{(n),n\lambda}-\hat{\boldsymbol{\beta}}_{n(1-\nu+\lambda)}\|=O_{p}(n^{-1/2})$ as $n\rightarrow\infty$ .

The optimal choice of $n(1-\nu+\lambda)$ for the approximating ridge regression estimator $\hat{\boldsymbol{\beta}}_{n(1-\nu+\lambda)}$ is $\|{\mbox{\boldmath$\beta$}}_{0}\|^{-2}d_{0}\sigma^{2}$ (Nakkiran et al., 2020) so the optimal choice of $\lambda$ is $\lambda_{opt}=n^{-1}\|{\mbox{\boldmath$\beta$}}_{0}\|^{-2}d_{0}\sigma^{2}-(1-\nu)$ which can be zero or negative as we showed in Figure 6.

There is an analogous result for the normalized predictors used by Kobak et al. (2020) instead of sequence II. Recall that Kobak et al. (2020) keep $n_{0}$ fixed and replace ${\mathbf{W}}$ in the training set $\mathcal{D}_{(n)}$ by $\tilde{{\mathbf{W}}}=(n-n_{0})^{-1/2}\lambda_{w}^{1/2}{\mathbf{W}}$ . Then we can show that $\|\hat{\boldsymbol{\beta}}_{(n),\lambda}-\hat{\boldsymbol{\beta}}_{\lambda+\lambda_{w}}\|=O_{p}(n^{-1/2})$ . The optimal ridge parameter in the approximating ridge regression estimator is $\lambda_{opt}=\|{\mbox{\boldmath$\beta$}}_{0}\|^{-2}d_{0}\sigma^{2}-\lambda_{w}$ which can also be zero or negative, but does not explicitly depend on $n$ . Figure 6 shows empirically that the optimal $\lambda$ should depend on $n$ , as shown in our theorem.

Theorem 3.5 does not hold for sequence I because the noise predictors are independent (Nakkiran et al. (2020)).

4 Data Application

We used publicly available data from the high-density rice array (HDRA, featuring 1,568 accessions and 700,000 SNPs) described by McCouch et al. (2016) to illustrate our results. The genotype (allele frequencies) and phenotype (average grain length) data can be downloaded from the Rice Diversity database (http://www.ricediversity.org). We emphasize that our objective is not to reanalyze the dataset but to show that double descent occurs in high-dimensional real-world data.

We reduced the dataset to 332 accessions by only including those from the indica subpopulation with available phenotypic data. We specifically focused on SNPs located on chromosome-3 because a functional SNP in the GS3 gene on rice chromosome-3 was found to be highly associated with grain length in McCouch et al. (2016). We removed SNP loci from our analysis if the call rate was less than 95% and the minor allele frequency (MAF) was less than 5%. We highlight the fact that the covariates are non-normal and exhibit high correlation with many instances of perfect correlation due to linkage disequilibrium: strong correlations between SNPs at adjacent loci (Ardlie et al., 2002; Malo et al., 2008). To reduce the high multicollinearity, we selected every fifth marker to produce a subset of 1,867 markers in our analysis. None of the SNPs has more than 5% missing values, and we imputed such missing values with the mean of its non-missing values. We standardized each SNP to have zero mean and unit standard deviation and centred the response variable (phenotype data) to have zero mean.

We randomly split the 332 accessions, allocating 300 to the training and 32 to the test sets. Starting from an intercept-only regression model, we gradually added genetic markers in their natural sequence to assess the trends in both training and test errors (as outlined in steps 9 to 12 of Algorithm S1). This process was iterated 100 times, each with new random data split into training and testing sets.

The test error in Figure 7 shows double descent, reaching its minimum in the over-parameterized regime. The noticeable decrease in test error just after $d=800$ highlights a SNP previously identified as significantly associated with grain length.

5 Discussion

Most datasets are assumed to include noise predictors; this is encapsulated in the widely used assumption of sparsity and underlies the use of model selection methods which try to identify the important predictors. Datasets may also include noise observations. Our study of the effect of including noise predictors and noise observations when fitting linear regression models show that both forms of noise perform shrinkage in a similar way to ridge regression estimators. We develop the consequences of this shrinkage due to noise to provide simple, intuitive explanations for double descent (in terms of the asymptotes of the test error and the “condition number anomaly”) and negative ridging (in terms of “double shrinkage”). These explanations also apply to shrinkage due to measurement error (Webb, 1994; Bishop, 1995; Hastie et al., 2022).

Moreover, the fact that double descent occurs both in the overparameterized regime (when fitting noise predictors) and in the underparameterized regime (when fitting noise observations) shows that it is not directly driven by overparameterization (which occurs only in the first case), but rather is driven by the shrinkage (which occurs in both cases) due to fitting noise.

Our results raise the question of whether, under sparsity, it is better to select a model (often in the underparameterized regime) or to fit a (regularized) model in the overparameterized regime. In so far as lasso regression (Tibshirani, 1996) represents the first approach, the question can be formulated as: Should we prefer lasso or ridge regression? It is possible that the lasso’s ability to set some components of $\hat{\boldsymbol{\beta}}$ to zero may be less of an advantage than has hitherto been thought. Finally, in addition to having potentially better performance, the fact that shrinkage provides a way to interpret overparameterized models may help to reduce or even overcome the claimed interpretability advantages of simple models. Addressing these questions and extending our results to other models, such as neural networks, are potential avenues for future research.

SUPPLEMENTARY MATERIAL

Supplementary material:: The supplementary materials include the detailed algorithms to produce the figures, additional figures, technical lemmas along with their proofs, as well as the proofs of the theorems presented in the main text. (pdf file)
R code:: The zipped folder contains R scripts for the exact replication of all analyses and figures from the main text and supplementary materials. (zipped file)
High-density rice array dataset:: The dataset used to illustrate the method is publicly available at http://www.ricediversity.org. The accompanying zipped file includes the R code to download the data and reproduce the analysis presented in section 4.

References

(1)
Ardlie et al. (2002) Ardlie, K. G., Kruglyak, L. & Seielstad, M. (2002), ‘Patterns of linkage disequilibrium in the human genome’, Nature Reviews Genetics 3(4), 299–309.
Bartlett et al. (2020) Bartlett, P. L., Long, P. M., Lugosi, G. & Tsigler, A. (2020), ‘Benign overfitting in linear regression’, Proceedings of the National Academy of Sciences 117(48), 30063–30070.
Belkin et al. (2019) Belkin, M., Hsu, D., Ma, S. & Mandal, S. (2019), ‘Reconciling modern machine-learning practice and the classical bias–variance trade-off’, Proceedings of the National Academy of Sciences 116(32), 15849–15854.
Belkin et al. (2020) Belkin, M., Hsu, D. & Xu, J. (2020), ‘Two models of double descent for weak features’, SIAM Journal on Mathematics of Data Science 2(4), 1167–1180.
Bishop (1995) Bishop, C. M. (1995), ‘Training with noise is equivalent to tikhonov regularization’, Neural computation 7(1), 108–116.
Breiman & Freedman (1983) Breiman, L. & Freedman, D. (1983), ‘How many variables should be entered in a regression equation’, Journal of the American Statistical Association 78, 131–136.
Dax (2022) Dax, A. (2022), ‘The smallest singular value anomaly and the condition number anomaly’, Axioms 11(3), 99.
Deng et al. (2022) Deng, Z., Kammoun, A. & Thrampoulidis, C. (2022), ‘A model of double descent for high-dimensional binary linear classification’, Information and Inference: A Journal of the IMA 11(2), 435–495.
Dobriban & Wager (2018) Dobriban, E. & Wager, S. (2018), ‘High-dimensional asymptotics of prediction: Ridge regression and classification’, The Annals of Statistics 46(1), 247–279.
Geiger et al. (2019) Geiger, M., Spigler, S., d’Ascoli, S., Sagun, L., Baity-Jesi, M., Biroli, G. & Wyart, M. (2019), ‘Jamming transition as a paradigm to understand the loss landscape of deep neural networks’, Physical Review E 100(1), 012115.
Hastie et al. (2022) Hastie, T., Montanari, A., Rosset, S. & Tibshirani, R. J. (2022), ‘Surprises in high-dimensional ridgeless least squares interpolation’, The Annals of Statistics 50(2), 949–986.
Hellkvist et al. (2023) Hellkvist, M., Özçelikkale, A. & Ahlén, A. (2023), ‘Estimation under model misspecification with fake features’, IEEE Transactions on Signal Processing 71, 47–60.
Hoerl & Kennard (1970) Hoerl, A. E. & Kennard, R. W. (1970), ‘Ridge regression: Biased estimation for nonorthogonal problems’, Technometrics 12(1), 55–67.
Horn & Johnson (2012) Horn, R. A. & Johnson, C. R. (2012), Matrix analysis, Cambridge university press.
Kobak et al. (2020) Kobak, D., Lomond, J. & Sanchez, B. (2020), ‘The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization.’, J. Mach. Learn. Res. 21, 169–1.
Liang & Rakhlin (2020) Liang, T. & Rakhlin, A. (2020), ‘Just interpolate: Kernel “ridgeless” regression can generalize’, The Annals of Statistics 48(3), 1329–1347.
Liang et al. (2020) Liang, T., Rakhlin, A. & Zhai, X. (2020), On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels, in ‘Conference on Learning Theory’, PMLR, pp. 2683–2711.
Malo et al. (2008) Malo, N., Libiger, O. & Schork, N. J. (2008), ‘Accommodating linkage disequilibrium in genetic-association analyses via ridge regression’, The American Journal of Human Genetics 82(2), 375–385.
McCouch et al. (2016) McCouch, S. R., Wright, M. H., Tung, C.-W., Maron, L. G., McNally, K. L., Fitzgerald, M., Singh, N., DeClerck, G., Agosto-Perez, F., Korniliev, P. et al. (2016), ‘Open access resources for genome-wide association mapping in rice’, Nature Communications 7, 10532.
Mitra (2019) Mitra, P. P. (2019), ‘Understanding overfitting peaks in generalization error: Analytical risk curves for $l\_2$ and $l\_1$ penalized interpolation’, arXiv preprint arXiv:1906.03667 .
Muthukumar et al. (2020) Muthukumar, V., Vodrahalli, K., Subramanian, V. & Sahai, A. (2020), ‘Harmless interpolation of noisy data in regression’, IEEE Journal on Selected Areas in Information Theory 1(1), 67–83.
Nakkiran (2019) Nakkiran, P. (2019), ‘More data can hurt for linear regression: Sample-wise double descent’, arXiv preprint arXiv:1912.07242 .
Nakkiran et al. (2021) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B. & Sutskever, I. (2021), ‘Deep double descent: Where bigger models and more data hurt’, Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003.
Nakkiran et al. (2020) Nakkiran, P., Venkat, P., Kakade, S. & Ma, T. (2020), ‘Optimal regularization can mitigate double descent’, arXiv preprint arXiv:2003.01897 .
Rocks & Mehta (2022) Rocks, J. W. & Mehta, P. (2022), ‘Memorizing without overfitting: Bias, variance, and interpolation in overparameterized models’, Physical Review Research 4(1), 013201.
Spigler et al. (2019) Spigler, S., Geiger, M., d’Ascoli, S., Sagun, L., Biroli, G. & Wyart, M. (2019), ‘A jamming transition from under-to over-parametrization affects generalization in deep learning’, Journal of Physics A: Mathematical and Theoretical 52(47), 474001.
Tibshirani (1996) Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288.
Tsigler & Bartlett (2023) Tsigler, A. & Bartlett, P. L. (2023), ‘Benign overfitting in ridge regression.’, J. Mach. Learn. Res. 24, 123–1.
Webb (1994) Webb, A. R. (1994), ‘Functional approximation by feed-forward networks: a least-squares approach to generalization’, IEEE transactions on Neural Networks 5(3), 363–371.
Zhang et al. (2021) Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. (2021), ‘Understanding deep learning (still) requires rethinking generalization’, Communications of the ACM 64(3), 107–115.
Zielke (1988) Zielke, G. (1988), ‘Some remarks on matrix norms, condition numbers, and error estimates for linear equations’, Linear algebra and its applications 110, 29–41.