Generalization Error of Generalized Linear Models in High Dimensions
Abstract
At the heart of machine learning lies the question of generalizability of learned rules over previously unseen data. While over-parameterized models based on neural networks are now ubiquitous in machine learning applications, our understanding of their generalization capabilities is incomplete. This task is made harder by the non-convexity of the underlying learning problems. We provide a general framework to characterize the asymptotic generalization error for single-layer neural networks (i.e., generalized linear models) with arbitrary non-linearities, making it applicable to regression as well as classification problems. This framework enables analyzing the effect of (i) over-parameterization and non-linearity during modeling; and (ii) choices of loss function, initialization, and regularizer during learning. Our model also captures mismatch between training and test distributions. As examples, we analyze a few special cases, namely linear regression and logistic regression. We are also able to rigorously and analytically explain the double descent phenomenon in generalized linear models.
1 Introduction
A fundamental goal of machine learning is generalization: the ability to draw inferences about unseen data from finite training examples. Methods to quantify the generalization error are therefore critical in assessing the performance of any machine learning approach.
This paper seeks to characterize the generalization error for a class of generalized linear models (GLMs) of the form
(1) |
where is a vector of input features, is a scalar output, are weights to be learned, is a known link function, and is random noise. The notation denotes an inner product. We use the superscript “” to denote the “true” values in contrast to estimated or postulated quantities. The output may be continuous or discrete to model either regression or classification problems.
We measure the generalization error in a standard manner: we are given training data , from which we learn some parameter estimate via a regularized empirical risk minimization of the form
(2) |
where , is the data matrix, is some output loss function, and is some regularizer on the weights. We are then given a new test sample, , for which the true and predicted values are given by
(3) |
where is the noise in the test sample, and is a postulated inverse link function that may be different from the true function . The generalization error is then defined as the expectation of some expected loss between and of the form
(4) |
for some test loss function such as squared error or prediction error.
Even for this relatively simple GLM model, the behavior of the generalization error is not fully understood. Recent works (Montanari et al., 2019; Deng et al., 2019; Mei & Montanari, 2019; Salehi et al., 2019) have characterized the generalization error of various linear models for classification and regression in certain large random problem instances. Specifically, the number of samples and number of features both grow without bound with their ratio satisfying , and the samples in the training data are drawn randomly. In this limit, the generalization error can be exactly computed. The analysis can explain the so-called double descent phenomena (Belkin et al., 2019a): in highly under-regularized settings, the test error may initially increase with the number of data samples before decreasing. See the prior work section below for more details.
Summary of Contributions.
Our main result (Theorem 1) provides a procedure for exactly computing the asymptotic value of the generalization error (4) for GLM models in a certain random high-dimensional regime called the Large System Limit (LSL). The procedure enables the generalization error to be related to key problem parameters including the sampling ratio , the regularizer, the output function, and the distributions of the true weights and noise. Importantly, our result holds under very general settings including: (i) arbitrary test metrics ; (ii) arbitrary training loss functions as well as decomposable regularizers ; (iii) arbitrary link functions ; (iv) correlated covariates ; (v) underparameterized () and overparameterized regimes (); and (vi) distributional mismatch in training and test data. Section 4 discusses in detail the general assumptions on the quantities , , , and under which Theorem 1 holds.
Prior Work.
Many recent works characterize generalization error of various machine learning models, including special cases of the GLM model considered here. For example, the precise characterization for asymptotics of prediction error for least squares regression has been provided in (Belkin et al., 2019b; Hastie et al., 2019; Muthukumar et al., 2019). The former confirmed the double descent curve of (Belkin et al., 2019a) under a Fourier series model and a noisy Gaussian model for data in the over-parameterized regime. The latter also obtained this scenario under both linear and non-linear feature models for ridge regression and min-norm least squares using random matrix theory. Also, (Advani & Saxe, 2017) studied the same setting for deep linear and shallow non-linear networks.
The analysis of the the generalization for max-margin linear classifiers in the high dimensional regime has been done in (Montanari et al., 2019). The exact expression for asymptotic prediction error is derived and in a specific case for two-layer neural network with random first-layer weights, the double descent curve was obtained. A similar double descent curve for logistic regression as well as linear discriminant analysis has been reported by (Deng et al., 2019). Random feature learning in the same setting has also been studied for ridge regression in (Mei & Montanari, 2019). The authors have, in particular, shown that highly over-parametrized estimators with zero training error are statistically optimal at high signal-to-noise ratio (SNR). The asymptotic performance of regularized logistic regression in high dimensions is studied in (Salehi et al., 2019) using the Convex Gaussian Min-max Theorem in the under-parametrized regime. The results in the current paper can consider all these models as special cases. Bounds on the generalization error of over-parametrized linear models are also given in (Bartlett et al., 2019; Neyshabur et al., 2018).
Although this paper and several other recent works consider only simple linear models and GLMs, much of the motivation is to understand generalization in deep neural networks where classical intuition may not hold (Belkin et al., 2018; Zhang et al., 2016; Neyshabur et al., 2018). In particular, a number of recent papers have shown the connection between neural networks in the over-parametrized regime and kernel methods. The works (Daniely, 2017; Daniely et al., 2016) showed that gradient descent on over-parametrized neural networks learns a function in the RKHS corresponding to the random feature kernel. Training dynamics of overparametrized neural networks has been studied by (Jacot et al., 2018; Du et al., 2018; Arora et al., 2019; Allen-Zhu et al., 2019), and it is shown that the function learned is in an RKHS corresponding to the neural tangent kernel.
Approximate Message Passing.
Our key tool to study the generalization error is approximate message passing (AMP), a class of inference algorithms originally developed in (Donoho et al., 2009, 2010; Bayati & Montanari, 2011) for compressed sensing. We show that the learning problem for the GLM can be formulated as an inference problem on a certain multi-layer network. Multi-layer AMP methods (He et al., 2017; Manoel et al., 2018; Fletcher et al., 2018; Pandit et al., 2019) can then be applied to perform the inference. The specific algorithm we use in this work is the multi-layer vector AMP (ML-VAMP) algorithm of (Fletcher et al., 2018; Pandit et al., 2019) which itself builds on several works (Opper & Winther, 2005; Fletcher et al., 2016; Rangan et al., 2019; Cakmak et al., 2014; Ma & Ping, 2017). The ML-VAMP algorithm is not necessarily the most computationally efficient procedure for the minimization (2). For our purposes, the key property is that ML-VAMP enables exact predictions of its performance in the large system limit. Specifically, the error of the algorithm estimates in each iteration can be predicted by a set of deterministic recursive equations called the state evolution or SE. The fixed points of these equations provide a way of computing the asymptotic performance of the algorithm. In certain cases, the algorithm can be proven to be Bayes optimal (Reeves, 2017; Gabrié et al., 2018; Barbier et al., 2019).
This approach of using AMP methods to characterize the generalization error of GLMs was also explored in (Barbier et al., 2019) for i.i.d. distributions on the data. The explicit formulae for the asymptotic mean squared error for the regularized linear regression with rotationally invarient data matrices is proved in (Gerbelot et al., 2020). The ML-VAMP method in this work enables extensions to correlated features and to mismatch between training and test distributions.
2 Generalization Error: System Model
We consider the problem of estimating the weights in the GLM model (1). As stated in the Introduction, we suppose we have training data arranged as , . Then we can write
(5) |
where is the vector-valued function such that and are general noise.
Given the training data , we consider estimates of given by a regularized empirical risk minimization of the form (2). We assume that the loss function and regularizer are separable functions, i.e., one can write
(6) |
for some functions and . Many standard optimization problems in machine learning can be written in this form: logistic regression, support vector machines, linear regression, Poisson regression.
Large System Limit: We follow the LSL analysis of (Bayati & Montanari, 2011) commonly used for analyzing AMP-based methods. Specifically, we consider a sequence of problems indexed by the number of training samples . For each , we suppose that the number of features grows linearly with , i.e.,
(7) |
for some constant . Note that corresponds to the over-parameterized regime and corresponds to the under-parameterized regime.
True parameter: We assume the true weight vector has components whose empirical distribution converges as
(8) |
for some limiting random variable . The precise definition of empirical convergence is given in Appendix A. It means that the empirical distribution converges, in the Wasserstein-2 metric (see Chap. 6 (Villani, 2008)), to the distribution of the finite-variance random variable . Importantly, the limit (8) will hold if the components are drawn i.i.d. from the distribution of with . However, as discussed in Appendix A, the convergence can also be satisfied by correlated sequences and deterministic sequences.
Training data input: For each , we assume that the training input data samples, , , are i.i.d. and drawn from a -dimensional Gaussian distribution with zero mean and covariance . The covariance can capture the effect of features being correlated. We assume the covariance matrix has an eigenvalue decomposition,
(9) |
where are the eigenvalues of and is the orthogonal matrix of eigenvectors. The scaling ensures that the total variance of the samples, , does not grow with . We will place a certain random model on and momentarily.
Using the covariance (9), we can write the data matrix as
(10) |
where has entries drawn i.i.d. from . For the purpose of analysis, it is useful to express the matrix in terms of its SVD:
(11) |
where and are orthogonal and with non-zero entries only along the principal diagonal. are the singular values of . A standard result of random matrix theory is that, since is i.i.d. Gaussian with entries , the matrices and are Haar-distributed on the group of orthogonal matrices and is such that
(12) |
where is a non-negative random variable such that satisfies the Marcencko-Pastur distribution. Details on this distribution are in Appendix H.
Training data output: Given the input data , we assume that the training outputs are generated from (5), where the noise is independent of and has an empirical distribution which converges as
(13) |
Again, the limit (13) will be satisfied if are i.i.d. draws of random variable with bounded second moments.
Test data: To measure the generalization error, we assume now that we are given a test point , and we obtain the true output and predicted output given by (3). We assume that the test data inputs are also Gaussian, i.e.,
(14) |
where has i.i.d. Gaussian components, , and and are the eigenvalues and eigenvectors of the test data covariance matrix. That is, the test data sample has a covariance matrix
(15) |
In comparison to (9), we see that we are assuming that the eigenvectors of the training and test data are the same, but the eigenvalues may be different. In this way, we can capture distributional mismatch between the training and test data. For example, we will be able to measure the generalization error when the test sample is outside a subspace explored by the training data.
To capture the relation between the training and test distributions, we assume that components of and converge as
(16) |
to some non-negative, bounded random vector . The joint distribution on captures the relation between the training and test data.
When , our model corresponds to the case when the training and test distribution are matched. Isotropic Gaussian features in both training and test data correspond to covariance matrices , , which can be modeled as , . We also require that the matrix is uniformly distributed on the set of orthogonal matrices.
Generalization error: From the training data, we obtain an estimate via a regularized empirical risk minimization (2). Given a test sample and parameter estimate , the true output and predicted output are given by equation (3). We assume the test noise is distributed as , following the same distribution as the training data. The postulated inverse-link function in (3) may be different from the true inverse-link function .
The generalization error is defined as the asymptotic expected loss,
(17) |
where is some loss function relevant for the test error (which may be different from the training loss). The expectation in (17) is with respect to the randomness in the training as well as test data, and the noise. Our main result provides a formula for the generalization error (17).
3 Learning GLMs via ML-VAMP
There are many methods for solving the minimization problem (2). We apply the ML-VAMP algorithm of (Fletcher et al., 2018; Pandit et al., 2019). This algorithm is not necessarily the most computationally efficient method. For our purposes, however, the algorithm serves as a constructive proof technique, i.e., it enables exact predictions for generalization error in the LSL as described above. Moreover, in the case when loss function (2) is strictly convex, the problem has a unique global minimum, whereby the generalization error of this minimum is agnostic to the choice of algorithm used to find this minimum. To that end, we start by reformulating (2) in a form that is amicable to the application of ML-VAMP, Algorithm 1.
Multi-Layer Representation.
The first step in applying ML-VAMP to the GLM learning problem is to represent the mapping from the true parameters to the output as a certain multi-layer network. We combine (5), (10) and (11), so that the mapping can be written as the following sequence of operations (as illustrated in Fig. 1):
(18) | |||||
where are the following vectors:
(19) |
and the functions are given by
(20) | ||||
We see from Fig. 1 that the mapping of true parameters to the observed response vector is described by a multi-layer network of alternating orthogonal operators and non-linear functions . Let denote the number of layers in this multi-layer network.
The minimization (2) can also be represented using a similar signal flow graph. Given a parameter candidate , the mapping can be written using the sequence of vectors
(21) | ||||||
There are steps in this sequence, and we let
denote the sets of vectors across the steps. The minimization in (2) can then be written in the following equivalent form:
(22) | ||||
where the penalty functions are defined as
(23) | ||||||
where is on the set , and on .
ML-VAMP for GLM Learning.
Using this multi-layer representation, we can now apply the ML-VAMP algorithm from (Fletcher et al., 2018; Pandit et al., 2019) to solve the optimization (22). The steps are shown in Algorithm 1. These steps are a special case of the “MAP version” of ML-VAMP in (Pandit et al., 2019), but with a slightly different set-up for the GLM problem. We will call these steps the ML-VAMP GLM Learning Algorithm.
The algorithm operates in a set of iterations indexed by . In each iteration, a “forward pass” through the layers generates estimates for the hidden variables , while a “backward pass” generates estimates for the variables . In each step, the estimates and are produced by functions and called estimators or denoisers.
For the MAP version of ML-VAMP algorithm in (Pandit et al., 2019), the denoisers are essentially proximal-type operators defined as
(24) |
An important property of the proximal operator is that for separable functions of the form (6), we have .
In the case of the GLM model, for and , on lines 7 and 19, the denoisers are proximal operators given by
(25a) | ||||
(25b) |
Note that in (25b), there is a dependence on through the term . For the middle terms, , i.e., lines 9 and 21, the denoisers are given by
(26a) | |||
(26b) |
where are the solutions to the minimization
(27) |
The quantity on lines 11 and 23 denotes the empirical mean .
Thus, the ML-VAMP algorithm in Algorithm 1 reduces the joint constrained minimization (22) over variables and to a set of proximal operations on pairs of variables . As discussed in (Pandit et al., 2019), this type of minimization is similar to ADMM with adaptive step-sizes. Details of the denoisers and other aspects of the algorithm are given in Appendix B.
4 Main Result
We make two assumptions. The first assumption imposes certain regularity conditions on the functions , , , and maps appearing in Algorithm 1. The precise definitions of pseudo-Lipschitz continuity and uniform Lipschitz continuity are given in Appendix A of the supplementary material.
Assumption 1.
The denoisers and link functions satisfy the following continuity conditions:
- (a)
-
(b)
The link function is Lipschitz continuous in . The test error function is pseduo-Lipschitz continuous in of order 2.
Our second assumption is that the ML-VAMP algorithm converges. Specifically, let be any set of outputs of Algorithm 1, at some iteration and dimension . For example, could be or for some , or a concatenation of signals such as .
Assumption 2.
Let be any finite set of outputs of the ML-VAMP algorithm as above. Then there exist limits
(28) |
satisfying
(29) |
We are now ready to state our main result.
Theorem 1.
Consider the GLM learning problem (2) solved by applying Algorithm 1 to the equivalent problem (22) under the assumptions of Section 2 along with Assumptions 1 and 2. Then, there exist constants and such that the following hold:
- (a)
-
(b)
The true parameter and its estimate empirically converge as
(30) where is the random variable from (8) and
(31) with independent of .
- (c)
Part (a) shows that, similar to gradient descent, Algorithm 1 finds the stationary points of problem (2). These stationary points will be unique in strictly convex problems such as linear and logistic regression. Thus, in such cases, the same results will be true for any algorithm that finds such stationary points. Hence, the fact that we are using ML-VAMP is immaterial – our results apply to any solver for (2). Note that the convergence to the fixed points is assumed from Assumption 2.
Part (b) provides an exact description of the asymptotic statistical relation between the true parameter and its estimate . The parameters and can be explicitly computed using a set of recursive equations called the state evolution or SE described in Appendix C in the supplementary material.
We can use the expressions to compute a variety of relevant metrics. For example, the convergence shows that the MSE on the parameter estimate is
(33) |
The expectation on the right hand side of (33) can then be computed via integration over the joint density of from part (b). In this way, we have a simple and exact method to compute the parameter error. Other metrics such as parameter bias or variance, cosine angle or sparsity detection can also be computed.
Part (c) of Theorem 1 similarly exactly characterizes the asymptotic generalization error. In this case, we would compute the expectation over the three variables . In this way, we have provided a methodology for exactly predicting the generalization error from the key parameters of the problems such as the sampling ratio , the regularizer, the output function, and the distributions of the true weights and noise. We provide several examples such as linear regression, logistic regression and SVM in the Appendix G. We also recover the result by (Hastie et al., 2019) in Appendix G.
Remarks on Assumptions.
Note that Assumption 1 is satisfied in many practical cases. For example, it can be verified that it is satisfied in the case when and are convex. Assumption 2 is somewhat more restrictive in that it requires that the ML-VAMP algorithm converges. The convergence properties of ML-VAMP are discussed in (Fletcher et al., 2016). The ML-VAMP algorithm may not always converge, and characterizing conditions under which convergence is possible is an open question. However, experiments in (Rangan et al., 2019) show that the algorithm does indeed often converge, and in these cases, our analysis applies. In any case, we will see below that the predictions from Theorem 1 agree closely with numerical experiments in several relevant cases.
5 Experiments
Training and Test Distributions.
We validate our theoretical results on a number of synthetic data experiments. For all the experiments, the training and test data is generated following the model in Section 2. We generate the training and test eigenvalues as i.i.d. with lognormal distributions,
where are bivariate zero-mean Gaussian with
In the case when , we obtain eigenvalues that are equal, corresponding to the i.i.d. case. With we can model correlated features. Also, when the correlation coefficient , , so there is no training and test mismatch. However, we can also select to experiment with cases when the training and test distributions differ. In the examples below, we consider the following three cases:
-
(1)
i.i.d. features ();
-
(2)
correlated features with matching training and test distributions ( dB, ); and
-
(3)
correlated features with train-test mismatch ( dB, ).
For all experiments below, the true model coefficients are generated as i.i.d. Gaussian and we use standard L2-regularization, for some . Our framework can incorporate arbitrary i.i.d. distributions on and regularizers, but we will illustrate just the Gaussian case with L2-regularization here.
Under-regularized linear regression.
We first consider the case of under-regularized linear regression where the output channel is with . The noise variance is set for an SNR level of 10 dB. We use a standard mean-square error (MSE) output loss, . Since we are using the L2-regularizer, , the minimization (2) is standard ridge regression. Moreover, if we were to select , then the ridge regression estimate would correspond to the minimum mean-squared error (MMSE) estimate of the coefficients . However, to study the under-regularized regime, we take .
Fig. 2 plots the test MSE for the three cases described above for the linear model. In the figure, we take features and vary the number of samples from (over-parametrized) to (under-paramertrized). For each value of , we take 100 random instances of the model and compute the ridge regression estimate using the sklearn package and measure the test MSE on the 1000 independent test samples. The simulated values in Fig. 2 are the median test error over the 100 random trials. The test MSE is plotted in a normalized dB scale,
Also plotted is the state evolution (SE) theoretical test MSE from Theorem 1.

In all three cases in Fig. 2, the SE theory exactly matches the simulated values for the test MSE. Note that the case of match training and test distributions for this problem was studied in (Hastie et al., 2019; Mei & Montanari, 2019; Montanari et al., 2019) and we see the double descent phenomenon described in their work. Specifically, with highly under-regularized linear regression, the test MSE actually increases with more samples in the over-parametrized regime () and then decreases again in the under-parametrized regime ().
Our SE theory can also provide predictions for the correlated feature case. In this particular setting, we see that in the correlated case the test error is slightly lower in the over-parametrized regime since the energy of data is concentrated in a smaller sub-space. Interestingly, there is minimal difference between the correlated and i.i.d. cases for the under-parametrized regime when the training and test data match. When the training and test data are not matched, the test error increases. In all cases, the SE theory can accurately predict these effects.

Logistic Regression.
Fig. 3 shows a similar plot as Fig. 2 for a logistic model. Specifically, we use a logistic output , a binary cross entropy output loss , and -regularization level so that the output corresponds to the MAP estimate (we do not perform ridgeless regression in this case). The mean of the training and test eigenvalues are selected such that, if the true coefficients were known, we could obtain a 5% prediction error. As in the linear case, we generate random instances of the model, use the sklearn package to perform the logistic regression, and evaluate the estimates on 1000 new test samples. We compute the median error rate ( accuracy) and compare the simulated values with the SE theoretical estimates. The i.i.d. case was considered in (Salehi et al., 2019). Fig. 3 shows that our SE theory is able to predict the test error rate exactly in i.i.d. cases along with a correlated case and a case with training and test mismatch.

Nonlinear Regression.
The SE framework can also consider non-convex problems. As an example, we consider a non-linear regression problem where the output function is
(34) |
The models saturation in the output. Corresponding to this output, we use a non-linear MSE output loss
(35) |
This output loss is non-convex. We scale the data matrix so that the input so that the is driven well into the non-linear regime. We also take .
For the simulation, the non-convex loss is minimized using Tensorflow where the non-linear model is described as a two-layer model. We use the ADAM optimizer (Kingma & Ba, 2014) with 200 epochs to approach a local minimum of the objective (2). Fig. 4 plots the median test MSE for the estimate along with the SE theoretical test MSE. We again see that the SE theory is able to predict the test MSE in all cases even for this non-convex problem.
6 Conclusions
In this paper we provide a procedure for exactly computing the asymptotic generalization error of a solution in a generalized linear model (GLM). This procedure is based on scalar quantities which are fixed points of a recursive iteration. The formula holds for a large class of generalization metrics, loss functions, and regularization schemes. Our formula allows analysis of important modeling effects such as (i) overparameterization, (ii) dependence between covariates, and (iii) mismatch between train and test distributions, which play a significant role in the analysis and design of machine learning systems. We experimentally validate our theoretical results for linear as well as non-linear regression and logistic regression, where a strong agreement is seen between our formula and simulated results.
References
- Advani & Saxe (2017) Advani, M. S. and Saxe, A. M. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.
- Allen-Zhu et al. (2019) Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in Neural Information Processing Systems, pp. 6155–6166, 2019.
- Arora et al. (2019) Arora, S., Du, S. S., Hu, W., Li, Z., and Wang, R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.
- Barbier et al. (2019) Barbier, J., Krzakala, F., Macris, N., Miolane, L., and Zdeborová, L. Optimal errors and phase transitions in high-dimensional generalized linear models. Proc. National Academy of Sciences, 116(12):5451–5460, 2019.
- Bartlett et al. (2019) Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. Benign overfitting in linear regression. arXiv preprint arXiv:1906.11300, 2019.
- Bayati & Montanari (2011) Bayati, M. and Montanari, A. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. Inform. Theory, 57(2):764–785, February 2011.
- Belkin et al. (2018) Belkin, M., Ma, S., and Mandal, S. To understand deep learning we need to understand kernel learning. arXiv preprint arXiv:1802.01396, 2018.
- Belkin et al. (2019a) Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc. National Academy of Sciences, 116(32):15849–15854, 2019a.
- Belkin et al. (2019b) Belkin, M., Hsu, D., and Xu, J. Two models of double descent for weak features. arXiv preprint arXiv:1903.07571, 2019b.
- Cakmak et al. (2014) Cakmak, B., Winther, O., and Fleury, B. H. S-AMP: Approximate message passing for general matrix ensembles. In Proc. IEEE ITW, 2014.
- Daniely (2017) Daniely, A. Sgd learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pp. 2422–2430, 2017.
- Daniely et al. (2016) Daniely, A., Frostig, R., and Singer, Y. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pp. 2253–2261, 2016.
- Deng et al. (2019) Deng, Z., Kammoun, A., and Thrampoulidis, C. A model of double descent for high-dimensional binary linear classification. arXiv preprint arXiv:1911.05822, 2019.
- Donoho et al. (2009) Donoho, D. L., Maleki, A., and Montanari, A. Message-passing algorithms for compressed sensing. Proc. National Academy of Sciences, 106(45):18914–18919, 2009.
- Donoho et al. (2010) Donoho, D. L., Maleki, A., and Montanari, A. Message passing algorithms for compressed sensing. In Proc. Inform. Theory Workshop, pp. 1–5, 2010.
- Du et al. (2018) Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
- Fletcher et al. (2016) Fletcher, A., Sahraee-Ardakan, M., Rangan, S., and Schniter, P. Expectation consistent approximate inference: Generalizations and convergence. In Proc. IEEE Int. Symp. Information Theory (ISIT), pp. 190–194. IEEE, 2016.
- Fletcher et al. (2018) Fletcher, A. K., Rangan, S., and Schniter, P. Inference in deep networks in high dimensions. Proc. IEEE Int. Symp. Information Theory, 2018.
- Gabrié et al. (2018) Gabrié, M., Manoel, A., Luneau, C., Barbier, J., Macris, N., Krzakala, F., and Zdeborová, L. Entropy and mutual information in models of deep neural networks. In Proc. NIPS, 2018.
- Gerbelot et al. (2020) Gerbelot, C., Abbara, A., and Krzakala, F. Asymptotic errors for convex penalized linear regression beyond gaussian matrices. arXiv preprint arXiv:2002.04372, 2020.
- Givens et al. (1984) Givens, C. R., Shortt, R. M., et al. A class of wasserstein metrics for probability distributions. The Michigan Mathematical Journal, 31(2):231–240, 1984.
- Hastie et al. (2019) Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
- He et al. (2017) He, H., Wen, C.-K., and Jin, S. Generalized expectation consistent signal recovery for nonlinear measurements. In 2017 IEEE International Symposium on Information Theory (ISIT), pp. 2333–2337. IEEE, 2017.
- Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580, 2018.
- Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Ma & Ping (2017) Ma, J. and Ping, L. Orthogonal amp. IEEE Access, 5:2020–2033, 2017.
- Manoel et al. (2018) Manoel, A., Krzakala, F., Varoquaux, G., Thirion, B., and Zdeborová, L. Approximate message-passing for convex optimization with non-separable penalties. arXiv preprint arXiv:1809.06304, 2018.
- Mei & Montanari (2019) Mei, S. and Montanari, A. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.
- Montanari et al. (2019) Montanari, A., Ruan, F., Sohn, Y., and Yan, J. The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime. arXiv preprint arXiv:1911.01544, 2019.
- Muthukumar et al. (2019) Muthukumar, V., Vodrahalli, K., and Sahai, A. Harmless interpolation of noisy data in regression. In 2019 IEEE International Symposium on Information Theory (ISIT), pp. 2299–2303. IEEE, 2019.
- Neyshabur et al. (2018) Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., and Srebro, N. Towards understanding the role of over-parametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076, 2018.
- Opper & Winther (2005) Opper, M. and Winther, O. Expectation consistent approximate inference. Journal of Machine Learning Research, 6(Dec):2177–2204, 2005.
- Pandit et al. (2019) Pandit, P., Sahraee, M., Rangan, S., and Fletcher, A. K. Asymptotics of MAP inference in deep networks. In Proc. IEEE Int. Symp. Information Theory, pp. 842–846, 2019.
- Pandit et al. (2019) Pandit, P., Sahraee-Ardakan, M., Rangan, S., Schniter, P., and Fletcher, A. K. Inference with deep generative priors in high dimensions. arXiv preprint arXiv:1911.03409, 2019.
- Rangan et al. (2019) Rangan, S., Schniter, P., and Fletcher, A. K. Vector approximate message passing. IEEE Trans. Information Theory, 65(10):6664–6684, 2019.
- Reeves (2017) Reeves, G. Additivity of information in multilayer networks via additive gaussian noise transforms. In Proc. 55th Annual Allerton Conf. Communication, Control, and Computing (Allerton), pp. 1064–1070. IEEE, 2017.
- Salehi et al. (2019) Salehi, F., Abbasi, E., and Hassibi, B. The impact of regularization on high-dimensional logistic regression. arXiv preprint arXiv:1906.03761, 2019.
- Tulino et al. (2004) Tulino, A. M., Verdú, S., et al. Random matrix theory and wireless communications. Foundations and Trends® in Communications and Information Theory, 1(1):1–182, 2004.
- Villani (2008) Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
- Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
Appendix A Empirical Convergence of Vector Sequences
Definition 1 (Pseudo-Lipschitz continuity).
For a given , a function is called Pseudo-Lipschitz of order if
(36) |
for some constant .
Observe that for , the pseudo-Lipschitz is equivalent to the standard definition of Lipschitz continuity.
Definition 2 (Uniform Lipschitz continuity).
Let be a function on and . We say that is uniformly Lipschitz continuous in at if there exists constants and an open neighborhood of such that
(37) |
for all and ; and
(38) |
for all and .
Definition 3 (Empirical convergence of a sequence).
Consider a sequence of vectors with . So, each is a block vector with a total of components. For a finite , we say that the vector sequence converges empirically with -th order moments if there exists a random variable such that
-
(i)
; and
-
(ii)
for any that is pseudo-Lipschitz continuous of order ,
(39)
In this case, with some abuse of notation, we will write
(40) |
where we have omitted the dependence on in . We note that the sequence can be random or deterministic. If it is random, we will require that for every pseudo-Lipschitz function , the limit (39) holds almost surely. In particular, if are i.i.d. and , then empirically converges to with order moments.
Appendix B ML-VAMP Denoisers Details
Related to and from equation (11), we need to define two quantities and that are zero-padded versions of the singular values , so that for , we set . Observe that are eigenvalues of whereas are eigenvalues of . Since empirically converges to as given in (12), the vector empirically converges to random variable whereas the vector empirically converges to random variable , where a mass is placed at appropriately. Specifically, has a point mass of when , whereas has a point mass of when In Appendix H (eqn. (113)), we provide the densities over positive parts of and .
A key property of our analysis will be that the non-linear functions (20) and the denoisers have simple forms.
Non-linear functions : The non-linear functions all act componentwise. For example, for in (20), we have
where is the scalar-valued function,
(41) |
Similarly, for ,
where is the zero-padded version of , and
(42) |
Finally, the function in (20) acts componentwise with
(43) |
Input denoiser : Since , and given in (6), the denoiser (25a) acts componentwise in that,
where is the scalar-valued function,
(44) |
Thus, the vector optimization in (25a) reduces to a set of scalar optimizations (44) on each component.
Output denoiser : The output penalty where has the separable form (6). Thus, similar to the case of , the denoiser in (25b) also acts componentwise with the function,
(45) |
Linear denoiser : The expressions for both denoisers and are very similar and can be explained together. The penalty restricts , where is a square matrix. Hence, for the minimization in (27) is given by,
(46) |
and . This is a simple quadratic minimization and the components of and are given by
where
(47a) | ||||
(47b) |
Linear denoiser : This denoiser is identical to the case in that we need to impose the linear constraint . However is in general a rectangular matrix and the two resulting cases of needs to be treated separately.
Appendix C State Evolution Analysis of ML-VAMP
A key property of the ML-VAMP algorithm is that its performance in the LSL can be exactly described by a scalar equivalent system. In the scalar equivalent system, the vector-valued outputs of the algorithm are replaced by scalar random variables representing the typical behavior of the components of the vectors in the large-scale-limit (LSL). Each of the random variables are described by a set of parameters, where the parameters are given by a set of deterministic equations called the state evolution or SE.
The SE for the general ML-VAMP algorithm are derived in (Pandit et al., 2019) and the special case of the updates for ML-VAMP for GLM learning are shown in Algorithm 2 with details of functions in Appendix B. We see that the SE updates in Algorithm 2 parallel those in the ML-VAMP algorithm Algo. 1, except that vector quantities such as , , and are replaced by scalar random variables such as , , and . Each of these random variables are described by the deterministic parameters such as , and , .
The updates in the section labeled as “Initial”, provide the scalar equivalent model for the true system (18). In these updates, represent the limits of the vectors in (19). That is,
(49) |
Due to assumptions in Section 2, we have that the components of converge empirically as,
(50) |
So, the represent the asymptotic distribution of the components of the vectors .
The updates in sections labeled “Forward pass” and “Backward pass” in the SE equations in Algorithm 2 parallel those in Algorithm 1. The key quantities in these SE equations are the error variables,
which represent the errors of the estimates to the inputs of the denoisers. We will also be interested in their transforms,
The following Theorem is an adapted version of the main result from (Pandit et al., 2019) to the iterates of Algorithms 1 and 2.
Theorem 2.
Consider the outputs of the ML-VAMP for GLM Learning Algorithm under the assumptions of Section 2. Assume the denoisers satisfy the continuity conditions in Assumption 1. Also, assume that the outputs of the SE satisfy
for all and . Suppose Algo. 1 is initialized so that the following convergence holds
where are independent zero-mean Gaussians, independent of . Then,
-
(a)
For any fixed iteration in the forward direction and layer , we have that, almost surely,
(51) (52) where the variables on the right-hand side are from the SE equations (51) and (52) are the outputs of the SE equations in Algorithm 2. Similar equations hold for with the appropriate variables removed.
-
(b)
Similarly, in the reverse direction, For any fixed iteration and layer , we have that, almost surely,
(53) (54)
Furthermore, and are independent.
Proof.
This is a direct application of Theorem 3 from (Pandit et al., 2019) to the iterations of Algorithm 1. The convergence result in (Pandit et al., 2019) requires the uniform Lipschitz continuity of functions . Assumption 1 provides the required uniform Lipschitz continuity assumption on and . For the ”middle” layers, , the denoisers are linear and the uniform continuity assumption is valid since we have made the additional assumption that the terms and are bounded almost surely.
A key use of the Theorem is to compute asymptotic empirical limits. Specifically, for a componentwise function , let denotes the average The above theorem then states that for any componentwise pseudo-Lipschitz function of order 2, as , we have the following two properties
That is, we can compute the empirical average over components with the expected value of the random variable limit. This convergence is key to proving Theorem 1.
Appendix D Empirical Convergence of Fixed Points
A consequence of Assumption 2 is that we can take the limit of the random variables in the SE algorithm. Specifically, let be any set of outputs from the ML-VAMP for GLM Learning Algorithm under the assumptions of Theorem 2. Under Assumption 2, for each , there exists a vector
(55) |
representing the limit over . For each , Theorem 2 shows there also exists a random vector limit,
(56) |
representing the limit over . The following proposition shows that we can take the limits of the random variables .
Proposition 1.
Consider the outputs of the ML-VAMP for GLM Learning Algorithm under the assumptions of Theorem 2 and Assumption 2. Let be any set of outputs from the algorithm and let be its limit from (55) and let be the random variable limit (56). Then, there exists a random variable such that, for any pseudo-Lipschitz continuous ,
(57) |
In addition, the SE parameter limits and converge to limits,
(58) |
The proposition shows that, under the convergence assumption, Assumption 2, we can take the limits as of the random variables from the SE. To prove the proposition we first need the following simple lemma.
Lemma 1.
If and are sequences such that
(59) |
then, there exists a constant such that,
(60) |
In particular, the two limits in (60) exist.
Proof.
Proof of Proposition 1
Let be any pseudo-Lipschitz function of order 2, and define,
(63) |
Their difference can be written as,
(64) |
where
(65) | ||||
(66) |
Since converges to , we have,
(67) |
For the term ,
(68) |
where (a) follows from applying the triangle inequality to the definition of in (65); (b) follows from the definition of pseudo-Lipschitz continuity in Definition 1, is the Lipschitz contant and
and (c) follows from the RMS-AM inequality:
By (29), we have that,
Hence, from (68), it follows that,
(69) |
Substituting (67) and (69) into (64) show that and satisfy (59). Therefore, applying Lemma 1 we have that for any pseudo-Lipschitz function , there exists a limit such that,
(70) |
In particular, the two limits in (70) exists. When restricted to the continuous, bounded functions with the norm, it is easy verified that is a positive, linear, bounded function of , with . Therefore, by the Riesz representation theorem, there exists a random variable such that . This fact in combination with (70) shows (57).
It remains to prove the parameter limits in (58). We prove the result for the parameter . The proof for the other parameters are similar. Using Stein’s lemma, it is shown in (Pandit et al., 2019) that
(71) |
Since the numerator and denominator of (71) are functions we have that the limit,
(72) |
where and are the limits of and . This completes the proof. ∎
Appendix E Proof of Theorem 1
From Assumption 2, we know that for every , every group of vectors converge to limits, . The parameters, , also converge to limits for all . By the continuity assumptions on the functions , the limits and are fixed points of the algorithms.
A proof similar to that in (Pandit et al., 2019) shows that the fixed points and satisfy the KKT condition of the constrained optimization (22). This proves part (a).
The estimate is the limit,
Also, the true parameter is . By Proposition 1, we have that the limits of these variables are
From line 15 of the SE Algorithm 2, we have
This proves part (b).
To prove part (c), we use the limit
(73) |
Since the fixed points are critical points of the constrained optimization (22), . We also have . Therefore,
(74) |
Here, in the subscript denotes the dependence on N. Since , is a zero-mean bivariate Gaussian with covariance matrix
The empirical convergence (73) yields the following limit,
(75) |
It suffices to show that the distribution of converges to the distribution of in the Wasserstein-2 metric as (See the discussion in Appendix A on the equivalence of convergence in Wasserstein-2 metric and PL(2) convergence.)
Now, Wassestein-2 distance between between two probability measures and is defined as
(76) |
where is the set of probability distributions on the product space with marginals consistent with and . For Gaussian measures and we have (Givens et al., 1984)
Therefore, for Gaussian distributions , and , the convergence (75) implies i.e., convergence in Wasserstein-2 distance. Hence,
where is the covariance matrix in (75). Hence the convergence holds in the PL(2) sense (see discussion in Appendix A on the equivalence of convergence in and PL(2) convergence).
Appendix F Formula for
For the special cases in the next Appendix, it is useful to derive expressions for the entries the covariance matrix in (75). For the term ,
(78) |
where we have used the fact that . Next, where,
(79) |
where are independent of . Hence,
(80) |
since and is the covariance matrix of from line 23.
Finally for we have,
(81) |
Appendix G Special Cases
G.1 Linear Output with Square Error
In this section we examine a few special cases of the GLM problem (2). We first consider a linear output with additive Gaussian noise and a squared error training and test loss. Specifically, consider the model,
(82) |
We consider estimates of such that:
(83) |
The factor is added above since the two terms scale with a ratio of . It does not change analysis. Consider the ML-VAMP GLM learning algorithm applied to this problem. The following corollary follows from the Main result in Theorem 1.
Corollary 1 (Squared error).
For linear regression, i.e., , , we have
The quantities , depend on the choice of regularizer and the covariance between features.
Proof.
G.2 Ridge Regression with i.i.d. Covariates
We next the special case when the input features are independent, i.e., (83) where rows of corresponding to the training data has i.i.d Gaussian features with covariance and .
Although the solution to (83) exists in closed form , we can study the effect of the regularization parameter on the generalization error as detailed in the result below.
Corollary 2.
Consider the ridge regression problem (83) with regularization parameter . For the squared loss i.e., , i.i.d Gaussian features without train-test mismatch, i.e., , the generalization error is given by Corollary 1, with constants
where , with given in Appendix H, and where is given in equation (95) in the proof.
Proof of Corollary 2.
We are interested in identifying the following constants appearing in Corollary 1:
These quantities are obtained as fixed points of the State Evolution Equations in Algo. 2. We explain below how to obtain expressions for these constants. Since these are fixed points we ignore the subscript corresponding to the iteration number in Algo. 2.
In the case of problem (83), the maps and , i.e., and respectively, can be expressed as closed-form formulae. This leads to simplification of the SE equations as explained below.
We start by looking at the forward pass (finding quantities with superscript ’+’) of Algorithm 2 for different layers and then the backward pass (finding quantities with superscript ’-’) to get the parameters for .
To begin with, notice that , and therefore the denoiser in (44) is simply,
Using the random variable and substituting in the expression of the denoiser to get , we can now calculate using lines 20 and 22,
(84) |
Similarly, we have , whereby the output denoiser in the last layer for ridge regression is given by,
(85) |
By substituting this denoiser in line 30 of the algorithm we get and thus, following the lines 35-38 of the algorithm we have
(86) |
Having identified these constants , we will now sequentially identify the quantities
in the forward pass, and then the quantities
in the backward pass.
We also note that we have
(87) |
Forward Pass:
Backward Pass:
Since , line 36 of algorithm on simplification yields , whereby we can get ,
(91) |
Next, to calculate the terms , we use the decoiser defined in (47a) for line 33 of the algorithm to get .
(92) |
where we have used due to (90), and from lines 17, 32 and 4 respectively.
Then, we calculate and as This gives,
(93) |
Here, in the overparameterized case , the denoiser outputs with probability and with probability .
Next, from line 37 we get,
(94) |
Now from line 36 and equation (87) we get,
(95) |
where (a) follows from (90) and (87), and (b) follows from (92). From this one can obtain which can be calculated using the knowledge that are independent Gaussian with covariances , . Further, are independent of .
Observe that by (95) we have
(96) |
with some simplification we get
(97a) | ||||
(97b) |
where , with given in Appendix H, and is the derivative of calculated at .
Now consider the under-parametrized case ():
Let and . In this case we have
(98) |
Note that,
(99a) |
where is the R-transform defined in (Tulino et al., 2004) and (a) follows from the relationship between the R- and Stieltjes-transform and (b) follows from the fact that for Marchenko-Pastur distribution we have . Therefore,
(100) |
For the over-parametrized case () we have:
(101) |
In this case, as mentioned in Appendix H and following the results from (Tulino et al., 2004), the measure scales with and thus . Therefore, similar to (99a), satisfies
(102) |
Now can be calculated as follows:
(103) |
where
(104) |
and is the solution to the fixed points
(105) |
G.3 Ridgeless Linear Regression
Here we consider the case of Ridge regression (83) when . Note that the solution to the problem (83) is remains unique since . The following result was stated in (Hastie et al., 2019), and can be recovered using our methodology. Note however, that we calculate the generalization error whereas they have calculated the squared error, whereby we obtain an additional additive factor of The result explains the double-descent phenomenon for Ridgeless linear regression.
Corollary 3.
For ridgeless linear regression, we have
Proof of Corollary 3.
We calculate the parameters , and when . Before starting off, we note that
(106) |
as described in Appendix H. Following the derivations in Corollary 2, we have
(107) |
Now for we have
(108) |
Using this in simplifying (95) for , we get
where during the evaluation of , for the case of we need to account for the point mass at for with weight .
G.4 Train-Test Mismatch
Observe that our formulation allows for analyzing the effect of mismatch in the training and test distribution. One can consider arbitrary joint distributions over that model the mismatch between training and test features. Here we give a simple example which highlights the effect of this mismatch.
Definition 4 (Bernoulli -mismatch).
has a bivariate Bernoulli distribution with
-
2
-
Notice that the marginal distribution of the in the Bernoulli mismatch model is such that Hence half of the features extracted by the matrix are relevant during training. Similarly, However the features spanned by the test data do not exactly overlap with the features captured in the training data, and the fraction of features common to both the training and test data is . Hence for , there is no training-test mismatch, whereas for there is a complete mismatch.
The following result shows that the generalization error increases linearly with the mismatch parameter
Corollary 4 (Mismatch).
Proof.
The quantities and in the result above can be calculated similar to the derivation in the proof of Corollary 2 and can in general depend on the regularization parameter and overparameterization parameter .
G.5 Logistic Regression
The precise analysis for the special case of regularized logistic regression estimator with i.i.d Gaussian features is provided in (Salehi et al., 2019). Consider the logistic regression model,
where is the standard logistic function.
In this problem we consider estimates of such that
where is the reguralization function. This is a special case of optimization problem (2) where
(109) |
Similar to the linear regression model, using the ML-VAMP GLM learning algorithm, we can characterize the generalization error for this model with quantities given by algorithm 2. We note that in this case, the output non-linearity is
(110) |
where . Also, the denoisers , and can be derived as the proximal operators of , and defined in (25).
G.6 Support Vector Machines
The asymptotic generalization error for support vector machine (SVM) is provided in (Deng et al., 2019). Our model can also handle SVMs. Similar to logistic regression, SVM finds a linear classifier using the hinge loss instead of logistic loss. Assuming the class labels are the hinge loss is
(111) |
Therefore, if we take
(112) |
where is the row of the data matrix, the ML-VAMP algorithm for GLMs finds the SVM classifier. The algorithm would have proximal map of hinge loss and our theory provides exact predictions for the estimation and prediction error of SVM.
As with all other models considered in this work, the true underlying data generating model could be anything that can be represented by the graphical model in Figure 1, e.g. logistic or probit model, and our theory is able to exactly predict the error when SVM is applied to learn such linear classifiers in the large system limit.
Appendix H Marchenko-Pastur distribution
We describe the random variable defined in (12) where has a rescaled Marchenko-Pastur distribution. Notice that the positive entries of are the positive eigenvalues of (or ).
Observe that , whereas, the standard scaling while studying the Marchenko-Pastur distribution is for matrices such that (for e.g. see equation (1.10) from (Tulino et al., 2004) and the discussion preceding it). Also notice that has the same distribution as . Thus the results from (Tulino et al., 2004) apply directly to the distributions of eigenvalues of and . We state their result below taking into account this disparity in scaling.
The positive eigenvalues of have an empirical distribution which converges to the following density:
(113) |
where , . Similarly the positive eigenvalues of have an empirical distribution converging to the density . We note the following integral which is useful in our analysis:
(114) |
More generally, the Stieltjes transform of the density is given by:
(115) |