Generalization Error of Generalized Linear Models in High Dimensions

Melikasadat Emami Mojtaba Sahraee-Ardakan Parthe Pandit Sundeep Rangan Alyson K. Fletcher

Abstract

At the heart of machine learning lies the question of generalizability of learned rules over previously unseen data. While over-parameterized models based on neural networks are now ubiquitous in machine learning applications, our understanding of their generalization capabilities is incomplete. This task is made harder by the non-convexity of the underlying learning problems. We provide a general framework to characterize the asymptotic generalization error for single-layer neural networks (i.e., generalized linear models) with arbitrary non-linearities, making it applicable to regression as well as classification problems. This framework enables analyzing the effect of (i) over-parameterization and non-linearity during modeling; and (ii) choices of loss function, initialization, and regularizer during learning. Our model also captures mismatch between training and test distributions. As examples, we analyze a few special cases, namely linear regression and logistic regression. We are also able to rigorously and analytically explain the double descent phenomenon in generalized linear models.

Machine Learning, Generalization error, Double Descent,

1 Introduction

A fundamental goal of machine learning is generalization: the ability to draw inferences about unseen data from finite training examples. Methods to quantify the generalization error are therefore critical in assessing the performance of any machine learning approach.

This paper seeks to characterize the generalization error for a class of generalized linear models (GLMs) of the form

y=\phi_{\rm out}({\left<\bm{x},\bm{w}^{0}\right>},d),

(1)

where $\bm{x}\in{\mathbb{R}}^{p}$ is a vector of input features, $y$ is a scalar output, $\bm{w}^{0}\in{\mathbb{R}}^{p}$ are weights to be learned, $\phi_{\rm out}(\cdot)$ is a known link function, and $d$ is random noise. The notation ${\left<\bm{x},\bm{w}^{0}\right>}$ denotes an inner product. We use the superscript “ $0$ ” to denote the “true” values in contrast to estimated or postulated quantities. The output may be continuous or discrete to model either regression or classification problems.

We measure the generalization error in a standard manner: we are given training data $(\bm{x}_{i},y_{i})$ , $i=1,\ldots,N$ from which we learn some parameter estimate $\widehat{\bm{w}}$ via a regularized empirical risk minimization of the form

\widehat{\bm{w}}=\operatorname*{argmin}_{\bm{w}}F_{\rm out}(\bm{y},\mathbf{X}\bm{w})+F_{\rm in}(\bm{w}),

(2)

where $\mathbf{X}=[\bm{x}_{1}\,\bm{x}_{2}\,\ldots\,\bm{x}_{N}]^{\text{\sf T}}$ , is the data matrix, $F_{\rm out}$ is some output loss function, and $F_{\rm in}$ is some regularizer on the weights. We are then given a new test sample, $\bm{x}_{\rm ts}$ , for which the true and predicted values are given by

y_{\rm ts}=\phi_{\rm out}({\left<\bm{x}_{\rm ts},\bm{w}^{0}\right>},d_{\rm ts}),\quad\widehat{y}_{\rm ts}=\phi({\left<\bm{x}_{\rm ts},\widehat{\bm{w}}\right>}),

(3)

where $d_{\rm ts}$ is the noise in the test sample, and $\phi(\cdot)$ is a postulated inverse link function that may be different from the true function $\phi_{\rm out}(\cdot)$ . The generalization error is then defined as the expectation of some expected loss between $y_{\rm ts}$ and $\widehat{y}_{\rm ts}$ of the form

\mathbb{E}\,f_{\rm ts}(y_{\rm ts},\widehat{y}_{\rm ts}),

(4)

for some test loss function $f_{\rm ts}(\cdot)$ such as squared error or prediction error.

Even for this relatively simple GLM model, the behavior of the generalization error is not fully understood. Recent works (Montanari et al., 2019; Deng et al., 2019; Mei & Montanari, 2019; Salehi et al., 2019) have characterized the generalization error of various linear models for classification and regression in certain large random problem instances. Specifically, the number of samples $N$ and number of features $p$ both grow without bound with their ratio satisfying ${p}/{N}\rightarrow\beta\in(0,\infty)$ , and the samples in the training data $\bm{x}_{i}$ are drawn randomly. In this limit, the generalization error can be exactly computed. The analysis can explain the so-called double descent phenomena (Belkin et al., 2019a): in highly under-regularized settings, the test error may initially increase with the number of data samples $N$ before decreasing. See the prior work section below for more details.

Summary of Contributions.

Our main result (Theorem 1) provides a procedure for exactly computing the asymptotic value of the generalization error (4) for GLM models in a certain random high-dimensional regime called the Large System Limit (LSL). The procedure enables the generalization error to be related to key problem parameters including the sampling ratio $\beta=p/N$ , the regularizer, the output function, and the distributions of the true weights and noise. Importantly, our result holds under very general settings including: (i) arbitrary test metrics $f_{\rm ts}$ ; (ii) arbitrary training loss functions $F_{\rm out}$ as well as decomposable regularizers $F_{\rm in}$ ; (iii) arbitrary link functions $\phi_{\rm out}$ ; (iv) correlated covariates $\bm{x}$ ; (v) underparameterized ( $\beta<1$ ) and overparameterized regimes ( $\beta>1$ ); and (vi) distributional mismatch in training and test data. Section 4 discusses in detail the general assumptions on the quantities $f_{\rm ts}$ , $F_{\rm out}$ , $F_{\rm in}$ , and $\phi_{\rm out}$ under which Theorem 1 holds.

Prior Work.

Many recent works characterize generalization error of various machine learning models, including special cases of the GLM model considered here. For example, the precise characterization for asymptotics of prediction error for least squares regression has been provided in (Belkin et al., 2019b; Hastie et al., 2019; Muthukumar et al., 2019). The former confirmed the double descent curve of (Belkin et al., 2019a) under a Fourier series model and a noisy Gaussian model for data in the over-parameterized regime. The latter also obtained this scenario under both linear and non-linear feature models for ridge regression and min-norm least squares using random matrix theory. Also, (Advani & Saxe, 2017) studied the same setting for deep linear and shallow non-linear networks.

The analysis of the the generalization for max-margin linear classifiers in the high dimensional regime has been done in (Montanari et al., 2019). The exact expression for asymptotic prediction error is derived and in a specific case for two-layer neural network with random first-layer weights, the double descent curve was obtained. A similar double descent curve for logistic regression as well as linear discriminant analysis has been reported by (Deng et al., 2019). Random feature learning in the same setting has also been studied for ridge regression in (Mei & Montanari, 2019). The authors have, in particular, shown that highly over-parametrized estimators with zero training error are statistically optimal at high signal-to-noise ratio (SNR). The asymptotic performance of regularized logistic regression in high dimensions is studied in (Salehi et al., 2019) using the Convex Gaussian Min-max Theorem in the under-parametrized regime. The results in the current paper can consider all these models as special cases. Bounds on the generalization error of over-parametrized linear models are also given in (Bartlett et al., 2019; Neyshabur et al., 2018).

Although this paper and several other recent works consider only simple linear models and GLMs, much of the motivation is to understand generalization in deep neural networks where classical intuition may not hold (Belkin et al., 2018; Zhang et al., 2016; Neyshabur et al., 2018). In particular, a number of recent papers have shown the connection between neural networks in the over-parametrized regime and kernel methods. The works (Daniely, 2017; Daniely et al., 2016) showed that gradient descent on over-parametrized neural networks learns a function in the RKHS corresponding to the random feature kernel. Training dynamics of overparametrized neural networks has been studied by (Jacot et al., 2018; Du et al., 2018; Arora et al., 2019; Allen-Zhu et al., 2019), and it is shown that the function learned is in an RKHS corresponding to the neural tangent kernel.

Approximate Message Passing.

Our key tool to study the generalization error is approximate message passing (AMP), a class of inference algorithms originally developed in (Donoho et al., 2009, 2010; Bayati & Montanari, 2011) for compressed sensing. We show that the learning problem for the GLM can be formulated as an inference problem on a certain multi-layer network. Multi-layer AMP methods (He et al., 2017; Manoel et al., 2018; Fletcher et al., 2018; Pandit et al., 2019) can then be applied to perform the inference. The specific algorithm we use in this work is the multi-layer vector AMP (ML-VAMP) algorithm of (Fletcher et al., 2018; Pandit et al., 2019) which itself builds on several works (Opper & Winther, 2005; Fletcher et al., 2016; Rangan et al., 2019; Cakmak et al., 2014; Ma & Ping, 2017). The ML-VAMP algorithm is not necessarily the most computationally efficient procedure for the minimization (2). For our purposes, the key property is that ML-VAMP enables exact predictions of its performance in the large system limit. Specifically, the error of the algorithm estimates in each iteration can be predicted by a set of deterministic recursive equations called the state evolution or SE. The fixed points of these equations provide a way of computing the asymptotic performance of the algorithm. In certain cases, the algorithm can be proven to be Bayes optimal (Reeves, 2017; Gabrié et al., 2018; Barbier et al., 2019).

This approach of using AMP methods to characterize the generalization error of GLMs was also explored in (Barbier et al., 2019) for i.i.d. distributions on the data. The explicit formulae for the asymptotic mean squared error for the regularized linear regression with rotationally invarient data matrices is proved in (Gerbelot et al., 2020). The ML-VAMP method in this work enables extensions to correlated features and to mismatch between training and test distributions.

2 Generalization Error: System Model

We consider the problem of estimating the weights $\bm{w}$ in the GLM model (1). As stated in the Introduction, we suppose we have training data $\{(\bm{x}_{i},y_{i})\}_{i=1}^{N}$ arranged as $\mathbf{X}:=[\bm{x}_{1}\,\bm{x}_{2}\ldots\bm{x}_{N}]^{\text{\sf T}}\in{\mathbb{R}}^{N\times p}$ , $\bm{y}:=[y_{1}\,y_{2}\ldots y_{N}]^{\text{\sf T}}\in{\mathbb{R}}^{N}$ . Then we can write

\bm{y}={\bm{\phi}}_{\rm out}(\mathbf{X}\bm{w}^{0},\mathbf{d}),

(5)

where ${\bm{\phi}}_{\rm out}(\mathbf{z},\mathbf{d})$ is the vector-valued function such that $[{\bm{\phi}}_{\rm out}(\mathbf{z},\mathbf{d})]_{n}=\phi_{\rm out}(z_{n},d_{n})$ and $\{d_{n}\}_{n=1}^{N}$ are general noise.

Given the training data $(\mathbf{X},\bm{y})$ , we consider estimates of $\bm{w}^{0}$ given by a regularized empirical risk minimization of the form (2). We assume that the loss function $F_{\rm out}$ and regularizer $F_{\rm in}$ are separable functions, i.e., one can write

\displaystyle F_{\rm out}(\bm{y},\mathbf{z})=\sum_{n=1}^{N}f_{\rm out}(y_{n},z_{n}),\ \ F_{\rm in}(\bm{w})=\sum_{j=1}^{p}f_{\rm in}(w_{j}),

(6)

for some functions $f_{\rm out}:\mathbb{R}^{2}\rightarrow\mathbb{R}$ and $f_{\rm in}:\mathbb{R}\rightarrow\mathbb{R}$ . Many standard optimization problems in machine learning can be written in this form: logistic regression, support vector machines, linear regression, Poisson regression.

Large System Limit: We follow the LSL analysis of (Bayati & Montanari, 2011) commonly used for analyzing AMP-based methods. Specifically, we consider a sequence of problems indexed by the number of training samples $N$ . For each $N$ , we suppose that the number of features $p=p(N)$ grows linearly with $N$ , i.e.,

\displaystyle\lim_{N\rightarrow\infty}\frac{p(N)}{N}\rightarrow\beta

(7)

for some constant $\beta\in(0,\infty)$ . Note that $\beta>1$ corresponds to the over-parameterized regime and $\beta<1$ corresponds to the under-parameterized regime.

True parameter: We assume the true weight vector $\bm{w}^{0}$ has components whose empirical distribution converges as

\lim_{N\rightarrow\infty}\{w^{0}_{n}\}\stackrel{{\scriptstyle PL(2)}}{{=}}W^{0},

(8)

for some limiting random variable $W^{0}$ . The precise definition of empirical convergence is given in Appendix A. It means that the empirical distribution $\tfrac{1}{p}\sum_{i=1}^{p}\delta_{w_{i}}$ converges, in the Wasserstein-2 metric (see Chap. 6 (Villani, 2008)), to the distribution of the finite-variance random variable $W^{0}$ . Importantly, the limit (8) will hold if the components $\{w^{0}_{i}\}_{i=1}^{p}$ are drawn i.i.d. from the distribution of $W^{0}$ with $\mathbb{E}(W^{0})^{2}<\infty$ . However, as discussed in Appendix A, the convergence can also be satisfied by correlated sequences and deterministic sequences.

Training data input: For each $N$ , we assume that the training input data samples, $\bm{x}_{i}\in{\mathbb{R}}^{p}$ , $i=1,\ldots,N$ , are i.i.d. and drawn from a $p$ -dimensional Gaussian distribution with zero mean and covariance ${\bm{\Sigma}}_{\rm tr}\in{\mathbb{R}}^{p\times p}$ . The covariance can capture the effect of features being correlated. We assume the covariance matrix has an eigenvalue decomposition,

{\bm{\Sigma}}_{\rm tr}=\tfrac{1}{p}\mathbf{V}_{0}^{\text{\sf T}}\mathrm{diag}(\mathbf{s}_{\rm tr}^{2})\mathbf{V}_{0},

(9)

where $\mathbf{s}_{\rm tr}^{2}$ are the eigenvalues of ${\bm{\Sigma}}_{\rm tr}$ and $\mathbf{V}_{0}\in{\mathbb{R}}^{p\times p}$ is the orthogonal matrix of eigenvectors. The scaling $\tfrac{1}{p}$ ensures that the total variance of the samples, $\mathbb{E}\|\bm{x}_{i}\|^{2}$ , does not grow with $N$ . We will place a certain random model on $\mathbf{s}_{\rm tr}$ and $\mathbf{V}_{0}$ momentarily.

Using the covariance (9), we can write the data matrix as

\mathbf{X}=\mathbf{U}\,\mathrm{diag}(\mathbf{s}_{\rm tr})\mathbf{V}_{0},

(10)

where $\mathbf{U}\in{\mathbb{R}}^{N\times p}$ has entries drawn i.i.d. from ${\mathcal{N}}(0,\tfrac{1}{p})$ . For the purpose of analysis, it is useful to express the matrix $\mathbf{U}$ in terms of its SVD:

\mathbf{U}=\mathbf{V}_{2}\mathbf{S}_{\rm mp}\mathbf{V}_{1},\quad\mathbf{S}_{\rm mp}:=\left[\begin{array}[]{cc}\mathrm{diag}(\mathbf{s}_{\rm mp})&\mathbf{0}\\ \mathbf{0}&*\end{array}\right]

(11)

where $\mathbf{V}_{1}\in{\mathbb{R}}^{N\times N}$ and $\mathbf{V}_{2}\in{\mathbb{R}}^{p\times p}$ are orthogonal and $\mathbf{S}_{\rm{mp}}\in\mathbb{R}^{N\times p}$ with non-zero entries $\mathbf{s}_{\rm mp}\in{\mathbb{R}}^{\min\{N,p\}}$ only along the principal diagonal. $\mathbf{s}_{\rm{mp}}$ are the singular values of $\mathbf{U}$ . A standard result of random matrix theory is that, since $\mathbf{U}$ is i.i.d. Gaussian with entries ${\mathcal{N}}(0,\tfrac{1}{p})$ , the matrices $\mathbf{V}_{1}$ and $\mathbf{V}_{2}$ are Haar-distributed on the group of orthogonal matrices and $\mathbf{s}_{\rm mp}$ is such that

\lim_{N\rightarrow\infty}\{s_{{\rm mp},i}\}\stackrel{{\scriptstyle PL(2)}}{{=}}S_{\rm mp},

(12)

where $S_{\rm mp}\geq 0$ is a non-negative random variable such that $S_{\rm mp}^{2}$ satisfies the Marcencko-Pastur distribution. Details on this distribution are in Appendix H.

Training data output: Given the input data $\mathbf{X}$ , we assume that the training outputs $\bm{y}$ are generated from (5), where the noise $\mathbf{d}$ is independent of $\mathbf{X}$ and has an empirical distribution which converges as

\lim_{N\rightarrow\infty}\{d_{i}\}\stackrel{{\scriptstyle PL(2)}}{{=}}D.

(13)

Again, the limit (13) will be satisfied if $\{d_{i}\}_{i=1}^{N}$ are i.i.d. draws of random variable $D$ with bounded second moments.

Test data: To measure the generalization error, we assume now that we are given a test point $\bm{x}_{\rm ts}$ , and we obtain the true output $y_{\rm ts}$ and predicted output $\widehat{y}_{\rm ts}$ given by (3). We assume that the test data inputs are also Gaussian, i.e.,

\bm{x}_{\rm ts}^{\text{\sf T}}=\mathbf{u}^{\text{\sf T}}\mathrm{diag}(\mathbf{s}_{\rm ts})\mathbf{V}_{0},

(14)

where $\mathbf{u}\in{\mathbb{R}}^{p}$ has i.i.d. Gaussian components, ${\mathcal{N}}(0,\tfrac{1}{p})$ , and $\mathbf{s}_{\rm ts}$ and $\mathbf{V}_{0}$ are the eigenvalues and eigenvectors of the test data covariance matrix. That is, the test data sample has a covariance matrix

{\bm{\Sigma}}_{\rm ts}=\tfrac{1}{p}\mathbf{V}_{0}^{\text{\sf T}}\mathrm{diag}(\mathbf{s}_{\rm ts}^{2})\mathbf{V}_{0}.

(15)

In comparison to (9), we see that we are assuming that the eigenvectors of the training and test data are the same, but the eigenvalues may be different. In this way, we can capture distributional mismatch between the training and test data. For example, we will be able to measure the generalization error when the test sample is outside a subspace explored by the training data.

To capture the relation between the training and test distributions, we assume that components of $\mathbf{s}_{\rm tr}$ and $\mathbf{s}_{\rm ts}$ converge as

\lim_{N\rightarrow\infty}\{(s_{{\rm tr},i},s_{{\rm ts},i})\}\stackrel{{\scriptstyle PL(2)}}{{=}}(S_{\rm tr},S_{\rm ts}),

(16)

to some non-negative, bounded random vector $(S_{\rm tr},S_{\rm ts})$ . The joint distribution on $(S_{\rm tr},S_{\rm ts})$ captures the relation between the training and test data.

When $S_{\rm tr}=S_{\rm ts}$ , our model corresponds to the case when the training and test distribution are matched. Isotropic Gaussian features in both training and test data correspond to covariance matrices ${\bm{\Sigma}}_{\rm tr}=\tfrac{1}{p}{\sigma^{2}_{\rm tr}}\mathbf{I}$ , ${\bm{\Sigma}}_{\rm ts}=\tfrac{1}{p}{\sigma^{2}_{\rm ts}}\mathbf{I}$ , which can be modeled as $S_{\rm tr}=\sigma_{\rm tr}$ , $S_{\rm ts}=\sigma_{\rm ts}$ . We also require that the matrix $\mathbf{V}_{0}$ is uniformly distributed on the set of $p\times p$ orthogonal matrices.

Generalization error: From the training data, we obtain an estimate $\widehat{\bm{w}}$ via a regularized empirical risk minimization (2). Given a test sample $\bm{x}_{\rm ts}$ and parameter estimate $\widehat{\bm{w}}$ , the true output $y_{\rm ts}$ and predicted output $\widehat{y}_{\rm tr}$ are given by equation (3). We assume the test noise is distributed as $d_{\rm ts}\sim D$ , following the same distribution as the training data. The postulated inverse-link function $\phi(\cdot)$ in (3) may be different from the true inverse-link function $\phi_{\rm out}(\cdot)$ .

The generalization error is defined as the asymptotic expected loss,

{\mathcal{E}}_{\rm ts}:=\lim_{N\rightarrow\infty}\mathbb{E}f_{\rm ts}(\widehat{y}_{\rm ts},y_{\rm ts}),

(17)

where $f_{\rm ts}(\cdot)$ is some loss function relevant for the test error (which may be different from the training loss). The expectation in (17) is with respect to the randomness in the training as well as test data, and the noise. Our main result provides a formula for the generalization error (17).

3 Learning GLMs via ML-VAMP

There are many methods for solving the minimization problem (2). We apply the ML-VAMP algorithm of (Fletcher et al., 2018; Pandit et al., 2019). This algorithm is not necessarily the most computationally efficient method. For our purposes, however, the algorithm serves as a constructive proof technique, i.e., it enables exact predictions for generalization error in the LSL as described above. Moreover, in the case when loss function (2) is strictly convex, the problem has a unique global minimum, whereby the generalization error of this minimum is agnostic to the choice of algorithm used to find this minimum. To that end, we start by reformulating (2) in a form that is amicable to the application of ML-VAMP, Algorithm 1.

Multi-Layer Representation.

The first step in applying ML-VAMP to the GLM learning problem is to represent the mapping from the true parameters $\bm{w}^{0}$ to the output $\bm{y}$ as a certain multi-layer network. We combine (5), (10) and (11), so that the mapping $\bm{w}^{0}\mapsto\bm{y}$ can be written as the following sequence of operations (as illustrated in Fig. 1):

$\displaystyle\mathbf{z}^{0}_{0}:=\bm{w}^{0},$	$\displaystyle\mathbf{p}^{0}_{0}:=\mathbf{V}_{0}\mathbf{z}^{0}_{0},$	(18)
$\displaystyle\mathbf{z}_{1}^{0}:={\bm{\phi}}_{1}(\mathbf{p}^{0}_{0},{\bm{\xi}}_{1}),$	$\displaystyle\mathbf{p}^{0}_{1}:=\mathbf{V}_{1}\mathbf{z}_{1}^{0},$
$\displaystyle\mathbf{z}_{2}^{0}:={\bm{\phi}}_{2}(\mathbf{p}^{0}_{1},{\bm{\xi}}_{2}),$	$\displaystyle\mathbf{p}^{0}_{2}:=\mathbf{V}_{2}\mathbf{z}_{2}^{0},$
$\displaystyle\mathbf{z}^{0}_{3}:={\bm{\phi}}_{3}(\mathbf{p}_{2}^{0},{\bm{\xi}}_{3})=\bm{y},$

where ${\bm{\xi}}_{\ell}$ are the following vectors:

{\bm{\xi}}_{1}:=\mathbf{s}_{\rm tr},\quad{\bm{\xi}}_{2}:=\mathbf{s}_{\rm mp},\quad{\bm{\xi}}_{3}:=\mathbf{d},

(19)

and the functions ${\bm{\phi}}_{\ell}(\cdot)$ are given by

$\displaystyle{\bm{\phi}}_{1}(\mathbf{p}_{0},\mathbf{s}_{\rm tr})$	$\displaystyle:=\mathrm{diag}(\mathbf{s}_{\rm tr})\mathbf{p}_{0},$	(20)
$\displaystyle{\bm{\phi}}_{2}(\mathbf{p}_{1},\mathbf{s}_{\rm mp})$	$\displaystyle:=\mathbf{S}_{\rm mp}\mathbf{p}_{1},$
$\displaystyle{\bm{\phi}}_{3}(\mathbf{p}_{2},\mathbf{d})$	$\displaystyle:={\bm{\phi}}_{\rm out}(\mathbf{p}_{2},\mathbf{d}).$

We see from Fig. 1 that the mapping of true parameters $\bm{w}^{0}=\mathbf{z}^{0}_{0}$ to the observed response vector $\bm{y}=\mathbf{z}^{0}_{3}$ is described by a multi-layer network of alternating orthogonal operators $\mathbf{V}_{\ell}$ and non-linear functions ${\bm{\phi}}_{\ell}(\cdot)$ . Let $L=3$ denote the number of layers in this multi-layer network.

Figure 1: Sequence flow representing the mapping from the unknown parameter values

\bm{w}^{0}

to the vector of responses

\bm{y}

on the training data.

The minimization (2) can also be represented using a similar signal flow graph. Given a parameter candidate $\bm{w}$ , the mapping $\bm{w}\mapsto\mathbf{X}\bm{w}$ can be written using the sequence of vectors

$\displaystyle\mathbf{z}_{0}:=\bm{w},$	$\displaystyle\mathbf{p}_{0}$	$\displaystyle:=\mathbf{V}_{0}\mathbf{z}_{0},$	(21)
$\displaystyle\mathbf{z}_{1}:=\mathbf{S}_{\rm tr}\mathbf{p}_{0},$	$\displaystyle\mathbf{p}_{1}$	$\displaystyle:=\mathbf{V}_{1}\mathbf{z}_{1},$
$\displaystyle\mathbf{z}_{2}:=\mathbf{S}_{\rm mp}\mathbf{p}_{1},$	$\displaystyle\mathbf{p}_{2}$	$\displaystyle:=\mathbf{V}_{2}\mathbf{z}_{2}=\mathbf{X}\bm{w}.$

There are $L=3$ steps in this sequence, and we let

\mathbf{z}=\{\mathbf{z}_{0},\mathbf{z}_{1},\mathbf{z}_{2}\},\quad\mathbf{p}=\{\mathbf{p}_{0},\mathbf{p}_{1},\mathbf{p}_{2}\}

denote the sets of vectors across the steps. The minimization in (2) can then be written in the following equivalent form:

		$\displaystyle\min_{\mathbf{z},\mathbf{p}}\ F_{0}(\mathbf{z}_{0})+F_{1}(\mathbf{p}_{0},\mathbf{z}_{1})+F_{2}(\mathbf{p}_{1},\mathbf{z}_{1})+F_{3}(\mathbf{p}_{2})$		(22)
		$\displaystyle{\rm subject\ to}\quad\mathbf{p}_{\ell}=\mathbf{V}_{\ell}\mathbf{z}_{\ell},\quad\ell=0,1,2,$		(22)

where the penalty functions $F_{\ell}$ are defined as

		$\displaystyle F_{0}(\cdot)=F_{\rm in}(\cdot),$	$\displaystyle F_{1}(\cdot,\cdot)=$	$\displaystyle\delta_{\left\{\mathbf{z}_{1}=\mathbf{S}_{\rm tr}\mathbf{p}_{0}\right\}}(\cdot,\cdot),$		(23)
		$\displaystyle F_{2}(\cdot,\cdot)=\delta_{\left\{\mathbf{z}_{2}=\mathbf{S}_{\rm mp}\mathbf{p}_{1}\right\}}(\cdot,\cdot),$	$\displaystyle F_{3}(\cdot)=$	$\displaystyle F_{\rm out}(\bm{y},\cdot),$		(23)

where $\delta_{\mathcal{A}}(\cdot)$ is $0$ on the set $\mathcal{A}$ , and $+\infty$ on $\mathcal{A}^{c}$ .

ML-VAMP for GLM Learning.

Using this multi-layer representation, we can now apply the ML-VAMP algorithm from (Fletcher et al., 2018; Pandit et al., 2019) to solve the optimization (22). The steps are shown in Algorithm 1. These steps are a special case of the “MAP version” of ML-VAMP in (Pandit et al., 2019), but with a slightly different set-up for the GLM problem. We will call these steps the ML-VAMP GLM Learning Algorithm.

Algorithm 1 ML-VAMP GLM Learning Algorithm

1: Initialize

\gamma^{-}_{0\ell}>0

\mathbf{r}^{-}_{0\ell}=0

for

\ell=0,\ldots,L\!-\!1

3: for

k=0,1,\dots

4: // Forward Pass

5: for

\ell=0,\ldots,L-1

6: if

\ell=0

then

\widehat{\mathbf{z}}_{k0}=\mathbf{g}^{+}_{0}(\mathbf{r}_{k0}^{-},\gamma^{-}_{k0})

8: else

\widehat{\mathbf{z}}_{k\ell}=\mathbf{g}^{+}_{\ell}(\mathbf{r}_{k,\ell\!-\!1}^{+},\mathbf{r}_{k\ell}^{-},\gamma^{+}_{k,\ell\!-\!1},\gamma^{-}_{k\ell})

10: end if

11:

\alpha_{k\ell}^{+}={\left<\partial\widehat{\mathbf{z}}_{k\ell}/\partial\mathbf{r}_{k\ell}^{-}\right>}

12:

\displaystyle\mathbf{r}_{k\ell}^{+}=\frac{\mathbf{V}_{\ell}(\widehat{\mathbf{z}}_{k\ell}-\alpha_{k\ell}^{+}\mathbf{r}_{k\ell}^{-})}{1-\alpha^{+}_{k\ell}}

13:

\gamma_{k\ell}^{+}=(1/\alpha_{k\ell}^{+}-1)\gamma_{k\ell}^{-}

14: end for

15:

16: // Backward Pass

17: for

\ell=L,\ldots,1

18: if

\ell=L

then

19:

\widehat{\mathbf{p}}_{k,L\!-\!1}=\mathbf{g}_{L}^{-}(\mathbf{r}^{+}_{k,L\!-\!1},\gamma^{+}_{k,L\!-\!1})

20: else

21:

\widehat{\mathbf{p}}_{k,\ell\!-\!1}=\mathbf{g}^{-}_{\ell}(\mathbf{r}_{k,\ell\!-\!1}^{+},\mathbf{r}_{k\!+\!1,\ell}^{-},\gamma^{+}_{k,\ell\!-\!1},\gamma^{-}_{k\!+\!1,\ell})

22: end if

23:

\alpha_{k,\ell\!-\!1}^{-}={\left<\partial\widehat{\mathbf{p}}_{k,\ell\!-\!1}/\partial\mathbf{r}_{k,\ell\!-\!1}^{+}\right>}

24:

\displaystyle\mathbf{r}_{k\!+\!1,\ell\!-\!1}^{-}=\frac{\mathbf{V}_{\ell\!-\!1}^{\text{\sf T}}(\widehat{\mathbf{p}}_{k,\ell\!-\!1}-\alpha_{k,\ell\!-\!1}^{-}\mathbf{r}_{k,\ell\!-\!1}^{+})}{1-\alpha^{-}_{k,\ell\!-\!1}}

25:

\gamma_{k\!+\!1,\ell\!-\!1}^{-}=(1/\alpha_{k,\ell\!-\!1}^{-}-1)\gamma_{k,\ell\!-\!1}^{+}

26: end for

27: end for

The algorithm operates in a set of iterations indexed by $k$ . In each iteration, a “forward pass” through the layers generates estimates $\widehat{\mathbf{z}}_{k\ell}$ for the hidden variables $\mathbf{z}^{0}_{\ell}$ , while a “backward pass” generates estimates $\widehat{\mathbf{p}}_{k\ell}$ for the variables $\mathbf{p}^{0}_{\ell}$ . In each step, the estimates $\widehat{\mathbf{z}}_{k\ell}$ and $\widehat{\mathbf{p}}_{k\ell}$ are produced by functions $\mathbf{g}^{+}_{\ell}(\cdot)$ and $\mathbf{g}^{-}_{\ell}(\cdot)$ called estimators or denoisers.

For the MAP version of ML-VAMP algorithm in (Pandit et al., 2019), the denoisers are essentially proximal-type operators defined as

\displaystyle\operatorname{prox}_{F/\gamma}(\bm{u}):=\underset{\bm{x}}{\rm argmin}\ F(\bm{x})+\tfrac{\gamma}{2}\left\|\bm{x}-\bm{u}\right\|^{2}.

(24)

An important property of the proximal operator is that for separable functions $F$ of the form (6), we have $[\operatorname{prox}_{F/\gamma}(\bm{u})]_{i}=\operatorname{prox}_{f/\gamma}(\bm{u}_{i})$ .

In the case of the GLM model, for $\ell=0$ and $L$ , on lines 7 and 19, the denoisers are proximal operators given by


$\displaystyle\mathbf{g}_{0}^{+}(\mathbf{r}_{0}^{-},\gamma_{0}^{-})$	$\displaystyle=\operatorname{prox}_{F_{\rm in}/\gamma_{0}^{-}}(\mathbf{r}^{-}_{0}),$	(25a)
$\displaystyle\mathbf{g}_{3}^{-}(\mathbf{r}_{2}^{+},\bm{y},\gamma_{2}^{+})$	$\displaystyle=\operatorname{prox}_{F_{\rm out}/\gamma_{2}^{+}}(\mathbf{r}^{+}_{2}).$	(25b)

Note that in (25b), there is a dependence on $\bm{y}$ through the term $F_{\rm out}(\bm{y},\cdot)$ . For the middle terms, $\ell=1,2$ , i.e., lines 9 and 21, the denoisers are given by


	$\displaystyle\mathbf{g}_{\ell}^{+}(\mathbf{r}_{\ell\!-\!1}^{+},\mathbf{r}_{\ell}^{-},\gamma_{\ell\!-\!1}^{+},\gamma_{\ell}^{-}):=\widehat{\mathbf{z}}_{\ell},$		(26a)
	$\displaystyle\mathbf{g}_{\ell}^{-}(\mathbf{r}_{\ell\!-\!1}^{+},\mathbf{r}_{\ell}^{-},\gamma_{\ell\!-\!1}^{+},\gamma_{\ell}^{-}):=\widehat{\mathbf{p}}_{\ell\!-\!1},$		(26b)

where $(\widehat{\mathbf{p}}_{\ell\!-\!1},\widehat{\mathbf{z}}_{\ell})$ are the solutions to the minimization

	$\displaystyle(\widehat{\mathbf{p}}_{\ell\!-\!1},\widehat{\mathbf{z}}_{\ell}):=\underset{(\mathbf{p}_{\ell\!-\!1},\mathbf{z}_{\ell})}{\rm argmin}\$	$\displaystyle F_{\ell}(\mathbf{p}_{\ell\!-\!1},\mathbf{z}_{\ell})+\frac{\gamma_{\ell}^{-}}{2}\\|\mathbf{z}_{\ell}-\mathbf{r}_{\ell}^{-}\\|^{2}$
		$\displaystyle+\frac{\gamma_{\ell\!-\!1}^{+}}{2}\\|\mathbf{p}_{\ell\!-\!1}-\mathbf{r}_{\ell\!-\!1}^{+}\\|^{2}.$		(27)

The quantity $\langle{\partial\bm{v}/\partial\bm{u}}\rangle$ on lines 11 and 23 denotes the empirical mean $\tfrac{1}{N}\sum_{n=1}^{N}\partial v_{n}/\partial u_{n}$ .

Thus, the ML-VAMP algorithm in Algorithm 1 reduces the joint constrained minimization (22) over variables $(\mathbf{z}_{0},\mathbf{z}_{1},\mathbf{z}_{2})$ and $(\mathbf{p}_{0},\mathbf{p}_{1},\mathbf{p}_{2})$ to a set of proximal operations on pairs of variables $(\mathbf{p}_{\ell\!-\!1},\mathbf{z}_{\ell})$ . As discussed in (Pandit et al., 2019), this type of minimization is similar to ADMM with adaptive step-sizes. Details of the denoisers $\mathbf{g}_{\ell}^{\pm}$ and other aspects of the algorithm are given in Appendix B.

4 Main Result

We make two assumptions. The first assumption imposes certain regularity conditions on the functions $f_{\rm ts}$ , $\phi$ , $\phi_{\rm out}$ , and maps $\mathbf{g}_{\ell}^{\pm}$ appearing in Algorithm 1. The precise definitions of pseudo-Lipschitz continuity and uniform Lipschitz continuity are given in Appendix A of the supplementary material.

Assumption 1.

The denoisers and link functions satisfy the following continuity conditions:

(a)

The proximal operators in (25),

$\mathbf{g}_{0}^{+}(\mathbf{r}_{0}^{-},\gamma_{0}^{-}),\quad\mathbf{g}_{3}^{-}(\mathbf{r}_{2}^{+},\bm{y},\gamma_{2}^{+}),$

are uniformly Lipschitz continuous in $\mathbf{r}^{-}_{0}$ and $(\mathbf{r}^{+}_{2},\bm{y})$ over parameters $\gamma^{-}_{0}$ and $\gamma^{+}_{2}$ .
(b)

The link function $\phi_{\rm out}(p,d)$ is Lipschitz continuous in $(p,d)$ . The test error function $f_{\rm ts}(\phi(\widehat{z}),\phi_{\rm out}(z,d))$ is pseduo-Lipschitz continuous in $(\widehat{z},z,d)$ of order 2.

Our second assumption is that the ML-VAMP algorithm converges. Specifically, let $\bm{x}_{k}=\bm{x}_{k}(N)$ be any set of outputs of Algorithm 1, at some iteration $k$ and dimension $N$ . For example, $\bm{x}_{k}(N)$ could be $\widehat{\mathbf{z}}_{k\ell}(N)$ or $\widehat{\mathbf{p}}_{k\ell}(N)$ for some $\ell$ , or a concatenation of signals such as $\begin{bmatrix}\mathbf{z}^{0}_{\ell}(N)&\widehat{\mathbf{z}}_{k\ell}(N)\end{bmatrix}$ .

Assumption 2.

Let $\bm{x}_{k}(N)$ be any finite set of outputs of the ML-VAMP algorithm as above. Then there exist limits

\bm{x}(N)=\lim_{k\rightarrow\infty}\bm{x}_{k}(N)

(28)

satisfying

\lim_{k\rightarrow\infty}\lim_{N\rightarrow\infty}\frac{1}{N}\|\bm{x}_{k}(N)-\bm{x}(N)\|^{2}=0.

(29)

We are now ready to state our main result.

Theorem 1.

Consider the GLM learning problem (2) solved by applying Algorithm 1 to the equivalent problem (22) under the assumptions of Section 2 along with Assumptions 1 and 2. Then, there exist constants $\tau_{0}^{-},\overline{\gamma}_{0}^{+}>0$ and $\mathbf{M}\in\mathbb{R}^{2\times 2}_{\succ 0}$ such that the following hold:

(a)

The fixed points $\{\widehat{\mathbf{z}}_{\ell},\widehat{\mathbf{p}}_{\ell}\}$ , $\ell=0,1,2$ of Algorithm 1 satisfy the KKT conditions for the constrained optimization problem (22). Equivalently $\widehat{\bm{w}}:=\widehat{\mathbf{z}}_{0}$ is a stationary point of (2).
(b)

The true parameter $\bm{w}^{0}$ and its estimate $\widehat{\bm{w}}$ empirically converge as

$\lim_{N\rightarrow\infty}\{(w^{0}_{i},\widehat{w}_{i})\}\stackrel{{\scriptstyle PL(2)}}{{=}}(W^{0},\widehat{W}),$ (30)

where $W^{0}$ is the random variable from (8) and

$\widehat{W}=\operatorname{prox}_{f_{\rm in}/\overline{\gamma}_{0}^{+}}(W^{0}+Q_{0}^{-}),$ (31)

with $Q_{0}^{-}={\mathcal{N}}(0,\tau_{0}^{-})$ independent of $W^{0}$ .
(c)

The asymptotic generalization error (17) with $(y_{\rm ts},\widehat{y}_{\rm ts})$ defined as (3) is given by

$\displaystyle\mathcal{E}_{\rm ts}=\mathbb{E}\,f_{\rm ts}\!\left(\phi_{\rm out}(Z_{\rm ts},D),\phi(\widehat{Z}_{\rm ts})\right),$ (32)

where $(Z_{\rm ts},\widehat{Z}_{\rm ts})\sim\mathcal{N}(\mathbf{0}_{2},\mathbf{M})$ and independent of $D$ .

Part (a) shows that, similar to gradient descent, Algorithm 1 finds the stationary points of problem (2). These stationary points will be unique in strictly convex problems such as linear and logistic regression. Thus, in such cases, the same results will be true for any algorithm that finds such stationary points. Hence, the fact that we are using ML-VAMP is immaterial – our results apply to any solver for (2). Note that the convergence to the fixed points $\{\widehat{\mathbf{z}}_{\ell},\widehat{\mathbf{p}}_{\ell}\}$ is assumed from Assumption 2.

Part (b) provides an exact description of the asymptotic statistical relation between the true parameter $\bm{w}^{0}$ and its estimate $\widehat{\bm{w}}$ . The parameters $\tau_{0}^{-},\overline{\gamma}_{0}^{+}>0$ and $\mathbf{M}$ can be explicitly computed using a set of recursive equations called the state evolution or SE described in Appendix C in the supplementary material.

We can use the expressions to compute a variety of relevant metrics. For example, the $PL(2)$ convergence shows that the MSE on the parameter estimate is

\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{n=1}^{N}(w_{n}^{0}-\widehat{w}_{n})^{2}=\mathbb{E}(W^{0}-\widehat{W})^{2}.

(33)

The expectation on the right hand side of (33) can then be computed via integration over the joint density of $(W^{0},\widehat{W})$ from part (b). In this way, we have a simple and exact method to compute the parameter error. Other metrics such as parameter bias or variance, cosine angle or sparsity detection can also be computed.

Part (c) of Theorem 1 similarly exactly characterizes the asymptotic generalization error. In this case, we would compute the expectation over the three variables $(Z,\widehat{Z},D)$ . In this way, we have provided a methodology for exactly predicting the generalization error from the key parameters of the problems such as the sampling ratio $\beta=p/N$ , the regularizer, the output function, and the distributions of the true weights and noise. We provide several examples such as linear regression, logistic regression and SVM in the Appendix G. We also recover the result by (Hastie et al., 2019) in Appendix G.

Remarks on Assumptions.

Note that Assumption 1 is satisfied in many practical cases. For example, it can be verified that it is satisfied in the case when $f_{\rm in}(\cdot)$ and $f_{\rm out}(\cdot)$ are convex. Assumption 2 is somewhat more restrictive in that it requires that the ML-VAMP algorithm converges. The convergence properties of ML-VAMP are discussed in (Fletcher et al., 2016). The ML-VAMP algorithm may not always converge, and characterizing conditions under which convergence is possible is an open question. However, experiments in (Rangan et al., 2019) show that the algorithm does indeed often converge, and in these cases, our analysis applies. In any case, we will see below that the predictions from Theorem 1 agree closely with numerical experiments in several relevant cases.

In some special cases equation (32) simplifies to yield quantitative insights for interesting modeling artifacts. We discuss these in Appendix G in the supplementary material.

5 Experiments

Training and Test Distributions.

We validate our theoretical results on a number of synthetic data experiments. For all the experiments, the training and test data is generated following the model in Section 2. We generate the training and test eigenvalues as i.i.d. with lognormal distributions,

S_{\rm tr}^{2}=A(10)^{0.1u_{\rm tr}},\quad S_{\rm ts}^{2}=A(10)^{0.1u_{\rm ts}},

where $(u_{\rm tr},u_{\rm ts})$ are bivariate zero-mean Gaussian with

\mathrm{cov}(u_{\rm tr},u_{\rm ts})=\sigma^{2}_{u}\left[\begin{array}[]{cc}1&\rho\\ \rho&1\end{array}\right].

In the case when $\sigma^{2}_{u}=0$ , we obtain eigenvalues that are equal, corresponding to the i.i.d. case. With $\sigma^{2}_{u}>0$ we can model correlated features. Also, when the correlation coefficient $\rho=1$ , $S_{\rm tr}=S_{\rm ts}$ , so there is no training and test mismatch. However, we can also select $\rho<1$ to experiment with cases when the training and test distributions differ. In the examples below, we consider the following three cases:

(1)

i.i.d. features ( $\sigma_{u}=0$ );
(2)

correlated features with matching training and test distributions ( $\sigma_{u}=3$ dB, $\rho=1$ ); and
(3)

correlated features with train-test mismatch ( $\sigma_{u}=3$ dB, $\rho=0.5$ ).

For all experiments below, the true model coefficients are generated as i.i.d. Gaussian $w_{j}^{0}\sim{\mathcal{N}}(0,1)$ and we use standard L2-regularization, $f_{\rm in}(w)=\lambda w^{2}/2$ for some $\lambda>0$ . Our framework can incorporate arbitrary i.i.d. distributions on $w_{j}$ and regularizers, but we will illustrate just the Gaussian case with L2-regularization here.

Under-regularized linear regression.

We first consider the case of under-regularized linear regression where the output channel is $\phi_{\rm out}(p,d)=p+d$ with $d\sim{\mathcal{N}}(0,\sigma^{2}_{d})$ . The noise variance $\sigma^{2}_{d}$ is set for an SNR level of 10 dB. We use a standard mean-square error (MSE) output loss, $f_{\rm out}(y,p)=(y-p)^{2}/(2\sigma^{2}_{d})$ . Since we are using the L2-regularizer, $f_{\rm in}(w)=\lambda w^{2}/2$ , the minimization (2) is standard ridge regression. Moreover, if we were to select $\lambda=1/\mathbb{E}(w_{j}^{0})^{2}$ , then the ridge regression estimate would correspond to the minimum mean-squared error (MMSE) estimate of the coefficients $\bm{w}^{0}$ . However, to study the under-regularized regime, we take $\lambda=(10)^{-4}/\mathbb{E}(w_{j}^{0})^{2}$ .

Fig. 2 plots the test MSE for the three cases described above for the linear model. In the figure, we take $p=1000$ features and vary the number of samples $n$ from $0.2p$ (over-parametrized) to $3p$ (under-paramertrized). For each value of $n$ , we take 100 random instances of the model and compute the ridge regression estimate using the sklearn package and measure the test MSE on the 1000 independent test samples. The simulated values in Fig. 2 are the median test error over the 100 random trials. The test MSE is plotted in a normalized dB scale,

\mbox{Test MSE (dB)}=10\log_{10}\left(\frac{\mathbb{E}(\widehat{y}_{\rm ts}-y_{\rm ts})^{2}}{\mathbb{E}y_{\rm ts}^{2}}\right).

Also plotted is the state evolution (SE) theoretical test MSE from Theorem 1.

Refer to caption — Figure 2: Test error for under-regularized linear regression under various train and test distributions

In all three cases in Fig. 2, the SE theory exactly matches the simulated values for the test MSE. Note that the case of match training and test distributions for this problem was studied in (Hastie et al., 2019; Mei & Montanari, 2019; Montanari et al., 2019) and we see the double descent phenomenon described in their work. Specifically, with highly under-regularized linear regression, the test MSE actually increases with more samples $n$ in the over-parametrized regime ( $n/p<1$ ) and then decreases again in the under-parametrized regime ( $n/p>1$ ).

Our SE theory can also provide predictions for the correlated feature case. In this particular setting, we see that in the correlated case the test error is slightly lower in the over-parametrized regime since the energy of data is concentrated in a smaller sub-space. Interestingly, there is minimal difference between the correlated and i.i.d. cases for the under-parametrized regime when the training and test data match. When the training and test data are not matched, the test error increases. In all cases, the SE theory can accurately predict these effects.

Logistic Regression.

Fig. 3 shows a similar plot as Fig. 2 for a logistic model. Specifically, we use a logistic output $P(y=1)=1/(1+e^{-p})$ , a binary cross entropy output loss $f_{\rm out}(y,p)$ , and $\ell_{2}$ -regularization level $\lambda$ so that the output corresponds to the MAP estimate (we do not perform ridgeless regression in this case). The mean of the training and test eigenvalues $\mathbb{E}S_{\rm tr}^{2}=\mathbb{E}S_{\rm ts}^{2}$ are selected such that, if the true coefficients $\bm{w}^{0}$ were known, we could obtain a 5% prediction error. As in the linear case, we generate random instances of the model, use the sklearn package to perform the logistic regression, and evaluate the estimates on 1000 new test samples. We compute the median error rate ( $1-$ accuracy) and compare the simulated values with the SE theoretical estimates. The i.i.d. case was considered in (Salehi et al., 2019). Fig. 3 shows that our SE theory is able to predict the test error rate exactly in i.i.d. cases along with a correlated case and a case with training and test mismatch.

Nonlinear Regression.

The SE framework can also consider non-convex problems. As an example, we consider a non-linear regression problem where the output function is

\phi_{\rm out}(p,d)=\tanh(p)+d,\quad d\sim{\mathcal{N}}(0,\sigma^{2}_{d}).

(34)

The $\tanh(p)$ models saturation in the output. Corresponding to this output, we use a non-linear MSE output loss

f_{\rm out}(y-p)=\frac{1}{2\sigma^{2}_{d}}(y-\tanh(p))^{2}.

(35)

This output loss is non-convex. We scale the data matrix so that the input $\mathbb{E}(p^{2})=9$ so that the $\tanh(p)$ is driven well into the non-linear regime. We also take $\sigma^{2}_{d}=0.01$ .

For the simulation, the non-convex loss is minimized using Tensorflow where the non-linear model is described as a two-layer model. We use the ADAM optimizer (Kingma & Ba, 2014) with 200 epochs to approach a local minimum of the objective (2). Fig. 4 plots the median test MSE for the estimate along with the SE theoretical test MSE. We again see that the SE theory is able to predict the test MSE in all cases even for this non-convex problem.

6 Conclusions

In this paper we provide a procedure for exactly computing the asymptotic generalization error of a solution in a generalized linear model (GLM). This procedure is based on scalar quantities which are fixed points of a recursive iteration. The formula holds for a large class of generalization metrics, loss functions, and regularization schemes. Our formula allows analysis of important modeling effects such as (i) overparameterization, (ii) dependence between covariates, and (iii) mismatch between train and test distributions, which play a significant role in the analysis and design of machine learning systems. We experimentally validate our theoretical results for linear as well as non-linear regression and logistic regression, where a strong agreement is seen between our formula and simulated results.

References

Advani & Saxe (2017) Advani, M. S. and Saxe, A. M. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.
Allen-Zhu et al. (2019) Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in Neural Information Processing Systems, pp. 6155–6166, 2019.
Arora et al. (2019) Arora, S., Du, S. S., Hu, W., Li, Z., and Wang, R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.
Barbier et al. (2019) Barbier, J., Krzakala, F., Macris, N., Miolane, L., and Zdeborová, L. Optimal errors and phase transitions in high-dimensional generalized linear models. Proc. National Academy of Sciences, 116(12):5451–5460, 2019.
Bartlett et al. (2019) Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. Benign overfitting in linear regression. arXiv preprint arXiv:1906.11300, 2019.
Bayati & Montanari (2011) Bayati, M. and Montanari, A. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. Inform. Theory, 57(2):764–785, February 2011.
Belkin et al. (2018) Belkin, M., Ma, S., and Mandal, S. To understand deep learning we need to understand kernel learning. arXiv preprint arXiv:1802.01396, 2018.
Belkin et al. (2019a) Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc. National Academy of Sciences, 116(32):15849–15854, 2019a.
Belkin et al. (2019b) Belkin, M., Hsu, D., and Xu, J. Two models of double descent for weak features. arXiv preprint arXiv:1903.07571, 2019b.
Cakmak et al. (2014) Cakmak, B., Winther, O., and Fleury, B. H. S-AMP: Approximate message passing for general matrix ensembles. In Proc. IEEE ITW, 2014.
Daniely (2017) Daniely, A. Sgd learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pp. 2422–2430, 2017.
Daniely et al. (2016) Daniely, A., Frostig, R., and Singer, Y. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pp. 2253–2261, 2016.
Deng et al. (2019) Deng, Z., Kammoun, A., and Thrampoulidis, C. A model of double descent for high-dimensional binary linear classification. arXiv preprint arXiv:1911.05822, 2019.
Donoho et al. (2009) Donoho, D. L., Maleki, A., and Montanari, A. Message-passing algorithms for compressed sensing. Proc. National Academy of Sciences, 106(45):18914–18919, 2009.
Donoho et al. (2010) Donoho, D. L., Maleki, A., and Montanari, A. Message passing algorithms for compressed sensing. In Proc. Inform. Theory Workshop, pp. 1–5, 2010.
Du et al. (2018) Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
Fletcher et al. (2016) Fletcher, A., Sahraee-Ardakan, M., Rangan, S., and Schniter, P. Expectation consistent approximate inference: Generalizations and convergence. In Proc. IEEE Int. Symp. Information Theory (ISIT), pp. 190–194. IEEE, 2016.
Fletcher et al. (2018) Fletcher, A. K., Rangan, S., and Schniter, P. Inference in deep networks in high dimensions. Proc. IEEE Int. Symp. Information Theory, 2018.
Gabrié et al. (2018) Gabrié, M., Manoel, A., Luneau, C., Barbier, J., Macris, N., Krzakala, F., and Zdeborová, L. Entropy and mutual information in models of deep neural networks. In Proc. NIPS, 2018.
Gerbelot et al. (2020) Gerbelot, C., Abbara, A., and Krzakala, F. Asymptotic errors for convex penalized linear regression beyond gaussian matrices. arXiv preprint arXiv:2002.04372, 2020.
Givens et al. (1984) Givens, C. R., Shortt, R. M., et al. A class of wasserstein metrics for probability distributions. The Michigan Mathematical Journal, 31(2):231–240, 1984.
Hastie et al. (2019) Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
He et al. (2017) He, H., Wen, C.-K., and Jin, S. Generalized expectation consistent signal recovery for nonlinear measurements. In 2017 IEEE International Symposium on Information Theory (ISIT), pp. 2333–2337. IEEE, 2017.
Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580, 2018.
Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Ma & Ping (2017) Ma, J. and Ping, L. Orthogonal amp. IEEE Access, 5:2020–2033, 2017.
Manoel et al. (2018) Manoel, A., Krzakala, F., Varoquaux, G., Thirion, B., and Zdeborová, L. Approximate message-passing for convex optimization with non-separable penalties. arXiv preprint arXiv:1809.06304, 2018.
Mei & Montanari (2019) Mei, S. and Montanari, A. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.
Montanari et al. (2019) Montanari, A., Ruan, F., Sohn, Y., and Yan, J. The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime. arXiv preprint arXiv:1911.01544, 2019.
Muthukumar et al. (2019) Muthukumar, V., Vodrahalli, K., and Sahai, A. Harmless interpolation of noisy data in regression. In 2019 IEEE International Symposium on Information Theory (ISIT), pp. 2299–2303. IEEE, 2019.
Neyshabur et al. (2018) Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., and Srebro, N. Towards understanding the role of over-parametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076, 2018.
Opper & Winther (2005) Opper, M. and Winther, O. Expectation consistent approximate inference. Journal of Machine Learning Research, 6(Dec):2177–2204, 2005.
Pandit et al. (2019) Pandit, P., Sahraee, M., Rangan, S., and Fletcher, A. K. Asymptotics of MAP inference in deep networks. In Proc. IEEE Int. Symp. Information Theory, pp. 842–846, 2019.
Pandit et al. (2019) Pandit, P., Sahraee-Ardakan, M., Rangan, S., Schniter, P., and Fletcher, A. K. Inference with deep generative priors in high dimensions. arXiv preprint arXiv:1911.03409, 2019.
Rangan et al. (2019) Rangan, S., Schniter, P., and Fletcher, A. K. Vector approximate message passing. IEEE Trans. Information Theory, 65(10):6664–6684, 2019.
Reeves (2017) Reeves, G. Additivity of information in multilayer networks via additive gaussian noise transforms. In Proc. 55th Annual Allerton Conf. Communication, Control, and Computing (Allerton), pp. 1064–1070. IEEE, 2017.
Salehi et al. (2019) Salehi, F., Abbasi, E., and Hassibi, B. The impact of regularization on high-dimensional logistic regression. arXiv preprint arXiv:1906.03761, 2019.
Tulino et al. (2004) Tulino, A. M., Verdú, S., et al. Random matrix theory and wireless communications. Foundations and Trends® in Communications and Information Theory, 1(1):1–182, 2004.
Villani (2008) Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

Appendix A Empirical Convergence of Vector Sequences

The LSL model in Section 2 and our main result in Section 4 require certain technical definitions.

Definition 1 (Pseudo-Lipschitz continuity).

For a given $p\geq 1$ , a function $\mathbf{f}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{m}$ is called Pseudo-Lipschitz of order $p$ if

	$\displaystyle\\|\mathbf{f}(\bm{x}_{1})-\mathbf{f}(\bm{x}_{2})\\|$
	$\displaystyle\leq C\\|\bm{x}_{1}-\bm{x}_{2}\\|\left(1+\\|\bm{x}_{1}\\|^{p-1}+\\|\bm{x}_{2}\\|^{p-1}\right)$		(36)

for some constant $C>0$ .

Observe that for $p=1$ , the pseudo-Lipschitz is equivalent to the standard definition of Lipschitz continuity.

Definition 2 (Uniform Lipschitz continuity).

Let ${\bm{\phi}}(\bm{x},\theta)$ be a function on $\mathbf{r}\in{\mathbb{R}}^{d}$ and $\theta\in{\mathbb{R}}^{s}$ . We say that ${\bm{\phi}}(\bm{x},\theta)$ is uniformly Lipschitz continuous in $\bm{x}$ at $\theta={\overline{\theta}}$ if there exists constants $L_{1},L_{2}\geq 0$ and an open neighborhood $U$ of ${\overline{\theta}}$ such that

\displaystyle\|{\bm{\phi}}(\bm{x}_{1},\theta)-{\bm{\phi}}(\bm{x}_{2},\theta)\|\leq L_{1}\|\bm{x}_{1}-\bm{x}_{2}\|

(37)

for all $\bm{x}_{1},\bm{x}_{2}\in{\mathbb{R}}^{d}$ and $\theta\in U$ ; and

\displaystyle\|{\bm{\phi}}(\bm{x},\theta_{1})-{\bm{\phi}}(\bm{x},\theta_{2})\|\leq L_{2}\left(1+\|\bm{x}\|\right)\|\theta_{1}-\theta_{2}\|,

(38)

for all $\bm{x}\in{\mathbb{R}}^{d}$ and $\theta_{1},\theta_{2}\in U$ .

Definition 3 (Empirical convergence of a sequence).

Consider a sequence of vectors $\bm{x}(N)=\{\bm{x}_{n}{(N)}\}_{n=1}^{N}$ with $\bm{x}_{n}(N)\in\mathbb{R}^{d}$ . So, each $\bm{x}(N)$ is a block vector with a total of $Nd$ components. For a finite $p\geq 1$ , we say that the vector sequence $\bm{x}(N)$ converges empirically with $p$ -th order moments if there exists a random variable $X\in\mathbb{R}^{d}$ such that

(i)

$\mathbb{E}\|X\|_{p}^{p}<\infty$ ; and
(ii)

for any $f:{\mathbb{R}}^{d}\rightarrow{\mathbb{R}}$ that is pseudo-Lipschitz continuous of order $p$ ,

$\lim_{N\rightarrow\infty}\tfrac{1}{N}\sum_{n=1}^{N}f(\bm{x}_{n}(N))=\mathbb{E}\left[f(X)\right].$ (39)

In this case, with some abuse of notation, we will write

\lim_{N\rightarrow\infty}\left\{\bm{x}_{n}\right\}\stackrel{{\scriptstyle PL(p)}}{{=}}X,

(40)

where we have omitted the dependence on $N$ in $\bm{x}_{n}(N)$ . We note that the sequence $\{\bm{x}{(N)}\}$ can be random or deterministic. If it is random, we will require that for every pseudo-Lipschitz function $f(\cdot)$ , the limit (39) holds almost surely. In particular, if $\bm{x}_{n}\sim X$ are i.i.d. and $\mathbb{E}\|X\|^{p}_{p}<\infty$ , then $\bm{x}$ empirically converges to $X$ with $p^{\rm th}$ order moments.

$PL(p)$ convergence is equivalent to weak convergence plus convergence in $p$ moment (Bayati & Montanari, 2011), and hence $PL(p)$ convergence is also equivalent to convergence in Wasserstein- $p$ metric (See Chapter 6. (Villani, 2008)). We use this fact later in proving Theorem 1.

Appendix B ML-VAMP Denoisers Details

Related to $\mathbf{S}_{\rm{mp}}$ and $\mathbf{s}_{\rm{mp}}$ from equation (11), we need to define two quantities $\mathbf{s}_{\rm{mp}}^{+}\in\mathbb{R}^{N}$ and $\mathbf{s}_{\rm{mp}}^{-}\in\mathbb{R}^{p}$ that are zero-padded versions of the singular values $\mathbf{s}_{{\rm{mp}}}$ , so that for $n>\min\{N,p\}$ , we set $s^{\pm}_{{\rm mp},n}=0$ . Observe that $(\mathbf{s}_{\rm{mp}}^{+})^{2}$ are eigenvalues of $\mathbf{U}\mathbf{U}^{\text{\sf T}}$ whereas $(\mathbf{s}_{\rm{mp}}^{-})^{2}$ are eigenvalues of $\mathbf{U}^{\text{\sf T}}\mathbf{U}$ . Since $\mathbf{s}_{\rm{mp}}$ empirically converges to $S_{\rm{mp}}$ as given in (12), the vector $\mathbf{s}_{\rm{mp}}^{+}$ empirically converges to random variable $S_{\rm{mp}}^{+}$ whereas the vector $\mathbf{s}_{\rm{mp}}^{-}$ empirically converges to random variable $S_{\rm{mp}}^{-}$ , where a mass is placed at $0$ appropriately. Specifically, $S_{\rm{mp}}^{+}$ has a point mass of $(1-\beta)_{+}\delta_{\{0\}}$ when $\beta<1$ , whereas $S_{\rm{mp}}^{-}$ has a point mass of $(1-\frac{1}{\beta})_{+}\delta_{\{0\}},$ when $\beta>1.$ In Appendix H (eqn. (113)), we provide the densities over positive parts of $S_{\rm{mp}}^{+}$ and $S_{\rm{mp}}^{-}$ .

A key property of our analysis will be that the non-linear functions (20) and the denoisers $\mathbf{g}_{\ell}^{\pm}(\cdot)$ have simple forms.

Non-linear functions ${\bm{\phi}}_{\ell}(\cdot)$ : The non-linear functions all act componentwise. For example, for ${\bm{\phi}}_{1}(\cdot)$ in (20), we have

\displaystyle\mathbf{z}_{1}={\bm{\phi}}_{1}(\mathbf{p}_{0},\mathbf{s}_{\rm tr})=\mathrm{diag}(\mathbf{s}_{\rm tr})\mathbf{p}_{0}\Longleftrightarrow z_{1,n}=\phi_{1}(p_{0,n},s_{{\rm tr},n}),

where $\phi_{1}(\cdot)$ is the scalar-valued function,

\phi_{1}(p_{0},s)=sp_{0}.

(41)

Similarly, for ${\bm{\phi}}_{2}(\cdot)$ ,

\displaystyle\mathbf{z}_{2}={\bm{\phi}}_{2}(\mathbf{p}_{1},\mathbf{s}^{+}_{\rm mp})\Longleftrightarrow z_{2,n}=\phi_{2}(\overline{p}_{1,n},{s}^{+}_{{\rm mp},n}),\quad n<N

where $\overline{\mathbf{p}}_{1}\in\mathbb{R}^{N}$ is the zero-padded version of $\mathbf{p}_{1}$ , and

\phi_{2}(p_{1},s)=s\,p_{1}.

(42)

Finally, the function ${\bm{\phi}}_{3}(\cdot)$ in (20) acts componentwise with

\phi_{3}(p_{2},d)=\phi_{\rm out}(p_{2},d).

(43)

Input denoiser $\mathbf{g}_{0}^{+}(\cdot)$ : Since $F_{0}(\mathbf{z}_{0})=F_{\rm in}(\mathbf{z}_{0})$ , and $F_{\rm in}(\cdot)$ given in (6), the denoiser (25a) acts componentwise in that,

\widehat{\mathbf{z}}_{0}=\mathbf{g}_{0}^{+}(\mathbf{r}_{0}^{-},\gamma_{0}^{-})\Longleftrightarrow\widehat{z}_{0,n}=g_{0}^{+}(r_{0,n}^{-},\gamma_{0}^{-}),

where $g_{0}^{+}(\cdot)$ is the scalar-valued function,

g_{0}^{+}(r_{0}^{-},\gamma_{0}^{-}):=\operatorname*{argmin}_{z_{0}}\ f_{\rm in}(z_{0})+\frac{\gamma_{0}^{-}}{2}(z_{0}-r_{0}^{-})^{2}.

(44)

Thus, the vector optimization in (25a) reduces to a set of scalar optimizations (44) on each component.

Output denoiser $\mathbf{g}_{3}^{-}(\cdot)$ : The output penalty $F_{3}(\mathbf{p}_{2},\bm{y})=F_{\rm out}(\mathbf{p}_{2},\bm{y})$ where $F_{\rm out}(\mathbf{p}_{2},\bm{y})$ has the separable form (6). Thus, similar to the case of $\mathbf{g}_{0}(\cdot)$ , the denoiser $\mathbf{g}_{3}(\cdot)$ in (25b) also acts componentwise with the function,

\displaystyle g_{3}^{-}(r_{2}^{+},\gamma_{2}^{+},y):=\operatorname*{argmin}_{p_{2}}\ f_{\rm out}(p_{2},y)+\tfrac{\gamma_{2}^{+}}{2}(p_{2}-r_{2}^{+})^{2}.

(45)

Linear denoiser $\mathbf{g}_{1}^{\pm}(\cdot)$ : The expressions for both denoisers $g_{1}^{\pm}$ and $g_{2}^{\pm}$ are very similar and can be explained together. The penalty $F_{1}(\cdot)$ restricts $\mathbf{z}_{1}=\mathbf{S}_{\rm tr}\mathbf{p}_{0}$ , where $\mathbf{S}_{\rm tr}$ is a square matrix. Hence, for $\ell=1,$ the minimization in (27) is given by,

\displaystyle\widehat{\mathbf{p}}_{0}:=\operatorname*{argmin}_{\mathbf{p}_{0}}\ \tfrac{\gamma_{0}^{+}}{2}\|\mathbf{p}_{0}-\mathbf{r}_{0}^{+}\|^{2}+\tfrac{\gamma_{1}^{-}}{2}\|\mathbf{S}_{\rm tr}\mathbf{p}_{0}-\mathbf{r}_{1}^{-}\|^{2},

(46)

and $\widehat{\mathbf{z}}_{1}=\mathbf{S}_{\rm tr}\widehat{\mathbf{p}}_{0}$ . This is a simple quadratic minimization and the components of $\widehat{\mathbf{p}}_{0}$ and $\widehat{\mathbf{z}}_{1}$ are given by

	$\displaystyle\widehat{p}_{0,n}$	$\displaystyle=g_{1}^{-}(r_{0,n}^{+},r_{1,n}^{-},\gamma_{0}^{+},\gamma^{-}_{1},s_{{\rm tr},n})$
	$\displaystyle\widehat{z}_{1,n}$	$\displaystyle=g_{1}^{+}(r_{0,n}^{+},r_{1,n}^{-},\gamma_{0}^{+},\gamma^{-}_{1},s_{{\rm tr},n}),$

where


$\displaystyle g_{1}^{-}(r_{0}^{+},r_{1}^{-},\gamma_{0}^{+},\gamma^{-}_{1},s)$	$\displaystyle:=\frac{\gamma^{+}_{0}r^{+}_{0}+s\gamma^{-}_{1}r^{-}_{1}}{\gamma^{+}_{0}+s^{2}\gamma^{-}_{1}}$	(47a)
$\displaystyle g_{1}^{+}(r_{0}^{+},r_{1}^{-},\gamma_{0}^{+},\gamma^{-}_{1},s)$	$\displaystyle:=\frac{s(\gamma^{+}_{0}r^{+}_{0}+s\gamma^{-}_{1}r^{-}_{1})}{\gamma^{+}_{0}+s^{2}\gamma^{-}_{1}}$	(47b)

Linear denoiser $\mathbf{g}_{2}^{\pm}(\cdot)$ : This denoiser is identical to the case $\mathbf{g}_{1}^{\pm}(\cdot)$ in that we need to impose the linear constraint $\mathbf{z}_{2}=\mathbf{S}_{\rm mp}\mathbf{p}_{1}$ . However $\mathbf{S}_{\rm{mp}}$ is in general a rectangular matrix and the two resulting cases of $\beta\lessgtr 1$ needs to be treated separately.

Recall the definitions of vectors $\mathbf{s}_{\rm{mp}}^{+}$ and $\mathbf{s}_{\rm{mp}}^{-}$ at the beginning of this section. Then, for $\ell=2$ , with the penalty $F_{2}(\mathbf{p}_{1},\mathbf{z}_{2})=\delta_{\{\mathbf{z}_{2}=\mathbf{S}_{\rm mp}\mathbf{p}_{1}\}}$ , the solution to (27) has components,


$\displaystyle\widehat{p}_{1,n}$	$\displaystyle=g_{2}^{-}(r_{1,n}^{+},r_{2,n}^{-},\gamma_{1}^{+},\gamma^{+}_{2},s^{-}_{{\rm mp},n})$	(48a)
$\displaystyle\widehat{z}_{2,n}$	$\displaystyle=g_{2}^{+}(r_{1,n}^{+},r_{2,n}^{-},\gamma_{1}^{+},\gamma^{+}_{2},s^{+}_{{\rm mp},n}),$	(48b)

with the identical functions $g_{2}^{-}=g_{1}^{-}$ and $g_{2}^{+}=g_{1}^{+}$ as given by (47a) and (47b). Note that in (48a), $n=1,\ldots,p$ and in (48b), $n=1,\ldots,N$ .

Appendix C State Evolution Analysis of ML-VAMP

A key property of the ML-VAMP algorithm is that its performance in the LSL can be exactly described by a scalar equivalent system. In the scalar equivalent system, the vector-valued outputs of the algorithm are replaced by scalar random variables representing the typical behavior of the components of the vectors in the large-scale-limit (LSL). Each of the random variables are described by a set of parameters, where the parameters are given by a set of deterministic equations called the state evolution or SE.

Algorithm 2 SE for ML-VAMP for GLM Learning

1: // Initial

2: Initialize

\overline{\gamma}^{-}_{0\ell}=\gamma_{0\ell}^{-}

from Algorithm 1.

Q^{-}_{0\ell}\sim\mathcal{N}(0,\tau_{0\ell}^{-})

for some

\tau_{0\ell}^{-}>0

for

\ell=0,1,2

Z^{0}_{0}=W^{0}

5: for

\ell=0,\ldots,L\!-\!1

P^{0}_{\ell}={\mathcal{N}}(0,\tau^{0}_{\ell}),\quad\tau^{0}_{\ell}=\mathrm{var}(Z^{0}_{\ell})

Z^{0}_{\ell\!+\!1}=\phi_{\ell\!+\!1}(P^{0}_{\ell},\Xi_{\ell\!+\!1})

8: end for

10: for

k=0,1,\dots

11: // Forward Pass

12: for

\ell=0,\ldots,L-1

13: if

\ell=0

then

14:

R^{-}_{k0}=Z^{0}_{\ell}+Q^{-}_{k0}

15:

\widehat{Z}_{k0}=g^{+}_{0}(R^{-}_{k0},\overline{\gamma}^{-}_{k0})

16: else

17:

R^{+}_{k,\ell\!-\!1}=P^{0}_{\ell\!-\!1}+P^{+}_{k,\ell\!-\!1}

R^{-}_{k\ell}=Z^{0}_{\ell}+Q^{-}_{k\ell}

18:

\widehat{Z}_{k\ell}=g^{+}_{\ell}(R^{+}_{k,\ell\!-\!1},R^{-}_{k\ell},\overline{\gamma}^{+}_{k,\ell\!-\!1},\overline{\gamma}^{-}_{k\ell},\Xi_{\ell})

19: end if

20:

\overline{\alpha}_{k\ell}^{+}=\mathbb{E}\partial\widehat{Z}_{k\ell}/\partial Q_{k\ell}^{-}

21:

\displaystyle Q_{k\ell}^{+}=\frac{\widehat{Z}_{k\ell}-Z^{0}_{\ell}-\overline{\alpha}_{k\ell}^{+}Q_{k\ell}^{-}}{1-\overline{\alpha}^{+}_{k\ell}}

22:

\overline{\gamma}_{k\ell}^{+}=(\tfrac{1}{\overline{\alpha}_{k\ell}^{+}}-1)\overline{\gamma}_{k\ell}^{-}

23:

(P^{0}_{\ell},P^{+}_{k\ell})\sim{\mathcal{N}}(0,\mathbf{K}_{k\ell}^{+}),~{}\mathbf{K}_{k\ell}^{+}=\mathrm{cov}(Z^{0}_{\ell},Q^{+}_{k\ell})

24: end for

25:

26: // Backward Pass

27: for

\ell=L,\ldots,1

28: if

\ell=L

then

29:

R^{+}_{k,L\!-\!1}=P^{0}_{L\!-\!1}+P^{+}_{k,L\!-\!1}

30:

\widehat{P}_{k,L\!-\!1}=g_{L}^{-}(R^{+}_{k,L\!-\!1},\overline{\gamma}^{+}_{k,L\!-\!1},Z^{0}_{L})

31: else

32:

R^{+}_{k,\ell\!-\!1}=P^{0}_{\ell\!-\!1}+P^{+}_{k,\ell\!-\!1}

R^{-}_{k\!+\!1,\ell}=Z^{0}_{\ell}+Q^{-}_{k\!+\!1,\ell}

33:

\widehat{P}_{k,\ell\!-\!1}=g_{\ell}^{-}(R^{+}_{k,\ell\!-\!1},R^{-}_{k\!+\!1,\ell},\overline{\gamma}^{+}_{k,\ell\!-\!1},\overline{\gamma}^{-}_{k\!+\!1,\ell},\Xi_{\ell})

34: end if

35:

\overline{\alpha}_{k,\ell\!-\!1}^{-}=\mathbb{E}\partial\widehat{P}_{k,\ell\!-\!1}/\partial P_{k,\ell\!-\!1}^{+}

36:

\displaystyle P_{k\!+\!1,\ell\!-\!1}^{-}=\frac{\widehat{P}_{k,\ell\!-\!1}-P^{0}_{\ell\!-\!1}-\overline{\alpha}_{k,\ell\!-\!1}^{-}P_{k,\ell\!-\!1}^{+}}{1-\overline{\alpha}^{-}_{k,\ell\!-\!1}}

37:

\overline{\gamma}_{k\!+\!1,\ell\!-\!1}^{-}=(\tfrac{1}{\overline{\alpha}_{k,\ell\!-\!1}^{-}}-1)\overline{\gamma}_{k,\ell\!-\!1}^{+}

38:

Q^{-}_{k\!+\!1,\ell\!-\!1}\sim{\mathcal{N}}(0,\tau_{k\!+\!1,\ell\!-\!1}^{-})

\tau_{k,\ell\!-\!1}^{-}=\mathbb{E}(P^{-}_{k\!+\!1,\ell\!-\!1})^{2}

39: end for

40: end for

The SE for the general ML-VAMP algorithm are derived in (Pandit et al., 2019) and the special case of the updates for ML-VAMP for GLM learning are shown in Algorithm 2 with details of functions $\mathbf{g}_{\ell}^{\pm}$ in Appendix B. We see that the SE updates in Algorithm 2 parallel those in the ML-VAMP algorithm Algo. 1, except that vector quantities such as $\widehat{\mathbf{z}}_{k\ell}$ , $\widehat{\mathbf{p}}_{k\ell}$ , $\mathbf{r}_{k\ell}^{+}$ and $\mathbf{r}_{k\ell}^{-}$ are replaced by scalar random variables such as $\widehat{Z}_{k\ell}$ , $\widehat{P}_{k\ell}$ , $R_{k\ell}^{+}$ and $R_{k\ell}^{-}$ . Each of these random variables are described by the deterministic parameters such as $\mathbf{K}_{k\ell}\in\mathbb{R}^{2\times 2}_{\succ 0}$ , and $\tau^{0}_{\ell}$ , $\tau^{-}_{k\ell}\in\mathbb{R}_{+}$ .

The updates in the section labeled as “Initial”, provide the scalar equivalent model for the true system (18). In these updates, $\Xi_{\ell}$ represent the limits of the vectors ${\bm{\xi}}_{\ell}$ in (19). That is,

\Xi_{1}:=S_{\rm tr},\quad\Xi_{2}:=S_{\rm mp}^{+},\quad\Xi_{3}:=D.

(49)

Due to assumptions in Section 2, we have that the components of ${\bm{\xi}}_{\ell}$ converge empirically as,

\lim_{N\rightarrow\infty}\{\xi_{\ell,i}\}\stackrel{{\scriptstyle PL(2)}}{{=}}\Xi_{\ell},

(50)

So, the $\Xi_{\ell}$ represent the asymptotic distribution of the components of the vectors ${\bm{\xi}}_{\ell}$ .

The updates in sections labeled “Forward pass” and “Backward pass” in the SE equations in Algorithm 2 parallel those in Algorithm 1. The key quantities in these SE equations are the error variables,

\mathbf{p}_{k\ell}^{+}:=\mathbf{r}_{k\ell}^{+}-\mathbf{p}^{0}_{\ell},\quad\mathbf{q}_{k\ell}^{-}:=\mathbf{r}_{k\ell}^{-}-\mathbf{z}^{0}_{\ell},\quad

which represent the errors of the estimates to the inputs of the denoisers. We will also be interested in their transforms,

\mathbf{q}_{k\ell}^{+}=\mathbf{V}_{\ell}^{\text{\sf T}}\mathbf{p}_{k,\ell\!+\!1}^{+},\quad\mathbf{p}_{k\ell}^{-}=\mathbf{V}_{\ell}\mathbf{q}_{k\ell}^{-}.

The following Theorem is an adapted version of the main result from (Pandit et al., 2019) to the iterates of Algorithms 1 and 2.

Theorem 2.

Consider the outputs of the ML-VAMP for GLM Learning Algorithm under the assumptions of Section 2. Assume the denoisers satisfy the continuity conditions in Assumption 1. Also, assume that the outputs of the SE satisfy

\overline{\alpha}_{k\ell}^{\pm}\in(0,1),

for all $k$ and $\ell$ . Suppose Algo. 1 is initialized so that the following convergence holds

\displaystyle\lim_{N\rightarrow\infty}\{\mathbf{r}^{-}_{0\ell}-\mathbf{z}_{\ell}^{0}\}\overset{PL(2)}{=}Q_{0\ell}^{-}

where $(Q_{00}^{-},Q_{01}^{-},Q_{02}^{-})$ are independent zero-mean Gaussians, independent of $\{\Xi_{\ell}\}$ . Then,

(a)

For any fixed iteration $k\geq 0$ in the forward direction and layer $\ell=1,\ldots,L\!-\!1$ , we have that, almost surely,

		$\displaystyle\lim_{N\rightarrow\infty}(\gamma_{k,\ell\!-\!1}^{+},\gamma_{k\ell}^{-})=(\overline{\gamma}_{k,\ell\!-\!1}^{+},\overline{\gamma}_{k\ell}^{-}),\quad{\rm and},$		(51)
		$\displaystyle\lim_{N\rightarrow\infty}\{(\widehat{\mathbf{z}}^{+}_{k\ell},\mathbf{z}^{0}_{\ell},\mathbf{p}_{\ell\!-\!1}^{0},\mathbf{r}^{+}_{k,\ell\!-\!1},\mathbf{r}^{-}_{\ell})\}$
		$\displaystyle\overset{PL(2)}{=}(\widehat{Z}^{+}_{k\ell},Z^{0}_{\ell},P_{\ell\!-\!1}^{0},R^{+}_{k,\ell\!-\!1},R^{-}_{\ell})$		(52)

where the variables on the right-hand side are from the SE equations (51) and (52) are the outputs of the SE equations in Algorithm 2. Similar equations hold for $\ell=0$ with the appropriate variables removed.

(b)

Similarly, in the reverse direction, For any fixed iteration $k\geq 0$ and layer $\ell=1,\ldots,L-2$ , we have that, almost surely,

		$\displaystyle\lim_{N\rightarrow\infty}(\gamma_{k,\ell\!-\!1}^{+},\gamma_{k\!+\!1,\ell}^{-})=(\overline{\gamma}_{k,\ell\!-\!1}^{+},\overline{\gamma}_{k\!+\!1,\ell}^{-}),\quad{\rm and}$		(53)
		$\displaystyle\lim_{N\rightarrow\infty}\{(\widehat{\mathbf{p}}^{+}_{k\!+\!1,\ell\!-\!1},\mathbf{z}^{0}_{\ell},\mathbf{p}_{\ell\!-\!1}^{0},\mathbf{r}^{+}_{k,\ell\!-\!1},\mathbf{r}^{-}_{k\!+\!1,\ell})\}$
		$\displaystyle\quad\overset{PL(2)}{=}(\widehat{P}^{+}_{k\!+\!1,\ell\!-\!1},Z^{0}_{\ell},P_{\ell\!-\!1}^{0},R^{+}_{k,\ell\!-\!1},R^{-}_{k\!+\!1,\ell}).$		(54)

Furthermore, $(P_{\ell\!-\!1}^{0},P_{k\ell\!-\!1}^{+})$ and $Q_{k\ell}^{-}$ are independent.

Proof.

This is a direct application of Theorem 3 from (Pandit et al., 2019) to the iterations of Algorithm 1. The convergence result in (Pandit et al., 2019) requires the uniform Lipschitz continuity of functions $\mathbf{g}_{\ell}^{\pm}(\cdot)$ . Assumption 1 provides the required uniform Lipschitz continuity assumption on $\mathbf{g}_{0}^{+}(\cdot)$ and $\mathbf{g}_{3}^{-}(\cdot)$ . For the ”middle” layers, $\ell=1,2$ , the denoisers $\mathbf{g}_{\ell}^{\pm}(\cdot)$ are linear and the uniform continuity assumption is valid since we have made the additional assumption that the terms $\mathbf{s}_{\rm tr}$ and $\mathbf{s}_{\rm mp}$ are bounded almost surely. $\Box$

A key use of the Theorem is to compute asymptotic empirical limits. Specifically, for a componentwise function $\psi(\cdot)$ , let $\langle{\psi(\bm{x})}\rangle$ denotes the average $\tfrac{1}{N}\sum_{n=1}^{N}\psi(x_{n})$ The above theorem then states that for any componentwise pseudo-Lipschitz function $\psi(\cdot)$ of order 2, as $N\rightarrow\infty$ , we have the following two properties

	$\displaystyle\lim_{N\rightarrow\infty}\langle{\psi(\widehat{\mathbf{z}}^{+}_{k\ell},\mathbf{z}^{0}_{\ell},\mathbf{p}_{\ell\!-\!1}^{0},\mathbf{r}^{+}_{k,\ell\!-\!1},\mathbf{r}^{-}_{\ell})}\rangle$
	$\displaystyle=\mathbb{E}\,\psi(\widehat{Z}^{+}_{k\ell},Z^{0}_{\ell},P_{\ell\!-\!1}^{0},R^{+}_{k,\ell\!-\!1},R^{-}_{\ell})$
	$\displaystyle\lim_{N\rightarrow\infty}\langle{\psi(\widehat{\mathbf{p}}^{+}_{k\!+\!1,\ell\!-\!1},\mathbf{z}^{0}_{\ell},\mathbf{p}_{\ell\!-\!1}^{0},\mathbf{r}^{+}_{k,\ell\!-\!1},\mathbf{r}^{-}_{k\!+\!1,\ell)}}\rangle$
	$\displaystyle=\mathbb{E}\,\psi(\widehat{P}^{+}_{k\!+\!1,\ell\!-\!1},Z^{0}_{\ell},P_{\ell\!-\!1}^{0},R^{+}_{k,\ell\!-\!1},R^{-}_{k\!+\!1,\ell}).$

That is, we can compute the empirical average over components with the expected value of the random variable limit. This convergence is key to proving Theorem 1.

Appendix D Empirical Convergence of Fixed Points

A consequence of Assumption 2 is that we can take the limit $k\rightarrow\infty$ of the random variables in the SE algorithm. Specifically, let $\bm{x}_{k}=\bm{x}_{k}(N)$ be any set of $d$ outputs from the ML-VAMP for GLM Learning Algorithm under the assumptions of Theorem 2. Under Assumption 2, for each $N$ , there exists a vector

\bm{x}(N)=\lim_{k\rightarrow\infty}\bm{x}_{k}(N),

(55)

representing the limit over $k$ . For each $k$ , Theorem 2 shows there also exists a random vector limit,

\lim_{N\rightarrow\infty}\{\bm{x}_{k,i}(N)\}\stackrel{{\scriptstyle PL(2)}}{{=}}X_{k},

(56)

representing the limit over $N$ . The following proposition shows that we can take the limits of the random variables $X_{k}$ .

Proposition 1.

Consider the outputs of the ML-VAMP for GLM Learning Algorithm under the assumptions of Theorem 2 and Assumption 2. Let $\bm{x}_{k}=\bm{x}_{k}(N)$ be any set of $d$ outputs from the algorithm and let $\bm{x}(N)$ be its limit from (55) and let $X_{k}$ be the random variable limit (56). Then, there exists a random variable $X\in{\mathbb{R}}^{d}$ such that, for any pseudo-Lipschitz continuous $f:{\mathbb{R}}^{d}\rightarrow{\mathbb{R}}$ ,

\lim_{k\rightarrow\infty}\mathbb{E}f(X_{k})=\mathbb{E}f(X)=\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{i=1}^{N}f(x_{i}(N)).

(57)

In addition, the SE parameter limits $\overline{\alpha}_{k\ell}^{\pm}$ and $\overline{\gamma}_{k\ell}^{\pm}$ converge to limits,

\lim_{k\rightarrow\infty}\overline{\alpha}_{k\ell}^{\pm}=\overline{\alpha}_{\ell}^{\pm},\quad\lim_{k\rightarrow\infty}\overline{\gamma}_{k\ell}^{\pm}=\overline{\gamma}_{\ell}^{\pm}.

(58)

The proposition shows that, under the convergence assumption, Assumption 2, we can take the limits as $k\rightarrow\infty$ of the random variables from the SE. To prove the proposition we first need the following simple lemma.

Lemma 1.

If $\alpha_{N}$ and $\beta_{k}\in{\mathbb{R}}$ are sequences such that

\lim_{k\rightarrow\infty}\lim_{N\rightarrow\infty}|\alpha_{N}-\beta_{k}|=0,

(59)

then, there exists a constant $C$ such that,

\lim_{N\rightarrow\infty}\alpha_{N}=\lim_{k\rightarrow\infty}\beta_{k}=C.

(60)

In particular, the two limits in (60) exist.

Proof.

For any $\epsilon>0$ , the limit (59) implies that there exists a $k_{\epsilon}(\uparrow\infty$ as $\epsilon\downarrow 0)$ such that for all $k>k_{\epsilon}$ ,

\lim_{N\rightarrow\infty}|\alpha_{N}-\beta_{k}|<\epsilon,

from which we can conclude,

\liminf_{N\rightarrow\infty}\alpha_{N}>\beta_{k}-\epsilon

for all $k>k_{\epsilon}$ . Therefore,

\liminf_{N\rightarrow\infty}\alpha_{N}\geq\sup_{k\geq k_{\epsilon}}\beta_{k}-\epsilon.

Since this is true for all $\epsilon>0$ , it follows that

\liminf_{N\rightarrow\infty}\alpha_{N}\geq\limsup_{k\rightarrow\infty}\beta_{k}.

(61)

Similarly, $\limsup_{N\rightarrow\infty}\alpha_{N}\leq\inf_{k>k_{\epsilon}}\beta_{k}+\epsilon$ , whereby

\limsup_{N\rightarrow\infty}\alpha_{N}\leq\liminf_{k\rightarrow\infty}\beta_{k}.

(62)

Equations (61) and (62) together show that the limits in (60) exists and are equal. $\Box$

Proof of Proposition 1

Let $f:{\mathbb{R}}^{d}\rightarrow{\mathbb{R}}$ be any pseudo-Lipschitz function of order 2, and define,

\alpha_{N}=\frac{1}{N}\sum_{i=1}^{N}f(x_{i}(N)),\quad\beta_{k}=\mathbb{E}f(X_{k}).

(63)

Their difference can be written as,

\alpha_{N}-\beta_{k}=A_{N,k}+B_{N,k},

(64)

where

	$\displaystyle A_{N,k}$	$\displaystyle:=\frac{1}{N}\sum_{i=1}^{N}f(\bm{x}_{i}(N))-f(\bm{x}_{k,i}(N)),$		(65)
	$\displaystyle B_{N,k}$	$\displaystyle:=\frac{1}{N}\sum_{i=1}^{N}f(\bm{x}_{k,i}(N))-\mathbb{E}f(X_{k}).$		(66)

Since $\{x_{k,i}(N)\}$ converges $PL(2)$ to $X_{k}$ , we have,

\lim_{N\rightarrow\infty}B_{N,k}=0.

(67)

For the term $A_{N,k}$ ,

	$\displaystyle\|A_{N,k}\|\stackrel{{\scriptstyle\rm(a)}}{{\leq}}\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{i=1}^{N}\left\|f(\bm{x}_{i}(N))-f(\bm{x}_{k,i}(N))\right\|$
	$\displaystyle\stackrel{{\scriptstyle\rm(b)}}{{\leq}}\lim_{N\rightarrow\infty}\frac{C}{N}\sum_{i=1}^{N}a_{ki}(N)(1+a_{ki}(N))$
	$\displaystyle\stackrel{{\scriptstyle\rm(c)}}{{\leq}}C\lim_{N\rightarrow\infty}\sqrt{\frac{1}{N}\sum_{i=1}^{N}a^{2}_{ki}(N)}+\frac{1}{N}\sum_{i=1}^{N}a_{ki}^{2}(N)$
	$\displaystyle=C\lim_{N\rightarrow\infty}\epsilon_{k}(N)(1+\epsilon_{k}(N)),$		(68)

where (a) follows from applying the triangle inequality to the definition of $A_{N,k}$ in (65); (b) follows from the definition of pseudo-Lipschitz continuity in Definition 1, $C>0$ is the Lipschitz contant and

a_{ki}(N):=\|\bm{x}_{k,i}(N)-\bm{x}_{i}(N)\|_{2},

and (c) follows from the RMS-AM inequality:

\displaystyle\left(\frac{1}{N}\sum_{i=1}^{N}a_{ki}(N)\right)^{2}\leq\frac{1}{N}\sum_{i=1}^{N}a_{ki}^{2}(N)=:\epsilon_{k}^{2}(N).

By (29), we have that,

\lim_{k\rightarrow\infty}\lim_{N\rightarrow\infty}\epsilon_{k}(N)=0.

Hence, from (68), it follows that,

\lim_{k\rightarrow\infty}\lim_{N\rightarrow\infty}A_{N,k}=0.

(69)

Substituting (67) and (69) into (64) show that $\alpha_{N}$ and $\beta_{k}$ satisfy (59). Therefore, applying Lemma 1 we have that for any pseudo-Lipschitz function $f(\cdot)$ , there exists a limit $\Phi(f)$ such that,

\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{i=1}^{N}f(x_{i}(N))=\lim_{k\rightarrow\infty}\mathbb{E}f(X_{k})=\Phi(f).

(70)

In particular, the two limits in (70) exists. When restricted to the continuous, bounded functions with the $\|f\|_{\infty}$ norm, it is easy verified that $\Phi(f)$ is a positive, linear, bounded function of $f$ , with $\Phi(1)=1$ . Therefore, by the Riesz representation theorem, there exists a random variable $X$ such that $\Phi(f)=\mathbb{E}f(X)$ . This fact in combination with (70) shows (57).

It remains to prove the parameter limits in (58). We prove the result for the parameter $\overline{\alpha}_{k\ell}^{+}$ . The proof for the other parameters are similar. Using Stein’s lemma, it is shown in (Pandit et al., 2019) that

\overline{\alpha}_{k\ell}^{+}=\frac{\mathbb{E}(\widehat{Z}_{k\ell}Q^{-}_{k\ell})}{\mathbb{E}(Q_{\ell}^{-})^{2}}.

(71)

Since the numerator and denominator of (71) are $PL(2)$ functions we have that the limit,

	$\displaystyle\overline{\alpha}_{\ell}^{+}$	$\displaystyle:=\lim_{k\rightarrow\infty}\overline{\alpha}_{k\ell}^{+}=\lim_{k\rightarrow\infty}\frac{\mathbb{E}(\widehat{Z}_{k\ell}Q^{-}_{k\ell})}{\mathbb{E}(Q_{k\ell}^{-})^{2}}$
		$\displaystyle=\frac{\mathbb{E}(\widehat{Z}_{\ell}Q^{-}_{\ell})}{\mathbb{E}(Q_{\ell}^{-})^{2}},$		(72)

where $\widehat{Z}_{\ell}$ and $Q^{-}_{\ell}$ are the limits of $\widehat{Z}_{k\ell}$ and $Q^{-}_{k\ell}$ . This completes the proof. ∎

Appendix E Proof of Theorem 1

From Assumption 2, we know that for every $N$ , every group of vectors $\bm{x}_{k}$ converge to limits, $\bm{x}:=\lim_{k\rightarrow\infty}\bm{x}_{k}$ . The parameters, $\gamma_{k\ell}^{\pm}$ , also converge to limits $\overline{\gamma}_{\ell}^{\pm}:=\lim_{k\rightarrow\infty}\gamma_{k\ell}^{\pm}$ for all $\ell$ . By the continuity assumptions on the functions $\mathbf{g}_{\ell}^{\pm}(\cdot)$ , the limits $\bm{x}$ and $\overline{\gamma}_{\ell}^{\pm}$ are fixed points of the algorithms.

A proof similar to that in (Pandit et al., 2019) shows that the fixed points $\widehat{\mathbf{z}}_{\ell}$ and $\widehat{\mathbf{p}}_{\ell}$ satisfy the KKT condition of the constrained optimization (22). This proves part (a).

The estimate $\widehat{\bm{w}}$ is the limit,

\widehat{\bm{w}}=\widehat{\mathbf{z}}_{0}=\lim_{k\rightarrow\infty}\widehat{\mathbf{z}}_{k0}.

Also, the true parameter is $\mathbf{z}_{0}^{0}=\bm{w}^{0}$ . By Proposition 1, we have that the $PL(2)$ limits of these variables are

\displaystyle\lim_{N\rightarrow\infty}\{(\widehat{\bm{w}},\bm{w}_{0})\}\stackrel{{\scriptstyle PL(2)}}{{=}}(\widehat{W},W_{0}):=(\widehat{Z}_{0},Z_{0}^{0}).

From line 15 of the SE Algorithm 2, we have

\displaystyle\widehat{W}=\widehat{Z}_{0}=g_{0}^{+}(R_{0}^{-},\overline{\gamma}_{0}^{-})=\operatorname{prox}_{f_{\rm in}/\overline{\gamma}_{0}^{-}}(W^{0}+Q_{0}^{-}).

This proves part (b).

To prove part (c), we use the limit

\displaystyle\lim_{N\rightarrow\infty}\{p^{0}_{0,n},\widehat{p}_{0,n}\}\stackrel{{\scriptstyle PL(2)}}{{=}}(P_{0}^{0},\widehat{P}_{0}).

(73)

Since the fixed points are critical points of the constrained optimization (22), $\widehat{\mathbf{p}}_{0}=\mathbf{V}_{0}\widehat{\bm{w}}$ . We also have $\mathbf{p}^{0}_{0}=\mathbf{V}_{0}\bm{w}^{0}$ . Therefore,

	$\displaystyle\left[z_{\rm ts}^{(N)}\ \widehat{z}_{\rm ts}^{(N)}\right]:=\mathbf{u}^{\text{\sf T}}\operatorname{Diag}(\mathbf{s}_{\rm ts})\mathbf{V}_{0}[\bm{w}^{0}\ \widehat{\bm{w}}]$
	$\displaystyle=\mathbf{u}^{\text{\sf T}}\operatorname{Diag}(\mathbf{s}_{\rm ts})[\mathbf{p}^{0}_{0}\ \widehat{\mathbf{p}}_{0}].$		(74)

Here, $(N)$ in the subscript denotes the dependence on N. Since $\mathbf{u}\sim{\mathcal{N}}(0,\tfrac{1}{p}\mathbf{I})$ , $[z_{\rm ts}^{(N)}\ \widehat{z}_{\rm ts}^{(N)}]$ is a zero-mean bivariate Gaussian with covariance matrix

\mathbf{M}^{(N)}=\tfrac{1}{p}\sum_{n=1}^{p}\begin{bmatrix}s_{{\rm ts},n}^{2}p^{0}_{0,n}p^{0}_{0,n}&s_{{\rm ts},n}^{2}p^{0}_{0,n}\widehat{p}_{0,n}\\ s_{{\rm ts},n}^{2}p^{0}_{0,n}\widehat{p}_{0,n}&s_{{\rm ts},n}^{2}\widehat{p}_{0,n}\widehat{p}_{0,n}\end{bmatrix}

The empirical convergence (73) yields the following limit,

\displaystyle\lim_{N\rightarrow\infty}\mathbf{M}^{(N)}=\mathbf{M}:=\mathbb{E}\,S_{{\rm ts}}^{2}\begin{bmatrix}P^{0}_{0}P^{0}_{0}&P^{0}_{0}\widehat{P}_{0}\\ P^{0}_{0}\widehat{P}_{0}&\widehat{P}_{0}\widehat{P}_{0}\end{bmatrix}.

(75)

It suffices to show that the distribution of $[z_{\rm ts}^{(N)}\,\widehat{z}_{\rm ts}^{(N)}]$ converges to the distribution of $[Z_{\rm ts}\,\widehat{Z}_{\rm ts}]$ in the Wasserstein-2 metric as $N\rightarrow\infty.$ (See the discussion in Appendix A on the equivalence of convergence in Wasserstein-2 metric and PL(2) convergence.)

Now, Wassestein-2 distance between between two probability measures $\nu_{1}$ and $\nu_{2}$ is defined as

W_{2}(\nu_{1},\nu_{2})=\left(\inf_{\gamma\in\Gamma}\mathbb{E}_{\gamma}\left\|X_{1}-X_{2}\right\|^{2}\right)^{1/2},

(76)

where $\Gamma$ is the set of probability distributions on the product space with marginals consistent with $\nu_{1}$ and $\nu_{2}$ . For Gaussian measures $\nu_{1}=\mathcal{N}(\mathbf{0},\Sigma_{1})$ and $\nu_{2}=\mathcal{N}(\mathbf{0},\Sigma_{2})$ we have (Givens et al., 1984)

W^{2}_{2}(\nu_{1},\nu_{2})={\rm tr}(\Sigma_{1}-2(\Sigma_{1}^{1/2}\Sigma_{2}\Sigma_{1}^{1/2})^{1/2}+\Sigma_{2}).

Therefore, for Gaussian distributions $\nu_{1}^{(N)}=\mathcal{N}(\mathbf{0},\mathbf{M}^{(N)})$ , and $\nu_{2}=\mathcal{N}(\mathbf{0},\mathbf{M})$ , the convergence (75) implies $W_{2}(\nu_{1}^{(N)},\nu_{2})\rightarrow 0,$ i.e., convergence in Wasserstein-2 distance. Hence,

(z_{{\rm ts}}^{(N)},\widehat{z}_{{\rm ts}}^{(N)})\stackrel{{\scriptstyle W_{2}}}{{\longrightarrow}}(Z_{\rm ts},\widehat{Z}_{\rm ts})\sim{\mathcal{N}}(0,\mathbf{M}),

where $\mathbf{M}$ is the covariance matrix in (75). Hence the convergence holds in the PL(2) sense (see discussion in Appendix A on the equivalence of convergence in $W_{2}$ and PL(2) convergence).

Hence the asymptotic generalization error (17) is

	$\displaystyle{\mathcal{E}}_{\rm ts}:=\lim_{N\rightarrow\infty}\mathbb{E}f_{\rm ts}(\widehat{y}_{\rm ts},y_{\rm ts})$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\lim_{N\rightarrow\infty}\mathbb{E}f_{\rm ts}(\phi_{\rm out}(z_{{\rm ts}}^{(N)},D),\phi(\widehat{z}_{{\rm ts}}^{(N)}))$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\mathbb{E}f_{\rm ts}(\phi_{\rm out}(Z_{{\rm ts}},D),\phi(\widehat{Z}_{{\rm ts}})),$		(77)

where (a) follows from (3); and step (b) follows from continuity assumption in Assumption 1(b) along with the definition of PL(2) convergence in Def. 3. This proves part (c).

Appendix F Formula for $\mathbf{M}$

For the special cases in the next Appendix, it is useful to derive expressions for the entries the covariance matrix $\mathbf{M}$ in (75). For the term $m_{11}$ ,

\displaystyle m_{11}=\mathbb{E}\,S_{{\rm ts}}^{2}(P_{0}^{0})^{2}=\mathbb{E}\,S_{{\rm ts}}^{2}\mathbb{E}(P_{0}^{0})^{2}=\mathbb{E}\,S_{{\rm ts}}^{2}\cdot k_{11},

(78)

where we have used the fact that $P_{0}^{0}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}(S_{\rm ts},S_{\rm tr})$ . Next, $m_{12}=\mathbb{E}\,S_{{\rm ts}}^{2}\ P_{0}^{0}\widehat{P}_{0}.$ where,

	$\displaystyle\widehat{P}_{0}$	$\displaystyle=g_{1}^{-}(P_{0}^{0}+P_{0}^{+},Z_{1}^{0}+Q_{1}^{-},\overline{\gamma}_{0}^{+},\overline{\gamma}_{1}^{-},S_{\rm tr}^{-})$
		$\displaystyle=\frac{\overline{\gamma}^{+}_{0}P_{0}^{+}+S_{\rm tr}\overline{\gamma}^{-}_{1}Q_{1}^{-}}{\overline{\gamma}^{+}_{0}+S^{2}_{\rm tr}\overline{\gamma}^{-}_{1}}+P^{0}_{0},$		(79)

where $(P_{0}^{0},P_{0}^{+},Q_{0}^{-})$ are independent of $(S_{{\rm tr}},S_{\rm ts})$ . Hence,

	$\displaystyle m_{12}=\mathbb{E}\,S_{{\rm ts}}^{2}\cdot\mathbb{E}(P_{0}^{0})^{2}+\mathbb{E}\frac{S_{{\rm ts}}^{2}\overline{\gamma}_{0}^{+}}{\overline{\gamma}_{0}^{+}+S_{\rm tr}^{2}\overline{\gamma}_{1}^{-}}\mathbb{E}[P_{0}^{0}P_{0}^{+}]$
	$\displaystyle=m_{11}+\mathbb{E}\left(\frac{S_{\rm ts}^{2}\overline{\gamma}_{0}^{+}}{\overline{\gamma}_{0}^{+}+S_{\rm tr}^{2}\overline{\gamma}_{1}^{-}}\right)\cdot k_{12},$		(80)

since $\mathbb{E}[P^{0}_{0}Q_{1}^{-}]=0$ and $\mathbf{K}_{0}^{+}$ is the covariance matrix of $(P_{0}^{0},P_{0}^{+})$ from line 23.

Finally for $m_{22}$ we have,

	$\displaystyle m_{22}=\mathbb{E}\,S_{{\rm ts}}^{2}\widehat{P}_{0}\widehat{P}_{0}$
	$\displaystyle=\mathbb{E}\left(\frac{S_{{\rm ts}}\overline{\gamma}_{0}^{+}}{\overline{\gamma}_{0}^{+}+S_{\rm tr}^{2}\overline{\gamma}_{1}^{-}}\right)^{2}\mathbb{E}(P_{0}^{+})^{2}$
	$\displaystyle+\mathbb{E}\left(\frac{S_{{\rm ts}}S_{{\rm tr}}\overline{\gamma}_{1}^{-}}{\overline{\gamma}_{0}^{+}+S_{\rm tr}^{2}\overline{\gamma}_{1}^{-}}\right)^{2}\mathbb{E}(Q_{1}^{-})^{2}$
	$\displaystyle+\mathbb{E}\,S_{\rm ts}^{2}\mathbb{E}(P_{0}^{0})^{2}+2\mathbb{E}\frac{\overline{\gamma}_{0}^{+}S_{\rm ts}^{2}}{\overline{\gamma}_{0}^{+}+\overline{\gamma}_{1}^{-}S_{\rm tr}^{2}}\cdot\mathbb{E}\,P_{0}^{0}P_{0}^{+}$
	$\displaystyle=k_{22}\mathbb{E}\left(\frac{S_{{\rm ts}}\overline{\gamma}_{0}^{+}}{\overline{\gamma}_{0}^{+}+S_{\rm tr}^{2}\overline{\gamma}_{1}^{-}}\right)^{2}$
	$\displaystyle+\tau_{1}^{-}\mathbb{E}\left(\frac{S_{{\rm ts}}S_{{\rm tr}}\overline{\gamma}_{1}^{-}}{\overline{\gamma}_{0}^{+}+S_{\rm tr}^{2}\overline{\gamma}_{1}^{-}}\right)^{2}-m_{11}+2m_{12}.$		(81)

Appendix G Special Cases

G.1 Linear Output with Square Error

In this section we examine a few special cases of the GLM problem (2). We first consider a linear output with additive Gaussian noise and a squared error training and test loss. Specifically, consider the model,

\bm{y}=\mathbf{X}\bm{w}^{0}+\mathbf{d}

(82)

We consider estimates of $\bm{w}^{0}$ such that:

\widehat{\bm{w}}=\operatorname*{argmin}_{\bm{w}}\ \tfrac{1}{2}\left\|\bm{y}-\mathbf{X}\bm{w}\right\|^{2}+\tfrac{\lambda}{2{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}\left\|\bm{w}\right\|^{2}

(83)

The factor $\beta$ is added above since the two terms scale with a ratio of $\beta$ . It does not change analysis. Consider the ML-VAMP GLM learning algorithm applied to this problem. The following corollary follows from the Main result in Theorem 1.

Corollary 1 (Squared error).

For linear regression, i.e., $\phi(t)=t,$ $\phi_{\rm out}(t,d)=t+d,$ $f_{\rm ts}(y,\widehat{y})=(y_{\rm ts}-\widehat{y}_{\rm ts})^{2}$ , $F_{\rm out}(\mathbf{p}_{2})=\tfrac{1}{N}\left\|\bm{y}-\mathbf{p}_{2}\right\|^{2}$ , we have

\displaystyle\mathcal{E}_{\rm ts}^{\mathsf{LR}}\!=\!\mathbb{E}\left(\tfrac{\overline{\gamma}_{0}^{+}S_{\rm ts}}{\overline{\gamma}_{0}^{+}+S_{\rm tr}^{2}\overline{\gamma}_{1}^{-}}\right)^{2}k_{22}+\mathbb{E}\left(\tfrac{\overline{\gamma}_{1}^{-}S_{\rm tr}S_{\rm ts}}{\overline{\gamma}_{0}^{+}+S_{\rm tr}^{2}\overline{\gamma}_{1}^{-}}\right)^{2}\tau_{1}^{-}+\sigma_{d}^{2}.

The quantities $k_{22}$ , $\tau_{1}^{-},\overline{\gamma}_{0}^{+},\overline{\gamma}_{1}^{-}$ depend on the choice of regularizer $\lambda$ and the covariance between features.

Proof.

This follows directly from the following observation:

	$\displaystyle\mathcal{E}^{\mathsf{SLR}}_{\rm ts}$	$\displaystyle=\mathbb{E}(Z_{\rm ts}+D-\widehat{Z}_{\rm ts})^{2}=\mathbb{E}(Z_{\rm ts}-\widehat{Z}_{\rm ts})^{2}+\mathbb{E}\,D^{2}$
		$\displaystyle=m_{11}+m_{22}-2m_{12}+\sigma_{d}^{2}.$

Substituting equation (81) proves the claim. $\Box$

G.2 Ridge Regression with i.i.d. Covariates

We next the special case when the input features are independent, i.e., (83) where rows of $\mathbf{X}$ corresponding to the training data has i.i.d Gaussian features with covariance $\mathbf{P}_{\rm train}=\frac{\sigma_{\rm tr}^{2}}{p}\mathbf{I}$ and $S_{\rm tr}=\sigma_{\rm tr}$ .

Although the solution to (83) exists in closed form $(\mathbf{X}^{\text{\sf T}}\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^{\text{\sf T}}\bm{y}$ , we can study the effect of the regularization parameter $\lambda$ on the generalization error $\mathcal{E}_{\rm ts}$ as detailed in the result below.

Corollary 2.

Consider the ridge regression problem (83) with regularization parameter $\lambda>0$ . For the squared loss i.e., $f_{\rm ts}(y,\widehat{y})=(y-\widehat{y})^{2}$ , i.i.d Gaussian features without train-test mismatch, i.e., $S_{\rm tr}=S_{\rm ts}=\sigma_{\rm tr}$ , the generalization error $\mathcal{E}_{{\rm ts}}^{\mathsf{RR}}$ is given by Corollary 1, with constants

	$\displaystyle k_{22}={\rm Var}(W^{0}),\qquad\overline{\gamma}_{0}^{+}=\lambda/\beta,$
	$\displaystyle\overline{\gamma}_{1}^{-}=\begin{cases}\tfrac{1}{G}-\tfrac{\lambda}{\sigma_{{\rm tr}}^{2}}&\beta<1\\ \frac{\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}(\tfrac{1}{G}-\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}})}{\tfrac{\beta-1}{G}+\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}}&\beta>1\end{cases}$

where $G=G_{\rm{mp}}(-\frac{\lambda}{\sigma_{\rm tr}^{2}\beta})$ , with $G_{\rm{mp}}$ given in Appendix H, and $\tau_{1}^{-}=\mathbb{E}(P_{1}^{-})^{2}$ where $P_{1}^{-}$ is given in equation (95) in the proof.

Proof of Corollary 2.

We are interested in identifying the following constants appearing in Corollary 1:

\displaystyle\mathbf{K}_{0}^{+},\tau_{1}^{-},\overline{\gamma}_{0}^{+},\overline{\gamma}_{1}^{-}.

These quantities are obtained as fixed points of the State Evolution Equations in Algo. 2. We explain below how to obtain expressions for these constants. Since these are fixed points we ignore the subscript $k$ corresponding to the iteration number in Algo. 2.

In the case of problem (83), the maps $\operatorname{prox}_{f_{\rm in}}$ and $\operatorname{prox}_{f_{\rm out}}$ , i.e., $g_{0}^{+}$ and $g_{3}^{-}$ respectively, can be expressed as closed-form formulae. This leads to simplification of the SE equations as explained below.

We start by looking at the forward pass (finding quantities with superscript ’+’) of Algorithm 2 for different layers and then the backward pass (finding quantities with superscript ’-’) to get the parameters $\{\mathbf{K}_{\ell}^{+},\tau_{\ell}^{-},\overline{\alpha}_{\ell}^{\pm},\overline{\gamma}_{\ell}^{\pm}\}$ for $\ell=0,1,2$ .

To begin with, notice that $f_{\rm in}(w)=\frac{\lambda}{2}w^{2}$ , and therefore the denoiser $g_{0}^{+}(\cdot)$ in (44) is simply,

\displaystyle g_{0}^{+}(r_{0}^{-},\gamma_{0}^{-})

\displaystyle=\tfrac{\gamma_{0}^{-}}{\gamma_{0}^{-}+\lambda/{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}r_{0}^{-},\quad{\rm and}\quad\tfrac{\partial g_{0}^{+}}{\partial r_{0}^{-}}=\tfrac{\gamma_{0}^{-}}{\gamma_{0}^{-}+\lambda/{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}

Using the random variable $R_{0}^{-}$ and substituting in the expression of the denoiser to get $\widehat{Z}_{0}$ , we can now calculate $\overline{\alpha}_{0}^{+}$ using lines 20 and 22,

\displaystyle\overline{\alpha}_{0}^{+}=\tfrac{\overline{\gamma}_{0}^{-}}{\overline{\gamma}_{0}^{-}+\lambda/{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}},\qquad\overline{\gamma}_{0}^{+}=\lambda/{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}.

(84)

Similarly, we have $f_{\rm out}(p_{2})={\tfrac{1}{2}}(p_{2}-y)^{2}$ , whereby the output denoiser $g_{3}^{-}(\cdot)$ in the last layer for ridge regression is given by,

\displaystyle g_{3}^{-}(r_{2}^{+},\gamma_{2}^{+},y)

\displaystyle=\frac{\gamma_{2}^{+}r_{2}^{+}+y}{\gamma_{2}^{+}+1}.

(85)

By substituting this denoiser in line 30 of the algorithm we get $\widehat{P}_{2}^{-}$ and thus, following the lines 35-38 of the algorithm we have

\displaystyle\overline{\alpha}_{2}^{-}={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\tfrac{\overline{\gamma}_{2}^{+}}{\overline{\gamma}_{2}^{+}+1}},\qquad{\rm whereby}\quad\overline{\gamma}_{2}^{-}=1.

(86)

Having identified these constants $\overline{\alpha}_{0}^{+},\overline{\gamma}_{0}^{+},\overline{\alpha}_{2}^{-},\overline{\gamma}_{2}^{-}$ , we will now sequentially identify the quantities

\displaystyle(\overline{\alpha}_{0}^{+},\overline{\gamma}_{0}^{+})\rightarrow\mathbf{K}_{0}^{+}\rightarrow(\overline{\alpha}_{1}^{+},\overline{\gamma}_{1}^{+})\rightarrow\mathbf{K}_{1}^{+}\rightarrow(\overline{\alpha}_{2}^{+},\overline{\gamma}_{2}^{+})\rightarrow\mathbf{K}_{2}^{+}

in the forward pass, and then the quantities

\displaystyle\tau_{0}^{-}\leftarrow(\overline{\alpha}_{0}^{-},\overline{\gamma}_{0}^{-})\leftarrow\tau_{1}^{-}\leftarrow(\overline{\alpha}_{1}^{-},\overline{\gamma}_{1}^{-})\leftarrow\tau_{2}^{-}\leftarrow(\overline{\alpha}_{2}^{-},\overline{\gamma}_{2}^{-})

in the backward pass.

We also note that we have

\displaystyle\overline{\alpha}_{\ell}^{+}+\overline{\alpha}_{\ell}^{-}=1

(87)

Forward Pass:

Observe that $\mathbf{K}_{0}^{+}=\mathrm{Cov}(Z_{0},Q_{0}^{+})$ . Now, from line 21, on simplification we get $Q_{0}^{+}=-W_{0}^{0}$ whereby,

\displaystyle\mathbf{K}_{0}^{+}=\mathrm{var}(W^{0})\begin{bmatrix}1&-1\\ -1&1\end{bmatrix}.

(88)

Notice that from line 23, the pair $(P_{0}^{0},P_{0}^{+})$ is jointly Gaussian with covariance matrix $\mathbf{K}_{0}^{+}$ . But the above equation means that $P_{0}^{+}=-P^{0}_{0}$ , whereby $R_{0}^{+}=0$ from line 17.

Now, the linear denoiser $g_{1}^{+}(\cdot)$ is defined as in (47a). Note that since we are considering i.i.d Gaussian features for this problem, the random variable $S_{{\rm tr}}$ in this layer is a constant $\sigma_{{\rm tr}}$ . Therefore, similar to layer $\ell=0$ by evaluating lines 17-23 of the algorithm we get $Q_{1}^{+}=-Z_{1}^{0},$ whereby

\displaystyle\overline{\alpha}_{1}^{+}=\tfrac{\sigma_{{\rm tr}}^{2}\overline{\gamma}_{1}^{-}}{\overline{\gamma}_{0}^{+}+\sigma_{\rm tr}^{2}\overline{\gamma}_{1}^{-}},\ \overline{\gamma}_{1}^{+}=\tfrac{\overline{\gamma}_{0}^{+}}{\sigma_{\rm tr}^{2}}=\tfrac{\lambda}{\sigma_{{\rm tr}}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}},\ \mathbf{K}_{1}^{+}

\displaystyle=\sigma_{{\rm tr}}^{2}\mathbf{K}_{0}^{+}.

(89)

Observe that this means

\displaystyle P_{1}^{+}=-P_{1}^{0}.

(90)

Backward Pass:

Since $Y=\phi_{\rm out}(P_{2}^{0},D)=P_{2}^{0}+D$ , line 36 of algorithm on simplification yields $P_{2}^{-}=D$ , whereby we can get $\tau_{2}^{-}$ ,

\displaystyle\tau_{2}^{-}=\mathbb{E}(P_{2}^{-})^{2}=\mathbb{E}[D^{2}]=\sigma_{d}^{2}.

(91)

Next, to calculate the terms $(\overline{\alpha}_{1}^{-},\overline{\gamma}^{-}_{1})$ , we use the decoiser $g_{2}^{-}$ defined in (47a) for line 33 of the algorithm to get $\widehat{P}_{1}$ .

\displaystyle\widehat{P}_{1}=\tfrac{\overline{\gamma}_{1}^{+}R_{1}^{+}+S_{\rm{mp}}^{-}\overline{\gamma}_{2}^{-}R_{2}^{-}}{\overline{\gamma}_{1}^{+}+(S_{\rm{mp}}^{-})^{2}\overline{\gamma}_{2}^{-}}=\tfrac{S_{\rm{mp}}^{-}(S_{\rm{mp}}^{+}P_{1}^{0}+Q_{2}^{-})}{\overline{\gamma}_{1}^{+}+(S_{\rm{mp}}^{-})^{2}},

(92)

where we have used $\overline{\gamma}_{2}^{-}=1,$ $R_{1}^{+}=P_{1}^{0}+P_{1}^{+}=0$ due to (90), and $R_{2}^{-}=Z_{2}^{0}+Q_{2}^{-}=S_{\rm{mp}}^{+}P_{1}^{0}+Q_{2}^{-}$ from lines 17, 32 and 4 respectively.

Then, we calculate $\overline{\alpha}_{1}^{-}$ and $\overline{\gamma}_{1}^{-}$ as $\overline{\alpha}_{1}^{-}=\mathbb{E}\frac{\partial g_{2}^{-}}{\partial P_{1}^{+}}=\mathbb{E}\tfrac{\overline{\gamma}_{1}^{+}}{\overline{\gamma}_{1}^{+}+(S_{\rm{mp}}^{-})^{2}}.$ This gives,

\displaystyle\overline{\alpha}_{1}^{-}=\begin{cases}\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}G&\beta<1\\ (1-\tfrac{1}{\beta})+\tfrac{1}{\beta}\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}G&\beta\geq 1\end{cases}

(93)

Here, in the overparameterized case $(\beta>1)$ , the denoiser $g_{2}^{-}$ outputs $R_{1}^{+}$ with probability $1-\tfrac{1}{\beta}$ and $\tfrac{\lambda}{\sigma_{\rm tr}^{2}\beta}G$ with probability $\tfrac{1}{\beta}$ .

Next, from line 37 we get,

\displaystyle\overline{\gamma}_{1}^{-}=(\tfrac{1}{\overline{\alpha}_{1}^{-}}-1)\overline{\gamma}_{1}^{+}=\begin{cases}\tfrac{1}{G}-\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}&\beta<1\\ \frac{\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}(\tfrac{1}{G}-\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}})}{\tfrac{\beta-1}{G}+\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}}&\beta>1\end{cases}

(94)

Now from line 36 and equation (87) we get,

	$\displaystyle\overline{\alpha}_{1}^{+}P_{1}^{-}$	$\displaystyle=\widehat{P}_{1}-P_{1}^{0}-\overline{\alpha}_{1}^{-}P_{1}^{+}\overset{\rm(a)}{=}\widehat{P}_{1}-\overline{\alpha}_{1}^{+}P_{1}^{0}$
		$\displaystyle\overset{\rm(b)}{=}\underbrace{\left(\tfrac{S_{\rm{mp}}^{-}S_{\rm{mp}}^{+}}{\tfrac{\lambda}{\sigma_{\rm tr}^{2}\beta}+(S_{\rm{mp}}^{-})^{2}}-\overline{\alpha}_{1}^{+}\right)}_{A}P_{1}^{0}+\underbrace{\tfrac{S_{\rm{mp}}^{-}}{\tfrac{\lambda}{\sigma_{\rm tr}^{2}\beta}+(S_{\rm{mp}}^{-})^{2}}}_{B}Q_{2}^{-}$		(95)

where (a) follows from (90) and (87), and (b) follows from (92). From this one can obtain $\tau_{1}^{-}=\mathbb{E}(P_{1}^{-})^{2}$ which can be calculated using the knowledge that $P_{1}^{0},Q_{2}^{-}$ are independent Gaussian with covariances $\mathbb{E}(P_{1}^{0})^{2}=\sigma_{\rm tr}^{2}{\rm Var}(W^{0})$ , $\mathbb{E}(Q_{2}^{-})^{2}=\sigma_{d}^{2}$ . Further, $P_{1}^{0},Q_{2}^{-}$ are independent of $(S_{\rm{mp}}^{+},S_{\rm{mp}}^{-})$ .

Observe that by (95) we have

\displaystyle\tau_{1}^{-}=\frac{1}{(\overline{\alpha}_{1}^{+})^{2}}\Bigg{(}\mathbb{E}{(A^{2})}\sigma_{\rm tr}^{2}{\rm Var}(W^{0})+\mathbb{E}{(B^{2})}\sigma_{d}^{2}\Bigg{)}.

(96)

with some simplification we get


$\displaystyle\mathbb{E}{(A^{2})}$	$\displaystyle=(\frac{\lambda}{\sigma_{\rm{tr}}^{2}\beta})^{2}G^{\prime}-(\frac{\lambda}{\sigma_{\rm{tr}}^{2}\beta}G)^{2},$	(97a)
$\displaystyle\mathbb{E}{(B^{2})}$	$\displaystyle=G-\frac{\lambda}{\sigma_{\rm{tr}}^{2}\beta}G^{\prime},$	(97b)

where $G=G_{\rm{mp}}(-\frac{\lambda}{\sigma_{\rm tr}^{2}\beta})$ , with $G_{\rm{mp}}$ given in Appendix H, and $G^{\prime}$ is the derivative of $G_{\rm{mp}}$ calculated at $-\frac{\lambda}{\sigma_{\rm tr}^{2}\beta}$ .

Now consider the under-parametrized case ( $\beta<1$ ):

Let $u=-\frac{\lambda}{\sigma_{\rm{tr}}^{2}\beta}$ and $z=G_{\rm{mp}}(u)$ . In this case we have

\displaystyle\overline{\alpha}_{1}^{+}=1-\frac{\lambda}{\sigma_{\rm{tr}}^{2}\beta}G=1+uz.

(98)

Note that,


$\displaystyle G_{\rm{mp}}^{-1}(z)=u$	$\displaystyle\quad\overset{\rm(a)}{\Rightarrow}R_{\rm{mp}}(z)+\frac{1}{z}=u$
	$\displaystyle\quad\overset{\rm(b)}{\Rightarrow}\frac{1}{1-\beta z}+\frac{1}{z}=u,$	(99a)

where $R_{\rm{mp}}(.)$ is the R-transform defined in (Tulino et al., 2004) and (a) follows from the relationship between the R- and Stieltjes-transform and (b) follows from the fact that for Marchenko-Pastur distribution we have $R_{\rm{mp}}(z)=\frac{1}{1-z\beta}$ . Therefore,

	$\displaystyle G_{\rm{mp}}(\frac{1}{1-\beta z}+\frac{1}{z})=z$
	$\displaystyle\Rightarrow G_{\rm{mp}}^{\prime}(\frac{1}{1-\beta z}+\frac{1}{z})=G^{\prime}=\frac{1}{\frac{\beta}{(1-\beta z)^{2}}-\frac{1}{z^{2}}}.$		(100)

For the over-parametrized case ( $\beta>1$ ) we have:

\displaystyle\overline{\alpha}_{1}^{+}=\tfrac{1}{\beta}(1+\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}G)=\frac{1-uz}{\beta}.

(101)

In this case, as mentioned in Appendix H and following the results from (Tulino et al., 2004), the measure $\mu_{\beta}$ scales with $\beta$ and thus $R_{\rm{mp}}(z)=\frac{\beta}{1-z}$ . Therefore, similar to (99a), $z$ satisfies

\displaystyle\frac{\beta}{1-z}+\frac{1}{z}=u\quad\Rightarrow G^{\prime}=\frac{1}{\frac{\beta}{(1-z)^{2}}-\frac{1}{z^{2}}}.

(102)

Now $\tau_{1}^{-}$ can be calculated as follows:

\displaystyle\tau_{1}^{-}

\displaystyle=\eta^{2}\bigg{(}u^{2}z^{2}\sigma_{\rm tr}^{2}var(W^{0})(\kappa-1)+\sigma_{d}^{2}z(uz\kappa+1)\bigg{)}

(103)

where

\displaystyle\eta=\begin{cases}\frac{1}{(1+uz)}&\beta<1\\ \frac{\beta}{(1-uz)}&\beta\geq 1\end{cases},\quad\kappa=\begin{cases}\frac{(1-\beta z)^{2}}{\beta z^{2}-(1-\beta z)^{2}}&\beta<1\\ \frac{(1-z)^{2}}{\beta z^{2}-(1-z)^{2}}&\beta\geq 1\end{cases}

(104)

and $z$ is the solution to the fixed points

\displaystyle\begin{cases}\frac{1}{1-\beta z}+\frac{1}{z}=u&\beta<1\\ \frac{\beta}{1-z}+\frac{1}{z}=u&\beta\geq 1\end{cases}.

(105)

$\Box$

G.3 Ridgeless Linear Regression

Here we consider the case of Ridge regression (83) when $\lambda\rightarrow 0^{+}$ . Note that the solution to the problem (83) is $(\mathbf{X}^{\text{\sf T}}\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^{\text{\sf T}}\bm{y}$ remains unique since $\lambda>0$ . The following result was stated in (Hastie et al., 2019), and can be recovered using our methodology. Note however, that we calculate the generalization error whereas they have calculated the squared error, whereby we obtain an additional additive factor of $\sigma_{d}^{2}.$ The result explains the double-descent phenomenon for Ridgeless linear regression.

Corollary 3.

For ridgeless linear regression, we have

\displaystyle\lim_{\lambda\rightarrow 0^{+}}\mathcal{E}^{\mathsf{RR}}_{\rm ts}=\begin{cases}\tfrac{1}{1-\beta}\sigma_{d}^{2}&\beta<1\\ \frac{\beta}{\beta-1}\sigma_{d}^{2}+(1-\tfrac{1}{\beta})\sigma_{\rm tr}^{2}{\rm Var}(W^{0})&\beta\geq 1\end{cases}

Proof of Corollary 3.

We calculate the parameters $\overline{\gamma}_{0}^{+},\overline{\gamma}_{1}^{-}$ , $k_{22}$ and $\tau_{1}^{-}$ when $\lambda\rightarrow 0^{+}$ . Before starting off, we note that

\displaystyle G_{0}

\displaystyle:=\lim_{z\rightarrow 0^{+}}G_{\rm{mp}}(-z)=\begin{cases}\frac{\beta}{1-\beta}&\beta<1\\ \frac{\beta}{\beta-1}&\beta>1\end{cases},

(106)

as described in Appendix H. Following the derivations in Corollary 2, we have

\displaystyle\overline{\gamma}_{0}^{+}=\lambda/{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta},\quad k_{22}={\rm Var}(W^{0})

(107)

Now for $\lambda\rightarrow 0^{+},$ we have

\displaystyle 1-\overline{\alpha}_{1}^{-}=\begin{cases}1&\beta<1\\ \tfrac{1}{\beta}&\beta\geq 1\end{cases},\quad\overline{\gamma}_{1}^{-}=\begin{cases}\frac{1}{G_{0}}=\tfrac{1-\beta}{\beta}&\beta<1\\ \frac{\lambda}{(\beta-1)\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}&\beta>1\end{cases},

(108)

Using this in simplifying (95) for $\lambda\rightarrow 0^{+}$ , we get

\displaystyle\tau_{1}^{-}=\mathbb{E}(P_{1}^{-})^{2}=\begin{cases}\sigma_{d}^{2}G_{0}&\beta<1\\ \beta\sigma_{d}^{2}G_{0}+{\sigma_{\rm tr}^{2}{\rm Var}(W^{0})}{(\beta-1)}&\beta\geq 1\end{cases}

where during the evaluation of $\mathbb{E}\left(\frac{S_{\rm{mp}}^{-}}{\overline{\gamma}_{1}^{+}+(S_{\rm{mp}}^{-})^{2}}\right)^{2}$ , for the case of $\beta>1,$ we need to account for the point mass at $0$ for $S_{\rm{mp}}^{-}$ with weight $1-\frac{1}{\beta}$ .

Next, notice that

\displaystyle a:=\frac{\overline{\gamma}_{0}^{+}\sigma_{\rm tr}}{\overline{\gamma}_{0}^{+}+\overline{\gamma}_{1}^{-}\sigma_{\rm tr}^{2}}=\begin{cases}0&\beta<1\\ (1-\tfrac{1}{\beta})\sigma_{\rm tr}&\beta\geq 1\end{cases},

and,

\displaystyle b:=\frac{\overline{\gamma}_{1}^{-}\sigma^{2}_{\rm tr}}{\overline{\gamma}_{0}^{+}+\overline{\gamma}_{1}^{-}\sigma_{\rm tr}^{2}}=\begin{cases}1&\beta<1\\ \tfrac{1}{\beta}&\beta\geq 1\end{cases},\quad

Thus applying Corollary 1, we get

	$\displaystyle\mathcal{E}^{\mathsf{RR}}_{\rm ts}$	$\displaystyle=a^{2}k_{22}+b^{2}\tau_{1}^{-}+\sigma_{d}^{2}$
		$\displaystyle=\begin{cases}\tfrac{1}{1-\beta}\sigma_{d}^{2}&\beta<1\\ \frac{\beta}{\beta-1}\sigma_{d}^{2}+(1-\tfrac{1}{\beta})\sigma_{\rm tr}^{2}{\rm Var}(W^{0})&\beta\geq 1\end{cases}$

This proves the claim. $\Box$

G.4 Train-Test Mismatch

Observe that our formulation allows for analyzing the effect of mismatch in the training and test distribution. One can consider arbitrary joint distributions over $(S_{\rm tr},S_{\rm ts})$ that model the mismatch between training and test features. Here we give a simple example which highlights the effect of this mismatch.

Definition 4 (Bernoulli $\varepsilon$ -mismatch).

$(S_{\rm ts},S_{\rm tr})$ has a bivariate Bernoulli distribution with

$\bullet$

$\Pr\{S_{\rm tr}\!=\!S_{\rm ts}\!=\!0\}=\mathbb{P}\{S_{\rm tr}\!=\!S_{\rm ts}\!=\!1\}=(1-\varepsilon)/$ 2
$\bullet$

$\Pr\{S_{\rm tr}\!=\!0,S_{\rm ts}\!=\!1\}=\mathbb{P}\{S_{\rm tr}\!=\!1,S_{\rm ts}\!=\!0\}=\varepsilon/2$

Notice that the marginal distribution of the $S_{\rm tr}$ in the Bernoulli $\varepsilon-$ mismatch model is such that $\mathbb{P}(S_{\rm tr}\neq 0)=\frac{1}{2}.$ Hence half of the features extracted by the matrix $V_{0}$ are relevant during training. Similarly, $\mathbb{P}(S_{\rm ts}\neq 0)=\frac{1}{2}.$ However the features spanned by the test data do not exactly overlap with the features captured in the training data, and the fraction of features common to both the training and test data is $1-\varepsilon$ . Hence for $\varepsilon=0$ , there is no training-test mismatch, whereas for $\varepsilon=1$ there is a complete mismatch.

The following result shows that the generalization error increases linearly with the mismatch parameter $\varepsilon.$

Corollary 4 (Mismatch).

Consider the problem of Linear Regression (83) under the conditions of Corollary 1. Additionally suppose we have Bernoulli $\varepsilon$ -mismatch between the training and test distributions. Then

\displaystyle\mathcal{E}_{\rm ts}=\tfrac{k_{22}}{2}((1-\varepsilon)\gamma^{*2}+\varepsilon)+\tfrac{\tau_{1}^{-}}{2}(1-\gamma^{*})(1-\varepsilon)+\sigma_{d}^{2},

where $\gamma^{*}:=\frac{\overline{\gamma}_{0}^{+}}{\overline{\gamma}_{0}^{+}+\overline{\gamma}_{1}^{-}}$ . The terms $k_{22},\tau_{1}^{-},\gamma^{*}$ are independent of $\varepsilon$ .

Proof.

This follows directly by calculating the expectations of the terms in Corollary 1, with the joint distribution of $(S_{\rm tr},S_{\rm ts})$ given in Definition 4. $\Box$

The quantities $k_{22}$ and $\tau_{1}^{-}$ in the result above can be calculated similar to the derivation in the proof of Corollary 2 and can in general depend on the regularization parameter $\lambda$ and overparameterization parameter $\beta$ .

G.5 Logistic Regression

The precise analysis for the special case of regularized logistic regression estimator with i.i.d Gaussian features is provided in (Salehi et al., 2019). Consider the logistic regression model,

\displaystyle\mathbb{P}(y_{i}=1|\bm{x}_{i}):=\rho(\bm{x}_{i}^{\text{\sf T}}\bm{w})~{}~{}~{}\text{for }i=1,\cdots,N

where $\rho(x)=\frac{1}{1+e^{-x}}$ is the standard logistic function.

In this problem we consider estimates of $\bm{w}^{0}$ such that

\displaystyle\widehat{\bm{w}}:=\operatorname*{argmin}_{\bm{w}}\mathbf{1}^{\text{\sf T}}\log(1+e^{\mathbf{X}\bm{w}})-\bm{y}^{\text{\sf T}}\mathbf{X}\bm{w}+F_{\rm in}(\bm{w}).

where $F_{\rm in}$ is the reguralization function. This is a special case of optimization problem (2) where

F_{\rm out}(\bm{y},\mathbf{X}\bm{w})=\mathbf{1}^{\text{\sf T}}\log(1+e^{\mathbf{X}\bm{w}})-\bm{y}^{\text{\sf T}}\mathbf{X}\bm{w}.

(109)

Similar to the linear regression model, using the ML-VAMP GLM learning algorithm, we can characterize the generalization error for this model with quantities $\mathbf{K}_{0}^{+},\tau_{1}^{-},\overline{\gamma}_{0}^{+},\overline{\gamma}_{1}^{-}$ given by algorithm 2. We note that in this case, the output non-linearity is

\displaystyle\phi_{\rm out}(p_{2},d)=\mathbbm{1}_{\{{\rho(p_{2})>d}\}}

(110)

where $d\sim\text{Unif}(0,1)$ . Also, the denoisers $g_{0}^{+}$ , and $g_{3}^{-}$ can be derived as the proximal operators of $F_{\rm in}$ , and $F_{\rm out}$ defined in (25).

G.6 Support Vector Machines

The asymptotic generalization error for support vector machine (SVM) is provided in (Deng et al., 2019). Our model can also handle SVMs. Similar to logistic regression, SVM finds a linear classifier using the hinge loss instead of logistic loss. Assuming the class labels are $y=\pm 1$ the hinge loss is

\ell_{\rm hinge}(y,\widehat{y})=\max(0,1-y\widehat{y}).

(111)

Therefore, if we take

F_{\rm out}(\bm{y},\mathbf{X}\bm{w})=\sum_{i}\max(0,1-y_{i}\mathbf{X}_{i}\bm{w}),

(112)

where $\mathbf{X}_{i}$ is the $i^{th}$ row of the data matrix, the ML-VAMP algorithm for GLMs finds the SVM classifier. The algorithm would have proximal map of hinge loss and our theory provides exact predictions for the estimation and prediction error of SVM.

As with all other models considered in this work, the true underlying data generating model could be anything that can be represented by the graphical model in Figure 1, e.g. logistic or probit model, and our theory is able to exactly predict the error when SVM is applied to learn such linear classifiers in the large system limit.

Appendix H Marchenko-Pastur distribution

We describe the random variable $S_{{\rm{mp}}}$ defined in (12) where $S_{\rm{mp}}^{2}$ has a rescaled Marchenko-Pastur distribution. Notice that the positive entries of $\mathbf{s}_{\rm{mp}}$ are the positive eigenvalues of $\mathbf{U}^{\text{\sf T}}\mathbf{U}$ (or $\mathbf{U}\mathbf{U}^{\text{\sf T}}$ ).

Observe that $U_{ij}\sim N(0,\frac{1}{p})$ , whereas, the standard scaling while studying the Marchenko-Pastur distribution is for matrices $\mathbf{H}$ such that $H_{ij}\sim\mathcal{N}(0,\frac{1}{N})$ (for e.g. see equation (1.10) from (Tulino et al., 2004) and the discussion preceding it). Also notice that $\sqrt{\beta}\mathbf{U}$ has the same distribution as $\mathbf{H}$ . Thus the results from (Tulino et al., 2004) apply directly to the distributions of eigenvalues of $\beta\mathbf{U}^{\text{\sf T}}\mathbf{U}$ and $\beta\mathbf{U}\mathbf{U}^{\text{\sf T}}$ . We state their result below taking into account this disparity in scaling.

The positive eigenvalues of $\beta\mathbf{U}^{\text{\sf T}}\mathbf{U}$ have an empirical distribution which converges to the following density:

\displaystyle\mu_{\beta}(x)=\frac{\sqrt{(b_{\beta}-x)_{+}(x-a_{\beta})_{+}}}{2\pi\beta x}

(113)

where $a_{\beta}=(1-\sqrt{\beta})^{2}$ , $b_{\beta}:=(1+\sqrt{\beta})^{2}$ . Similarly the positive eigenvalues of $\beta\mathbf{U}\mathbf{U}^{\text{\sf T}}$ have an empirical distribution converging to the density $\beta\mu_{\beta}$ . We note the following integral which is useful in our analysis:

	$\displaystyle G_{0}:$	$\displaystyle=\lim_{z\rightarrow 0^{-}}\mathbb{E}\frac{1}{S_{{\rm{mp}}}^{2}-z}\mathbbm{1}_{\{{S_{\rm{mp}}>0}\}}$
		$\displaystyle=\lim_{z\rightarrow 0^{-}}\int_{a_{\beta}}^{b_{\beta}}\frac{1}{x/\beta-z}\mu_{\beta}(x)dx=\frac{\beta}{\|\beta-1\|}.$		(114)

More generally, the Stieltjes transform of the density is given by:

\displaystyle G_{\rm{mp}}(z)=\mathbb{E}\frac{1}{S_{\rm{mp}}^{2}-z}\mathbbm{1}_{\{{S_{\rm{mp}}>0}\}}=\int_{a_{\beta}}^{b_{\beta}}\frac{1}{x/\beta-z}\mu_{\beta}(x)dx

(115)

Generalization Error of Generalized Linear Models in High Dimensions

Abstract

1 Introduction

Summary of Contributions.

Prior Work.

Approximate Message Passing.

2 Generalization Error: System Model

3 Learning GLMs via ML-VAMP

Multi-Layer Representation.

ML-VAMP for GLM Learning.

4 Main Result

Assumption 1.

Assumption 2.

Theorem 1.

Remarks on Assumptions.

5 Experiments

Training and Test Distributions.

Under-regularized linear regression.

Logistic Regression.

Nonlinear Regression.

6 Conclusions

References

Appendix A Empirical Convergence of Vector Sequences

Definition 1 (Pseudo-Lipschitz continuity).

Definition 2 (Uniform Lipschitz continuity).

Definition 3 (Empirical convergence of a sequence).

Appendix B ML-VAMP Denoisers Details

Appendix C State Evolution Analysis of ML-VAMP

Theorem 2.

Proof.

Appendix D Empirical Convergence of Fixed Points

Proposition 1.

Lemma 1.

Proof.

Proof of Proposition 1

Appendix E Proof of Theorem 1

Appendix F Formula for 𝐌\mathbf{M}

Appendix G Special Cases

G.1 Linear Output with Square Error

Corollary 1 (Squared error).

Proof.

G.2 Ridge Regression with i.i.d. Covariates

Corollary 2.

Proof of Corollary 2.

Forward Pass:

Backward Pass:

G.3 Ridgeless Linear Regression

Corollary 3.

Proof of Corollary 3.

G.4 Train-Test Mismatch

Definition 4 (Bernoulli ε\varepsilon-mismatch).

Corollary 4 (Mismatch).

Proof.

G.5 Logistic Regression

G.6 Support Vector Machines

Appendix H Marchenko-Pastur distribution

Appendix F Formula for $\mathbf{M}$

Definition 4 (Bernoulli $\varepsilon$ -mismatch).