This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Generalization Error of Generalized Linear Models in High Dimensions

Melikasadat Emami    Mojtaba Sahraee-Ardakan    Parthe Pandit    Sundeep Rangan    Alyson K. Fletcher
Abstract

At the heart of machine learning lies the question of generalizability of learned rules over previously unseen data. While over-parameterized models based on neural networks are now ubiquitous in machine learning applications, our understanding of their generalization capabilities is incomplete. This task is made harder by the non-convexity of the underlying learning problems. We provide a general framework to characterize the asymptotic generalization error for single-layer neural networks (i.e., generalized linear models) with arbitrary non-linearities, making it applicable to regression as well as classification problems. This framework enables analyzing the effect of (i) over-parameterization and non-linearity during modeling; and (ii) choices of loss function, initialization, and regularizer during learning. Our model also captures mismatch between training and test distributions. As examples, we analyze a few special cases, namely linear regression and logistic regression. We are also able to rigorously and analytically explain the double descent phenomenon in generalized linear models.

Machine Learning, Generalization error, Double Descent,

1 Introduction

A fundamental goal of machine learning is generalization: the ability to draw inferences about unseen data from finite training examples. Methods to quantify the generalization error are therefore critical in assessing the performance of any machine learning approach.

This paper seeks to characterize the generalization error for a class of generalized linear models (GLMs) of the form

y=ϕout(𝒙,𝒘0,d),y=\phi_{\rm out}({\left<\bm{x},\bm{w}^{0}\right>},d), (1)

where 𝒙p\bm{x}\in{\mathbb{R}}^{p} is a vector of input features, yy is a scalar output, 𝒘0p\bm{w}^{0}\in{\mathbb{R}}^{p} are weights to be learned, ϕout()\phi_{\rm out}(\cdot) is a known link function, and dd is random noise. The notation 𝒙,𝒘0{\left<\bm{x},\bm{w}^{0}\right>} denotes an inner product. We use the superscript “0” to denote the “true” values in contrast to estimated or postulated quantities. The output may be continuous or discrete to model either regression or classification problems.

We measure the generalization error in a standard manner: we are given training data (𝒙i,yi)(\bm{x}_{i},y_{i}), i=1,,Ni=1,\ldots,N from which we learn some parameter estimate 𝒘^\widehat{\bm{w}} via a regularized empirical risk minimization of the form

𝒘^=argmin𝒘Fout(𝒚,𝐗𝒘)+Fin(𝒘),\widehat{\bm{w}}=\operatorname*{argmin}_{\bm{w}}F_{\rm out}(\bm{y},\mathbf{X}\bm{w})+F_{\rm in}(\bm{w}), (2)

where 𝐗=[𝒙1𝒙2𝒙N]T\mathbf{X}=[\bm{x}_{1}\,\bm{x}_{2}\,\ldots\,\bm{x}_{N}]^{\text{\sf T}}, is the data matrix, FoutF_{\rm out} is some output loss function, and FinF_{\rm in} is some regularizer on the weights. We are then given a new test sample, 𝒙ts\bm{x}_{\rm ts}, for which the true and predicted values are given by

yts=ϕout(𝒙ts,𝒘0,dts),y^ts=ϕ(𝒙ts,𝒘^),y_{\rm ts}=\phi_{\rm out}({\left<\bm{x}_{\rm ts},\bm{w}^{0}\right>},d_{\rm ts}),\quad\widehat{y}_{\rm ts}=\phi({\left<\bm{x}_{\rm ts},\widehat{\bm{w}}\right>}), (3)

where dtsd_{\rm ts} is the noise in the test sample, and ϕ()\phi(\cdot) is a postulated inverse link function that may be different from the true function ϕout()\phi_{\rm out}(\cdot). The generalization error is then defined as the expectation of some expected loss between ytsy_{\rm ts} and y^ts\widehat{y}_{\rm ts} of the form

𝔼fts(yts,y^ts),\mathbb{E}\,f_{\rm ts}(y_{\rm ts},\widehat{y}_{\rm ts}), (4)

for some test loss function fts()f_{\rm ts}(\cdot) such as squared error or prediction error.

Even for this relatively simple GLM model, the behavior of the generalization error is not fully understood. Recent works (Montanari et al., 2019; Deng et al., 2019; Mei & Montanari, 2019; Salehi et al., 2019) have characterized the generalization error of various linear models for classification and regression in certain large random problem instances. Specifically, the number of samples NN and number of features pp both grow without bound with their ratio satisfying p/Nβ(0,){p}/{N}\rightarrow\beta\in(0,\infty), and the samples in the training data 𝒙i\bm{x}_{i} are drawn randomly. In this limit, the generalization error can be exactly computed. The analysis can explain the so-called double descent phenomena (Belkin et al., 2019a): in highly under-regularized settings, the test error may initially increase with the number of data samples NN before decreasing. See the prior work section below for more details.

Summary of Contributions.

Our main result (Theorem 1) provides a procedure for exactly computing the asymptotic value of the generalization error (4) for GLM models in a certain random high-dimensional regime called the Large System Limit (LSL). The procedure enables the generalization error to be related to key problem parameters including the sampling ratio β=p/N\beta=p/N, the regularizer, the output function, and the distributions of the true weights and noise. Importantly, our result holds under very general settings including: (i) arbitrary test metrics ftsf_{\rm ts}; (ii) arbitrary training loss functions FoutF_{\rm out} as well as decomposable regularizers FinF_{\rm in}; (iii) arbitrary link functions ϕout\phi_{\rm out}; (iv) correlated covariates 𝒙\bm{x}; (v) underparameterized (β<1\beta<1) and overparameterized regimes (β>1\beta>1); and (vi) distributional mismatch in training and test data. Section 4 discusses in detail the general assumptions on the quantities ftsf_{\rm ts}, FoutF_{\rm out}, FinF_{\rm in}, and ϕout\phi_{\rm out} under which Theorem 1 holds.

Prior Work.

Many recent works characterize generalization error of various machine learning models, including special cases of the GLM model considered here. For example, the precise characterization for asymptotics of prediction error for least squares regression has been provided in (Belkin et al., 2019b; Hastie et al., 2019; Muthukumar et al., 2019). The former confirmed the double descent curve of (Belkin et al., 2019a) under a Fourier series model and a noisy Gaussian model for data in the over-parameterized regime. The latter also obtained this scenario under both linear and non-linear feature models for ridge regression and min-norm least squares using random matrix theory. Also, (Advani & Saxe, 2017) studied the same setting for deep linear and shallow non-linear networks.

The analysis of the the generalization for max-margin linear classifiers in the high dimensional regime has been done in (Montanari et al., 2019). The exact expression for asymptotic prediction error is derived and in a specific case for two-layer neural network with random first-layer weights, the double descent curve was obtained. A similar double descent curve for logistic regression as well as linear discriminant analysis has been reported by (Deng et al., 2019). Random feature learning in the same setting has also been studied for ridge regression in (Mei & Montanari, 2019). The authors have, in particular, shown that highly over-parametrized estimators with zero training error are statistically optimal at high signal-to-noise ratio (SNR). The asymptotic performance of regularized logistic regression in high dimensions is studied in (Salehi et al., 2019) using the Convex Gaussian Min-max Theorem in the under-parametrized regime. The results in the current paper can consider all these models as special cases. Bounds on the generalization error of over-parametrized linear models are also given in (Bartlett et al., 2019; Neyshabur et al., 2018).

Although this paper and several other recent works consider only simple linear models and GLMs, much of the motivation is to understand generalization in deep neural networks where classical intuition may not hold (Belkin et al., 2018; Zhang et al., 2016; Neyshabur et al., 2018). In particular, a number of recent papers have shown the connection between neural networks in the over-parametrized regime and kernel methods. The works (Daniely, 2017; Daniely et al., 2016) showed that gradient descent on over-parametrized neural networks learns a function in the RKHS corresponding to the random feature kernel. Training dynamics of overparametrized neural networks has been studied by (Jacot et al., 2018; Du et al., 2018; Arora et al., 2019; Allen-Zhu et al., 2019), and it is shown that the function learned is in an RKHS corresponding to the neural tangent kernel.

Approximate Message Passing.

Our key tool to study the generalization error is approximate message passing (AMP), a class of inference algorithms originally developed in (Donoho et al., 2009, 2010; Bayati & Montanari, 2011) for compressed sensing. We show that the learning problem for the GLM can be formulated as an inference problem on a certain multi-layer network. Multi-layer AMP methods (He et al., 2017; Manoel et al., 2018; Fletcher et al., 2018; Pandit et al., 2019) can then be applied to perform the inference. The specific algorithm we use in this work is the multi-layer vector AMP (ML-VAMP) algorithm of (Fletcher et al., 2018; Pandit et al., 2019) which itself builds on several works (Opper & Winther, 2005; Fletcher et al., 2016; Rangan et al., 2019; Cakmak et al., 2014; Ma & Ping, 2017). The ML-VAMP algorithm is not necessarily the most computationally efficient procedure for the minimization (2). For our purposes, the key property is that ML-VAMP enables exact predictions of its performance in the large system limit. Specifically, the error of the algorithm estimates in each iteration can be predicted by a set of deterministic recursive equations called the state evolution or SE. The fixed points of these equations provide a way of computing the asymptotic performance of the algorithm. In certain cases, the algorithm can be proven to be Bayes optimal (Reeves, 2017; Gabrié et al., 2018; Barbier et al., 2019).

This approach of using AMP methods to characterize the generalization error of GLMs was also explored in (Barbier et al., 2019) for i.i.d. distributions on the data. The explicit formulae for the asymptotic mean squared error for the regularized linear regression with rotationally invarient data matrices is proved in (Gerbelot et al., 2020). The ML-VAMP method in this work enables extensions to correlated features and to mismatch between training and test distributions.

2 Generalization Error: System Model

We consider the problem of estimating the weights 𝒘\bm{w} in the GLM model (1). As stated in the Introduction, we suppose we have training data {(𝒙i,yi)}i=1N\{(\bm{x}_{i},y_{i})\}_{i=1}^{N} arranged as 𝐗:=[𝒙1𝒙2𝒙N]TN×p\mathbf{X}:=[\bm{x}_{1}\,\bm{x}_{2}\ldots\bm{x}_{N}]^{\text{\sf T}}\in{\mathbb{R}}^{N\times p}, 𝒚:=[y1y2yN]TN\bm{y}:=[y_{1}\,y_{2}\ldots y_{N}]^{\text{\sf T}}\in{\mathbb{R}}^{N}. Then we can write

𝒚=ϕout(𝐗𝒘0,𝐝),\bm{y}={\bm{\phi}}_{\rm out}(\mathbf{X}\bm{w}^{0},\mathbf{d}), (5)

where ϕout(𝐳,𝐝){\bm{\phi}}_{\rm out}(\mathbf{z},\mathbf{d}) is the vector-valued function such that [ϕout(𝐳,𝐝)]n=ϕout(zn,dn)[{\bm{\phi}}_{\rm out}(\mathbf{z},\mathbf{d})]_{n}=\phi_{\rm out}(z_{n},d_{n}) and {dn}n=1N\{d_{n}\}_{n=1}^{N} are general noise.

Given the training data (𝐗,𝒚)(\mathbf{X},\bm{y}), we consider estimates of 𝒘0\bm{w}^{0} given by a regularized empirical risk minimization of the form (2). We assume that the loss function FoutF_{\rm out} and regularizer FinF_{\rm in} are separable functions, i.e., one can write

Fout(𝒚,𝐳)=n=1Nfout(yn,zn),Fin(𝒘)=j=1pfin(wj),\displaystyle F_{\rm out}(\bm{y},\mathbf{z})=\sum_{n=1}^{N}f_{\rm out}(y_{n},z_{n}),\ \ F_{\rm in}(\bm{w})=\sum_{j=1}^{p}f_{\rm in}(w_{j}), (6)

for some functions fout:2f_{\rm out}:\mathbb{R}^{2}\rightarrow\mathbb{R} and fin:f_{\rm in}:\mathbb{R}\rightarrow\mathbb{R}. Many standard optimization problems in machine learning can be written in this form: logistic regression, support vector machines, linear regression, Poisson regression.

Large System Limit: We follow the LSL analysis of (Bayati & Montanari, 2011) commonly used for analyzing AMP-based methods. Specifically, we consider a sequence of problems indexed by the number of training samples NN. For each NN, we suppose that the number of features p=p(N)p=p(N) grows linearly with NN, i.e.,

limNp(N)Nβ\displaystyle\lim_{N\rightarrow\infty}\frac{p(N)}{N}\rightarrow\beta (7)

for some constant β(0,)\beta\in(0,\infty). Note that β>1\beta>1 corresponds to the over-parameterized regime and β<1\beta<1 corresponds to the under-parameterized regime.

True parameter: We assume the true weight vector 𝒘0\bm{w}^{0} has components whose empirical distribution converges as

limN{wn0}=PL(2)W0,\lim_{N\rightarrow\infty}\{w^{0}_{n}\}\stackrel{{\scriptstyle PL(2)}}{{=}}W^{0}, (8)

for some limiting random variable W0W^{0}. The precise definition of empirical convergence is given in Appendix A. It means that the empirical distribution 1pi=1pδwi\tfrac{1}{p}\sum_{i=1}^{p}\delta_{w_{i}} converges, in the Wasserstein-2 metric (see Chap. 6 (Villani, 2008)), to the distribution of the finite-variance random variable W0W^{0}. Importantly, the limit (8) will hold if the components {wi0}i=1p\{w^{0}_{i}\}_{i=1}^{p} are drawn i.i.d. from the distribution of W0W^{0} with 𝔼(W0)2<\mathbb{E}(W^{0})^{2}<\infty. However, as discussed in Appendix A, the convergence can also be satisfied by correlated sequences and deterministic sequences.

Training data input: For each NN, we assume that the training input data samples, 𝒙ip\bm{x}_{i}\in{\mathbb{R}}^{p}, i=1,,Ni=1,\ldots,N, are i.i.d. and drawn from a pp-dimensional Gaussian distribution with zero mean and covariance 𝚺trp×p{\bm{\Sigma}}_{\rm tr}\in{\mathbb{R}}^{p\times p}. The covariance can capture the effect of features being correlated. We assume the covariance matrix has an eigenvalue decomposition,

𝚺tr=1p𝐕0Tdiag(𝐬tr2)𝐕0,{\bm{\Sigma}}_{\rm tr}=\tfrac{1}{p}\mathbf{V}_{0}^{\text{\sf T}}\mathrm{diag}(\mathbf{s}_{\rm tr}^{2})\mathbf{V}_{0}, (9)

where 𝐬tr2\mathbf{s}_{\rm tr}^{2} are the eigenvalues of 𝚺tr{\bm{\Sigma}}_{\rm tr} and 𝐕0p×p\mathbf{V}_{0}\in{\mathbb{R}}^{p\times p} is the orthogonal matrix of eigenvectors. The scaling 1p\tfrac{1}{p} ensures that the total variance of the samples, 𝔼𝒙i2\mathbb{E}\|\bm{x}_{i}\|^{2}, does not grow with NN. We will place a certain random model on 𝐬tr\mathbf{s}_{\rm tr} and 𝐕0\mathbf{V}_{0} momentarily.

Using the covariance (9), we can write the data matrix as

𝐗=𝐔diag(𝐬tr)𝐕0,\mathbf{X}=\mathbf{U}\,\mathrm{diag}(\mathbf{s}_{\rm tr})\mathbf{V}_{0}, (10)

where 𝐔N×p\mathbf{U}\in{\mathbb{R}}^{N\times p} has entries drawn i.i.d. from 𝒩(0,1p){\mathcal{N}}(0,\tfrac{1}{p}). For the purpose of analysis, it is useful to express the matrix 𝐔\mathbf{U} in terms of its SVD:

𝐔=𝐕2𝐒mp𝐕1,𝐒mp:=[diag(𝐬mp)𝟎𝟎]\mathbf{U}=\mathbf{V}_{2}\mathbf{S}_{\rm mp}\mathbf{V}_{1},\quad\mathbf{S}_{\rm mp}:=\left[\begin{array}[]{cc}\mathrm{diag}(\mathbf{s}_{\rm mp})&\mathbf{0}\\ \mathbf{0}&*\end{array}\right] (11)

where 𝐕1N×N\mathbf{V}_{1}\in{\mathbb{R}}^{N\times N} and 𝐕2p×p\mathbf{V}_{2}\in{\mathbb{R}}^{p\times p} are orthogonal and 𝐒mpN×p\mathbf{S}_{\rm{mp}}\in\mathbb{R}^{N\times p} with non-zero entries 𝐬mpmin{N,p}\mathbf{s}_{\rm mp}\in{\mathbb{R}}^{\min\{N,p\}} only along the principal diagonal. 𝐬mp\mathbf{s}_{\rm{mp}} are the singular values of 𝐔\mathbf{U}. A standard result of random matrix theory is that, since 𝐔\mathbf{U} is i.i.d. Gaussian with entries 𝒩(0,1p){\mathcal{N}}(0,\tfrac{1}{p}), the matrices 𝐕1\mathbf{V}_{1} and 𝐕2\mathbf{V}_{2} are Haar-distributed on the group of orthogonal matrices and 𝐬mp\mathbf{s}_{\rm mp} is such that

limN{smp,i}=PL(2)Smp,\lim_{N\rightarrow\infty}\{s_{{\rm mp},i}\}\stackrel{{\scriptstyle PL(2)}}{{=}}S_{\rm mp}, (12)

where Smp0S_{\rm mp}\geq 0 is a non-negative random variable such that Smp2S_{\rm mp}^{2} satisfies the Marcencko-Pastur distribution. Details on this distribution are in Appendix H.

Training data output: Given the input data 𝐗\mathbf{X}, we assume that the training outputs 𝒚\bm{y} are generated from (5), where the noise 𝐝\mathbf{d} is independent of 𝐗\mathbf{X} and has an empirical distribution which converges as

limN{di}=PL(2)D.\lim_{N\rightarrow\infty}\{d_{i}\}\stackrel{{\scriptstyle PL(2)}}{{=}}D. (13)

Again, the limit (13) will be satisfied if {di}i=1N\{d_{i}\}_{i=1}^{N} are i.i.d. draws of random variable DD with bounded second moments.

Test data: To measure the generalization error, we assume now that we are given a test point 𝒙ts\bm{x}_{\rm ts}, and we obtain the true output ytsy_{\rm ts} and predicted output y^ts\widehat{y}_{\rm ts} given by (3). We assume that the test data inputs are also Gaussian, i.e.,

𝒙tsT=𝐮Tdiag(𝐬ts)𝐕0,\bm{x}_{\rm ts}^{\text{\sf T}}=\mathbf{u}^{\text{\sf T}}\mathrm{diag}(\mathbf{s}_{\rm ts})\mathbf{V}_{0}, (14)

where 𝐮p\mathbf{u}\in{\mathbb{R}}^{p} has i.i.d. Gaussian components, 𝒩(0,1p){\mathcal{N}}(0,\tfrac{1}{p}), and 𝐬ts\mathbf{s}_{\rm ts} and 𝐕0\mathbf{V}_{0} are the eigenvalues and eigenvectors of the test data covariance matrix. That is, the test data sample has a covariance matrix

𝚺ts=1p𝐕0Tdiag(𝐬ts2)𝐕0.{\bm{\Sigma}}_{\rm ts}=\tfrac{1}{p}\mathbf{V}_{0}^{\text{\sf T}}\mathrm{diag}(\mathbf{s}_{\rm ts}^{2})\mathbf{V}_{0}. (15)

In comparison to (9), we see that we are assuming that the eigenvectors of the training and test data are the same, but the eigenvalues may be different. In this way, we can capture distributional mismatch between the training and test data. For example, we will be able to measure the generalization error when the test sample is outside a subspace explored by the training data.

To capture the relation between the training and test distributions, we assume that components of 𝐬tr\mathbf{s}_{\rm tr} and 𝐬ts\mathbf{s}_{\rm ts} converge as

limN{(str,i,sts,i)}=PL(2)(Str,Sts),\lim_{N\rightarrow\infty}\{(s_{{\rm tr},i},s_{{\rm ts},i})\}\stackrel{{\scriptstyle PL(2)}}{{=}}(S_{\rm tr},S_{\rm ts}), (16)

to some non-negative, bounded random vector (Str,Sts)(S_{\rm tr},S_{\rm ts}). The joint distribution on (Str,Sts)(S_{\rm tr},S_{\rm ts}) captures the relation between the training and test data.

When Str=StsS_{\rm tr}=S_{\rm ts}, our model corresponds to the case when the training and test distribution are matched. Isotropic Gaussian features in both training and test data correspond to covariance matrices 𝚺tr=1pσtr2𝐈{\bm{\Sigma}}_{\rm tr}=\tfrac{1}{p}{\sigma^{2}_{\rm tr}}\mathbf{I}, 𝚺ts=1pσts2𝐈{\bm{\Sigma}}_{\rm ts}=\tfrac{1}{p}{\sigma^{2}_{\rm ts}}\mathbf{I}, which can be modeled as Str=σtrS_{\rm tr}=\sigma_{\rm tr}, Sts=σtsS_{\rm ts}=\sigma_{\rm ts}. We also require that the matrix 𝐕0\mathbf{V}_{0} is uniformly distributed on the set of p×pp\times p orthogonal matrices.

Generalization error: From the training data, we obtain an estimate 𝒘^\widehat{\bm{w}} via a regularized empirical risk minimization (2). Given a test sample 𝒙ts\bm{x}_{\rm ts} and parameter estimate 𝒘^\widehat{\bm{w}}, the true output ytsy_{\rm ts} and predicted output y^tr\widehat{y}_{\rm tr} are given by equation (3). We assume the test noise is distributed as dtsDd_{\rm ts}\sim D, following the same distribution as the training data. The postulated inverse-link function ϕ()\phi(\cdot) in (3) may be different from the true inverse-link function ϕout()\phi_{\rm out}(\cdot).

The generalization error is defined as the asymptotic expected loss,

ts:=limN𝔼fts(y^ts,yts),{\mathcal{E}}_{\rm ts}:=\lim_{N\rightarrow\infty}\mathbb{E}f_{\rm ts}(\widehat{y}_{\rm ts},y_{\rm ts}), (17)

where fts()f_{\rm ts}(\cdot) is some loss function relevant for the test error (which may be different from the training loss). The expectation in (17) is with respect to the randomness in the training as well as test data, and the noise. Our main result provides a formula for the generalization error (17).

3 Learning GLMs via ML-VAMP

There are many methods for solving the minimization problem (2). We apply the ML-VAMP algorithm of (Fletcher et al., 2018; Pandit et al., 2019). This algorithm is not necessarily the most computationally efficient method. For our purposes, however, the algorithm serves as a constructive proof technique, i.e., it enables exact predictions for generalization error in the LSL as described above. Moreover, in the case when loss function (2) is strictly convex, the problem has a unique global minimum, whereby the generalization error of this minimum is agnostic to the choice of algorithm used to find this minimum. To that end, we start by reformulating (2) in a form that is amicable to the application of ML-VAMP, Algorithm 1.

Multi-Layer Representation.

The first step in applying ML-VAMP to the GLM learning problem is to represent the mapping from the true parameters 𝒘0\bm{w}^{0} to the output 𝒚\bm{y} as a certain multi-layer network. We combine (5), (10) and (11), so that the mapping 𝒘0𝒚\bm{w}^{0}\mapsto\bm{y} can be written as the following sequence of operations (as illustrated in Fig. 1):

𝐳00:=𝒘0,\displaystyle\mathbf{z}^{0}_{0}:=\bm{w}^{0}, 𝐩00:=𝐕0𝐳00,\displaystyle\mathbf{p}^{0}_{0}:=\mathbf{V}_{0}\mathbf{z}^{0}_{0}, (18)
𝐳10:=ϕ1(𝐩00,𝝃1),\displaystyle\mathbf{z}_{1}^{0}:={\bm{\phi}}_{1}(\mathbf{p}^{0}_{0},{\bm{\xi}}_{1}), 𝐩10:=𝐕1𝐳10,\displaystyle\mathbf{p}^{0}_{1}:=\mathbf{V}_{1}\mathbf{z}_{1}^{0},
𝐳20:=ϕ2(𝐩10,𝝃2),\displaystyle\mathbf{z}_{2}^{0}:={\bm{\phi}}_{2}(\mathbf{p}^{0}_{1},{\bm{\xi}}_{2}), 𝐩20:=𝐕2𝐳20,\displaystyle\mathbf{p}^{0}_{2}:=\mathbf{V}_{2}\mathbf{z}_{2}^{0},
𝐳30:=ϕ3(𝐩20,𝝃3)=𝒚,\displaystyle\mathbf{z}^{0}_{3}:={\bm{\phi}}_{3}(\mathbf{p}_{2}^{0},{\bm{\xi}}_{3})=\bm{y},

where 𝝃{\bm{\xi}}_{\ell} are the following vectors:

𝝃1:=𝐬tr,𝝃2:=𝐬mp,𝝃3:=𝐝,{\bm{\xi}}_{1}:=\mathbf{s}_{\rm tr},\quad{\bm{\xi}}_{2}:=\mathbf{s}_{\rm mp},\quad{\bm{\xi}}_{3}:=\mathbf{d}, (19)

and the functions ϕ(){\bm{\phi}}_{\ell}(\cdot) are given by

ϕ1(𝐩0,𝐬tr)\displaystyle{\bm{\phi}}_{1}(\mathbf{p}_{0},\mathbf{s}_{\rm tr}) :=diag(𝐬tr)𝐩0,\displaystyle:=\mathrm{diag}(\mathbf{s}_{\rm tr})\mathbf{p}_{0}, (20)
ϕ2(𝐩1,𝐬mp)\displaystyle{\bm{\phi}}_{2}(\mathbf{p}_{1},\mathbf{s}_{\rm mp}) :=𝐒mp𝐩1,\displaystyle:=\mathbf{S}_{\rm mp}\mathbf{p}_{1},
ϕ3(𝐩2,𝐝)\displaystyle{\bm{\phi}}_{3}(\mathbf{p}_{2},\mathbf{d}) :=ϕout(𝐩2,𝐝).\displaystyle:={\bm{\phi}}_{\rm out}(\mathbf{p}_{2},\mathbf{d}).

We see from Fig. 1 that the mapping of true parameters 𝒘0=𝐳00\bm{w}^{0}=\mathbf{z}^{0}_{0} to the observed response vector 𝒚=𝐳30\bm{y}=\mathbf{z}^{0}_{3} is described by a multi-layer network of alternating orthogonal operators 𝐕\mathbf{V}_{\ell} and non-linear functions ϕ(){\bm{\phi}}_{\ell}(\cdot). Let L=3L=3 denote the number of layers in this multi-layer network.

𝐕0\mathbf{V}_{0}ϕ1(){\bm{\phi}}_{1}(\cdot)𝐕1\mathbf{V}_{1}ϕ2(){\bm{\phi}}_{2}(\cdot)𝐕2\mathbf{V}_{2}ϕ3(){\bm{\phi}}_{3}(\cdot)𝐳30=𝒚\mathbf{z}^{0}_{3}=\bm{y}𝐳00=𝒘0\mathbf{z}^{0}_{0}=\bm{w}^{0}𝐩00\mathbf{p}_{0}^{0}𝐳10\mathbf{z}_{1}^{0}𝐩10\mathbf{p}_{1}^{0}𝐳20\mathbf{z}_{2}^{0}𝐩20=𝐗𝒘0\mathbf{p}_{2}^{0}=\mathbf{X}\bm{w}^{0}𝝃1{\bm{\xi}}_{1}𝝃2{\bm{\xi}}_{2}𝝃3{\bm{\xi}}_{3}
Figure 1: Sequence flow representing the mapping from the unknown parameter values 𝒘0\bm{w}^{0} to the vector of responses 𝒚\bm{y} on the training data.

The minimization (2) can also be represented using a similar signal flow graph. Given a parameter candidate 𝒘\bm{w}, the mapping 𝒘𝐗𝒘\bm{w}\mapsto\mathbf{X}\bm{w} can be written using the sequence of vectors

𝐳0:=𝒘,\displaystyle\mathbf{z}_{0}:=\bm{w}, 𝐩0\displaystyle\mathbf{p}_{0} :=𝐕0𝐳0,\displaystyle:=\mathbf{V}_{0}\mathbf{z}_{0}, (21)
𝐳1:=𝐒tr𝐩0,\displaystyle\mathbf{z}_{1}:=\mathbf{S}_{\rm tr}\mathbf{p}_{0}, 𝐩1\displaystyle\mathbf{p}_{1} :=𝐕1𝐳1,\displaystyle:=\mathbf{V}_{1}\mathbf{z}_{1},
𝐳2:=𝐒mp𝐩1,\displaystyle\mathbf{z}_{2}:=\mathbf{S}_{\rm mp}\mathbf{p}_{1}, 𝐩2\displaystyle\mathbf{p}_{2} :=𝐕2𝐳2=𝐗𝒘.\displaystyle:=\mathbf{V}_{2}\mathbf{z}_{2}=\mathbf{X}\bm{w}.

There are L=3L=3 steps in this sequence, and we let

𝐳={𝐳0,𝐳1,𝐳2},𝐩={𝐩0,𝐩1,𝐩2}\mathbf{z}=\{\mathbf{z}_{0},\mathbf{z}_{1},\mathbf{z}_{2}\},\quad\mathbf{p}=\{\mathbf{p}_{0},\mathbf{p}_{1},\mathbf{p}_{2}\}

denote the sets of vectors across the steps. The minimization in (2) can then be written in the following equivalent form:

min𝐳,𝐩F0(𝐳0)+F1(𝐩0,𝐳1)+F2(𝐩1,𝐳1)+F3(𝐩2)\displaystyle\min_{\mathbf{z},\mathbf{p}}\ F_{0}(\mathbf{z}_{0})+F_{1}(\mathbf{p}_{0},\mathbf{z}_{1})+F_{2}(\mathbf{p}_{1},\mathbf{z}_{1})+F_{3}(\mathbf{p}_{2}) (22)
subjectto𝐩=𝐕𝐳,=0,1,2,\displaystyle{\rm subject\ to}\quad\mathbf{p}_{\ell}=\mathbf{V}_{\ell}\mathbf{z}_{\ell},\quad\ell=0,1,2,

where the penalty functions FF_{\ell} are defined as

F0()=Fin(),\displaystyle F_{0}(\cdot)=F_{\rm in}(\cdot), F1(,)=\displaystyle F_{1}(\cdot,\cdot)= δ{𝐳1=𝐒tr𝐩0}(,),\displaystyle\delta_{\left\{\mathbf{z}_{1}=\mathbf{S}_{\rm tr}\mathbf{p}_{0}\right\}}(\cdot,\cdot), (23)
F2(,)=δ{𝐳2=𝐒mp𝐩1}(,),\displaystyle F_{2}(\cdot,\cdot)=\delta_{\left\{\mathbf{z}_{2}=\mathbf{S}_{\rm mp}\mathbf{p}_{1}\right\}}(\cdot,\cdot), F3()=\displaystyle F_{3}(\cdot)= Fout(𝒚,),\displaystyle F_{\rm out}(\bm{y},\cdot),

where δ𝒜()\delta_{\mathcal{A}}(\cdot) is 0 on the set 𝒜\mathcal{A}, and ++\infty on 𝒜c\mathcal{A}^{c}.

ML-VAMP for GLM Learning.

Using this multi-layer representation, we can now apply the ML-VAMP algorithm from (Fletcher et al., 2018; Pandit et al., 2019) to solve the optimization (22). The steps are shown in Algorithm 1. These steps are a special case of the “MAP version” of ML-VAMP in (Pandit et al., 2019), but with a slightly different set-up for the GLM problem. We will call these steps the ML-VAMP GLM Learning Algorithm.

Algorithm 1 ML-VAMP GLM Learning Algorithm
1:  Initialize γ0>0\gamma^{-}_{0\ell}>0, 𝐫0=0\mathbf{r}^{-}_{0\ell}=0 for =0,,L1\ell=0,\ldots,L\!-\!1
2:  
3:  for k=0,1,k=0,1,\dots do
4:     // Forward Pass
5:     for =0,,L1\ell=0,\ldots,L-1 do
6:        if =0\ell=0 then
7:           𝐳^k0=𝐠0+(𝐫k0,γk0)\widehat{\mathbf{z}}_{k0}=\mathbf{g}^{+}_{0}(\mathbf{r}_{k0}^{-},\gamma^{-}_{k0})
8:        else
9:           𝐳^k=𝐠+(𝐫k,1+,𝐫k,γk,1+,γk)\widehat{\mathbf{z}}_{k\ell}=\mathbf{g}^{+}_{\ell}(\mathbf{r}_{k,\ell\!-\!1}^{+},\mathbf{r}_{k\ell}^{-},\gamma^{+}_{k,\ell\!-\!1},\gamma^{-}_{k\ell})
10:        end if
11:        αk+=𝐳^k/𝐫k\alpha_{k\ell}^{+}={\left<\partial\widehat{\mathbf{z}}_{k\ell}/\partial\mathbf{r}_{k\ell}^{-}\right>}
12:        𝐫k+=𝐕(𝐳^kαk+𝐫k)1αk+\displaystyle\mathbf{r}_{k\ell}^{+}=\frac{\mathbf{V}_{\ell}(\widehat{\mathbf{z}}_{k\ell}-\alpha_{k\ell}^{+}\mathbf{r}_{k\ell}^{-})}{1-\alpha^{+}_{k\ell}}
13:        γk+=(1/αk+1)γk\gamma_{k\ell}^{+}=(1/\alpha_{k\ell}^{+}-1)\gamma_{k\ell}^{-}
14:     end for
15:     
16:     // Backward Pass
17:     for =L,,1\ell=L,\ldots,1 do
18:        if =L\ell=L then
19:           𝐩^k,L1=𝐠L(𝐫k,L1+,γk,L1+)\widehat{\mathbf{p}}_{k,L\!-\!1}=\mathbf{g}_{L}^{-}(\mathbf{r}^{+}_{k,L\!-\!1},\gamma^{+}_{k,L\!-\!1})
20:        else
21:           𝐩^k,1=𝐠(𝐫k,1+,𝐫k+1,,γk,1+,γk+1,)\widehat{\mathbf{p}}_{k,\ell\!-\!1}=\mathbf{g}^{-}_{\ell}(\mathbf{r}_{k,\ell\!-\!1}^{+},\mathbf{r}_{k\!+\!1,\ell}^{-},\gamma^{+}_{k,\ell\!-\!1},\gamma^{-}_{k\!+\!1,\ell})
22:        end if
23:        αk,1=𝐩^k,1/𝐫k,1+\alpha_{k,\ell\!-\!1}^{-}={\left<\partial\widehat{\mathbf{p}}_{k,\ell\!-\!1}/\partial\mathbf{r}_{k,\ell\!-\!1}^{+}\right>}
24:        𝐫k+1,1=𝐕1T(𝐩^k,1αk,1𝐫k,1+)1αk,1\displaystyle\mathbf{r}_{k\!+\!1,\ell\!-\!1}^{-}=\frac{\mathbf{V}_{\ell\!-\!1}^{\text{\sf T}}(\widehat{\mathbf{p}}_{k,\ell\!-\!1}-\alpha_{k,\ell\!-\!1}^{-}\mathbf{r}_{k,\ell\!-\!1}^{+})}{1-\alpha^{-}_{k,\ell\!-\!1}}
25:        γk+1,1=(1/αk,11)γk,1+\gamma_{k\!+\!1,\ell\!-\!1}^{-}=(1/\alpha_{k,\ell\!-\!1}^{-}-1)\gamma_{k,\ell\!-\!1}^{+}
26:     end for
27:  end for

The algorithm operates in a set of iterations indexed by kk. In each iteration, a “forward pass” through the layers generates estimates 𝐳^k\widehat{\mathbf{z}}_{k\ell} for the hidden variables 𝐳0\mathbf{z}^{0}_{\ell}, while a “backward pass” generates estimates 𝐩^k\widehat{\mathbf{p}}_{k\ell} for the variables 𝐩0\mathbf{p}^{0}_{\ell}. In each step, the estimates 𝐳^k\widehat{\mathbf{z}}_{k\ell} and 𝐩^k\widehat{\mathbf{p}}_{k\ell} are produced by functions 𝐠+()\mathbf{g}^{+}_{\ell}(\cdot) and 𝐠()\mathbf{g}^{-}_{\ell}(\cdot) called estimators or denoisers.

For the MAP version of ML-VAMP algorithm in (Pandit et al., 2019), the denoisers are essentially proximal-type operators defined as

proxF/γ(𝒖):=argmin𝒙F(𝐱)+γ2𝐱𝐮2.\displaystyle\operatorname{prox}_{F/\gamma}(\bm{u}):=\underset{\bm{x}}{\rm argmin}\ F(\bm{x})+\tfrac{\gamma}{2}\left\|\bm{x}-\bm{u}\right\|^{2}. (24)

An important property of the proximal operator is that for separable functions FF of the form (6), we have [proxF/γ(𝒖)]i=proxf/γ(𝒖i)[\operatorname{prox}_{F/\gamma}(\bm{u})]_{i}=\operatorname{prox}_{f/\gamma}(\bm{u}_{i}).

In the case of the GLM model, for =0\ell=0 and LL, on lines 7 and 19, the denoisers are proximal operators given by

𝐠0+(𝐫0,γ0)\displaystyle\mathbf{g}_{0}^{+}(\mathbf{r}_{0}^{-},\gamma_{0}^{-}) =proxFin/γ0(𝐫0),\displaystyle=\operatorname{prox}_{F_{\rm in}/\gamma_{0}^{-}}(\mathbf{r}^{-}_{0}), (25a)
𝐠3(𝐫2+,𝒚,γ2+)\displaystyle\mathbf{g}_{3}^{-}(\mathbf{r}_{2}^{+},\bm{y},\gamma_{2}^{+}) =proxFout/γ2+(𝐫2+).\displaystyle=\operatorname{prox}_{F_{\rm out}/\gamma_{2}^{+}}(\mathbf{r}^{+}_{2}). (25b)

Note that in (25b), there is a dependence on 𝒚\bm{y} through the term Fout(𝒚,)F_{\rm out}(\bm{y},\cdot). For the middle terms, =1,2\ell=1,2, i.e., lines 9 and 21, the denoisers are given by

𝐠+(𝐫1+,𝐫,γ1+,γ):=𝐳^,\displaystyle\mathbf{g}_{\ell}^{+}(\mathbf{r}_{\ell\!-\!1}^{+},\mathbf{r}_{\ell}^{-},\gamma_{\ell\!-\!1}^{+},\gamma_{\ell}^{-}):=\widehat{\mathbf{z}}_{\ell}, (26a)
𝐠(𝐫1+,𝐫,γ1+,γ):=𝐩^1,\displaystyle\mathbf{g}_{\ell}^{-}(\mathbf{r}_{\ell\!-\!1}^{+},\mathbf{r}_{\ell}^{-},\gamma_{\ell\!-\!1}^{+},\gamma_{\ell}^{-}):=\widehat{\mathbf{p}}_{\ell\!-\!1}, (26b)

where (𝐩^1,𝐳^)(\widehat{\mathbf{p}}_{\ell\!-\!1},\widehat{\mathbf{z}}_{\ell}) are the solutions to the minimization

(𝐩^1,𝐳^):=argmin(𝐩1,𝐳)\displaystyle(\widehat{\mathbf{p}}_{\ell\!-\!1},\widehat{\mathbf{z}}_{\ell}):=\underset{(\mathbf{p}_{\ell\!-\!1},\mathbf{z}_{\ell})}{\rm argmin}\ F(𝐩1,𝐳)+γ2𝐳𝐫2\displaystyle F_{\ell}(\mathbf{p}_{\ell\!-\!1},\mathbf{z}_{\ell})+\frac{\gamma_{\ell}^{-}}{2}\|\mathbf{z}_{\ell}-\mathbf{r}_{\ell}^{-}\|^{2}
+γ1+2𝐩1𝐫1+2.\displaystyle+\frac{\gamma_{\ell\!-\!1}^{+}}{2}\|\mathbf{p}_{\ell\!-\!1}-\mathbf{r}_{\ell\!-\!1}^{+}\|^{2}. (27)

The quantity 𝒗/𝒖\langle{\partial\bm{v}/\partial\bm{u}}\rangle on lines 11 and 23 denotes the empirical mean 1Nn=1Nvn/un\tfrac{1}{N}\sum_{n=1}^{N}\partial v_{n}/\partial u_{n}.

Thus, the ML-VAMP algorithm in Algorithm 1 reduces the joint constrained minimization (22) over variables (𝐳0,𝐳1,𝐳2)(\mathbf{z}_{0},\mathbf{z}_{1},\mathbf{z}_{2}) and (𝐩0,𝐩1,𝐩2)(\mathbf{p}_{0},\mathbf{p}_{1},\mathbf{p}_{2}) to a set of proximal operations on pairs of variables (𝐩1,𝐳)(\mathbf{p}_{\ell\!-\!1},\mathbf{z}_{\ell}). As discussed in (Pandit et al., 2019), this type of minimization is similar to ADMM with adaptive step-sizes. Details of the denoisers 𝐠±\mathbf{g}_{\ell}^{\pm} and other aspects of the algorithm are given in Appendix B.

4 Main Result

We make two assumptions. The first assumption imposes certain regularity conditions on the functions ftsf_{\rm ts}, ϕ\phi, ϕout\phi_{\rm out}, and maps 𝐠±\mathbf{g}_{\ell}^{\pm} appearing in Algorithm 1. The precise definitions of pseudo-Lipschitz continuity and uniform Lipschitz continuity are given in Appendix A of the supplementary material.

Assumption 1.

The denoisers and link functions satisfy the following continuity conditions:

  1. (a)

    The proximal operators in (25),

    𝐠0+(𝐫0,γ0),𝐠3(𝐫2+,𝒚,γ2+),\mathbf{g}_{0}^{+}(\mathbf{r}_{0}^{-},\gamma_{0}^{-}),\quad\mathbf{g}_{3}^{-}(\mathbf{r}_{2}^{+},\bm{y},\gamma_{2}^{+}),

    are uniformly Lipschitz continuous in 𝐫0\mathbf{r}^{-}_{0} and (𝐫2+,𝒚)(\mathbf{r}^{+}_{2},\bm{y}) over parameters γ0\gamma^{-}_{0} and γ2+\gamma^{+}_{2}.

  2. (b)

    The link function ϕout(p,d)\phi_{\rm out}(p,d) is Lipschitz continuous in (p,d)(p,d). The test error function fts(ϕ(z^),ϕout(z,d))f_{\rm ts}(\phi(\widehat{z}),\phi_{\rm out}(z,d)) is pseduo-Lipschitz continuous in (z^,z,d)(\widehat{z},z,d) of order 2.

Our second assumption is that the ML-VAMP algorithm converges. Specifically, let 𝒙k=𝒙k(N)\bm{x}_{k}=\bm{x}_{k}(N) be any set of outputs of Algorithm 1, at some iteration kk and dimension NN. For example, 𝒙k(N)\bm{x}_{k}(N) could be 𝐳^k(N)\widehat{\mathbf{z}}_{k\ell}(N) or 𝐩^k(N)\widehat{\mathbf{p}}_{k\ell}(N) for some \ell, or a concatenation of signals such as [𝐳0(N)𝐳^k(N)]\begin{bmatrix}\mathbf{z}^{0}_{\ell}(N)&\widehat{\mathbf{z}}_{k\ell}(N)\end{bmatrix}.

Assumption 2.

Let 𝒙k(N)\bm{x}_{k}(N) be any finite set of outputs of the ML-VAMP algorithm as above. Then there exist limits

𝒙(N)=limk𝒙k(N)\bm{x}(N)=\lim_{k\rightarrow\infty}\bm{x}_{k}(N) (28)

satisfying

limklimN1N𝒙k(N)𝒙(N)2=0.\lim_{k\rightarrow\infty}\lim_{N\rightarrow\infty}\frac{1}{N}\|\bm{x}_{k}(N)-\bm{x}(N)\|^{2}=0. (29)

We are now ready to state our main result.

Theorem 1.

Consider the GLM learning problem (2) solved by applying Algorithm 1 to the equivalent problem (22) under the assumptions of Section 2 along with Assumptions 1 and 2. Then, there exist constants τ0,γ¯0+>0\tau_{0}^{-},\overline{\gamma}_{0}^{+}>0 and 𝐌02×2\mathbf{M}\in\mathbb{R}^{2\times 2}_{\succ 0} such that the following hold:

  1. (a)

    The fixed points {𝐳^,𝐩^}\{\widehat{\mathbf{z}}_{\ell},\widehat{\mathbf{p}}_{\ell}\}, =0,1,2\ell=0,1,2 of Algorithm 1 satisfy the KKT conditions for the constrained optimization problem (22). Equivalently 𝒘^:=𝐳^0\widehat{\bm{w}}:=\widehat{\mathbf{z}}_{0} is a stationary point of (2).

  2. (b)

    The true parameter 𝒘0\bm{w}^{0} and its estimate 𝒘^\widehat{\bm{w}} empirically converge as

    limN{(wi0,w^i)}=PL(2)(W0,W^),\lim_{N\rightarrow\infty}\{(w^{0}_{i},\widehat{w}_{i})\}\stackrel{{\scriptstyle PL(2)}}{{=}}(W^{0},\widehat{W}), (30)

    where W0W^{0} is the random variable from (8) and

    W^=proxfin/γ¯0+(W0+Q0),\widehat{W}=\operatorname{prox}_{f_{\rm in}/\overline{\gamma}_{0}^{+}}(W^{0}+Q_{0}^{-}), (31)

    with Q0=𝒩(0,τ0)Q_{0}^{-}={\mathcal{N}}(0,\tau_{0}^{-}) independent of W0W^{0}.

  3. (c)

    The asymptotic generalization error (17) with (yts,y^ts)(y_{\rm ts},\widehat{y}_{\rm ts}) defined as (3) is given by

    ts=𝔼fts(ϕout(Zts,D),ϕ(Z^ts)),\displaystyle\mathcal{E}_{\rm ts}=\mathbb{E}\,f_{\rm ts}\!\left(\phi_{\rm out}(Z_{\rm ts},D),\phi(\widehat{Z}_{\rm ts})\right), (32)

    where (Zts,Z^ts)𝒩(𝟎2,𝐌)(Z_{\rm ts},\widehat{Z}_{\rm ts})\sim\mathcal{N}(\mathbf{0}_{2},\mathbf{M}) and independent of DD.

Part (a) shows that, similar to gradient descent, Algorithm 1 finds the stationary points of problem (2). These stationary points will be unique in strictly convex problems such as linear and logistic regression. Thus, in such cases, the same results will be true for any algorithm that finds such stationary points. Hence, the fact that we are using ML-VAMP is immaterial – our results apply to any solver for (2). Note that the convergence to the fixed points {𝐳^,𝐩^}\{\widehat{\mathbf{z}}_{\ell},\widehat{\mathbf{p}}_{\ell}\} is assumed from Assumption 2.

Part (b) provides an exact description of the asymptotic statistical relation between the true parameter 𝒘0\bm{w}^{0} and its estimate 𝒘^\widehat{\bm{w}}. The parameters τ0,γ¯0+>0\tau_{0}^{-},\overline{\gamma}_{0}^{+}>0 and 𝐌\mathbf{M} can be explicitly computed using a set of recursive equations called the state evolution or SE described in Appendix C in the supplementary material.

We can use the expressions to compute a variety of relevant metrics. For example, the PL(2)PL(2) convergence shows that the MSE on the parameter estimate is

limN1Nn=1N(wn0w^n)2=𝔼(W0W^)2.\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{n=1}^{N}(w_{n}^{0}-\widehat{w}_{n})^{2}=\mathbb{E}(W^{0}-\widehat{W})^{2}. (33)

The expectation on the right hand side of (33) can then be computed via integration over the joint density of (W0,W^)(W^{0},\widehat{W}) from part (b). In this way, we have a simple and exact method to compute the parameter error. Other metrics such as parameter bias or variance, cosine angle or sparsity detection can also be computed.

Part (c) of Theorem 1 similarly exactly characterizes the asymptotic generalization error. In this case, we would compute the expectation over the three variables (Z,Z^,D)(Z,\widehat{Z},D). In this way, we have provided a methodology for exactly predicting the generalization error from the key parameters of the problems such as the sampling ratio β=p/N\beta=p/N, the regularizer, the output function, and the distributions of the true weights and noise. We provide several examples such as linear regression, logistic regression and SVM in the Appendix G. We also recover the result by (Hastie et al., 2019) in Appendix G.

Remarks on Assumptions.

Note that Assumption 1 is satisfied in many practical cases. For example, it can be verified that it is satisfied in the case when fin()f_{\rm in}(\cdot) and fout()f_{\rm out}(\cdot) are convex. Assumption 2 is somewhat more restrictive in that it requires that the ML-VAMP algorithm converges. The convergence properties of ML-VAMP are discussed in (Fletcher et al., 2016). The ML-VAMP algorithm may not always converge, and characterizing conditions under which convergence is possible is an open question. However, experiments in (Rangan et al., 2019) show that the algorithm does indeed often converge, and in these cases, our analysis applies. In any case, we will see below that the predictions from Theorem 1 agree closely with numerical experiments in several relevant cases.

In some special cases equation (32) simplifies to yield quantitative insights for interesting modeling artifacts. We discuss these in Appendix G in the supplementary material.

5 Experiments

Training and Test Distributions.

We validate our theoretical results on a number of synthetic data experiments. For all the experiments, the training and test data is generated following the model in Section 2. We generate the training and test eigenvalues as i.i.d. with lognormal distributions,

Str2=A(10)0.1utr,Sts2=A(10)0.1uts,S_{\rm tr}^{2}=A(10)^{0.1u_{\rm tr}},\quad S_{\rm ts}^{2}=A(10)^{0.1u_{\rm ts}},

where (utr,uts)(u_{\rm tr},u_{\rm ts}) are bivariate zero-mean Gaussian with

cov(utr,uts)=σu2[1ρρ1].\mathrm{cov}(u_{\rm tr},u_{\rm ts})=\sigma^{2}_{u}\left[\begin{array}[]{cc}1&\rho\\ \rho&1\end{array}\right].

In the case when σu2=0\sigma^{2}_{u}=0, we obtain eigenvalues that are equal, corresponding to the i.i.d. case. With σu2>0\sigma^{2}_{u}>0 we can model correlated features. Also, when the correlation coefficient ρ=1\rho=1, Str=StsS_{\rm tr}=S_{\rm ts}, so there is no training and test mismatch. However, we can also select ρ<1\rho<1 to experiment with cases when the training and test distributions differ. In the examples below, we consider the following three cases:

  1. (1)

    i.i.d. features (σu=0\sigma_{u}=0);

  2. (2)

    correlated features with matching training and test distributions (σu=3\sigma_{u}=3 dB, ρ=1\rho=1); and

  3. (3)

    correlated features with train-test mismatch (σu=3\sigma_{u}=3 dB, ρ=0.5\rho=0.5).

For all experiments below, the true model coefficients are generated as i.i.d. Gaussian wj0𝒩(0,1)w_{j}^{0}\sim{\mathcal{N}}(0,1) and we use standard L2-regularization, fin(w)=λw2/2f_{\rm in}(w)=\lambda w^{2}/2 for some λ>0\lambda>0. Our framework can incorporate arbitrary i.i.d. distributions on wjw_{j} and regularizers, but we will illustrate just the Gaussian case with L2-regularization here.

Under-regularized linear regression.

We first consider the case of under-regularized linear regression where the output channel is ϕout(p,d)=p+d\phi_{\rm out}(p,d)=p+d with d𝒩(0,σd2)d\sim{\mathcal{N}}(0,\sigma^{2}_{d}). The noise variance σd2\sigma^{2}_{d} is set for an SNR level of 10 dB. We use a standard mean-square error (MSE) output loss, fout(y,p)=(yp)2/(2σd2)f_{\rm out}(y,p)=(y-p)^{2}/(2\sigma^{2}_{d}). Since we are using the L2-regularizer, fin(w)=λw2/2f_{\rm in}(w)=\lambda w^{2}/2, the minimization (2) is standard ridge regression. Moreover, if we were to select λ=1/𝔼(wj0)2\lambda=1/\mathbb{E}(w_{j}^{0})^{2}, then the ridge regression estimate would correspond to the minimum mean-squared error (MMSE) estimate of the coefficients 𝒘0\bm{w}^{0}. However, to study the under-regularized regime, we take λ=(10)4/𝔼(wj0)2\lambda=(10)^{-4}/\mathbb{E}(w_{j}^{0})^{2}.

Fig. 2 plots the test MSE for the three cases described above for the linear model. In the figure, we take p=1000p=1000 features and vary the number of samples nn from 0.2p0.2p (over-parametrized) to 3p3p (under-paramertrized). For each value of nn, we take 100 random instances of the model and compute the ridge regression estimate using the sklearn package and measure the test MSE on the 1000 independent test samples. The simulated values in Fig. 2 are the median test error over the 100 random trials. The test MSE is plotted in a normalized dB scale,

Test MSE (dB)=10log10(𝔼(y^tsyts)2𝔼yts2).\mbox{Test MSE (dB)}=10\log_{10}\left(\frac{\mathbb{E}(\widehat{y}_{\rm ts}-y_{\rm ts})^{2}}{\mathbb{E}y_{\rm ts}^{2}}\right).

Also plotted is the state evolution (SE) theoretical test MSE from Theorem 1.

Refer to caption
Figure 2: Test error for under-regularized linear regression under various train and test distributions

In all three cases in Fig. 2, the SE theory exactly matches the simulated values for the test MSE. Note that the case of match training and test distributions for this problem was studied in (Hastie et al., 2019; Mei & Montanari, 2019; Montanari et al., 2019) and we see the double descent phenomenon described in their work. Specifically, with highly under-regularized linear regression, the test MSE actually increases with more samples nn in the over-parametrized regime (n/p<1n/p<1) and then decreases again in the under-parametrized regime (n/p>1n/p>1).

Our SE theory can also provide predictions for the correlated feature case. In this particular setting, we see that in the correlated case the test error is slightly lower in the over-parametrized regime since the energy of data is concentrated in a smaller sub-space. Interestingly, there is minimal difference between the correlated and i.i.d. cases for the under-parametrized regime when the training and test data match. When the training and test data are not matched, the test error increases. In all cases, the SE theory can accurately predict these effects.

Refer to caption
Figure 3: Classification error rate with logistic regression under various train and test distributions

Logistic Regression.

Fig. 3 shows a similar plot as Fig. 2 for a logistic model. Specifically, we use a logistic output P(y=1)=1/(1+ep)P(y=1)=1/(1+e^{-p}), a binary cross entropy output loss fout(y,p)f_{\rm out}(y,p), and 2\ell_{2}-regularization level λ\lambda so that the output corresponds to the MAP estimate (we do not perform ridgeless regression in this case). The mean of the training and test eigenvalues 𝔼Str2=𝔼Sts2\mathbb{E}S_{\rm tr}^{2}=\mathbb{E}S_{\rm ts}^{2} are selected such that, if the true coefficients 𝒘0\bm{w}^{0} were known, we could obtain a 5% prediction error. As in the linear case, we generate random instances of the model, use the sklearn package to perform the logistic regression, and evaluate the estimates on 1000 new test samples. We compute the median error rate (11- accuracy) and compare the simulated values with the SE theoretical estimates. The i.i.d. case was considered in (Salehi et al., 2019). Fig. 3 shows that our SE theory is able to predict the test error rate exactly in i.i.d. cases along with a correlated case and a case with training and test mismatch.

Refer to caption
Figure 4: Test MSE under a non-linear least square estimation.

Nonlinear Regression.

The SE framework can also consider non-convex problems. As an example, we consider a non-linear regression problem where the output function is

ϕout(p,d)=tanh(p)+d,d𝒩(0,σd2).\phi_{\rm out}(p,d)=\tanh(p)+d,\quad d\sim{\mathcal{N}}(0,\sigma^{2}_{d}). (34)

The tanh(p)\tanh(p) models saturation in the output. Corresponding to this output, we use a non-linear MSE output loss

fout(yp)=12σd2(ytanh(p))2.f_{\rm out}(y-p)=\frac{1}{2\sigma^{2}_{d}}(y-\tanh(p))^{2}. (35)

This output loss is non-convex. We scale the data matrix so that the input 𝔼(p2)=9\mathbb{E}(p^{2})=9 so that the tanh(p)\tanh(p) is driven well into the non-linear regime. We also take σd2=0.01\sigma^{2}_{d}=0.01.

For the simulation, the non-convex loss is minimized using Tensorflow where the non-linear model is described as a two-layer model. We use the ADAM optimizer (Kingma & Ba, 2014) with 200 epochs to approach a local minimum of the objective (2). Fig. 4 plots the median test MSE for the estimate along with the SE theoretical test MSE. We again see that the SE theory is able to predict the test MSE in all cases even for this non-convex problem.

6 Conclusions

In this paper we provide a procedure for exactly computing the asymptotic generalization error of a solution in a generalized linear model (GLM). This procedure is based on scalar quantities which are fixed points of a recursive iteration. The formula holds for a large class of generalization metrics, loss functions, and regularization schemes. Our formula allows analysis of important modeling effects such as (i) overparameterization, (ii) dependence between covariates, and (iii) mismatch between train and test distributions, which play a significant role in the analysis and design of machine learning systems. We experimentally validate our theoretical results for linear as well as non-linear regression and logistic regression, where a strong agreement is seen between our formula and simulated results.

References

  • Advani & Saxe (2017) Advani, M. S. and Saxe, A. M. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.
  • Allen-Zhu et al. (2019) Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in Neural Information Processing Systems, pp. 6155–6166, 2019.
  • Arora et al. (2019) Arora, S., Du, S. S., Hu, W., Li, Z., and Wang, R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.
  • Barbier et al. (2019) Barbier, J., Krzakala, F., Macris, N., Miolane, L., and Zdeborová, L. Optimal errors and phase transitions in high-dimensional generalized linear models. Proc. National Academy of Sciences, 116(12):5451–5460, 2019.
  • Bartlett et al. (2019) Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. Benign overfitting in linear regression. arXiv preprint arXiv:1906.11300, 2019.
  • Bayati & Montanari (2011) Bayati, M. and Montanari, A. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. Inform. Theory, 57(2):764–785, February 2011.
  • Belkin et al. (2018) Belkin, M., Ma, S., and Mandal, S. To understand deep learning we need to understand kernel learning. arXiv preprint arXiv:1802.01396, 2018.
  • Belkin et al. (2019a) Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc. National Academy of Sciences, 116(32):15849–15854, 2019a.
  • Belkin et al. (2019b) Belkin, M., Hsu, D., and Xu, J. Two models of double descent for weak features. arXiv preprint arXiv:1903.07571, 2019b.
  • Cakmak et al. (2014) Cakmak, B., Winther, O., and Fleury, B. H. S-AMP: Approximate message passing for general matrix ensembles. In Proc. IEEE ITW, 2014.
  • Daniely (2017) Daniely, A. Sgd learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pp. 2422–2430, 2017.
  • Daniely et al. (2016) Daniely, A., Frostig, R., and Singer, Y. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pp. 2253–2261, 2016.
  • Deng et al. (2019) Deng, Z., Kammoun, A., and Thrampoulidis, C. A model of double descent for high-dimensional binary linear classification. arXiv preprint arXiv:1911.05822, 2019.
  • Donoho et al. (2009) Donoho, D. L., Maleki, A., and Montanari, A. Message-passing algorithms for compressed sensing. Proc. National Academy of Sciences, 106(45):18914–18919, 2009.
  • Donoho et al. (2010) Donoho, D. L., Maleki, A., and Montanari, A. Message passing algorithms for compressed sensing. In Proc. Inform. Theory Workshop, pp.  1–5, 2010.
  • Du et al. (2018) Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
  • Fletcher et al. (2016) Fletcher, A., Sahraee-Ardakan, M., Rangan, S., and Schniter, P. Expectation consistent approximate inference: Generalizations and convergence. In Proc. IEEE Int. Symp. Information Theory (ISIT), pp. 190–194. IEEE, 2016.
  • Fletcher et al. (2018) Fletcher, A. K., Rangan, S., and Schniter, P. Inference in deep networks in high dimensions. Proc. IEEE Int. Symp. Information Theory, 2018.
  • Gabrié et al. (2018) Gabrié, M., Manoel, A., Luneau, C., Barbier, J., Macris, N., Krzakala, F., and Zdeborová, L. Entropy and mutual information in models of deep neural networks. In Proc. NIPS, 2018.
  • Gerbelot et al. (2020) Gerbelot, C., Abbara, A., and Krzakala, F. Asymptotic errors for convex penalized linear regression beyond gaussian matrices. arXiv preprint arXiv:2002.04372, 2020.
  • Givens et al. (1984) Givens, C. R., Shortt, R. M., et al. A class of wasserstein metrics for probability distributions. The Michigan Mathematical Journal, 31(2):231–240, 1984.
  • Hastie et al. (2019) Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
  • He et al. (2017) He, H., Wen, C.-K., and Jin, S. Generalized expectation consistent signal recovery for nonlinear measurements. In 2017 IEEE International Symposium on Information Theory (ISIT), pp.  2333–2337. IEEE, 2017.
  • Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580, 2018.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Ma & Ping (2017) Ma, J. and Ping, L. Orthogonal amp. IEEE Access, 5:2020–2033, 2017.
  • Manoel et al. (2018) Manoel, A., Krzakala, F., Varoquaux, G., Thirion, B., and Zdeborová, L. Approximate message-passing for convex optimization with non-separable penalties. arXiv preprint arXiv:1809.06304, 2018.
  • Mei & Montanari (2019) Mei, S. and Montanari, A. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.
  • Montanari et al. (2019) Montanari, A., Ruan, F., Sohn, Y., and Yan, J. The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime. arXiv preprint arXiv:1911.01544, 2019.
  • Muthukumar et al. (2019) Muthukumar, V., Vodrahalli, K., and Sahai, A. Harmless interpolation of noisy data in regression. In 2019 IEEE International Symposium on Information Theory (ISIT), pp.  2299–2303. IEEE, 2019.
  • Neyshabur et al. (2018) Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., and Srebro, N. Towards understanding the role of over-parametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076, 2018.
  • Opper & Winther (2005) Opper, M. and Winther, O. Expectation consistent approximate inference. Journal of Machine Learning Research, 6(Dec):2177–2204, 2005.
  • Pandit et al. (2019) Pandit, P., Sahraee, M., Rangan, S., and Fletcher, A. K. Asymptotics of MAP inference in deep networks. In Proc. IEEE Int. Symp. Information Theory, pp.  842–846, 2019.
  • Pandit et al. (2019) Pandit, P., Sahraee-Ardakan, M., Rangan, S., Schniter, P., and Fletcher, A. K. Inference with deep generative priors in high dimensions. arXiv preprint arXiv:1911.03409, 2019.
  • Rangan et al. (2019) Rangan, S., Schniter, P., and Fletcher, A. K. Vector approximate message passing. IEEE Trans. Information Theory, 65(10):6664–6684, 2019.
  • Reeves (2017) Reeves, G. Additivity of information in multilayer networks via additive gaussian noise transforms. In Proc. 55th Annual Allerton Conf. Communication, Control, and Computing (Allerton), pp.  1064–1070. IEEE, 2017.
  • Salehi et al. (2019) Salehi, F., Abbasi, E., and Hassibi, B. The impact of regularization on high-dimensional logistic regression. arXiv preprint arXiv:1906.03761, 2019.
  • Tulino et al. (2004) Tulino, A. M., Verdú, S., et al. Random matrix theory and wireless communications. Foundations and Trends® in Communications and Information Theory, 1(1):1–182, 2004.
  • Villani (2008) Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
  • Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

Appendix A Empirical Convergence of Vector Sequences

The LSL model in Section 2 and our main result in Section 4 require certain technical definitions.

Definition 1 (Pseudo-Lipschitz continuity).

For a given p1p\geq 1, a function 𝐟:dm\mathbf{f}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{m} is called Pseudo-Lipschitz of order pp if

𝐟(𝒙1)𝐟(𝒙2)\displaystyle\|\mathbf{f}(\bm{x}_{1})-\mathbf{f}(\bm{x}_{2})\|
C𝒙1𝒙2(1+𝒙1p1+𝒙2p1)\displaystyle\leq C\|\bm{x}_{1}-\bm{x}_{2}\|\left(1+\|\bm{x}_{1}\|^{p-1}+\|\bm{x}_{2}\|^{p-1}\right) (36)

for some constant C>0C>0.

Observe that for p=1p=1, the pseudo-Lipschitz is equivalent to the standard definition of Lipschitz continuity.

Definition 2 (Uniform Lipschitz continuity).

Let ϕ(𝒙,θ){\bm{\phi}}(\bm{x},\theta) be a function on 𝐫d\mathbf{r}\in{\mathbb{R}}^{d} and θs\theta\in{\mathbb{R}}^{s}. We say that ϕ(𝒙,θ){\bm{\phi}}(\bm{x},\theta) is uniformly Lipschitz continuous in 𝒙\bm{x} at θ=θ¯\theta={\overline{\theta}} if there exists constants L1,L20L_{1},L_{2}\geq 0 and an open neighborhood UU of θ¯{\overline{\theta}} such that

ϕ(𝒙1,θ)ϕ(𝒙2,θ)L1𝒙1𝒙2\displaystyle\|{\bm{\phi}}(\bm{x}_{1},\theta)-{\bm{\phi}}(\bm{x}_{2},\theta)\|\leq L_{1}\|\bm{x}_{1}-\bm{x}_{2}\| (37)

for all 𝒙1,𝒙2d\bm{x}_{1},\bm{x}_{2}\in{\mathbb{R}}^{d} and θU\theta\in U; and

ϕ(𝒙,θ1)ϕ(𝒙,θ2)L2(1+𝒙)θ1θ2,\displaystyle\|{\bm{\phi}}(\bm{x},\theta_{1})-{\bm{\phi}}(\bm{x},\theta_{2})\|\leq L_{2}\left(1+\|\bm{x}\|\right)\|\theta_{1}-\theta_{2}\|, (38)

for all 𝒙d\bm{x}\in{\mathbb{R}}^{d} and θ1,θ2U\theta_{1},\theta_{2}\in U.

Definition 3 (Empirical convergence of a sequence).

Consider a sequence of vectors 𝒙(N)={𝒙n(N)}n=1N\bm{x}(N)=\{\bm{x}_{n}{(N)}\}_{n=1}^{N} with 𝒙n(N)d\bm{x}_{n}(N)\in\mathbb{R}^{d}. So, each 𝒙(N)\bm{x}(N) is a block vector with a total of NdNd components. For a finite p1p\geq 1, we say that the vector sequence 𝒙(N)\bm{x}(N) converges empirically with pp-th order moments if there exists a random variable XdX\in\mathbb{R}^{d} such that

  1. (i)

    𝔼Xpp<\mathbb{E}\|X\|_{p}^{p}<\infty; and

  2. (ii)

    for any f:df:{\mathbb{R}}^{d}\rightarrow{\mathbb{R}} that is pseudo-Lipschitz continuous of order pp,

    limN1Nn=1Nf(𝒙n(N))=𝔼[f(X)].\lim_{N\rightarrow\infty}\tfrac{1}{N}\sum_{n=1}^{N}f(\bm{x}_{n}(N))=\mathbb{E}\left[f(X)\right]. (39)

In this case, with some abuse of notation, we will write

limN{𝒙n}=PL(p)X,\lim_{N\rightarrow\infty}\left\{\bm{x}_{n}\right\}\stackrel{{\scriptstyle PL(p)}}{{=}}X, (40)

where we have omitted the dependence on NN in 𝒙n(N)\bm{x}_{n}(N). We note that the sequence {𝒙(N)}\{\bm{x}{(N)}\} can be random or deterministic. If it is random, we will require that for every pseudo-Lipschitz function f()f(\cdot), the limit (39) holds almost surely. In particular, if 𝒙nX\bm{x}_{n}\sim X are i.i.d. and 𝔼Xpp<\mathbb{E}\|X\|^{p}_{p}<\infty, then 𝒙\bm{x} empirically converges to XX with pthp^{\rm th} order moments.

PL(p)PL(p) convergence is equivalent to weak convergence plus convergence in pp moment (Bayati & Montanari, 2011), and hence PL(p)PL(p) convergence is also equivalent to convergence in Wasserstein-pp metric (See Chapter 6. (Villani, 2008)). We use this fact later in proving Theorem 1.

Appendix B ML-VAMP Denoisers Details

Related to 𝐒mp\mathbf{S}_{\rm{mp}} and 𝐬mp\mathbf{s}_{\rm{mp}} from equation (11), we need to define two quantities 𝐬mp+N\mathbf{s}_{\rm{mp}}^{+}\in\mathbb{R}^{N} and 𝐬mpp\mathbf{s}_{\rm{mp}}^{-}\in\mathbb{R}^{p} that are zero-padded versions of the singular values 𝐬mp\mathbf{s}_{{\rm{mp}}}, so that for n>min{N,p}n>\min\{N,p\}, we set smp,n±=0s^{\pm}_{{\rm mp},n}=0. Observe that (𝐬mp+)2(\mathbf{s}_{\rm{mp}}^{+})^{2} are eigenvalues of 𝐔𝐔T\mathbf{U}\mathbf{U}^{\text{\sf T}} whereas (𝐬mp)2(\mathbf{s}_{\rm{mp}}^{-})^{2} are eigenvalues of 𝐔T𝐔\mathbf{U}^{\text{\sf T}}\mathbf{U}. Since 𝐬mp\mathbf{s}_{\rm{mp}} empirically converges to SmpS_{\rm{mp}} as given in (12), the vector 𝐬mp+\mathbf{s}_{\rm{mp}}^{+} empirically converges to random variable Smp+S_{\rm{mp}}^{+} whereas the vector 𝐬mp\mathbf{s}_{\rm{mp}}^{-} empirically converges to random variable SmpS_{\rm{mp}}^{-}, where a mass is placed at 0 appropriately. Specifically, Smp+S_{\rm{mp}}^{+} has a point mass of (1β)+δ{0}(1-\beta)_{+}\delta_{\{0\}} when β<1\beta<1, whereas SmpS_{\rm{mp}}^{-} has a point mass of (11β)+δ{0},(1-\frac{1}{\beta})_{+}\delta_{\{0\}}, when β>1.\beta>1. In Appendix H (eqn. (113)), we provide the densities over positive parts of Smp+S_{\rm{mp}}^{+} and SmpS_{\rm{mp}}^{-}.

A key property of our analysis will be that the non-linear functions (20) and the denoisers 𝐠±()\mathbf{g}_{\ell}^{\pm}(\cdot) have simple forms.

Non-linear functions ϕ(){\bm{\phi}}_{\ell}(\cdot): The non-linear functions all act componentwise. For example, for ϕ1(){\bm{\phi}}_{1}(\cdot) in (20), we have

𝐳1=ϕ1(𝐩0,𝐬tr)=diag(𝐬tr)𝐩0z1,n=ϕ1(p0,n,str,n),\displaystyle\mathbf{z}_{1}={\bm{\phi}}_{1}(\mathbf{p}_{0},\mathbf{s}_{\rm tr})=\mathrm{diag}(\mathbf{s}_{\rm tr})\mathbf{p}_{0}\Longleftrightarrow z_{1,n}=\phi_{1}(p_{0,n},s_{{\rm tr},n}),

where ϕ1()\phi_{1}(\cdot) is the scalar-valued function,

ϕ1(p0,s)=sp0.\phi_{1}(p_{0},s)=sp_{0}. (41)

Similarly, for ϕ2(){\bm{\phi}}_{2}(\cdot),

𝐳2=ϕ2(𝐩1,𝐬mp+)z2,n=ϕ2(p¯1,n,smp,n+),n<N\displaystyle\mathbf{z}_{2}={\bm{\phi}}_{2}(\mathbf{p}_{1},\mathbf{s}^{+}_{\rm mp})\Longleftrightarrow z_{2,n}=\phi_{2}(\overline{p}_{1,n},{s}^{+}_{{\rm mp},n}),\quad n<N

where 𝐩¯1N\overline{\mathbf{p}}_{1}\in\mathbb{R}^{N} is the zero-padded version of 𝐩1\mathbf{p}_{1}, and

ϕ2(p1,s)=sp1.\phi_{2}(p_{1},s)=s\,p_{1}. (42)

Finally, the function ϕ3(){\bm{\phi}}_{3}(\cdot) in (20) acts componentwise with

ϕ3(p2,d)=ϕout(p2,d).\phi_{3}(p_{2},d)=\phi_{\rm out}(p_{2},d). (43)

Input denoiser 𝐠0+()\mathbf{g}_{0}^{+}(\cdot): Since F0(𝐳0)=Fin(𝐳0)F_{0}(\mathbf{z}_{0})=F_{\rm in}(\mathbf{z}_{0}), and Fin()F_{\rm in}(\cdot) given in (6), the denoiser (25a) acts componentwise in that,

𝐳^0=𝐠0+(𝐫0,γ0)z^0,n=g0+(r0,n,γ0),\widehat{\mathbf{z}}_{0}=\mathbf{g}_{0}^{+}(\mathbf{r}_{0}^{-},\gamma_{0}^{-})\Longleftrightarrow\widehat{z}_{0,n}=g_{0}^{+}(r_{0,n}^{-},\gamma_{0}^{-}),

where g0+()g_{0}^{+}(\cdot) is the scalar-valued function,

g0+(r0,γ0):=argminz0fin(z0)+γ02(z0r0)2.g_{0}^{+}(r_{0}^{-},\gamma_{0}^{-}):=\operatorname*{argmin}_{z_{0}}\ f_{\rm in}(z_{0})+\frac{\gamma_{0}^{-}}{2}(z_{0}-r_{0}^{-})^{2}. (44)

Thus, the vector optimization in (25a) reduces to a set of scalar optimizations (44) on each component.

Output denoiser 𝐠3()\mathbf{g}_{3}^{-}(\cdot): The output penalty F3(𝐩2,𝒚)=Fout(𝐩2,𝒚)F_{3}(\mathbf{p}_{2},\bm{y})=F_{\rm out}(\mathbf{p}_{2},\bm{y}) where Fout(𝐩2,𝒚)F_{\rm out}(\mathbf{p}_{2},\bm{y}) has the separable form (6). Thus, similar to the case of 𝐠0()\mathbf{g}_{0}(\cdot), the denoiser 𝐠3()\mathbf{g}_{3}(\cdot) in (25b) also acts componentwise with the function,

g3(r2+,γ2+,y):=argminp2fout(p2,y)+γ2+2(p2r2+)2.\displaystyle g_{3}^{-}(r_{2}^{+},\gamma_{2}^{+},y):=\operatorname*{argmin}_{p_{2}}\ f_{\rm out}(p_{2},y)+\tfrac{\gamma_{2}^{+}}{2}(p_{2}-r_{2}^{+})^{2}. (45)

Linear denoiser 𝐠1±()\mathbf{g}_{1}^{\pm}(\cdot): The expressions for both denoisers g1±g_{1}^{\pm} and g2±g_{2}^{\pm} are very similar and can be explained together. The penalty F1()F_{1}(\cdot) restricts 𝐳1=𝐒tr𝐩0\mathbf{z}_{1}=\mathbf{S}_{\rm tr}\mathbf{p}_{0}, where 𝐒tr\mathbf{S}_{\rm tr} is a square matrix. Hence, for =1,\ell=1, the minimization in (27) is given by,

𝐩^0:=argmin𝐩0γ0+2𝐩0𝐫0+2+γ12𝐒tr𝐩0𝐫12,\displaystyle\widehat{\mathbf{p}}_{0}:=\operatorname*{argmin}_{\mathbf{p}_{0}}\ \tfrac{\gamma_{0}^{+}}{2}\|\mathbf{p}_{0}-\mathbf{r}_{0}^{+}\|^{2}+\tfrac{\gamma_{1}^{-}}{2}\|\mathbf{S}_{\rm tr}\mathbf{p}_{0}-\mathbf{r}_{1}^{-}\|^{2}, (46)

and 𝐳^1=𝐒tr𝐩^0\widehat{\mathbf{z}}_{1}=\mathbf{S}_{\rm tr}\widehat{\mathbf{p}}_{0}. This is a simple quadratic minimization and the components of 𝐩^0\widehat{\mathbf{p}}_{0} and 𝐳^1\widehat{\mathbf{z}}_{1} are given by

p^0,n\displaystyle\widehat{p}_{0,n} =g1(r0,n+,r1,n,γ0+,γ1,str,n)\displaystyle=g_{1}^{-}(r_{0,n}^{+},r_{1,n}^{-},\gamma_{0}^{+},\gamma^{-}_{1},s_{{\rm tr},n})
z^1,n\displaystyle\widehat{z}_{1,n} =g1+(r0,n+,r1,n,γ0+,γ1,str,n),\displaystyle=g_{1}^{+}(r_{0,n}^{+},r_{1,n}^{-},\gamma_{0}^{+},\gamma^{-}_{1},s_{{\rm tr},n}),

where

g1(r0+,r1,γ0+,γ1,s)\displaystyle g_{1}^{-}(r_{0}^{+},r_{1}^{-},\gamma_{0}^{+},\gamma^{-}_{1},s) :=γ0+r0++sγ1r1γ0++s2γ1\displaystyle:=\frac{\gamma^{+}_{0}r^{+}_{0}+s\gamma^{-}_{1}r^{-}_{1}}{\gamma^{+}_{0}+s^{2}\gamma^{-}_{1}} (47a)
g1+(r0+,r1,γ0+,γ1,s)\displaystyle g_{1}^{+}(r_{0}^{+},r_{1}^{-},\gamma_{0}^{+},\gamma^{-}_{1},s) :=s(γ0+r0++sγ1r1)γ0++s2γ1\displaystyle:=\frac{s(\gamma^{+}_{0}r^{+}_{0}+s\gamma^{-}_{1}r^{-}_{1})}{\gamma^{+}_{0}+s^{2}\gamma^{-}_{1}} (47b)

Linear denoiser 𝐠2±()\mathbf{g}_{2}^{\pm}(\cdot): This denoiser is identical to the case 𝐠1±()\mathbf{g}_{1}^{\pm}(\cdot) in that we need to impose the linear constraint 𝐳2=𝐒mp𝐩1\mathbf{z}_{2}=\mathbf{S}_{\rm mp}\mathbf{p}_{1}. However 𝐒mp\mathbf{S}_{\rm{mp}} is in general a rectangular matrix and the two resulting cases of β1\beta\lessgtr 1 needs to be treated separately.

Recall the definitions of vectors 𝐬mp+\mathbf{s}_{\rm{mp}}^{+} and 𝐬mp\mathbf{s}_{\rm{mp}}^{-} at the beginning of this section. Then, for =2\ell=2, with the penalty F2(𝐩1,𝐳2)=δ{𝐳2=𝐒mp𝐩1}F_{2}(\mathbf{p}_{1},\mathbf{z}_{2})=\delta_{\{\mathbf{z}_{2}=\mathbf{S}_{\rm mp}\mathbf{p}_{1}\}}, the solution to (27) has components,

p^1,n\displaystyle\widehat{p}_{1,n} =g2(r1,n+,r2,n,γ1+,γ2+,smp,n)\displaystyle=g_{2}^{-}(r_{1,n}^{+},r_{2,n}^{-},\gamma_{1}^{+},\gamma^{+}_{2},s^{-}_{{\rm mp},n}) (48a)
z^2,n\displaystyle\widehat{z}_{2,n} =g2+(r1,n+,r2,n,γ1+,γ2+,smp,n+),\displaystyle=g_{2}^{+}(r_{1,n}^{+},r_{2,n}^{-},\gamma_{1}^{+},\gamma^{+}_{2},s^{+}_{{\rm mp},n}), (48b)

with the identical functions g2=g1g_{2}^{-}=g_{1}^{-} and g2+=g1+g_{2}^{+}=g_{1}^{+} as given by (47a) and (47b). Note that in (48a), n=1,,pn=1,\ldots,p and in (48b), n=1,,Nn=1,\ldots,N.

Appendix C State Evolution Analysis of ML-VAMP

A key property of the ML-VAMP algorithm is that its performance in the LSL can be exactly described by a scalar equivalent system. In the scalar equivalent system, the vector-valued outputs of the algorithm are replaced by scalar random variables representing the typical behavior of the components of the vectors in the large-scale-limit (LSL). Each of the random variables are described by a set of parameters, where the parameters are given by a set of deterministic equations called the state evolution or SE.

Algorithm 2 SE for ML-VAMP for GLM Learning
1:  // Initial
2:  Initialize γ¯0=γ0\overline{\gamma}^{-}_{0\ell}=\gamma_{0\ell}^{-} from Algorithm 1.
3:  Q0𝒩(0,τ0)Q^{-}_{0\ell}\sim\mathcal{N}(0,\tau_{0\ell}^{-}) for some τ0>0\tau_{0\ell}^{-}>0 for =0,1,2\ell=0,1,2
4:  Z00=W0Z^{0}_{0}=W^{0}
5:  for =0,,L1\ell=0,\ldots,L\!-\!1 do
6:     P0=𝒩(0,τ0),τ0=var(Z0)P^{0}_{\ell}={\mathcal{N}}(0,\tau^{0}_{\ell}),\quad\tau^{0}_{\ell}=\mathrm{var}(Z^{0}_{\ell})
7:     Z+10=ϕ+1(P0,Ξ+1)Z^{0}_{\ell\!+\!1}=\phi_{\ell\!+\!1}(P^{0}_{\ell},\Xi_{\ell\!+\!1})
8:  end for
9:  
10:  for k=0,1,k=0,1,\dots do
11:     // Forward Pass
12:     for =0,,L1\ell=0,\ldots,L-1 do
13:        if =0\ell=0 then
14:           Rk0=Z0+Qk0R^{-}_{k0}=Z^{0}_{\ell}+Q^{-}_{k0}
15:           Z^k0=g0+(Rk0,γ¯k0)\widehat{Z}_{k0}=g^{+}_{0}(R^{-}_{k0},\overline{\gamma}^{-}_{k0})
16:        else
17:           Rk,1+=P10+Pk,1+R^{+}_{k,\ell\!-\!1}=P^{0}_{\ell\!-\!1}+P^{+}_{k,\ell\!-\!1}, Rk=Z0+QkR^{-}_{k\ell}=Z^{0}_{\ell}+Q^{-}_{k\ell}
18:           Z^k=g+(Rk,1+,Rk,γ¯k,1+,γ¯k,Ξ)\widehat{Z}_{k\ell}=g^{+}_{\ell}(R^{+}_{k,\ell\!-\!1},R^{-}_{k\ell},\overline{\gamma}^{+}_{k,\ell\!-\!1},\overline{\gamma}^{-}_{k\ell},\Xi_{\ell})
19:        end if
20:        α¯k+=𝔼Z^k/Qk\overline{\alpha}_{k\ell}^{+}=\mathbb{E}\partial\widehat{Z}_{k\ell}/\partial Q_{k\ell}^{-}
21:        Qk+=Z^kZ0α¯k+Qk1α¯k+\displaystyle Q_{k\ell}^{+}=\frac{\widehat{Z}_{k\ell}-Z^{0}_{\ell}-\overline{\alpha}_{k\ell}^{+}Q_{k\ell}^{-}}{1-\overline{\alpha}^{+}_{k\ell}}
22:        γ¯k+=(1α¯k+1)γ¯k\overline{\gamma}_{k\ell}^{+}=(\tfrac{1}{\overline{\alpha}_{k\ell}^{+}}-1)\overline{\gamma}_{k\ell}^{-}
23:        (P0,Pk+)𝒩(0,𝐊k+),𝐊k+=cov(Z0,Qk+)(P^{0}_{\ell},P^{+}_{k\ell})\sim{\mathcal{N}}(0,\mathbf{K}_{k\ell}^{+}),~{}\mathbf{K}_{k\ell}^{+}=\mathrm{cov}(Z^{0}_{\ell},Q^{+}_{k\ell})
24:     end for
25:     
26:     // Backward Pass
27:     for =L,,1\ell=L,\ldots,1 do
28:        if =L\ell=L then
29:           Rk,L1+=PL10+Pk,L1+R^{+}_{k,L\!-\!1}=P^{0}_{L\!-\!1}+P^{+}_{k,L\!-\!1}
30:           P^k,L1=gL(Rk,L1+,γ¯k,L1+,ZL0)\widehat{P}_{k,L\!-\!1}=g_{L}^{-}(R^{+}_{k,L\!-\!1},\overline{\gamma}^{+}_{k,L\!-\!1},Z^{0}_{L})
31:        else
32:           Rk,1+=P10+Pk,1+R^{+}_{k,\ell\!-\!1}=P^{0}_{\ell\!-\!1}+P^{+}_{k,\ell\!-\!1}, Rk+1,=Z0+Qk+1,R^{-}_{k\!+\!1,\ell}=Z^{0}_{\ell}+Q^{-}_{k\!+\!1,\ell}
33:           P^k,1=g(Rk,1+,Rk+1,,γ¯k,1+,γ¯k+1,,Ξ)\widehat{P}_{k,\ell\!-\!1}=g_{\ell}^{-}(R^{+}_{k,\ell\!-\!1},R^{-}_{k\!+\!1,\ell},\overline{\gamma}^{+}_{k,\ell\!-\!1},\overline{\gamma}^{-}_{k\!+\!1,\ell},\Xi_{\ell})
34:        end if
35:        α¯k,1=𝔼P^k,1/Pk,1+\overline{\alpha}_{k,\ell\!-\!1}^{-}=\mathbb{E}\partial\widehat{P}_{k,\ell\!-\!1}/\partial P_{k,\ell\!-\!1}^{+}
36:        Pk+1,1=P^k,1P10α¯k,1Pk,1+1α¯k,1\displaystyle P_{k\!+\!1,\ell\!-\!1}^{-}=\frac{\widehat{P}_{k,\ell\!-\!1}-P^{0}_{\ell\!-\!1}-\overline{\alpha}_{k,\ell\!-\!1}^{-}P_{k,\ell\!-\!1}^{+}}{1-\overline{\alpha}^{-}_{k,\ell\!-\!1}}
37:        γ¯k+1,1=(1α¯k,11)γ¯k,1+\overline{\gamma}_{k\!+\!1,\ell\!-\!1}^{-}=(\tfrac{1}{\overline{\alpha}_{k,\ell\!-\!1}^{-}}-1)\overline{\gamma}_{k,\ell\!-\!1}^{+}
38:        Qk+1,1𝒩(0,τk+1,1)Q^{-}_{k\!+\!1,\ell\!-\!1}\sim{\mathcal{N}}(0,\tau_{k\!+\!1,\ell\!-\!1}^{-}), τk,1=𝔼(Pk+1,1)2\tau_{k,\ell\!-\!1}^{-}=\mathbb{E}(P^{-}_{k\!+\!1,\ell\!-\!1})^{2}
39:     end for
40:  end for

The SE for the general ML-VAMP algorithm are derived in (Pandit et al., 2019) and the special case of the updates for ML-VAMP for GLM learning are shown in Algorithm 2 with details of functions 𝐠±\mathbf{g}_{\ell}^{\pm} in Appendix B. We see that the SE updates in Algorithm 2 parallel those in the ML-VAMP algorithm Algo. 1, except that vector quantities such as 𝐳^k\widehat{\mathbf{z}}_{k\ell}, 𝐩^k\widehat{\mathbf{p}}_{k\ell}, 𝐫k+\mathbf{r}_{k\ell}^{+} and 𝐫k\mathbf{r}_{k\ell}^{-} are replaced by scalar random variables such as Z^k\widehat{Z}_{k\ell}, P^k\widehat{P}_{k\ell}, Rk+R_{k\ell}^{+} and RkR_{k\ell}^{-}. Each of these random variables are described by the deterministic parameters such as 𝐊k02×2\mathbf{K}_{k\ell}\in\mathbb{R}^{2\times 2}_{\succ 0}, and τ0\tau^{0}_{\ell}, τk+\tau^{-}_{k\ell}\in\mathbb{R}_{+}.

The updates in the section labeled as “Initial”, provide the scalar equivalent model for the true system (18). In these updates, Ξ\Xi_{\ell} represent the limits of the vectors 𝝃{\bm{\xi}}_{\ell} in (19). That is,

Ξ1:=Str,Ξ2:=Smp+,Ξ3:=D.\Xi_{1}:=S_{\rm tr},\quad\Xi_{2}:=S_{\rm mp}^{+},\quad\Xi_{3}:=D. (49)

Due to assumptions in Section 2, we have that the components of 𝝃{\bm{\xi}}_{\ell} converge empirically as,

limN{ξ,i}=PL(2)Ξ,\lim_{N\rightarrow\infty}\{\xi_{\ell,i}\}\stackrel{{\scriptstyle PL(2)}}{{=}}\Xi_{\ell}, (50)

So, the Ξ\Xi_{\ell} represent the asymptotic distribution of the components of the vectors 𝝃{\bm{\xi}}_{\ell}.

The updates in sections labeled “Forward pass” and “Backward pass” in the SE equations in Algorithm 2 parallel those in Algorithm 1. The key quantities in these SE equations are the error variables,

𝐩k+:=𝐫k+𝐩0,𝐪k:=𝐫k𝐳0,\mathbf{p}_{k\ell}^{+}:=\mathbf{r}_{k\ell}^{+}-\mathbf{p}^{0}_{\ell},\quad\mathbf{q}_{k\ell}^{-}:=\mathbf{r}_{k\ell}^{-}-\mathbf{z}^{0}_{\ell},\quad

which represent the errors of the estimates to the inputs of the denoisers. We will also be interested in their transforms,

𝐪k+=𝐕T𝐩k,+1+,𝐩k=𝐕𝐪k.\mathbf{q}_{k\ell}^{+}=\mathbf{V}_{\ell}^{\text{\sf T}}\mathbf{p}_{k,\ell\!+\!1}^{+},\quad\mathbf{p}_{k\ell}^{-}=\mathbf{V}_{\ell}\mathbf{q}_{k\ell}^{-}.

The following Theorem is an adapted version of the main result from (Pandit et al., 2019) to the iterates of Algorithms 1 and 2.

Theorem 2.

Consider the outputs of the ML-VAMP for GLM Learning Algorithm under the assumptions of Section 2. Assume the denoisers satisfy the continuity conditions in Assumption 1. Also, assume that the outputs of the SE satisfy

α¯k±(0,1),\overline{\alpha}_{k\ell}^{\pm}\in(0,1),

for all kk and \ell. Suppose Algo. 1 is initialized so that the following convergence holds

limN{𝐫0𝐳0}=PL(2)Q0\displaystyle\lim_{N\rightarrow\infty}\{\mathbf{r}^{-}_{0\ell}-\mathbf{z}_{\ell}^{0}\}\overset{PL(2)}{=}Q_{0\ell}^{-}

where (Q00,Q01,Q02)(Q_{00}^{-},Q_{01}^{-},Q_{02}^{-}) are independent zero-mean Gaussians, independent of {Ξ}\{\Xi_{\ell}\}. Then,

  1. (a)

    For any fixed iteration k0k\geq 0 in the forward direction and layer =1,,L1\ell=1,\ldots,L\!-\!1, we have that, almost surely,

    limN(γk,1+,γk)=(γ¯k,1+,γ¯k),and,\displaystyle\lim_{N\rightarrow\infty}(\gamma_{k,\ell\!-\!1}^{+},\gamma_{k\ell}^{-})=(\overline{\gamma}_{k,\ell\!-\!1}^{+},\overline{\gamma}_{k\ell}^{-}),\quad{\rm and}, (51)
    limN{(𝐳^k+,𝐳0,𝐩10,𝐫k,1+,𝐫)}\displaystyle\lim_{N\rightarrow\infty}\{(\widehat{\mathbf{z}}^{+}_{k\ell},\mathbf{z}^{0}_{\ell},\mathbf{p}_{\ell\!-\!1}^{0},\mathbf{r}^{+}_{k,\ell\!-\!1},\mathbf{r}^{-}_{\ell})\}
    =PL(2)(Z^k+,Z0,P10,Rk,1+,R)\displaystyle\overset{PL(2)}{=}(\widehat{Z}^{+}_{k\ell},Z^{0}_{\ell},P_{\ell\!-\!1}^{0},R^{+}_{k,\ell\!-\!1},R^{-}_{\ell}) (52)

    where the variables on the right-hand side are from the SE equations (51) and (52) are the outputs of the SE equations in Algorithm 2. Similar equations hold for =0\ell=0 with the appropriate variables removed.

  2. (b)

    Similarly, in the reverse direction, For any fixed iteration k0k\geq 0 and layer =1,,L2\ell=1,\ldots,L-2, we have that, almost surely,

    limN(γk,1+,γk+1,)=(γ¯k,1+,γ¯k+1,),and\displaystyle\lim_{N\rightarrow\infty}(\gamma_{k,\ell\!-\!1}^{+},\gamma_{k\!+\!1,\ell}^{-})=(\overline{\gamma}_{k,\ell\!-\!1}^{+},\overline{\gamma}_{k\!+\!1,\ell}^{-}),\quad{\rm and} (53)
    limN{(𝐩^k+1,1+,𝐳0,𝐩10,𝐫k,1+,𝐫k+1,)}\displaystyle\lim_{N\rightarrow\infty}\{(\widehat{\mathbf{p}}^{+}_{k\!+\!1,\ell\!-\!1},\mathbf{z}^{0}_{\ell},\mathbf{p}_{\ell\!-\!1}^{0},\mathbf{r}^{+}_{k,\ell\!-\!1},\mathbf{r}^{-}_{k\!+\!1,\ell})\}
    =PL(2)(P^k+1,1+,Z0,P10,Rk,1+,Rk+1,).\displaystyle\quad\overset{PL(2)}{=}(\widehat{P}^{+}_{k\!+\!1,\ell\!-\!1},Z^{0}_{\ell},P_{\ell\!-\!1}^{0},R^{+}_{k,\ell\!-\!1},R^{-}_{k\!+\!1,\ell}). (54)

Furthermore, (P10,Pk1+)(P_{\ell\!-\!1}^{0},P_{k\ell\!-\!1}^{+}) and QkQ_{k\ell}^{-} are independent.

Proof.

This is a direct application of Theorem 3 from (Pandit et al., 2019) to the iterations of Algorithm 1. The convergence result in (Pandit et al., 2019) requires the uniform Lipschitz continuity of functions 𝐠±()\mathbf{g}_{\ell}^{\pm}(\cdot). Assumption 1 provides the required uniform Lipschitz continuity assumption on 𝐠0+()\mathbf{g}_{0}^{+}(\cdot) and 𝐠3()\mathbf{g}_{3}^{-}(\cdot). For the ”middle” layers, =1,2\ell=1,2, the denoisers 𝐠±()\mathbf{g}_{\ell}^{\pm}(\cdot) are linear and the uniform continuity assumption is valid since we have made the additional assumption that the terms 𝐬tr\mathbf{s}_{\rm tr} and 𝐬mp\mathbf{s}_{\rm mp} are bounded almost surely. \Box

A key use of the Theorem is to compute asymptotic empirical limits. Specifically, for a componentwise function ψ()\psi(\cdot), let ψ(𝒙)\langle{\psi(\bm{x})}\rangle denotes the average 1Nn=1Nψ(xn)\tfrac{1}{N}\sum_{n=1}^{N}\psi(x_{n}) The above theorem then states that for any componentwise pseudo-Lipschitz function ψ()\psi(\cdot) of order 2, as NN\rightarrow\infty, we have the following two properties

limNψ(𝐳^k+,𝐳0,𝐩10,𝐫k,1+,𝐫)\displaystyle\lim_{N\rightarrow\infty}\langle{\psi(\widehat{\mathbf{z}}^{+}_{k\ell},\mathbf{z}^{0}_{\ell},\mathbf{p}_{\ell\!-\!1}^{0},\mathbf{r}^{+}_{k,\ell\!-\!1},\mathbf{r}^{-}_{\ell})}\rangle
=𝔼ψ(Z^k+,Z0,P10,Rk,1+,R)\displaystyle=\mathbb{E}\,\psi(\widehat{Z}^{+}_{k\ell},Z^{0}_{\ell},P_{\ell\!-\!1}^{0},R^{+}_{k,\ell\!-\!1},R^{-}_{\ell})
limNψ(𝐩^k+1,1+,𝐳0,𝐩10,𝐫k,1+,𝐫k+1,)\displaystyle\lim_{N\rightarrow\infty}\langle{\psi(\widehat{\mathbf{p}}^{+}_{k\!+\!1,\ell\!-\!1},\mathbf{z}^{0}_{\ell},\mathbf{p}_{\ell\!-\!1}^{0},\mathbf{r}^{+}_{k,\ell\!-\!1},\mathbf{r}^{-}_{k\!+\!1,\ell)}}\rangle
=𝔼ψ(P^k+1,1+,Z0,P10,Rk,1+,Rk+1,).\displaystyle=\mathbb{E}\,\psi(\widehat{P}^{+}_{k\!+\!1,\ell\!-\!1},Z^{0}_{\ell},P_{\ell\!-\!1}^{0},R^{+}_{k,\ell\!-\!1},R^{-}_{k\!+\!1,\ell}).

That is, we can compute the empirical average over components with the expected value of the random variable limit. This convergence is key to proving Theorem 1.

Appendix D Empirical Convergence of Fixed Points

A consequence of Assumption 2 is that we can take the limit kk\rightarrow\infty of the random variables in the SE algorithm. Specifically, let 𝒙k=𝒙k(N)\bm{x}_{k}=\bm{x}_{k}(N) be any set of dd outputs from the ML-VAMP for GLM Learning Algorithm under the assumptions of Theorem 2. Under Assumption 2, for each NN, there exists a vector

𝒙(N)=limk𝒙k(N),\bm{x}(N)=\lim_{k\rightarrow\infty}\bm{x}_{k}(N), (55)

representing the limit over kk. For each kk, Theorem 2 shows there also exists a random vector limit,

limN{𝒙k,i(N)}=PL(2)Xk,\lim_{N\rightarrow\infty}\{\bm{x}_{k,i}(N)\}\stackrel{{\scriptstyle PL(2)}}{{=}}X_{k}, (56)

representing the limit over NN. The following proposition shows that we can take the limits of the random variables XkX_{k}.

Proposition 1.

Consider the outputs of the ML-VAMP for GLM Learning Algorithm under the assumptions of Theorem 2 and Assumption 2. Let 𝐱k=𝐱k(N)\bm{x}_{k}=\bm{x}_{k}(N) be any set of dd outputs from the algorithm and let 𝐱(N)\bm{x}(N) be its limit from (55) and let XkX_{k} be the random variable limit (56). Then, there exists a random variable XdX\in{\mathbb{R}}^{d} such that, for any pseudo-Lipschitz continuous f:df:{\mathbb{R}}^{d}\rightarrow{\mathbb{R}},

limk𝔼f(Xk)=𝔼f(X)=limN1Ni=1Nf(xi(N)).\lim_{k\rightarrow\infty}\mathbb{E}f(X_{k})=\mathbb{E}f(X)=\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{i=1}^{N}f(x_{i}(N)). (57)

In addition, the SE parameter limits α¯k±\overline{\alpha}_{k\ell}^{\pm} and γ¯k±\overline{\gamma}_{k\ell}^{\pm} converge to limits,

limkα¯k±=α¯±,limkγ¯k±=γ¯±.\lim_{k\rightarrow\infty}\overline{\alpha}_{k\ell}^{\pm}=\overline{\alpha}_{\ell}^{\pm},\quad\lim_{k\rightarrow\infty}\overline{\gamma}_{k\ell}^{\pm}=\overline{\gamma}_{\ell}^{\pm}. (58)

The proposition shows that, under the convergence assumption, Assumption 2, we can take the limits as kk\rightarrow\infty of the random variables from the SE. To prove the proposition we first need the following simple lemma.

Lemma 1.

If αN\alpha_{N} and βk\beta_{k}\in{\mathbb{R}} are sequences such that

limklimN|αNβk|=0,\lim_{k\rightarrow\infty}\lim_{N\rightarrow\infty}|\alpha_{N}-\beta_{k}|=0, (59)

then, there exists a constant CC such that,

limNαN=limkβk=C.\lim_{N\rightarrow\infty}\alpha_{N}=\lim_{k\rightarrow\infty}\beta_{k}=C. (60)

In particular, the two limits in (60) exist.

Proof.

For any ϵ>0\epsilon>0, the limit (59) implies that there exists a kϵ(k_{\epsilon}(\uparrow\infty as ϵ0)\epsilon\downarrow 0) such that for all k>kϵk>k_{\epsilon},

limN|αNβk|<ϵ,\lim_{N\rightarrow\infty}|\alpha_{N}-\beta_{k}|<\epsilon,

from which we can conclude,

lim infNαN>βkϵ\liminf_{N\rightarrow\infty}\alpha_{N}>\beta_{k}-\epsilon

for all k>kϵk>k_{\epsilon}. Therefore,

lim infNαNsupkkϵβkϵ.\liminf_{N\rightarrow\infty}\alpha_{N}\geq\sup_{k\geq k_{\epsilon}}\beta_{k}-\epsilon.

Since this is true for all ϵ>0\epsilon>0, it follows that

lim infNαNlim supkβk.\liminf_{N\rightarrow\infty}\alpha_{N}\geq\limsup_{k\rightarrow\infty}\beta_{k}. (61)

Similarly, lim supNαNinfk>kϵβk+ϵ\limsup_{N\rightarrow\infty}\alpha_{N}\leq\inf_{k>k_{\epsilon}}\beta_{k}+\epsilon, whereby

lim supNαNlim infkβk.\limsup_{N\rightarrow\infty}\alpha_{N}\leq\liminf_{k\rightarrow\infty}\beta_{k}. (62)

Equations (61) and (62) together show that the limits in (60) exists and are equal. \Box

Proof of Proposition 1

Let f:df:{\mathbb{R}}^{d}\rightarrow{\mathbb{R}} be any pseudo-Lipschitz function of order 2, and define,

αN=1Ni=1Nf(xi(N)),βk=𝔼f(Xk).\alpha_{N}=\frac{1}{N}\sum_{i=1}^{N}f(x_{i}(N)),\quad\beta_{k}=\mathbb{E}f(X_{k}). (63)

Their difference can be written as,

αNβk=AN,k+BN,k,\alpha_{N}-\beta_{k}=A_{N,k}+B_{N,k}, (64)

where

AN,k\displaystyle A_{N,k} :=1Ni=1Nf(𝒙i(N))f(𝒙k,i(N)),\displaystyle:=\frac{1}{N}\sum_{i=1}^{N}f(\bm{x}_{i}(N))-f(\bm{x}_{k,i}(N)), (65)
BN,k\displaystyle B_{N,k} :=1Ni=1Nf(𝒙k,i(N))𝔼f(Xk).\displaystyle:=\frac{1}{N}\sum_{i=1}^{N}f(\bm{x}_{k,i}(N))-\mathbb{E}f(X_{k}). (66)

Since {xk,i(N)}\{x_{k,i}(N)\} converges PL(2)PL(2) to XkX_{k}, we have,

limNBN,k=0.\lim_{N\rightarrow\infty}B_{N,k}=0. (67)

For the term AN,kA_{N,k},

|AN,k|(a)limN1Ni=1N|f(𝒙i(N))f(𝒙k,i(N))|\displaystyle|A_{N,k}|\stackrel{{\scriptstyle\rm(a)}}{{\leq}}\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{i=1}^{N}\left|f(\bm{x}_{i}(N))-f(\bm{x}_{k,i}(N))\right|
(b)limNCNi=1Naki(N)(1+aki(N))\displaystyle\stackrel{{\scriptstyle\rm(b)}}{{\leq}}\lim_{N\rightarrow\infty}\frac{C}{N}\sum_{i=1}^{N}a_{ki}(N)(1+a_{ki}(N))
(c)ClimN1Ni=1Naki2(N)+1Ni=1Naki2(N)\displaystyle\stackrel{{\scriptstyle\rm(c)}}{{\leq}}C\lim_{N\rightarrow\infty}\sqrt{\frac{1}{N}\sum_{i=1}^{N}a^{2}_{ki}(N)}+\frac{1}{N}\sum_{i=1}^{N}a_{ki}^{2}(N)
=ClimNϵk(N)(1+ϵk(N)),\displaystyle=C\lim_{N\rightarrow\infty}\epsilon_{k}(N)(1+\epsilon_{k}(N)), (68)

where (a) follows from applying the triangle inequality to the definition of AN,kA_{N,k} in (65); (b) follows from the definition of pseudo-Lipschitz continuity in Definition 1, C>0C>0 is the Lipschitz contant and

aki(N):=𝒙k,i(N)𝒙i(N)2,a_{ki}(N):=\|\bm{x}_{k,i}(N)-\bm{x}_{i}(N)\|_{2},

and (c) follows from the RMS-AM inequality:

(1Ni=1Naki(N))21Ni=1Naki2(N)=:ϵk2(N).\displaystyle\left(\frac{1}{N}\sum_{i=1}^{N}a_{ki}(N)\right)^{2}\leq\frac{1}{N}\sum_{i=1}^{N}a_{ki}^{2}(N)=:\epsilon_{k}^{2}(N).

By (29), we have that,

limklimNϵk(N)=0.\lim_{k\rightarrow\infty}\lim_{N\rightarrow\infty}\epsilon_{k}(N)=0.

Hence, from (68), it follows that,

limklimNAN,k=0.\lim_{k\rightarrow\infty}\lim_{N\rightarrow\infty}A_{N,k}=0. (69)

Substituting (67) and (69) into (64) show that αN\alpha_{N} and βk\beta_{k} satisfy (59). Therefore, applying Lemma 1 we have that for any pseudo-Lipschitz function f()f(\cdot), there exists a limit Φ(f)\Phi(f) such that,

limN1Ni=1Nf(xi(N))=limk𝔼f(Xk)=Φ(f).\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{i=1}^{N}f(x_{i}(N))=\lim_{k\rightarrow\infty}\mathbb{E}f(X_{k})=\Phi(f). (70)

In particular, the two limits in (70) exists. When restricted to the continuous, bounded functions with the f\|f\|_{\infty} norm, it is easy verified that Φ(f)\Phi(f) is a positive, linear, bounded function of ff, with Φ(1)=1\Phi(1)=1. Therefore, by the Riesz representation theorem, there exists a random variable XX such that Φ(f)=𝔼f(X)\Phi(f)=\mathbb{E}f(X). This fact in combination with (70) shows (57).

It remains to prove the parameter limits in (58). We prove the result for the parameter α¯k+\overline{\alpha}_{k\ell}^{+}. The proof for the other parameters are similar. Using Stein’s lemma, it is shown in (Pandit et al., 2019) that

α¯k+=𝔼(Z^kQk)𝔼(Q)2.\overline{\alpha}_{k\ell}^{+}=\frac{\mathbb{E}(\widehat{Z}_{k\ell}Q^{-}_{k\ell})}{\mathbb{E}(Q_{\ell}^{-})^{2}}. (71)

Since the numerator and denominator of (71) are PL(2)PL(2) functions we have that the limit,

α¯+\displaystyle\overline{\alpha}_{\ell}^{+} :=limkα¯k+=limk𝔼(Z^kQk)𝔼(Qk)2\displaystyle:=\lim_{k\rightarrow\infty}\overline{\alpha}_{k\ell}^{+}=\lim_{k\rightarrow\infty}\frac{\mathbb{E}(\widehat{Z}_{k\ell}Q^{-}_{k\ell})}{\mathbb{E}(Q_{k\ell}^{-})^{2}}
=𝔼(Z^Q)𝔼(Q)2,\displaystyle=\frac{\mathbb{E}(\widehat{Z}_{\ell}Q^{-}_{\ell})}{\mathbb{E}(Q_{\ell}^{-})^{2}}, (72)

where Z^\widehat{Z}_{\ell} and QQ^{-}_{\ell} are the limits of Z^k\widehat{Z}_{k\ell} and QkQ^{-}_{k\ell}. This completes the proof. ∎

Appendix E Proof of Theorem 1

From Assumption 2, we know that for every NN, every group of vectors 𝒙k\bm{x}_{k} converge to limits, 𝒙:=limk𝒙k\bm{x}:=\lim_{k\rightarrow\infty}\bm{x}_{k}. The parameters, γk±\gamma_{k\ell}^{\pm}, also converge to limits γ¯±:=limkγk±\overline{\gamma}_{\ell}^{\pm}:=\lim_{k\rightarrow\infty}\gamma_{k\ell}^{\pm} for all \ell. By the continuity assumptions on the functions 𝐠±()\mathbf{g}_{\ell}^{\pm}(\cdot), the limits 𝒙\bm{x} and γ¯±\overline{\gamma}_{\ell}^{\pm} are fixed points of the algorithms.

A proof similar to that in (Pandit et al., 2019) shows that the fixed points 𝐳^\widehat{\mathbf{z}}_{\ell} and 𝐩^\widehat{\mathbf{p}}_{\ell} satisfy the KKT condition of the constrained optimization (22). This proves part (a).

The estimate 𝒘^\widehat{\bm{w}} is the limit,

𝒘^=𝐳^0=limk𝐳^k0.\widehat{\bm{w}}=\widehat{\mathbf{z}}_{0}=\lim_{k\rightarrow\infty}\widehat{\mathbf{z}}_{k0}.

Also, the true parameter is 𝐳00=𝒘0\mathbf{z}_{0}^{0}=\bm{w}^{0}. By Proposition 1, we have that the PL(2)PL(2) limits of these variables are

limN{(𝒘^,𝒘0)}=PL(2)(W^,W0):=(Z^0,Z00).\displaystyle\lim_{N\rightarrow\infty}\{(\widehat{\bm{w}},\bm{w}_{0})\}\stackrel{{\scriptstyle PL(2)}}{{=}}(\widehat{W},W_{0}):=(\widehat{Z}_{0},Z_{0}^{0}).

From line 15 of the SE Algorithm 2, we have

W^=Z^0=g0+(R0,γ¯0)=proxfin/γ¯0(W0+Q0).\displaystyle\widehat{W}=\widehat{Z}_{0}=g_{0}^{+}(R_{0}^{-},\overline{\gamma}_{0}^{-})=\operatorname{prox}_{f_{\rm in}/\overline{\gamma}_{0}^{-}}(W^{0}+Q_{0}^{-}).

This proves part (b).

To prove part (c), we use the limit

limN{p0,n0,p^0,n}=PL(2)(P00,P^0).\displaystyle\lim_{N\rightarrow\infty}\{p^{0}_{0,n},\widehat{p}_{0,n}\}\stackrel{{\scriptstyle PL(2)}}{{=}}(P_{0}^{0},\widehat{P}_{0}). (73)

Since the fixed points are critical points of the constrained optimization (22), 𝐩^0=𝐕0𝒘^\widehat{\mathbf{p}}_{0}=\mathbf{V}_{0}\widehat{\bm{w}}. We also have 𝐩00=𝐕0𝒘0\mathbf{p}^{0}_{0}=\mathbf{V}_{0}\bm{w}^{0}. Therefore,

[zts(N)z^ts(N)]:=𝐮TDiag(𝐬ts)𝐕0[𝒘0𝒘^]\displaystyle\left[z_{\rm ts}^{(N)}\ \widehat{z}_{\rm ts}^{(N)}\right]:=\mathbf{u}^{\text{\sf T}}\operatorname{Diag}(\mathbf{s}_{\rm ts})\mathbf{V}_{0}[\bm{w}^{0}\ \widehat{\bm{w}}]
=𝐮TDiag(𝐬ts)[𝐩00𝐩^0].\displaystyle=\mathbf{u}^{\text{\sf T}}\operatorname{Diag}(\mathbf{s}_{\rm ts})[\mathbf{p}^{0}_{0}\ \widehat{\mathbf{p}}_{0}]. (74)

Here, (N)(N) in the subscript denotes the dependence on N. Since 𝐮𝒩(0,1p𝐈)\mathbf{u}\sim{\mathcal{N}}(0,\tfrac{1}{p}\mathbf{I}), [zts(N)z^ts(N)][z_{\rm ts}^{(N)}\ \widehat{z}_{\rm ts}^{(N)}] is a zero-mean bivariate Gaussian with covariance matrix

𝐌(N)=1pn=1p[sts,n2p0,n0p0,n0sts,n2p0,n0p^0,nsts,n2p0,n0p^0,nsts,n2p^0,np^0,n]\mathbf{M}^{(N)}=\tfrac{1}{p}\sum_{n=1}^{p}\begin{bmatrix}s_{{\rm ts},n}^{2}p^{0}_{0,n}p^{0}_{0,n}&s_{{\rm ts},n}^{2}p^{0}_{0,n}\widehat{p}_{0,n}\\ s_{{\rm ts},n}^{2}p^{0}_{0,n}\widehat{p}_{0,n}&s_{{\rm ts},n}^{2}\widehat{p}_{0,n}\widehat{p}_{0,n}\end{bmatrix}

The empirical convergence (73) yields the following limit,

limN𝐌(N)=𝐌:=𝔼Sts2[P00P00P00P^0P00P^0P^0P^0].\displaystyle\lim_{N\rightarrow\infty}\mathbf{M}^{(N)}=\mathbf{M}:=\mathbb{E}\,S_{{\rm ts}}^{2}\begin{bmatrix}P^{0}_{0}P^{0}_{0}&P^{0}_{0}\widehat{P}_{0}\\ P^{0}_{0}\widehat{P}_{0}&\widehat{P}_{0}\widehat{P}_{0}\end{bmatrix}. (75)

It suffices to show that the distribution of [zts(N)z^ts(N)][z_{\rm ts}^{(N)}\,\widehat{z}_{\rm ts}^{(N)}] converges to the distribution of [ZtsZ^ts][Z_{\rm ts}\,\widehat{Z}_{\rm ts}] in the Wasserstein-2 metric as N.N\rightarrow\infty. (See the discussion in Appendix A on the equivalence of convergence in Wasserstein-2 metric and PL(2) convergence.)

Now, Wassestein-2 distance between between two probability measures ν1\nu_{1} and ν2\nu_{2} is defined as

W2(ν1,ν2)=(infγΓ𝔼γX1X22)1/2,W_{2}(\nu_{1},\nu_{2})=\left(\inf_{\gamma\in\Gamma}\mathbb{E}_{\gamma}\left\|X_{1}-X_{2}\right\|^{2}\right)^{1/2}, (76)

where Γ\Gamma is the set of probability distributions on the product space with marginals consistent with ν1\nu_{1} and ν2\nu_{2}. For Gaussian measures ν1=𝒩(𝟎,Σ1)\nu_{1}=\mathcal{N}(\mathbf{0},\Sigma_{1}) and ν2=𝒩(𝟎,Σ2)\nu_{2}=\mathcal{N}(\mathbf{0},\Sigma_{2}) we have (Givens et al., 1984)

W22(ν1,ν2)=tr(Σ12(Σ11/2Σ2Σ11/2)1/2+Σ2).W^{2}_{2}(\nu_{1},\nu_{2})={\rm tr}(\Sigma_{1}-2(\Sigma_{1}^{1/2}\Sigma_{2}\Sigma_{1}^{1/2})^{1/2}+\Sigma_{2}).

Therefore, for Gaussian distributions ν1(N)=𝒩(𝟎,𝐌(N))\nu_{1}^{(N)}=\mathcal{N}(\mathbf{0},\mathbf{M}^{(N)}), and ν2=𝒩(𝟎,𝐌)\nu_{2}=\mathcal{N}(\mathbf{0},\mathbf{M}), the convergence (75) implies W2(ν1(N),ν2)0,W_{2}(\nu_{1}^{(N)},\nu_{2})\rightarrow 0, i.e., convergence in Wasserstein-2 distance. Hence,

(zts(N),z^ts(N))W2(Zts,Z^ts)𝒩(0,𝐌),(z_{{\rm ts}}^{(N)},\widehat{z}_{{\rm ts}}^{(N)})\stackrel{{\scriptstyle W_{2}}}{{\longrightarrow}}(Z_{\rm ts},\widehat{Z}_{\rm ts})\sim{\mathcal{N}}(0,\mathbf{M}),

where 𝐌\mathbf{M} is the covariance matrix in (75). Hence the convergence holds in the PL(2) sense (see discussion in Appendix A on the equivalence of convergence in W2W_{2} and PL(2) convergence).

Hence the asymptotic generalization error (17) is

ts:=limN𝔼fts(y^ts,yts)\displaystyle{\mathcal{E}}_{\rm ts}:=\lim_{N\rightarrow\infty}\mathbb{E}f_{\rm ts}(\widehat{y}_{\rm ts},y_{\rm ts})
=(a)limN𝔼fts(ϕout(zts(N),D),ϕ(z^ts(N)))\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\lim_{N\rightarrow\infty}\mathbb{E}f_{\rm ts}(\phi_{\rm out}(z_{{\rm ts}}^{(N)},D),\phi(\widehat{z}_{{\rm ts}}^{(N)}))
=(b)𝔼fts(ϕout(Zts,D),ϕ(Z^ts)),\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\mathbb{E}f_{\rm ts}(\phi_{\rm out}(Z_{{\rm ts}},D),\phi(\widehat{Z}_{{\rm ts}})), (77)

where (a) follows from (3); and step (b) follows from continuity assumption in Assumption 1(b) along with the definition of PL(2) convergence in Def. 3. This proves part (c).

Appendix F Formula for 𝐌\mathbf{M}

For the special cases in the next Appendix, it is useful to derive expressions for the entries the covariance matrix 𝐌\mathbf{M} in (75). For the term m11m_{11},

m11=𝔼Sts2(P00)2=𝔼Sts2𝔼(P00)2=𝔼Sts2k11,\displaystyle m_{11}=\mathbb{E}\,S_{{\rm ts}}^{2}(P_{0}^{0})^{2}=\mathbb{E}\,S_{{\rm ts}}^{2}\mathbb{E}(P_{0}^{0})^{2}=\mathbb{E}\,S_{{\rm ts}}^{2}\cdot k_{11}, (78)

where we have used the fact that P00(Sts,Str)P_{0}^{0}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}(S_{\rm ts},S_{\rm tr}). Next, m12=𝔼Sts2P00P^0.m_{12}=\mathbb{E}\,S_{{\rm ts}}^{2}\ P_{0}^{0}\widehat{P}_{0}. where,

P^0\displaystyle\widehat{P}_{0} =g1(P00+P0+,Z10+Q1,γ¯0+,γ¯1,Str)\displaystyle=g_{1}^{-}(P_{0}^{0}+P_{0}^{+},Z_{1}^{0}+Q_{1}^{-},\overline{\gamma}_{0}^{+},\overline{\gamma}_{1}^{-},S_{\rm tr}^{-})
=γ¯0+P0++Strγ¯1Q1γ¯0++Str2γ¯1+P00,\displaystyle=\frac{\overline{\gamma}^{+}_{0}P_{0}^{+}+S_{\rm tr}\overline{\gamma}^{-}_{1}Q_{1}^{-}}{\overline{\gamma}^{+}_{0}+S^{2}_{\rm tr}\overline{\gamma}^{-}_{1}}+P^{0}_{0}, (79)

where (P00,P0+,Q0)(P_{0}^{0},P_{0}^{+},Q_{0}^{-}) are independent of (Str,Sts)(S_{{\rm tr}},S_{\rm ts}). Hence,

m12=𝔼Sts2𝔼(P00)2+𝔼Sts2γ¯0+γ¯0++Str2γ¯1𝔼[P00P0+]\displaystyle m_{12}=\mathbb{E}\,S_{{\rm ts}}^{2}\cdot\mathbb{E}(P_{0}^{0})^{2}+\mathbb{E}\frac{S_{{\rm ts}}^{2}\overline{\gamma}_{0}^{+}}{\overline{\gamma}_{0}^{+}+S_{\rm tr}^{2}\overline{\gamma}_{1}^{-}}\mathbb{E}[P_{0}^{0}P_{0}^{+}]
=m11+𝔼(Sts2γ¯0+γ¯0++Str2γ¯1)k12,\displaystyle=m_{11}+\mathbb{E}\left(\frac{S_{\rm ts}^{2}\overline{\gamma}_{0}^{+}}{\overline{\gamma}_{0}^{+}+S_{\rm tr}^{2}\overline{\gamma}_{1}^{-}}\right)\cdot k_{12}, (80)

since 𝔼[P00Q1]=0\mathbb{E}[P^{0}_{0}Q_{1}^{-}]=0 and 𝐊0+\mathbf{K}_{0}^{+} is the covariance matrix of (P00,P0+)(P_{0}^{0},P_{0}^{+}) from line 23.

Finally for m22m_{22} we have,

m22=𝔼Sts2P^0P^0\displaystyle m_{22}=\mathbb{E}\,S_{{\rm ts}}^{2}\widehat{P}_{0}\widehat{P}_{0}
=𝔼(Stsγ¯0+γ¯0++Str2γ¯1)2𝔼(P0+)2\displaystyle=\mathbb{E}\left(\frac{S_{{\rm ts}}\overline{\gamma}_{0}^{+}}{\overline{\gamma}_{0}^{+}+S_{\rm tr}^{2}\overline{\gamma}_{1}^{-}}\right)^{2}\mathbb{E}(P_{0}^{+})^{2}
+𝔼(StsStrγ¯1γ¯0++Str2γ¯1)2𝔼(Q1)2\displaystyle+\mathbb{E}\left(\frac{S_{{\rm ts}}S_{{\rm tr}}\overline{\gamma}_{1}^{-}}{\overline{\gamma}_{0}^{+}+S_{\rm tr}^{2}\overline{\gamma}_{1}^{-}}\right)^{2}\mathbb{E}(Q_{1}^{-})^{2}
+𝔼Sts2𝔼(P00)2+2𝔼γ¯0+Sts2γ¯0++γ¯1Str2𝔼P00P0+\displaystyle+\mathbb{E}\,S_{\rm ts}^{2}\mathbb{E}(P_{0}^{0})^{2}+2\mathbb{E}\frac{\overline{\gamma}_{0}^{+}S_{\rm ts}^{2}}{\overline{\gamma}_{0}^{+}+\overline{\gamma}_{1}^{-}S_{\rm tr}^{2}}\cdot\mathbb{E}\,P_{0}^{0}P_{0}^{+}
=k22𝔼(Stsγ¯0+γ¯0++Str2γ¯1)2\displaystyle=k_{22}\mathbb{E}\left(\frac{S_{{\rm ts}}\overline{\gamma}_{0}^{+}}{\overline{\gamma}_{0}^{+}+S_{\rm tr}^{2}\overline{\gamma}_{1}^{-}}\right)^{2}
+τ1𝔼(StsStrγ¯1γ¯0++Str2γ¯1)2m11+2m12.\displaystyle+\tau_{1}^{-}\mathbb{E}\left(\frac{S_{{\rm ts}}S_{{\rm tr}}\overline{\gamma}_{1}^{-}}{\overline{\gamma}_{0}^{+}+S_{\rm tr}^{2}\overline{\gamma}_{1}^{-}}\right)^{2}-m_{11}+2m_{12}. (81)

Appendix G Special Cases

G.1 Linear Output with Square Error

In this section we examine a few special cases of the GLM problem (2). We first consider a linear output with additive Gaussian noise and a squared error training and test loss. Specifically, consider the model,

𝒚=𝐗𝒘0+𝐝\bm{y}=\mathbf{X}\bm{w}^{0}+\mathbf{d} (82)

We consider estimates of 𝒘0\bm{w}^{0} such that:

𝒘^=argmin𝒘12𝒚𝐗𝒘2+λ2β𝒘2\widehat{\bm{w}}=\operatorname*{argmin}_{\bm{w}}\ \tfrac{1}{2}\left\|\bm{y}-\mathbf{X}\bm{w}\right\|^{2}+\tfrac{\lambda}{2{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}\left\|\bm{w}\right\|^{2} (83)

The factor β\beta is added above since the two terms scale with a ratio of β\beta. It does not change analysis. Consider the ML-VAMP GLM learning algorithm applied to this problem. The following corollary follows from the Main result in Theorem 1.

Corollary 1 (Squared error).

For linear regression, i.e., ϕ(t)=t,\phi(t)=t, ϕout(t,d)=t+d,\phi_{\rm out}(t,d)=t+d, fts(y,y^)=(ytsy^ts)2f_{\rm ts}(y,\widehat{y})=(y_{\rm ts}-\widehat{y}_{\rm ts})^{2}, Fout(𝐩2)=1N𝐲𝐩22F_{\rm out}(\mathbf{p}_{2})=\tfrac{1}{N}\left\|\bm{y}-\mathbf{p}_{2}\right\|^{2}, we have

ts𝖫𝖱=𝔼(γ¯0+Stsγ¯0++Str2γ¯1)2k22+𝔼(γ¯1StrStsγ¯0++Str2γ¯1)2τ1+σd2.\displaystyle\mathcal{E}_{\rm ts}^{\mathsf{LR}}\!=\!\mathbb{E}\left(\tfrac{\overline{\gamma}_{0}^{+}S_{\rm ts}}{\overline{\gamma}_{0}^{+}+S_{\rm tr}^{2}\overline{\gamma}_{1}^{-}}\right)^{2}k_{22}+\mathbb{E}\left(\tfrac{\overline{\gamma}_{1}^{-}S_{\rm tr}S_{\rm ts}}{\overline{\gamma}_{0}^{+}+S_{\rm tr}^{2}\overline{\gamma}_{1}^{-}}\right)^{2}\tau_{1}^{-}+\sigma_{d}^{2}.

The quantities k22k_{22}, τ1,γ¯0+,γ¯1\tau_{1}^{-},\overline{\gamma}_{0}^{+},\overline{\gamma}_{1}^{-} depend on the choice of regularizer λ\lambda and the covariance between features.

Proof.

This follows directly from the following observation:

ts𝖲𝖫𝖱\displaystyle\mathcal{E}^{\mathsf{SLR}}_{\rm ts} =𝔼(Zts+DZ^ts)2=𝔼(ZtsZ^ts)2+𝔼D2\displaystyle=\mathbb{E}(Z_{\rm ts}+D-\widehat{Z}_{\rm ts})^{2}=\mathbb{E}(Z_{\rm ts}-\widehat{Z}_{\rm ts})^{2}+\mathbb{E}\,D^{2}
=m11+m222m12+σd2.\displaystyle=m_{11}+m_{22}-2m_{12}+\sigma_{d}^{2}.

Substituting equation (81) proves the claim. \Box

G.2 Ridge Regression with i.i.d. Covariates

We next the special case when the input features are independent, i.e., (83) where rows of 𝐗\mathbf{X} corresponding to the training data has i.i.d Gaussian features with covariance 𝐏train=σtr2p𝐈\mathbf{P}_{\rm train}=\frac{\sigma_{\rm tr}^{2}}{p}\mathbf{I} and Str=σtrS_{\rm tr}=\sigma_{\rm tr}.

Although the solution to (83) exists in closed form (𝐗T𝐗+λ𝐈)1𝐗T𝒚(\mathbf{X}^{\text{\sf T}}\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^{\text{\sf T}}\bm{y}, we can study the effect of the regularization parameter λ\lambda on the generalization error ts\mathcal{E}_{\rm ts} as detailed in the result below.

Corollary 2.

Consider the ridge regression problem (83) with regularization parameter λ>0\lambda>0. For the squared loss i.e., fts(y,y^)=(yy^)2f_{\rm ts}(y,\widehat{y})=(y-\widehat{y})^{2}, i.i.d Gaussian features without train-test mismatch, i.e., Str=Sts=σtrS_{\rm tr}=S_{\rm ts}=\sigma_{\rm tr}, the generalization error ts𝖱𝖱\mathcal{E}_{{\rm ts}}^{\mathsf{RR}} is given by Corollary 1, with constants

k22=Var(W0),γ¯0+=λ/β,\displaystyle k_{22}={\rm Var}(W^{0}),\qquad\overline{\gamma}_{0}^{+}=\lambda/\beta,
γ¯1={1Gλσtr2β<1λσtr2β(1Gλσtr2β)β1G+λσtr2ββ>1\displaystyle\overline{\gamma}_{1}^{-}=\begin{cases}\tfrac{1}{G}-\tfrac{\lambda}{\sigma_{{\rm tr}}^{2}}&\beta<1\\ \frac{\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}(\tfrac{1}{G}-\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}})}{\tfrac{\beta-1}{G}+\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}}&\beta>1\end{cases}

where G=Gmp(λσtr2β)G=G_{\rm{mp}}(-\frac{\lambda}{\sigma_{\rm tr}^{2}\beta}), with GmpG_{\rm{mp}} given in Appendix H, and τ1=𝔼(P1)2\tau_{1}^{-}=\mathbb{E}(P_{1}^{-})^{2} where P1P_{1}^{-} is given in equation (95) in the proof.

Proof of Corollary 2.

We are interested in identifying the following constants appearing in Corollary 1:

𝐊0+,τ1,γ¯0+,γ¯1.\displaystyle\mathbf{K}_{0}^{+},\tau_{1}^{-},\overline{\gamma}_{0}^{+},\overline{\gamma}_{1}^{-}.

These quantities are obtained as fixed points of the State Evolution Equations in Algo. 2. We explain below how to obtain expressions for these constants. Since these are fixed points we ignore the subscript kk corresponding to the iteration number in Algo. 2.

In the case of problem (83), the maps proxfin\operatorname{prox}_{f_{\rm in}} and proxfout\operatorname{prox}_{f_{\rm out}}, i.e., g0+g_{0}^{+} and g3g_{3}^{-} respectively, can be expressed as closed-form formulae. This leads to simplification of the SE equations as explained below.

We start by looking at the forward pass (finding quantities with superscript ’+’) of Algorithm 2 for different layers and then the backward pass (finding quantities with superscript ’-’) to get the parameters {𝐊+,τ,α¯±,γ¯±}\{\mathbf{K}_{\ell}^{+},\tau_{\ell}^{-},\overline{\alpha}_{\ell}^{\pm},\overline{\gamma}_{\ell}^{\pm}\} for =0,1,2\ell=0,1,2.

To begin with, notice that fin(w)=λ2w2f_{\rm in}(w)=\frac{\lambda}{2}w^{2}, and therefore the denoiser g0+()g_{0}^{+}(\cdot) in (44) is simply,

g0+(r0,γ0)\displaystyle g_{0}^{+}(r_{0}^{-},\gamma_{0}^{-}) =γ0γ0+λ/βr0,andg0+r0=γ0γ0+λ/β\displaystyle=\tfrac{\gamma_{0}^{-}}{\gamma_{0}^{-}+\lambda/{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}r_{0}^{-},\quad{\rm and}\quad\tfrac{\partial g_{0}^{+}}{\partial r_{0}^{-}}=\tfrac{\gamma_{0}^{-}}{\gamma_{0}^{-}+\lambda/{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}

Using the random variable R0R_{0}^{-} and substituting in the expression of the denoiser to get Z^0\widehat{Z}_{0}, we can now calculate α¯0+\overline{\alpha}_{0}^{+} using lines 20 and 22,

α¯0+=γ¯0γ¯0+λ/β,γ¯0+=λ/β.\displaystyle\overline{\alpha}_{0}^{+}=\tfrac{\overline{\gamma}_{0}^{-}}{\overline{\gamma}_{0}^{-}+\lambda/{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}},\qquad\overline{\gamma}_{0}^{+}=\lambda/{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}. (84)

Similarly, we have fout(p2)=12(p2y)2f_{\rm out}(p_{2})={\tfrac{1}{2}}(p_{2}-y)^{2}, whereby the output denoiser g3()g_{3}^{-}(\cdot) in the last layer for ridge regression is given by,

g3(r2+,γ2+,y)\displaystyle g_{3}^{-}(r_{2}^{+},\gamma_{2}^{+},y) =γ2+r2++yγ2++1.\displaystyle=\frac{\gamma_{2}^{+}r_{2}^{+}+y}{\gamma_{2}^{+}+1}. (85)

By substituting this denoiser in line 30 of the algorithm we get P^2\widehat{P}_{2}^{-} and thus, following the lines 35-38 of the algorithm we have

α¯2=γ¯2+γ¯2++1,wherebyγ¯2=1.\displaystyle\overline{\alpha}_{2}^{-}={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\tfrac{\overline{\gamma}_{2}^{+}}{\overline{\gamma}_{2}^{+}+1}},\qquad{\rm whereby}\quad\overline{\gamma}_{2}^{-}=1. (86)

Having identified these constants α¯0+,γ¯0+,α¯2,γ¯2\overline{\alpha}_{0}^{+},\overline{\gamma}_{0}^{+},\overline{\alpha}_{2}^{-},\overline{\gamma}_{2}^{-}, we will now sequentially identify the quantities

(α¯0+,γ¯0+)𝐊0+(α¯1+,γ¯1+)𝐊1+(α¯2+,γ¯2+)𝐊2+\displaystyle(\overline{\alpha}_{0}^{+},\overline{\gamma}_{0}^{+})\rightarrow\mathbf{K}_{0}^{+}\rightarrow(\overline{\alpha}_{1}^{+},\overline{\gamma}_{1}^{+})\rightarrow\mathbf{K}_{1}^{+}\rightarrow(\overline{\alpha}_{2}^{+},\overline{\gamma}_{2}^{+})\rightarrow\mathbf{K}_{2}^{+}

in the forward pass, and then the quantities

τ0(α¯0,γ¯0)τ1(α¯1,γ¯1)τ2(α¯2,γ¯2)\displaystyle\tau_{0}^{-}\leftarrow(\overline{\alpha}_{0}^{-},\overline{\gamma}_{0}^{-})\leftarrow\tau_{1}^{-}\leftarrow(\overline{\alpha}_{1}^{-},\overline{\gamma}_{1}^{-})\leftarrow\tau_{2}^{-}\leftarrow(\overline{\alpha}_{2}^{-},\overline{\gamma}_{2}^{-})

in the backward pass.

We also note that we have

α¯++α¯=1\displaystyle\overline{\alpha}_{\ell}^{+}+\overline{\alpha}_{\ell}^{-}=1 (87)

Forward Pass:

Observe that 𝐊0+=Cov(Z0,Q0+)\mathbf{K}_{0}^{+}=\mathrm{Cov}(Z_{0},Q_{0}^{+}). Now, from line 21, on simplification we get Q0+=W00Q_{0}^{+}=-W_{0}^{0} whereby,

𝐊0+=var(W0)[1111].\displaystyle\mathbf{K}_{0}^{+}=\mathrm{var}(W^{0})\begin{bmatrix}1&-1\\ -1&1\end{bmatrix}. (88)

Notice that from line 23, the pair (P00,P0+)(P_{0}^{0},P_{0}^{+}) is jointly Gaussian with covariance matrix 𝐊0+\mathbf{K}_{0}^{+}. But the above equation means that P0+=P00P_{0}^{+}=-P^{0}_{0}, whereby R0+=0R_{0}^{+}=0 from line 17.

Now, the linear denoiser g1+()g_{1}^{+}(\cdot) is defined as in (47a). Note that since we are considering i.i.d Gaussian features for this problem, the random variable StrS_{{\rm tr}} in this layer is a constant σtr\sigma_{{\rm tr}}. Therefore, similar to layer =0\ell=0 by evaluating lines 17-23 of the algorithm we get Q1+=Z10,Q_{1}^{+}=-Z_{1}^{0}, whereby

α¯1+=σtr2γ¯1γ¯0++σtr2γ¯1,γ¯1+=γ¯0+σtr2=λσtr2β,𝐊1+\displaystyle\overline{\alpha}_{1}^{+}=\tfrac{\sigma_{{\rm tr}}^{2}\overline{\gamma}_{1}^{-}}{\overline{\gamma}_{0}^{+}+\sigma_{\rm tr}^{2}\overline{\gamma}_{1}^{-}},\ \overline{\gamma}_{1}^{+}=\tfrac{\overline{\gamma}_{0}^{+}}{\sigma_{\rm tr}^{2}}=\tfrac{\lambda}{\sigma_{{\rm tr}}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}},\ \mathbf{K}_{1}^{+} =σtr2𝐊0+.\displaystyle=\sigma_{{\rm tr}}^{2}\mathbf{K}_{0}^{+}. (89)

Observe that this means

P1+=P10.\displaystyle P_{1}^{+}=-P_{1}^{0}. (90)

Backward Pass:

Since Y=ϕout(P20,D)=P20+DY=\phi_{\rm out}(P_{2}^{0},D)=P_{2}^{0}+D, line 36 of algorithm on simplification yields P2=DP_{2}^{-}=D, whereby we can get τ2\tau_{2}^{-},

τ2=𝔼(P2)2=𝔼[D2]=σd2.\displaystyle\tau_{2}^{-}=\mathbb{E}(P_{2}^{-})^{2}=\mathbb{E}[D^{2}]=\sigma_{d}^{2}. (91)

Next, to calculate the terms (α¯1,γ¯1)(\overline{\alpha}_{1}^{-},\overline{\gamma}^{-}_{1}), we use the decoiser g2g_{2}^{-} defined in (47a) for line 33 of the algorithm to get P^1\widehat{P}_{1}.

P^1=γ¯1+R1++Smpγ¯2R2γ¯1++(Smp)2γ¯2=Smp(Smp+P10+Q2)γ¯1++(Smp)2,\displaystyle\widehat{P}_{1}=\tfrac{\overline{\gamma}_{1}^{+}R_{1}^{+}+S_{\rm{mp}}^{-}\overline{\gamma}_{2}^{-}R_{2}^{-}}{\overline{\gamma}_{1}^{+}+(S_{\rm{mp}}^{-})^{2}\overline{\gamma}_{2}^{-}}=\tfrac{S_{\rm{mp}}^{-}(S_{\rm{mp}}^{+}P_{1}^{0}+Q_{2}^{-})}{\overline{\gamma}_{1}^{+}+(S_{\rm{mp}}^{-})^{2}}, (92)

where we have used γ¯2=1,\overline{\gamma}_{2}^{-}=1, R1+=P10+P1+=0R_{1}^{+}=P_{1}^{0}+P_{1}^{+}=0 due to (90), and R2=Z20+Q2=Smp+P10+Q2R_{2}^{-}=Z_{2}^{0}+Q_{2}^{-}=S_{\rm{mp}}^{+}P_{1}^{0}+Q_{2}^{-} from lines 17, 32 and 4 respectively.

Then, we calculate α¯1\overline{\alpha}_{1}^{-} and γ¯1\overline{\gamma}_{1}^{-} as α¯1=𝔼g2P1+=𝔼γ¯1+γ¯1++(Smp)2.\overline{\alpha}_{1}^{-}=\mathbb{E}\frac{\partial g_{2}^{-}}{\partial P_{1}^{+}}=\mathbb{E}\tfrac{\overline{\gamma}_{1}^{+}}{\overline{\gamma}_{1}^{+}+(S_{\rm{mp}}^{-})^{2}}. This gives,

α¯1={λσtr2βGβ<1(11β)+1βλσtr2βGβ1\displaystyle\overline{\alpha}_{1}^{-}=\begin{cases}\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}G&\beta<1\\ (1-\tfrac{1}{\beta})+\tfrac{1}{\beta}\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}G&\beta\geq 1\end{cases} (93)

Here, in the overparameterized case (β>1)(\beta>1), the denoiser g2g_{2}^{-} outputs R1+R_{1}^{+} with probability 11β1-\tfrac{1}{\beta} and λσtr2βG\tfrac{\lambda}{\sigma_{\rm tr}^{2}\beta}G with probability 1β\tfrac{1}{\beta}.

Next, from line 37 we get,

γ¯1=(1α¯11)γ¯1+={1Gλσtr2ββ<1λσtr2β(1Gλσtr2β)β1G+λσtr2ββ>1\displaystyle\overline{\gamma}_{1}^{-}=(\tfrac{1}{\overline{\alpha}_{1}^{-}}-1)\overline{\gamma}_{1}^{+}=\begin{cases}\tfrac{1}{G}-\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}&\beta<1\\ \frac{\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}(\tfrac{1}{G}-\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}})}{\tfrac{\beta-1}{G}+\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}}&\beta>1\end{cases} (94)

Now from line 36 and equation (87) we get,

α¯1+P1\displaystyle\overline{\alpha}_{1}^{+}P_{1}^{-} =P^1P10α¯1P1+=(a)P^1α¯1+P10\displaystyle=\widehat{P}_{1}-P_{1}^{0}-\overline{\alpha}_{1}^{-}P_{1}^{+}\overset{\rm(a)}{=}\widehat{P}_{1}-\overline{\alpha}_{1}^{+}P_{1}^{0}
=(b)(SmpSmp+λσtr2β+(Smp)2α¯1+)AP10+Smpλσtr2β+(Smp)2BQ2\displaystyle\overset{\rm(b)}{=}\underbrace{\left(\tfrac{S_{\rm{mp}}^{-}S_{\rm{mp}}^{+}}{\tfrac{\lambda}{\sigma_{\rm tr}^{2}\beta}+(S_{\rm{mp}}^{-})^{2}}-\overline{\alpha}_{1}^{+}\right)}_{A}P_{1}^{0}+\underbrace{\tfrac{S_{\rm{mp}}^{-}}{\tfrac{\lambda}{\sigma_{\rm tr}^{2}\beta}+(S_{\rm{mp}}^{-})^{2}}}_{B}Q_{2}^{-} (95)

where (a) follows from (90) and (87), and (b) follows from (92). From this one can obtain τ1=𝔼(P1)2\tau_{1}^{-}=\mathbb{E}(P_{1}^{-})^{2} which can be calculated using the knowledge that P10,Q2P_{1}^{0},Q_{2}^{-} are independent Gaussian with covariances 𝔼(P10)2=σtr2Var(W0)\mathbb{E}(P_{1}^{0})^{2}=\sigma_{\rm tr}^{2}{\rm Var}(W^{0}), 𝔼(Q2)2=σd2\mathbb{E}(Q_{2}^{-})^{2}=\sigma_{d}^{2}. Further, P10,Q2P_{1}^{0},Q_{2}^{-} are independent of (Smp+,Smp)(S_{\rm{mp}}^{+},S_{\rm{mp}}^{-}).

Observe that by (95) we have

τ1=1(α¯1+)2(𝔼(A2)σtr2Var(W0)+𝔼(B2)σd2).\displaystyle\tau_{1}^{-}=\frac{1}{(\overline{\alpha}_{1}^{+})^{2}}\Bigg{(}\mathbb{E}{(A^{2})}\sigma_{\rm tr}^{2}{\rm Var}(W^{0})+\mathbb{E}{(B^{2})}\sigma_{d}^{2}\Bigg{)}. (96)

with some simplification we get

𝔼(A2)\displaystyle\mathbb{E}{(A^{2})} =(λσtr2β)2G(λσtr2βG)2,\displaystyle=(\frac{\lambda}{\sigma_{\rm{tr}}^{2}\beta})^{2}G^{\prime}-(\frac{\lambda}{\sigma_{\rm{tr}}^{2}\beta}G)^{2}, (97a)
𝔼(B2)\displaystyle\mathbb{E}{(B^{2})} =Gλσtr2βG,\displaystyle=G-\frac{\lambda}{\sigma_{\rm{tr}}^{2}\beta}G^{\prime}, (97b)

where G=Gmp(λσtr2β)G=G_{\rm{mp}}(-\frac{\lambda}{\sigma_{\rm tr}^{2}\beta}), with GmpG_{\rm{mp}} given in Appendix H, and GG^{\prime} is the derivative of GmpG_{\rm{mp}} calculated at λσtr2β-\frac{\lambda}{\sigma_{\rm tr}^{2}\beta}.

Now consider the under-parametrized case (β<1\beta<1):

Let u=λσtr2βu=-\frac{\lambda}{\sigma_{\rm{tr}}^{2}\beta} and z=Gmp(u)z=G_{\rm{mp}}(u). In this case we have

α¯1+=1λσtr2βG=1+uz.\displaystyle\overline{\alpha}_{1}^{+}=1-\frac{\lambda}{\sigma_{\rm{tr}}^{2}\beta}G=1+uz. (98)

Note that,

Gmp1(z)=u\displaystyle G_{\rm{mp}}^{-1}(z)=u (a)Rmp(z)+1z=u\displaystyle\quad\overset{\rm(a)}{\Rightarrow}R_{\rm{mp}}(z)+\frac{1}{z}=u
(b)11βz+1z=u,\displaystyle\quad\overset{\rm(b)}{\Rightarrow}\frac{1}{1-\beta z}+\frac{1}{z}=u, (99a)

where Rmp(.)R_{\rm{mp}}(.) is the R-transform defined in (Tulino et al., 2004) and (a) follows from the relationship between the R- and Stieltjes-transform and (b) follows from the fact that for Marchenko-Pastur distribution we have Rmp(z)=11zβR_{\rm{mp}}(z)=\frac{1}{1-z\beta}. Therefore,

Gmp(11βz+1z)=z\displaystyle G_{\rm{mp}}(\frac{1}{1-\beta z}+\frac{1}{z})=z
Gmp(11βz+1z)=G=1β(1βz)21z2.\displaystyle\Rightarrow G_{\rm{mp}}^{\prime}(\frac{1}{1-\beta z}+\frac{1}{z})=G^{\prime}=\frac{1}{\frac{\beta}{(1-\beta z)^{2}}-\frac{1}{z^{2}}}. (100)

For the over-parametrized case (β>1\beta>1) we have:

α¯1+=1β(1+λσtr2βG)=1uzβ.\displaystyle\overline{\alpha}_{1}^{+}=\tfrac{1}{\beta}(1+\tfrac{\lambda}{\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}G)=\frac{1-uz}{\beta}. (101)

In this case, as mentioned in Appendix H and following the results from (Tulino et al., 2004), the measure μβ\mu_{\beta} scales with β\beta and thus Rmp(z)=β1zR_{\rm{mp}}(z)=\frac{\beta}{1-z}. Therefore, similar to (99a), zz satisfies

β1z+1z=uG=1β(1z)21z2.\displaystyle\frac{\beta}{1-z}+\frac{1}{z}=u\quad\Rightarrow G^{\prime}=\frac{1}{\frac{\beta}{(1-z)^{2}}-\frac{1}{z^{2}}}. (102)

Now τ1\tau_{1}^{-} can be calculated as follows:

τ1\displaystyle\tau_{1}^{-} =η2(u2z2σtr2var(W0)(κ1)+σd2z(uzκ+1))\displaystyle=\eta^{2}\bigg{(}u^{2}z^{2}\sigma_{\rm tr}^{2}var(W^{0})(\kappa-1)+\sigma_{d}^{2}z(uz\kappa+1)\bigg{)} (103)

where

η={1(1+uz)β<1β(1uz)β1,κ={(1βz)2βz2(1βz)2β<1(1z)2βz2(1z)2β1\displaystyle\eta=\begin{cases}\frac{1}{(1+uz)}&\beta<1\\ \frac{\beta}{(1-uz)}&\beta\geq 1\end{cases},\quad\kappa=\begin{cases}\frac{(1-\beta z)^{2}}{\beta z^{2}-(1-\beta z)^{2}}&\beta<1\\ \frac{(1-z)^{2}}{\beta z^{2}-(1-z)^{2}}&\beta\geq 1\end{cases} (104)

and zz is the solution to the fixed points

{11βz+1z=uβ<1β1z+1z=uβ1.\displaystyle\begin{cases}\frac{1}{1-\beta z}+\frac{1}{z}=u&\beta<1\\ \frac{\beta}{1-z}+\frac{1}{z}=u&\beta\geq 1\end{cases}. (105)

\Box

G.3 Ridgeless Linear Regression

Here we consider the case of Ridge regression (83) when λ0+\lambda\rightarrow 0^{+}. Note that the solution to the problem (83) is (𝐗T𝐗+λ𝐈)1𝐗T𝒚(\mathbf{X}^{\text{\sf T}}\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^{\text{\sf T}}\bm{y} remains unique since λ>0\lambda>0. The following result was stated in (Hastie et al., 2019), and can be recovered using our methodology. Note however, that we calculate the generalization error whereas they have calculated the squared error, whereby we obtain an additional additive factor of σd2.\sigma_{d}^{2}. The result explains the double-descent phenomenon for Ridgeless linear regression.

Corollary 3.

For ridgeless linear regression, we have

limλ0+ts𝖱𝖱={11βσd2β<1ββ1σd2+(11β)σtr2Var(W0)β1\displaystyle\lim_{\lambda\rightarrow 0^{+}}\mathcal{E}^{\mathsf{RR}}_{\rm ts}=\begin{cases}\tfrac{1}{1-\beta}\sigma_{d}^{2}&\beta<1\\ \frac{\beta}{\beta-1}\sigma_{d}^{2}+(1-\tfrac{1}{\beta})\sigma_{\rm tr}^{2}{\rm Var}(W^{0})&\beta\geq 1\end{cases}
Proof of Corollary 3.

We calculate the parameters γ¯0+,γ¯1\overline{\gamma}_{0}^{+},\overline{\gamma}_{1}^{-}, k22k_{22} and τ1\tau_{1}^{-} when λ0+\lambda\rightarrow 0^{+}. Before starting off, we note that

G0\displaystyle G_{0} :=limz0+Gmp(z)={β1ββ<1ββ1β>1,\displaystyle:=\lim_{z\rightarrow 0^{+}}G_{\rm{mp}}(-z)=\begin{cases}\frac{\beta}{1-\beta}&\beta<1\\ \frac{\beta}{\beta-1}&\beta>1\end{cases}, (106)

as described in Appendix H. Following the derivations in Corollary 2, we have

γ¯0+=λ/β,k22=Var(W0)\displaystyle\overline{\gamma}_{0}^{+}=\lambda/{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta},\quad k_{22}={\rm Var}(W^{0}) (107)

Now for λ0+,\lambda\rightarrow 0^{+}, we have

1α¯1={1β<11ββ1,γ¯1={1G0=1βββ<1λ(β1)σtr2ββ>1,\displaystyle 1-\overline{\alpha}_{1}^{-}=\begin{cases}1&\beta<1\\ \tfrac{1}{\beta}&\beta\geq 1\end{cases},\quad\overline{\gamma}_{1}^{-}=\begin{cases}\frac{1}{G_{0}}=\tfrac{1-\beta}{\beta}&\beta<1\\ \frac{\lambda}{(\beta-1)\sigma_{\rm tr}^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\beta}}&\beta>1\end{cases}, (108)

Using this in simplifying (95) for λ0+\lambda\rightarrow 0^{+}, we get

τ1=𝔼(P1)2={σd2G0β<1βσd2G0+σtr2Var(W0)(β1)β1\displaystyle\tau_{1}^{-}=\mathbb{E}(P_{1}^{-})^{2}=\begin{cases}\sigma_{d}^{2}G_{0}&\beta<1\\ \beta\sigma_{d}^{2}G_{0}+{\sigma_{\rm tr}^{2}{\rm Var}(W^{0})}{(\beta-1)}&\beta\geq 1\end{cases}

where during the evaluation of 𝔼(Smpγ¯1++(Smp)2)2\mathbb{E}\left(\frac{S_{\rm{mp}}^{-}}{\overline{\gamma}_{1}^{+}+(S_{\rm{mp}}^{-})^{2}}\right)^{2}, for the case of β>1,\beta>1, we need to account for the point mass at 0 for SmpS_{\rm{mp}}^{-} with weight 11β1-\frac{1}{\beta}.

Next, notice that

a:=γ¯0+σtrγ¯0++γ¯1σtr2={0β<1(11β)σtrβ1,\displaystyle a:=\frac{\overline{\gamma}_{0}^{+}\sigma_{\rm tr}}{\overline{\gamma}_{0}^{+}+\overline{\gamma}_{1}^{-}\sigma_{\rm tr}^{2}}=\begin{cases}0&\beta<1\\ (1-\tfrac{1}{\beta})\sigma_{\rm tr}&\beta\geq 1\end{cases},

and,

b:=γ¯1σtr2γ¯0++γ¯1σtr2={1β<11ββ1,\displaystyle b:=\frac{\overline{\gamma}_{1}^{-}\sigma^{2}_{\rm tr}}{\overline{\gamma}_{0}^{+}+\overline{\gamma}_{1}^{-}\sigma_{\rm tr}^{2}}=\begin{cases}1&\beta<1\\ \tfrac{1}{\beta}&\beta\geq 1\end{cases},\quad

Thus applying Corollary 1, we get

ts𝖱𝖱\displaystyle\mathcal{E}^{\mathsf{RR}}_{\rm ts} =a2k22+b2τ1+σd2\displaystyle=a^{2}k_{22}+b^{2}\tau_{1}^{-}+\sigma_{d}^{2}
={11βσd2β<1ββ1σd2+(11β)σtr2Var(W0)β1\displaystyle=\begin{cases}\tfrac{1}{1-\beta}\sigma_{d}^{2}&\beta<1\\ \frac{\beta}{\beta-1}\sigma_{d}^{2}+(1-\tfrac{1}{\beta})\sigma_{\rm tr}^{2}{\rm Var}(W^{0})&\beta\geq 1\end{cases}

This proves the claim. \Box

G.4 Train-Test Mismatch

Observe that our formulation allows for analyzing the effect of mismatch in the training and test distribution. One can consider arbitrary joint distributions over (Str,Sts)(S_{\rm tr},S_{\rm ts}) that model the mismatch between training and test features. Here we give a simple example which highlights the effect of this mismatch.

Definition 4 (Bernoulli ε\varepsilon-mismatch).

(Sts,Str)(S_{\rm ts},S_{\rm tr}) has a bivariate Bernoulli distribution with

  1. \bullet

    Pr{Str=Sts=0}={Str=Sts=1}=(1ε)/\Pr\{S_{\rm tr}\!=\!S_{\rm ts}\!=\!0\}=\mathbb{P}\{S_{\rm tr}\!=\!S_{\rm ts}\!=\!1\}=(1-\varepsilon)/2

  2. \bullet

    Pr{Str=0,Sts=1}={Str=1,Sts=0}=ε/2\Pr\{S_{\rm tr}\!=\!0,S_{\rm ts}\!=\!1\}=\mathbb{P}\{S_{\rm tr}\!=\!1,S_{\rm ts}\!=\!0\}=\varepsilon/2

Notice that the marginal distribution of the StrS_{\rm tr} in the Bernoulli ε\varepsilon-mismatch model is such that (Str0)=12.\mathbb{P}(S_{\rm tr}\neq 0)=\frac{1}{2}. Hence half of the features extracted by the matrix V0V_{0} are relevant during training. Similarly, (Sts0)=12.\mathbb{P}(S_{\rm ts}\neq 0)=\frac{1}{2}. However the features spanned by the test data do not exactly overlap with the features captured in the training data, and the fraction of features common to both the training and test data is 1ε1-\varepsilon. Hence for ε=0\varepsilon=0, there is no training-test mismatch, whereas for ε=1\varepsilon=1 there is a complete mismatch.

The following result shows that the generalization error increases linearly with the mismatch parameter ε.\varepsilon.

Corollary 4 (Mismatch).

Consider the problem of Linear Regression (83) under the conditions of Corollary 1. Additionally suppose we have Bernoulli ε\varepsilon-mismatch between the training and test distributions. Then

ts=k222((1ε)γ2+ε)+τ12(1γ)(1ε)+σd2,\displaystyle\mathcal{E}_{\rm ts}=\tfrac{k_{22}}{2}((1-\varepsilon)\gamma^{*2}+\varepsilon)+\tfrac{\tau_{1}^{-}}{2}(1-\gamma^{*})(1-\varepsilon)+\sigma_{d}^{2},

where γ:=γ¯0+γ¯0++γ¯1\gamma^{*}:=\frac{\overline{\gamma}_{0}^{+}}{\overline{\gamma}_{0}^{+}+\overline{\gamma}_{1}^{-}}. The terms k22,τ1,γk_{22},\tau_{1}^{-},\gamma^{*} are independent of ε\varepsilon.

Proof.

This follows directly by calculating the expectations of the terms in Corollary 1, with the joint distribution of (Str,Sts)(S_{\rm tr},S_{\rm ts}) given in Definition 4. \Box

The quantities k22k_{22} and τ1\tau_{1}^{-} in the result above can be calculated similar to the derivation in the proof of Corollary 2 and can in general depend on the regularization parameter λ\lambda and overparameterization parameter β\beta.

G.5 Logistic Regression

The precise analysis for the special case of regularized logistic regression estimator with i.i.d Gaussian features is provided in (Salehi et al., 2019). Consider the logistic regression model,

(yi=1|𝒙i):=ρ(𝒙iT𝒘)for i=1,,N\displaystyle\mathbb{P}(y_{i}=1|\bm{x}_{i}):=\rho(\bm{x}_{i}^{\text{\sf T}}\bm{w})~{}~{}~{}\text{for }i=1,\cdots,N

where ρ(x)=11+ex\rho(x)=\frac{1}{1+e^{-x}} is the standard logistic function.

In this problem we consider estimates of 𝒘0\bm{w}^{0} such that

𝒘^:=argmin𝒘𝟏Tlog(1+e𝐗𝒘)𝒚T𝐗𝒘+Fin(𝒘).\displaystyle\widehat{\bm{w}}:=\operatorname*{argmin}_{\bm{w}}\mathbf{1}^{\text{\sf T}}\log(1+e^{\mathbf{X}\bm{w}})-\bm{y}^{\text{\sf T}}\mathbf{X}\bm{w}+F_{\rm in}(\bm{w}).

where FinF_{\rm in} is the reguralization function. This is a special case of optimization problem (2) where

Fout(𝒚,𝐗𝒘)=𝟏Tlog(1+e𝐗𝒘)𝒚T𝐗𝒘.F_{\rm out}(\bm{y},\mathbf{X}\bm{w})=\mathbf{1}^{\text{\sf T}}\log(1+e^{\mathbf{X}\bm{w}})-\bm{y}^{\text{\sf T}}\mathbf{X}\bm{w}. (109)

Similar to the linear regression model, using the ML-VAMP GLM learning algorithm, we can characterize the generalization error for this model with quantities 𝐊0+,τ1,γ¯0+,γ¯1\mathbf{K}_{0}^{+},\tau_{1}^{-},\overline{\gamma}_{0}^{+},\overline{\gamma}_{1}^{-} given by algorithm 2. We note that in this case, the output non-linearity is

ϕout(p2,d)=𝟙{ρ(p2)>d}\displaystyle\phi_{\rm out}(p_{2},d)=\mathbbm{1}_{\{{\rho(p_{2})>d}\}} (110)

where dUnif(0,1)d\sim\text{Unif}(0,1). Also, the denoisers g0+g_{0}^{+}, and g3g_{3}^{-} can be derived as the proximal operators of FinF_{\rm in}, and FoutF_{\rm out} defined in (25).

G.6 Support Vector Machines

The asymptotic generalization error for support vector machine (SVM) is provided in (Deng et al., 2019). Our model can also handle SVMs. Similar to logistic regression, SVM finds a linear classifier using the hinge loss instead of logistic loss. Assuming the class labels are y=±1y=\pm 1 the hinge loss is

hinge(y,y^)=max(0,1yy^).\ell_{\rm hinge}(y,\widehat{y})=\max(0,1-y\widehat{y}). (111)

Therefore, if we take

Fout(𝒚,𝐗𝒘)=imax(0,1yi𝐗i𝒘),F_{\rm out}(\bm{y},\mathbf{X}\bm{w})=\sum_{i}\max(0,1-y_{i}\mathbf{X}_{i}\bm{w}), (112)

where 𝐗i\mathbf{X}_{i} is the ithi^{th} row of the data matrix, the ML-VAMP algorithm for GLMs finds the SVM classifier. The algorithm would have proximal map of hinge loss and our theory provides exact predictions for the estimation and prediction error of SVM.

As with all other models considered in this work, the true underlying data generating model could be anything that can be represented by the graphical model in Figure 1, e.g. logistic or probit model, and our theory is able to exactly predict the error when SVM is applied to learn such linear classifiers in the large system limit.

Appendix H Marchenko-Pastur distribution

We describe the random variable SmpS_{{\rm{mp}}} defined in (12) where Smp2S_{\rm{mp}}^{2} has a rescaled Marchenko-Pastur distribution. Notice that the positive entries of 𝐬mp\mathbf{s}_{\rm{mp}} are the positive eigenvalues of 𝐔T𝐔\mathbf{U}^{\text{\sf T}}\mathbf{U} (or 𝐔𝐔T\mathbf{U}\mathbf{U}^{\text{\sf T}}).

Observe that UijN(0,1p)U_{ij}\sim N(0,\frac{1}{p}), whereas, the standard scaling while studying the Marchenko-Pastur distribution is for matrices 𝐇\mathbf{H} such that Hij𝒩(0,1N)H_{ij}\sim\mathcal{N}(0,\frac{1}{N}) (for e.g. see equation (1.10) from (Tulino et al., 2004) and the discussion preceding it). Also notice that β𝐔\sqrt{\beta}\mathbf{U} has the same distribution as 𝐇\mathbf{H}. Thus the results from (Tulino et al., 2004) apply directly to the distributions of eigenvalues of β𝐔T𝐔\beta\mathbf{U}^{\text{\sf T}}\mathbf{U} and β𝐔𝐔T\beta\mathbf{U}\mathbf{U}^{\text{\sf T}}. We state their result below taking into account this disparity in scaling.

The positive eigenvalues of β𝐔T𝐔\beta\mathbf{U}^{\text{\sf T}}\mathbf{U} have an empirical distribution which converges to the following density:

μβ(x)=(bβx)+(xaβ)+2πβx\displaystyle\mu_{\beta}(x)=\frac{\sqrt{(b_{\beta}-x)_{+}(x-a_{\beta})_{+}}}{2\pi\beta x} (113)

where aβ=(1β)2a_{\beta}=(1-\sqrt{\beta})^{2}, bβ:=(1+β)2b_{\beta}:=(1+\sqrt{\beta})^{2}. Similarly the positive eigenvalues of β𝐔𝐔T\beta\mathbf{U}\mathbf{U}^{\text{\sf T}} have an empirical distribution converging to the density βμβ\beta\mu_{\beta}. We note the following integral which is useful in our analysis:

G0:\displaystyle G_{0}: =limz0𝔼1Smp2z𝟙{Smp>0}\displaystyle=\lim_{z\rightarrow 0^{-}}\mathbb{E}\frac{1}{S_{{\rm{mp}}}^{2}-z}\mathbbm{1}_{\{{S_{\rm{mp}}>0}\}}
=limz0aβbβ1x/βzμβ(x)𝑑x=β|β1|.\displaystyle=\lim_{z\rightarrow 0^{-}}\int_{a_{\beta}}^{b_{\beta}}\frac{1}{x/\beta-z}\mu_{\beta}(x)dx=\frac{\beta}{|\beta-1|}. (114)

More generally, the Stieltjes transform of the density is given by:

Gmp(z)=𝔼1Smp2z𝟙{Smp>0}=aβbβ1x/βzμβ(x)𝑑x\displaystyle G_{\rm{mp}}(z)=\mathbb{E}\frac{1}{S_{\rm{mp}}^{2}-z}\mathbbm{1}_{\{{S_{\rm{mp}}>0}\}}=\int_{a_{\beta}}^{b_{\beta}}\frac{1}{x/\beta-z}\mu_{\beta}(x)dx (115)