This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bayesian inference with finitely wide neural networks

Chi-Ken Lu Faculty of Data Science Minor, [email protected] Department of Mathematics and Computer Science, Rutgers University, Newark, New Jersey 07102, USA
Abstract

The analytic inference, e.g. predictive distribution being in closed form, may be an appealing benefit for machine learning practitioners when they treat wide neural networks as Gaussian process in Bayesian setting. The realistic widths, however, are finite and cause weak deviation from the Gaussianity under which partial marginalization of random variables in a model is straightforward. On the basis of multivariate Edgeworth expansion, we propose a non-Gaussian distribution in differential form to model a finite set of outputs from a random neural network, and derive the corresponding marginal and conditional properties. Thus, we are able to derive the non-Gaussian posterior distribution in Bayesian regression task. In addition, in the bottlenecked deep neural networks, a weight space representation of deep Gaussian process, the non-Gaussianity is investigated through the marginal kernel and the accompanying small parameters.

I Introduction

Neal in his seminal work [1] pointed out that a shallow but infinitely wide random neural network is a Gaussian process (GP) [2] in statistical sense. Subsequent work [3, 4] in interpreting neural network with specific nonlinear activation units as kernel machines was also inspired by such idea. More recent reports [5, 6] further claimed the equivalence between GP and deep neural networks when each hidden layer in latter is of infinite width. Consequently, machine learning practitioners can perform Bayesian inference by treating deep and wide neural network as a GP, and exploit the analytic marginal and conditional properties of multivariate Gaussian distribution. Otherwise, one needs to employ gradient-based learning and bootstrap sampling for obtaining predictive distribution [7].

In reality, all neural networks have finite width. Therefore, the deviation from Gaussianity requires further quantitative account as practitioners may wonder the corrections to the predictive mean and variance in, for example, a regression task. Yaida [8] and colleagues [9] proposed a perturbative approach for computing the multivariate cumulants by direct application of Wick’s contraction theorem. Moreover, the fourth cumulants are shown to be nonzero, scaled by the sum of reciprocal widths, 1/N1+1/N2++1/NL11/N_{1}+1/N_{2}+\cdots+1/N_{L-1}, and signaling non-Gaussian aspect of the random processes representing finite-width deep neural networks with LL hidden layers [8]. A quartic energy functional for a fixed set of network outputs is formulated field-theoretically with which the corrections to the posterior mean and variance due to the weak non-Gaussianity are obtained [8].

While Yaida’s approach is appealing from a field theory perspective (see also [10, 11]), the loss of elegant marginal property due to the presence of fourth power term in exponent is critical for analytic inference with finite-width networks. An alternative thinking is to modify the multivariate Gaussian distribution so that the new distribution can match the higher cumulants associated with the networks. In this paper, we shall use the multivariate Edgeworth series expansion [12] to construct the non-Gaussian distribution for the network’s outputs. In particular, we find that the differential representation of Edgeworth expansion greatly facilitates the derivation of marginal and conditional properties of the non-Gaussian distribution. Three main results are reported in this paper. First, the marginal property is intact, and the corrections to conditional mean and variance are derived. Second, with observed data, the non-Gaussian posterior distribution associated with an unobserved output is derived. Third, we derive the marginal covariance [13] of a bottlenecked deep neural network [14], which represents deep Gaussian Process [15] in weight space. It is worthwhile to note that some of the hidden layers in the bottlenecked network are narrow and strong non-Gaussianity may be induced.

The paper has the following organization. In Sec. II we begin by reviewing the computational structure of a shallow cosine network, its equivalence to GP with a Gaussian kernel, and the emergence of non-Gaussianity due to finite width. The shallow network with random parameters in Bayesian setting is a non-Gaussian prior over function. In Sec. III the non-Gaussian prior is represented as a differential representation of Edgeworth expansion around Gaussian multivariate distribution, and its marginal and conditional properties are established. Application of the non-Gaussian prior in Bayesian regression task is discussed in Sec. IV. Finally, Sec. V is devoted to investigating the combined effect of nonlinear activation, depth and finite width on the non-Gaussian prior for a deep bottlenecked cosine network, which is followed by a discussion in Sec. VI.

II Wide feed forward neural network

Let us start with the discussion of a random shallow feed forward network with cosine activation. In the infinite width limit, the network is statistically equivalent to a GP with Gaussian kernel [16]. The goal is to see the emergence of nonzero fourth cumulant when the width is finite. Consider a single output network with NN activation units, the real function value zαz_{\alpha}\in\mathbb{R} with greek subscript is an indexed random variable in association with its input 𝐱αd{\bf x}_{\alpha}\in\mathbb{R}^{d}. Explicitly, the output-input of the network has the following relation,

zα=2Ni=1Nwicos(Ωi𝐱αd+ϕi),z_{\alpha}=\sqrt{\frac{2}{N}}\sum_{i=1}^{N}w_{i}\cos(\frac{\Omega_{i}\cdot{\bf x}_{\alpha}}{\sqrt{d}}+\phi_{i})\>, (1)

where the weight variables ww’s are sampled from the Gaussian distribution 𝒩(0,1)\mathcal{N}(0,1), the scaling variables Ω\Omega’s sampled from 𝒩(0,Id)\mathcal{N}(0,I_{d}), and the phase variables ϕ\phi’s sampled from uniform distribution 𝒰([0,2π])\mathcal{U}([0,2\pi]). The normalization factors 2/N\sqrt{2/N} and 1/d1/\sqrt{d} above follow the parameterization used in [17]. Because ww is zero-mean, the first relevant statistical moment is the covariance, k(𝐱α,𝐱β)k({\bf x}_{\alpha},{\bf x}_{\beta}), namely the expectation of product of two function values at two inputs,

k(𝐱α,𝐱β):=𝔼[zαzβ]=1Ni𝔼{cos[Ωi(𝐱α𝐱β)d]+cos[Ωi(𝐱α+𝐱β)d+2ϕi]}=e12d|𝐱α𝐱β|2.\begin{split}k({\bf x}_{\alpha},{\bf x}_{\beta})&:=\mathbb{E}[z_{\alpha}z_{\beta}]\\ &=\frac{1}{N}\sum_{i}\mathbb{E}\bigg{\{}\cos[\frac{\Omega_{i}\cdot({\bf x}_{\alpha}-{\bf x}_{\beta})}{\sqrt{d}}]+\cos[\frac{\Omega_{i}\cdot({\bf x}_{\alpha}+{\bf x}_{\beta})}{\sqrt{d}}+2\phi_{i}]\bigg{\}}\\ &=e^{-\frac{1}{2d}|{\bf x}_{\alpha}-{\bf x}_{\beta}|^{2}}\>.\end{split} (2)

In above, the independence between the random variables ww’s is used to arrive at the second equality. The third equality is due to the fact that Fourier transformation of Gaussian is Gaussian and the vanishing average of cosine with uniformly random phase. The fourth moment can be computed in similar manners, but one shall notice that the product of two cosine terms containing the random phase can generate a nonzero contribution. Here, we focus on the fourth cumulant tensor as the signature of non-Gaussianity,

Vαβγδ:=𝔼[zαzβzγzδ]KαβKγδKαγKβδKαδKβγ=1N2ij𝔼{cos[Ωi(𝐱α+𝐱β)d+2ϕi]cos[Ωj(𝐱γ+𝐱δ)d+2ϕj]}+sym.perm.=12N(e12d|𝐱α+𝐱β𝐱γ𝐱δ|2+e12d|𝐱α+𝐱γ𝐱β𝐱δ|2+e12d|𝐱α+𝐱δ𝐱β𝐱γ|2).\begin{split}V_{\alpha\beta\gamma\delta}&:=\mathbb{E}[z_{\alpha}z_{\beta}z_{\gamma}z_{\delta}]-K_{\alpha\beta}K_{\gamma\delta}-K_{\alpha\gamma}K_{\beta\delta}-K_{\alpha\delta}K_{\beta\gamma}\\ &=\frac{1}{N^{2}}\sum_{ij}\mathbb{E}\big{\{}\cos[\frac{\Omega_{i}\cdot({\bf x}_{\alpha}+{\bf x}_{\beta})}{\sqrt{d}}+2\phi_{i}]\cos[\frac{\Omega_{j}\cdot({\bf x}_{\gamma}+{\bf x}_{\delta})}{\sqrt{d}}+2\phi_{j}]\big{\}}+{\rm sym.\ perm.}\\ &=\frac{1}{2N}\bigg{(}e^{-\frac{1}{2d}|{\bf x}_{\alpha}+{\bf x}_{\beta}-{\bf x}_{\gamma}-{\bf x}_{\delta}|^{2}}+e^{-\frac{1}{2d}|{\bf x}_{\alpha}+{\bf x}_{\gamma}-{\bf x}_{\beta}-{\bf x}_{\delta}|^{2}}+e^{-\frac{1}{2d}|{\bf x}_{\alpha}+{\bf x}_{\delta}-{\bf x}_{\beta}-{\bf x}_{\gamma}|^{2}}\bigg{)}\>.\end{split} (3)

Hence, the fourth cumulant tensor receives a contribution proportional to the reciprocal width, 1/N1/N, and it is symmetric with respect to permutation of indices.

III Multivariate Edgeworth expansion and non-Gaussian prior

The fourth cumulant tensor in Eq. (3) signifies that the distribution over the network outputs is non-Gaussian. In other words, the random parameters ww’s, Ω\Omega’s, and ϕ\phi’s from a prior distribution induce the non-Gaussian prior function distribution due to the finite width. The first interesting consequence of the nonzero fourth cumulant is that the prior distribution over a single output, say zαz_{\alpha}, receives a correction term deviating from Gaussian [18],

q(zα)=𝒩(zα|0,σ02)[1+Vαααα24(36zα2σ02+zα4σ04)].q(z_{\alpha})=\mathcal{N}(z_{\alpha}|0,\sigma_{0}^{2})\bigg{[}1+\frac{V_{\alpha\alpha\alpha\alpha}}{24}(3-6\frac{z_{\alpha}^{2}}{\sigma_{0}^{2}}+\frac{z_{\alpha}^{4}}{\sigma_{0}^{4}})\bigg{]}\>. (4)

The variance σ02:=k(zα,zα)\sigma_{0}^{2}:=k(z_{\alpha},z_{\alpha}) and the new distribution qq has matched moments, i.e. 𝔼q[1]=1\mathbb{E}_{q}[1]=1, 𝔼q[zα2]=σ02\mathbb{E}_{q}[z_{\alpha}^{2}]=\sigma_{0}^{2}, and 𝔼q[zα4]=3σ04+Vαααα\mathbb{E}_{q}[z_{\alpha}^{4}]=3\sigma_{0}^{4}+V_{\alpha\alpha\alpha\alpha}. Although the work [18] derived the above non-Gaussian distribution from a renormalization group perspective, such expansion with Hermite polynomials around Gaussian distribution has been known as Edgeworth expansion in statistics literature [19, 20]. When the width NN\rightarrow\infty, the fourth cumulant VV vanishes and for notational convenience we denote the limiting distribution by qq_{\infty}, which is Gaussian.

III.1 Joint distribution

As we are mainly interested in the prior distribution over a finite set of function values {z1,z2,}\{z_{1},z_{2},\cdots\}, the objective here is to construct the non-Gaussian joint distribution as a prior in Bayesian learning. The work [12] suggested the construction of multivariate Edgeworth expansion by replacing the univariate Gaussian 𝒩(z|0,σ02)\mathcal{N}(z|0,\sigma_{0}^{2}) with multivariate one 𝒩(𝐳|0,K)\mathcal{N}({\bf z}|0,K) and the Hermite polynomials with contracted tensor terms containing the inverse covariance matrix [K1][K^{-1}] and vector of function values 𝐳{\bf z}. Please see Appendix A for the expression.

An apparent difficulty with such representation is the demonstration of consistency after marginalization, i.e., 𝑑z1q(z1,z2,z3,)=q(z2,z3,)\int dz_{1}q(z_{1},z_{2},z_{3},\cdots)=q(z_{2},z_{3},\cdots), which is of critical importance for deriving conditional distribution and subsequent Bayesian learning. Inspired by the fact that the Hermite polynomials are obtained by taking derivative of Gaussian, we can rewrite the multivariate Edgeworth expansion in the following differential form,

q(𝐳)=(1+Vijml24zizjzmzl)𝒩(𝐳|0,K)q({\bf z})=\big{(}1+\frac{V_{ijml}}{24}\partial_{z_{i}}\partial_{z_{j}}\partial_{z_{m}}\partial_{z_{l}}\big{)}\mathcal{N}({\bf z}|0,K) (5)

where the notation of summation by repeated indices is employed. An advantage of such representation is that the adjoint property of derivative, z=z\partial_{z}^{{\dagger}}=-\partial_{z}, can be exploited in the integration of functions which vanish at infinity. Hence, it is easy to see the distribution qq is normalized, and the second and fourth moments match, for instance,

𝔼q[z1z2]=K12,𝔼q[z1z2z3z4]=K12K34+K13K24+K14K23+V1234.\begin{split}\mathbb{E}_{q}[z_{1}z_{2}]&=K_{12}\>,\\ \mathbb{E}_{q}[z_{1}z_{2}z_{3}z_{4}]&=K_{12}K_{34}+K_{13}K_{24}+K_{14}K_{23}+V_{1234}\>.\end{split} (6)

The above non-Gaussian prior can be viewed as a perturbative extension of multivariate Gaussian. In the setting of Bayesian regression, the marginal and conditional properties of Gaussian are appealing when conjugate likelihood functions are employed. In the following, we shall examine the effects of the perturbation in differential form on these properties.

III.2 Marginal consistency and conditional statistics

We first show that the marginal consistency is intact with the Edgeworth expansion in differential form. Namely, without loss of generality, it suffices to show that marginalization of z1z_{1} of the joint distribution q(z1,z2,z3,)q(z_{1},z_{2},z_{3},\cdots),

𝑑z1q(z1,𝐳~)=𝒩(𝐳~|0,K~)+vi~j~m~l~zi~zj~zm~zl~𝑑z1𝒩(z1,𝐳~|0,K)=(1+vi~j~m~l~zi~zj~zm~zl~)𝒩(𝐳~|0,K~)=q(𝐳~),\begin{split}\int dz_{1}q(z_{1},\tilde{\bf z})&=\mathcal{N}(\tilde{\bf z}|0,\tilde{K})+v_{\tilde{i}\tilde{j}\tilde{m}\tilde{l}}\partial_{z_{\tilde{i}}}\partial_{z_{\tilde{j}}}\partial_{z_{\tilde{m}}}\partial_{z_{\tilde{l}}}\int dz_{1}\mathcal{N}(z_{1},\tilde{\bf z}|0,K)\\ &=\big{(}1+v_{\tilde{i}\tilde{j}\tilde{m}\tilde{l}}\partial_{z_{\tilde{i}}}\partial_{z_{\tilde{j}}}\partial_{z_{\tilde{m}}}\partial_{z_{\tilde{l}}}\big{)}\mathcal{N}(\tilde{\bf z}|0,\tilde{K})\\ &=q(\tilde{\bf z})\>,\end{split} (7)

indeed reproduces the joint distribution for z2z_{2}, z3,z_{3},\cdots in the original form Eq. (5) as the smaller covariance matrix K~\tilde{K} is the submatrix of KK excluding the first row and column. For simplicity, we use vv to denote V/24V/24. In deriving above, we have employed the fact that 𝑑z1(z1)s𝒩(z1,𝐳~)=0\int dz_{1}(\partial_{z_{1}})^{s}\mathcal{N}(z_{1},\tilde{\bf z})=0 for any integer power s1s\geq 1, meaning that the indices i,j,m,li,j,m,l shall exclude any contribution associated with z1z_{1}. Thus, the tilded indices i~{2,3,4,}\tilde{i}\in\{2,3,4,\cdots\}, and they can be factored out of the integral.

Here we wish to stress that the above marginal property of non-Gaussian prior may seems straightforward with the differential representation of Edgeworth expansion, but it does not seem promising to reach the same conclusion if the marginalization is carried out with the explicit representation in Appendix A. Nevertheless, one can easily write the conditional distribution q(z1|𝐳~)=q(z1,𝐳~)/q(𝐳~)q(z_{1}|\tilde{\bf z})=q(z_{1},\tilde{\bf z})/q(\tilde{\bf z}). More interestingly, the conditional mean is useful for prediction with noiseless observation of data, and also shed light on the effect of non-Gaussianity due to the finite width. The details of deriving the following conditional mean

𝔼[z1|𝐳~]=𝑑z1z1q(z1,𝐳~)q(𝐳~)=μ(𝐳~)+1q(𝐳~)Ai~j~m~zi~zj~zm~𝒩(𝐳~|0,K~),\begin{split}\mathbb{E}[z_{1}|\tilde{\bf z}]&=\int dz_{1}z_{1}\frac{q(z_{1},\tilde{\bf z})}{q(\tilde{\bf z})}\\ &=\mu(\tilde{\bf z})+\frac{1}{q(\tilde{\bf z})}A_{\tilde{i}\tilde{j}\tilde{m}}\partial_{z_{\tilde{i}}}\partial_{z_{\tilde{j}}}\partial_{z_{\tilde{m}}}\mathcal{N}(\tilde{\bf z}|0,\tilde{K})\>,\end{split} (8)

with the finite-width correction term proportional to the third-order tensor,

Ai~j~m~=Vi~j~m~l~6[K~1𝐤]l~V1i~j~m~6,A_{\tilde{i}\tilde{j}\tilde{m}}=\frac{V_{\tilde{i}\tilde{j}\tilde{m}\tilde{l}}}{6}[{\tilde{K}}^{-1}{\bf k}]_{\tilde{l}}-\frac{V_{1\tilde{i}\tilde{j}\tilde{m}}}{6}\>, (9)

can be found in Appendix B. In the GP limit, the conditional mean becomes the well-known result μ(𝐳~)=𝐤tK~1𝐳~\mu(\tilde{\bf z})={\bf k}^{t}{\tilde{K}}^{-1}{\tilde{\bf z}}, as the fourth cumulant VV vanishes in large NN limit. We also remind the readers that the tilded symbols are associated with the conditioned outputs z2,3,z_{2,3,\cdots}. In a similar manner, we can also show the conditional second moment,

𝔼[z12|𝐳~]=𝑑z1z12q(z1,𝐳~)q(𝐳~)=μ2(𝐳~)+σ2+1q(𝐳~)[2μ(𝐳~)Ai~j~m~zi~zj~zm~+Bi~j~zi~zj~]𝒩(𝐳~|0,K~),\begin{split}\mathbb{E}[z_{1}^{2}|\tilde{\bf z}]&=\int dz_{1}z_{1}^{2}\frac{q(z_{1},\tilde{\bf z})}{q(\tilde{\bf z})}\\ &=\mu^{2}(\tilde{\bf z})+\sigma^{2}+\frac{1}{q(\tilde{\bf z})}\bigg{[}2\mu(\tilde{\bf z})A_{\tilde{i}\tilde{j}\tilde{m}}\partial_{z_{\tilde{i}}}\partial_{z_{\tilde{j}}}\partial_{z_{\tilde{m}}}+B_{\tilde{i}\tilde{j}}\partial_{z_{\tilde{i}}}\partial_{z_{\tilde{j}}}\bigg{]}\mathcal{N}(\tilde{\bf z}|0,\tilde{K})\>,\end{split} (10)

with the second-order tensors denoting

Bi~j~=Vi~j~m~l~2[K~1𝐤]m~[K~1𝐤]l~V1i~j~m~[K~1𝐤]m~+V11i~j~2.B_{\tilde{i}\tilde{j}}=\frac{V_{\tilde{i}\tilde{j}\tilde{m}\tilde{l}}}{2}[{\tilde{K}}^{-1}{\bf k}]_{\tilde{m}}[{\tilde{K}}^{-1}{\bf k}]_{\tilde{l}}-V_{1\tilde{i}\tilde{j}\tilde{m}}[{\tilde{K}}^{-1}{\bf k}]_{\tilde{m}}+\frac{V_{11\tilde{i}\tilde{j}}}{2}\>. (11)

The conditional second moment coincides with that in the GP limit, σ2=K11𝐤tK~1𝐤\sigma^{2}=K_{11}-{\bf k}^{t}{\tilde{K}}^{-1}{\bf k}. The details of derivation can also be found in Appendix B. It is worthwhile to note that the conditional variance, 𝔼q[z12|𝐳~](𝔼q[z1|𝐳~])2\mathbb{E}_{q}[z_{1}^{2}|\tilde{\bf z}]-(\mathbb{E}_{q}[z_{1}|\tilde{\bf z}])^{2}, can include contributions related to function values 𝐳~\tilde{\bf z} through the nonzero AA and BB terms in above expressions. The lack of such dependence in GP limit is in fact a shortcoming for modelling, which has motivated the study of non-Gaussian prior in machine learning community (for example the Student-t process in [21]).

III.3 An example: bivariate distribution and prediction

Now let us pause to consider a simple example where a shallow network defined in Eq. (1) with width NN is used to model two input-output pairs (𝐱1,z1)({\bf x}_{1},z_{1}) and (𝐱2,z2)({\bf x}_{2},z_{2}). In the end, we shall present the predictive mean and variance for z1z_{1} conditioning on z2z_{2}. The following notations are used for simplification: the covariance c:=exp(|𝐱1𝐱2|2/2d)c:=\exp(-|{\bf x}_{1}-{\bf x}_{2}|^{2}/2d), the 2-by-2 covariance matrix Σ=(1cc1)\Sigma=\bigl{(}\begin{smallmatrix}1&c\\ c&1\end{smallmatrix}\bigr{)} for the bivariate prior, and z21:=(z2cz1)/(1c2)z_{21}:=(z_{2}-cz_{1})/(\sqrt{1-c^{2}}). Besides, the relations of derivatives of Gaussian in terms of Hermite polynomial shall be useful,

z1nz2m𝒩(z1,z2|0,Σ)z1n[𝒩(z1|0,1)z2m𝒩(z21|0,1)](1)m𝒩(z2|0,1)(1c2)mz1n[Hm(z21)𝒩(z12|0,1)],\partial^{n}_{z_{1}}\partial^{m}_{z_{2}}\mathcal{N}(z_{1},z_{2}|0,\Sigma)\propto\partial^{n}_{z_{1}}\big{[}\mathcal{N}(z_{1}|0,1)\partial^{m}_{z_{2}}\mathcal{N}(z_{21}|0,1)\big{]}\propto(-1)^{m}\frac{\mathcal{N}(z_{2}|0,1)}{\sqrt{(1-c^{2})^{m}}}\partial^{n}_{z_{1}}[H_{m}(z_{21})\mathcal{N}(z_{12}|0,1)]\>, (12)

where we have used 𝒩(z1,z2|0,Σ)𝒩(z1|0,1)𝒩(z21|0,1)\mathcal{N}(z_{1},z_{2}|0,\Sigma)\propto\mathcal{N}(z_{1}|0,1)\mathcal{N}(z_{21}|0,1) up to some irrelevant constant. The probabilist’s Hermite polynomials are defined as Hn(z)=(1)n[𝒩(z|0,1)]1dzn𝒩(z|0,1)H_{n}(z)=(-1)^{n}[\mathcal{N}(z|0,1)]^{-1}d^{n}_{z}\mathcal{N}(z|0,1) [19]. Consequently, following the previous discussion, the non-Gaussian bivariate distribution is shown to be

q(z1,z2)=𝒩(z1,z2|0,Σ){1+γ416N[H4(z21)+H4(z12)]+cγ44N[3cH2(z21)+H3(z21)H1(z12)+z12z21]+(c4+2)γ48N[2c2+4cH1(z21)H1(z12)+H2(z21)H2(z12)]},\begin{split}q(z_{1},z_{2})&=\mathcal{N}(z_{1},z_{2}|0,\Sigma)\bigg{\{}1+\frac{\gamma^{4}}{16N}[H_{4}(z_{21})+H_{4}(z_{12})]+\frac{c\gamma^{4}}{4N}[3cH_{2}(z_{21})+H_{3}(z_{21})H_{1}(z_{12})+z_{12}\leftrightarrow z_{21}]\\ &+\frac{(c^{4}+2)\gamma^{4}}{8N}[2c^{2}+4cH_{1}(z_{21})H_{1}(z_{12})+H_{2}(z_{21})H_{2}(z_{12})]\bigg{\}}\>,\end{split} (13)

and the symbol γ=1/1c2\gamma=1/\sqrt{1-c^{2}} is used to simplify the expression. Besides, the fourth cumulant for the cosine network, V1111=3/2NV_{1111}=3/2N, V1112=3c/2NV_{1112}=3c/2N, and V1122=(c4+2)/2NV_{1122}=(c^{4}+2)/2N, have been plugged in above. As for the conditional distribution, e.g. q(z1|z2)=q(z1,z2)/q(z2)q(z_{1}|z_{2})=q(z_{1},z_{2})/q(z_{2}), for the bivariate case, one can simply divid the above with q(z2)=𝒩(z2|0,1)[1+H4(z2)/16N]q(z_{2})=\mathcal{N}(z_{2}|0,1)[1+H_{4}(z_{2})/16N], which shall yield a conditional Gaussian 𝒩(z1|z2)\mathcal{N}(z_{1}|z_{2}) multiplied by a factor representing the finite-width effect.

We can directly apply Eq. (8) to obtain the conditional mean in the simple example where the tilded indices there only apply to z2z_{2}. Surprisingly, the conditional mean for such noiseless case is

𝔼[z1|z2]=cz2,\mathbb{E}[z_{1}|z_{2}]=cz_{2}\>, (14)

coinciding with that in the GP limit because the accidental cancelation in the third-order tensor A222(cV2222V1222)A_{222}\propto(cV_{2222}-V_{1222}). We shall show that the cancelation does not occur in the noisy (next Section) and more general cases. Application of Eq. (10) together with the above conditional mean results in the following conditional variance,

Var[z1|z2]=(1c2)[1+4(2c2)16N+H4(z2)H2(z2)],{\rm Var}[z_{1}|z_{2}]=(1-c^{2})\bigg{[}1+\frac{4(2-c^{2})}{16N+H_{4}(z_{2})}H_{2}(z_{2})\bigg{]}\>, (15)

which consists of the corresponding variance in GP limit along with a term depending on z22z^{2}_{2}.

IV Bayesian Regression with non-gaussian prior

Having established the marginal and conditional properties of the non-Gaussian distribution in Eq. (5), we now continue investigating the posterior distribution over the unseen function value zz_{*} associated with input 𝐱{\bf x}_{*} when the noisy observations 𝒟={(𝐱1,y1),(𝐱2,y2),}\mathcal{D}=\{({\bf x}_{1},y_{1}),({\bf x}_{2},y_{2}),\cdots\} are known. According to Bayes’s rule, the objective distribution is the posterior distribution defined as,

q(z|𝒟)=q(z,𝒟)q(𝒟)=𝑑𝐳q(z,𝐳)𝒩(𝐲|𝐳,Λ)𝑑𝐳q(𝐳)𝒩(𝐲|𝐳,Λ),q(z_{*}|\mathcal{D})=\frac{q(z_{*},\mathcal{D})}{q(\mathcal{D})}=\frac{\int d{\bf z}\ q(z_{*},{\bf z})\mathcal{N}({\bf y}|{\bf z},\Lambda)}{\int d{\bf z}\ q({\bf z})\mathcal{N}({\bf y}|{\bf z},\Lambda)}\>, (16)

with Λ\Lambda denoting a diagonal matrix representing the noise in Gaussian likelihood. When the qq’s in the numerator and denominator are both Gaussian, which corresponds to the infinitely wide network, the limiting distribution reads

q(z|𝒟)=𝑑𝐳𝒩(z|𝐤tK1𝐳,k𝐤tK1𝐤)𝒩(𝐳|K(K+Λ)1𝐲,ΛK(K+Λ)1)=𝒩(z|𝐤t(K+Λ)1𝐲,k𝐤t(K+Λ)1𝐤).\begin{split}q_{\infty}(z_{*}|\mathcal{D})&=\int d{\bf z}\ \mathcal{N}(z_{*}|{\bf k}_{*}^{t}K^{-1}{\bf z},k_{**}-{\bf k}_{*}^{t}K^{-1}{\bf k}_{*})\ \mathcal{N}({\bf z}|K(K+\Lambda)^{-1}{\bf y},\Lambda K(K+\Lambda)^{-1})\\ &=\mathcal{N}\big{(}z_{*}|{\bf k}_{*}^{t}(K+\Lambda)^{-1}{\bf y},k_{**}-{\bf k}_{*}^{t}(K+\Lambda)^{-1}{\bf k}_{*}\big{)}\>.\end{split}

In above first equality, the first and second Gaussian distributions on the right hand side represent the conditional density q(z|𝐳)q_{\infty}(z_{*}|{\bf z}) and the posterior q(𝐳|𝒟)q_{\infty}({\bf z}|\mathcal{D}), respectively.

As for the case of finite-width, the involved Gaussian likelihood can be dealt with by continuing application of the properties of adjoint differential operator as well as derivative of independent Gaussians. The details of derivation of evidence q(𝒟)q(\mathcal{D}) can be seen in Appendix C, and the expression in terms of derivatives with respect to the observed yy’s reads,

q(𝒟)=𝑑𝐳𝒩(𝐲|𝐳,Λ)q(𝐳)=(1+vijmlyiyjymyl)𝒩(𝐲|0,K+Λ).\begin{split}q(\mathcal{D})&=\int d{\bf z}\ \mathcal{N}({\bf y}|{\bf z},\Lambda)q({\bf z})\\ &=(1+v_{ijml}\partial_{y_{i}}\partial_{y_{j}}\partial_{y_{m}}\partial_{y_{l}})\mathcal{N}({\bf y}|0,K+\Lambda)\>.\end{split} (17)

Despite its simple form, the evidence term can not be further be manipulated like in the infinite-width case. The details of derivation of the posterior distribution for output zz_{*} at the unseen input,

q(z|𝒟)=𝑑𝐳𝒩(𝐲|𝐳,Λ)q(z,𝐳)q(𝒟)=(1+vi^j^m^l^i^j^m^l^)𝒩[(z𝐲)|0,(1𝐤t𝐤K+Λ)]/q(𝒟),\begin{split}q(z_{*}|\mathcal{D})&=\frac{\int d{\bf z}\ \mathcal{N}({\bf y}|{\bf z},\Lambda)q(z_{*},{\bf z})}{q(\mathcal{D})}\\ &=(1+v_{\hat{i}\hat{j}\hat{m}\hat{l}}\partial_{\hat{i}\hat{j}\hat{m}\hat{l}})\mathcal{N}\big{[}\bigl{(}\begin{smallmatrix}z_{*}\\ {\bf y}\end{smallmatrix}\bigr{)}\big{|}0,\bigl{(}\begin{smallmatrix}1&{\bf k}_{*}^{t}\\ {\bf k}_{*}&K+\Lambda\end{smallmatrix}\bigr{)}\big{]}/q(\mathcal{D})\>,\end{split} (18)

can also be found in Appendix C. Here, the notations can be understood as follows: the hatted indices, i^\hat{i} for instance, additionally include the symbol * associated with test function value zz_{*}, and the derivative refers to :=z\partial_{*}:=\partial_{z_{*}} (with respect to function value) and i:=yi\partial_{i}:=\partial_{y_{i}} (with respect to the observed value).

We conclude this Section with considering the same simple regression example but with noisy observation in y2y_{2} at input 𝐱2{\bf x}_{2}, and we wish to predict the unobserved function value z1z_{1}. To stress the role of noise parameter σn\sigma_{n}, we only show the predictive mean here,

𝔼[z1|y2]=cαn2y2cσn2αn5H3(αny2)4N+αn4H4(αny2),\mathbb{E}[z_{1}|y_{2}]=c\alpha_{n}^{2}y_{2}-\frac{c\sigma_{n}^{2}\alpha_{n}^{5}H_{3}(\alpha_{n}y_{2})}{4N+\alpha_{n}^{4}H_{4}(\alpha_{n}y_{2})}\>, (19)

with αn=1/1+σn2\alpha_{n}=1/\sqrt{1+\sigma_{n}^{2}}. The correction term vanishes as the noise parameter does, and is odd with respect to the sign change of y2y_{2}.

V Deep Bottlenecked Network

Up to this point, we have focused on the emergence of non-Gaussianity in prior function distribution in wide and shallow network, as well as how the non-Gaussian prior affects the inference in a regression task. Our approach can be extended to the deep and wide neural networks as the corresponding prior distributions approach GP [5, 6]. However, such wide-width assumption leading to weak non-Gaussianity is not valid for the bottlenecked networks in which the hidden layers are alternatively wide and narrow [22, 14].

Let us consider a two-layer bottlenecked feed forward network modeling the hierarchical mapping 𝐱d𝐳(1)Hz(2){\bf x}\in\mathbb{R}^{d}\mapsto{\bf z}^{(1)}\in\mathbb{R}^{H}\mapsto z^{(2)}\in\mathbb{R}. The hierarchy consists of the following computations,

zi,α(1)=2N1j=1N1wij(1)cos(Ωj(1)𝐱αd+ϕj(1)),zα(2)=2N2j=1N2wj(2)cos(Ωj(2)𝐳α(1)H+ϕj(2)),\begin{split}z^{(1)}_{i,\alpha}&=\sqrt{\frac{2}{N_{1}}}\sum_{j=1}^{N_{1}}w^{(1)}_{ij}\cos(\frac{\Omega^{(1)}_{j}\cdot{\bf x}_{\alpha}}{\sqrt{d}}+\phi^{(1)}_{j})\>,\\ z^{(2)}_{\alpha}&=\sqrt{\frac{2}{N_{2}}}\sum_{j=1}^{N_{2}}w^{(2)}_{j}\cos(\frac{\Omega^{(2)}_{j}\cdot{\bf z}^{(1)}_{\alpha}}{\sqrt{H}}+\phi^{(2)}_{j})\>,\end{split} (20)

where the parameters ww’s, Ω\Omega’s, and ϕ\phi’s share the same prior distribution with their counterparts in the shallow network. The widths of hidden layers, N1N_{1} and N2N_{2}, are large but the hidden output dimension HH is not. In the limit N1,2N_{1,2}\rightarrow\infty while HH remains finite, the two-layer cosine bottlenecked network corresponds to the prior of deep Gaussian process [15, 23, 24] with Gaussian kernel, which is a flexible and expressive function prior due to its compositional nature. As its name suggests, the conditional prior z(2)|𝐳(1)z^{(2)}|{\bf z}^{(1)} is a GP and so is each component zi(1)|𝐱z_{i}^{(1)}|{\bf x} in the hidden output. However, the marginal distribution for z(2)|𝐱z^{(2)}|{\bf x} is non-Gaussian as [13] showed that the fourth cumulant is positive, i.e., a heavy-tailed distribution [25].

Here, we are interested in tracking how the small parameters, 1/N11/N_{1} and 1/N21/N_{2}, and the bottleneck parameter 1/H1/H enters the second moment. For a deep and finitely wide linear network, the prior is non-Gaussian [8, 26] but the second moment does not receive correction due to the finite width [8, 9]. Thus, the effects of nonlinear activation on higher statistical moments are interesting (also see a recent work in [27]). For the deep cosine network in Eq. (20), we can first consider the wide limit, N1,2N_{1,2}\rightarrow\infty, but the bottleneck width HH remains finite. Following the result in Eq. (2), the covariance for the deep model can be computed as followings,

k(2)(𝐱α,𝐱β)=𝔼q{exp(|𝐳α(1)𝐳β(1)|22H)}=[𝑑zα𝑑zβe(zαzβ)22Hq(zα,zβ)]H=[1+1exp(|𝐱α𝐱β|22d)H/2]H/2.\begin{split}k^{(2)}_{\infty}({\bf x}_{\alpha},{\bf x}_{\beta})&=\mathbb{E}_{q_{\infty}}\bigg{\{}\exp\big{(}-\frac{|{\bf z}^{(1)}_{\alpha}-{\bf z}^{(1)}_{\beta}|^{2}}{2H}\big{)}\bigg{\}}\\ &=\bigg{[}\int dz_{\alpha}dz_{\beta}\ e^{-\frac{(z_{\alpha}-z_{\beta})^{2}}{2H}}q_{\infty}(z_{\alpha},z_{\beta})\bigg{]}^{H}\\ &=\big{[}1+\frac{1-\exp(-\frac{|{\bf x}_{\alpha}-{\bf x}_{\beta}|^{2}}{2d})}{H/2}\big{]}^{-H/2}\>.\end{split} (21)

Note that the decomposition 𝒩(z1,z2|0,K)=𝒩(z1+z2|0,σ+2)𝒩(z1z2|0,σ2)\mathcal{N}(z_{1},z_{2}|0,K)=\mathcal{N}(z_{1}+z_{2}|0,\sigma_{+}^{2})\mathcal{N}(z_{1}-z_{2}|0,\sigma_{-}^{2}) [28] has been used in derivation with the variances σ±2=2(K11±K12)\sigma_{\pm}^{2}=2(K_{11}\pm K_{12}). We want to stress that the above kernel is exact for any HH. For a very narrow case, i.e., H=1H=1, the above result coincides with that in [13], while in the very wide bottleneck case, H1H\gg 1, it can be shown that

k(2)(𝐱α,𝐱β)exp[k(𝐱α,𝐱β)1]{1+1H[k(𝐱α,𝐱β)1]2},k^{(2)}_{\infty}({\bf x}_{\alpha},{\bf x}_{\beta})\approx\exp[k({\bf x}_{\alpha},{\bf x}_{\beta})-1]\big{\{}1+\frac{1}{H}[k({\bf x}_{\alpha},{\bf x}_{\beta})-1]^{2}\big{\}}\>, (22)

where the covariance kk in shallow network is given in (2). In addition, the exponential of shallow kernel in above corresponds to HH\rightarrow\infty, which is the same as the deep kernel in [29]. Indeed, as suggested in [9], the appearance of reciprocal width, 1/H1/H, in the correction term is in association with the weak non-Gaussianity even though the other widths N1,2N_{1,2} are infinite.

Next, we shall compute the covariance for the case where all widths are finite. In fact, the outer width N2N_{2} does not enter the kernel and we only need large but finite N1N_{1}. The computation is similar and one only needs to replace the limiting joint distribution q(zα(1),zβ(1))q_{\infty}(z^{(1)}_{\alpha},z^{(1)}_{\beta}) with the non-Gaussian qq. The details of derivation using the property derivative of Gaussian can be seen in Appendix D. The kernel of the two-layer bottlenecked cosine network with N1,2<N_{1,2}<\infty reads,

k(2)(𝐱α,𝐱β)=[1+1k(𝐱α,𝐱β)H/2]H/2(1+ϵ)H,k^{(2)}({\bf x}_{\alpha},{\bf x}_{\beta})=\bigg{[}1+\frac{1-k({\bf x}_{\alpha},{\bf x}_{\beta})}{H/2}\bigg{]}^{-H/2}(1+\epsilon)^{H}\>, (23)

with the correction due to finite N1N_{1},

ϵ=Vαααα+Vββββ4Vαβββ4Vβααα+6Vααββ24[3H26σ2H3(1+σ2/H)+3σ4H4(1+σ2/H)2],\epsilon=\frac{V_{\alpha\alpha\alpha\alpha}+V_{\beta\beta\beta\beta}-4V_{\alpha\beta\beta\beta}-4V_{\beta\alpha\alpha\alpha}+6V_{\alpha\alpha\beta\beta}}{24}\big{[}\frac{3}{H^{2}}-\frac{6\sigma_{-}^{2}}{H^{3}(1+\sigma_{-}^{2}/H)}+\frac{3\sigma_{-}^{4}}{H^{4}(1+\sigma_{-}^{2}/H)^{2}}\big{]}\>, (24)

with σ2=2(1k)\sigma_{-}^{2}=2(1-k) is used for easing the notation. Again, the above result is for general HH. For the wide bottleneck limit, one can show that these small parameters enter the kernel with the correction of O(1H)O(\frac{1}{H}) followed by O(1N1H)O(\frac{1}{N_{1}H}). Hence, the deep model as a function prior serves as a GP with random kernel whose mean value converges to Eq. (22) when all widths N1,2N_{1,2} and HH approach infinity. However, the role of HH is different from N1N_{1} from our perturbative analysis, and 1/N11/N_{1} does not appear alone in the small parameter. The influence of finite width on the kernel of deep and nonlinear and convolutional models is also studied in [30, 31, 32].

The fourth moment can be computed in a similar manner. The closed form for N1,2N_{1,2}\rightarrow\infty can be found in [13]. For the finite width case, the fourth cumulant Vαβγδ(2)V_{\alpha\beta\gamma\delta}^{(2)} becomes the HH-th power of the permutational symmetrization of 1N2𝔼q{exp[(zα(1)+zβ(1)zγ(1)zδ(1))2/2H)}\frac{1}{N_{2}}\mathbb{E}_{q}\{\exp[-(z^{(1)}_{\alpha}+z^{(1)}_{\beta}-z^{(1)}_{\gamma}-z^{(1)}_{\delta})^{2}/2H)\}, the small parameters of which shall consist of 1/N21/N_{2}, 1/(N2H)1/(N_{2}H), and 1/(N2HN1)1/(N_{2}HN_{1}). The analysis of scaling with respect to the reciprocal width is more complex than the deep linear network.

VI discussions

In essence, the finite width in random neural networks induces nonzero variance of kernel since the fourth cumulant is not zero. From function space perspective, the shallow networks with finitely large width can be regarded as a GP but the kernel itself is a random variable too, so the learned data representation is a distribution over kernels. Therefore, neural network with very narrow layers may not have enough expressive power for learning while the network with very wide layers is not flexible enough because it is equivalent to GP learning with one fixed kernel. It was recently suggested in Ref. 33 through studying the partition function [34, 35] of finite deep network that Student-t process [21] may suitably represent finite-width network, which offers an alternative description than this paper. The reports in [30, 24] observed the degradation of performance for Bayesian deep neural network when expanding the widths while the study in [36] suggested otherwise.

Taking the wide limit with the perturbative approach is appealing because of the analytic and elegant expressions for inference, which may shed light on future investigation of alternative algorithm for classification [37], study of average test error in regression [38, 39] and classification [40] tasks using finite-width networks. In this paper, the conditional statistics and perturbed posterior distribution over the unseen output using the non-Gaussian prior are obtained with help of the differential representation of multivariate Edgeworth expansion. In parallel with the equivalence between GP prior and the random wide neural network, the investigation of neural tangent kernel [17, 41, 42] is important for understanding the learning dynamics during the optimization and the relation with Bayesian neural network learning [43, 44].

acknowledgement

The research is supported by the Dean Office of School of Arts and Science at Rutgers University Newark. Correspondences with Jacob Zavatone-Veth and Pietro Rotondo are acknowledged.

References

  • Neal [1997] R. M. Neal, Monte carlo implementation of Gaussian process models for Bayesian regression and classification, arXiv preprint physics/9701026  (1997).
  • Rasmussen and Williams [2006] C. E. Rasmussen and C. K. I. Williams, Gaussian Process for Machine Learning (MIT press, Cambridge, MA, 2006).
  • Williams [1997] C. K. Williams, Computing with infinite networks, in Advances in neural information processing systems (1997) pp. 295–301.
  • Cho and Saul [2009] Y. Cho and L. K. Saul, Kernel methods for deep learning, in Advances in neural information processing systems (2009) pp. 342–350.
  • Matthews et al. [2018] A. G. d. G. Matthews, J. Hron, M. Rowland, R. E. Turner, and Z. Ghahramani, Gaussian process behaviour in wide deep neural networks, in International Conference on Learning Representations (2018).
  • Lee et al. [2018] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein, Deep neural networks as Gaussian processes, in International Conference on Learning Representations (2018).
  • Osband [2016] I. Osband, Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of dropout, in Workshop on Bayesian Deep Learning (NIPS, 2016).
  • Yaida [2020] S. Yaida, Non-Gaussian processes and neural networks at finite widths, in Mathematical and Scientific Machine Learning (PMLR, 2020) pp. 165–192.
  • Roberts et al. [2021] D. A. Roberts, S. Yaida, and B. Hanin, The principles of deep learning theory, arXiv preprint arXiv:2106.10165  (2021).
  • Dyer and Gur-Ari [2019] E. Dyer and G. Gur-Ari, Asymptotics of wide networks from Feynman diagrams, in International Conference on Learning Representations (2019).
  • Gabrié [2020] M. Gabrié, Mean-field inference methods for neural networks, Journal of Physics A: Mathematical and Theoretical 53, 223002 (2020).
  • Skovgaard [1986] I. M. Skovgaard, On multivariate Edgeworth expansions, International Statistical Review/Revue Internationale de Statistique , 169 (1986).
  • Lu et al. [2020] C.-K. Lu, S. C.-H. Yang, X. Hao, and P. Shafto, Interpretable deep Gaussian processes with moments, in International Conference on Artificial Intelligence and Statistics (2020) pp. 613–623.
  • Agrawal et al. [2020] D. Agrawal, T. Papamarkou, and J. D. Hinkle, Wide neural networks with bottlenecks are deep Gaussian processes., J. Mach. Learn. Res. 21, 175 (2020).
  • Damianou and Lawrence [2013] A. Damianou and N. Lawrence, Deep Gaussian processes, in Artificial Intelligence and Statistics (2013) pp. 207–215.
  • Rahimi and Recht [2008] A. Rahimi and B. Recht, Random features for large-scale kernel machines, in Advances in neural information processing systems (2008) pp. 1177–1184.
  • Jacot et al. [2018] A. Jacot, F. Gabriel, and C. Hongler, Neural tangent kernel: Convergence and generalization in neural networks, in Advances in neural information processing systems (2018) pp. 8571–8580.
  • Antognini [2019] J. M. Antognini, Finite size corrections for neural network Gaussian processes, arXiv preprint arXiv:1908.10030  (2019).
  • Papoulis and Unnikrishna Pillai [2002] A. Papoulis and S. Unnikrishna Pillai, Probability, random variables and stochastic processes (2002).
  • Welling [2005] M. Welling, Robust higher order statistics, in International Workshop on Artificial Intelligence and Statistics (PMLR, 2005) pp. 405–412.
  • Shah et al. [2014] A. Shah, A. Wilson, and Z. Ghahramani, Student-t processes as alternatives to Gaussian processes, in Artificial intelligence and statistics (2014) pp. 877–885.
  • Cutajar et al. [2017] K. Cutajar, E. V. Bonilla, P. Michiardi, and M. Filippone, Random feature expansions for deep Gaussian processes, in Proceedings of the 34th International Conference on Machine Learning-Volume 70 (JMLR. org, 2017) pp. 884–893.
  • Dunlop et al. [2018] M. M. Dunlop, M. A. Girolami, A. M. Stuart, and A. L. Teckentrup, How deep are deep Gaussian processes?, The Journal of Machine Learning Research 19, 2100 (2018).
  • Pleiss and Cunningham [2021] G. Pleiss and J. P. Cunningham, The limitations of large width in neural networks: A deep Gaussian process perspective, Advances in Neural Information Processing Systems 34 (2021).
  • Vladimirova et al. [2019] M. Vladimirova, J. Verbeek, P. Mesejo, and J. Arbel, Understanding priors in Bayesian neural networks at the unit level, in International Conference on Machine Learning (PMLR, 2019) pp. 6458–6467.
  • Zavatone-Veth and Pehlevan [2021] J. Zavatone-Veth and C. Pehlevan, Exact marginal prior distributions of finite Bayesian neural networks, Advances in Neural Information Processing Systems 34 (2021).
  • Zavatone-Veth et al. [2021] J. Zavatone-Veth, A. Canatar, B. Ruben, and C. Pehlevan, Asymptotics of representation learning in finite bayesian neural networks, Advances in neural information processing systems 34, 24765 (2021).
  • Lu et al. [2018] C.-K. Lu, S. C.-H. Yang, and P. Shafto, Standing-wave-decomposition Gaussian process, Physical Review E 98, 032303 (2018).
  • Duvenaud et al. [2014] D. Duvenaud, O. Rippel, R. Adams, and Z. Ghahramani, Avoiding pathologies in very deep networks, in Artificial Intelligence and Statistics (2014) pp. 202–210.
  • Aitchison [2020] L. Aitchison, Why bigger is not always better: on finite and infinite neural networks, in International Conference on Machine Learning (PMLR, 2020) pp. 156–164.
  • Li et al. [2022] M. B. Li, M. Nica, and D. M. Roy, The neural covariance sde: Shaped infinite depth-and-width networks at initialization, arXiv preprint arXiv:2206.02768  (2022).
  • Hanin [2022] B. Hanin, Correlation functions in random fully connected neural networks at finite width, arXiv preprint arXiv:2204.01058  (2022).
  • Ariosto et al. [2022] S. Ariosto, R. Pacelli, M. Pastore, F. Ginelli, M. Gherardi, and P. Rotondo, Statistical mechanics of deep learning beyond the infinite-width limit, arXiv preprint arXiv:2209.04882  (2022).
  • Seung et al. [1992] H. S. Seung, H. Sompolinsky, and N. Tishby, Statistical mechanics of learning from examples, Physical review A 45, 6056 (1992).
  • Li and Sompolinsky [2021] Q. Li and H. Sompolinsky, Statistical mechanics of deep linear neural networks: The backpropagating kernel renormalization, Physical Review X 11, 031059 (2021).
  • Lee et al. [2020] J. Lee, S. Schoenholz, J. Pennington, B. Adlam, L. Xiao, R. Novak, and J. Sohl-Dickstein, Finite versus infinite neural networks: an empirical study, Advances in Neural Information Processing Systems 33, 15156 (2020).
  • Csató et al. [1999] L. Csató, E. Fokoué, M. Opper, B. Schottky, and O. Winther, Efficient approaches to Gaussian process classification, Advances in neural information processing systems 12 (1999).
  • Sollich [1998] P. Sollich, Learning curves for Gaussian processes, Advances in neural information processing systems 11 (1998).
  • Williams and Vivarelli [2000] C. K. Williams and F. Vivarelli, Upper and lower bounds on the learning curve for Gaussian processes, Machine Learning 40, 77 (2000).
  • Seeger [2002] M. Seeger, PAC-Bayesian generalisation error bounds for Gaussian process classification, Journal of machine learning research 3, 233 (2002).
  • Arora et al. [2019] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang, On exact computation with an infinitely wide neural net, Advances in Neural Information Processing Systems 32, 8141 (2019).
  • Hanin and Nica [2019] B. Hanin and M. Nica, Finite depth and width corrections to the neural tangent kernel, in International Conference on Learning Representations (2019).
  • Karakida et al. [2019] R. Karakida, S. Akaho, and S.-i. Amari, Universal statistics of Fisher information in deep neural networks: Mean field approach, in The 22nd International Conference on Artificial Intelligence and Statistics (2019) pp. 1032–1041.
  • Khan et al. [2019] M. E. E. Khan, A. Immer, E. Abedi, and M. Korzepa, Approximate inference turns deep networks into Gaussian processes, in Advances in neural information processing systems (2019) pp. 3094–3104.

Appendix A Expression for non-Gaussian prior distribution

After explicitly taking the derivatives in Eq. (5), it is straightforward to show the expressions for the multivariate non-Gaussian distribution reads,

q(𝐳)=1(2π)N|K|e12𝐳tK1𝐳[1+Vαβγδ24(χαβγδ(0)χαβγδμν(1)zμzν+χαβγδμνκη(2)zμzνzκzη)],q({\bf z})=\frac{1}{\sqrt{(2\pi)^{N}|K|}}e^{-\frac{1}{2}{\bf z}^{t}K^{-1}{\bf z}}\bigg{[}1+\frac{V_{\alpha\beta\gamma\delta}}{24}(\chi^{(0)}_{\alpha\beta\gamma\delta}-\chi^{(1)}_{\alpha\beta\gamma\delta\mu\nu}z_{\mu}z_{\nu}+\chi^{(2)}_{\alpha\beta\gamma\delta\mu\nu\kappa\eta}z_{\mu}z_{\nu}z_{\kappa}z_{\eta})\bigg{]}\>, (25)

in which the first perturbation term, a scalar term, is the contraction between the fourth cumulant tensor Vαβγδ1/N1V_{\alpha\beta\gamma\delta}\propto 1/N_{1} and the following tensor,

χαβγδ(0)=(K1)αβ(K1)γδ+(K1)αγ(K1)βδ+(K1)αδ(K1)βγ.\chi^{(0)}_{\alpha\beta\gamma\delta}=(K^{-1})_{\alpha\beta}(K^{-1})_{\gamma\delta}+(K^{-1})_{\alpha\gamma}(K^{-1})_{\beta\delta}+(K^{-1})_{\alpha\delta}(K^{-1})_{\beta\gamma}\>. (26)

The next contribution results from contraction between the fourth cumulant, the following rank-6 tensor,

χαβγδμν(1)=(K1)αμ(K1)νβ(K1)γδ+(K1)αβ(K1)γμ(K1)νδ+(K1)αμ(K1)νγ(K1)βδ+(K1)αγ(K1)βμ(K1)νδ+(K1)αμ(K1)νδ(K1)βγ+(K1)αδ(K1)βμ(K1)νγ,\begin{split}\chi^{(1)}_{\alpha\beta\gamma\delta\mu\nu}=&(K^{-1})_{\alpha\mu}(K^{-1})_{\nu\beta}(K^{-1})_{\gamma\delta}+(K^{-1})_{\alpha\beta}(K^{-1})_{\gamma\mu}(K^{-1})_{\nu\delta}\ +\\ &(K^{-1})_{\alpha\mu}(K^{-1})_{\nu\gamma}(K^{-1})_{\beta\delta}+(K^{-1})_{\alpha\gamma}(K^{-1})_{\beta\mu}(K^{-1})_{\nu\delta}\ +\\ &(K^{-1})_{\alpha\mu}(K^{-1})_{\nu\delta}(K^{-1})_{\beta\gamma}+(K^{-1})_{\alpha\delta}(K^{-1})_{\beta\mu}(K^{-1})_{\nu\gamma}\>,\end{split} (27)

and the rank-2 tensor zμzνz_{\mu}z_{\nu}. The last contribution contains the rank-8 tensor,

χαβγδμνκη(2)=(K1)αμ(K1)βν(K1)γκ(K1)δη,\chi^{(2)}_{\alpha\beta\gamma\delta\mu\nu\kappa\eta}=(K^{-1})_{\alpha\mu}(K^{-1})_{\beta\nu}(K^{-1})_{\gamma\kappa}(K^{-1})_{\delta\eta}\>, (28)

the rank-4 tensor zμzνzκzηz_{\mu}z_{\nu}z_{\kappa}z_{\eta} and the fourth cumulant. It becomes extensively tedious, if possible, proving the marginal and conditional properties with this representation.

Appendix B Proofs of conditional mean Eq. (8) and second moment Eq. (10)

By plugging the corresponding expression for non-Gaussian joint q(z1,𝐳~)q(z_{1},\tilde{\bf z}), using the conditional property of distribution, and temporarily neglecting the coefficients and indices related to the fourth cumulant VV, we can compute the conditional first moment in the followings,

𝔼[z1|𝐳~]𝑑z1z1q(z1,𝐳~)=𝑑z1z1(1+V~4+V~3z1+V~2z12+V~z13+Vz14)[𝒩(z1|𝐳~)𝒩(𝐳~)]=(1+V~4)[μ(𝐳~)𝒩(𝐳~)]V~3𝒩(𝐳~).\begin{split}\mathbb{E}[z_{1}|\tilde{\bf z}]&\propto\int dz_{1}z_{1}q(z_{1},\tilde{\bf z})\\ &=\int dz_{1}z_{1}(1+V\tilde{\partial}^{4}+V\tilde{\partial}^{3}\partial_{z_{1}}+V\tilde{\partial}^{2}\partial^{2}_{z_{1}}+V\tilde{\partial}\partial^{3}_{z_{1}}+V\partial^{4}_{z_{1}})[\mathcal{N}(z_{1}|\tilde{\bf z})\mathcal{N}(\tilde{\bf z})]\\ &=(1+V\tilde{\partial}^{4})[\mu(\tilde{\bf z})\mathcal{N}(\tilde{\bf z})]-V\tilde{\partial}^{3}\mathcal{N}(\tilde{\bf z})\>.\end{split} (29)

In the first equality, the derivative terms of equal to or higher order than z12\partial^{2}_{z_{1}} do not contribute as they produce zero when acting on z1z_{1}. In the second equality, one can further reduce the first term into μ(1+V~4)𝒩+4(~μ)~3𝒩\mu(1+V\tilde{\partial}^{4})\mathcal{N}+4(\tilde{\partial}\mu)\tilde{\partial}^{3}\mathcal{N} since the μ\mu term is linear in 𝐳~\tilde{\bf z}. Note that the former term is actually μq(𝐳~)\mu q(\tilde{\bf z}). In addition, the minus sign preceding the ~3\tilde{\partial}^{3} arises from the adjoint property of moving z1\partial_{z_{1}} to act on z1z_{1}. By restoring the coefficients with some combinatorics and that iμ(𝐳)=[K1𝐤]i\partial_{i}\mu({\bf z})=[K^{-1}{\bf k}]_{i}, it proves Eq. (8).

The conditional second moment can be computed in a similar manner,

𝔼[z12|𝐳~]𝑑z1z12q(z1,𝐳~)=𝑑z1z12(1+V~4+V~3z1+V~2z12+V~z13+Vz14)[𝒩(z1|𝐳~)𝒩(𝐳~)]=(1+V~4){[μ2(𝐳~)+σ2]𝒩(𝐳~)}2V~3[μ(𝐳~)𝒩(𝐳~)]+2V~2[𝒩(𝐳~)].\begin{split}\mathbb{E}[z_{1}^{2}|\tilde{\bf z}]&\propto\int dz_{1}z_{1}^{2}q(z_{1},\tilde{\bf z})\\ &=\int dz_{1}z_{1}^{2}(1+V\tilde{\partial}^{4}+V\tilde{\partial}^{3}\partial_{z_{1}}+V\tilde{\partial}^{2}\partial^{2}_{z_{1}}+V\tilde{\partial}\partial^{3}_{z_{1}}+V\partial^{4}_{z_{1}})[\mathcal{N}(z_{1}|\tilde{\bf z})\mathcal{N}(\tilde{\bf z})]\\ &=(1+V\tilde{\partial}^{4})\bigg{\{}[\mu^{2}(\tilde{\bf z})+\sigma^{2}]\mathcal{N}(\tilde{\bf z})\bigg{\}}-2V\tilde{\partial}^{3}[\mu(\tilde{\bf z})\mathcal{N}(\tilde{\bf z})]+2V\tilde{\partial}^{2}[\mathcal{N}(\tilde{\bf z})]\>.\end{split} (30)

The first term in the second equality shall contribute a term q(𝐳~)\propto q(\tilde{\bf z}), the the details are similar with above.

Appendix C Proofs of evidence in Eq. (17) and posterior in Eq. (18)

We start with the definition in Eq. (17) and focus on the correction term. The computation involves the marginalization over all zz’s as well as a complication arising from the derivatives with respect to some 4 zz’s. Using the adjoint property of derivative, i.e. f(x)[g(x)/x]𝑑x=[f(x)/x]g(x)𝑑x\int f(x)[\partial g(x)/\partial x]dx=\int[-\partial f(x)/\partial x]g(x)dx, provided that the product fgfg vanishes at infinity, we can shuffle the operator z\partial_{z} acting on the Gaussian prior on the right most to act on the Gaussian likelihood on the left. Then use the property, (t+s)exp[(st)2]=0(\partial_{t}+\partial_{s})\exp[-(s-t)^{2}]=0, we can replace z\partial_{z} with y-\partial_{y}, and pull these derivatives out of the integral. The followings summarize these actions.

𝑑𝐳𝒩(𝐲|𝐳,Λ)[vijmlzizjzmzl𝒩(𝐳|0,K)]=vijml𝑑𝐳[zizjzmzl𝒩(𝐲|𝐳,Λ)]𝒩(𝐳|0,K)=vijmlyiyjymyl𝑑𝐳𝒩(𝐲|𝐳,Λ)𝒩(𝐳|0,K)=vijmlyiyjymyl𝒩(𝐲|0,K+Λ).\begin{split}&\int d{\bf z}\ \mathcal{N}({\bf y}|{\bf z},\Lambda)\big{[}v_{ijml}\partial_{z_{i}}\partial_{z_{j}}\partial_{z_{m}}\partial_{z_{l}}\mathcal{N}({\bf z}|0,K)\big{]}\\ &=v_{ijml}\int d{\bf z}\big{[}\partial_{z_{i}}\partial_{z_{j}}\partial_{z_{m}}\partial_{z_{l}}\mathcal{N}({\bf y}|{\bf z},\Lambda)\big{]}\mathcal{N}({\bf z}|0,K)\\ &=v_{ijml}\partial_{y_{i}}\partial_{y_{j}}\partial_{y_{m}}\partial_{y_{l}}\int d{\bf z}\mathcal{N}({\bf y}|{\bf z},\Lambda)\mathcal{N}({\bf z}|0,K)\\ &=v_{ijml}\partial_{y_{i}}\partial_{y_{j}}\partial_{y_{m}}\partial_{y_{l}}\mathcal{N}({\bf y}|0,K+\Lambda)\>.\end{split} (31)

As for the proof for Eq. (18), we can proceed with computation of the numerator there, and note that the derivative z\partial_{z_{*}} in q(z,𝐳)q(z_{*},{\bf z}) can be moved out of the integral directly. To accommodate this extra variable, we use the hat on top of the indices to stress that * is included. The followings contain the meaning of our notation as well as some details.

vi^j^m^l^𝑑𝐳𝒩(𝐲|𝐳,Λ)[zi^zj^zm^zl^𝒩(z,𝐳|0,K^)]=[vijmlyiyjymyl+vijmyiyjymz+vijyiyjz2+viyiz3+vz4]×𝑑𝐳𝒩(𝐲|𝐳,Λ)𝒩(z,𝐳|0,K^)=vi^j^m^l^i^j^m^l^𝒩[(z𝐲)|0,(1𝐤t𝐤K+Λ)],\begin{split}v_{\hat{i}\hat{j}\hat{m}\hat{l}}&\int d{\bf z}\ \mathcal{N}({\bf y}|{\bf z},\Lambda)\big{[}\partial_{z_{\hat{i}}}\partial_{z_{\hat{j}}}\partial_{z_{\hat{m}}}\partial_{z_{\hat{l}}}\mathcal{N}(z_{*},{\bf z}|0,\hat{K})\big{]}\\ &=\bigg{[}v_{ijml}\partial_{y_{i}}\partial_{y_{j}}\partial_{y_{m}}\partial_{y_{l}}+v_{ijm*}\partial_{y_{i}}\partial_{y_{j}}\partial_{y_{m}}\partial_{z_{*}}+v_{ij**}\partial_{y_{i}}\partial_{y_{j}}\partial^{2}_{z_{*}}+v_{i***}\partial_{y_{i}}\partial^{3}_{z_{*}}+v_{****}\partial^{4}_{z_{*}}\bigg{]}\times\\ &\int d{\bf z}\mathcal{N}({\bf y}|{\bf z},\Lambda)\mathcal{N}(z_{*},{\bf z}|0,\hat{K})\\ &=v_{\hat{i}\hat{j}\hat{m}\hat{l}}\partial_{\hat{i}\hat{j}\hat{m}\hat{l}}\mathcal{N}\big{[}\bigl{(}\begin{smallmatrix}z_{*}\\ {\bf y}\end{smallmatrix}\bigr{)}\big{|}0,\bigl{(}\begin{smallmatrix}1&{\bf k}_{*}^{t}\\ {\bf k}_{*}&K+\Lambda\end{smallmatrix}\bigr{)}\big{]}\>,\end{split} (32)

where the readers can refer to the text around Eq. (2.21) in [2] for similar integration leading to the second equality.

Appendix D Proof of Eq. (21)

To prove the correction term to the second moment, the previous tricks dealing with Gaussians are still useful. In below, we first shuffle the four derivatives acting on the bivariate Gaussian on the left most to act on the Gaussian factor. One can further use z1,2exp[(z1z2)2/2]=±zexp(z2/2)\partial_{z_{1,2}}\exp[-(z_{1}-z_{2})^{2}/2]=\pm\partial_{z_{-}}\exp(-z_{-}^{2}/2) such that one can rewrite ijlm=(1)n1z4\partial_{ijlm}=(-1)^{n_{1}}\partial^{4}_{z_{-}} with n1n_{1} denoting the number of involved z1z_{1}’s. The rest of computation is straightforward.

𝑑z1𝑑z2e(z1z2)22H(vijmlijml)𝒩(z1,z2|0,K2)=𝑑z1𝑑z2[(vijmlijml)e(z1z2)22H]𝒩(z1,z2|0,K2)=V1111+V22224V12224V21111+6V112224×dz+dz2[(z)4ez22H]𝒩(z+|0,σ+2)𝒩(z|0,σ2)=V𝑑z(3H26z2H3+z4H4)ez22H𝒩(z|0,σ2)=V1+σ2H[3H26σ2H3(1+σ2/H)+3σ4H4(1+σ2/H)2].\begin{split}&\int dz_{1}dz_{2}\ e^{-\frac{(z_{1}-z_{2})^{2}}{2H}}(v_{ijml}\partial_{i}\partial_{j}\partial_{m}\partial_{l})\mathcal{N}(z_{1},z_{2}|0,K_{2})\\ &=\int dz_{1}dz_{2}\big{[}(v_{ijml}\partial_{i}\partial_{j}\partial_{m}\partial_{l})e^{-\frac{(z_{1}-z_{2})^{2}}{2H}}\big{]}\mathcal{N}(z_{1},z_{2}|0,K_{2})\\ &=\frac{V_{1111}+V_{2222}-4V_{1222}-4V_{21111}+6V_{1122}}{24}\times\\ &\int\frac{dz_{+}dz_{-}}{2}[(\partial_{z_{-}})^{4}e^{-\frac{z_{-}^{2}}{2H}}]\mathcal{N}(z_{+}|0,\sigma_{+}^{2})\mathcal{N}(z_{-}|0,\sigma_{-}^{2})\\ &=V^{\prime}\int dz(\frac{3}{H^{2}}-\frac{6z^{2}}{H^{3}}+\frac{z^{4}}{H^{4}})e^{-\frac{z^{2}}{2H}}\mathcal{N}(z|0,\sigma_{-}^{2})\\ &=\frac{V^{\prime}}{\sqrt{1+\frac{\sigma_{-}^{2}}{H}}}\big{[}\frac{3}{H^{2}}-\frac{6\sigma_{-}^{2}}{H^{3}(1+\sigma_{-}^{2}/H)}+\frac{3\sigma_{-}^{4}}{H^{4}(1+\sigma_{-}^{2}/H)^{2}}\big{]}\>.\end{split} (33)

We have used the notation for the variances σ±2=2[1±k(𝐱1,𝐱2)]\sigma_{\pm}^{2}=2[1\pm k({\bf x}_{1},{\bf x}_{2})].