Bayesian inference with finitely wide neural networks

Chi-Ken Lu Faculty of Data Science Minor, [email protected] Department of Mathematics and Computer Science, Rutgers University, Newark, New Jersey 07102, USA

Abstract

The analytic inference, e.g. predictive distribution being in closed form, may be an appealing benefit for machine learning practitioners when they treat wide neural networks as Gaussian process in Bayesian setting. The realistic widths, however, are finite and cause weak deviation from the Gaussianity under which partial marginalization of random variables in a model is straightforward. On the basis of multivariate Edgeworth expansion, we propose a non-Gaussian distribution in differential form to model a finite set of outputs from a random neural network, and derive the corresponding marginal and conditional properties. Thus, we are able to derive the non-Gaussian posterior distribution in Bayesian regression task. In addition, in the bottlenecked deep neural networks, a weight space representation of deep Gaussian process, the non-Gaussianity is investigated through the marginal kernel and the accompanying small parameters.

I Introduction

Neal in his seminal work [1] pointed out that a shallow but infinitely wide random neural network is a Gaussian process (GP) [2] in statistical sense. Subsequent work [3, 4] in interpreting neural network with specific nonlinear activation units as kernel machines was also inspired by such idea. More recent reports [5, 6] further claimed the equivalence between GP and deep neural networks when each hidden layer in latter is of infinite width. Consequently, machine learning practitioners can perform Bayesian inference by treating deep and wide neural network as a GP, and exploit the analytic marginal and conditional properties of multivariate Gaussian distribution. Otherwise, one needs to employ gradient-based learning and bootstrap sampling for obtaining predictive distribution [7].

In reality, all neural networks have finite width. Therefore, the deviation from Gaussianity requires further quantitative account as practitioners may wonder the corrections to the predictive mean and variance in, for example, a regression task. Yaida [8] and colleagues [9] proposed a perturbative approach for computing the multivariate cumulants by direct application of Wick’s contraction theorem. Moreover, the fourth cumulants are shown to be nonzero, scaled by the sum of reciprocal widths, $1/N_{1}+1/N_{2}+\cdots+1/N_{L-1}$ , and signaling non-Gaussian aspect of the random processes representing finite-width deep neural networks with $L$ hidden layers [8]. A quartic energy functional for a fixed set of network outputs is formulated field-theoretically with which the corrections to the posterior mean and variance due to the weak non-Gaussianity are obtained [8].

While Yaida’s approach is appealing from a field theory perspective (see also [10, 11]), the loss of elegant marginal property due to the presence of fourth power term in exponent is critical for analytic inference with finite-width networks. An alternative thinking is to modify the multivariate Gaussian distribution so that the new distribution can match the higher cumulants associated with the networks. In this paper, we shall use the multivariate Edgeworth series expansion [12] to construct the non-Gaussian distribution for the network’s outputs. In particular, we find that the differential representation of Edgeworth expansion greatly facilitates the derivation of marginal and conditional properties of the non-Gaussian distribution. Three main results are reported in this paper. First, the marginal property is intact, and the corrections to conditional mean and variance are derived. Second, with observed data, the non-Gaussian posterior distribution associated with an unobserved output is derived. Third, we derive the marginal covariance [13] of a bottlenecked deep neural network [14], which represents deep Gaussian Process [15] in weight space. It is worthwhile to note that some of the hidden layers in the bottlenecked network are narrow and strong non-Gaussianity may be induced.

The paper has the following organization. In Sec. II we begin by reviewing the computational structure of a shallow cosine network, its equivalence to GP with a Gaussian kernel, and the emergence of non-Gaussianity due to finite width. The shallow network with random parameters in Bayesian setting is a non-Gaussian prior over function. In Sec. III the non-Gaussian prior is represented as a differential representation of Edgeworth expansion around Gaussian multivariate distribution, and its marginal and conditional properties are established. Application of the non-Gaussian prior in Bayesian regression task is discussed in Sec. IV. Finally, Sec. V is devoted to investigating the combined effect of nonlinear activation, depth and finite width on the non-Gaussian prior for a deep bottlenecked cosine network, which is followed by a discussion in Sec. VI.

II Wide feed forward neural network

Let us start with the discussion of a random shallow feed forward network with cosine activation. In the infinite width limit, the network is statistically equivalent to a GP with Gaussian kernel [16]. The goal is to see the emergence of nonzero fourth cumulant when the width is finite. Consider a single output network with $N$ activation units, the real function value $z_{\alpha}\in\mathbb{R}$ with greek subscript is an indexed random variable in association with its input ${\bf x}_{\alpha}\in\mathbb{R}^{d}$ . Explicitly, the output-input of the network has the following relation,

z_{\alpha}=\sqrt{\frac{2}{N}}\sum_{i=1}^{N}w_{i}\cos(\frac{\Omega_{i}\cdot{\bf x}_{\alpha}}{\sqrt{d}}+\phi_{i})\>,

(1)

where the weight variables $w$ ’s are sampled from the Gaussian distribution $\mathcal{N}(0,1)$ , the scaling variables $\Omega$ ’s sampled from $\mathcal{N}(0,I_{d})$ , and the phase variables $\phi$ ’s sampled from uniform distribution $\mathcal{U}([0,2\pi])$ . The normalization factors $\sqrt{2/N}$ and $1/\sqrt{d}$ above follow the parameterization used in [17]. Because $w$ is zero-mean, the first relevant statistical moment is the covariance, $k({\bf x}_{\alpha},{\bf x}_{\beta})$ , namely the expectation of product of two function values at two inputs,

\begin{split}k({\bf x}_{\alpha},{\bf x}_{\beta})&:=\mathbb{E}[z_{\alpha}z_{\beta}]\\ &=\frac{1}{N}\sum_{i}\mathbb{E}\bigg{\{}\cos[\frac{\Omega_{i}\cdot({\bf x}_{\alpha}-{\bf x}_{\beta})}{\sqrt{d}}]+\cos[\frac{\Omega_{i}\cdot({\bf x}_{\alpha}+{\bf x}_{\beta})}{\sqrt{d}}+2\phi_{i}]\bigg{\}}\\ &=e^{-\frac{1}{2d}|{\bf x}_{\alpha}-{\bf x}_{\beta}|^{2}}\>.\end{split}

(2)

In above, the independence between the random variables $w$ ’s is used to arrive at the second equality. The third equality is due to the fact that Fourier transformation of Gaussian is Gaussian and the vanishing average of cosine with uniformly random phase. The fourth moment can be computed in similar manners, but one shall notice that the product of two cosine terms containing the random phase can generate a nonzero contribution. Here, we focus on the fourth cumulant tensor as the signature of non-Gaussianity,

\begin{split}V_{\alpha\beta\gamma\delta}&:=\mathbb{E}[z_{\alpha}z_{\beta}z_{\gamma}z_{\delta}]-K_{\alpha\beta}K_{\gamma\delta}-K_{\alpha\gamma}K_{\beta\delta}-K_{\alpha\delta}K_{\beta\gamma}\\ &=\frac{1}{N^{2}}\sum_{ij}\mathbb{E}\big{\{}\cos[\frac{\Omega_{i}\cdot({\bf x}_{\alpha}+{\bf x}_{\beta})}{\sqrt{d}}+2\phi_{i}]\cos[\frac{\Omega_{j}\cdot({\bf x}_{\gamma}+{\bf x}_{\delta})}{\sqrt{d}}+2\phi_{j}]\big{\}}+{\rm sym.\ perm.}\\ &=\frac{1}{2N}\bigg{(}e^{-\frac{1}{2d}|{\bf x}_{\alpha}+{\bf x}_{\beta}-{\bf x}_{\gamma}-{\bf x}_{\delta}|^{2}}+e^{-\frac{1}{2d}|{\bf x}_{\alpha}+{\bf x}_{\gamma}-{\bf x}_{\beta}-{\bf x}_{\delta}|^{2}}+e^{-\frac{1}{2d}|{\bf x}_{\alpha}+{\bf x}_{\delta}-{\bf x}_{\beta}-{\bf x}_{\gamma}|^{2}}\bigg{)}\>.\end{split}

(3)

Hence, the fourth cumulant tensor receives a contribution proportional to the reciprocal width, $1/N$ , and it is symmetric with respect to permutation of indices.

III Multivariate Edgeworth expansion and non-Gaussian prior

The fourth cumulant tensor in Eq. (3) signifies that the distribution over the network outputs is non-Gaussian. In other words, the random parameters $w$ ’s, $\Omega$ ’s, and $\phi$ ’s from a prior distribution induce the non-Gaussian prior function distribution due to the finite width. The first interesting consequence of the nonzero fourth cumulant is that the prior distribution over a single output, say $z_{\alpha}$ , receives a correction term deviating from Gaussian [18],

q(z_{\alpha})=\mathcal{N}(z_{\alpha}|0,\sigma_{0}^{2})\bigg{[}1+\frac{V_{\alpha\alpha\alpha\alpha}}{24}(3-6\frac{z_{\alpha}^{2}}{\sigma_{0}^{2}}+\frac{z_{\alpha}^{4}}{\sigma_{0}^{4}})\bigg{]}\>.

(4)

The variance $\sigma_{0}^{2}:=k(z_{\alpha},z_{\alpha})$ and the new distribution $q$ has matched moments, i.e. $\mathbb{E}_{q}[1]=1$ , $\mathbb{E}_{q}[z_{\alpha}^{2}]=\sigma_{0}^{2}$ , and $\mathbb{E}_{q}[z_{\alpha}^{4}]=3\sigma_{0}^{4}+V_{\alpha\alpha\alpha\alpha}$ . Although the work [18] derived the above non-Gaussian distribution from a renormalization group perspective, such expansion with Hermite polynomials around Gaussian distribution has been known as Edgeworth expansion in statistics literature [19, 20]. When the width $N\rightarrow\infty$ , the fourth cumulant $V$ vanishes and for notational convenience we denote the limiting distribution by $q_{\infty}$ , which is Gaussian.

III.1 Joint distribution

As we are mainly interested in the prior distribution over a finite set of function values $\{z_{1},z_{2},\cdots\}$ , the objective here is to construct the non-Gaussian joint distribution as a prior in Bayesian learning. The work [12] suggested the construction of multivariate Edgeworth expansion by replacing the univariate Gaussian $\mathcal{N}(z|0,\sigma_{0}^{2})$ with multivariate one $\mathcal{N}({\bf z}|0,K)$ and the Hermite polynomials with contracted tensor terms containing the inverse covariance matrix $[K^{-1}]$ and vector of function values ${\bf z}$ . Please see Appendix A for the expression.

An apparent difficulty with such representation is the demonstration of consistency after marginalization, i.e., $\int dz_{1}q(z_{1},z_{2},z_{3},\cdots)=q(z_{2},z_{3},\cdots)$ , which is of critical importance for deriving conditional distribution and subsequent Bayesian learning. Inspired by the fact that the Hermite polynomials are obtained by taking derivative of Gaussian, we can rewrite the multivariate Edgeworth expansion in the following differential form,

q({\bf z})=\big{(}1+\frac{V_{ijml}}{24}\partial_{z_{i}}\partial_{z_{j}}\partial_{z_{m}}\partial_{z_{l}}\big{)}\mathcal{N}({\bf z}|0,K)

(5)

where the notation of summation by repeated indices is employed. An advantage of such representation is that the adjoint property of derivative, $\partial_{z}^{{\dagger}}=-\partial_{z}$ , can be exploited in the integration of functions which vanish at infinity. Hence, it is easy to see the distribution $q$ is normalized, and the second and fourth moments match, for instance,

\begin{split}\mathbb{E}_{q}[z_{1}z_{2}]&=K_{12}\>,\\ \mathbb{E}_{q}[z_{1}z_{2}z_{3}z_{4}]&=K_{12}K_{34}+K_{13}K_{24}+K_{14}K_{23}+V_{1234}\>.\end{split}

(6)

The above non-Gaussian prior can be viewed as a perturbative extension of multivariate Gaussian. In the setting of Bayesian regression, the marginal and conditional properties of Gaussian are appealing when conjugate likelihood functions are employed. In the following, we shall examine the effects of the perturbation in differential form on these properties.

III.2 Marginal consistency and conditional statistics

We first show that the marginal consistency is intact with the Edgeworth expansion in differential form. Namely, without loss of generality, it suffices to show that marginalization of $z_{1}$ of the joint distribution $q(z_{1},z_{2},z_{3},\cdots)$ ,

\begin{split}\int dz_{1}q(z_{1},\tilde{\bf z})&=\mathcal{N}(\tilde{\bf z}|0,\tilde{K})+v_{\tilde{i}\tilde{j}\tilde{m}\tilde{l}}\partial_{z_{\tilde{i}}}\partial_{z_{\tilde{j}}}\partial_{z_{\tilde{m}}}\partial_{z_{\tilde{l}}}\int dz_{1}\mathcal{N}(z_{1},\tilde{\bf z}|0,K)\\ &=\big{(}1+v_{\tilde{i}\tilde{j}\tilde{m}\tilde{l}}\partial_{z_{\tilde{i}}}\partial_{z_{\tilde{j}}}\partial_{z_{\tilde{m}}}\partial_{z_{\tilde{l}}}\big{)}\mathcal{N}(\tilde{\bf z}|0,\tilde{K})\\ &=q(\tilde{\bf z})\>,\end{split}

(7)

indeed reproduces the joint distribution for $z_{2}$ , $z_{3},\cdots$ in the original form Eq. (5) as the smaller covariance matrix $\tilde{K}$ is the submatrix of $K$ excluding the first row and column. For simplicity, we use $v$ to denote $V/24$ . In deriving above, we have employed the fact that $\int dz_{1}(\partial_{z_{1}})^{s}\mathcal{N}(z_{1},\tilde{\bf z})=0$ for any integer power $s\geq 1$ , meaning that the indices $i,j,m,l$ shall exclude any contribution associated with $z_{1}$ . Thus, the tilded indices $\tilde{i}\in\{2,3,4,\cdots\}$ , and they can be factored out of the integral.

Here we wish to stress that the above marginal property of non-Gaussian prior may seems straightforward with the differential representation of Edgeworth expansion, but it does not seem promising to reach the same conclusion if the marginalization is carried out with the explicit representation in Appendix A. Nevertheless, one can easily write the conditional distribution $q(z_{1}|\tilde{\bf z})=q(z_{1},\tilde{\bf z})/q(\tilde{\bf z})$ . More interestingly, the conditional mean is useful for prediction with noiseless observation of data, and also shed light on the effect of non-Gaussianity due to the finite width. The details of deriving the following conditional mean

\begin{split}\mathbb{E}[z_{1}|\tilde{\bf z}]&=\int dz_{1}z_{1}\frac{q(z_{1},\tilde{\bf z})}{q(\tilde{\bf z})}\\ &=\mu(\tilde{\bf z})+\frac{1}{q(\tilde{\bf z})}A_{\tilde{i}\tilde{j}\tilde{m}}\partial_{z_{\tilde{i}}}\partial_{z_{\tilde{j}}}\partial_{z_{\tilde{m}}}\mathcal{N}(\tilde{\bf z}|0,\tilde{K})\>,\end{split}

(8)

with the finite-width correction term proportional to the third-order tensor,

A_{\tilde{i}\tilde{j}\tilde{m}}=\frac{V_{\tilde{i}\tilde{j}\tilde{m}\tilde{l}}}{6}[{\tilde{K}}^{-1}{\bf k}]_{\tilde{l}}-\frac{V_{1\tilde{i}\tilde{j}\tilde{m}}}{6}\>,

(9)

can be found in Appendix B. In the GP limit, the conditional mean becomes the well-known result $\mu(\tilde{\bf z})={\bf k}^{t}{\tilde{K}}^{-1}{\tilde{\bf z}}$ , as the fourth cumulant $V$ vanishes in large $N$ limit. We also remind the readers that the tilded symbols are associated with the conditioned outputs $z_{2,3,\cdots}$ . In a similar manner, we can also show the conditional second moment,

\begin{split}\mathbb{E}[z_{1}^{2}|\tilde{\bf z}]&=\int dz_{1}z_{1}^{2}\frac{q(z_{1},\tilde{\bf z})}{q(\tilde{\bf z})}\\ &=\mu^{2}(\tilde{\bf z})+\sigma^{2}+\frac{1}{q(\tilde{\bf z})}\bigg{[}2\mu(\tilde{\bf z})A_{\tilde{i}\tilde{j}\tilde{m}}\partial_{z_{\tilde{i}}}\partial_{z_{\tilde{j}}}\partial_{z_{\tilde{m}}}+B_{\tilde{i}\tilde{j}}\partial_{z_{\tilde{i}}}\partial_{z_{\tilde{j}}}\bigg{]}\mathcal{N}(\tilde{\bf z}|0,\tilde{K})\>,\end{split}

(10)

with the second-order tensors denoting

B_{\tilde{i}\tilde{j}}=\frac{V_{\tilde{i}\tilde{j}\tilde{m}\tilde{l}}}{2}[{\tilde{K}}^{-1}{\bf k}]_{\tilde{m}}[{\tilde{K}}^{-1}{\bf k}]_{\tilde{l}}-V_{1\tilde{i}\tilde{j}\tilde{m}}[{\tilde{K}}^{-1}{\bf k}]_{\tilde{m}}+\frac{V_{11\tilde{i}\tilde{j}}}{2}\>.

(11)

The conditional second moment coincides with that in the GP limit, $\sigma^{2}=K_{11}-{\bf k}^{t}{\tilde{K}}^{-1}{\bf k}$ . The details of derivation can also be found in Appendix B. It is worthwhile to note that the conditional variance, $\mathbb{E}_{q}[z_{1}^{2}|\tilde{\bf z}]-(\mathbb{E}_{q}[z_{1}|\tilde{\bf z}])^{2}$ , can include contributions related to function values $\tilde{\bf z}$ through the nonzero $A$ and $B$ terms in above expressions. The lack of such dependence in GP limit is in fact a shortcoming for modelling, which has motivated the study of non-Gaussian prior in machine learning community (for example the Student-t process in [21]).

III.3 An example: bivariate distribution and prediction

Now let us pause to consider a simple example where a shallow network defined in Eq. (1) with width $N$ is used to model two input-output pairs $({\bf x}_{1},z_{1})$ and $({\bf x}_{2},z_{2})$ . In the end, we shall present the predictive mean and variance for $z_{1}$ conditioning on $z_{2}$ . The following notations are used for simplification: the covariance $c:=\exp(-|{\bf x}_{1}-{\bf x}_{2}|^{2}/2d)$ , the 2-by-2 covariance matrix $\Sigma=\bigl{(}\begin{smallmatrix}1&c\\ c&1\end{smallmatrix}\bigr{)}$ for the bivariate prior, and $z_{21}:=(z_{2}-cz_{1})/(\sqrt{1-c^{2}})$ . Besides, the relations of derivatives of Gaussian in terms of Hermite polynomial shall be useful,

\partial^{n}_{z_{1}}\partial^{m}_{z_{2}}\mathcal{N}(z_{1},z_{2}|0,\Sigma)\propto\partial^{n}_{z_{1}}\big{[}\mathcal{N}(z_{1}|0,1)\partial^{m}_{z_{2}}\mathcal{N}(z_{21}|0,1)\big{]}\propto(-1)^{m}\frac{\mathcal{N}(z_{2}|0,1)}{\sqrt{(1-c^{2})^{m}}}\partial^{n}_{z_{1}}[H_{m}(z_{21})\mathcal{N}(z_{12}|0,1)]\>,

(12)

where we have used $\mathcal{N}(z_{1},z_{2}|0,\Sigma)\propto\mathcal{N}(z_{1}|0,1)\mathcal{N}(z_{21}|0,1)$ up to some irrelevant constant. The probabilist’s Hermite polynomials are defined as $H_{n}(z)=(-1)^{n}[\mathcal{N}(z|0,1)]^{-1}d^{n}_{z}\mathcal{N}(z|0,1)$ [19]. Consequently, following the previous discussion, the non-Gaussian bivariate distribution is shown to be

\begin{split}q(z_{1},z_{2})&=\mathcal{N}(z_{1},z_{2}|0,\Sigma)\bigg{\{}1+\frac{\gamma^{4}}{16N}[H_{4}(z_{21})+H_{4}(z_{12})]+\frac{c\gamma^{4}}{4N}[3cH_{2}(z_{21})+H_{3}(z_{21})H_{1}(z_{12})+z_{12}\leftrightarrow z_{21}]\\ &+\frac{(c^{4}+2)\gamma^{4}}{8N}[2c^{2}+4cH_{1}(z_{21})H_{1}(z_{12})+H_{2}(z_{21})H_{2}(z_{12})]\bigg{\}}\>,\end{split}

(13)

and the symbol $\gamma=1/\sqrt{1-c^{2}}$ is used to simplify the expression. Besides, the fourth cumulant for the cosine network, $V_{1111}=3/2N$ , $V_{1112}=3c/2N$ , and $V_{1122}=(c^{4}+2)/2N$ , have been plugged in above. As for the conditional distribution, e.g. $q(z_{1}|z_{2})=q(z_{1},z_{2})/q(z_{2})$ , for the bivariate case, one can simply divid the above with $q(z_{2})=\mathcal{N}(z_{2}|0,1)[1+H_{4}(z_{2})/16N]$ , which shall yield a conditional Gaussian $\mathcal{N}(z_{1}|z_{2})$ multiplied by a factor representing the finite-width effect.

We can directly apply Eq. (8) to obtain the conditional mean in the simple example where the tilded indices there only apply to $z_{2}$ . Surprisingly, the conditional mean for such noiseless case is

\mathbb{E}[z_{1}|z_{2}]=cz_{2}\>,

(14)

coinciding with that in the GP limit because the accidental cancelation in the third-order tensor $A_{222}\propto(cV_{2222}-V_{1222})$ . We shall show that the cancelation does not occur in the noisy (next Section) and more general cases. Application of Eq. (10) together with the above conditional mean results in the following conditional variance,

{\rm Var}[z_{1}|z_{2}]=(1-c^{2})\bigg{[}1+\frac{4(2-c^{2})}{16N+H_{4}(z_{2})}H_{2}(z_{2})\bigg{]}\>,

(15)

which consists of the corresponding variance in GP limit along with a term depending on $z^{2}_{2}$ .

IV Bayesian Regression with non-gaussian prior

Having established the marginal and conditional properties of the non-Gaussian distribution in Eq. (5), we now continue investigating the posterior distribution over the unseen function value $z_{*}$ associated with input ${\bf x}_{*}$ when the noisy observations $\mathcal{D}=\{({\bf x}_{1},y_{1}),({\bf x}_{2},y_{2}),\cdots\}$ are known. According to Bayes’s rule, the objective distribution is the posterior distribution defined as,

q(z_{*}|\mathcal{D})=\frac{q(z_{*},\mathcal{D})}{q(\mathcal{D})}=\frac{\int d{\bf z}\ q(z_{*},{\bf z})\mathcal{N}({\bf y}|{\bf z},\Lambda)}{\int d{\bf z}\ q({\bf z})\mathcal{N}({\bf y}|{\bf z},\Lambda)}\>,

(16)

with $\Lambda$ denoting a diagonal matrix representing the noise in Gaussian likelihood. When the $q$ ’s in the numerator and denominator are both Gaussian, which corresponds to the infinitely wide network, the limiting distribution reads

\begin{split}q_{\infty}(z_{*}|\mathcal{D})&=\int d{\bf z}\ \mathcal{N}(z_{*}|{\bf k}_{*}^{t}K^{-1}{\bf z},k_{**}-{\bf k}_{*}^{t}K^{-1}{\bf k}_{*})\ \mathcal{N}({\bf z}|K(K+\Lambda)^{-1}{\bf y},\Lambda K(K+\Lambda)^{-1})\\ &=\mathcal{N}\big{(}z_{*}|{\bf k}_{*}^{t}(K+\Lambda)^{-1}{\bf y},k_{**}-{\bf k}_{*}^{t}(K+\Lambda)^{-1}{\bf k}_{*}\big{)}\>.\end{split}

In above first equality, the first and second Gaussian distributions on the right hand side represent the conditional density $q_{\infty}(z_{*}|{\bf z})$ and the posterior $q_{\infty}({\bf z}|\mathcal{D})$ , respectively.

As for the case of finite-width, the involved Gaussian likelihood can be dealt with by continuing application of the properties of adjoint differential operator as well as derivative of independent Gaussians. The details of derivation of evidence $q(\mathcal{D})$ can be seen in Appendix C, and the expression in terms of derivatives with respect to the observed $y$ ’s reads,

\begin{split}q(\mathcal{D})&=\int d{\bf z}\ \mathcal{N}({\bf y}|{\bf z},\Lambda)q({\bf z})\\ &=(1+v_{ijml}\partial_{y_{i}}\partial_{y_{j}}\partial_{y_{m}}\partial_{y_{l}})\mathcal{N}({\bf y}|0,K+\Lambda)\>.\end{split}

(17)

Despite its simple form, the evidence term can not be further be manipulated like in the infinite-width case. The details of derivation of the posterior distribution for output $z_{*}$ at the unseen input,

\begin{split}q(z_{*}|\mathcal{D})&=\frac{\int d{\bf z}\ \mathcal{N}({\bf y}|{\bf z},\Lambda)q(z_{*},{\bf z})}{q(\mathcal{D})}\\ &=(1+v_{\hat{i}\hat{j}\hat{m}\hat{l}}\partial_{\hat{i}\hat{j}\hat{m}\hat{l}})\mathcal{N}\big{[}\bigl{(}\begin{smallmatrix}z_{*}\\ {\bf y}\end{smallmatrix}\bigr{)}\big{|}0,\bigl{(}\begin{smallmatrix}1&{\bf k}_{*}^{t}\\ {\bf k}_{*}&K+\Lambda\end{smallmatrix}\bigr{)}\big{]}/q(\mathcal{D})\>,\end{split}

(18)

can also be found in Appendix C. Here, the notations can be understood as follows: the hatted indices, $\hat{i}$ for instance, additionally include the symbol $*$ associated with test function value $z_{*}$ , and the derivative refers to $\partial_{*}:=\partial_{z_{*}}$ (with respect to function value) and $\partial_{i}:=\partial_{y_{i}}$ (with respect to the observed value).

We conclude this Section with considering the same simple regression example but with noisy observation in $y_{2}$ at input ${\bf x}_{2}$ , and we wish to predict the unobserved function value $z_{1}$ . To stress the role of noise parameter $\sigma_{n}$ , we only show the predictive mean here,

\mathbb{E}[z_{1}|y_{2}]=c\alpha_{n}^{2}y_{2}-\frac{c\sigma_{n}^{2}\alpha_{n}^{5}H_{3}(\alpha_{n}y_{2})}{4N+\alpha_{n}^{4}H_{4}(\alpha_{n}y_{2})}\>,

(19)

with $\alpha_{n}=1/\sqrt{1+\sigma_{n}^{2}}$ . The correction term vanishes as the noise parameter does, and is odd with respect to the sign change of $y_{2}$ .

V Deep Bottlenecked Network

Up to this point, we have focused on the emergence of non-Gaussianity in prior function distribution in wide and shallow network, as well as how the non-Gaussian prior affects the inference in a regression task. Our approach can be extended to the deep and wide neural networks as the corresponding prior distributions approach GP [5, 6]. However, such wide-width assumption leading to weak non-Gaussianity is not valid for the bottlenecked networks in which the hidden layers are alternatively wide and narrow [22, 14].

Let us consider a two-layer bottlenecked feed forward network modeling the hierarchical mapping ${\bf x}\in\mathbb{R}^{d}\mapsto{\bf z}^{(1)}\in\mathbb{R}^{H}\mapsto z^{(2)}\in\mathbb{R}$ . The hierarchy consists of the following computations,

\begin{split}z^{(1)}_{i,\alpha}&=\sqrt{\frac{2}{N_{1}}}\sum_{j=1}^{N_{1}}w^{(1)}_{ij}\cos(\frac{\Omega^{(1)}_{j}\cdot{\bf x}_{\alpha}}{\sqrt{d}}+\phi^{(1)}_{j})\>,\\ z^{(2)}_{\alpha}&=\sqrt{\frac{2}{N_{2}}}\sum_{j=1}^{N_{2}}w^{(2)}_{j}\cos(\frac{\Omega^{(2)}_{j}\cdot{\bf z}^{(1)}_{\alpha}}{\sqrt{H}}+\phi^{(2)}_{j})\>,\end{split}

(20)

where the parameters $w$ ’s, $\Omega$ ’s, and $\phi$ ’s share the same prior distribution with their counterparts in the shallow network. The widths of hidden layers, $N_{1}$ and $N_{2}$ , are large but the hidden output dimension $H$ is not. In the limit $N_{1,2}\rightarrow\infty$ while $H$ remains finite, the two-layer cosine bottlenecked network corresponds to the prior of deep Gaussian process [15, 23, 24] with Gaussian kernel, which is a flexible and expressive function prior due to its compositional nature. As its name suggests, the conditional prior $z^{(2)}|{\bf z}^{(1)}$ is a GP and so is each component $z_{i}^{(1)}|{\bf x}$ in the hidden output. However, the marginal distribution for $z^{(2)}|{\bf x}$ is non-Gaussian as [13] showed that the fourth cumulant is positive, i.e., a heavy-tailed distribution [25].

Here, we are interested in tracking how the small parameters, $1/N_{1}$ and $1/N_{2}$ , and the bottleneck parameter $1/H$ enters the second moment. For a deep and finitely wide linear network, the prior is non-Gaussian [8, 26] but the second moment does not receive correction due to the finite width [8, 9]. Thus, the effects of nonlinear activation on higher statistical moments are interesting (also see a recent work in [27]). For the deep cosine network in Eq. (20), we can first consider the wide limit, $N_{1,2}\rightarrow\infty$ , but the bottleneck width $H$ remains finite. Following the result in Eq. (2), the covariance for the deep model can be computed as followings,

\begin{split}k^{(2)}_{\infty}({\bf x}_{\alpha},{\bf x}_{\beta})&=\mathbb{E}_{q_{\infty}}\bigg{\{}\exp\big{(}-\frac{|{\bf z}^{(1)}_{\alpha}-{\bf z}^{(1)}_{\beta}|^{2}}{2H}\big{)}\bigg{\}}\\ &=\bigg{[}\int dz_{\alpha}dz_{\beta}\ e^{-\frac{(z_{\alpha}-z_{\beta})^{2}}{2H}}q_{\infty}(z_{\alpha},z_{\beta})\bigg{]}^{H}\\ &=\big{[}1+\frac{1-\exp(-\frac{|{\bf x}_{\alpha}-{\bf x}_{\beta}|^{2}}{2d})}{H/2}\big{]}^{-H/2}\>.\end{split}

(21)

Note that the decomposition $\mathcal{N}(z_{1},z_{2}|0,K)=\mathcal{N}(z_{1}+z_{2}|0,\sigma_{+}^{2})\mathcal{N}(z_{1}-z_{2}|0,\sigma_{-}^{2})$ [28] has been used in derivation with the variances $\sigma_{\pm}^{2}=2(K_{11}\pm K_{12})$ . We want to stress that the above kernel is exact for any $H$ . For a very narrow case, i.e., $H=1$ , the above result coincides with that in [13], while in the very wide bottleneck case, $H\gg 1$ , it can be shown that

k^{(2)}_{\infty}({\bf x}_{\alpha},{\bf x}_{\beta})\approx\exp[k({\bf x}_{\alpha},{\bf x}_{\beta})-1]\big{\{}1+\frac{1}{H}[k({\bf x}_{\alpha},{\bf x}_{\beta})-1]^{2}\big{\}}\>,

(22)

where the covariance $k$ in shallow network is given in (2). In addition, the exponential of shallow kernel in above corresponds to $H\rightarrow\infty$ , which is the same as the deep kernel in [29]. Indeed, as suggested in [9], the appearance of reciprocal width, $1/H$ , in the correction term is in association with the weak non-Gaussianity even though the other widths $N_{1,2}$ are infinite.

Next, we shall compute the covariance for the case where all widths are finite. In fact, the outer width $N_{2}$ does not enter the kernel and we only need large but finite $N_{1}$ . The computation is similar and one only needs to replace the limiting joint distribution $q_{\infty}(z^{(1)}_{\alpha},z^{(1)}_{\beta})$ with the non-Gaussian $q$ . The details of derivation using the property derivative of Gaussian can be seen in Appendix D. The kernel of the two-layer bottlenecked cosine network with $N_{1,2}<\infty$ reads,

k^{(2)}({\bf x}_{\alpha},{\bf x}_{\beta})=\bigg{[}1+\frac{1-k({\bf x}_{\alpha},{\bf x}_{\beta})}{H/2}\bigg{]}^{-H/2}(1+\epsilon)^{H}\>,

(23)

with the correction due to finite $N_{1}$ ,

\epsilon=\frac{V_{\alpha\alpha\alpha\alpha}+V_{\beta\beta\beta\beta}-4V_{\alpha\beta\beta\beta}-4V_{\beta\alpha\alpha\alpha}+6V_{\alpha\alpha\beta\beta}}{24}\big{[}\frac{3}{H^{2}}-\frac{6\sigma_{-}^{2}}{H^{3}(1+\sigma_{-}^{2}/H)}+\frac{3\sigma_{-}^{4}}{H^{4}(1+\sigma_{-}^{2}/H)^{2}}\big{]}\>,

(24)

with $\sigma_{-}^{2}=2(1-k)$ is used for easing the notation. Again, the above result is for general $H$ . For the wide bottleneck limit, one can show that these small parameters enter the kernel with the correction of $O(\frac{1}{H})$ followed by $O(\frac{1}{N_{1}H})$ . Hence, the deep model as a function prior serves as a GP with random kernel whose mean value converges to Eq. (22) when all widths $N_{1,2}$ and $H$ approach infinity. However, the role of $H$ is different from $N_{1}$ from our perturbative analysis, and $1/N_{1}$ does not appear alone in the small parameter. The influence of finite width on the kernel of deep and nonlinear and convolutional models is also studied in [30, 31, 32].

The fourth moment can be computed in a similar manner. The closed form for $N_{1,2}\rightarrow\infty$ can be found in [13]. For the finite width case, the fourth cumulant $V_{\alpha\beta\gamma\delta}^{(2)}$ becomes the $H$ -th power of the permutational symmetrization of $\frac{1}{N_{2}}\mathbb{E}_{q}\{\exp[-(z^{(1)}_{\alpha}+z^{(1)}_{\beta}-z^{(1)}_{\gamma}-z^{(1)}_{\delta})^{2}/2H)\}$ , the small parameters of which shall consist of $1/N_{2}$ , $1/(N_{2}H)$ , and $1/(N_{2}HN_{1})$ . The analysis of scaling with respect to the reciprocal width is more complex than the deep linear network.

VI discussions

In essence, the finite width in random neural networks induces nonzero variance of kernel since the fourth cumulant is not zero. From function space perspective, the shallow networks with finitely large width can be regarded as a GP but the kernel itself is a random variable too, so the learned data representation is a distribution over kernels. Therefore, neural network with very narrow layers may not have enough expressive power for learning while the network with very wide layers is not flexible enough because it is equivalent to GP learning with one fixed kernel. It was recently suggested in Ref. 33 through studying the partition function [34, 35] of finite deep network that Student-t process [21] may suitably represent finite-width network, which offers an alternative description than this paper. The reports in [30, 24] observed the degradation of performance for Bayesian deep neural network when expanding the widths while the study in [36] suggested otherwise.

Taking the wide limit with the perturbative approach is appealing because of the analytic and elegant expressions for inference, which may shed light on future investigation of alternative algorithm for classification [37], study of average test error in regression [38, 39] and classification [40] tasks using finite-width networks. In this paper, the conditional statistics and perturbed posterior distribution over the unseen output using the non-Gaussian prior are obtained with help of the differential representation of multivariate Edgeworth expansion. In parallel with the equivalence between GP prior and the random wide neural network, the investigation of neural tangent kernel [17, 41, 42] is important for understanding the learning dynamics during the optimization and the relation with Bayesian neural network learning [43, 44].

acknowledgement

The research is supported by the Dean Office of School of Arts and Science at Rutgers University Newark. Correspondences with Jacob Zavatone-Veth and Pietro Rotondo are acknowledged.

References

Neal [1997] R. M. Neal, Monte carlo implementation of Gaussian process models for Bayesian regression and classification, arXiv preprint physics/9701026 (1997).
Rasmussen and Williams [2006] C. E. Rasmussen and C. K. I. Williams, Gaussian Process for Machine Learning (MIT press, Cambridge, MA, 2006).
Williams [1997] C. K. Williams, Computing with infinite networks, in Advances in neural information processing systems (1997) pp. 295–301.
Cho and Saul [2009] Y. Cho and L. K. Saul, Kernel methods for deep learning, in Advances in neural information processing systems (2009) pp. 342–350.
Matthews et al. [2018] A. G. d. G. Matthews, J. Hron, M. Rowland, R. E. Turner, and Z. Ghahramani, Gaussian process behaviour in wide deep neural networks, in International Conference on Learning Representations (2018).
Lee et al. [2018] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein, Deep neural networks as Gaussian processes, in International Conference on Learning Representations (2018).
Osband [2016] I. Osband, Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of dropout, in Workshop on Bayesian Deep Learning (NIPS, 2016).
Yaida [2020] S. Yaida, Non-Gaussian processes and neural networks at finite widths, in Mathematical and Scientific Machine Learning (PMLR, 2020) pp. 165–192.
Roberts et al. [2021] D. A. Roberts, S. Yaida, and B. Hanin, The principles of deep learning theory, arXiv preprint arXiv:2106.10165 (2021).
Dyer and Gur-Ari [2019] E. Dyer and G. Gur-Ari, Asymptotics of wide networks from Feynman diagrams, in International Conference on Learning Representations (2019).
Gabrié [2020] M. Gabrié, Mean-field inference methods for neural networks, Journal of Physics A: Mathematical and Theoretical 53, 223002 (2020).
Skovgaard [1986] I. M. Skovgaard, On multivariate Edgeworth expansions, International Statistical Review/Revue Internationale de Statistique , 169 (1986).
Lu et al. [2020] C.-K. Lu, S. C.-H. Yang, X. Hao, and P. Shafto, Interpretable deep Gaussian processes with moments, in International Conference on Artificial Intelligence and Statistics (2020) pp. 613–623.
Agrawal et al. [2020] D. Agrawal, T. Papamarkou, and J. D. Hinkle, Wide neural networks with bottlenecks are deep Gaussian processes., J. Mach. Learn. Res. 21, 175 (2020).
Damianou and Lawrence [2013] A. Damianou and N. Lawrence, Deep Gaussian processes, in Artificial Intelligence and Statistics (2013) pp. 207–215.
Rahimi and Recht [2008] A. Rahimi and B. Recht, Random features for large-scale kernel machines, in Advances in neural information processing systems (2008) pp. 1177–1184.
Jacot et al. [2018] A. Jacot, F. Gabriel, and C. Hongler, Neural tangent kernel: Convergence and generalization in neural networks, in Advances in neural information processing systems (2018) pp. 8571–8580.
Antognini [2019] J. M. Antognini, Finite size corrections for neural network Gaussian processes, arXiv preprint arXiv:1908.10030 (2019).
Papoulis and Unnikrishna Pillai [2002] A. Papoulis and S. Unnikrishna Pillai, Probability, random variables and stochastic processes (2002).
Welling [2005] M. Welling, Robust higher order statistics, in International Workshop on Artificial Intelligence and Statistics (PMLR, 2005) pp. 405–412.
Shah et al. [2014] A. Shah, A. Wilson, and Z. Ghahramani, Student-t processes as alternatives to Gaussian processes, in Artificial intelligence and statistics (2014) pp. 877–885.
Cutajar et al. [2017] K. Cutajar, E. V. Bonilla, P. Michiardi, and M. Filippone, Random feature expansions for deep Gaussian processes, in Proceedings of the 34th International Conference on Machine Learning-Volume 70 (JMLR. org, 2017) pp. 884–893.
Dunlop et al. [2018] M. M. Dunlop, M. A. Girolami, A. M. Stuart, and A. L. Teckentrup, How deep are deep Gaussian processes?, The Journal of Machine Learning Research 19, 2100 (2018).
Pleiss and Cunningham [2021] G. Pleiss and J. P. Cunningham, The limitations of large width in neural networks: A deep Gaussian process perspective, Advances in Neural Information Processing Systems 34 (2021).
Vladimirova et al. [2019] M. Vladimirova, J. Verbeek, P. Mesejo, and J. Arbel, Understanding priors in Bayesian neural networks at the unit level, in International Conference on Machine Learning (PMLR, 2019) pp. 6458–6467.
Zavatone-Veth and Pehlevan [2021] J. Zavatone-Veth and C. Pehlevan, Exact marginal prior distributions of finite Bayesian neural networks, Advances in Neural Information Processing Systems 34 (2021).
Zavatone-Veth et al. [2021] J. Zavatone-Veth, A. Canatar, B. Ruben, and C. Pehlevan, Asymptotics of representation learning in finite bayesian neural networks, Advances in neural information processing systems 34, 24765 (2021).
Lu et al. [2018] C.-K. Lu, S. C.-H. Yang, and P. Shafto, Standing-wave-decomposition Gaussian process, Physical Review E 98, 032303 (2018).
Duvenaud et al. [2014] D. Duvenaud, O. Rippel, R. Adams, and Z. Ghahramani, Avoiding pathologies in very deep networks, in Artificial Intelligence and Statistics (2014) pp. 202–210.
Aitchison [2020] L. Aitchison, Why bigger is not always better: on finite and infinite neural networks, in International Conference on Machine Learning (PMLR, 2020) pp. 156–164.
Li et al. [2022] M. B. Li, M. Nica, and D. M. Roy, The neural covariance sde: Shaped infinite depth-and-width networks at initialization, arXiv preprint arXiv:2206.02768 (2022).
Hanin [2022] B. Hanin, Correlation functions in random fully connected neural networks at finite width, arXiv preprint arXiv:2204.01058 (2022).
Ariosto et al. [2022] S. Ariosto, R. Pacelli, M. Pastore, F. Ginelli, M. Gherardi, and P. Rotondo, Statistical mechanics of deep learning beyond the infinite-width limit, arXiv preprint arXiv:2209.04882 (2022).
Seung et al. [1992] H. S. Seung, H. Sompolinsky, and N. Tishby, Statistical mechanics of learning from examples, Physical review A 45, 6056 (1992).
Li and Sompolinsky [2021] Q. Li and H. Sompolinsky, Statistical mechanics of deep linear neural networks: The backpropagating kernel renormalization, Physical Review X 11, 031059 (2021).
Lee et al. [2020] J. Lee, S. Schoenholz, J. Pennington, B. Adlam, L. Xiao, R. Novak, and J. Sohl-Dickstein, Finite versus infinite neural networks: an empirical study, Advances in Neural Information Processing Systems 33, 15156 (2020).
Csató et al. [1999] L. Csató, E. Fokoué, M. Opper, B. Schottky, and O. Winther, Efficient approaches to Gaussian process classification, Advances in neural information processing systems 12 (1999).
Sollich [1998] P. Sollich, Learning curves for Gaussian processes, Advances in neural information processing systems 11 (1998).
Williams and Vivarelli [2000] C. K. Williams and F. Vivarelli, Upper and lower bounds on the learning curve for Gaussian processes, Machine Learning 40, 77 (2000).
Seeger [2002] M. Seeger, PAC-Bayesian generalisation error bounds for Gaussian process classification, Journal of machine learning research 3, 233 (2002).
Arora et al. [2019] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang, On exact computation with an infinitely wide neural net, Advances in Neural Information Processing Systems 32, 8141 (2019).
Hanin and Nica [2019] B. Hanin and M. Nica, Finite depth and width corrections to the neural tangent kernel, in International Conference on Learning Representations (2019).
Karakida et al. [2019] R. Karakida, S. Akaho, and S.-i. Amari, Universal statistics of Fisher information in deep neural networks: Mean field approach, in The 22nd International Conference on Artificial Intelligence and Statistics (2019) pp. 1032–1041.
Khan et al. [2019] M. E. E. Khan, A. Immer, E. Abedi, and M. Korzepa, Approximate inference turns deep networks into Gaussian processes, in Advances in neural information processing systems (2019) pp. 3094–3104.

Appendix A Expression for non-Gaussian prior distribution

After explicitly taking the derivatives in Eq. (5), it is straightforward to show the expressions for the multivariate non-Gaussian distribution reads,

q({\bf z})=\frac{1}{\sqrt{(2\pi)^{N}|K|}}e^{-\frac{1}{2}{\bf z}^{t}K^{-1}{\bf z}}\bigg{[}1+\frac{V_{\alpha\beta\gamma\delta}}{24}(\chi^{(0)}_{\alpha\beta\gamma\delta}-\chi^{(1)}_{\alpha\beta\gamma\delta\mu\nu}z_{\mu}z_{\nu}+\chi^{(2)}_{\alpha\beta\gamma\delta\mu\nu\kappa\eta}z_{\mu}z_{\nu}z_{\kappa}z_{\eta})\bigg{]}\>,

(25)

in which the first perturbation term, a scalar term, is the contraction between the fourth cumulant tensor $V_{\alpha\beta\gamma\delta}\propto 1/N_{1}$ and the following tensor,

\chi^{(0)}_{\alpha\beta\gamma\delta}=(K^{-1})_{\alpha\beta}(K^{-1})_{\gamma\delta}+(K^{-1})_{\alpha\gamma}(K^{-1})_{\beta\delta}+(K^{-1})_{\alpha\delta}(K^{-1})_{\beta\gamma}\>.

(26)

The next contribution results from contraction between the fourth cumulant, the following rank-6 tensor,

\begin{split}\chi^{(1)}_{\alpha\beta\gamma\delta\mu\nu}=&(K^{-1})_{\alpha\mu}(K^{-1})_{\nu\beta}(K^{-1})_{\gamma\delta}+(K^{-1})_{\alpha\beta}(K^{-1})_{\gamma\mu}(K^{-1})_{\nu\delta}\ +\\ &(K^{-1})_{\alpha\mu}(K^{-1})_{\nu\gamma}(K^{-1})_{\beta\delta}+(K^{-1})_{\alpha\gamma}(K^{-1})_{\beta\mu}(K^{-1})_{\nu\delta}\ +\\ &(K^{-1})_{\alpha\mu}(K^{-1})_{\nu\delta}(K^{-1})_{\beta\gamma}+(K^{-1})_{\alpha\delta}(K^{-1})_{\beta\mu}(K^{-1})_{\nu\gamma}\>,\end{split}

(27)

and the rank-2 tensor $z_{\mu}z_{\nu}$ . The last contribution contains the rank-8 tensor,

\chi^{(2)}_{\alpha\beta\gamma\delta\mu\nu\kappa\eta}=(K^{-1})_{\alpha\mu}(K^{-1})_{\beta\nu}(K^{-1})_{\gamma\kappa}(K^{-1})_{\delta\eta}\>,

(28)

the rank-4 tensor $z_{\mu}z_{\nu}z_{\kappa}z_{\eta}$ and the fourth cumulant. It becomes extensively tedious, if possible, proving the marginal and conditional properties with this representation.

Appendix B Proofs of conditional mean Eq. (8) and second moment Eq. (10)

By plugging the corresponding expression for non-Gaussian joint $q(z_{1},\tilde{\bf z})$ , using the conditional property of distribution, and temporarily neglecting the coefficients and indices related to the fourth cumulant $V$ , we can compute the conditional first moment in the followings,

\begin{split}\mathbb{E}[z_{1}|\tilde{\bf z}]&\propto\int dz_{1}z_{1}q(z_{1},\tilde{\bf z})\\ &=\int dz_{1}z_{1}(1+V\tilde{\partial}^{4}+V\tilde{\partial}^{3}\partial_{z_{1}}+V\tilde{\partial}^{2}\partial^{2}_{z_{1}}+V\tilde{\partial}\partial^{3}_{z_{1}}+V\partial^{4}_{z_{1}})[\mathcal{N}(z_{1}|\tilde{\bf z})\mathcal{N}(\tilde{\bf z})]\\ &=(1+V\tilde{\partial}^{4})[\mu(\tilde{\bf z})\mathcal{N}(\tilde{\bf z})]-V\tilde{\partial}^{3}\mathcal{N}(\tilde{\bf z})\>.\end{split}

(29)

In the first equality, the derivative terms of equal to or higher order than $\partial^{2}_{z_{1}}$ do not contribute as they produce zero when acting on $z_{1}$ . In the second equality, one can further reduce the first term into $\mu(1+V\tilde{\partial}^{4})\mathcal{N}+4(\tilde{\partial}\mu)\tilde{\partial}^{3}\mathcal{N}$ since the $\mu$ term is linear in $\tilde{\bf z}$ . Note that the former term is actually $\mu q(\tilde{\bf z})$ . In addition, the minus sign preceding the $\tilde{\partial}^{3}$ arises from the adjoint property of moving $\partial_{z_{1}}$ to act on $z_{1}$ . By restoring the coefficients with some combinatorics and that $\partial_{i}\mu({\bf z})=[K^{-1}{\bf k}]_{i}$ , it proves Eq. (8).

The conditional second moment can be computed in a similar manner,

\begin{split}\mathbb{E}[z_{1}^{2}|\tilde{\bf z}]&\propto\int dz_{1}z_{1}^{2}q(z_{1},\tilde{\bf z})\\ &=\int dz_{1}z_{1}^{2}(1+V\tilde{\partial}^{4}+V\tilde{\partial}^{3}\partial_{z_{1}}+V\tilde{\partial}^{2}\partial^{2}_{z_{1}}+V\tilde{\partial}\partial^{3}_{z_{1}}+V\partial^{4}_{z_{1}})[\mathcal{N}(z_{1}|\tilde{\bf z})\mathcal{N}(\tilde{\bf z})]\\ &=(1+V\tilde{\partial}^{4})\bigg{\{}[\mu^{2}(\tilde{\bf z})+\sigma^{2}]\mathcal{N}(\tilde{\bf z})\bigg{\}}-2V\tilde{\partial}^{3}[\mu(\tilde{\bf z})\mathcal{N}(\tilde{\bf z})]+2V\tilde{\partial}^{2}[\mathcal{N}(\tilde{\bf z})]\>.\end{split}

(30)

The first term in the second equality shall contribute a term $\propto q(\tilde{\bf z})$ , the the details are similar with above.

Appendix C Proofs of evidence in Eq. (17) and posterior in Eq. (18)

We start with the definition in Eq. (17) and focus on the correction term. The computation involves the marginalization over all $z$ ’s as well as a complication arising from the derivatives with respect to some 4 $z$ ’s. Using the adjoint property of derivative, i.e. $\int f(x)[\partial g(x)/\partial x]dx=\int[-\partial f(x)/\partial x]g(x)dx$ , provided that the product $fg$ vanishes at infinity, we can shuffle the operator $\partial_{z}$ acting on the Gaussian prior on the right most to act on the Gaussian likelihood on the left. Then use the property, $(\partial_{t}+\partial_{s})\exp[-(s-t)^{2}]=0$ , we can replace $\partial_{z}$ with $-\partial_{y}$ , and pull these derivatives out of the integral. The followings summarize these actions.

\begin{split}&\int d{\bf z}\ \mathcal{N}({\bf y}|{\bf z},\Lambda)\big{[}v_{ijml}\partial_{z_{i}}\partial_{z_{j}}\partial_{z_{m}}\partial_{z_{l}}\mathcal{N}({\bf z}|0,K)\big{]}\\ &=v_{ijml}\int d{\bf z}\big{[}\partial_{z_{i}}\partial_{z_{j}}\partial_{z_{m}}\partial_{z_{l}}\mathcal{N}({\bf y}|{\bf z},\Lambda)\big{]}\mathcal{N}({\bf z}|0,K)\\ &=v_{ijml}\partial_{y_{i}}\partial_{y_{j}}\partial_{y_{m}}\partial_{y_{l}}\int d{\bf z}\mathcal{N}({\bf y}|{\bf z},\Lambda)\mathcal{N}({\bf z}|0,K)\\ &=v_{ijml}\partial_{y_{i}}\partial_{y_{j}}\partial_{y_{m}}\partial_{y_{l}}\mathcal{N}({\bf y}|0,K+\Lambda)\>.\end{split}

(31)

As for the proof for Eq. (18), we can proceed with computation of the numerator there, and note that the derivative $\partial_{z_{*}}$ in $q(z_{*},{\bf z})$ can be moved out of the integral directly. To accommodate this extra variable, we use the hat on top of the indices to stress that $*$ is included. The followings contain the meaning of our notation as well as some details.

\begin{split}v_{\hat{i}\hat{j}\hat{m}\hat{l}}&\int d{\bf z}\ \mathcal{N}({\bf y}|{\bf z},\Lambda)\big{[}\partial_{z_{\hat{i}}}\partial_{z_{\hat{j}}}\partial_{z_{\hat{m}}}\partial_{z_{\hat{l}}}\mathcal{N}(z_{*},{\bf z}|0,\hat{K})\big{]}\\ &=\bigg{[}v_{ijml}\partial_{y_{i}}\partial_{y_{j}}\partial_{y_{m}}\partial_{y_{l}}+v_{ijm*}\partial_{y_{i}}\partial_{y_{j}}\partial_{y_{m}}\partial_{z_{*}}+v_{ij**}\partial_{y_{i}}\partial_{y_{j}}\partial^{2}_{z_{*}}+v_{i***}\partial_{y_{i}}\partial^{3}_{z_{*}}+v_{****}\partial^{4}_{z_{*}}\bigg{]}\times\\ &\int d{\bf z}\mathcal{N}({\bf y}|{\bf z},\Lambda)\mathcal{N}(z_{*},{\bf z}|0,\hat{K})\\ &=v_{\hat{i}\hat{j}\hat{m}\hat{l}}\partial_{\hat{i}\hat{j}\hat{m}\hat{l}}\mathcal{N}\big{[}\bigl{(}\begin{smallmatrix}z_{*}\\ {\bf y}\end{smallmatrix}\bigr{)}\big{|}0,\bigl{(}\begin{smallmatrix}1&{\bf k}_{*}^{t}\\ {\bf k}_{*}&K+\Lambda\end{smallmatrix}\bigr{)}\big{]}\>,\end{split}

(32)

where the readers can refer to the text around Eq. (2.21) in [2] for similar integration leading to the second equality.

Appendix D Proof of Eq. (21)

To prove the correction term to the second moment, the previous tricks dealing with Gaussians are still useful. In below, we first shuffle the four derivatives acting on the bivariate Gaussian on the left most to act on the Gaussian factor. One can further use $\partial_{z_{1,2}}\exp[-(z_{1}-z_{2})^{2}/2]=\pm\partial_{z_{-}}\exp(-z_{-}^{2}/2)$ such that one can rewrite $\partial_{ijlm}=(-1)^{n_{1}}\partial^{4}_{z_{-}}$ with $n_{1}$ denoting the number of involved $z_{1}$ ’s. The rest of computation is straightforward.

\begin{split}&\int dz_{1}dz_{2}\ e^{-\frac{(z_{1}-z_{2})^{2}}{2H}}(v_{ijml}\partial_{i}\partial_{j}\partial_{m}\partial_{l})\mathcal{N}(z_{1},z_{2}|0,K_{2})\\ &=\int dz_{1}dz_{2}\big{[}(v_{ijml}\partial_{i}\partial_{j}\partial_{m}\partial_{l})e^{-\frac{(z_{1}-z_{2})^{2}}{2H}}\big{]}\mathcal{N}(z_{1},z_{2}|0,K_{2})\\ &=\frac{V_{1111}+V_{2222}-4V_{1222}-4V_{21111}+6V_{1122}}{24}\times\\ &\int\frac{dz_{+}dz_{-}}{2}[(\partial_{z_{-}})^{4}e^{-\frac{z_{-}^{2}}{2H}}]\mathcal{N}(z_{+}|0,\sigma_{+}^{2})\mathcal{N}(z_{-}|0,\sigma_{-}^{2})\\ &=V^{\prime}\int dz(\frac{3}{H^{2}}-\frac{6z^{2}}{H^{3}}+\frac{z^{4}}{H^{4}})e^{-\frac{z^{2}}{2H}}\mathcal{N}(z|0,\sigma_{-}^{2})\\ &=\frac{V^{\prime}}{\sqrt{1+\frac{\sigma_{-}^{2}}{H}}}\big{[}\frac{3}{H^{2}}-\frac{6\sigma_{-}^{2}}{H^{3}(1+\sigma_{-}^{2}/H)}+\frac{3\sigma_{-}^{4}}{H^{4}(1+\sigma_{-}^{2}/H)^{2}}\big{]}\>.\end{split}

(33)

We have used the notation for the variances $\sigma_{\pm}^{2}=2[1\pm k({\bf x}_{1},{\bf x}_{2})]$ .