Infinitely Wide Tensor Networks as Gaussian Process

Erdong Guo David Draper University of California, Santa Cruz, email: [email protected].University of California, Santa Cruz, email: [email protected]

Abstract

Gaussian Process is a non-parametric prior which can be understood as a distribution on the function space intuitively. It is known that by introducing appropriate prior to the weights of the neural networks, Gaussian Process can be obtained by taking the infinite-width limit of the Bayesian neural networks from a Bayesian perspective. In this paper, we explore the infinitely wide Tensor Networks and show the equivalence of the infinitely wide Tensor Networks and the Gaussian Process. We study the pure Tensor Network and another two extended Tensor Network structures: Neural Kernel Tensor Network and Tensor Network hidden layer Neural Network and prove that each one will converge to the Gaussian Process as the width of each model goes to infinity. (We note here that Gaussian Process can also be obtained by taking the infinite limit of at least one of the bond dimensions $\alpha_{i}$ in the product of the chain of tensor nodes, and the proofs can be done with the same ideas in the proofs of the infinite-width cases.) We calculate the mean function (mean vector) and the covariance function (covariance matrix) of the finite dimensional distribution of the induced Gaussian Process by the infinite-width tensor network with a general set-up. We study the properties of the covariance function and derive the approximation of the covariance function when the integral in the expectation operator is intractable. In the numerical experiments, we implement the Gaussian Process corresponding to the infinite limit tensor networks and plot the sample paths of these models. We study the hyperparameters and plot the sample path families in the induced Gaussian Process by varying the standard deviations of the prior distributions. As expected, the parameters in the prior distribution namely the hyper-parameters in the induced Gaussian Process controls the characteristic lengthscales of the Gaussian Process.

1 Introduction

Gaussian Process (GP) as the infinite neural network function prior attracts remarkable attention recently ever since the discovery of the equivalence between GP and the infinite-width one hidden layer Neural Network (Neal, 1995). The idea of setting the number of neurons in the hidden layer of neural network to the infinite limit is motivated by answering the following question: how our knowledge (background information) can be used to suggest the prior distribution? It turns out that as the number of neurons in the hidden layer goes to infinity, the response of the infinite-width hidden layer neural network with weights prior $p(\mathbf{w})$ converges to the GP. The interesting point of this limit is that we can obtain a prior distribution on the function space by the limiting process, namely the distribution $p(\mathbf{y})$ of the response function by introducing a prior distribution $p(\mathbf{w})$ on the weights. More concretely, by achieving the infinite-width limit of the neural network, the response function $y(\mathbf{w};\mathbf{x})$ is sampled from the function space with $p(y(\mathbf{w};\mathbf{x})|\mathbf{w},\mathbf{x})$ density function rather than parameterized by a fixed function form with the parameters to be learned from the data set.

In some sense, the GP is more flexible which has stronger representation power than the neural networks. So it is suggested to work on the non-parametric Gaussian Process model which is parameterized arbitrarily before the data arrive instead of the neural network with fixed parameterized function. D. MacKay has a nice review discussing the neural networks and Gaussian Process, see (MacKay, 1998). In the work by Williams (Williams, 1997), the GPs induced by the limit of the neural network with specific neurons, namely the sigmoid and gaussian neurons, are constructed. The following work studied the infinite limit of the neural network with more general structures: the general shallow infinite-width neural network, the convolutional neural network and other structures (Roux and Bengio, 2007; Novak et al., 2018; Garriga-Alonso et al., 2018; Yang, 2019). In (Hazan and Jaakkola, 2015; Lee et al., 2017; Matthews et al., 2018), the equivalence between the infinite-width deep neural network and the GP is proved and the performance of the infinite-width deep neural network induced GP is compared with the normal model. Recently, the connection between kernel machine and sufficiently wide neural network is studied in a series work, see (Cho and Saul, 2009; Daniely et al., 2016; Daniely, 2017; Jacot et al., 2018).

2 Preliminaries: Gaussian Process by Neural Networks

2.1 Basis of Gaussian Process

As is well known, Gaussian Process can be understood as the generalization of the multi-variate Gaussian distribution on the vector space to infinite dimensional function space which is defined by the mean function $\mu(x)$ and the covariance function $C(x,x^{\prime})$ . Here $\mu(x)$ represents the mean of the random variable $X$ with index $x$ , and the covariance $C(a,x^{\prime})$ represents the covariance between two random variable $X$ and $X^{\prime}$ with index $x$ and $x^{\prime}$ . In the application case with finite size training data set, the finite dimensional distribution (f.d.d.s.) of the Gaussian Process is used to model the data set, and then the mean function degenerate to the finite dimensional vector $\mathbf{\mu}$ and the covariance function degenerate to the covariance matrix $C_{ij}$ which defines the f.d.d.s. of the GP, namely the multi-variates Gaussian distribution $\mathcal{N}(\mathbf{\mu},C_{ij})$ .

The nice point of the GP is that the Bayesian updating of the GP prior is analytically tractable and simple. For a GP prior as follows,

\displaystyle(f|\mu,C)\sim\mathcal{GP}(f|\mu,C).

(1)

We can get the joint distribution of the training data $\mathbf{x}$ and the $\mathbf{x}^{*}$ easily as

\displaystyle\begin{bmatrix}f\\ f^{*}\end{bmatrix}\sim\mathcal{N}(\begin{bmatrix}\mu(\mathbf{x})\\ \mu(\mathbf{x}^{*})\end{bmatrix},\begin{bmatrix}C(\mathbf{x},\mathbf{x})&C(\mathbf{x},\mathbf{x}^{*})\\ C(\mathbf{x}^{*},\mathbf{x})&C(\mathbf{x}^{*},\mathbf{x}^{*}))\end{bmatrix}.

(2)

The predictive posterior distribution $p(f^{*}|\mathbf{x}^{*},\mathbf{x},f)$ can be obtained analytically as

\displaystyle(f^{*}|f,\mathbf{x}^{*},\mathbf{x})\sim\mathcal{N}(C(\mathbf{x}^{*},\mathbf{x})C(\mathbf{x},\mathbf{x})^{-1}f,C(\mathbf{x}^{*},\mathbf{x}^{*})-C(\mathbf{x}^{*},\mathbf{x})C(\mathbf{x},\mathbf{x})^{-1}C(\mathbf{x},\mathbf{x}^{*}).

(3)

It is easy to extend above analysis to the noisy label model, namely introducing noise to the label $\mathbf{y}=f(\mathbf{x})+\mathcal{\epsilon}$ .

For a non-parametric model, it can adapt itself to simulate the data after the data comes but we can still use the Maximum Likelihood Estimator (M.L.E.) or the Maximum A Posterior (M.A.P.) to optimize the hyper-parameters. We can get the nice marginal likelihood as follows,

\displaystyle\log{(\mathbf{y}|\mathbf{x},\mathbf{\theta})}\propto-\frac{1}{2}\mathbf{y}^{T}C_{\mathbf{y}}^{-1}\mathbf{y}-\frac{1}{2}\log{|C_{\mathbf{y}}|},

(4)

where $C_{\mathbf{y}}=C_{f}+\sigma_{n}^{2}\mathbb{I}$ .

The training data only get into first term of above formula which evaluates how good our model fits the given training data, however the second term only depends on the covariance matrix which is a model complexity penalty term. We can undertand the competition between the data fitting and the model capacity by considering the length-scale. The model capacity will increase as the length-scale decrease, namely the model complexity penalty will increase although the data-fitting will get better. From the computation perspective, we know to get the optimal point of the marginal likelihood, we need to compute the inverse of the covariance matrix $C_{\mathbf{y}}^{-1}$ which is of order $O(n^{3})$ , where $n$ is the training sample size. For application in the big data set, the inverse matrix computation will be highly time-consuming so we can consider using matrix approximation method such as outer-product approximation. See (Williams and Rasmussen, 2006) for a good review on GP application in Machine Learning.

For an infinite-width neural network with response function $f(\mathbf{x})$ as

\displaystyle f(\mathbf{x})=b^{[1]}+\sum_{i}w^{[1]}_{i}a_{i}(\mathbf{w}^{[0]};\mathbf{x}),

(5)

where $\mathbf{w}^{[i]}$ and $b^{[i]}$ represent the weights and the bias in the $i$ ’th layer and $a_{i}(\mathbf{w};\mathbf{x})$ represents the activation function. Since all parameters in the neural network are independent and identically distributed (i.i.d.), by the Central Limit Theorem (C.L.T.), we know as $i\to\infty$ ,

\displaystyle\sum_{i}w_{i}^{[1]}a_{i}(\mathbf{w}^{[0]};\mathbf{x})\to X,

(6)

where $X$ belongs to the Normal distribution. Since $b^{[1]}$ is normal and independent with $\sum_{i}w_{i}^{[1]}a(w_{i}^{[0]};x)$ , we know $f(\mathbf{x})$ is normal random variable. We will get a sequence of random variables $\{f(\mathbf{x}^{(i)}),i\in\{1,\cdots,m\}\}$ by evaluating the infinite neural network on the data set. For the neural network with several special non-linear activation functions, analytical expressions of the Gaussian Process are obtained (Williams, 1997).

Above discussion is an intuition explaining of the equivalence between GP and infinite-width neural networks, here we will propose our first theorem on the equivalence between GP and neural networks and provide a strict proof in the Appendix A.

Theorem 1.

Let the response $f(\mathbf{x})$ of the one hidden layer neural network with $j$ neurons as

\displaystyle f(\mathbf{x})=b^{[1]}+\sum_{j}w^{[1]}_{j}a_{j}(\mathbf{w}^{[0]};\mathbf{x}),

(7)

where

	$\displaystyle w_{j}^{[i]}\overset{\text{i.i.d.}}{\sim}\mathcal{N}(0,\sigma_{i}),$
	$\displaystyle b^{[i]}\sim\mathcal{N}(0,\sigma_{b}).$

As $j$ goes to infinity, $f$ converge to a Gaussian Process, $\mathcal{GP}(\mu,\Sigma)$ with $\mu$ , and $\Sigma$ as the mean and covariance function.

3 Infinite Limits of Tensor Networks

3.1 Introduction of Tensor Network

For a $n$ nodes Tensor Network, actually Matrix Product States (M.P.S.) (Biamonte and Bergholm, 2017; Orús, 2014; Cichocki, 2014), the scalar response $f(\mathbf{x})$ is as follows,

\displaystyle\psi(\mathbf{x})=\sum_{\{\alpha_{i},s_{i}\}}A^{s_{1}}_{\alpha_{1}\alpha_{2}}\cdots A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}\cdots A^{s_{n}}_{\alpha_{n}\alpha_{1}}\Phi^{s_{1}\cdots s_{n}}(\mathbf{x}),

(8)

where

\displaystyle\Phi^{s_{1}s_{2}\cdots s_{n}}(\mathbf{x})=\phi^{s_{1}}(x_{1})\otimes\cdots\phi^{s_{i}}(x_{i})\cdots\otimes\phi^{s_{n}}(x_{n}).

Here $\phi^{s_{i}}(x_{i})$ represents the function which maps each component of the input data $x_{i}$ into a higher dimensional feature space. $A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}$ represents the rank $3$ tensor node, where $s_{i}$ is the physical index which contracts with the $s_{i}$ index in the kernel $\phi^{s_{i}}$ . $\alpha_{i}$ is the bond index by which we can tune the representation power of the tensor network.

M.P.S. is introduced to approximate a quantum state $\Ket{\psi}$ efficiently in quantum physics (Orús, 2019; Chabuda et al., 2020; Schröder et al., 2019; McMahon et al., 2020). To fully describe a $n$ -party quantum state $\Ket{\psi}$ , we need $2^{n}$ basis which grows exponentially with respecting to $n$ . By expanding the $n$ -party quantum state with a chain of product of matrices, we can approximate the original quantum state with less parameters. For the M.P.S., the number of the basis grows linearly as $n$ increase.

\displaystyle\Ket{\psi}=\psi_{s_{1}\cdots s_{n}}\Ket{s_{1}\cdots s_{n}}\approx A^{s_{1}}_{\alpha_{1}\alpha_{2}}\cdots A^{s_{n}}_{\alpha_{n}\alpha_{1}}\Ket{s_{1}\cdots s_{n}}

(9)

Here we write a general n-party state as a matrix product chain with $O(s\alpha^{2}n)$ parameters.

Tensor Networks as a powerful machine has been widely studied to construct statistical learning model in Machine Learning community recently and also great achievements have been obtained in different tasks such as supervised pattern recognition (classification and regression), natural language processing and density estimation (unsupervised generative model) (Stoudenmire and Schwab, 2016; Han et al., 2018; Pestun et al., 2017; Novikov et al., 2018). Since the special structure of tensor network, the model easily blows out or decays to dead nodes during training process, efficient and robust initialization strategy was proposed in (Guo and Draper, 2021). It is known that the correlation function of tensor networks decays exponentially (Evenbly and Vidal, 2011b, a), so it is discussed to extend the tensor network to more complicated structures such as multi-scale tree structure to reduce the correlation decay problem in machine learning application (Stoudenmire, 2018). More extension work of tensor network can be found in (Cheng et al., 2020; Li and Zhang, 2018; Cheng et al., 2020; Liu et al., 2018; Glasser et al., 2018).

3.2 Infinite-Width Pure Tensor Networks

3.2.1 Theory

Since the special structure of the MPS which is the contraction of a chain of tensor nodes, we can obtain a ’trivial’ Gaussian Process pure MPS without activation. So our first theorem on the infinite-width MPS equivalent GP is as follows,

Theorem 2.

For an infinite-width MPS with independent random tensor nodes (we note here i.i.d. is not necessary), namely

\displaystyle(A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}|\mathbf{\theta}_{i})\sim p(A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}|\mathbf{\theta}_{i}),

(10)

with at lease one of the bond index whose dimension $|a_{i}|\geq 2$ , as the number of the tensor nodes goes to infinity, the response will converge to the Gaussian Process,

\displaystyle\psi\sim\mathcal{GP}(\mu,\Sigma),

(11)

where $\psi$ is the response of the infinite-width MPS, $\mu$ is the mean function and $\Sigma$ is the covariance function.

Note: the GP limit can also be obtained by taking the infinite limit of at lease one of bond index $\alpha_{i}$ , where $i\in\{1,\cdots,n\}$ if the width of the tensor network is fixed to be finite.

A proof of Theorem 2 is provided in Appendix B.1 Proof of Theorem 2. Without hidden layer and activation, the response of MPS is linear with respect to the tensor weights $A^{s_{i}}_{\alpha_{i},\alpha_{i+1}}$ which means the sample path of the equivalent GP is linear function. We can write down the joint distribution of the sequence of random variables, namely the response of the MPS evaluating at different training points $\psi(\mathbf{x}^{(i)})$ .

\displaystyle p(\psi_{1},\cdots,\psi_{m})=p(\psi_{1})\delta{(\psi_{1}-k_{2}\psi_{2}})\cdots\delta{(\psi_{1}-k_{m}\psi_{m}}),

(12)

where $\psi_{i}=\psi(x^{(i)})$ , $k_{i}$ is the coefficient of the linear transformation between $\psi_{i}$ and $\psi_{1}$ , namely $\psi_{i}=k_{i}\psi_{1}$ and $p(\psi_{1})$ is the normal distribution. It is easy to verify that each random variable $\psi_{i}$ is normal and then the sequence $\{\psi_{i},i\in\{1,\cdots,m\}\}$ is GP.

Formally we can write down the joint distribution of $\psi_{i}$ and $\psi_{j}$ as

\displaystyle p(\psi_{i},\psi_{j}|A^{s_{i}}_{\alpha_{i}\alpha_{j}})=\mathcal{N}(\psi_{i},\psi_{j}|\mathbf{0},\Sigma),

where

\displaystyle\Sigma=\begin{bmatrix}\sigma_{1}^{2}&\rho\sigma_{1}\sigma_{2}\\ \rho\sigma_{1}\sigma_{2}&\sigma_{2}^{2}\end{bmatrix}.

Since any two random variables $\psi_{i}$ and $\psi_{j}$ in the sequence are related by a linear transformation, we know the correlation $\rho$ of these two random variables is $1$ or $-1$ . Using the condition probability rule, we have

\displaystyle p(\psi_{i},\psi_{j}|A^{s_{i}}_{\alpha_{i}\alpha_{j}})

\displaystyle=p(\psi_{i}|A^{s_{i}}_{\alpha_{i}\alpha_{j}})p(\psi_{j}|\psi_{i},A^{s_{i}}_{\alpha_{i}\alpha_{j}}),

where

	$\displaystyle p(\psi_{j}\|\psi_{i},A^{s_{i}}_{\alpha_{i}\alpha_{j}})$	$\displaystyle=\lim_{\sigma\to 0}\mathcal{N}(\psi_{j}-\rho\frac{\sigma_{j}}{\sigma_{i}}\psi_{i},\sigma)$
		$\displaystyle=\delta{(\psi_{j}-\rho\frac{\sigma_{j}}{\sigma_{i}}\psi_{i})}.$

So in the pure MPS case, the GP is ’trivial’ since the uncertainty band of $\psi_{j}$ given $\psi_{i}$ is $0$ , and then the conditional distribution $p(\psi_{j}|\psi_{i},A^{s_{i}}_{\alpha_{i}\alpha_{j}})$ is a generalized function which is defined as the limit of a sequence of normal distributions showed above.

We can get the sample path of the GP as follows,

\displaystyle\psi(x)=\frac{\psi_{2}-\psi_{1}}{\psi_{2}-\psi_{1}}(x-x_{1})+\psi_{1},

where $\psi_{1}$ and $\psi_{2}$ are two responses of the MPS evaluating at data points $x^{(1)}$ and $x^{(2)}$ .

3.2.2 Discussion on Equivalent GP induced by Specific Kernel

We can get the mean function $\mu(\mathbf{x})$ and the covariance function $\Sigma(\mathbf{x},\mathbf{x}^{\prime})$ of the equivalent GP of pure tensor network as follows,

	$\displaystyle\mu(x^{(i)})=\mathrm{E}[\psi(x^{(i)})]=0,$		(13)
	$\displaystyle\Sigma(x^{(i)},x^{(j)})=\mathrm{E}[\psi(x^{(i)})\psi(x^{(j)})]=\alpha^{n-1}\prod_{k}\mathrm{E}_{A}[Z(A;x^{(i)}_{k})Z(A;x^{(j)}_{k})]$		(14)

where

\displaystyle Z(x_{i})=A^{s}_{\alpha_{i-1}\alpha_{i}}\phi^{s}(x_{i}).

Here we consider a specific normal prior and kernel as follows,

	$\displaystyle p(A^{s}_{\alpha_{i}\alpha_{i+1}}\|\mathbf{\sigma_{A}})\sim\mathcal{N}(A^{s}_{\alpha_{i}\alpha_{i+1}}\|0,\sigma_{A})$
	$\displaystyle\phi(x_{i})=[f(x_{i}),g(x_{i})]=[x_{i},1-x_{i}],$

we get the mean function and the covariance function as

	$\displaystyle E_{A}[\psi(\mathbf{x};A)^{2}]=\alpha^{n-1}\sigma_{A}^{2n}\prod_{i}(2x_{i}^{2}-2x_{i}+1),$
	$\displaystyle E_{A}[\psi(\mathbf{x};A)\psi(\mathbf{x}^{\prime};A)]=\alpha^{n-1}\sigma_{A}^{2n}\prod_{i}(2x_{i}x^{\prime}_{i}-x_{i}-x_{i}^{\prime}+1).$

To avoid the power explosion problem, in application case we need to set that

\displaystyle\alpha^{n-1}\sigma_{A}^{2n}=1,

namely,

\displaystyle\sigma_{A}=\alpha^{\frac{1-n}{2n}}.

In the infinite limit, we should set $\sigma_{A}$ to be $\alpha^{-\frac{1}{2}}$ . Actually this means in the case when the tensor network is very wide, we need to fine tune the standard deviation of the prior distribution to keep the network in stable range which is discussed in (Guo and Draper, 2021). In Fig. 1, we show the sample path of the pure MPS and also plot the sample paths sampled from the MPS with different standard deviation $\sigma$ .

3.3 Hybrid Tensor Neural Networks

For the infinite-width pure MPS, no activation functions and no latent variables are introduced and then the response function $\psi(\mathbf{x})$ is linear respecting to the weights $A^{\alpha_{i}}_{\alpha_{i}\alpha_{i+1}}$ in each tensor node. The representation power of the pure tensor network is limited since the pure MPS can only represent the linear function. To increase the representation capacity of MPS, we can introduce another layer of neural network with activation functions. Here we propose two methods to combine the MPS with the neural network:

•

Neural network kernel MPS;
•

MPS hidden layer neural network.

For the neural network kernel MPS, we use one layer neural network as the kernel function and ’non-linearity’ is introduced by the hidden layer and the activation function in neural network. For the MPS hidden layer neural network, the response $\psi^{l}(\mathbf{w},\mathbf{A};\mathbf{x})$ of tensor network is fed into the neural network by which the linearity is injected into the model.

In the following sections, we will discuss these two hybrid tensor neural network and the theorems on their Gaussian Process equivalence properties.

3.3.1 Infinite-width Neural kernel Tensor Networks

The set-up of neural network kernel MPS is as follows,

	$\displaystyle a_{i}(\mathbf{w}^{[0]};\mathbf{x})=a(\sum_{j}w^{[0]}_{ij}x_{j}),$		(15)
	$\displaystyle\psi(\mathbf{A},\mathbf{w};\mathbf{x})=\sum_{\{s,\alpha\}}A^{s_{1}}_{\alpha_{1}\alpha_{2}}\cdots A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}\cdots A^{s_{n}}_{\alpha_{n},\alpha_{1}}\Phi^{s_{1}\cdots s_{n}}(\mathbf{w};\mathbf{x}),$		(16)

where

	$\displaystyle\phi^{s_{i}}(\mathbf{w};\mathbf{x})=[a_{(i-1)s+1},\cdots,a_{is}],$
	$\displaystyle\Phi^{s_{1}\cdots s_{n}}(\mathbf{w};\mathbf{x})=\phi^{s_{1}}\otimes\cdots\otimes\phi^{s_{n}}=[[a_{1},\cdots,a_{s}],\cdots,[a_{(n-1)s+1},a_{ns}]],$

where $\mathbf{w}$ is the weights of the neural kernel, $f(\cdot)$ is the output of the hidden neural network. After we get the response of the neural network, we reshape the output of our neural network to $(n,s)$ tensor and feed it into the MPS.

The following theorem states the equivalence between the infinite-width neural kernel tensor network and GP.

Theorem 3.

For a neural kernel tensor network with independent (we note here that i.i.d. is not necessary) tensor nodes whose distribution are

\displaystyle(A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}|\mathbf{\theta}_{i})\sim p(A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}|\mathbf{\theta}_{i}),

and also with neural weights $\mathbf{w}$ whose components are independent following the distribution as

\displaystyle(w_{ij}|\sigma_{w})\sim p(w_{ij}|\sigma_{w}),

with at lease one of the bond index whose dimension $|a_{i}|\geq 2$ . As the number of the tensor nodes and also the number of the nodes of the hidden layer and the output layer of the neural network go to infinity, the response of the neural kernel tensor network will converge to a GP,

\displaystyle\psi\sim\mathcal{GP}(\mu,\Sigma),

where $\psi$ is the response of the infinite-width MPS, $\mu$ is the mean function and $\Sigma$ is the covariance function.

Note: the GP limit can also be obtained by taking the infinite limit of at lease one of bond index $\alpha_{i}$ , where $i\in\{1,\cdots,n\}$ if the width of the tensor network is fixed to be finite.

A proof of above Theorem 3 is given in Appendix B.2 Proof of Theorem 3. The interesting point of above theorem is that we just need to assume that the tensor nodes in tensor network $A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}$ to be independent with each other and also all the components of neural weights to be independent, then the respond of the model will converge to GP without assuming any other condition such as identical distribution requirements in the infinite-width neural network case. Roughly, the reason why the identical distribution requirement is not needed in our case is the factorization property of the tensor network which decompose a high rank tensor into a series of production of low rank tensor nodes. Thanks for the bond index $\alpha_{i}$ in the tensor network, the sum of the infinite i.i.d. random variables is obtained and by C.L.T, we will get a sequence of normal random variables.

The mean function $\mu{(\mathbf{x})}$ and covariance function $\Sigma(\mathbf{x},\mathbf{x}^{\prime})$ of the neural kernel tensor network is as follows,

	$\displaystyle\mathrm{E}_{A,\mathbf{w}}[\psi(x;A,\mathbf{w})]=0,$
	$\displaystyle\mathrm{E}_{A,\mathbf{w}}[\psi(x;A,\mathbf{w})\psi(x^{\prime};A,\mathbf{w})]=2^{n}\alpha^{n-1}\sigma_{A}^{2n}\mathrm{E}_{\mathbf{w}}^{n}[a(x;\mathbf{w})a(x^{\prime};\mathbf{w})],$
	$\displaystyle\mathrm{E}_{A,\mathbf{w}}[\psi(x;A,\mathbf{w})\psi(x;A,\mathbf{w})]=2^{n}\alpha^{n-1}\sigma_{A}^{2n}\mathrm{E}_{\mathbf{w}}^{n}[a(x;\mathbf{w})^{2}].$

When the activation function $a(x;\mathbf{w})$ is the Sigmoidal or Gaussian function, analytical results of $\mathrm{E}_{\mathbf{w}}[a(x;\mathbf{w})a(x^{\prime};\mathbf{w})]$ are obtained in Williams (1997). In Fig. 2, we plot the sample path of the MPS with neural kernel.

3.4 Infinite-Width MPS hidden layer Neural Networks

3.4.1 Theory

In this part, we feed the output of the MPS with an activation into a fully connected neuron layer. We write down the formula of the $l$ dimensional one hidden layer MPS Neural Network as follows,

	$\displaystyle\psi^{l}(\mathbf{x})=\sum_{\{\alpha_{i},s_{i}\}}A^{s_{1}}_{\alpha_{1}\alpha_{2}}\cdots A^{s_{i}l}_{\alpha_{i}\alpha_{i+1}}\cdots A^{s_{n}}_{\alpha_{n}\alpha_{1}}\Phi^{s_{1}\cdots s_{n}}(\mathbf{x}),$		(17)
	$\displaystyle f(\mathbf{x})=b+\sum_{l}w_{l}a(\psi^{l}).$		(18)

where $a(\cdot)$ is the activation function, usually it is the Sigmoid, Relu or Tanh function.

In the following part, we will state the theorem on the equivalence between the infinite-width MPS hidden layer Neural Network and the GP.

Theorem 4.

For the MPS hidden layer Neural Network whose tensor nodes $A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}$ are all independent with each other, and also all the components $w_{ij}$ of the neural weights $\mathbf{w}$ are all independent with each other with following distributions,

	$\displaystyle A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}\sim p(A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}\|\mathbf{\theta}_{i}),$		(19)
	$\displaystyle(w_{ij}\|\sigma_{w})\sim p(w_{ij}\|\sigma_{w}),$		(20)

with at lease one of the bond index whose dimension $|a_{i}|\geq 2$ . As the number of the tensor nodes goes to to infinity, the response will converge to a GP,

\displaystyle f\sim\mathcal{GP}(\mu,\Sigma),

where $\psi$ is the response of the infinite-width MPS, $\mu$ is the mean function and $\Sigma$ is the covariance function.

Note: the GP limit can also be obtained by taking the infinite limit of at lease one of bond index $\alpha_{i}$ , where $i\in\{1,\cdots,n\}$ if the width of the tensor network is fixed to be finite.

A proof of Theorem 4 will be given in the appendix B.3 Proof of Theorem 4. We note here as the width of the tensor network gets to infinity, each component of the respond of the tensor network will converge to a GP and they are independent with each other. The number of neurons in the neural network will get to infinity as the infinite-width of the tensor network is obtained, so C.L.T. can be used to show that the sequence of the random variables with the data point indices is a GP.

We calculate the mean function and the variance function as

	$\displaystyle\mathrm{E}_{A,w,b}[f(\mathbf{x})]=\mathrm{E}[b]+\sum_{l}\mathrm{E}[w_{l}]\mathrm{E}[a(\psi^{l})]=0,$		(21)
	$\displaystyle\mathrm{E}_{A}[f(\mathbf{x})f(\mathbf{x}^{\prime})]=\sigma_{b}^{2}+\sigma_{w}^{2}\mathrm{E}_{A}[\sum_{l}a^{l}(\mathbf{x})a^{l}(\mathbf{x}^{\prime})].$		(22)

In computation of the covariance matrix, the neural activations of different data points are coupled with each other which means the analytical result is difficult to be obtained. We can use the Monte Carlo simulation to compute the matrix element of the covariance function evaluated at different data pairs by sampling hierarchically. We can get the sample activations from neural network by sampling from the prior and then plugging the prior sample into the hybrid MPS neural network which is straightforward. If the tensor network structure is not huge, the time complexity is acceptable.

In Fig. 3, we plot the sample path of the MPS hidden layer neural network. In Fig. 4, we plot the two dimensional sample path of the MPS hidden layer Neural Network.

3.4.2 Approximation of Covariance matrix

For one hidden layer neural network, when the activation is chosen as the Sigmoidal or Gaussian function, analytical result of the covariance function was obtained (Williams, 1997). However, in our MPS hidden neural network case, the covariance integral is much more complicated since different component of the MPS couples with each other by the bond dimension of the tensor network. We obtain the approximation of $\mathrm{E}_{\mathbf{A}}[a^{l}(\mathbf{x})a^{l}(\mathbf{x}^{\prime})]$ by expanding the square terms around the mean using Taylor expansion.

	$\displaystyle\mathrm{E}_{A}[a^{l}(\mathbf{x};A)a^{l}(\mathbf{x}^{\prime};A)]$	$\displaystyle\approx a^{l}(\mathbf{x};A_{0})a^{l}(\mathbf{x}^{\prime};A_{0})+\frac{1}{2}\mathrm{Var}[A]\cdot(2\nabla a^{l}(\mathbf{x};A_{0})\otimes\nabla a^{l}(\mathbf{x}^{\prime};A_{0})$
		$\displaystyle+\nabla^{2}a^{l}(\mathbf{x};A_{0})\otimes a^{l}(\mathbf{x}^{\prime};A_{0})+a^{l}(\mathbf{x};A_{0})\otimes\nabla^{2}a^{l}(\mathbf{x}^{\prime};A_{0})),$

where $\mathrm{Var}[A]$ is the diagonal part of the covariance matrix of random vector $A$ , namely the variance vector every component of which is the variance of each component of the tensor nodes vector $A$ . See the Appendix C. Approximation of Covariance Function to get more details of the calculation.

4 Experiments

We implement all three tensor network induced GPs in the numerical experiments. We study the sample paths of each GP and also explore the sample paths family by varying hyperparameters, namely the standard deviations $\sigma_{\mathbf{A}}$ and $\sigma_{\mathbf{w}}$ of the tensor nodes and the neural weights prior.

In Fig. 1(a), we sample several random paths from the pure MPS GP. In Fig. 1(b), we plot the sample paths family of pure MPS GP with different standard deviation $\sigma_{\mathbf{A}}$ . We find that the slope of the sample path increases hugely as the standard deviation $\sigma_{\mathbf{A}}$ of the prior of the tensor nodes increases.

Refer to caption — (a) plot of randomly sampled $4$ sample paths of pure MPS.

In Fig. 2, we plot the numerical experiment of the neural kernel MPS. In Fig. 2(a), we plot the sample paths sampled from the neural kernel MPS. Since the introduction of neural hidden layer and the non-linear activation, the sample paths are not linear and more complicated than the sample paths by pure MPS model. In Fig. 2(b), we plot one sample path varying as we increase the standard deviation $\sigma_{\mathbf{A}}$ . As expected, the variation of the sample path increases as we increase the standard deviation $\sigma_{\mathbf{A}}$ of the tensor nodes prior. In the $d$ dimensional data vector case, the sample path is $d-1$ dimensional surface. In Fig. 2(c) and 2(d), we plot the sample path of the neural kernel MPS model with two dimensional data points. Now the sample path is two dimensional surface.

In Fig. 3(a) and Fig. 3(b), we compare the sample paths from MPS hidden neural network with increased standard deviation from $\sigma=0.5$ to $\sigma=1.0$ . We find that as we increase the standard deviation of the prior, the sample paths become more complicated which captures greater data fitting ability. In Fig. 3(c) and Fig. 3(d), the standard deviation $\sigma$ is increased and model complexity increases at accordingly.

In Fig. 4(a) and Fig. 4(b), we plot the sample surfaces from the MPS hidden neural network in the two dimensional input data case. We can find that the complexity of the configures of two dimensional sample surface increases as the standard deviation increases.

5 Conclusion

We study the infinite-width limit of tensor networks, acutally the pure MPS and the extended hybrid neural tensor networks: combining the MPS layer with neural layers. We prove that the pure MPS and the hybrid tensor networks converge to GP as the width of tensor network goes to infinity. For the pure MPS, since no non-linearity is injected into the model, the equivalent GP induced by the pure MPS is trivial in which each gaussian random variable indexed by a specific data point has zero uncertainty band. However, we can extend the pure MPS to more complicated structures by introducing hidden layers or using neural kernels. So we study the infinite limit of two kinds of hybrid tensor neural networks. We proved that these two structures converge to GP as the their widths go to infinity and calculate the equivalent GP of the infinite-width tensor network. In the numerical experiment, we explore the sample path and the properties of hyper-parameters of the tensor network induced GP and plot the sample path families by varying the standard deviations of the prior. The tensor networks have rich structures (intuitively the exponentially increasing contraction ways as the number of tensor nodes increases) and powerful fitting ability, so the function space supported by tensor networks has profound structures which allows well behaved limit processes. From the application perspective, over-parameterization and huge function space supported by tensor networks also lead to training instability, poor generalization and scaling difficulties which are essential problems in applying tensor network to big data set analysis.

Acknowledgements

The authors wish to thank David Helmbold, Hongyun Wang, Qi Gong, Torsten Ehrhardt, Hai Lin and Francois Monard for their helpful discussions. Erdong Guo is grateful for the financial support by Ming-Ren Teahouse and Uncertainty Quantification LLC for this work.

References

Biamonte and Bergholm (2017) Biamonte, J. and Bergholm, V. (2017). Tensor networks in a nutshell. arXiv preprint arXiv:1708.00006.
Chabuda et al. (2020) Chabuda, K., Dziarmaga, J., Osborne, T. J. and Demkowicz-Dobrzański, R. (2020). Tensor-network approach for quantum metrology in many-body quantum systems. Nature Communications, 11 1–12.
Cheng et al. (2020) Cheng, S., Wang, L. and Zhang, P. (2020). Supervised learning with projected entangled pair states. arXiv preprint arXiv:2009.09932.
Cho and Saul (2009) Cho, Y. and Saul, L. (2009). Kernel methods for deep learning. Advances in neural information processing systems, 22 342–350.
Cichocki (2014) Cichocki, A. (2014). Era of big data processing: A new approach via tensor networks and tensor decompositions. arXiv preprint arXiv:1403.2048.
Daniely (2017) Daniely, A. (2017). Sgd learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems.
Daniely et al. (2016) Daniely, A., Frostig, R. and Singer, Y. (2016). Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems.
Evenbly and Vidal (2011a) Evenbly, G. and Vidal, G. (2011a). Quantum criticality with the multi-scale entanglement renormalization ansatz, chapter in” strongly correlated systems, numerical methods”, edited by adolfo avella and ferdinando mancini. Springer Series in Solid-State Sciences, 176.
Evenbly and Vidal (2011b) Evenbly, G. and Vidal, G. (2011b). Tensor network states and geometry. Journal of Statistical Physics, 145 891–918.
Garriga-Alonso et al. (2018) Garriga-Alonso, A., Rasmussen, C. E. and Aitchison, L. (2018). Deep convolutional networks as shallow gaussian processes. arXiv preprint arXiv:1808.05587.
Glasser et al. (2018) Glasser, I., Pancotti, N. and Cirac, J. I. (2018). Supervised learning with generalized tensor networks. arXiv preprint arXiv:1806.05964.
Guo and Draper (2021) Guo, E. and Draper, D. (2021). The bayesian method of tensor networks. arXiv preprint arXiv:2101.00245.
Han et al. (2018) Han, Z.-Y., Wang, J., Fan, H., Wang, L. and Zhang, P. (2018). Unsupervised generative modeling using matrix product states. Physical Review X, 8 031012.
Hazan and Jaakkola (2015) Hazan, T. and Jaakkola, T. (2015). Steps toward deep kernel methods from infinite neural networks. arXiv preprint arXiv:1508.05133.
Jacot et al. (2018) Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems.
Lee et al. (2017) Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J. and Sohl-Dickstein, J. (2017). Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165.
Li and Zhang (2018) Li, Z. and Zhang, P. (2018). Shortcut matrix product states and its applications. arXiv preprint arXiv:1812.05248.
Liu et al. (2018) Liu, D., Ran, S.-J., Wittek, P., Peng, C., Garcıa, R. B., Su, G. and Lewenstein, M. (2018). Machine learning by two-dimensional hierarchical tensor networks: A quantum information theoretic perspective on deep architectures. stat, 1050 23.
MacKay (1998) MacKay, D. J. (1998). Introduction to gaussian processes. NATO ASI Series F Computer and Systems Sciences, 168 133–166.
Matthews et al. (2018) Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E. and Ghahramani, Z. (2018). Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271.
McMahon et al. (2020) McMahon, N. A., Singh, S. and Brennen, G. K. (2020). A holographic duality from lifted tensor networks. npj Quantum Information, 6 1–13.
Neal (1995) Neal, R. M. (1995). BAYESIAN LEARNING FOR NEURAL NETWORKS. Ph.D. thesis, University of Toronto.
Novak et al. (2018) Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., Hron, J., Abolafia, D. A., Pennington, J. and Sohl-Dickstein, J. (2018). Bayesian deep convolutional networks with many channels are gaussian processes. arXiv preprint arXiv:1810.05148.
Novikov et al. (2018) Novikov, A., Trofimov, M. and Oseledets, I. (2018). Exponential machines. Bulletin of the Polish Academy of Sciences. Technical Sciences, 66.
Orús (2014) Orús, R. (2014). A practical introduction to tensor networks: Matrix product states and projected entangled pair states. Annals of Physics, 349 117–158.
Orús (2019) Orús, R. (2019). Tensor networks for complex quantum systems. Nature Reviews Physics, 1 538–550.
Pestun et al. (2017) Pestun, V., Terilla, J. and Vlassopoulos, Y. (2017). Language as a matrix product state. arXiv preprint arXiv:1711.01416.
Roux and Bengio (2007) Roux, N. L. and Bengio, Y. (2007). Continuous neural networks. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (M. Meila and X. Shen, eds.), vol. 2 of Proceedings of Machine Learning Research. PMLR, San Juan, Puerto Rico.
http://proceedings.mlr.press/v2/leroux07a.html
Schröder et al. (2019) Schröder, F. A., Turban, D. H., Musser, A. J., Hine, N. D. and Chin, A. W. (2019). Tensor network simulation of multi-environmental open quantum dynamics via machine learning and entanglement renormalisation. Nature communications, 10 1–10.
Stoudenmire and Schwab (2016) Stoudenmire, E. and Schwab, D. J. (2016). Supervised learning with tensor networks. In Advances in Neural Information Processing Systems.
Stoudenmire (2018) Stoudenmire, E. M. (2018). Learning relevant features of data with multi-scale tensor networks. Quantum Science and Technology, 3 034003.
Williams (1997) Williams, C. K. (1997). Computing with infinite networks. In Advances in neural information processing systems.
Williams and Rasmussen (2006) Williams, C. K. and Rasmussen, C. E. (2006). Gaussian processes for machine learning, vol. 2. MIT press Cambridge, MA.
Yang (2019) Yang, G. (2019). Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760.

A. Infinite Width Neural Networks

In this appendix we prove the equivalence of the infinite width one hidden layer Neural Net and GP.

Lemma 1 For a sequence of random variable $\{X_{i},i\in\{1,\cdots,k\}\}$ ,

\displaystyle\forall i\in\{1,\cdots,k\},\quad X_{i}\sim\mathcal{N}(\mu_{i},\sigma_{i}).

If $\,\forall\mathbf{t}\in\mathrm{R}^{k},\mathbf{t}\cdot\mathbf{X}\sim\mathcal{N}(\mu,\sigma)$ , then

\displaystyle\mathbf{X}\sim\mathcal{N}(\mathbf{\mu},\mathbf{\Sigma}).

Proof. Since $\forall\mathbf{t}\in\mathrm{R}^{k},\mathbf{t}\cdot\mathbf{X}\sim\mathcal{N}(\mu,\sigma)$ , we have

	$\displaystyle\mathrm{E}[\mathbf{t}\cdot\mathbf{X}]=\mathbf{t}\cdot\mathbf{\mu},$
	$\displaystyle\mathrm{Var}[\mathbf{t}\cdot\mathbf{X}]=\mathbf{t}^{T}\cdot\mathbf{\Sigma}\cdot\mathbf{t}.$

We can write down the characteristic function $\phi_{\mathbf{X}}(t)$ of the random vector $\mathbf{X}=[x_{1},\cdots x_{k}]$ as

\displaystyle\phi_{\mathbf{X}}(t)=\phi_{\mathbf{t}\cdot\mathbf{X}}(1)=\exp{(i\mathbf{t}\cdot\mathbf{\mu}-\frac{1}{2}\mathbf{t}^{T}\cdot\mathbf{\Sigma}\cdot\mathbf{t})},

which is just the characteristic function of multi-variate normal distribution $\mathcal{N}(\mathbf{\mu},\mathbf{\Sigma})$ .

Theorem 1 Let the response $f(\mathbf{x})$ of the one hidden layer neural network with $j$ neurons as

\displaystyle f(\mathbf{x})=b+\sum_{j}w^{[1]}_{j}z_{j}(\mathbf{w}^{[0]};\mathbf{x}),

where

	$\displaystyle w_{j}^{[i]}\overset{\text{i.i.d.}}{\sim}\mathcal{N}(0,\sigma_{i}),$
	$\displaystyle b\sim\mathcal{N}(0,\sigma_{b}).$

As $j$ goes to infinity, $f$ converge to the Gaussian Process,

\displaystyle f\sim\mathcal{GP}(m,\Sigma),

where $m$ , and $\Sigma$ are the mean and covariance function.

Proof. Since all components of the neural layer weights $\mathbf{w}^{[0]}$ are i.i.d., we know that all the components of the output of the first layer network $z_{j}(\mathbf{w}^{[0]};\mathbf{x})$ are i.i.d.,

\displaystyle(z_{j}|\mathbf{w}^{[0]},\mathbf{x})\overset{\text{i.i.d.}}{\sim}p(z_{j}),

then we can easily verify that

\displaystyle(w^{[1]}_{j}z_{j}|\mathbf{w}^{[0]},\mathbf{w}^{[1]},\mathbf{x})\overset{\text{i.i.d.}}{\sim}p(w^{[1]}_{j}z_{j}),

In neural networks, the activation functions are mostly well behaved, so the mean and the variance of the output $z_{j}$ are bounded,

	$\displaystyle\mathrm{E}[w_{j}^{[1]}z_{j}]=0,$
	$\displaystyle\mathrm{Var}[w^{[1]}_{j}z_{j}]=\mathrm{Var}[w^{[1]}_{j}]\mathrm{Var}[z_{j}]<+\infty.$

By the central limit theorem, we know that as $j$ goes to $\infty$ , the infinite sum $\sum_{j}w_{j}^{[1]}z_{j}$ will converge to a random variable with gaussian distribution, namely

\displaystyle\frac{1}{n}\sum_{j}^{n}w_{j}^{[1]}z_{j}\sim\mathcal{N}(0,\sigma_{h}),

where

\displaystyle\sigma_{h}^{2}=\frac{1}{n}\sigma_{w^{[1]}}^{2}\sigma^{2}_{z}.

Since random variable $b$ belongs to the gaussian distribution, then the sum of two gaussian variables $f(\mathbf{x})$ , namely

\displaystyle f(\mathbf{x})=b+\sum_{j}w_{j}^{[1]}z_{j}(\mathbf{w}^{[0]};\mathbf{x})\sim\mathcal{N}(0,\sqrt{\sigma_{b}^{2}+n\sigma_{h}^{2}})

We can set $n\sigma^{2}_{w^{[1]}}$ to be fixed constant during the limit process to keep $f(\mathbf{x})$ well behaved. Now we have already proved that for a finite index $\mathbf{T}$ set which is just the training set $\{\mathbf{x}^{(i)},i\in\mathrm{N}\}$ , there exists a sequence of gaussian random variables $\{f(\mathbf{x}^{(i)}),\mathbf{x}^{(i)}\in\mathbf{T}\}$ . To extend the finite sequence to the infinite index sequence such as the interval $\mathbf{T}$ , we need to verify that the finite dimensional distributions satisfy the Kolmogorov consistency (extension) theorem. However, it is known that as long as the f.d.d.s. is a consistent distribution, then the stochastic process exists. In our neural network case, the f.d.d.s. also satisfies the two consistent conditions, then we proved that infinite width one hidden layer neural networks is equivalent to the gaussian process.

B. Infinite Width Tensor Networks

In this appendix, we will prove the theorems on the equivalence of the infinite width tensor network and GP. The main idea of the proofs are similar to the proof of Theorem 1.

B.1 Proof of Theorem 2

There are three indices in each tensor node $\mathbf{A}^{s_{i}}_{\alpha_{i}\alpha_{i+1}}$ , so at first step we can contract the kernel $\Phi^{s_{1},\cdots,s_{n}}(\mathbf{x})$ with the indices sequence $\{s_{1},\cdots s_{n}\}$ in tensor network, and then we get

\displaystyle\sum_{\{\alpha_{i}\}}Z_{\alpha_{1}\alpha_{2}}\cdots Z_{\alpha_{n}\alpha_{1}},

(23)

where

\displaystyle Z_{\alpha_{i}\alpha_{i+1}}=\sum_{s_{i}}A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}\phi^{s_{i}}(\mathbf{x}_{i}).

(24)

It is easy to verify that all the components of the $Z_{\alpha_{i}\alpha_{i+1}}$ are i.i.d. since all components of one specific tensor node $\mathbf{A}^{s_{i}}_{\alpha_{i}\alpha_{i+1}}$ are i.i.d. We note here that in our assumption of Theorem 2, we just assume that all tensor nodes are independent, but not necessary to be i.i.d. For two specific implementations of the bond indices, namely two different sequences of value of the indices, $\{i_{1},\cdots,i_{n}\}$ and $\{j_{1},\cdots,j_{n}\}$ , the product of each chain of the nodes are i.i.d.. We introduce a new notation for the production of the chain,

\displaystyle K_{\{i_{1}\cdots i_{n}\}}=Z_{i_{1}i_{2}}\cdots Z_{i_{n}i_{1}}.

(25)

We know all the different possible evaluation of the indices sequence $\{\alpha_{1}\cdots\alpha_{n}\}$ of $K$ are i.i.d. As the width of the tensor network goes to infinity, the number of all the possible implementation of the indices sequence goes to infinity. By the C.L.T, $\psi$ converges to a GP.

B.2 Proof of Theorem 3

Based on the work of Theorem 2 proof, to prove Theorem 3, we just need to show that all the components of the neural kernel are independent. It is obvious that all component of the output of the neural kernel $a_{i}(\mathbf{w};\mathbf{x})$ are independent. By reshaping the output of the neural kernel $f^{l}$ , we get the kernel $\Phi(\mathbf{x})$ and all the components are independent. Then by setting the width of the MPS to be infinite, using C.L.T and Theorem 3, the response of the neural kernel mps $\psi$ will converge to GP.

B.3 Proof of Theorem 4

By taking the infinite width limit of the hidden layer MPS, the response of the MPS $\psi^{l}$ are all i.i.d. which follow the normal distribution by Theorem 2, then it is easy to verify that $\sum_{l}w_{l}a(\psi^{l})$ follows gaussian distribution by C.L.T. and then $b+\sum_{l}w_{l}a(\psi^{l})$ is normal random variable. So $f$ converge to the GP as $l$ goes to infinity.

C. Approximation of Covariance Function

In this appendix, we will explain the calculation of the approximation of the covariance function of mps hidden neural network in detail. For a general function $f(\mathbf{x};\mathbf{A})$ , the expectation $\mathrm{E}_{A}[f(\mathbf{x};\mathbf{A})]$ can be written as

	$\displaystyle\mathrm{E}_{A}[f(\mathbf{x};A)]$	$\displaystyle=\mathrm{E}_{A}[f(\mathbf{x};A_{0})+\nabla f(\mathbf{x};A_{0})(A-A_{0})+\frac{1}{2}(A-A_{0})^{T}\nabla^{2}f(\mathbf{x};A_{0})(A-A_{0})+R(\mathbf{x};A)]$		(26)
		$\displaystyle=f(\mathbf{x};A_{0})+\frac{1}{2}\mathrm{Var}[A]\cdot\mathrm{Diag}[H]+\mathrm{E}_{A}[R(\mathbf{x},A)],$		(27)

where

\displaystyle R(\mathbf{x},A)=\frac{1}{3!}f^{(3)}(\mathbf{x};\xi)(A-A_{0})^{3},\quad\xi\in[A_{0},\mathbf{x}].

We write the second derivative matrix $H=\nabla^{2}f(\mathbf{x};A_{0})$ as

\displaystyle H=H^{T\frac{1}{2}}H^{\frac{1}{2}}

(28)

where $H^{\frac{1}{2}}$ is the formal square root of the the second derivative matrix $H$ . Since $H$ is real symmetric matrix, then $H^{\frac{1}{2}}$ exists and is symmetric.

$\displaystyle\mathrm{E}_{A}[\frac{1}{2}(A-A_{0})^{T}H(A-A_{0})]$	$\displaystyle=\frac{1}{2}\mathrm{Tr}\mathrm{E}_{A}[(A-A_{0})^{T}H^{T\frac{1}{2}}H^{\frac{1}{2}}(A-A_{0})]$	(29)
	$\displaystyle=\frac{1}{2}\mathrm{Tr}\mathrm{E}_{A}[(H^{\frac{1}{2}}(A-A_{0}))((A-A_{0})^{T}H^{T\frac{1}{2}}]$	(30)
	$\displaystyle=\frac{1}{2}\mathrm{Tr}[H^{\frac{1}{2}}\mathrm{E}_{A}[(A-A_{0})(A-A_{0})^{T}]H^{T\frac{1}{2}}]$	(31)
	$\displaystyle=\frac{1}{2}\mathrm{Tr}[\mathrm{Cov}[A-A_{0},A-A_{0}]H]$	(32)
	$\displaystyle=\frac{1}{2}\mathrm{Tr}[\mathrm{Cov}[A,A]\cdot H]$	(33)
	$\displaystyle=\frac{1}{2}\mathrm{Var}[A]\cdot\mathrm{Diag}[H]$	(34)

Here we use $\mathrm{Var}[A]$ to represent the diagonal part of the covariance matrix $Cov[A,A]$ , and $\mathrm{Diag}[H]$ to represent the diagonal part of the second derivative matrix $\nabla^{2}f(\mathbf{x};A)$ of the integrated function $f(\mathbf{x};A)$ .