This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Infinitely Wide Tensor Networks as Gaussian Process

Erdong Guo   David Draper University of California, Santa Cruz, email: [email protected].University of California, Santa Cruz, email: [email protected]
Abstract

Gaussian Process is a non-parametric prior which can be understood as a distribution on the function space intuitively. It is known that by introducing appropriate prior to the weights of the neural networks, Gaussian Process can be obtained by taking the infinite-width limit of the Bayesian neural networks from a Bayesian perspective. In this paper, we explore the infinitely wide Tensor Networks and show the equivalence of the infinitely wide Tensor Networks and the Gaussian Process. We study the pure Tensor Network and another two extended Tensor Network structures: Neural Kernel Tensor Network and Tensor Network hidden layer Neural Network and prove that each one will converge to the Gaussian Process as the width of each model goes to infinity. (We note here that Gaussian Process can also be obtained by taking the infinite limit of at least one of the bond dimensions αi\alpha_{i} in the product of the chain of tensor nodes, and the proofs can be done with the same ideas in the proofs of the infinite-width cases.) We calculate the mean function (mean vector) and the covariance function (covariance matrix) of the finite dimensional distribution of the induced Gaussian Process by the infinite-width tensor network with a general set-up. We study the properties of the covariance function and derive the approximation of the covariance function when the integral in the expectation operator is intractable. In the numerical experiments, we implement the Gaussian Process corresponding to the infinite limit tensor networks and plot the sample paths of these models. We study the hyperparameters and plot the sample path families in the induced Gaussian Process by varying the standard deviations of the prior distributions. As expected, the parameters in the prior distribution namely the hyper-parameters in the induced Gaussian Process controls the characteristic lengthscales of the Gaussian Process.

1 Introduction

Gaussian Process (GP) as the infinite neural network function prior attracts remarkable attention recently ever since the discovery of the equivalence between GP and the infinite-width one hidden layer Neural Network  (Neal, 1995). The idea of setting the number of neurons in the hidden layer of neural network to the infinite limit is motivated by answering the following question: how our knowledge (background information) can be used to suggest the prior distribution? It turns out that as the number of neurons in the hidden layer goes to infinity, the response of the infinite-width hidden layer neural network with weights prior p(𝐰)p(\mathbf{w}) converges to the GP. The interesting point of this limit is that we can obtain a prior distribution on the function space by the limiting process, namely the distribution p(𝐲)p(\mathbf{y}) of the response function by introducing a prior distribution p(𝐰)p(\mathbf{w}) on the weights. More concretely, by achieving the infinite-width limit of the neural network, the response function y(𝐰;𝐱)y(\mathbf{w};\mathbf{x}) is sampled from the function space with p(y(𝐰;𝐱)|𝐰,𝐱)p(y(\mathbf{w};\mathbf{x})|\mathbf{w},\mathbf{x}) density function rather than parameterized by a fixed function form with the parameters to be learned from the data set.

In some sense, the GP is more flexible which has stronger representation power than the neural networks. So it is suggested to work on the non-parametric Gaussian Process model which is parameterized arbitrarily before the data arrive instead of the neural network with fixed parameterized function. D. MacKay has a nice review discussing the neural networks and Gaussian Process, see  (MacKay, 1998). In the work by Williams  (Williams, 1997), the GPs induced by the limit of the neural network with specific neurons, namely the sigmoid and gaussian neurons, are constructed. The following work studied the infinite limit of the neural network with more general structures: the general shallow infinite-width neural network, the convolutional neural network and other structures  (Roux and Bengio, 2007; Novak et al., 2018; Garriga-Alonso et al., 2018; Yang, 2019). In  (Hazan and Jaakkola, 2015; Lee et al., 2017; Matthews et al., 2018), the equivalence between the infinite-width deep neural network and the GP is proved and the performance of the infinite-width deep neural network induced GP is compared with the normal model. Recently, the connection between kernel machine and sufficiently wide neural network is studied in a series work, see  (Cho and Saul, 2009; Daniely et al., 2016; Daniely, 2017; Jacot et al., 2018).

2 Preliminaries: Gaussian Process by Neural Networks

2.1 Basis of Gaussian Process

As is well known, Gaussian Process can be understood as the generalization of the multi-variate Gaussian distribution on the vector space to infinite dimensional function space which is defined by the mean function μ(x)\mu(x) and the covariance function C(x,x)C(x,x^{\prime}). Here μ(x)\mu(x) represents the mean of the random variable XX with index xx, and the covariance C(a,x)C(a,x^{\prime}) represents the covariance between two random variable XX and XX^{\prime} with index xx and xx^{\prime}. In the application case with finite size training data set, the finite dimensional distribution (f.d.d.s.) of the Gaussian Process is used to model the data set, and then the mean function degenerate to the finite dimensional vector μ\mathbf{\mu} and the covariance function degenerate to the covariance matrix CijC_{ij} which defines the f.d.d.s. of the GP, namely the multi-variates Gaussian distribution 𝒩(μ,Cij)\mathcal{N}(\mathbf{\mu},C_{ij}).

The nice point of the GP is that the Bayesian updating of the GP prior is analytically tractable and simple. For a GP prior as follows,

(f|μ,C)𝒢𝒫(f|μ,C).\displaystyle(f|\mu,C)\sim\mathcal{GP}(f|\mu,C). (1)

We can get the joint distribution of the training data 𝐱\mathbf{x} and the 𝐱\mathbf{x}^{*} easily as

[ff]𝒩([μ(𝐱)μ(𝐱)],[C(𝐱,𝐱)C(𝐱,𝐱)C(𝐱,𝐱)C(𝐱,𝐱))].\displaystyle\begin{bmatrix}f\\ f^{*}\end{bmatrix}\sim\mathcal{N}(\begin{bmatrix}\mu(\mathbf{x})\\ \mu(\mathbf{x}^{*})\end{bmatrix},\begin{bmatrix}C(\mathbf{x},\mathbf{x})&C(\mathbf{x},\mathbf{x}^{*})\\ C(\mathbf{x}^{*},\mathbf{x})&C(\mathbf{x}^{*},\mathbf{x}^{*}))\end{bmatrix}. (2)

The predictive posterior distribution p(f|𝐱,𝐱,f)p(f^{*}|\mathbf{x}^{*},\mathbf{x},f) can be obtained analytically as

(f|f,𝐱,𝐱)𝒩(C(𝐱,𝐱)C(𝐱,𝐱)1f,C(𝐱,𝐱)C(𝐱,𝐱)C(𝐱,𝐱)1C(𝐱,𝐱).\displaystyle(f^{*}|f,\mathbf{x}^{*},\mathbf{x})\sim\mathcal{N}(C(\mathbf{x}^{*},\mathbf{x})C(\mathbf{x},\mathbf{x})^{-1}f,C(\mathbf{x}^{*},\mathbf{x}^{*})-C(\mathbf{x}^{*},\mathbf{x})C(\mathbf{x},\mathbf{x})^{-1}C(\mathbf{x},\mathbf{x}^{*}). (3)

It is easy to extend above analysis to the noisy label model, namely introducing noise to the label 𝐲=f(𝐱)+ϵ\mathbf{y}=f(\mathbf{x})+\mathcal{\epsilon}.

For a non-parametric model, it can adapt itself to simulate the data after the data comes but we can still use the Maximum Likelihood Estimator (M.L.E.) or the Maximum A Posterior (M.A.P.) to optimize the hyper-parameters. We can get the nice marginal likelihood as follows,

log(𝐲|𝐱,θ)12𝐲TC𝐲1𝐲12log|C𝐲|,\displaystyle\log{(\mathbf{y}|\mathbf{x},\mathbf{\theta})}\propto-\frac{1}{2}\mathbf{y}^{T}C_{\mathbf{y}}^{-1}\mathbf{y}-\frac{1}{2}\log{|C_{\mathbf{y}}|}, (4)

where C𝐲=Cf+σn2𝕀C_{\mathbf{y}}=C_{f}+\sigma_{n}^{2}\mathbb{I}.

The training data only get into first term of above formula which evaluates how good our model fits the given training data, however the second term only depends on the covariance matrix which is a model complexity penalty term. We can undertand the competition between the data fitting and the model capacity by considering the length-scale. The model capacity will increase as the length-scale decrease, namely the model complexity penalty will increase although the data-fitting will get better. From the computation perspective, we know to get the optimal point of the marginal likelihood, we need to compute the inverse of the covariance matrix C𝐲1C_{\mathbf{y}}^{-1} which is of order O(n3)O(n^{3}), where nn is the training sample size. For application in the big data set, the inverse matrix computation will be highly time-consuming so we can consider using matrix approximation method such as outer-product approximation. See  (Williams and Rasmussen, 2006) for a good review on GP application in Machine Learning.

For an infinite-width neural network with response function f(𝐱)f(\mathbf{x}) as

f(𝐱)=b[1]+iwi[1]ai(𝐰[0];𝐱),\displaystyle f(\mathbf{x})=b^{[1]}+\sum_{i}w^{[1]}_{i}a_{i}(\mathbf{w}^{[0]};\mathbf{x}), (5)

where 𝐰[i]\mathbf{w}^{[i]} and b[i]b^{[i]} represent the weights and the bias in the ii’th layer and ai(𝐰;𝐱)a_{i}(\mathbf{w};\mathbf{x}) represents the activation function. Since all parameters in the neural network are independent and identically distributed (i.i.d.), by the Central Limit Theorem (C.L.T.), we know as ii\to\infty,

iwi[1]ai(𝐰[0];𝐱)X,\displaystyle\sum_{i}w_{i}^{[1]}a_{i}(\mathbf{w}^{[0]};\mathbf{x})\to X, (6)

where XX belongs to the Normal distribution. Since b[1]b^{[1]} is normal and independent with iwi[1]a(wi[0];x)\sum_{i}w_{i}^{[1]}a(w_{i}^{[0]};x), we know f(𝐱)f(\mathbf{x}) is normal random variable. We will get a sequence of random variables {f(𝐱(i)),i{1,,m}}\{f(\mathbf{x}^{(i)}),i\in\{1,\cdots,m\}\} by evaluating the infinite neural network on the data set. For the neural network with several special non-linear activation functions, analytical expressions of the Gaussian Process are obtained  (Williams, 1997).

Above discussion is an intuition explaining of the equivalence between GP and infinite-width neural networks, here we will propose our first theorem on the equivalence between GP and neural networks and provide a strict proof in the Appendix A.

Theorem 1.

Let the response f(𝐱)f(\mathbf{x}) of the one hidden layer neural network with jj neurons as

f(𝐱)=b[1]+jwj[1]aj(𝐰[0];𝐱),\displaystyle f(\mathbf{x})=b^{[1]}+\sum_{j}w^{[1]}_{j}a_{j}(\mathbf{w}^{[0]};\mathbf{x}), (7)

where

wj[i]i.i.d.𝒩(0,σi),\displaystyle w_{j}^{[i]}\overset{\text{i.i.d.}}{\sim}\mathcal{N}(0,\sigma_{i}),
b[i]𝒩(0,σb).\displaystyle b^{[i]}\sim\mathcal{N}(0,\sigma_{b}).

As jj goes to infinity, ff converge to a Gaussian Process, 𝒢𝒫(μ,Σ)\mathcal{GP}(\mu,\Sigma) with μ\mu, and Σ\Sigma as the mean and covariance function.

3 Infinite Limits of Tensor Networks

3.1 Introduction of Tensor Network

For a nn nodes Tensor Network, actually Matrix Product States (M.P.S.)  (Biamonte and Bergholm, 2017; Orús, 2014; Cichocki, 2014), the scalar response f(𝐱)f(\mathbf{x}) is as follows,

ψ(𝐱)={αi,si}Aα1α2s1Aαiαi+1siAαnα1snΦs1sn(𝐱),\displaystyle\psi(\mathbf{x})=\sum_{\{\alpha_{i},s_{i}\}}A^{s_{1}}_{\alpha_{1}\alpha_{2}}\cdots A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}\cdots A^{s_{n}}_{\alpha_{n}\alpha_{1}}\Phi^{s_{1}\cdots s_{n}}(\mathbf{x}), (8)

where

Φs1s2sn(𝐱)=ϕs1(x1)ϕsi(xi)ϕsn(xn).\displaystyle\Phi^{s_{1}s_{2}\cdots s_{n}}(\mathbf{x})=\phi^{s_{1}}(x_{1})\otimes\cdots\phi^{s_{i}}(x_{i})\cdots\otimes\phi^{s_{n}}(x_{n}).

Here ϕsi(xi)\phi^{s_{i}}(x_{i}) represents the function which maps each component of the input data xix_{i} into a higher dimensional feature space. Aαiαi+1siA^{s_{i}}_{\alpha_{i}\alpha_{i+1}} represents the rank 33 tensor node, where sis_{i} is the physical index which contracts with the sis_{i} index in the kernel ϕsi\phi^{s_{i}}. αi\alpha_{i} is the bond index by which we can tune the representation power of the tensor network.

M.P.S. is introduced to approximate a quantum state |ψ\Ket{\psi} efficiently in quantum physics  (Orús, 2019; Chabuda et al., 2020; Schröder et al., 2019; McMahon et al., 2020). To fully describe a nn-party quantum state |ψ\Ket{\psi}, we need 2n2^{n} basis which grows exponentially with respecting to nn. By expanding the nn-party quantum state with a chain of product of matrices, we can approximate the original quantum state with less parameters. For the M.P.S., the number of the basis grows linearly as nn increase.

|ψ=ψs1sn|s1snAα1α2s1Aαnα1sn|s1sn\displaystyle\Ket{\psi}=\psi_{s_{1}\cdots s_{n}}\Ket{s_{1}\cdots s_{n}}\approx A^{s_{1}}_{\alpha_{1}\alpha_{2}}\cdots A^{s_{n}}_{\alpha_{n}\alpha_{1}}\Ket{s_{1}\cdots s_{n}} (9)

Here we write a general n-party state as a matrix product chain with O(sα2n)O(s\alpha^{2}n) parameters.

Tensor Networks as a powerful machine has been widely studied to construct statistical learning model in Machine Learning community recently and also great achievements have been obtained in different tasks such as supervised pattern recognition (classification and regression), natural language processing and density estimation (unsupervised generative model)  (Stoudenmire and Schwab, 2016; Han et al., 2018; Pestun et al., 2017; Novikov et al., 2018). Since the special structure of tensor network, the model easily blows out or decays to dead nodes during training process, efficient and robust initialization strategy was proposed in  (Guo and Draper, 2021). It is known that the correlation function of tensor networks decays exponentially  (Evenbly and Vidal, 2011b, a), so it is discussed to extend the tensor network to more complicated structures such as multi-scale tree structure to reduce the correlation decay problem in machine learning application  (Stoudenmire, 2018). More extension work of tensor network can be found in  (Cheng et al., 2020; Li and Zhang, 2018; Cheng et al., 2020; Liu et al., 2018; Glasser et al., 2018).

3.2 Infinite-Width Pure Tensor Networks

3.2.1 Theory

Since the special structure of the MPS which is the contraction of a chain of tensor nodes, we can obtain a ’trivial’ Gaussian Process pure MPS without activation. So our first theorem on the infinite-width MPS equivalent GP is as follows,

Theorem 2.

For an infinite-width MPS with independent random tensor nodes (we note here i.i.d. is not necessary), namely

(Aαiαi+1si|θi)p(Aαiαi+1si|θi),\displaystyle(A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}|\mathbf{\theta}_{i})\sim p(A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}|\mathbf{\theta}_{i}), (10)

with at lease one of the bond index whose dimension |ai|2|a_{i}|\geq 2, as the number of the tensor nodes goes to infinity, the response will converge to the Gaussian Process,

ψ𝒢𝒫(μ,Σ),\displaystyle\psi\sim\mathcal{GP}(\mu,\Sigma), (11)

where ψ\psi is the response of the infinite-width MPS, μ\mu is the mean function and Σ\Sigma is the covariance function.

Note: the GP limit can also be obtained by taking the infinite limit of at lease one of bond index αi\alpha_{i}, where i{1,,n}i\in\{1,\cdots,n\} if the width of the tensor network is fixed to be finite.

A proof of Theorem 2 is provided in Appendix B.1 Proof of Theorem 2. Without hidden layer and activation, the response of MPS is linear with respect to the tensor weights Aαi,αi+1siA^{s_{i}}_{\alpha_{i},\alpha_{i+1}} which means the sample path of the equivalent GP is linear function. We can write down the joint distribution of the sequence of random variables, namely the response of the MPS evaluating at different training points ψ(𝐱(i))\psi(\mathbf{x}^{(i)}).

p(ψ1,,ψm)=p(ψ1)δ(ψ1k2ψ2)δ(ψ1kmψm),\displaystyle p(\psi_{1},\cdots,\psi_{m})=p(\psi_{1})\delta{(\psi_{1}-k_{2}\psi_{2}})\cdots\delta{(\psi_{1}-k_{m}\psi_{m}}), (12)

where ψi=ψ(x(i))\psi_{i}=\psi(x^{(i)}), kik_{i} is the coefficient of the linear transformation between ψi\psi_{i} and ψ1\psi_{1}, namely ψi=kiψ1\psi_{i}=k_{i}\psi_{1} and p(ψ1)p(\psi_{1}) is the normal distribution. It is easy to verify that each random variable ψi\psi_{i} is normal and then the sequence {ψi,i{1,,m}}\{\psi_{i},i\in\{1,\cdots,m\}\} is GP.

Formally we can write down the joint distribution of ψi\psi_{i} and ψj\psi_{j} as

p(ψi,ψj|Aαiαjsi)=𝒩(ψi,ψj|𝟎,Σ),\displaystyle p(\psi_{i},\psi_{j}|A^{s_{i}}_{\alpha_{i}\alpha_{j}})=\mathcal{N}(\psi_{i},\psi_{j}|\mathbf{0},\Sigma),

where

Σ=[σ12ρσ1σ2ρσ1σ2σ22].\displaystyle\Sigma=\begin{bmatrix}\sigma_{1}^{2}&\rho\sigma_{1}\sigma_{2}\\ \rho\sigma_{1}\sigma_{2}&\sigma_{2}^{2}\end{bmatrix}.

Since any two random variables ψi\psi_{i} and ψj\psi_{j} in the sequence are related by a linear transformation, we know the correlation ρ\rho of these two random variables is 11 or 1-1. Using the condition probability rule, we have

p(ψi,ψj|Aαiαjsi)\displaystyle p(\psi_{i},\psi_{j}|A^{s_{i}}_{\alpha_{i}\alpha_{j}}) =p(ψi|Aαiαjsi)p(ψj|ψi,Aαiαjsi),\displaystyle=p(\psi_{i}|A^{s_{i}}_{\alpha_{i}\alpha_{j}})p(\psi_{j}|\psi_{i},A^{s_{i}}_{\alpha_{i}\alpha_{j}}),

where

p(ψj|ψi,Aαiαjsi)\displaystyle p(\psi_{j}|\psi_{i},A^{s_{i}}_{\alpha_{i}\alpha_{j}}) =limσ0𝒩(ψjρσjσiψi,σ)\displaystyle=\lim_{\sigma\to 0}\mathcal{N}(\psi_{j}-\rho\frac{\sigma_{j}}{\sigma_{i}}\psi_{i},\sigma)
=δ(ψjρσjσiψi).\displaystyle=\delta{(\psi_{j}-\rho\frac{\sigma_{j}}{\sigma_{i}}\psi_{i})}.

So in the pure MPS case, the GP is ’trivial’ since the uncertainty band of ψj\psi_{j} given ψi\psi_{i} is 0, and then the conditional distribution p(ψj|ψi,Aαiαjsi)p(\psi_{j}|\psi_{i},A^{s_{i}}_{\alpha_{i}\alpha_{j}}) is a generalized function which is defined as the limit of a sequence of normal distributions showed above.

We can get the sample path of the GP as follows,

ψ(x)=ψ2ψ1ψ2ψ1(xx1)+ψ1,\displaystyle\psi(x)=\frac{\psi_{2}-\psi_{1}}{\psi_{2}-\psi_{1}}(x-x_{1})+\psi_{1},

where ψ1\psi_{1} and ψ2\psi_{2} are two responses of the MPS evaluating at data points x(1)x^{(1)} and x(2)x^{(2)}.

3.2.2 Discussion on Equivalent GP induced by Specific Kernel

We can get the mean function μ(𝐱)\mu(\mathbf{x}) and the covariance function Σ(𝐱,𝐱)\Sigma(\mathbf{x},\mathbf{x}^{\prime}) of the equivalent GP of pure tensor network as follows,

μ(x(i))=E[ψ(x(i))]=0,\displaystyle\mu(x^{(i)})=\mathrm{E}[\psi(x^{(i)})]=0, (13)
Σ(x(i),x(j))=E[ψ(x(i))ψ(x(j))]=αn1kEA[Z(A;xk(i))Z(A;xk(j))]\displaystyle\Sigma(x^{(i)},x^{(j)})=\mathrm{E}[\psi(x^{(i)})\psi(x^{(j)})]=\alpha^{n-1}\prod_{k}\mathrm{E}_{A}[Z(A;x^{(i)}_{k})Z(A;x^{(j)}_{k})] (14)

where

Z(xi)=Aαi1αisϕs(xi).\displaystyle Z(x_{i})=A^{s}_{\alpha_{i-1}\alpha_{i}}\phi^{s}(x_{i}).

Here we consider a specific normal prior and kernel as follows,

p(Aαiαi+1s|σ𝐀)𝒩(Aαiαi+1s|0,σA)\displaystyle p(A^{s}_{\alpha_{i}\alpha_{i+1}}|\mathbf{\sigma_{A}})\sim\mathcal{N}(A^{s}_{\alpha_{i}\alpha_{i+1}}|0,\sigma_{A})
ϕ(xi)=[f(xi),g(xi)]=[xi,1xi],\displaystyle\phi(x_{i})=[f(x_{i}),g(x_{i})]=[x_{i},1-x_{i}],

we get the mean function and the covariance function as

EA[ψ(𝐱;A)2]=αn1σA2ni(2xi22xi+1),\displaystyle E_{A}[\psi(\mathbf{x};A)^{2}]=\alpha^{n-1}\sigma_{A}^{2n}\prod_{i}(2x_{i}^{2}-2x_{i}+1),
EA[ψ(𝐱;A)ψ(𝐱;A)]=αn1σA2ni(2xixixixi+1).\displaystyle E_{A}[\psi(\mathbf{x};A)\psi(\mathbf{x}^{\prime};A)]=\alpha^{n-1}\sigma_{A}^{2n}\prod_{i}(2x_{i}x^{\prime}_{i}-x_{i}-x_{i}^{\prime}+1).

To avoid the power explosion problem, in application case we need to set that

αn1σA2n=1,\displaystyle\alpha^{n-1}\sigma_{A}^{2n}=1,

namely,

σA=α1n2n.\displaystyle\sigma_{A}=\alpha^{\frac{1-n}{2n}}.

In the infinite limit, we should set σA\sigma_{A} to be α12\alpha^{-\frac{1}{2}}. Actually this means in the case when the tensor network is very wide, we need to fine tune the standard deviation of the prior distribution to keep the network in stable range which is discussed in  (Guo and Draper, 2021). In Fig. 1, we show the sample path of the pure MPS and also plot the sample paths sampled from the MPS with different standard deviation σ\sigma.

3.3 Hybrid Tensor Neural Networks

For the infinite-width pure MPS, no activation functions and no latent variables are introduced and then the response function ψ(𝐱)\psi(\mathbf{x}) is linear respecting to the weights Aαiαi+1αiA^{\alpha_{i}}_{\alpha_{i}\alpha_{i+1}} in each tensor node. The representation power of the pure tensor network is limited since the pure MPS can only represent the linear function. To increase the representation capacity of MPS, we can introduce another layer of neural network with activation functions. Here we propose two methods to combine the MPS with the neural network:

  • Neural network kernel MPS;

  • MPS hidden layer neural network.

For the neural network kernel MPS, we use one layer neural network as the kernel function and ’non-linearity’ is introduced by the hidden layer and the activation function in neural network. For the MPS hidden layer neural network, the response ψl(𝐰,𝐀;𝐱)\psi^{l}(\mathbf{w},\mathbf{A};\mathbf{x}) of tensor network is fed into the neural network by which the linearity is injected into the model.

In the following sections, we will discuss these two hybrid tensor neural network and the theorems on their Gaussian Process equivalence properties.

3.3.1 Infinite-width Neural kernel Tensor Networks

The set-up of neural network kernel MPS is as follows,

ai(𝐰[0];𝐱)=a(jwij[0]xj),\displaystyle a_{i}(\mathbf{w}^{[0]};\mathbf{x})=a(\sum_{j}w^{[0]}_{ij}x_{j}), (15)
ψ(𝐀,𝐰;𝐱)={s,α}Aα1α2s1Aαiαi+1siAαn,α1snΦs1sn(𝐰;𝐱),\displaystyle\psi(\mathbf{A},\mathbf{w};\mathbf{x})=\sum_{\{s,\alpha\}}A^{s_{1}}_{\alpha_{1}\alpha_{2}}\cdots A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}\cdots A^{s_{n}}_{\alpha_{n},\alpha_{1}}\Phi^{s_{1}\cdots s_{n}}(\mathbf{w};\mathbf{x}), (16)

where

ϕsi(𝐰;𝐱)=[a(i1)s+1,,ais],\displaystyle\phi^{s_{i}}(\mathbf{w};\mathbf{x})=[a_{(i-1)s+1},\cdots,a_{is}],
Φs1sn(𝐰;𝐱)=ϕs1ϕsn=[[a1,,as],,[a(n1)s+1,ans]],\displaystyle\Phi^{s_{1}\cdots s_{n}}(\mathbf{w};\mathbf{x})=\phi^{s_{1}}\otimes\cdots\otimes\phi^{s_{n}}=[[a_{1},\cdots,a_{s}],\cdots,[a_{(n-1)s+1},a_{ns}]],

where 𝐰\mathbf{w} is the weights of the neural kernel, f()f(\cdot) is the output of the hidden neural network. After we get the response of the neural network, we reshape the output of our neural network to (n,s)(n,s) tensor and feed it into the MPS.

The following theorem states the equivalence between the infinite-width neural kernel tensor network and GP.

Theorem 3.

For a neural kernel tensor network with independent (we note here that i.i.d. is not necessary) tensor nodes whose distribution are

(Aαiαi+1si|θi)p(Aαiαi+1si|θi),\displaystyle(A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}|\mathbf{\theta}_{i})\sim p(A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}|\mathbf{\theta}_{i}),

and also with neural weights 𝐰\mathbf{w} whose components are independent following the distribution as

(wij|σw)p(wij|σw),\displaystyle(w_{ij}|\sigma_{w})\sim p(w_{ij}|\sigma_{w}),

with at lease one of the bond index whose dimension |ai|2|a_{i}|\geq 2. As the number of the tensor nodes and also the number of the nodes of the hidden layer and the output layer of the neural network go to infinity, the response of the neural kernel tensor network will converge to a GP,

ψ𝒢𝒫(μ,Σ),\displaystyle\psi\sim\mathcal{GP}(\mu,\Sigma),

where ψ\psi is the response of the infinite-width MPS, μ\mu is the mean function and Σ\Sigma is the covariance function.

Note: the GP limit can also be obtained by taking the infinite limit of at lease one of bond index αi\alpha_{i}, where i{1,,n}i\in\{1,\cdots,n\} if the width of the tensor network is fixed to be finite.

A proof of above Theorem 3 is given in Appendix B.2 Proof of Theorem 3. The interesting point of above theorem is that we just need to assume that the tensor nodes in tensor network Aαiαi+1siA^{s_{i}}_{\alpha_{i}\alpha_{i+1}} to be independent with each other and also all the components of neural weights to be independent, then the respond of the model will converge to GP without assuming any other condition such as identical distribution requirements in the infinite-width neural network case. Roughly, the reason why the identical distribution requirement is not needed in our case is the factorization property of the tensor network which decompose a high rank tensor into a series of production of low rank tensor nodes. Thanks for the bond index αi\alpha_{i} in the tensor network, the sum of the infinite i.i.d. random variables is obtained and by C.L.T, we will get a sequence of normal random variables.

The mean function μ(𝐱)\mu{(\mathbf{x})} and covariance function Σ(𝐱,𝐱)\Sigma(\mathbf{x},\mathbf{x}^{\prime}) of the neural kernel tensor network is as follows,

EA,𝐰[ψ(x;A,𝐰)]=0,\displaystyle\mathrm{E}_{A,\mathbf{w}}[\psi(x;A,\mathbf{w})]=0,
EA,𝐰[ψ(x;A,𝐰)ψ(x;A,𝐰)]=2nαn1σA2nE𝐰n[a(x;𝐰)a(x;𝐰)],\displaystyle\mathrm{E}_{A,\mathbf{w}}[\psi(x;A,\mathbf{w})\psi(x^{\prime};A,\mathbf{w})]=2^{n}\alpha^{n-1}\sigma_{A}^{2n}\mathrm{E}_{\mathbf{w}}^{n}[a(x;\mathbf{w})a(x^{\prime};\mathbf{w})],
EA,𝐰[ψ(x;A,𝐰)ψ(x;A,𝐰)]=2nαn1σA2nE𝐰n[a(x;𝐰)2].\displaystyle\mathrm{E}_{A,\mathbf{w}}[\psi(x;A,\mathbf{w})\psi(x;A,\mathbf{w})]=2^{n}\alpha^{n-1}\sigma_{A}^{2n}\mathrm{E}_{\mathbf{w}}^{n}[a(x;\mathbf{w})^{2}].

When the activation function a(x;𝐰)a(x;\mathbf{w}) is the Sigmoidal or Gaussian function, analytical results of E𝐰[a(x;𝐰)a(x;𝐰)]\mathrm{E}_{\mathbf{w}}[a(x;\mathbf{w})a(x^{\prime};\mathbf{w})] are obtained in Williams (1997). In Fig. 2, we plot the sample path of the MPS with neural kernel.

3.4 Infinite-Width MPS hidden layer Neural Networks

3.4.1 Theory

In this part, we feed the output of the MPS with an activation into a fully connected neuron layer. We write down the formula of the ll dimensional one hidden layer MPS Neural Network as follows,

ψl(𝐱)={αi,si}Aα1α2s1Aαiαi+1silAαnα1snΦs1sn(𝐱),\displaystyle\psi^{l}(\mathbf{x})=\sum_{\{\alpha_{i},s_{i}\}}A^{s_{1}}_{\alpha_{1}\alpha_{2}}\cdots A^{s_{i}l}_{\alpha_{i}\alpha_{i+1}}\cdots A^{s_{n}}_{\alpha_{n}\alpha_{1}}\Phi^{s_{1}\cdots s_{n}}(\mathbf{x}), (17)
f(𝐱)=b+lwla(ψl).\displaystyle f(\mathbf{x})=b+\sum_{l}w_{l}a(\psi^{l}). (18)

where a()a(\cdot) is the activation function, usually it is the Sigmoid, Relu or Tanh function.

In the following part, we will state the theorem on the equivalence between the infinite-width MPS hidden layer Neural Network and the GP.

Theorem 4.

For the MPS hidden layer Neural Network whose tensor nodes Aαiαi+1siA^{s_{i}}_{\alpha_{i}\alpha_{i+1}} are all independent with each other, and also all the components wijw_{ij} of the neural weights 𝐰\mathbf{w} are all independent with each other with following distributions,

Aαiαi+1sip(Aαiαi+1si|θi),\displaystyle A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}\sim p(A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}|\mathbf{\theta}_{i}), (19)
(wij|σw)p(wij|σw),\displaystyle(w_{ij}|\sigma_{w})\sim p(w_{ij}|\sigma_{w}), (20)

with at lease one of the bond index whose dimension |ai|2|a_{i}|\geq 2. As the number of the tensor nodes goes to to infinity, the response will converge to a GP,

f𝒢𝒫(μ,Σ),\displaystyle f\sim\mathcal{GP}(\mu,\Sigma),

where ψ\psi is the response of the infinite-width MPS, μ\mu is the mean function and Σ\Sigma is the covariance function.

Note: the GP limit can also be obtained by taking the infinite limit of at lease one of bond index αi\alpha_{i}, where i{1,,n}i\in\{1,\cdots,n\} if the width of the tensor network is fixed to be finite.

A proof of Theorem 4 will be given in the appendix B.3 Proof of Theorem 4. We note here as the width of the tensor network gets to infinity, each component of the respond of the tensor network will converge to a GP and they are independent with each other. The number of neurons in the neural network will get to infinity as the infinite-width of the tensor network is obtained, so C.L.T. can be used to show that the sequence of the random variables with the data point indices is a GP.

We calculate the mean function and the variance function as

EA,w,b[f(𝐱)]=E[b]+lE[wl]E[a(ψl)]=0,\displaystyle\mathrm{E}_{A,w,b}[f(\mathbf{x})]=\mathrm{E}[b]+\sum_{l}\mathrm{E}[w_{l}]\mathrm{E}[a(\psi^{l})]=0, (21)
EA[f(𝐱)f(𝐱)]=σb2+σw2EA[lal(𝐱)al(𝐱)].\displaystyle\mathrm{E}_{A}[f(\mathbf{x})f(\mathbf{x}^{\prime})]=\sigma_{b}^{2}+\sigma_{w}^{2}\mathrm{E}_{A}[\sum_{l}a^{l}(\mathbf{x})a^{l}(\mathbf{x}^{\prime})]. (22)

In computation of the covariance matrix, the neural activations of different data points are coupled with each other which means the analytical result is difficult to be obtained. We can use the Monte Carlo simulation to compute the matrix element of the covariance function evaluated at different data pairs by sampling hierarchically. We can get the sample activations from neural network by sampling from the prior and then plugging the prior sample into the hybrid MPS neural network which is straightforward. If the tensor network structure is not huge, the time complexity is acceptable.

In Fig. 3, we plot the sample path of the MPS hidden layer neural network. In Fig. 4, we plot the two dimensional sample path of the MPS hidden layer Neural Network.

3.4.2 Approximation of Covariance matrix

For one hidden layer neural network, when the activation is chosen as the Sigmoidal or Gaussian function, analytical result of the covariance function was obtained  (Williams, 1997). However, in our MPS hidden neural network case, the covariance integral is much more complicated since different component of the MPS couples with each other by the bond dimension of the tensor network. We obtain the approximation of E𝐀[al(𝐱)al(𝐱)]\mathrm{E}_{\mathbf{A}}[a^{l}(\mathbf{x})a^{l}(\mathbf{x}^{\prime})] by expanding the square terms around the mean using Taylor expansion.

EA[al(𝐱;A)al(𝐱;A)]\displaystyle\mathrm{E}_{A}[a^{l}(\mathbf{x};A)a^{l}(\mathbf{x}^{\prime};A)] al(𝐱;A0)al(𝐱;A0)+12Var[A](2al(𝐱;A0)al(𝐱;A0)\displaystyle\approx a^{l}(\mathbf{x};A_{0})a^{l}(\mathbf{x}^{\prime};A_{0})+\frac{1}{2}\mathrm{Var}[A]\cdot(2\nabla a^{l}(\mathbf{x};A_{0})\otimes\nabla a^{l}(\mathbf{x}^{\prime};A_{0})
+2al(𝐱;A0)al(𝐱;A0)+al(𝐱;A0)2al(𝐱;A0)),\displaystyle+\nabla^{2}a^{l}(\mathbf{x};A_{0})\otimes a^{l}(\mathbf{x}^{\prime};A_{0})+a^{l}(\mathbf{x};A_{0})\otimes\nabla^{2}a^{l}(\mathbf{x}^{\prime};A_{0})),

where Var[A]\mathrm{Var}[A] is the diagonal part of the covariance matrix of random vector AA, namely the variance vector every component of which is the variance of each component of the tensor nodes vector AA. See the Appendix C. Approximation of Covariance Function to get more details of the calculation.

4 Experiments

We implement all three tensor network induced GPs in the numerical experiments. We study the sample paths of each GP and also explore the sample paths family by varying hyperparameters, namely the standard deviations σ𝐀\sigma_{\mathbf{A}} and σ𝐰\sigma_{\mathbf{w}} of the tensor nodes and the neural weights prior.

In Fig. 1(a), we sample several random paths from the pure MPS GP. In Fig. 1(b), we plot the sample paths family of pure MPS GP with different standard deviation σ𝐀\sigma_{\mathbf{A}}. We find that the slope of the sample path increases hugely as the standard deviation σ𝐀\sigma_{\mathbf{A}} of the prior of the tensor nodes increases.

Refer to caption
(a) plot of randomly sampled 44 sample paths of pure MPS.
Refer to caption
(b) plot of four sample path from the models with different variances.
Figure 1: In above figure, we plot the sample paths of the MPS model. Since no actvations are used in this model, so all the paths are just linear functions. Here we sampled four sample paths from the MPS model. The structure of our model is as follows: the number of tensor nodes is 55, the bond dimension is 88 and the kernel dimension is 22.
Refer to caption
(a) plots of randomly sampled 44 sample paths of neural kernel MPS.
Refer to caption
(b) plots of four sample path from the models with different variances.
Refer to caption
(c) plots of randomly sampled 44 sample paths of tensor networks.
Refer to caption
(d) plot of four sample path from the models with different variances.
Figure 2: In above figure, we plot the sample paths of the neural kernel MPS model. In this model, we introduce non-linearity by adding hidden neural layer and the activations.
Refer to caption
(a) plots of randomly sampled 44 sample paths of MPS hidden neural network with σ=0.5\sigma=0.5.
Refer to caption
(b) plots of four sample paths from the models with variance σ=1\sigma=1.
Refer to caption
(c) plots of a randomly sampled family of sample paths of MPS hidden neural network with varying standard variation σ\sigma.
Refer to caption
(d) plots of a family of sampled paths from MPS hidden layer neural network with different variances.
Figure 3: In above figure, we plot the sample paths of the MPS hidden neural network.
Refer to caption
(a) sample path (surface) from MPS hidden neural network with σ=0.7\sigma=0.7.
Refer to caption
(b) plot of sampled path (surface) of MPS hidden neural network with 2 dimensional data vector.
Figure 4: In above figure, we plot the sample paths of the MPS hidden neural network model when the input data point is two dimensional vector.

In Fig. 2, we plot the numerical experiment of the neural kernel MPS. In Fig. 2(a), we plot the sample paths sampled from the neural kernel MPS. Since the introduction of neural hidden layer and the non-linear activation, the sample paths are not linear and more complicated than the sample paths by pure MPS model. In Fig. 2(b), we plot one sample path varying as we increase the standard deviation σ𝐀\sigma_{\mathbf{A}}. As expected, the variation of the sample path increases as we increase the standard deviation σ𝐀\sigma_{\mathbf{A}} of the tensor nodes prior. In the dd dimensional data vector case, the sample path is d1d-1 dimensional surface. In Fig. 2(c) and 2(d), we plot the sample path of the neural kernel MPS model with two dimensional data points. Now the sample path is two dimensional surface.

In Fig. 3(a) and Fig. 3(b), we compare the sample paths from MPS hidden neural network with increased standard deviation from σ=0.5\sigma=0.5 to σ=1.0\sigma=1.0. We find that as we increase the standard deviation of the prior, the sample paths become more complicated which captures greater data fitting ability. In Fig. 3(c) and Fig. 3(d), the standard deviation σ\sigma is increased and model complexity increases at accordingly.

In Fig. 4(a) and Fig. 4(b), we plot the sample surfaces from the MPS hidden neural network in the two dimensional input data case. We can find that the complexity of the configures of two dimensional sample surface increases as the standard deviation increases.

5 Conclusion

We study the infinite-width limit of tensor networks, acutally the pure MPS and the extended hybrid neural tensor networks: combining the MPS layer with neural layers. We prove that the pure MPS and the hybrid tensor networks converge to GP as the width of tensor network goes to infinity. For the pure MPS, since no non-linearity is injected into the model, the equivalent GP induced by the pure MPS is trivial in which each gaussian random variable indexed by a specific data point has zero uncertainty band. However, we can extend the pure MPS to more complicated structures by introducing hidden layers or using neural kernels. So we study the infinite limit of two kinds of hybrid tensor neural networks. We proved that these two structures converge to GP as the their widths go to infinity and calculate the equivalent GP of the infinite-width tensor network. In the numerical experiment, we explore the sample path and the properties of hyper-parameters of the tensor network induced GP and plot the sample path families by varying the standard deviations of the prior. The tensor networks have rich structures (intuitively the exponentially increasing contraction ways as the number of tensor nodes increases) and powerful fitting ability, so the function space supported by tensor networks has profound structures which allows well behaved limit processes. From the application perspective, over-parameterization and huge function space supported by tensor networks also lead to training instability, poor generalization and scaling difficulties which are essential problems in applying tensor network to big data set analysis.

Acknowledgements

The authors wish to thank David Helmbold, Hongyun Wang, Qi Gong, Torsten Ehrhardt, Hai Lin and Francois Monard for their helpful discussions. Erdong Guo is grateful for the financial support by Ming-Ren Teahouse and Uncertainty Quantification LLC for this work.

References

  • Biamonte and Bergholm (2017) Biamonte, J. and Bergholm, V. (2017). Tensor networks in a nutshell. arXiv preprint arXiv:1708.00006.
  • Chabuda et al. (2020) Chabuda, K., Dziarmaga, J., Osborne, T. J. and Demkowicz-Dobrzański, R. (2020). Tensor-network approach for quantum metrology in many-body quantum systems. Nature Communications, 11 1–12.
  • Cheng et al. (2020) Cheng, S., Wang, L. and Zhang, P. (2020). Supervised learning with projected entangled pair states. arXiv preprint arXiv:2009.09932.
  • Cho and Saul (2009) Cho, Y. and Saul, L. (2009). Kernel methods for deep learning. Advances in neural information processing systems, 22 342–350.
  • Cichocki (2014) Cichocki, A. (2014). Era of big data processing: A new approach via tensor networks and tensor decompositions. arXiv preprint arXiv:1403.2048.
  • Daniely (2017) Daniely, A. (2017). Sgd learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems.
  • Daniely et al. (2016) Daniely, A., Frostig, R. and Singer, Y. (2016). Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems.
  • Evenbly and Vidal (2011a) Evenbly, G. and Vidal, G. (2011a). Quantum criticality with the multi-scale entanglement renormalization ansatz, chapter in” strongly correlated systems, numerical methods”, edited by adolfo avella and ferdinando mancini. Springer Series in Solid-State Sciences, 176.
  • Evenbly and Vidal (2011b) Evenbly, G. and Vidal, G. (2011b). Tensor network states and geometry. Journal of Statistical Physics, 145 891–918.
  • Garriga-Alonso et al. (2018) Garriga-Alonso, A., Rasmussen, C. E. and Aitchison, L. (2018). Deep convolutional networks as shallow gaussian processes. arXiv preprint arXiv:1808.05587.
  • Glasser et al. (2018) Glasser, I., Pancotti, N. and Cirac, J. I. (2018). Supervised learning with generalized tensor networks. arXiv preprint arXiv:1806.05964.
  • Guo and Draper (2021) Guo, E. and Draper, D. (2021). The bayesian method of tensor networks. arXiv preprint arXiv:2101.00245.
  • Han et al. (2018) Han, Z.-Y., Wang, J., Fan, H., Wang, L. and Zhang, P. (2018). Unsupervised generative modeling using matrix product states. Physical Review X, 8 031012.
  • Hazan and Jaakkola (2015) Hazan, T. and Jaakkola, T. (2015). Steps toward deep kernel methods from infinite neural networks. arXiv preprint arXiv:1508.05133.
  • Jacot et al. (2018) Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems.
  • Lee et al. (2017) Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J. and Sohl-Dickstein, J. (2017). Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165.
  • Li and Zhang (2018) Li, Z. and Zhang, P. (2018). Shortcut matrix product states and its applications. arXiv preprint arXiv:1812.05248.
  • Liu et al. (2018) Liu, D., Ran, S.-J., Wittek, P., Peng, C., Garcıa, R. B., Su, G. and Lewenstein, M. (2018). Machine learning by two-dimensional hierarchical tensor networks: A quantum information theoretic perspective on deep architectures. stat, 1050 23.
  • MacKay (1998) MacKay, D. J. (1998). Introduction to gaussian processes. NATO ASI Series F Computer and Systems Sciences, 168 133–166.
  • Matthews et al. (2018) Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E. and Ghahramani, Z. (2018). Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271.
  • McMahon et al. (2020) McMahon, N. A., Singh, S. and Brennen, G. K. (2020). A holographic duality from lifted tensor networks. npj Quantum Information, 6 1–13.
  • Neal (1995) Neal, R. M. (1995). BAYESIAN LEARNING FOR NEURAL NETWORKS. Ph.D. thesis, University of Toronto.
  • Novak et al. (2018) Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., Hron, J., Abolafia, D. A., Pennington, J. and Sohl-Dickstein, J. (2018). Bayesian deep convolutional networks with many channels are gaussian processes. arXiv preprint arXiv:1810.05148.
  • Novikov et al. (2018) Novikov, A., Trofimov, M. and Oseledets, I. (2018). Exponential machines. Bulletin of the Polish Academy of Sciences. Technical Sciences, 66.
  • Orús (2014) Orús, R. (2014). A practical introduction to tensor networks: Matrix product states and projected entangled pair states. Annals of Physics, 349 117–158.
  • Orús (2019) Orús, R. (2019). Tensor networks for complex quantum systems. Nature Reviews Physics, 1 538–550.
  • Pestun et al. (2017) Pestun, V., Terilla, J. and Vlassopoulos, Y. (2017). Language as a matrix product state. arXiv preprint arXiv:1711.01416.
  • Roux and Bengio (2007) Roux, N. L. and Bengio, Y. (2007). Continuous neural networks. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (M. Meila and X. Shen, eds.), vol. 2 of Proceedings of Machine Learning Research. PMLR, San Juan, Puerto Rico.
    http://proceedings.mlr.press/v2/leroux07a.html
  • Schröder et al. (2019) Schröder, F. A., Turban, D. H., Musser, A. J., Hine, N. D. and Chin, A. W. (2019). Tensor network simulation of multi-environmental open quantum dynamics via machine learning and entanglement renormalisation. Nature communications, 10 1–10.
  • Stoudenmire and Schwab (2016) Stoudenmire, E. and Schwab, D. J. (2016). Supervised learning with tensor networks. In Advances in Neural Information Processing Systems.
  • Stoudenmire (2018) Stoudenmire, E. M. (2018). Learning relevant features of data with multi-scale tensor networks. Quantum Science and Technology, 3 034003.
  • Williams (1997) Williams, C. K. (1997). Computing with infinite networks. In Advances in neural information processing systems.
  • Williams and Rasmussen (2006) Williams, C. K. and Rasmussen, C. E. (2006). Gaussian processes for machine learning, vol. 2. MIT press Cambridge, MA.
  • Yang (2019) Yang, G. (2019). Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760.

A. Infinite Width Neural Networks

In this appendix we prove the equivalence of the infinite width one hidden layer Neural Net and GP.

Lemma 1 For a sequence of random variable {Xi,i{1,,k}}\{X_{i},i\in\{1,\cdots,k\}\},

i{1,,k},Xi𝒩(μi,σi).\displaystyle\forall i\in\{1,\cdots,k\},\quad X_{i}\sim\mathcal{N}(\mu_{i},\sigma_{i}).

If 𝐭Rk,𝐭𝐗𝒩(μ,σ)\,\forall\mathbf{t}\in\mathrm{R}^{k},\mathbf{t}\cdot\mathbf{X}\sim\mathcal{N}(\mu,\sigma), then

𝐗𝒩(μ,𝚺).\displaystyle\mathbf{X}\sim\mathcal{N}(\mathbf{\mu},\mathbf{\Sigma}).

Proof. Since 𝐭Rk,𝐭𝐗𝒩(μ,σ)\forall\mathbf{t}\in\mathrm{R}^{k},\mathbf{t}\cdot\mathbf{X}\sim\mathcal{N}(\mu,\sigma), we have

E[𝐭𝐗]=𝐭μ,\displaystyle\mathrm{E}[\mathbf{t}\cdot\mathbf{X}]=\mathbf{t}\cdot\mathbf{\mu},
Var[𝐭𝐗]=𝐭T𝚺𝐭.\displaystyle\mathrm{Var}[\mathbf{t}\cdot\mathbf{X}]=\mathbf{t}^{T}\cdot\mathbf{\Sigma}\cdot\mathbf{t}.

We can write down the characteristic function ϕ𝐗(t)\phi_{\mathbf{X}}(t) of the random vector 𝐗=[x1,xk]\mathbf{X}=[x_{1},\cdots x_{k}] as

ϕ𝐗(t)=ϕ𝐭𝐗(1)=exp(i𝐭μ12𝐭T𝚺𝐭),\displaystyle\phi_{\mathbf{X}}(t)=\phi_{\mathbf{t}\cdot\mathbf{X}}(1)=\exp{(i\mathbf{t}\cdot\mathbf{\mu}-\frac{1}{2}\mathbf{t}^{T}\cdot\mathbf{\Sigma}\cdot\mathbf{t})},

which is just the characteristic function of multi-variate normal distribution 𝒩(μ,𝚺)\mathcal{N}(\mathbf{\mu},\mathbf{\Sigma}).  

Theorem 1 Let the response f(𝐱)f(\mathbf{x}) of the one hidden layer neural network with jj neurons as

f(𝐱)=b+jwj[1]zj(𝐰[0];𝐱),\displaystyle f(\mathbf{x})=b+\sum_{j}w^{[1]}_{j}z_{j}(\mathbf{w}^{[0]};\mathbf{x}),

where

wj[i]i.i.d.𝒩(0,σi),\displaystyle w_{j}^{[i]}\overset{\text{i.i.d.}}{\sim}\mathcal{N}(0,\sigma_{i}),
b𝒩(0,σb).\displaystyle b\sim\mathcal{N}(0,\sigma_{b}).

As jj goes to infinity, ff converge to the Gaussian Process,

f𝒢𝒫(m,Σ),\displaystyle f\sim\mathcal{GP}(m,\Sigma),

where mm, and Σ\Sigma are the mean and covariance function.

Proof. Since all components of the neural layer weights 𝐰[0]\mathbf{w}^{[0]} are i.i.d., we know that all the components of the output of the first layer network zj(𝐰[0];𝐱)z_{j}(\mathbf{w}^{[0]};\mathbf{x}) are i.i.d.,

(zj|𝐰[0],𝐱)i.i.d.p(zj),\displaystyle(z_{j}|\mathbf{w}^{[0]},\mathbf{x})\overset{\text{i.i.d.}}{\sim}p(z_{j}),

then we can easily verify that

(wj[1]zj|𝐰[0],𝐰[1],𝐱)i.i.d.p(wj[1]zj),\displaystyle(w^{[1]}_{j}z_{j}|\mathbf{w}^{[0]},\mathbf{w}^{[1]},\mathbf{x})\overset{\text{i.i.d.}}{\sim}p(w^{[1]}_{j}z_{j}),

In neural networks, the activation functions are mostly well behaved, so the mean and the variance of the output zjz_{j} are bounded,

E[wj[1]zj]=0,\displaystyle\mathrm{E}[w_{j}^{[1]}z_{j}]=0,
Var[wj[1]zj]=Var[wj[1]]Var[zj]<+.\displaystyle\mathrm{Var}[w^{[1]}_{j}z_{j}]=\mathrm{Var}[w^{[1]}_{j}]\mathrm{Var}[z_{j}]<+\infty.

By the central limit theorem, we know that as jj goes to \infty, the infinite sum jwj[1]zj\sum_{j}w_{j}^{[1]}z_{j} will converge to a random variable with gaussian distribution, namely

1njnwj[1]zj𝒩(0,σh),\displaystyle\frac{1}{n}\sum_{j}^{n}w_{j}^{[1]}z_{j}\sim\mathcal{N}(0,\sigma_{h}),

where

σh2=1nσw[1]2σz2.\displaystyle\sigma_{h}^{2}=\frac{1}{n}\sigma_{w^{[1]}}^{2}\sigma^{2}_{z}.

Since random variable bb belongs to the gaussian distribution, then the sum of two gaussian variables f(𝐱)f(\mathbf{x}), namely

f(𝐱)=b+jwj[1]zj(𝐰[0];𝐱)𝒩(0,σb2+nσh2)\displaystyle f(\mathbf{x})=b+\sum_{j}w_{j}^{[1]}z_{j}(\mathbf{w}^{[0]};\mathbf{x})\sim\mathcal{N}(0,\sqrt{\sigma_{b}^{2}+n\sigma_{h}^{2}})

We can set nσw[1]2n\sigma^{2}_{w^{[1]}} to be fixed constant during the limit process to keep f(𝐱)f(\mathbf{x}) well behaved. Now we have already proved that for a finite index 𝐓\mathbf{T} set which is just the training set {𝐱(i),iN}\{\mathbf{x}^{(i)},i\in\mathrm{N}\}, there exists a sequence of gaussian random variables {f(𝐱(i)),𝐱(i)𝐓}\{f(\mathbf{x}^{(i)}),\mathbf{x}^{(i)}\in\mathbf{T}\}. To extend the finite sequence to the infinite index sequence such as the interval 𝐓\mathbf{T}, we need to verify that the finite dimensional distributions satisfy the Kolmogorov consistency (extension) theorem. However, it is known that as long as the f.d.d.s. is a consistent distribution, then the stochastic process exists. In our neural network case, the f.d.d.s. also satisfies the two consistent conditions, then we proved that infinite width one hidden layer neural networks is equivalent to the gaussian process.  

B. Infinite Width Tensor Networks

In this appendix, we will prove the theorems on the equivalence of the infinite width tensor network and GP. The main idea of the proofs are similar to the proof of Theorem 1.

B.1 Proof of Theorem 2

There are three indices in each tensor node 𝐀αiαi+1si\mathbf{A}^{s_{i}}_{\alpha_{i}\alpha_{i+1}}, so at first step we can contract the kernel Φs1,,sn(𝐱)\Phi^{s_{1},\cdots,s_{n}}(\mathbf{x}) with the indices sequence {s1,sn}\{s_{1},\cdots s_{n}\} in tensor network, and then we get

{αi}Zα1α2Zαnα1,\displaystyle\sum_{\{\alpha_{i}\}}Z_{\alpha_{1}\alpha_{2}}\cdots Z_{\alpha_{n}\alpha_{1}}, (23)

where

Zαiαi+1=siAαiαi+1siϕsi(𝐱i).\displaystyle Z_{\alpha_{i}\alpha_{i+1}}=\sum_{s_{i}}A^{s_{i}}_{\alpha_{i}\alpha_{i+1}}\phi^{s_{i}}(\mathbf{x}_{i}). (24)

It is easy to verify that all the components of the Zαiαi+1Z_{\alpha_{i}\alpha_{i+1}} are i.i.d. since all components of one specific tensor node 𝐀αiαi+1si\mathbf{A}^{s_{i}}_{\alpha_{i}\alpha_{i+1}} are i.i.d. We note here that in our assumption of Theorem 2, we just assume that all tensor nodes are independent, but not necessary to be i.i.d. For two specific implementations of the bond indices, namely two different sequences of value of the indices, {i1,,in}\{i_{1},\cdots,i_{n}\} and {j1,,jn}\{j_{1},\cdots,j_{n}\}, the product of each chain of the nodes are i.i.d.. We introduce a new notation for the production of the chain,

K{i1in}=Zi1i2Zini1.\displaystyle K_{\{i_{1}\cdots i_{n}\}}=Z_{i_{1}i_{2}}\cdots Z_{i_{n}i_{1}}. (25)

We know all the different possible evaluation of the indices sequence {α1αn}\{\alpha_{1}\cdots\alpha_{n}\} of KK are i.i.d. As the width of the tensor network goes to infinity, the number of all the possible implementation of the indices sequence goes to infinity. By the C.L.T, ψ\psi converges to a GP.  

B.2 Proof of Theorem 3

Based on the work of Theorem 2 proof, to prove Theorem 3, we just need to show that all the components of the neural kernel are independent. It is obvious that all component of the output of the neural kernel ai(𝐰;𝐱)a_{i}(\mathbf{w};\mathbf{x}) are independent. By reshaping the output of the neural kernel flf^{l}, we get the kernel Φ(𝐱)\Phi(\mathbf{x}) and all the components are independent. Then by setting the width of the MPS to be infinite, using C.L.T and Theorem 3, the response of the neural kernel mps ψ\psi will converge to GP.  

B.3 Proof of Theorem 4

By taking the infinite width limit of the hidden layer MPS, the response of the MPS ψl\psi^{l} are all i.i.d. which follow the normal distribution by Theorem 2, then it is easy to verify that lwla(ψl)\sum_{l}w_{l}a(\psi^{l}) follows gaussian distribution by C.L.T. and then b+lwla(ψl)b+\sum_{l}w_{l}a(\psi^{l}) is normal random variable. So ff converge to the GP as ll goes to infinity.  

C. Approximation of Covariance Function

In this appendix, we will explain the calculation of the approximation of the covariance function of mps hidden neural network in detail. For a general function f(𝐱;𝐀)f(\mathbf{x};\mathbf{A}), the expectation EA[f(𝐱;𝐀)]\mathrm{E}_{A}[f(\mathbf{x};\mathbf{A})] can be written as

EA[f(𝐱;A)]\displaystyle\mathrm{E}_{A}[f(\mathbf{x};A)] =EA[f(𝐱;A0)+f(𝐱;A0)(AA0)+12(AA0)T2f(𝐱;A0)(AA0)+R(𝐱;A)]\displaystyle=\mathrm{E}_{A}[f(\mathbf{x};A_{0})+\nabla f(\mathbf{x};A_{0})(A-A_{0})+\frac{1}{2}(A-A_{0})^{T}\nabla^{2}f(\mathbf{x};A_{0})(A-A_{0})+R(\mathbf{x};A)] (26)
=f(𝐱;A0)+12Var[A]Diag[H]+EA[R(𝐱,A)],\displaystyle=f(\mathbf{x};A_{0})+\frac{1}{2}\mathrm{Var}[A]\cdot\mathrm{Diag}[H]+\mathrm{E}_{A}[R(\mathbf{x},A)], (27)

where

R(𝐱,A)=13!f(3)(𝐱;ξ)(AA0)3,ξ[A0,𝐱].\displaystyle R(\mathbf{x},A)=\frac{1}{3!}f^{(3)}(\mathbf{x};\xi)(A-A_{0})^{3},\quad\xi\in[A_{0},\mathbf{x}].

We write the second derivative matrix H=2f(𝐱;A0)H=\nabla^{2}f(\mathbf{x};A_{0}) as

H=HT12H12\displaystyle H=H^{T\frac{1}{2}}H^{\frac{1}{2}} (28)

where H12H^{\frac{1}{2}} is the formal square root of the the second derivative matrix HH. Since HH is real symmetric matrix, then H12H^{\frac{1}{2}} exists and is symmetric.

EA[12(AA0)TH(AA0)]\displaystyle\mathrm{E}_{A}[\frac{1}{2}(A-A_{0})^{T}H(A-A_{0})] =12TrEA[(AA0)THT12H12(AA0)]\displaystyle=\frac{1}{2}\mathrm{Tr}\mathrm{E}_{A}[(A-A_{0})^{T}H^{T\frac{1}{2}}H^{\frac{1}{2}}(A-A_{0})] (29)
=12TrEA[(H12(AA0))((AA0)THT12]\displaystyle=\frac{1}{2}\mathrm{Tr}\mathrm{E}_{A}[(H^{\frac{1}{2}}(A-A_{0}))((A-A_{0})^{T}H^{T\frac{1}{2}}] (30)
=12Tr[H12EA[(AA0)(AA0)T]HT12]\displaystyle=\frac{1}{2}\mathrm{Tr}[H^{\frac{1}{2}}\mathrm{E}_{A}[(A-A_{0})(A-A_{0})^{T}]H^{T\frac{1}{2}}] (31)
=12Tr[Cov[AA0,AA0]H]\displaystyle=\frac{1}{2}\mathrm{Tr}[\mathrm{Cov}[A-A_{0},A-A_{0}]H] (32)
=12Tr[Cov[A,A]H]\displaystyle=\frac{1}{2}\mathrm{Tr}[\mathrm{Cov}[A,A]\cdot H] (33)
=12Var[A]Diag[H]\displaystyle=\frac{1}{2}\mathrm{Var}[A]\cdot\mathrm{Diag}[H] (34)

Here we use Var[A]\mathrm{Var}[A] to represent the diagonal part of the covariance matrix Cov[A,A]Cov[A,A], and Diag[H]\mathrm{Diag}[H] to represent the diagonal part of the second derivative matrix 2f(𝐱;A)\nabla^{2}f(\mathbf{x};A) of the integrated function f(𝐱;A)f(\mathbf{x};A).