This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Infinite-Width Hypernetworks

Etai Littwin
School of Computer Science
Tel Aviv University
Tel Aviv, Israel
[email protected]
&Tomer Galanti
School of Computer Science
Tel Aviv University
Tel Aviv, Israel
[email protected]
Lior Wolf
School of Computer Science
Tel Aviv University
Tel Aviv, Israel
[email protected] &Greg Yang
Microsoft Research AI
[email protected]
Equal Contribution
Abstract

Hypernetworks are architectures that produce the weights of a task-specific primary network. A notable application of hypernetworks in the recent literature involves learning to output functional representations. In these scenarios, the hypernetwork learns a representation corresponding to the weights of a shallow MLP, which typically encodes shape or image information. While such representations have seen considerable success in practice, they remain lacking in the theoretical guarantees in the wide regime of the standard architectures. In this work, we study wide over-parameterized hypernetworks. We show that unlike typical architectures, infinitely wide hypernetworks do not guarantee convergence to a global minima under gradient descent. We further show that convexity can be achieved by increasing the dimensionality of the hypernetwork’s output, to represent wide MLPs. In the dually infinite-width regime, we identify the functional priors of these architectures by deriving their corresponding GP and NTK kernels, the latter of which we refer to as the hyperkernel. As part of this study, we make a mathematical contribution by deriving tight bounds on high order Taylor expansion terms of standard fully connected ReLU networks.

1 Introduction

[Uncaptioned image]

In this work, we analyze the training dynamics of over-parameterized meta networks, which are networks that output the weights of other networks, often referred to as hypernetworks. In the typical framework, a function hh involves two networks, ff and gg. The hypernetwork ff takes the input xx (typically an image) and returns the weights of the primary network, gg, which then takes the input zz and returns the output of hh.

The literature of hypernetworks is roughly divided into two main categories. In the functional representation literature [22, 33, 39, 30, 16] the input to the hypernetwork ff is typically an image. For shape reconstruction tasks, the network gg represents the shape via a signed distance field, where the input are coordinates in 3D space. In image completion tasks, the inputs to gg are image coordinates, and the output is the corresponding pixel intensity. In these settings, ff is typically a large network and gg is typically a shallow fully connected network.

In the second category [4, 23, 36, 44], hypernetworks are typically used for hyper-parameter search, where xx is being treated as a hyperparameter descriptor and is optimized alongside with the network’s weights. In this paper, we consider models corresponding to the first group of methods.

Following a prominent thread in the recent literature, our study takes place in the regime of wide networks. [11] recently showed that, when the width of the network approaches infinity, the gradient-descent training dynamics of a fully connected network ff can be characterized by a kernel, called the Neural Tangent Kernel (or NTK for short). In other words, as the width of each layer approaches infinity, provided with proper scaling and initialization of the weights, it holds that:

f(x;w)wf(x;w)wΘf(x,x)\frac{\partial f(x;w)}{\partial w}\cdot\frac{\partial^{\top}f(x^{\prime};w)}{\partial w}\to\Theta^{f}(x,x^{\prime}) (1)

as the width, nn of ff tends to infinity. Here, ww are the weights of the network f(x;w)f(x;w). As shown in [11], as the width tends to infinity, when minimizing the squared loss using gradient descent, the evolution through time of the function computed by the network follows the dynamics of kernel gradient descent with kernel Θf\Theta^{f}. To prove this phenomenon, various papers [20, 3, 2] introduce a Taylor expansion of the network output around the point of initialization and consider its values. It is shown that the first-order term is deterministic during the SGD optimization and the higher-order terms converge to zero as the width nn tends to infinity.

A natural question that arises when considering hypernetworks is whether a similar “wide” regime exists, where trained and untrained networks may be functionally approximated by kernels. If so, since this architecture involves two networks, the “wide” regime needs a more refined definition, taking into account both networks.

Our contributions:

  1. 1.

    We show that infinitely wide hypernetworks can induce highly non-convex training dynamics under gradient descent. The complexity of the optimization problem is highly dependent on the architecture of the primary network gg, which may considerably impair the trainability of the architecture if not defined appropriately.

  2. 2.

    However, when the widths of both the hypernetwork ff and the primary network gg tend to infinity, the optimization dynamics of the hypernetwork simplifies, and its neural tangent kernel (which we call the hyperkernel) has a well defined infinite-width limit governing the network evolution.

  3. 3.

    We verify our theory empirically and also demonstrate the utility of this hyperkernel on several functional representation tasks. Consistent with prior observations on kernel methods, the hypernetwork induced kernels also outperforms a trained hypernetwork when training data is small.

  4. 4.

    We make a technical contribution by deriving asymptotically tight bounds on high order Taylor expansion terms in ReLU MLPs. Our result partially settles a conjecture posed in [6] regarding the asymptotic behavior of general correlation functions.

1.1 Related Works

Hypernetworks

Hypernetworks were first introduced under this name in [8], are networks that generate the weights of a second primary network that computes the actual task. However, the idea of having one network predict the weights of another was proposed earlier and has reemerged multiple times [15, 29, 13]. The tool can naturally be applied for image representations tasks. In [22], they applied hypernetworks for 3D shape reconstruction from a single image. In [33] hypernetworks were shown to be useful for learning shared image representations. Hypernetworks were also shown to be effective in non-image domains. For instance, hypernetworks achieve state of the art results on the task of decoding error correcting codes [25].

Several publications consider a different framework, in which, the inputs xx of the hypernetwork are optimized alongside to the weights of the hypernetwork. In this setting, hypernetworks were recently used for continuous learning by [36]. Hypernetworks can be efficiently used for neural architecture search, as was demonstrated by [4, 44], where a feedforward regression (with network ff) replaces direct gradient-based learning of the weights of the primary network while its architecture is being explored. Lorraine et al. applied hypernetworks for hyperparameters selection [23].

Despite their success and increasing prominence, little theoretical work was done in order to better understand hypernetworks and their behavior. A recent paper [12] studies the role of multiplicative interaction within a unifying framework to describe a range of classical and modern neural network architectural motifs, such as gating, attention layers, hypernetworks, and dynamic convolutions amongst others. It is shown that standard neural networks are a strict subset of neural networks with multiplicative interactions. In [7] the authors theoretically study the modular properties of hypernetworks. In particular, they show that compared to standard embedding methods, hypernetworks are exponentially more expressive when the primary network is of small complexity. In this work, we provide a complementary perspective and show that a shallow primary network is a requirement for successful training. [5] showed that applying standard initializations on a hypernetwork produces sub-optimal initialization of the primary network. A principled technique for weight initialization in hypernetworks is then developed.

Gaussian Processes and Neural Tangent Kernel

The connection between infinitely wide neural networks, Gaussian processes and kernel methods, has been the focus of many recent papers [11, 19, 40, 42, 32, 31, 38, 37, 26]. Empirical support has demonstrated the power of CNTK (convolutional neural tangent kernel) on popular datasets, demonstrating new state of the art results for kernel methods [1, 43]. [21] showed that ReLU ResNets [10] can have NTK convergence occur even when depth and width simultaneously tend to infinity, provided proper initialization. In this work, we extend the kernel analysis of networks to hypernetworks, and characterize the regime in which the kernels converge and training dynamics simplify.

2 Setup

In this section, we introduce the setting of the analysis considered in this paper. We begin by defining fully connected neural networks and hypernetworks in the context of the NTK framework.

Neural networks   In the NTK framework, a fully connected neural network, f(x;w)=yL(x)f(x;w)=y^{L}(x), is defined in the following manner:

{yl(x)=1nl1Wlql1(x)ql(x)=2σ(yl(x)) and q0(x)=x,\begin{cases}y^{l}(x)=\sqrt{\frac{1}{n_{l-1}}}W^{l}q^{l-1}(x)\\ q^{l}(x)=\sqrt{2}\cdot\sigma(y^{l}(x))\end{cases}\textnormal{ and }q^{0}(x)=x\,, (2)

where σ:\sigma:\mathbb{R}\to\mathbb{R} is the activation function of ff. Throughout the paper, we specifically take σ\sigma to be a piece-wise linear function with a finite number of pieces (e.g., the ReLU activation ReLU(x):=max(0,x)\textnormal{ReLU}(x):=\max(0,x) and the Leaky ReLU activation ReLUα(x)={x if x0αx if x<0\textnormal{ReLU}_{\alpha}(x)=\begin{cases}x&\textnormal{ if }x\geq 0\\ \alpha x&\textnormal{ if }x<0\end{cases}). The weight matrices Wlnl×nl1W^{l}\in\mathbb{R}^{n_{l}\times n_{l-1}} are trainable variables, initialized independently according to a standard normal distribution, Wi,jl𝒩(0,1)W^{l}_{i,j}\sim\mathcal{N}(0,1). The width of ff is denoted by n:=min(n1,,nL1)n:=\min(n_{1},\dots,n_{L-1}). The parameters ww are aggregated as a long vector w=(vec(W1),,vec(WL))w=(vec(W^{1}),\dots,vec(W^{L})). The coefficients 1/nl1\sqrt{1/n_{l-1}} serve for normalizing the activations of each layer. This parametrization is nonstandard, and we will refer to it as the NTK parameterization. It has already been employed in several recent works [14, 35, 27]. For simplicity, in many cases, we will omit to specify the weights ww associated with our model.

Hypernetworks   Given the input tuple u=(x,z)n0+m0u=(x,z)\in\mathbb{R}^{n_{0}+m_{0}}, we consider models of the form: h(u;w):=g(z;f(x;w))h(u;w):=g(z;f(x;w)), where f(x;w)f(x;w) and g(z;v)g(z;v) are two neural network architectures with depth LL and HH respectively. The function f(x;w)f(x;w) referred to as hypernetwork, takes the input xx and computes the weights v=f(x;w)v=f(x;w) of a second neural network g(z;v)g(z;v), referred as the primary network, which is assumed to output a scalar. As before, the variable ww stands for a vector of trainable parameters (vv is not trained directly and is given by ff).

We parameterize the primary network g(z;v)=gH(z;v)g(z;v)=g^{H}(z;v) as follows:

{gl(z;v)=1ml1Vlal1(z;v)al(z;v)=2ϕ(gl(z;v)) and a0(z)=z\begin{cases}g^{l}(z;v)=\sqrt{\frac{1}{m_{l-1}}}V^{l}\cdot a^{l-1}(z;v)\\ a^{l}(z;v)=\sqrt{2}\cdot\phi(g^{l}(z;v))\end{cases}\textnormal{ and }a^{0}(z)=z (3)

Here, the weights of the primary network Vl(x)ml×ml1V^{l}(x)\in\mathbb{R}^{m_{l}\times m_{l-1}} are given in a concatenated vector form by the output of the hypernetwork f(x;w)=v=(vec(V1),,vec(VH))f(x;w)=v=(vec(V^{1}),\dots,vec(V^{H})). The output dimension of the hypernetwork ff is therefore nL=i=1Hmimi1n_{L}=\sum^{H}_{i=1}m_{i}\cdot m_{i-1}. We denote by fd(x;w):=Vd(x;w):=Vdf^{d}(x;w):=V^{d}(x;w):=V^{d} the dd’th output matrix of f(x;w)f(x;w). The width of gg is denoted by m:=min(m1,,mH1)m:=\min(m_{1},\dots,m_{H-1}). The function ϕ\phi is an element-wise continuously differentiable function or a piece-wise linear function with a finite number of pieces.

Optimization   Let S={(ui,yi)}i=1NS=\{(u_{i},y_{i})\}^{N}_{i=1}, where ui=(xi,zi)u_{i}=(x_{i},z_{i}) be some dataset and let (a,b):=|ab|p\ell(a,b):=|a-b|^{p} be the p\ell^{p}-loss function. For a given hypernetwork h(u;w)h(u;w), we are interested in selecting the parameters ww that minimize the empirical risk:

c(w):=i=1N(h(ui;w),yi)c(w):=\sum^{N}_{i=1}\ell(h(u_{i};w),y_{i}) (4)

For simplicity, oftentimes we will simply write i(a):=(a,yi)\ell_{i}(a):=\ell(a,y_{i}) and hi(w):=h(ui):=h(ui;w)h_{i}(w):=h(u_{i}):=h(u_{i};w), depending on the context. In order to minimize the empirical error c(w)c(w), we consider the SGD method with learning rate μ>0\mu>0 and step of the form wt+1wtμwjt(hjt(wt))w_{t+1}\leftarrow w_{t}-\mu\nabla_{w}\ell_{j_{t}}(h_{j_{t}}(w_{t})) for some index jtU[N]j_{t}\sim U[N] that is selected uniformly at random for the tt’th iteration. A continuous version of the GD method is the gradient flow method, in which w˙=μwc(w)\dot{w}=-\mu\nabla_{w}c(w). In recent works [14, 20, 1, 35], the optimization dynamics of the gradient method for standard fully-connected neural networks was analyzed, as the network width tends to infinity. In our work, since hypernetworks consist of two interacting neural networks, there are multiple ways in which the size can tend to infinity. We consider two cases: (i) the width of ff tends to infinity and that of gg is fixed and (ii) the width of both ff and gg tend to infinity.

3 Dynamics of Hypernetworks

Infinitely wide ff without infinitely wide gg induces non-convex optimization

In the NTK literature, it is common to adopt a functional view of the network evolution by analyzing the dynamics of the output of the network along with the cost, typically a convex function, as a function of that output. In the hypernetwork case, this presents us with two possible viewpoints of the same optimization problem of h(u)=g(z;f(x))h(u)=g(z;f(x)). On one hand, since only the hypernetwork ff contains the trainable parameters, we can view the optimization of hh under the loss \ell as training of ff under the loss g\ell\circ g. The classical NTK theory would imply that ff evolves linearly when its width tends to infinity, but because g\ell\circ g is in general not convex anymore, even when \ell originally is, an infinitely wide ff without an infinitely wide gg does not guarantee convergence to a global optimum. In what follows, we make this point precise by characterising how nonlinear the dynamics becomes in terms of the depth of gg.

After a single stochastic gradient descent step with learning rate μ\mu, the hypernetwork output for example ii is given by hi(wμwj)h_{i}\big{(}w-\mu\nabla_{w}\ell_{j}\big{)}. When computing the Taylor approximation around ww with respect to the function hh at the point w=wμwjw^{\prime}=w-\mu\nabla_{w}\ell_{j}, it holds that:

hi(wμwj)=r=01r!w(r)hi,(μwj)r=r=01r!(μjhj)r𝒦i,j(r)\displaystyle h_{i}\big{(}w-\mu\nabla_{w}\ell_{j}\big{)}=\sum_{r=0}^{\infty}\frac{1}{r!}\langle\nabla^{(r)}_{w}h_{i},(-\mu\nabla_{w}\ell_{j})^{r}\rangle=\sum_{r=0}^{\infty}\frac{1}{r!}\left(-\mu\frac{\partial\ell_{j}}{\partial h_{j}}\right)^{r}\cdot\mathcal{K}^{(r)}_{i,j} (5)

where 𝒦i,j(r):=w(r)hi,(whj)r\mathcal{K}^{(r)}_{i,j}:=\langle\nabla^{(r)}_{w}h_{i},(\nabla_{w}h_{j})^{r}\rangle, and w(r)hi\nabla^{(r)}_{w}h_{i} is the rr tensor that holds the rr’th derivative of the output hih_{i}. The terms w(r)hi,(μwj)r\langle\nabla^{(r)}_{w}h_{i},(-\mu\nabla_{w}\ell_{j})^{r}\rangle are the multivariate extensions of the Taylor expansion terms hi(r)(w)(ww)rh^{(r)}_{i}(w)(w^{\prime}-w)^{r}, and take the general form of correlation functions as introduced in Eq. 5 in the appendix. This equation holds for neural networks with smooth activation functions (including hypernetworks), and holds in approximation for piece-wise linear activation functions.

Previous works have shown that, if hh is a wide fully connected network, the first order term (r=1r=1) converges to the NTK, while higher order terms (r>1r>1) scale like 𝒪(1/n)\mathcal{O}(1/\sqrt{n}) [20, 6]. Hence, for large widths and small learning rates, these higher order terms vanish, and the loss surface appears deterministic and linear at initialization, and remains so during training.

However, the situation is more complex for hypernetworks. As shown in the following theorem, for infinitely wide hypernetworks and finite primary network, the behaviour depends on the depth and width of the generated primary network. Specifically, when the primary network is deep and narrow, the higher order terms in Eq. 5 may not vanish, and parameter dynamics can be highly non-convex.

Theorem 1 (Higher order terms for hypernetworks).

Let h(u)=g(z;f(x))h(u)=g(z;f(x)) for a hypernetwork ff and an primary network gg. Then, we have:

𝒦i,j(r){nHrif r>H1otherwise.\mathcal{K}^{(r)}_{i,j}\sim\begin{cases}n^{H-r}&\text{if $r>H$}\\ 1&\text{otherwise.}\end{cases} (6)

Thm. 1 illustrates the effect of the depth of the primary network gg on the evolution of the output hh. The larger HH is, the more non-linear the evolution is, even when ff is infinitely wide. Indeed, we observe empirically that when ff is wide and kept fixed, a deeper gg incurs slower training, and lower overall test performance as illustrated in Fig. 2.

As a special case of this theorem, when taking H=1H=1, we can also derive the asymptotic behaviour of 𝒦i,j(r)n1r\mathcal{K}^{(r)}_{i,j}\sim n^{1-r} for a neural network hh. This provides a tighter bound than the previously conjectured 𝒪(1/n)\mathcal{O}(1/n) upper bound [6]. The following remark is a consequence of this result and is validated in the supplementary material.

Remark 1.

The rr’th order term of the Taylor expansion in Eq. 5 is of order 𝒪(μrr!nr1)\mathcal{O}(\frac{\mu^{r}}{r!\cdot n^{r-1}}) instead of the previously postulated 𝒪(μrr!n)\mathcal{O}(\frac{\mu^{r}}{r!\cdot n}). Therefore, it is evident that for any choice μ=o(n)\mu=o(\sqrt{n}), all of the high order terms tend to zero as nn\to\infty. This is opposed the previous bound, which guarantees that all of the high order terms tend to zero as nn\to\infty only when μ\mu is constant.

4 Dually Infinite Hypernetworks

It has been shown by [11, 19] that NNGPs and neural tangent kernels fully characterise the training dynamics of infinitely wide networks. As a result, in various publications [21, 9], these kernels are being treated as functional priors of neural networks. In the previous section, we have shown that the Taylor expansion of the hypernetwork is non-linear when the size of the primary network is finite. In this section, we consider the case when both hyper and primary networks are infinitely wide, with the intention of gaining insight into the functional prior of wide hypernetworks. For this purpose, we draw a formal correspondence between infinitely wide hypernetworks and GPs, and use this connection to derive the corresponding neural tangent kernel.

4.1 The NNGP kernel

Previous work have shown the equivalence between popular architectures, and Gaussian processes, when the width of the architecture tends to infinity. This equivalence has sparked renewed interest in kernel methods, through the corresponding NNGP kernel, and the Neural Tangent Kernel (NTK) induced by the architecture, which fully characterise the training dynamics of infinitely wide networks. This equivalence has recently been unified to encompass most architectures which use a pre-defined set of generic computational blocks [40, 41]. Hypernetworks represent a different class of neural networks where the parameters contain randomly initialized matrices except the last layer whose parameters are aggregated as a rank 3 tensor. All of the matrices/tensors dimensions tend to infinity. This means the results of [40, 41] do not apply to hypernetworks. Nevertheless, by considering sequential limit taking, where we take the limit of the width of ff ahead of the width of gg, we show the output of ff achieves a GP behaviour, essentially feeding gg with Gaussian distributed weights with adaptive variances. A formal argument is presented in the following theorem.

Refer to caption Refer to caption
(a) (b)
Figure 1: Convergence to the hyperkernel. (a) Empirical variance of kernel values in log-scale for a single entry for varying width ff and gg. Variance of the kernel converges to zero only when the widths of ff and gg both increase. (b) Empirical kernel value for z=(1,0),z=(cos(θ),sin(θ))z=(1,0),z^{\prime}=(\cos(\theta),\sin(\theta)), and x=x=(1,0)x=x^{\prime}=(1,0) for different values of θ[π2,π2]\theta\in[-\frac{\pi}{2},\frac{\pi}{2}]. Convergence to a deterministic kernel is observed only when both ff and gg are wide.
Theorem 2 (Hypernetworks as GPs).

Let h(u)=g(z;f(x))h(u)=g(z;f(x)) be a hypernetwork. For any pair of inputs u=(x,z)u=(x,z) and u=(x,z)u^{\prime}=(x^{\prime},z^{\prime}), let Σ0(z,z)=zzm0,S0(x,x)=xxn0\Sigma^{0}(z,z^{\prime})=\frac{z^{\top}z^{\prime}}{m_{0}},S^{0}(x,x^{\prime})=\frac{x^{\top}x^{\prime}}{n_{0}}. Then, it holds for any unit ii in layer 0<lH0<l\leq H of the primary network:

gil(z;f(x))d𝒢il(u)g^{l}_{i}(z;f(x))\stackrel{{\scriptstyle d}}{{\longrightarrow}}\mathcal{G}_{i}^{l}(u) (7)

as m,nm,n\to\infty sequentially. Here, {𝒢il(u)}i=1ml\{\mathcal{G}_{i}^{l}(u)\}_{i=1}^{m_{l}} are independent Gaussian processes, such that, (𝒢il(u),𝒢il(u))𝒩(0,Λl(u,u))(\mathcal{G}_{i}^{l}(u),\mathcal{G}_{i}^{l}(u^{\prime}))\sim\mathcal{N}\big{(}0,\Lambda^{l}(u,u^{\prime})\big{)} defined by the following recursion:

Λl+1(u,u)=(Σl(u,u)Σl(u,u)Σl(u,u)Σl(u,u))(SL(x,x)SL(x,x)SL(x,x)SL(x,x))\displaystyle\Lambda^{l+1}(u,u^{\prime})=\begin{pmatrix}\Sigma^{l}(u,u)&\Sigma^{l}(u^{\prime},u)\\ \Sigma^{l}(u,u^{\prime})&\Sigma^{l}(u^{\prime},u^{\prime})\end{pmatrix}\bigodot\begin{pmatrix}S^{L}(x,x)&S^{L}(x^{\prime},x)\\ S^{L}(x,x^{\prime})&S^{L}(x^{\prime},x^{\prime})\end{pmatrix} (8)
Σl(u,u)=2𝔼(u,v)𝒩(0,Λl)[σ(u)σ(v)]\Sigma^{l}(u,u^{\prime})=2\operatorname*{\mathbb{E}}_{(u,v)\sim\mathcal{N}(0,\Lambda^{l})}[\sigma(u)\cdot\sigma(v)] (9)

where SL(x,x)S^{L}(x,x^{\prime}) is defined recursively:

Sl(x,x)=2𝔼(u,v)𝒩(0,Γl)[σ(u)σ(v)] and Γl(x,x)=(Sl(x,x)Sl(x,x)Sl(x,x)Sl(x,x))\displaystyle S^{l}(x,x^{\prime})=2\operatorname*{\mathbb{E}}_{(u,v)\sim\mathcal{N}(0,\Gamma^{l})}[\sigma(u)\cdot\sigma(v)]~{}\textnormal{ and }~{}\Gamma^{l}(x,x^{\prime})=\begin{pmatrix}S^{l}(x,x)&S^{l}(x^{\prime},x)\\ S^{l}(x,x^{\prime})&S^{l}(x^{\prime},x^{\prime})\end{pmatrix} (10)

In other words, the NNGP kernel, governing the behaviour of wide untrained hypernetworks, is given by the Hadamard product of the GP kernels of ff and gg (see Eq. 8).

As a consequence of the above theorem, we observe that the NNGP kernel of hh at each layer, Λl(u,u)\Lambda^{l}(u,u^{\prime}), is simply a function of Σ0(z,z),S0(x,x)\Sigma^{0}(z,z^{\prime}),S^{0}(x,x^{\prime}).

Corollary 1.

Let h(u)=g(z;f(x))h(u)=g(z;f(x)) be a hypernetwork. For any 0<lH0<l\leq H, there exists a function l\mathcal{F}^{l}, such that, for all pairs of inputs u=(x,z)u=(x,z) and u=(x,z)u^{\prime}=(x^{\prime},z^{\prime}), it holds that:

ΛH(u,u)=(Σ0(z,z),S0(x,x))\Lambda^{H}(u,u^{\prime})=\mathcal{F}\left(\Sigma^{0}(z,z^{\prime}),S^{0}(x,x^{\prime})\right) (11)

The factorization of the NNGP kernel into a function of Σ0(z,z)\Sigma^{0}(z,z^{\prime}) and S0(x,x)S^{0}(x,x^{\prime}) provides a convenient way to explicitly encode useful invariances into the kernel.

As an example, in the following remark, we investigate the behaviour of the NNGP kernel of hh, when the inputs zz are preprocessed random random Fourier features as suggested by [28, 34].

Remark 2.

Let p(z)=[cos(Wi1z+bi1)]i=1kp(z)=[\cos(W^{1}_{i}z+b^{1}_{i})]^{k}_{i=1} be a Fourier features preprocessing, where Wi,j1𝒩(0,1)W^{1}_{i,j}\sim\mathcal{N}(0,1) and biases biU[π,π]b_{i}\sim U[-\pi,\pi]. Let h(u)=g(p(z);f(x))h(u)=g(p(z);f(x)) be a hypernetwork, with zz preprocessed according to pp. Let u=(x,z)u=(x,z) and u=(x,z)u^{\prime}=(x^{\prime},z^{\prime}) be two pairs of inputs. Then, Λl(u,u)\Lambda^{l}(u,u^{\prime}) is a function of exp[zz22/2]\exp[-\|z-z^{\prime}\|^{2}_{2}/2] and SL(x,x)S^{L}(x,x^{\prime}).

The above remark shows that for any given inputs x,xx,x^{\prime}, the NNGP kernel depends on z,zz,z^{\prime} only through the distance between zz and zz^{\prime}, which has been shown to be especially useful in implicit neural representation [34].

We next derive the corresponding neural tangent kernel of hypernetworks, referred to as hyperkernels.

Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
(a) (b) (c)
Figure 2: A hypernetwork with a wider and shallower primary network gg trains faster and achieves better test performance on the MNIST (top) and CIFAR10 (bottom) rotations prediction task. We fix the hypernetwork ff and the depth of gg at (a) 3, (b) 6 and (c) 8, while varying the width mm of gg. The x-axis specifies the epoch and the y-axis the accuracy at test time.

4.2 The Hyperkernel

Recall the definition of the NTK as the infinite width limit of the Jacobian inner product given by:

𝒦h(u,u)=h(u)wh(u)w=g(z;f(x))f(x)𝒦f(x,x)g(z;f(x))f(x)\displaystyle\mathcal{K}^{h}(u,u^{\prime})=\frac{\partial h(u)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime})}{\partial w}=\frac{\partial g(z;f(x))}{\partial f(x)}\cdot\mathcal{K}^{f}(x,x^{\prime})\cdot\frac{\partial^{\top}g(z^{\prime};f(x^{\prime}))}{\partial f(x^{\prime})} (12)

where 𝒦f(x,x):=f(x)wf(x)w\mathcal{K}^{f}(x,x^{\prime}):=\frac{\partial f(x)}{\partial w}\cdot\frac{\partial^{\top}f(x^{\prime})}{\partial w} and 𝒦g(u,u):=g(z;f(x))f(x)g(z;f(x))f(x)\mathcal{K}^{g}(u,u^{\prime}):=\frac{\partial g(z;f(x))}{\partial f(x)}\cdot\frac{\partial^{\top}g(z^{\prime};f(x^{\prime}))}{\partial f(x^{\prime})}. In the following theorem we show that 𝒦h(u,u)\mathcal{K}^{h}(u,u^{\prime}) converges in probability at initialization to a limiting kernel in the sequentially infinite width limit of ff and gg, denoted by Θh(u,u)\Theta^{h}(u,u^{\prime}). Furthermore, we show that the hyperkernel is decomposed to the Hadamard product between the kernels corresponding to ff and gg. In addition, we show that the derivative of the hyperkernel with respect to time tends to zero at initialization.

Theorem 3 (Hyperkernel decomposition and convergence at initialization).

Let h(u;w)=g(z;f(x;w))h(u;w)=g(z;f(x;w)) be a hypernetwork. Then,

𝒦h(u,u)pΘh(u,u)\mathcal{K}^{h}(u,u^{\prime})\stackrel{{\scriptstyle p}}{{\longrightarrow}}\Theta^{h}(u,u^{\prime}) (13)

where:

Θh(u,u)=Θf(x,x)Θg(u,u,SL(x,x))\Theta^{h}(u,u^{\prime})=\Theta^{f}(x,x^{\prime})\cdot\Theta^{g}(u,u^{\prime},S^{L}(x,x^{\prime})) (14)

such that:

𝒦f(x,x)pΘf(x,x)I and 𝒦g(u,u)pΘg(u,u,SL(x,x))\mathcal{K}^{f}(x,x^{\prime})\stackrel{{\scriptstyle p}}{{\longrightarrow}}\Theta^{f}(x,x^{\prime})\cdot I\textnormal{ and }\mathcal{K}^{g}(u,u^{\prime})\stackrel{{\scriptstyle p}}{{\longrightarrow}}\Theta^{g}(u,u^{\prime},S^{L}(x,x^{\prime})) (15)

moreover, if ww evolves throughout gradient flow, we have:

𝒦h(u,u)t|t=0p0\frac{\partial\mathcal{K}^{h}(u,u^{\prime})}{\partial t}\Big{|}_{t=0}\stackrel{{\scriptstyle p}}{{\longrightarrow}}0 (16)

where the limits are taken with respect to m,nm,n\to\infty sequentially.

As a consequence of Thm. 3, when applying a Fourier features preprocessing to zz, one obtains that Θg(u,u)\Theta^{g}(u,u^{\prime}) becomes shift invariant.

Remark 3.

Let p(z)p(z) be as in Remark 2. Let h(u)=g(p(z);f(x))h(u)=g(p(z);f(x)) be a hypernetwork, where zz is preprocessed according to pp. Let u=(x,z)u=(x,z) and u=(x,z)u^{\prime}=(x^{\prime},z^{\prime}) be two pairs of inputs. Then, Θg(u,u)\Theta^{g}(u,u^{\prime}) is a function of exp[zz22/2]\exp[-\|z-z^{\prime}\|^{2}_{2}/2] and S0(x,x)S^{0}(x,x^{\prime}).

Note that Θf\Theta^{f} is the standard limiting NTK of ff and depends only on the inputs {xi}i=1N\{x_{i}\}_{i=1}^{N}. However from Eq. 8, the term Θg\Theta^{g} requires the computation of the NNGP kernel of ff in advance in order to compute the terms {Σl,Σ˙l}\{\Sigma^{l},\dot{\Sigma}^{l}\}. This form provides an intuitive factorization of the hyperkernel into a term Θf\Theta^{f} which depends on the meta function and data, and Θg\Theta^{g} which can be though of as a conditional term.

5 Experiments

Our experiments are divided into two main parts. In the first part, we validate the ideas presented in our theoretical analysis and study the effect of the width and depth of gg on the optimization of a hypernetwork. In the second part, we evaluate the performance of the NNGP and NTK kernels on image representation tasks. For further implementation details on all experiments see Appendix A.

Representation Inpainting
NN 50 100 200 500 50 100 200 500
HK 0.055 0.050 0.043 0.032 0.057 0.051 0.047 0.038
NNGP 0.051 0.045 0.037 0.026 0.054 0.047 0.043 0.034
HN 0.12 0.08 0.052 0.041 0.16 0.098 0.066 0.49
Table 1: Results on image representation and inpainting. Reported are the MSE of the reconstructed image on test set where NN is the number of training samples. As can be seen, in low data regime the kernels outperform a trained hypernetwork on both tasks, and the NNGP consistently outperforms the rest.

5.1 Convergence of the Hyperkernel

We verified our results of Thm. 3 by constructing a simple hypernetwork, for which both ff and gg are four layered fully connected networks with ReLU activations. For the input of ff, we used a fixed 2D vector x=(1,1)x=(1,-1). The input z(θ)z(\theta) of gg varied according to z(θ)=(sin(θ),cos(θ))z(\theta)=(\sin(\theta),\cos(\theta)), where θ[π2,π2]\theta\in[-\frac{\pi}{2},\frac{\pi}{2}]. We then compute the empirical hyperkernel as follows, while varying the width of both ff and gg:

𝒦h(u,u)=wh(x,z(θ))wh(x,z(θ))\mathcal{K}^{h}(u,u^{\prime})=\nabla_{w}h(x,z(\theta))\cdot\nabla_{w}^{\top}h(x,z(\theta)) (17)

Results are presented in Fig. 1. As can be seen, convergence to a fixed kernel is only observed in the dually wide regime, as stated in Thm. 3.

5.2 Training Dynamics

We consider a rotation prediction task. In this task, the hypernetwork ff is provided with a randomly rotated image xx and the primary network gg is provided with a rotated version zz of xx with a random angle α\alpha. The setting is cast into a classification task, where the goal is to predict the closest value to α/360\alpha/360 within {αi=30i/360i=0,,11}\{\alpha_{i}=30i/360\mid i=0,\dots,11\}. We experimented with the MNIST [18] and CIFAR10 [17] datasets. For each dataset we took 1000010000 training samples only.

We investigate the effect of the depth and width of gg on the training dynamics of a hypernetwork. We compared the performance of hypernetworks of various architectures to investigate the effect of the depth and width of gg. The architectures of the hypernetwork and the primary network are as follows. The hypernetwork, ff, is a fully-connected ReLU neural network of depth 44 and width 200200. The inputs of ff are flattened vectors of dimension ch2c\cdot h^{2}, where cc specifies the number of channels and hh the height/width of each image (c=1c=1 and h=28h=28 for MNIST and c=3c=3 and h=32h=32 for CIFAR10). The primary network gg is a fully-connected ReLU neural network of depth {3,6,8}\in\{3,6,8\}. Since the MNIST rotations dataset is simpler, we varied the width of gg in {10,50,100}\in\{10,50,100\} and for the the CIFAR10 variation we selected the width of gg to be {100,200,300}\in\{100,200,300\}. The network outputs 1212 values and is trained using the cross-entropy loss.

We trained the hypernetworks for 100 epochs, using the SGD method with batch size 100 and learning rate μ=0.01\mu=0.01. For completeness, we conducted a sensitivity study on the learning rate for both datasets, to show that the reported behaviour is consistent for any chosen learning rate, see appendix. In Fig. 2(a-c) we compare the performance of the various architectures on the MNIST and CIFAR10 rotations prediction tasks. The performance is computed as an average and standard deviation (error bars) over 100100 runs. As can be seen, we observe a clear improvement in test performance as the width of gg increases, especially at the initialization. When comparing the three plots, we observe that when ff is wide and kept fixed, a deeper gg incurs slower training, and lower overall test performance. This is aligned with the conclusions of Thm. 1.

5.3 Image representation and Inpainting

We compare the performance of a hypernetwork and kernel regression with the hyperkernel on two visual tasks: functional image representation and inpainting. In the MNIST image representation task, the goal of the hypernetwork ff is to represent an input image via the network gg, which receives image pixel coordinates and outputs pixel values. In the inpainting task, the goal is the same where only half of the image is observed by ff.

Refer to caption
Figure 3: Results on image inpainting. (Row 1) ground-truth images. (Row 2) corresponding inputs of meta-network ff. (Row 3) reconstruction by the hypernetwork. (Row 4) reconstruction by the hyperkernel. See Section 5.3 for experimental details.

Problem Setup  We cast these problems as a meta-learning problem, where ff receives an image, and the goal of the primary network g:[28]2[0,1]g:[28]^{2}\rightarrow[0,1] is then to learn a conditional mapping from pixel coordinates to pixel values for all the pixels in the image, with the MSE as the metric. Our training dataset S={(ui,yi)}i=1NS=\{(u_{i},y_{i})\}_{i=1}^{N} then consists of samples ui=(xi,zi)u_{i}=(x_{i},z_{i}), such that, xix_{i} are images, and ziz_{i} are random pixel location (i.e., a tuple [28]2\in[28]^{2}), and yiy_{i} is a label specifying the pixel value at the specified location (normalized between 0 and 1). In both experiments, training was done on randomly sampled training data of varying size, specified by NN.

Evaluation  We evaluate the performance of both training a hypernetwork, and using kernel regression with the hyperkernel. For kernel regression, we use the following formula to infer the pixel value of a test point uu:

(Θh(u,u1),,Θh(u,uN))(Θh(U,U)+ϵI)1Y(\Theta^{h}(u,u_{1}),...,\Theta^{h}(u,u_{N}))\cdot\big{(}\Theta^{h}(U,U)+\epsilon\cdot I\big{)}^{-1}\cdot Y (18)

where Θh(U,U)=(Θh(ui,uj))i,j[N]\Theta^{h}(U,U)=(\Theta^{h}(u_{i},u_{j}))_{i,j\in[N]} is the hyperkernel matrix evaluated on all of the training data and Y=(yi)i=1NY=(y_{i})^{N}_{i=1} is the vector of labels in the training dataset and ϵ=0.001\epsilon=0.001.

In Tab. 1, we compare the results of the hyperkernel and the NNGP kernel with the corresponding hypernetwork. The reported numbers are averages over 2020 runs. As can be seen, in the case of a small dataset, the kernels outperforms the hypernetwork, and the NNGP outperforms the rest.

6 Conclusions

In this paper, we apply the well-established large width analysis to hypernetwork type models. For the class of models analyzed, we have shown that a wide hypernetwork must be coupled with a wide primary network in order achieve a simplified, convex training dynamics as in standard architectures. The deeper gg is, the more complicated the evolution is. In the dually infinite case, when the widths of both the hyper and primary networks tend to infinity, the optimization of the hypernetwork become convex and is governed by the proposed hyperkernel. The analysis presented in this paper is limited to a specific type of hypernetworks used in the literature, typically found in the functional neural representation literature, and we leave the extension of this work to additional types of hyper models to future work.
Some of the tools developed in this study, also apply for regular NTKs. Specifically, [6] provide a conjecture, for which one of its consequences is that 𝒦i,j(r)=𝒪(1/n)\mathcal{K}^{(r)}_{i,j}=\mathcal{O}(1/n). In Thm. 1 we prove that this hypothesized upper bound is increasingly loose as rr increases, and prove an asymptotic behaviour in the order of 𝒦i,j(r)1/nr1\mathcal{K}^{(r)}_{i,j}\sim 1/n^{r-1}.

Broader Impact

This work improves our understanding and design of hypernetworks and hopefully will help us improve the transparency of machine learning involving them. Beyond that, this work falls under the category of basic research and does not seem to have particular societal or ethical implications.

Acknowledgements and Funding Disclosure

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant ERC CoG 725974). The contribution of Tomer Galanti is part of Ph.D. thesis research conducted at Tel Aviv University.

References

  • [1] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019.
  • [2] Yu Bai and Jason D. Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. In International Conference on Learning Representations, 2020.
  • [3] Yunru Bai, Ben Krause, Haiquan Wang, Caiming Xiong, and Richard Socher. Taylorized training: Towards better approximation of neural network training at finite width. ArXiv, 2020.
  • [4] Andrew Brock, Theo Lim, J.M. Ritchie, and Nick Weston. SMASH: One-shot model architecture search through hypernetworks. In International Conference on Learning Representations, 2018.
  • [5] Oscar Chang, Lampros Flokas, and Hod Lipson. Principled weight initialization for hypernetworks. In International Conference on Learning Representations, 2020.
  • [6] Ethan Dyer and Guy Gur-Ari. Asymptotics of wide networks from feynman diagrams. In International Conference on Learning Representations, 2020.
  • [7] Tomer Galanti and Lior Wolf. On the modularity of hypernetworks. In Advances in Neural Information Processing Systems 33. Curran Associates, Inc., 2020.
  • [8] David Ha, Andrew M. Dai, and Quoc V. Le. Hypernetworks. In International Conference on Learning Representations, 2016.
  • [9] Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. In International Conference on Learning Representations, 2020.
  • [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [11] Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2018.
  • [12] Siddhant M. Jayakumar, Jacob Menick, Wojciech M. Czarnecki, Jonathan Schwarz, Jack Rae, Simon Osindero, Yee Whye Teh, Tim Harley, and Razvan Pascanu. Multiplicative interactions and where to find them. In International Conference on Learning Representations, 2020.
  • [13] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic filter networks. In Advances in Neural Information Processing Systems 29. Curran Associates, Inc., 2016.
  • [14] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
  • [15] Benjamin Klein, Lior Wolf, and Yehuda Afek. A dynamic convolutional layer for short range weather prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [16] Sylwester Klocek, Lukasz Maziarka, Maciej Wolczyk, Jacek Tabor, Jakub Nowak, and Marek Śmieja. Hypernetwork functional image representation. Lecture Notes in Computer Science, 2019.
  • [17] Alex Krizhevsky. Convolutional deep belief networks on cifar-10, 2010.
  • [18] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
  • [19] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018.
  • [20] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019.
  • [21] Etai Littwin, Tomer Galanti, and Lior Wolf. On random kernels of residual architectures. Arxiv, 2020.
  • [22] Gidi Littwin and Lior Wolf. Deep meta functionals for shape representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  • [23] Jonathan Lorraine and David Duvenaud. Stochastic hyperparameter optimization through hypernetworks, 2018.
  • [24] Henry B. Mann and Abraham Wald. On stochastic limit and order relationships. Annals of Mathematical Statistics, 14(3):217–226, 09 1943.
  • [25] Eliya Nachmani and Lior Wolf. Hyper-graph-network decoders for block codes. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019.
  • [26] Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-dickstein. Bayesian deep convolutional networks with many channels are gaussian processes. In International Conference on Learning Representations, 2019.
  • [27] Daniel Park, Jascha Sohl-Dickstein, Quoc Le, and Samuel Smith. The effect of network width on stochastic gradient descent and generalization: an empirical study. In Proceedings of Machine Learning Research, volume 97, pages 5042–5051. PMLR, 2019.
  • [28] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 20. Curran Associates, Inc., 2008.
  • [29] G. Riegler, S. Schulter, M. Rüther, and H. Bischof. Conditioned regression models for non-blind single image super-resolution. In IEEE International Conference on Computer Vision (ICCV), pages 522–530, 2015.
  • [30] M. Rotman and L. Wolf. Electric analog circuit design with hypernetworks and a differential simulator. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
  • [31] Tim G. J. Rudner. On the connection between neural processes and gaussian processes with deep kernels. In Workshop on Bayesian Deep Learning (NeurIPS), 2018.
  • [32] Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. ArXiv, 2016.
  • [33] Vincent Sitzmann, Julien N. P. Martel, Alexander W. Bergman, David B. Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. Arxiv, 2020.
  • [34] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. Arxiv, 2020.
  • [35] Twan van Laarhoven. L2 regularization versus batch and weight normalization. Arxiv, 2017.
  • [36] Johannes von Oswald, Christian Henning, João Sacramento, and Benjamin F. Grewe. Continual learning with hypernetworks. In International Conference on Learning Representations, 2020.
  • [37] Colin Wei, Jason D. Lee, Qiang Liu, and Tengyu Ma. On the margin theory of feedforward neural networks. ArXiv, 2018.
  • [38] Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Proceedings of Machine Learning Research, volume 125, pages 3635–3673. PMLR, 2020.
  • [39] Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. Pay less attention with lightweight and dynamic convolutions. In International Conference on Learning Representations, 2019.
  • [40] Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. ArXiv, 2019.
  • [41] Greg Yang. Tensor programs I: Wide feedforward or recurrent neural networks of any architecture are gaussian processes. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019.
  • [42] Greg Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 2017.
  • [43] Dingli Yu, Ruosong Wang, Zhiyuan Li, Wei Hu, Ruslan Salakhutdinov, Sanjeev Arora, and Simon S. Du. Enhanced convolutional neural tangent kernels, 2020.
  • [44] Chris Zhang, Mengye Ren, and Raquel Urtasun. Graph hypernetworks for neural architecture search. In International Conference on Learning Representations, 2019.

Appendix A Implementation Details

A.1 Convergence of the Hyperkernel

In Fig. 1(a) (main text) we plot the variance of the kernel values 𝒦h(u,u)\mathcal{K}^{h}(u,u^{\prime}) in log-scale, as a function of the width of both ff and gg. The variance was computed empirically over k=100k=100 normally distributed samples ww. As can be seen, the variance of the kernel tends to zero only when both widths increase. In Fig. 1(b) (main text) we plot the value of 𝒦h(u,u)\mathcal{K}^{h}(u,u^{\prime}) and its variance for a fixed hypernetwork ff of width 500500 and gg of width 1010 or 800800. The xx-axis specifies the value of θ[π2,π2]\theta\in[-\frac{\pi}{2},\frac{\pi}{2}] and the y-axis specifies the value of the kernel. As can be seen, the expected value of the empirical kernel, 𝒦h(u,u)\mathcal{K}^{h}(u,u^{\prime}), is equal to the width-limit kernel (e.g., theoretical kernel) for both widths 1010 and 800800. In addition, convergence of the width-limit kernel is guaranteed only when the widths of both networks increase, highlighting the importance of wide architectures for both the hyper and implicit networks for stable training.

A.2 Image Completion and Impainting

Architectures  In both tasks, we used fully connected architectures, where ff contains two hidden layers, and gg contains one hidden layer. The hyperkernel used corresponds to the infinite width limit of the same architecture. For the input of gg, we used random Fourier features [34] of the pixel coordinates as inputs for both the hyperkernel and the hypernetwork. To ease on the computational burden of computing the full kernel matrix Θh(U,U)\Theta^{h}(U,U) when evaluating the hyperkernel, we compute smaller kernel matrices on subsets of the data {Uis}={xis,zis}s[10]\{U^{s}_{i}\}=\{x_{i}^{s},z_{i}^{s}\}_{s\in[10]}, where each subset contains 1k input images {xis}\{x^{s}_{i}\}, and 20 random image coordinates per input, producing a kernel matrix of size 20k×20k20k\times 20k. The final output prediction is then given by:

110s(Θh(u,u1s),,Θh(u,uNs))(Θh(Us,Us)+ϵI)1Ys\frac{1}{10}\sum_{s}(\Theta^{h}(u,u_{1}^{s}),...,\Theta^{h}(u,u_{N}^{s}))\cdot\big{(}\Theta^{h}(U^{s},U^{s})+\epsilon\cdot I\big{)}^{-1}\cdot Y^{s} (19)

where YsY^{s} are the corresponding labels of the subset UsU^{s}. For the hypernetwork evaluation, we used the same inputs {Uis}s[10]\{U^{s}_{i}\}_{s\in[10]} to train the hypernetwork using a batchsize of 2020, and a learning rate of 0.010.01 which was found to produce the best results.

Appendix B Additional Experiments

B.1 Sensitivity Study

To further demonstrate the behavior reported in Fig. 2 (main text), we verified that it is consistent regardless of the value of the learning rate. We used the same architectures as in the rotations prediction experiments, i.e., ff is a fully-connected ReLU neural network of depth 44 and width 200200 and gg is of depth {3,6,8}\in\{3,6,8\} and width {50,100,200}\in\{50,100,200\}. We vary the learning rate: μ=10j7\mu=10^{j-7}, for j=0,,7j=0,\dots,7. For each value of the learning rate, we report the average performance (and standard deviation over 100100 runs) of the various architectures after 4040 epochs of training.

As can be seen in Fig. 4, when ff is wide and kept fixed, there is a clear improvement in test performance as the width of gg increases, for every learning rate in which the networks provide non-trivial performance. When ff is wide and kept fixed, a deeper gg incurs slower training and lower overall test performance. We note that it might seem that the width of gg does not affect the performance when the learning rate is μ=0.01\mu=0.01 in all settings except Figs. 4(c,f). Indeed, we can verify from Fig. 2 (main text) that the performance at epoch 4040 is indeed similar for different widths. However, for earlier epochs, the performance improves for shallower and wider architectures.

Refer to caption Refer to caption Refer to caption
(a) gg of depth 3 (b) gg of depth 6 (c) gg of depth 8
Refer to caption Refer to caption Refer to caption
(d) gg of depth 3 (e) gg of depth 6 (f) gg of depth 8
Figure 4: Sensitivity experiment. We compare the performance of hypernetworks with an implicit network of different widths and depths after 4040 epochs, when varying the learning rate. The x-axis specifies the value of the learning rate and the y-axis specifies the averaged accuracy rate at test time. (a-c) Results on MNIST and (d-f) Results on CIFAR10. In the first column, gg’s depth is 33, in the second, it is 66 and in the third, it is 88.

B.2 Training wide networks with a large learning rate

Remark 1 (main text) states that one is able to train wide networks with a learning rate μ=o(n)\mu=o(n). To validate this remark, we trained shallow networks of varying width n{102,103,104,2.5105}n\in\{10^{2},10^{3},10^{4},2.5\cdot 10^{5}\} with learning rate μ=n\mu=\sqrt{n} on MNIST. As can be seen in Fig. 5, training those networks is possible despite the very large learning rate. In fact, we observe that the accuracy rate and loss improve as we increase the width of the network.

Refer to caption Refer to caption
(a) (b)
Figure 5: Results of training wide networks with a large learning rate. The y-axis is the (a) accuracy rate or (b) the average loss at test time. We vary the width n{102,103,104,2.5105}n\in\{10^{2},10^{3},10^{4},2.5\cdot 10^{5}\} and take the learning rate to be n\sqrt{n}.

Appendix C Correlation Functions

Correlation functions are products of general high order tensors representing high order derivatives of a networks output with respect to the weights. In [6] a conjecture is posed on the order of magnitude of general correlation functions involving high order derivative tensors, which arise when analysing the dynamics of gradient descent. Roughly speaking, given inputs {xi}i=1r\{x_{i}\}_{i=1}^{r}, the outputs of a neural network f(x1;w),,f(xr;w)f(x_{1};w),...,f(x_{r};w)\in\mathbb{R} with normally distributed parameters wNw\in\mathbb{R}^{N}, correlation functions takes the form:

ηk0,,ηkr[N]j=1rΓηkj+1,,ηkj+1(xj)\sum_{\eta_{k_{0}},...,\eta_{k_{r}}\in[N]}\prod_{j=1}^{r}\Gamma_{\eta_{k_{j}+1},...,\eta_{k_{j+1}}}(x_{j}) (20)

where

Γη1,,ηk(xj):=kf(xj;w)wη1wηk\Gamma_{\eta_{1},...,\eta_{k}}(x_{j}):=\frac{\partial^{k}f(x_{j};w)}{\partial w_{\eta_{1}}...\partial w_{\eta_{k}}} (21)

For instance, the following are two examples of correlation functions,

f(x1;w)f(x2;w)wμ1,2f(x1;w)wμ1wμ2f(x2;w)wμ1f(x_{1};w)\cdot\frac{\partial f(x_{2};w)}{\partial w_{\mu_{1}}},\frac{\partial^{2}f(x_{1};w)}{\partial w_{\mu_{1}}\partial w_{\mu_{2}}}\cdot\frac{\partial f(x_{2};w)}{\partial w_{\mu_{1}}} (22)

Computing the expected value of these correlation functions involve keeping track of various moments of normally distributed weights along paths, as done in recent finite width correction works [9, 21]. [6] employ the Feynman diagram to efficiently compute the expected values (order of magnitude) of general correlation functions, albeit at the cost of only being provably accurate for deep linear, or shallow ReLU networks. In this work, we analyze the asymptotic behaviour correlation functions of the form:

𝒯r(x0,,xr)\displaystyle\mathcal{T}^{r}(x_{0},...,x_{r}) :=ηk0ηkr[N]Γηk1,,ηkr(x0)j=1rΓηkj(xj)\displaystyle:=\sum_{\eta_{k_{0}}...\eta_{k_{r}}\in[N]}\Gamma_{\eta_{k_{1}},...,\eta_{k_{r}}}(x_{0})\prod_{j=1}^{r}\Gamma_{\eta_{k_{j}}}(x_{j}) (23)
=w(r)f(x0),j=1rwf(xj)\displaystyle=\bigg{\langle}\nabla^{(r)}_{w}f(x_{0}),\bigotimes^{r}_{j=1}\nabla_{w}f(x_{j})\bigg{\rangle}

where w(r)f(x0)\nabla^{(r)}_{w}f(x_{0}) is a rank rr tensor, representing the rr’th derivative of the output, and j=1rwf(xj)\bigotimes^{r}_{j=1}\nabla_{w}f(x_{j}) denotes outer products of the gradients for different examples. terms of the form in Eq. 23 represent high order terms in the multivariate Taylor expansion of outputs, and are, therefore, relevant for the full understanding of training dynamics. As a consequence of Thm. 1, we prove that 𝒯r(x0,,xr)1/nmax(r1,0)\mathcal{T}^{r}(x_{0},...,x_{r})\sim 1/n^{\max(r-1,0)} for vanilla neural networks, where nn is the width of the network. As we have shown in Sec. 3, terms of the form in Eq. 23 represent high order terms in the multivariate Taylor expansion of outputs, and are, therefore, relevant for the full understanding of training dynamics. As a consequence of Thm. 1, we prove that 𝒯r(x0,,xr)1/nmax(r1,0)\mathcal{T}^{r}(x_{0},...,x_{r})\sim 1/n^{\max(r-1,0)} for vanilla neural networks, where nn is the width of the network.

This result is a partial solution to the open problem suggested by [6]. In their paper, they conjecture the asymptotic behaviour of general correlation functions, and predict an upper bound on the asymptotic behaviour of terms of the form in Eq. 23 in the order of 𝒪(1/n)\mathcal{O}(1/n). Our results therefore proves a stronger version of the conjecture, while giving the exact behaviour as a function of width.

Appendix D Proofs of the Main Results

Terminology and Notations  Throughout the appendix, we denote by ABA\otimes B and ABA\odot B the outer and Hadamard products of the tensors AA and BB (resp.). When considering the outer products of a sequence of tensors {Ai}i=1k\{A_{i}\}^{k}_{i=1}, we denote, i=1kAi=A1Ak\bigotimes^{k}_{i=1}A_{i}=A_{1}\otimes\dots\otimes A_{k}. We denote by sgn(x):=x/|x|\textnormal{sgn}(x):=x/|x| the sign function. The notation XnanX_{n}\sim a_{n} states that Xn/anX_{n}/a_{n} converges in distribution to some non-zero random variable XX. A convenient property of this notation is that it satisfies: XnYnanbnX_{n}\cdot Y_{n}\sim a_{n}\cdot b_{n} when XnanX_{n}\sim a_{n} and YnbnY_{n}\sim b_{n}. Throughout the paper, we will make use of sequential limits and denote nk,,n1n_{k},\dots,n_{1}\to\infty to express that n1n_{1} tend to infinity, then n2n_{2}, and so on. For a given sequence of random variable {Xn}n=1\{X_{n}\}^{\infty}_{n=1}, we denote by XndXX_{n}\stackrel{{\scriptstyle d}}{{\longrightarrow}}X (XnpXX_{n}\stackrel{{\scriptstyle p}}{{\longrightarrow}}X), when XnX_{n} converges in distribution (probability) to a random variable XX.

D.1 Useful Lemmas

Lemma 1.

Let XndXX_{n}\stackrel{{\scriptstyle d}}{{\longrightarrow}}X. Then, sgn(Xn)dsgn(X)\textnormal{sgn}(X_{n})\stackrel{{\scriptstyle d}}{{\longrightarrow}}\textnormal{sgn}(X).

Proof.

We have:

limn[sgn(Xn)=1]=limn[Xn0]=[X0]=[sgn(X)=1]\lim_{n\to\infty}\mathbb{P}[\textnormal{sgn}(X_{n})=1]=\lim_{n\to\infty}\mathbb{P}[X_{n}\geq 0]=\mathbb{P}[X\geq 0]=\mathbb{P}[\textnormal{sgn}(X)=1] (24)

Hence, sgn(Xn)\textnormal{sgn}(X_{n}) converges in distribution to sgn(X)\textnormal{sgn}(X). ∎

D.2 Main Technical Lemma

In this section, we prove Lem. 3, which is the main technical lemma that enables us proving Thm. 1. Let f(x;w)f(x;w) be a neural network with HH outputs {fd(x;w)}d=1H\{f^{d}(x;w)\}^{H}_{d=1}. We would like to estimate the order of magnitude of the following expression:

𝒯n,i,d𝒍,𝒊,𝒅:=kfd(xi;w)Wl1Wlk,t=1kfd1(xit;w)Wlt\mathcal{T}^{\bm{l,i,d}}_{n,i,d}:=\left\langle\frac{\partial^{k}f^{d}(x_{i};w)}{\partial W^{l_{1}}\dots\partial W^{l_{k}}},\bigotimes^{k}_{t=1}\frac{\partial f^{d_{1}}(x_{i_{t}};w)}{\partial W^{l_{t}}}\right\rangle (25)

where 𝒅=(d1,,dk)\bm{d}=(d_{1},\dots,d_{k}), 𝒊=(i1,,ik)\bm{i}=(i_{1},\dots,i_{k}) and 𝒍=(l1,,lk)\bm{l}=(l_{1},\dots,l_{k}). For simplicity, when, i1==ik=ji_{1}=\dots=i_{k}=j, we denote: 𝒯n,i,j,d𝒍,𝒅:=𝒯n,i,d𝒍,𝒊,𝒅\mathcal{T}^{\bm{l,d}}_{n,i,j,d}:=\mathcal{T}^{\bm{l,i,d}}_{n,i,d} and 𝒯n,i,j,d𝒍:=𝒯n,i,d𝒍,𝒊,𝒅\mathcal{T}^{\bm{l}}_{n,i,j,d}:=\mathcal{T}^{\bm{l,i,d}}_{n,i,d} when d1==dk=dd_{1}=\dots=d_{k}=d as well.

To estimate the order of magnitude of the expression in Eq. 25, we provide an explicit expression for kfd(xi;w)Wl1Wlk\frac{\partial^{k}f^{d}(x_{i};w)}{\partial W^{l_{1}}\dots\partial W^{l_{k}}}. First, we note that for any ww, such that, fd(xi;w)f^{d}(x_{i};w) is kk times continuously differentiable at ww, for any set 𝒍:={l1,,lk}\bm{l}:=\{l_{1},\dots,l_{k}\}, we have:

kfd(xi;w)Wl1Wlk=kfd(xi;w)Wl1Wlk\frac{\partial^{k}f^{d}(x_{i};w)}{\partial W^{l_{1}}\dots\partial W^{l_{k}}}=\frac{\partial^{k}f^{d}(x_{i};w)}{\partial W^{l^{\prime}_{1}}\dots\partial W^{l^{\prime}_{k}}} (26)

where the set 𝒍:={l1,,lk}\bm{l}^{\prime}:=\{l^{\prime}_{1},\dots,l^{\prime}_{k}\} is an ordered version of 𝒍\bm{l}, i.e., the two sets consist of the same elements but l1<<lkl^{\prime}_{1}<\dots<l^{\prime}_{k}. In addition, we notice that for any multi-set 𝒍\bm{l}, such that, li=ljl_{i}=l_{j} for some iji\neq j, then,

kfd(xi;w)Wl1Wlk=0\frac{\partial^{k}f^{d}(x_{i};w)}{\partial W^{l_{1}}\dots\partial W^{l_{k}}}=0 (27)

since fd(xi;w)f^{d}(x_{i};w) is a neural network with a piece-wise linear activation function. Therefore, with no loss of generality, we consider 𝒍={l1,,lk}\bm{l}=\{l_{1},\dots,l_{k}\}, such that, l1<<lkl_{1}<\dots<l_{k}. It holds that:

kfd(xi;w)Wl1Wlk=1nl11qi,dl11𝒜i,dl1l2\frac{\partial^{k}f^{d}(x_{i};w)}{\partial W^{l_{1}}\dots\partial W^{l_{k}}}=\frac{1}{\sqrt{n_{l_{1}-1}}}q^{l_{1}-1}_{i,d}\otimes\mathcal{A}^{l_{1}\rightarrow l_{2}}_{i,d} (28)

where 𝒜i,dl1l2\mathcal{A}^{l_{1}\rightarrow l_{2}}_{i,d} is a 2k12k-1 tensor, defined as follows:

𝒜i,dljlj+1={1nlj+11Ci,dljlj+1𝒜i,dlj+1lj+21<j<k11nlk1Ci,dlk1lkCi,dlkLj=k1\mathcal{A}^{l_{j}\rightarrow l_{j+1}}_{i,d}=\begin{cases}\frac{1}{\sqrt{n_{l_{j+1}-1}}}C^{l_{j}\rightarrow l_{j+1}}_{i,d}\otimes\mathcal{A}^{l_{j+1}\rightarrow l_{j+2}}_{i,d}&1<j<k-1\\ \frac{1}{\sqrt{n_{l_{k}-1}}}C^{l_{k-1}\rightarrow l_{k}}_{i,d}\otimes C^{l_{k}\rightarrow L}_{i,d}&j=k-1\\ \end{cases} (29)

where:

Ci,dljlj+1={2Zi,dlj+11Pi,dljlj+11lj+1LPi,dljLelseC^{l_{j}\rightarrow l_{j+1}}_{i,d}=\begin{cases}\sqrt{2}Z^{l_{j+1}-1}_{i,d}P_{i,d}^{l_{j}\rightarrow l_{j+1}-1}&l_{j+1}\neq L\\ P_{i,d}^{l_{j}\rightarrow L}&else\end{cases} (30)

and:

Piuv=l=uv1(2nlWl+1Zil) and Zil=diag(σ˙(yl(xi)))P_{i}^{u\rightarrow v}=\prod_{l=u}^{v-1}(\sqrt{\frac{2}{n_{l}}}W^{l+1}Z^{l}_{i})\textnormal{ and }Z^{l}_{i}=\textnormal{diag}(\dot{\sigma}(y^{l}(x_{i}))) (31)

The individual gradients can be expressed using:

fwdj(xij)Wlj=qij,djlj1Cij,djljLnlj1\frac{\partial f^{d_{j}}_{w}(x_{i_{j}})}{\partial W^{l_{j}}}=\frac{q^{l_{j}-1}_{i_{j},d_{j}}\otimes C^{l_{j}\rightarrow L}_{i_{j},d_{j}}}{\sqrt{n_{l_{j}-1}}} (32)

Note that the following holds for any u<v<hLu<v<h\leq L:

Ci,duh=CvhWvnv1Ci,duv and Ci,duL=Ci,dv1LPi,duv1C^{u\rightarrow h}_{i,d}=C^{v\rightarrow h}\frac{W^{v}}{\sqrt{n_{v-1}}}C^{u\rightarrow v}_{i,d}\textnormal{ and }C^{u\rightarrow L}_{i,d}=C^{v-1\rightarrow L}_{i,d}P^{u\rightarrow v-1}_{i,d} (33)

In the following, given the sets 𝒍={l1,,lk}\bm{l}=\{l_{1},\dots,l_{k}\}, 𝒊={i1,,ik}\bm{i}=\{i_{1},\dots,i_{k}\} and 𝒅={d1,,dk}\bm{d}=\{d_{1},\dots,d_{k}\}, we derive the limit of 𝒯n,i,d𝒍,𝒊,𝒅\mathcal{T}^{\bm{l,i,d}}_{n,i,d} using elementary tensor algebra. By Eqs. 32 and 28, we see that:

𝒯n,i,d𝒍,𝒊,𝒅\displaystyle\mathcal{T}^{\bm{l,i,d}}_{n,i,d} =t=1kfdt(xit;w)Wlt,qi,dl11nl11Ci,dl1l2nl21Ci,dlr1lknlk1Ci,dlkL\displaystyle=\Big{\langle}\bigotimes^{k}_{t=1}\frac{\partial f^{d_{t}}(x_{i_{t}};w)}{\partial W^{l_{t}}},\frac{q^{l_{1}-1}_{i,d}}{\sqrt{n_{l_{1}-1}}}\otimes\frac{C^{l_{1}\rightarrow l_{2}}_{i,d}}{\sqrt{n_{l_{2}-1}}}\otimes...\otimes\frac{C^{l_{r-1}\rightarrow l_{k}}_{i,d}}{\sqrt{n_{l_{k}-1}}}\otimes C^{l_{k}\rightarrow L}_{i,d}\Big{\rangle} (34)
=1nl11qi,dl11,qi1,d1l11Cik,dklkL,Ci,dlkLj=1k1Cij,djljLqij+1,dj+1lj+11nlj+11,Ci,dljlj+1\displaystyle=\frac{1}{n_{l_{1}-1}}\left\langle q^{l_{1}-1}_{i,d},q^{l_{1}-1}_{i_{1},d_{1}}\right\rangle\cdot\left\langle C_{i_{k},d_{k}}^{l_{k}\rightarrow L},C_{i,d}^{l_{k}\rightarrow L}\right\rangle\prod_{j=1}^{k-1}\left\langle\frac{C_{i_{j},d_{j}}^{l_{j}\rightarrow L}\otimes q_{i_{j+1},d_{j+1}}^{l_{j+1}-1}}{n_{l_{j+1}-1}},C_{i,d}^{l_{j}\rightarrow l_{j+1}}\right\rangle

We recall the analysis of [41] showing that in the infinite width limit, with n=min(n1,nL1)n=\min(n_{1}\dots,n_{L-1})\to\infty, every pre-activation yl(x)y^{l}(x) of f(x;w)f(x;w) at hidden layer l[L]l\in[L] has all its coordinates tending to i.i.d. centered Gaussian processes of covariance Σl(x,x):n0×n0\Sigma^{l}(x,x^{\prime}):\mathbb{R}^{n_{0}}\times\mathbb{R}^{n_{0}}\to\mathbb{R} defined recursively as follows:

Σ0(x,x)\displaystyle\Sigma^{0}(x,x^{\prime}) =xx,\displaystyle=x^{\top}x^{\prime}, (35)
Λl(x,x)\displaystyle\Lambda^{l}(x,x^{\prime}) =[Σl1(x,x)Σl1(x,x)Σl1(x,x)Σl1(x,x)]2×2,\displaystyle=\begin{bmatrix}\Sigma^{l-1}(x,x)&\Sigma^{l-1}(x,x^{\prime})\\ \Sigma^{l-1}(x^{\prime},x)&\Sigma^{l-1}(x^{\prime},x^{\prime})\end{bmatrix}\in\mathbb{R}^{2\times 2},
Σl(x,x)\displaystyle\Sigma^{l}(x,x^{\prime}) =𝔼(u,v)𝒩(0,Λl1(x,x))[σ(u)σ(v)]\displaystyle=\mathbb{E}_{(u,v)\sim\mathcal{N}(0,\Lambda^{l-1}(x,x^{\prime}))}[\sigma(u)\sigma(v)]

In addition, we define the derivative covariance as follows:

Σ˙l(x,x)=𝔼(u,v)𝒩(0,Λl1(x,x))[σ˙(u)σ˙(v)]\dot{\Sigma}^{l}(x,x^{\prime})=\mathbb{E}_{(u,v)\sim\mathcal{N}(0,\Lambda^{l-1}(x,x^{\prime}))}[\dot{\sigma}(u)\dot{\sigma}(v)] (36)

when considering x=xix=x_{i} and x=xjx^{\prime}=x_{j} from the training set, we simply write Σi,jl:=Σl(xi,xj)\Sigma^{l}_{i,j}:=\Sigma^{l}(x_{i},x_{j}) and Σ˙i,jl=Σ˙l(xi,xj)\dot{\Sigma}^{l}_{i,j}=\dot{\Sigma}^{l}(x_{i},x_{j}).

Lemma 2.

The following holds:

  1. 1.

    For nL1,,n1n_{L-1},\dots,n_{1}\to\infty, we have: Pi,d1uL(Pj,d2uL)dl=uL1Σ˙i,jlδd1=d2P_{i,d_{1}}^{u\rightarrow L}(P_{j,d_{2}}^{u\rightarrow L})^{\top}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\prod_{l=u}^{L-1}\dot{\Sigma}_{i,j}^{l}\delta_{d_{1}=d_{2}}.

  2. 2.

    For nv,,n1n_{v},\dots,n_{1}\to\infty, we have: (qiv)qjvnvdΣi,jv\frac{(q_{i}^{v})^{\top}q_{j}^{v}}{n_{v}}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\Sigma_{i,j}^{v}.

Here, δT\delta_{T} is an indicator that returns 11 if TT is true and 0 otherwise.

Proof.

See [1].

Lemma 3.

Let k0k\geq 0 and sets 𝐥={l1,,lk}\bm{l}=\{l_{1},\dots,l_{k}\}, 𝐢={i1,,ik}\bm{i}=\{i_{1},\dots,i_{k}\} and 𝐝={d1,,dk}\bm{d}=\{d_{1},\dots,d_{k}\}. We have:

nmax(k1,0)𝒯n,i,d𝒍,𝒊,𝒅d{δ𝒅j=1k1𝒢jk>1constk=1n^{\max(k-1,0)}\cdot\mathcal{T}^{\bm{l,i,d}}_{n,i,d}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\begin{cases}\delta_{\bm{d}}\cdot\prod_{j=1}^{k-1}\mathcal{G}_{j}&k>1\\ const&k=1\\ \end{cases} (37)

as nn\to\infty. Here, 𝒢1,,𝒢k1\mathcal{G}_{1},...,\mathcal{G}_{k-1} are centered Gaussian variables with finite, non-zero variances, and δ𝐝:=δ(d1==dk=d)\delta_{\bm{d}}:=\delta(d_{1}=...=d_{k}=d).

Proof.

The case k=0k=0 is trivial. Let k1k\geq 1. By Eq. 34, it holds that:

nk1𝒯n,i,d𝒍,𝒊,𝒅\displaystyle n^{k-1}\mathcal{T}^{\bm{l,i,d}}_{n,i,d} (38)
=\displaystyle= nk1qi,dl11,qi1,d1l11Cik,dklkL,Ci,dlkLnj=1k1Cij,djljLqij+1,dj+1lj+11n,Ci,dljlj+1\displaystyle n^{k-1}\frac{\Big{\langle}q^{l_{1}-1}_{i,d},q^{l_{1}-1}_{i_{1},d_{1}}\Big{\rangle}\Big{\langle}C_{i_{k},d_{k}}^{l_{k}\rightarrow L},C_{i,d}^{l_{k}\rightarrow L}\Big{\rangle}}{n}\cdot\prod_{j=1}^{k-1}\left\langle\frac{C_{i_{j},d_{j}}^{l_{j}\rightarrow L}\otimes q_{i_{j+1},d_{j+1}}^{l_{j+1}-1}}{n},C_{i,d}^{l_{j}\rightarrow l_{j+1}}\right\rangle
=\displaystyle= qi,dl11,qi1,d1l11Cik,dklkL,Ci,dlkLnj=1k1Cij,djljLqij+1,dj+1lj+11,Ci,dljlj+1\displaystyle\frac{\Big{\langle}q^{l_{1}-1}_{i,d},q^{l_{1}-1}_{i_{1},d_{1}}\Big{\rangle}\Big{\langle}C_{i_{k},d_{k}}^{l_{k}\rightarrow L},C_{i,d}^{l_{k}\rightarrow L}\Big{\rangle}}{n}\cdot\prod_{j=1}^{k-1}\left\langle C_{i_{j},d_{j}}^{l_{j}\rightarrow L}\otimes q_{i_{j+1},d_{j+1}}^{l_{j+1}-1},C_{i,d}^{l_{j}\rightarrow l_{j+1}}\right\rangle

Note that intermediate activations do not depend on the index djd_{j}, and so we remove the dependency on djd_{j} in the relevant terms. Next, by applying Lem. 2,

qil11,qi1l11Cik,dklkL,Ci,dlkLndΣi,i1l11(j=lkLΣ˙i,iklj)δ𝒅\frac{\Big{\langle}q^{l_{1}-1}_{i},q^{l_{1}-1}_{i_{1}}\Big{\rangle}\Big{\langle}C_{i_{k},d_{k}}^{l_{k}\rightarrow L},C_{i,d}^{l_{k}\rightarrow L}\Big{\rangle}}{n}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\Sigma_{i,i_{1}}^{l_{1}-1}\left(\prod_{j=l_{k}}^{L}\dot{\Sigma}_{i,i_{k}}^{l_{j}}\right)\delta_{\bm{d}} (39)

Expanding the second term using Eq. 33:

Cij,djljLqij+1lj+11,Ci,dljij+1\displaystyle\Big{\langle}C_{i_{j},d_{j}}^{l_{j}\rightarrow L}\otimes q_{i_{j+1}}^{l_{j+1}-1},C_{i,d}^{l_{j}\rightarrow i_{j+1}}\Big{\rangle} (40)
=Cij,djljLCiljij+1qij+1lj+11\displaystyle=C_{i_{j},d_{j}}^{l_{j}\rightarrow L}C_{i}^{l_{j}\rightarrow i_{j+1}}q_{i_{j+1}}^{l_{j+1}-1}
=Cij,djlj+11LPijljlj+11(Piljlj+11)2Zilj+11qij+1lj+11\displaystyle=C_{i_{j},d_{j}}^{l_{j+1}-1\rightarrow L}P_{i_{j}}^{l_{j}\rightarrow l_{j+1}-1}(P_{i}^{l_{j}\rightarrow l_{j+1}-1})^{\top}\sqrt{2}\cdot Z_{i}^{l_{j+1}-1}q_{i_{j+1}}^{l_{j+1}-1}
=2Cij,djlj+11L(Zilj+11qij+1lj+11),Pijljlj+11(Piljlj+11)\displaystyle=\sqrt{2}\cdot\Big{\langle}C_{i_{j},d_{j}}^{l_{j+1}-1\rightarrow L}\otimes(Z_{i}^{l_{j+1}-1}q_{i_{j+1}}^{l_{j+1}-1}),P_{i_{j}}^{l_{j}\rightarrow l_{j+1}-1}(P_{i}^{l_{j}\rightarrow l_{j+1}-1})^{\top}\Big{\rangle}
=2Cij,djlj+11LPijljlj+11(Piljlj+11)Zilj+11qij+1lj+11\displaystyle=\sqrt{2}\cdot C_{i_{j},d_{j}}^{l_{j+1}-1\rightarrow L}P_{i_{j}}^{l_{j}\rightarrow l_{j+1}-1}(P_{i}^{l_{j}\rightarrow l_{j+1}-1})^{\top}Z_{i}^{l_{j+1}-1}q_{i_{j+1}}^{l_{j+1}-1}
=ξj.\displaystyle=\xi_{j}.

The above expression is fully implementable in a Tensor Program (see [41, 40]), and approaches a GP as width tend to infinity. In other words:

ξjd𝒢^j\xi_{j}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\hat{\mathcal{G}}_{j} (41)

and denoting 𝝃=[ξ1,,ξk1]\bm{\xi}=[\xi_{1},...,\xi_{k-1}], and 𝓖^=[𝒢^1,,𝒢^k1]\hat{\bm{\mathcal{G}}}=[\hat{\mathcal{G}}_{1},...,\hat{\mathcal{G}}_{k-1}], it holds using the multivariate Central Limit theorem:

𝝃d𝓖^\bm{\xi}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\hat{\bm{\mathcal{G}}} (42)

Using the Mann-Wald theorem [24] (where we take the mapping as the product pooling of 𝝃\bm{\xi}), we have that:

j=1k1ξjdj=1k1𝒢^j\prod_{j=1}^{k-1}\xi_{j}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\prod_{j=1}^{k-1}\hat{\mathcal{G}}_{j} (43)

Finally, by Slutsky’s theorem,

nk1𝒯n,i,d𝒍,𝒊,𝒅dΣi,i1l11(j=lkLΣ˙i,iklj)(j=1k1𝒢^j)δ𝒅.\displaystyle n^{k-1}\mathcal{T}^{\bm{l,i,d}}_{n,i,d}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\Sigma_{i,i_{1}}^{l_{1}-1}\left(\prod_{j=l_{k}}^{L}\dot{\Sigma}_{i,i_{k}}^{l_{j}}\right)\cdot\left(\prod^{k-1}_{j=1}\hat{\mathcal{G}}_{j}\right)\cdot\delta_{\bm{d}}. (44)

Assigning 𝒢j:=Σi,i1l11(j=lkLΣ˙i,iklj)𝒢^j\mathcal{G}_{j}:=\Sigma_{i,i_{1}}^{l_{1}-1}\left(\prod_{j=l_{k}}^{L}\dot{\Sigma}_{i,i_{k}}^{l_{j}}\right)\hat{\mathcal{G}}_{j} completes the proof. ∎

D.3 Proof of Thm. 1

Since we assume that gg is a finite neural network, i.e., ml<m_{l}<\infty for all l[H]l\in[H], throughout the proofs with no loss of generality we assume that m1==mH=1m_{1}=\dots=m_{H}=1.

Lemma 4.

Let h(u;w)=g(z;f(x;w))h(u;w)=g(z;f(x;w)) be a hypernetwork. We have:

𝒦i,j(r)=\displaystyle\mathcal{K}^{(r)}_{i,j}= α1++αH=rα1,,αH0r!α1!αH!zi[j=1H1ϕ˙(gij)]d=1Hw(αd)fid,(whj)αd\displaystyle\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ \alpha_{1},\dots,\alpha_{H}\geq 0\end{subarray}}\frac{r!}{\alpha_{1}!\cdots\alpha_{H}!}\cdot z_{i}\cdot\left[\prod^{H-1}_{j=1}\dot{\phi}(g^{j}_{i})\right]\cdot\prod^{H}_{d=1}\left\langle\nabla^{(\alpha_{d})}_{w}f^{d}_{i},(\nabla_{w}h_{j})^{\alpha_{d}}\right\rangle (45)
Proof.

By the higher order product rule and the fact that the second derivative of a piece-wise linear function is 0 everywhere:

w(r)hi=α1++αH=rα1,,αH0r!α1!αH!ziw(αH)fiHd=1H1DHd\displaystyle\nabla^{(r)}_{w}h_{i}=\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ \alpha_{1},\dots,\alpha_{H}\geq 0\end{subarray}}\frac{r!}{\alpha_{1}!\cdots\alpha_{H}!}\cdot z_{i}\cdot\nabla^{(\alpha_{H})}_{w}f^{H}_{i}\bigotimes^{H-1}_{d=1}D_{H-d} (46)

where

Dd:=ϕ˙(gid)w(αd)fidD_{d}:=\dot{\phi}(g^{d}_{i})\cdot\nabla^{(\alpha_{d})}_{w}f^{d}_{i} (47)

In addition, by elementary tensor algebra, we have:

𝒦i,j(r)=\displaystyle\mathcal{K}^{(r)}_{i,j}= w(r)hi,(whj)r\displaystyle\langle\nabla^{(r)}_{w}h_{i},(\nabla_{w}h_{j})^{r}\rangle (48)
=\displaystyle= α1++αH=rα1,,αH0r!α1!αH!ziw(αH)fiHd=1H1DHd,(whj)r\displaystyle\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ \alpha_{1},\dots,\alpha_{H}\geq 0\end{subarray}}\frac{r!}{\alpha_{1}!\cdots\alpha_{H}!}z_{i}\cdot\left\langle\nabla^{(\alpha_{H})}_{w}f^{H}_{i}\cdot\bigotimes^{H-1}_{d=1}D_{H-d},(\nabla_{w}h_{j})^{r}\right\rangle
=\displaystyle= α1++αH=rα1,,αH0r!α1!αH!ziw(αH)fiH,(whj)αH\displaystyle\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ \alpha_{1},\dots,\alpha_{H}\geq 0\end{subarray}}\frac{r!}{\alpha_{1}!\cdots\alpha_{H}!}z_{i}\cdot\left\langle\nabla^{(\alpha_{H})}_{w}f^{H}_{i},(\nabla_{w}h_{j})^{\alpha_{H}}\right\rangle
d=1H1ϕ˙(giHd)w(αHd)fiHd,(whj)αHd\displaystyle\quad\quad\quad\quad\quad\quad\cdot\prod^{H-1}_{d=1}\left\langle\dot{\phi}(g^{H-d}_{i})\cdot\nabla^{(\alpha_{H-d})}_{w}f^{H-d}_{i},(\nabla_{w}h_{j})^{\alpha_{H-d}}\right\rangle
=\displaystyle= α1++αH=rα1,,αH0r!α1!αH!zi[d=1H1ϕ˙(gid)]d=1Hw(αd)fid,(whj)αd\displaystyle\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ \alpha_{1},\dots,\alpha_{H}\geq 0\end{subarray}}\frac{r!}{\alpha_{1}!\cdots\alpha_{H}!}\cdot z_{i}\cdot\left[\prod^{H-1}_{d=1}\dot{\phi}(g^{d}_{i})\right]\cdot\prod^{H}_{d=1}\left\langle\nabla^{(\alpha_{d})}_{w}f^{d}_{i},(\nabla_{w}h_{j})^{\alpha_{d}}\right\rangle

Lemma 5.

Let h(u;w)=g(z;f(x;w))h(u;w)=g(z;f(x;w)) be a hypernetwork. In addition, let,

d[H]:hjd:=ajd1t=1HdfjHt+1ϕ˙(gjHt)\forall d\in[H]:~{}~{}h^{d}_{j}:=a^{d-1}_{j}\prod^{H-d}_{t=1}f^{H-t+1}_{j}\cdot\dot{\phi}(g^{H-t}_{j}) (49)

We have:

(αd)fid,(whj)αd=𝒍[L]αd𝒅[H]αd(k=1αdhjdk)𝒯n,i,j,d𝒍,𝒅\displaystyle\left\langle\nabla^{(\alpha_{d})}f^{d}_{i},(\nabla_{w}h_{j})^{\alpha_{d}}\right\rangle=\sum_{\bm{l}\in[L]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\left(\prod^{\alpha_{d}}_{k=1}h^{d_{k}}_{j}\right)\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d} (50)
Proof.

We have:

(αd)fid,(whj)αd=𝒍[L]αdαdfidWl1Wlαd,k=1αdhjWlk\displaystyle\left\langle\nabla^{(\alpha_{d})}f^{d}_{i},(\nabla_{w}h_{j})^{\alpha_{d}}\right\rangle=\sum_{\bm{l}\in[L]^{\alpha_{d}}}\left\langle\frac{\partial^{\alpha_{d}}f^{d}_{i}}{\partial W^{l_{1}}\dots\partial W^{l_{\alpha_{d}}}},\bigotimes^{\alpha_{d}}_{k=1}\frac{\partial h_{j}}{\partial W^{l_{k}}}\right\rangle (51)

By the product rule:

hjWlk=\displaystyle\frac{\partial h_{j}}{\partial W^{l_{k}}}= d=1H[t=1HdfjHt+1ϕ˙(gjHt)]fjdWlkajd1=d=1HhjdfjdWlk\displaystyle\sum^{H}_{d=1}\left[\prod^{H-d}_{t=1}f^{H-t+1}_{j}\cdot\dot{\phi}(g^{H-t}_{j})\right]\cdot\frac{\partial f^{d}_{j}}{\partial W^{l_{k}}}\cdot a^{d-1}_{j}=\sum^{H}_{d=1}h^{d}_{j}\cdot\frac{\partial f^{d}_{j}}{\partial W^{l_{k}}} (52)

Hence,

k=1αdhjWlk=𝒅[H]αd(k=1αdhjdk)k=1αdfjdkWlk\displaystyle\bigotimes^{\alpha_{d}}_{k=1}\frac{\partial h_{j}}{\partial W^{l_{k}}}=\sum_{\bm{d}\in[H]^{\alpha_{d}}}\left(\prod^{\alpha_{d}}_{k=1}h^{d_{k}}_{j}\right)\bigotimes^{\alpha_{d}}_{k=1}\frac{\partial f^{d_{k}}_{j}}{\partial W^{l_{k}}} (53)

In particular,

(αd)fid,(whj)αd=𝒍[L]αd𝒅[H]αd(k=1αdhjdk)𝒯n,i,j,d𝒍,𝒅\displaystyle\left\langle\nabla^{(\alpha_{d})}f^{d}_{i},(\nabla_{w}h_{j})^{\alpha_{d}}\right\rangle=\sum_{\bm{l}\in[L]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\left(\prod^{\alpha_{d}}_{k=1}h^{d_{k}}_{j}\right)\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d} (54)

See 1

Proof.

Throughout the proof, in order to derive certain limits of various sequences of random variables, we implicitly make use of the Mann-Wald theorem [24]. For simplicity, oftentimes, we will avoid explicitly stating when this theorem is applied. As a general note, the repeated argument is as follows: terms, such as, nmax(αd1,0)𝒯n,i,j,d𝒍,𝒅n^{\max(\alpha_{d}-1,0)}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d}, 𝒬n,j𝒅\mathcal{Q}^{\bm{d}}_{n,j}, gidg^{d}_{i}, etc’, (see below) can be expressed as continuous mappings of jointly convergent random variables. Hence, they jointly converge, and continuous mappings over them converge as well.

By Lems. 4 and 5, we have:

𝒦i,j(r)=α1++αH=rα1,,αH0r!α1!αH!zi[d=1H1ϕ˙(gid)]d=1H𝒍[H]αd𝒅[H]αd𝒬n,j𝒅𝒯n,i,j,d𝒍,𝒅\displaystyle\mathcal{K}^{(r)}_{i,j}=\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ \alpha_{1},\dots,\alpha_{H}\geq 0\end{subarray}}\frac{r!}{\alpha_{1}!\cdots\alpha_{H}!}\cdot z_{i}\cdot\left[\prod^{H-1}_{d=1}\dot{\phi}(g^{d}_{i})\right]\cdot\prod^{H}_{d=1}\sum_{\bm{l}\in[H]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\mathcal{Q}^{\bm{d}}_{n,j}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d} (55)

where 𝒬n,j𝒅:=(k=1αdhjdk)\mathcal{Q}^{\bm{d}}_{n,j}:=\left(\prod^{\alpha_{d}}_{k=1}h^{d_{k}}_{j}\right). By the Mann-Wald theorem [24],gid,g^{d}_{i} converges to some random variable 𝒰id\mathcal{U}^{d}_{i}. If ϕ˙\dot{\phi} is a continuous function, then ϕ˙(gid)\dot{\phi}(g^{d}_{i}) converges to ϕ˙(𝒰id)\dot{\phi}(\mathcal{U}^{d}_{i}). If ϕ\phi is the ReLU activation function, by Lem. 1, ϕ˙(gid)=sgn(gid)\dot{\phi}(g^{d}_{i})=\textnormal{sgn}(g^{d}_{i}) converges to sgn(𝒰id)\textnormal{sgn}(\mathcal{U}^{d}_{i}) in distribution. We notice that 𝒬n,j𝒅\mathcal{Q}^{\bm{d}}_{n,j} converges in distribution to some random variable 𝒬j𝒅\mathcal{Q}^{\bm{d}}_{j}.

The proof is divided into two cases: H=1H=1 and H>1H>1.

Case H=1H=1:

First, we note that for H=1H=1 and d[H]d\in[H] (i.e., d=1d=1), we have:

hjd\displaystyle h^{d}_{j} =ajd1t=1HdfjHt+1σ˙(gjHt)=aj0=zj\displaystyle=a^{d-1}_{j}\cdot\prod^{H-d}_{t=1}f^{H-t+1}_{j}\cdot\dot{\sigma}(g^{H-t}_{j})=a^{0}_{j}=z_{j} (56)

In addition, d=1H1σ˙(gid)=1\prod^{H-1}_{d=1}\dot{\sigma}(g^{d}_{i})=1 as it is an empty product. Therefore, we can rewrite:

𝒦i,j(r)=zizjr𝒍[H]r𝒅[H]r𝒯n,i,j,d𝒍,𝒅\mathcal{K}^{(r)}_{i,j}=z_{i}\cdot z^{r}_{j}\sum_{\bm{l}\in[H]^{r}}\sum_{\bm{d}\in[H]^{r}}\mathcal{T}^{\bm{l,d}}_{n,i,j,d} (57)

By Lem. 3, for r=1r=1, the above tends to a constant as nn\to\infty. For r>1r>1, nr1𝒯n,i,j,d𝒍,𝒅n^{r-1}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d} converges in distribution to zero for all 𝒅(d,,d)\bm{d}\neq(d,\dots,d) and converges to a non-constant random variable 𝒯i,j,d𝒍\mathcal{T}^{\bm{l}}_{i,j,d} otherwise. Hence, by the Mann-Wald theorem [24],

nr1𝒦i,j(r)dzizjr𝒍[H]r𝒯i,j,d𝒍n^{r-1}\cdot\mathcal{K}^{(r)}_{i,j}\stackrel{{\scriptstyle d}}{{\longrightarrow}}z_{i}\cdot z^{r}_{j}\sum_{\bm{l}\in[H]^{r}}\mathcal{T}^{\bm{l}}_{i,j,d} (58)

which is a non-zero random variable.

Case H>1H>1:

By Lem. 3, nαd1𝒯n,i,j,d𝒍,𝒅n^{\alpha_{d}-1}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d} converges in distribution to zero for all 𝒅(d,,d)\bm{d}\neq(d,\dots,d). Therefore, in these cases, by Slutsky’s theorem, nαd1𝒬n,j𝒅𝒯n,i,j,d𝒍,𝒅n^{\alpha_{d}-1}\cdot\mathcal{Q}^{\bm{d}}_{n,j}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d} converges to zero in distribution. On the other hand, for each 𝒍[H]αd\bm{l}\in[H]^{\alpha_{d}}, d[H]d\in[H] and 𝒅=(d,,d)\bm{d}=(d,\dots,d), by Lem. 3, we have:

nαd1𝒬n,j𝒅𝒯n,i,j,d𝒍d𝒬j𝒅𝒯i,j,d𝒍n^{\alpha_{d}-1}\cdot\mathcal{Q}^{\bm{d}}_{n,j}\cdot\mathcal{T}^{\bm{l}}_{n,i,j,d}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\mathcal{Q}^{\bm{d}}_{j}\cdot\mathcal{T}^{\bm{l}}_{i,j,d} (59)

In particular,

nmax(αd1,0)𝒍[H]αd𝒅[H]αd𝒬n,j𝒅𝒯n,i,j,d𝒍d𝒍[H]αdd[H]𝒬jd𝒯i,j,d𝒍n^{\max(\alpha_{d}-1,0)}\sum_{\bm{l}\in[H]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\cdot\mathcal{Q}^{\bm{d}}_{n,j}\cdot\mathcal{T}^{\bm{l}}_{n,i,j,d}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\sum_{\bm{l}\in[H]^{\alpha_{d}}}\sum_{d\in[H]}\mathcal{Q}^{d}_{j}\cdot\mathcal{T}^{\bm{l}}_{i,j,d} (60)

Consider the case where rHr\geq H. In this case, for any α1,,αH\alpha_{1},\dots,\alpha_{H}, such that, there are t>1t>1 indices i[H]i\in[H], such that, αi=0\alpha_{i}=0. The following random variable converges in distribution:

Xn:=nr(Ht)d=1H𝒍[H]αd𝒅[H]αd𝒬n,j𝒅𝒯n,i,j,d𝒍,𝒅X_{n}:=n^{r-(H-t)}\cdot\prod^{H}_{d=1}\sum_{\bm{l}\in[H]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\mathcal{Q}^{\bm{d}}_{n,j}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d} (61)

Therefore, by Slutsky’s theorem:

nrHd=1H𝒍[H]αd𝒅[H]αd𝒬n,j𝒅𝒯n,i,j,d𝒍,𝒅=ntXnd0n^{r-H}\cdot\prod^{H}_{d=1}\sum_{\bm{l}\in[H]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\mathcal{Q}^{\bm{d}}_{n,j}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d}=n^{-t}\cdot X_{n}\stackrel{{\scriptstyle d}}{{\longrightarrow}}0 (62)

We have:

nrHw(r)hi,(whj)r\displaystyle n^{r-H}\cdot\left\langle\nabla^{(r)}_{w}h_{i},(\nabla_{w}h_{j})^{r}\right\rangle (63)
=\displaystyle= nrHα1++αH=rα1,,αH0r!α1!αH!zi[d=1H1σ˙(gid)]d=1H𝒍[H]αd𝒅[H]αd(k=1αdhjdk)𝒯n,i,j,d𝒍,𝒅\displaystyle n^{r-H}\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ \alpha_{1},\dots,\alpha_{H}\geq 0\end{subarray}}\frac{r!}{\alpha_{1}!\cdots\alpha_{H}!}\cdot z_{i}\cdot\left[\prod^{H-1}_{d=1}\dot{\sigma}(g^{d}_{i})\right]\cdot\prod^{H}_{d=1}\sum_{\bm{l}\in[H]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\left(\prod^{\alpha_{d}}_{k=1}h^{d_{k}}_{j}\right)\mathcal{T}^{\bm{l,d}}_{n,i,j,d}
=\displaystyle= α1++αH=rα1,,αH0r!α1!αH!zi[d=1H1σ˙(gid)]d=1Hnαd1𝒍[H]αd𝒅[H]αd𝒬n,j𝒅𝒯n,i,j,d𝒍,𝒅\displaystyle\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ \alpha_{1},\dots,\alpha_{H}\geq 0\end{subarray}}\frac{r!}{\alpha_{1}!\cdots\alpha_{H}!}\cdot z_{i}\cdot\left[\prod^{H-1}_{d=1}\dot{\sigma}(g^{d}_{i})\right]\cdot\prod^{H}_{d=1}n^{\alpha_{d}-1}\sum_{\bm{l}\in[H]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\mathcal{Q}^{\bm{d}}_{n,j}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d}
d\displaystyle\stackrel{{\scriptstyle d}}{{\longrightarrow}} α1++αH=rα1,,αH1r!α1!αH!zi[d=1H1sgn(𝒰id)]d=1H𝒍[H]αd𝒬jd𝒯i,j,d𝒍\displaystyle\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ \alpha_{1},\dots,\alpha_{H}\geq 1\end{subarray}}\frac{r!}{\alpha_{1}!\cdots\alpha_{H}!}\cdot z_{i}\cdot\left[\prod^{H-1}_{d=1}\textnormal{sgn}(\mathcal{U}^{d}_{i})\right]\cdot\prod^{H}_{d=1}\sum_{\bm{l}\in[H]^{\alpha_{d}}}\mathcal{Q}^{d}_{j}\cdot\mathcal{T}^{\bm{l}}_{i,j,d}

which is a non-constant random variable.

Next, we consider the case when rHr\leq H. By Lem. 3, for any αd2\alpha_{d}\geq 2, the term 𝒯n,i,j,d𝒍,𝒅\mathcal{T}^{\bm{l,d}}_{n,i,j,d} tends to zero as nn\to\infty. In addition, 𝒬n,j𝒅\mathcal{Q}^{\bm{d}}_{n,j} converges in distribution. Therefore, for any αd2\alpha_{d}\geq 2, we have:

𝒍[L]αd𝒅[H]αd𝒬n,j𝒅𝒯n,i,j,d𝒍,𝒅d0\sum_{\bm{l}\in[L]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\mathcal{Q}^{\bm{d}}_{n,j}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d}\stackrel{{\scriptstyle d}}{{\longrightarrow}}0 (64)

Hence, for any α1,,αH0\alpha_{1},\dots,\alpha_{H}\geq 0, such that, there is at least one αd2\alpha_{d}\geq 2, we have:

d=1H𝒍[L]αd𝒅[H]αd𝒬n,j𝒅𝒯n,i,j,d𝒍,𝒅d0\prod^{H}_{d=1}\sum_{\bm{l}\in[L]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\mathcal{Q}^{\bm{d}}_{n,j}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d}\stackrel{{\scriptstyle d}}{{\longrightarrow}}0 (65)

On the other hand, for any 0α1,,αH10\leq\alpha_{1},\dots,\alpha_{H}\leq 1, the terms {𝒯n,i,d𝒍,𝒊,𝒅}\{\mathcal{T}^{\bm{l,i,d}}_{n,i,d}\}, {gid}\{g^{d}_{i}\} and {𝒬n,j𝒅}\{\mathcal{Q}^{\bm{d}}_{n,j}\} converge jointly in distribution to some random variables {𝒯i,d𝒍,𝒊,𝒅}\{\mathcal{T}^{\bm{l,i,d}}_{i,d}\}, {sgn(𝒰id)}\{\textnormal{sgn}(\mathcal{U}^{d}_{i})\} and {𝒬j𝒅}\{\mathcal{Q}^{\bm{d}}_{j}\} as nn\to\infty. Hence,

w(r)hi,(whj)rdα1++αH=r0α1,,αH1r![d=1H1sgn(𝒰id)]d=1H𝒍[L]αd𝒅[H]αd𝒬j𝒅𝒯i,j,d𝒍,𝒅\langle\nabla^{(r)}_{w}h_{i},(\nabla_{w}h_{j})^{r}\rangle\stackrel{{\scriptstyle d}}{{\longrightarrow}}\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ 0\leq\alpha_{1},\dots,\alpha_{H}\leq 1\end{subarray}}r!\cdot\left[\prod^{H-1}_{d=1}\textnormal{sgn}(\mathcal{U}^{d}_{i})\right]\cdot\prod^{H}_{d=1}\sum_{\bm{l}\in[L]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\mathcal{Q}^{\bm{d}}_{j}\cdot\mathcal{T}^{\bm{l,d}}_{i,j,d} (66)

which is a non-constant random variable. ∎

D.4 Proofs of the Results in Sec. 4

See 2

Proof.

By [41], taking the width n=min(n1,,nL1)n=\min(n_{1},...,n_{L-1}) to infinity, the outputs Vd(x;w):=fd(x;w)V^{d}(x;w):=f^{d}(x;w) are governed by a centered Gaussian process, such that, the entries Vi,jd(x;w)V^{d}_{i,j}(x;w), given some input xx, are independent and identically distributed. Moreover, it holds that:

(Vi,jd(x;w),Vi,jd(x;w))𝒩(0,SL(x,x)).\Big{(}V^{d}_{i,j}(x;w),V^{d}_{i,j}(x^{\prime};w)\Big{)}\sim\mathcal{N}\left(0,S^{L}(x,x^{\prime})\right). (67)

with SL(x,x)S^{L}(x,x^{\prime}) as defined in Eq. 9. For the function h(u;w)=g(z;f(x;w))h(u;w)=g(z;f(x;w)), it holds for the first layer:

g1(z;f(x;w))=1m0V1(x;w)zg^{1}(z;f(x;w))=\sqrt{\frac{1}{m_{0}}}V^{1}(x;w)z (68)

After taking the limit n=min(n1,,nL1)n=\min(n_{1},...,n_{L-1}) to infinity, the implicit network gg is fed with Gaussian distributed weights. And so g1(z;f(x;w))g^{1}(z;f(x;w)) also converges to a Gaussian process, such that:

(g1(z;f(x;w))i,g1(z;f(x;w))i)𝒩(0,Λ1)(g^{1}(z;f(x;w))_{i},g^{1}(z^{\prime};f(x^{\prime};w))_{i})\sim\mathcal{N}(0,\Lambda^{1}) (69)

where:

Λ1=1m0(SL(x,x)zzSL(x,x)zzSL(x,x)zzSL(x,x)zz)\Lambda^{1}=\frac{1}{m_{0}}\begin{pmatrix}S^{L}(x,x)z^{\top}z&S^{L}(x^{\prime},x)z^{\prime\top}z\\ S^{L}(x,x^{\prime})z^{\top}z^{\prime}&S^{L}(x^{\prime},x^{\prime})z^{\prime\top}z^{\prime}\end{pmatrix} (70)

In a similar fashion to the standard feed forward case, the pre-activations gl(z;f(x;w))g^{l}(z;f(x;w)) converge to Gaussian processes as we let m=min(m1,,mH1)m=\min(m_{1},...,m_{H-1}) tend to infinity, with a covariance defined recursively:

Σl(u,u)=2𝔼(u,v)𝒩(0,Λl)[σ(u)σ(v)]\Sigma^{l}(u,u^{\prime})=\sqrt{2}\operatorname*{\mathbb{E}}_{(u,v)\sim\mathcal{N}(0,\Lambda^{l})}[\sigma(u)\sigma(v)] (71)

where,

Λl=(SL(x,x)Σl1(u,u)SL(x,x)Σl1(u,u)SL(x,x)Σl1(u,u)SL(x,x)Σl1(u,u))\Lambda^{l}=\begin{pmatrix}S^{L}(x,x)\cdot\Sigma^{l-1}(u,u)&S^{L}(x^{\prime},x)\cdot\Sigma^{l-1}(u^{\prime},u)\\ S^{L}(x,x^{\prime})\cdot\Sigma^{l-1}(u,u^{\prime})&S^{L}(x^{\prime},x^{\prime})\cdot\Sigma^{l-1}(u^{\prime},u^{\prime})\end{pmatrix} (72)

and

Σ0(z,z)=1m0zz\Sigma^{0}(z,z^{\prime})=\frac{1}{m_{0}}z^{\top}z^{\prime} (73)

proving the claim. ∎

See 1

Proof.

We prove that Λl(u,u)\Lambda^{l}(u,u^{\prime}) is a function of S0(x,x)S^{0}(x,x^{\prime}) and Σ0(u,u)\Sigma^{0}(u,u^{\prime}) by induction. First, we note that Λ1(u,u)\Lambda^{1}(u,u^{\prime}) is a function of SL(x,x)S^{L}(x,x^{\prime}) and Σ0(u,u)\Sigma^{0}(u,u^{\prime}) by definition. By the recursive definition of SL(x,x)S^{L}(x,x^{\prime}), it is a function of S0(x,x)S^{0}(x,x^{\prime}). Therefore, Λ1(u,u)\Lambda^{1}(u,u^{\prime}) can be simply represented as a function of S0(x,x)S^{0}(x,x^{\prime}) and Σ0(u,u)\Sigma^{0}(u,u^{\prime}). We assume by induction that Λl(u,u)\Lambda^{l}(u,u^{\prime}) is a function of S0(x,x)S^{0}(x,x^{\prime}) and Σ0(u,u)\Sigma^{0}(u,u^{\prime}). We would like to show that Λl+1(u,u)\Lambda^{l+1}(u,u^{\prime}) is a function of S0(x,x)S^{0}(x,x^{\prime}) and Σ0(u,u)\Sigma^{0}(u,u^{\prime}). By definition, Λl+1(u,u)\Lambda^{l+1}(u,u^{\prime}) is a function of SL(x,x)S^{L}(x,x^{\prime}) and Σl(u,u)\Sigma^{l}(u,u^{\prime}). In addition, Σl(u,u)\Sigma^{l}(u,u^{\prime}) is a function of Λl(u,u)\Lambda^{l}(u,u^{\prime}). Hence, by induction, Σl(u,u)\Sigma^{l}(u,u^{\prime}) is simply a function of S0(x,x)S^{0}(x,x^{\prime}) and Σ0(u,u)\Sigma^{0}(u,u^{\prime}). Since SL(x,x)S^{L}(x,x^{\prime}) is a function of S0(x,x)S^{0}(x,x^{\prime}), we conclude that one can represent Λl+1(u,u)\Lambda^{l+1}(u,u^{\prime}) as a function of S0(x,x)S^{0}(x,x^{\prime}) and Σ0(u,u)\Sigma^{0}(u,u^{\prime}). ∎

See 2

Proof.

We note that:

Σ0(p(z),p(z))=1kp(z)p(z)=1ki=1kcos(Wi1z+bi1)cos(Wi1z+bi1)\Sigma^{0}(p(z),p(z^{\prime}))=\frac{1}{k}p(z)^{\top}p(z)=\frac{1}{k}\sum^{k}_{i=1}\cos(W^{1}_{i}z+b^{1}_{i})\cos(W^{1}_{i}z^{\prime}+b^{1}_{i}) (74)

By Thm. 1 in [28], we have:

limkΣ0(p(z),p(z))=exp[zz22/2]/2\lim_{k\to\infty}\Sigma^{0}(p(z),p(z^{\prime}))=\exp[-\|z-z^{\prime}\|^{2}_{2}/2]/2 (75)

which is a function of exp[zz22]\exp[\|z-z^{\prime}\|^{2}_{2}] as desired. ∎

We make use of the following lemma in the proof of Thm. 3.

Lemma 6.

Recall the parametrization of the implicit network:

{gil:=gl(zi;v)=1ml1fl(xi;w)ail1ail:=al(zi;v)=2σ(gil) and ai0:=zi\begin{cases}g^{l}_{i}:=g^{l}(z_{i};v)=\sqrt{\frac{1}{m_{l-1}}}f^{l}(x_{i};w)\cdot a_{i}^{l-1}\\ a^{l}_{i}:=a^{l}(z_{i};v)=\sqrt{2}\cdot\sigma(g^{l}_{i})\end{cases}\textnormal{ and }a^{0}_{i}:=z_{i} (76)

For any pair ui={ui}u_{i}=\{u_{i}\}, we denote:

Pil1l2=l=l1l21(2mlVl+1(xi;w)Zl(zi)) and Zl(z)=diag(σ˙(gl(z)))P_{i}^{l_{1}\rightarrow l_{2}}=\prod_{l=l_{1}}^{l_{2}-1}\left(\sqrt{\frac{2}{m_{l}}}V^{l+1}(x_{i};w)\cdot Z^{l}(z_{i})\right)\textnormal{ and }Z^{l}(z)=\textnormal{diag}(\dot{\sigma}(g^{l}(z))) (77)

It holds that:

  1. 1.

    Pil1l2(Pjl1l2)dl=l1l21Σ˙l(ui,uj)IP_{i}^{l_{1}\rightarrow l_{2}}(P_{j}^{l_{1}\rightarrow l_{2}})^{\top}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\prod_{l=l_{1}}^{l_{2}-1}\dot{\Sigma}^{l}(u_{i},u_{j})I.

  2. 2.

    h(ui,w)vh(uj,w)vdl=0H1(Σl(ui,uj)h=l+1H1Σ˙l(ui,uj))\frac{\partial h(u_{i},w)}{\partial v}\cdot\frac{\partial^{\top}h(u_{j},w)}{\partial v}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\sum_{l=0}^{H-1}\left(\Sigma^{l}(u_{i},u_{j})\prod_{h=l+1}^{H-1}\dot{\Sigma}^{l}(u_{i},u_{j})\right).

where the limits are taken with respect to m,nm,n\to\infty sequentially.

Proof.

We have:

Pil1l2(Pjl1l2)\displaystyle P_{i}^{l_{1}\rightarrow l_{2}}(P_{j}^{l_{1}\rightarrow l_{2}})^{\top} (78)
=\displaystyle= Pil1l212ml21Vl2(xi;w)Zl21(zi)Zl21(zj)Vl2(xj;w)(Pjl1l21)\displaystyle P_{i}^{l_{1}\rightarrow l_{2}-1}\frac{2}{m_{l_{2}-1}}V^{l_{2}}(x_{i};w)\cdot Z^{l_{2}-1}(z_{i})Z^{l_{2}-1}(z_{j})V^{l_{2}}(x_{j};w)^{\top}(P_{j}^{l_{1}\rightarrow l_{2}-1})^{\top}

Note that it holds that when m,nm,n\to\infty sequentially, we have:

2ml21Vl2(xi;w)Zl21(zi)Zl21(zj)Vl2(xj;w)\displaystyle\frac{2}{m_{l_{2}-1}}V^{l_{2}}(x_{i};w)\cdot Z^{l_{2}-1}(z_{i})Z^{l_{2}-1}(z_{j})V^{l_{2}}(x_{j};w)^{\top} (79)
d\displaystyle\stackrel{{\scriptstyle d}}{{\longrightarrow}} 2𝔼(u,v)𝒩(0,Λl2)[σ(u)˙σ(v)˙]I=Σ˙l2(ui,uj)I\displaystyle\sqrt{2}\operatorname*{\mathbb{E}}_{(u,v)\sim\mathcal{N}(0,\Lambda^{l_{2}})}[\dot{\sigma(u)}\dot{\sigma(v)}]I=\dot{\Sigma}^{l_{2}}(u_{i},u_{j})I

Applying the above recursively proves the first claim. Using the first claim, along with the derivation of the neural tangent kernel (see [1]) proves the second claim. ∎

See 3

Proof.

Recalling that v=vec(f(x))=[vec(V1),,vec(VH)]v=vec(f(x))=[vec(V^{1}),...,vec(V^{H})], concatenated into a single vector of length l=0H1mlml+1\sum_{l=0}^{H-1}m_{l}\cdot m_{l+1}. The components of the inner matrix 𝒦f(x,x)\mathcal{K}^{f}(x,x^{\prime}) are given by:

𝒦f(x,x)(i,j)=l=1Lvi(x)wl,vj(x)wl\mathcal{K}^{f}(x,x^{\prime})(i,j)=\sum_{l=1}^{L}\left\langle\frac{\partial v_{i}(x)}{\partial w^{l}},\frac{\partial v_{j}(x^{\prime})}{\partial w^{l}}\right\rangle\\ (80)

and it holds that in the infinite width limit, 𝒦f(x,x)\mathcal{K}^{f}(x,x^{\prime}) is a diagonal matrix:

𝒦f(x,x)dΘf(x,x)I\mathcal{K}^{f}(x,x^{\prime})\stackrel{{\scriptstyle d}}{{\longrightarrow}}\Theta^{f}(x,x^{\prime})\cdot I (81)

By letting the widths nn and mm tend to infinity consecutively, by Lem. 6, it follows that:

h(u;w)vh(u;w)vdΘg(u,u,SL(x,x))\frac{\partial h(u;w)}{\partial v}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial v}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\Theta^{g}(u,u^{\prime},S^{L}(x,x^{\prime})) (82)

Since 𝒦f(x,x)=f(x;w)wf(x;w)w\mathcal{K}^{f}(x,x^{\prime})=\frac{\partial f(x;w)}{\partial w}\cdot\frac{\partial^{\top}f(x^{\prime};w)}{w} converges to the diagonal matrix Θf(x,x)I\Theta^{f}(x,x^{\prime})\cdot I, the limit of 𝒦h(u,u)\mathcal{K}^{h}(u,u^{\prime}) is given by:

𝒦h(u,u)=\displaystyle\mathcal{K}^{h}(u,u^{\prime})= g(z;f(x;w))f(x;w)f(x;w)wf(x;w)wg(z;f(x;w))f(x;w)\displaystyle\frac{\partial g(z;f(x;w))}{\partial f(x;w)}\cdot\frac{\partial f(x;w)}{\partial w}\cdot\frac{\partial^{\top}f(x^{\prime};w)}{w}\cdot\frac{\partial^{\top}g(z^{\prime};f(x^{\prime};w))}{\partial f(x^{\prime};w)} (83)
=\displaystyle= h(u;w)vf(x;w)wf(x;w)wh(u;w)v\displaystyle\frac{\partial h(u;w)}{\partial v}\cdot\frac{\partial f(x;w)}{\partial w}\cdot\frac{\partial^{\top}f(x^{\prime};w)}{w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial v}
d\displaystyle\stackrel{{\scriptstyle d}}{{\longrightarrow}} Θf(x,x)Θg(u,u,SL(x,x))\displaystyle\Theta^{f}(x,x^{\prime})\cdot\Theta^{g}(u,u^{\prime},S^{L}(x,x^{\prime}))

where we used the results of Lem. 6.

Next, we would like to prove that 𝒦h(u,u)t|t=0=0\frac{\partial\mathcal{K}^{h}(u,u^{\prime})}{\partial t}\Big{|}_{t=0}=0. For this purpose, we write the derivative explicitly:

𝒦h(u,u)t=h(u;w)wth(u;w)w+th(u;w)wh(u;w)w\displaystyle\frac{\partial\mathcal{K}^{h}(u,u^{\prime})}{\partial t}=\frac{\partial h(u;w)}{\partial w}\cdot\frac{\partial}{\partial t}\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}+\frac{\partial}{\partial t}\frac{\partial h(u;w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w} (84)

We notice that the two terms are the same up to changing between the inputs uu and uu^{\prime}. Therefore, with no loss of generality, we can simply prove the convergence of the second term. We have:

th(u;w)wh(u;w)w\displaystyle\frac{\partial}{\partial t}\frac{\partial h(u;w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w} (85)
=\displaystyle= [t(h(u;w)f(x;w)f(x;w)w)]h(u;w)w\displaystyle\left[\frac{\partial}{\partial t}\left(\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\frac{\partial f(x;w)}{\partial w}\right)\right]\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}
=\displaystyle= [h(u;w)f(x;w)tf(x;w)w+h(u;w)f(x;w)f(x;w)wt]h(u;w)w\displaystyle\left[\frac{\partial h(u;w)}{\partial f(x;w)\partial t}\cdot\frac{\partial f(x;w)}{\partial w}+\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\frac{\partial f(x;w)}{\partial w\partial t}\right]\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}
=\displaystyle= h(u;w)f(x;w)tf(x;w)wh(u;w)w+h(u;w)f(x;w)f(x;w)wth(u;w)w\displaystyle\frac{\partial h(u;w)}{\partial f(x;w)\partial t}\cdot\frac{\partial f(x;w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}+\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\frac{\partial f(x;w)}{\partial w\partial t}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}

We analyze each term separately.

Analyzing the first term

By substituting t=μwc(w)w=μwc(w)fwf\frac{\partial}{\partial t}=-\mu\nabla_{w}c(w)\frac{\partial^{\top}}{\partial w}=-\mu\nabla_{w}c(w)\frac{\partial^{\top}f}{\partial w}\frac{\partial^{\top}}{\partial f}, we have:

h(u;w)f(x;w)tf(x;w)wh(u;w)w\displaystyle\frac{\partial h(u;w)}{\partial f(x;w)\partial t}\cdot\frac{\partial f(x;w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w} (86)
=μwc(w)f(x;w)w2h(u;w)f(x;w)f(x;w)f(x;w)wf(x;w)wh(u;w)f(x;w)\displaystyle=-\mu\nabla_{w}c(w)\frac{\partial^{\top}f(x;w)}{\partial w}\cdot\frac{\partial^{2}h(u;w)}{\partial f(x;w)\partial f(x;w)}\cdot\frac{\partial f(x;w)}{\partial w}\cdot\frac{\partial^{\top}f(x^{\prime};w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial f(x^{\prime};w)}
=μwc(w)f(x;w)w2h(u;w)f(x;w)f(x;w)𝒦f(x,x)h(u;w)f(x;w)\displaystyle=-\mu\nabla_{w}c(w)\frac{\partial^{\top}f(x;w)}{\partial w}\frac{\partial^{2}h(u;w)}{\partial f(x;w)\partial f(x;w)}\mathcal{K}^{f}(x,x^{\prime})\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial f(x^{\prime};w)}
=μi=1N(h(ui;w),yi)h(ui;w)h(ui;w)f(x;w)𝒦f(x,xi)2h(u;w)f(x;w)f(x;w)𝒦f(x,x)h(u;w)f(x;w)\displaystyle=-\mu\sum^{N}_{i=1}\frac{\partial\ell(h(u_{i};w),y_{i})}{\partial h(u_{i};w)}\cdot\frac{\partial h(u_{i};w)}{\partial f(x;w)}\cdot\mathcal{K}^{f}(x,x_{i})\cdot\frac{\partial^{2}h(u;w)}{\partial f(x;w)\partial f(x;w)}\cdot\mathcal{K}^{f}(x,x^{\prime})\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial f(x^{\prime};w)}

It then follows:

limnh(u;w)f(x;w)tf(x;w)wh(u;w)w\displaystyle\lim_{n\to\infty}\frac{\partial h(u;w)}{\partial f(x;w)\partial t}\cdot\frac{\partial f(x;w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w} (87)
=\displaystyle= μi=1NiΘf(x,xi)Θf(x,x)limnh(ui;w)f(xi;w)2h(u;w)f(x;w)f(x;w)h(u;w)f(x;w)\displaystyle-\mu\sum^{N}_{i=1}\ell_{i}\cdot\Theta^{f}(x,x_{i})\cdot\Theta^{f}(x,x^{\prime})\lim_{n\to\infty}\frac{\partial h(u_{i};w)}{\partial f(x_{i};w)}\cdot\frac{\partial^{2}h(u;w)}{\partial f(x;w)\partial f(x;w)}\cdot\frac{\partial h(u^{\prime};w)}{\partial f(x^{\prime};w)}

We notice that:

limnh(ui;w)f(xi;w)2h(u;w)f(x;w)f(x;w)h(u;w)f(x;w)\displaystyle\lim_{n\to\infty}\frac{\partial^{\top}h(u_{i};w)}{\partial f(x_{i};w)}\cdot\frac{\partial^{2}h(u;w)}{\partial f(x;w)\partial f(x;w)}\cdot\frac{\partial h(u^{\prime};w)}{\partial f(x^{\prime};w)} (88)
=\displaystyle= l1,l2limn2h(u;w)fl1(x;w)fl2(x;w),h(ui;w)fl1(xi;w)h(u;w)fl2(x;w)\displaystyle\sum_{l_{1},l_{2}}\lim_{n\to\infty}\left\langle\frac{\partial^{2}h(u;w)}{\partial f^{l_{1}}(x;w)\partial f^{l_{2}}(x;w)},\frac{\partial h(u_{i};w)}{\partial f^{l_{1}}(x_{i};w)}\otimes\frac{\partial h(u^{\prime};w)}{\partial f^{l_{2}}(x^{\prime};w)}\right\rangle
:=\displaystyle:= l1,l2𝒯ml1,l2(u,ui,u)\displaystyle\sum_{l_{1},l_{2}}\mathcal{T}^{l_{1},l_{2}}_{m}(u,u_{i},u^{\prime})

We recall that fl(x;w)f^{l}(x;w) converges to a GP (as a function of xx) as nn\to\infty [19]. Therefore, 𝒯ml1,l2(u,ui,u)\mathcal{T}^{l_{1},l_{2}}_{m}(u,u_{i},u^{\prime}) are special cases of the terms 𝒯n,i,d𝒍,𝒊,𝒅\mathcal{T}^{\bm{l,i,d}}_{n,i,d} (see Eq. 25) with weights that are distributed according to a GP instead of a normal distribution. In this case, we have: k=2k=2, d=d1==dk=1d=d_{1}=\dots=d_{k}=1, the neural network f1f^{1} is replaced with hh, the weights WlW^{l} are translated into fl(x;w)f^{l}(x;w). We recall that the proof of Lem. 3 showing that 𝒯n,i,d𝒍,𝒊,𝒅=𝒪p(1/nk1)\mathcal{T}^{\bm{l,i,d}}_{n,i,d}=\mathcal{O}_{p}(1/n^{k-1}) is simply based on Lem. 2. Since Lem. 6 extends Lem. 2 to our case, the proof of Lem. 3 can be applied to show that 𝒯ml1,l2(u,ui,u)1/m\mathcal{T}^{l_{1},l_{2}}_{m}(u,u_{i},u^{\prime})\sim 1/m.

Analyzing the second term

We would like to show that for any m>0m>0, we have:

h(u;w)f(x;w)f(x;w)wth(u;w)wd0\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\frac{\partial f(x;w)}{\partial w\partial t}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}\stackrel{{\scriptstyle d}}{{\longrightarrow}}0 (89)

as nn\to\infty. Since wt=μwc(w)\frac{\partial w}{\partial t}=-\mu\nabla_{w}c(w), we have:

h(u;w)f(x;w)f(x;w)wth(u;w)w\displaystyle\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\frac{\partial f(x;w)}{\partial w\partial t}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w} (90)
=\displaystyle= μh(u;w)f(x;w)wc(w)2f(x;w)w2h(u;w)w\displaystyle-\mu\cdot\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\nabla_{w}c(w)\cdot\frac{\partial^{2}f(x;w)}{\partial w^{2}}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}
=\displaystyle= μh(u;w)f(x;w)wc(w)2f(x;w)w2f(x;w)wh(u;w)f(x;w)\displaystyle-\mu\cdot\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\nabla_{w}c(w)\cdot\frac{\partial^{2}f(x;w)}{\partial w^{2}}\cdot\frac{\partial^{\top}f(x^{\prime};w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial f(x;w)}

In addition, we have:

wc(w)=i=1N(h(ui;w),yi)h(ui;w)h(ui;w)w\displaystyle\nabla_{w}c(w)=\sum^{N}_{i=1}\frac{\partial\ell(h(u_{i};w),y_{i})}{\partial h(u_{i};w)}\cdot\frac{\partial h(u_{i};w)}{\partial w} (91)

We note that (h(ui;w),yi)h(ui;w)\frac{\partial\ell(h(u_{i};w),y_{i})}{\partial h(u_{i};w)} converges in distribution as m,nm,n\to\infty. Therefore, we can simply analyze the convergence of:

i=1Nh(u;w)f(x;w)h(ui;w)w2f(x;w)w2f(x;w)wh(u;w)f(x;w)\displaystyle\sum^{N}_{i=1}\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\frac{\partial h(u_{i};w)}{\partial w}\cdot\frac{\partial^{2}f(x;w)}{\partial w^{2}}\cdot\frac{\partial^{\top}f(x^{\prime};w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial f(x;w)} (92)

Since NN is a constant, it is enough to show that each term converges to zero. We have:

h(u;w)f(x;w)h(ui;w)w2f(x;w)w2f(x;w)wh(u;w)f(x;w)\displaystyle\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\frac{\partial h(u_{i};w)}{\partial w}\cdot\frac{\partial^{2}f(x;w)}{\partial w^{2}}\cdot\frac{\partial^{\top}f(x^{\prime};w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial f(x;w)} (93)
=\displaystyle= h(u;w)f(x;w)h(ui;w)f(xi;w)f(xi;w)w2f(x;w)w2f(x;w)wh(u;w)f(x;w)\displaystyle\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\frac{\partial h(u_{i};w)}{\partial f(x_{i};w)}\cdot\frac{\partial f(x_{i};w)}{\partial w}\cdot\frac{\partial^{2}f(x;w)}{\partial w^{2}}\cdot\frac{\partial^{\top}f(x^{\prime};w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial f(x;w)}
=\displaystyle= l,j,kh(u;w)f(x;w)lh(ui;w)f(xi;w)jf(xi;w)jw2f(x;w)lw2f(x;w)kwh(u;w)f(x;w)k\displaystyle\sum_{l,j,k}\frac{\partial h(u;w)}{\partial f(x;w)_{l}}\cdot\frac{\partial h(u_{i};w)}{\partial f(x_{i};w)_{j}}\cdot\frac{\partial f(x_{i};w)_{j}}{\partial w}\cdot\frac{\partial^{2}f(x;w)_{l}}{\partial w^{2}}\cdot\frac{\partial^{\top}f(x^{\prime};w)_{k}}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial f(x;w)_{k}}

where f(x;w)jf(x;w)_{j} is the jj’th output of ff over xx. In addition, the summation is done over the indices of the corresponding tensors. We note that for any m>0m>0, the number of indices l,j,kl,j,k is finite. We would like to show that each term in the sum tends to zero as nn\to\infty. We can write:

f(xi;w)jw2f(x;w)lw2f(x;w)kw=2f(x;w)lw2,f(xi;w)jwf(x;w)kw\frac{\partial f(x_{i};w)_{j}}{\partial w}\cdot\frac{\partial^{2}f(x;w)_{l}}{\partial w^{2}}\cdot\frac{\partial^{\top}f(x^{\prime};w)_{k}}{\partial w}=\left\langle\frac{\partial^{2}f(x;w)_{l}}{\partial w^{2}},\frac{\partial f(x_{i};w)_{j}}{\partial w}\otimes\frac{\partial f(x^{\prime};w)_{k}}{\partial w}\right\rangle (94)

By Lem. 3, the term in Eq. 94 tends to zero as nn\to\infty. In addition, it is easy to see that h(u;w)f(x;w)l\frac{\partial h(u;w)}{\partial f(x;w)_{l}}, h(ui;w)f(xi;w)j\frac{\partial h(u_{i};w)}{\partial f(x_{i};w)_{j}} and h(u;w)f(x;w)k\frac{\partial^{\top}h(u^{\prime};w)}{\partial f(x;w)_{k}} converge to some random variables. Therefore, for any fixed m>0m>0, the above sum converges to zero as nn\to\infty. ∎