On Infinite-Width Hypernetworks

Etai Littwin
School of Computer Science
Tel Aviv University
Tel Aviv, Israel
[email protected]
&Tomer Galanti^∗
School of Computer Science
Tel Aviv University
Tel Aviv, Israel
[email protected]
Lior Wolf
School of Computer Science
Tel Aviv University
Tel Aviv, Israel
[email protected] &Greg Yang
Microsoft Research AI
[email protected]
Equal Contribution

Abstract

Hypernetworks are architectures that produce the weights of a task-specific primary network. A notable application of hypernetworks in the recent literature involves learning to output functional representations. In these scenarios, the hypernetwork learns a representation corresponding to the weights of a shallow MLP, which typically encodes shape or image information. While such representations have seen considerable success in practice, they remain lacking in the theoretical guarantees in the wide regime of the standard architectures. In this work, we study wide over-parameterized hypernetworks. We show that unlike typical architectures, infinitely wide hypernetworks do not guarantee convergence to a global minima under gradient descent. We further show that convexity can be achieved by increasing the dimensionality of the hypernetwork’s output, to represent wide MLPs. In the dually infinite-width regime, we identify the functional priors of these architectures by deriving their corresponding GP and NTK kernels, the latter of which we refer to as the hyperkernel. As part of this study, we make a mathematical contribution by deriving tight bounds on high order Taylor expansion terms of standard fully connected ReLU networks.

1 Introduction

In this work, we analyze the training dynamics of over-parameterized meta networks, which are networks that output the weights of other networks, often referred to as hypernetworks. In the typical framework, a function $h$ involves two networks, $f$ and $g$ . The hypernetwork $f$ takes the input $x$ (typically an image) and returns the weights of the primary network, $g$ , which then takes the input $z$ and returns the output of $h$ .

The literature of hypernetworks is roughly divided into two main categories. In the functional representation literature [22, 33, 39, 30, 16] the input to the hypernetwork $f$ is typically an image. For shape reconstruction tasks, the network $g$ represents the shape via a signed distance field, where the input are coordinates in 3D space. In image completion tasks, the inputs to $g$ are image coordinates, and the output is the corresponding pixel intensity. In these settings, $f$ is typically a large network and $g$ is typically a shallow fully connected network.

In the second category [4, 23, 36, 44], hypernetworks are typically used for hyper-parameter search, where $x$ is being treated as a hyperparameter descriptor and is optimized alongside with the network’s weights. In this paper, we consider models corresponding to the first group of methods.

Following a prominent thread in the recent literature, our study takes place in the regime of wide networks. [11] recently showed that, when the width of the network approaches infinity, the gradient-descent training dynamics of a fully connected network $f$ can be characterized by a kernel, called the Neural Tangent Kernel (or NTK for short). In other words, as the width of each layer approaches infinity, provided with proper scaling and initialization of the weights, it holds that:

\frac{\partial f(x;w)}{\partial w}\cdot\frac{\partial^{\top}f(x^{\prime};w)}{\partial w}\to\Theta^{f}(x,x^{\prime})

(1)

as the width, $n$ of $f$ tends to infinity. Here, $w$ are the weights of the network $f(x;w)$ . As shown in [11], as the width tends to infinity, when minimizing the squared loss using gradient descent, the evolution through time of the function computed by the network follows the dynamics of kernel gradient descent with kernel $\Theta^{f}$ . To prove this phenomenon, various papers [20, 3, 2] introduce a Taylor expansion of the network output around the point of initialization and consider its values. It is shown that the first-order term is deterministic during the SGD optimization and the higher-order terms converge to zero as the width $n$ tends to infinity.

A natural question that arises when considering hypernetworks is whether a similar “wide” regime exists, where trained and untrained networks may be functionally approximated by kernels. If so, since this architecture involves two networks, the “wide” regime needs a more refined definition, taking into account both networks.

Our contributions:

1.

We show that infinitely wide hypernetworks can induce highly non-convex training dynamics under gradient descent. The complexity of the optimization problem is highly dependent on the architecture of the primary network $g$ , which may considerably impair the trainability of the architecture if not defined appropriately.
2.

However, when the widths of both the hypernetwork $f$ and the primary network $g$ tend to infinity, the optimization dynamics of the hypernetwork simplifies, and its neural tangent kernel (which we call the hyperkernel) has a well defined infinite-width limit governing the network evolution.
3.

We verify our theory empirically and also demonstrate the utility of this hyperkernel on several functional representation tasks. Consistent with prior observations on kernel methods, the hypernetwork induced kernels also outperforms a trained hypernetwork when training data is small.
4.

We make a technical contribution by deriving asymptotically tight bounds on high order Taylor expansion terms in ReLU MLPs. Our result partially settles a conjecture posed in [6] regarding the asymptotic behavior of general correlation functions.

1.1 Related Works

Hypernetworks

Hypernetworks were first introduced under this name in [8], are networks that generate the weights of a second primary network that computes the actual task. However, the idea of having one network predict the weights of another was proposed earlier and has reemerged multiple times [15, 29, 13]. The tool can naturally be applied for image representations tasks. In [22], they applied hypernetworks for 3D shape reconstruction from a single image. In [33] hypernetworks were shown to be useful for learning shared image representations. Hypernetworks were also shown to be effective in non-image domains. For instance, hypernetworks achieve state of the art results on the task of decoding error correcting codes [25].

Several publications consider a different framework, in which, the inputs $x$ of the hypernetwork are optimized alongside to the weights of the hypernetwork. In this setting, hypernetworks were recently used for continuous learning by [36]. Hypernetworks can be efficiently used for neural architecture search, as was demonstrated by [4, 44], where a feedforward regression (with network $f$ ) replaces direct gradient-based learning of the weights of the primary network while its architecture is being explored. Lorraine et al. applied hypernetworks for hyperparameters selection [23].

Despite their success and increasing prominence, little theoretical work was done in order to better understand hypernetworks and their behavior. A recent paper [12] studies the role of multiplicative interaction within a unifying framework to describe a range of classical and modern neural network architectural motifs, such as gating, attention layers, hypernetworks, and dynamic convolutions amongst others. It is shown that standard neural networks are a strict subset of neural networks with multiplicative interactions. In [7] the authors theoretically study the modular properties of hypernetworks. In particular, they show that compared to standard embedding methods, hypernetworks are exponentially more expressive when the primary network is of small complexity. In this work, we provide a complementary perspective and show that a shallow primary network is a requirement for successful training. [5] showed that applying standard initializations on a hypernetwork produces sub-optimal initialization of the primary network. A principled technique for weight initialization in hypernetworks is then developed.

Gaussian Processes and Neural Tangent Kernel

The connection between infinitely wide neural networks, Gaussian processes and kernel methods, has been the focus of many recent papers [11, 19, 40, 42, 32, 31, 38, 37, 26]. Empirical support has demonstrated the power of CNTK (convolutional neural tangent kernel) on popular datasets, demonstrating new state of the art results for kernel methods [1, 43]. [21] showed that ReLU ResNets [10] can have NTK convergence occur even when depth and width simultaneously tend to infinity, provided proper initialization. In this work, we extend the kernel analysis of networks to hypernetworks, and characterize the regime in which the kernels converge and training dynamics simplify.

2 Setup

In this section, we introduce the setting of the analysis considered in this paper. We begin by defining fully connected neural networks and hypernetworks in the context of the NTK framework.

Neural networks In the NTK framework, a fully connected neural network, $f(x;w)=y^{L}(x)$ , is defined in the following manner:

\begin{cases}y^{l}(x)=\sqrt{\frac{1}{n_{l-1}}}W^{l}q^{l-1}(x)\\ q^{l}(x)=\sqrt{2}\cdot\sigma(y^{l}(x))\end{cases}\textnormal{ and }q^{0}(x)=x\,,

(2)

where $\sigma:\mathbb{R}\to\mathbb{R}$ is the activation function of $f$ . Throughout the paper, we specifically take $\sigma$ to be a piece-wise linear function with a finite number of pieces (e.g., the ReLU activation $\textnormal{ReLU}(x):=\max(0,x)$ and the Leaky ReLU activation $\textnormal{ReLU}_{\alpha}(x)=\begin{cases}x&\textnormal{ if }x\geq 0\\ \alpha x&\textnormal{ if }x<0\end{cases}$ ). The weight matrices $W^{l}\in\mathbb{R}^{n_{l}\times n_{l-1}}$ are trainable variables, initialized independently according to a standard normal distribution, $W^{l}_{i,j}\sim\mathcal{N}(0,1)$ . The width of $f$ is denoted by $n:=\min(n_{1},\dots,n_{L-1})$ . The parameters $w$ are aggregated as a long vector $w=(vec(W^{1}),\dots,vec(W^{L}))$ . The coefficients $\sqrt{1/n_{l-1}}$ serve for normalizing the activations of each layer. This parametrization is nonstandard, and we will refer to it as the NTK parameterization. It has already been employed in several recent works [14, 35, 27]. For simplicity, in many cases, we will omit to specify the weights $w$ associated with our model.

Hypernetworks Given the input tuple $u=(x,z)\in\mathbb{R}^{n_{0}+m_{0}}$ , we consider models of the form: $h(u;w):=g(z;f(x;w))$ , where $f(x;w)$ and $g(z;v)$ are two neural network architectures with depth $L$ and $H$ respectively. The function $f(x;w)$ referred to as hypernetwork, takes the input $x$ and computes the weights $v=f(x;w)$ of a second neural network $g(z;v)$ , referred as the primary network, which is assumed to output a scalar. As before, the variable $w$ stands for a vector of trainable parameters ( $v$ is not trained directly and is given by $f$ ).

We parameterize the primary network $g(z;v)=g^{H}(z;v)$ as follows:

\begin{cases}g^{l}(z;v)=\sqrt{\frac{1}{m_{l-1}}}V^{l}\cdot a^{l-1}(z;v)\\ a^{l}(z;v)=\sqrt{2}\cdot\phi(g^{l}(z;v))\end{cases}\textnormal{ and }a^{0}(z)=z

(3)

Here, the weights of the primary network $V^{l}(x)\in\mathbb{R}^{m_{l}\times m_{l-1}}$ are given in a concatenated vector form by the output of the hypernetwork $f(x;w)=v=(vec(V^{1}),\dots,vec(V^{H}))$ . The output dimension of the hypernetwork $f$ is therefore $n_{L}=\sum^{H}_{i=1}m_{i}\cdot m_{i-1}$ . We denote by $f^{d}(x;w):=V^{d}(x;w):=V^{d}$ the $d$ ’th output matrix of $f(x;w)$ . The width of $g$ is denoted by $m:=\min(m_{1},\dots,m_{H-1})$ . The function $\phi$ is an element-wise continuously differentiable function or a piece-wise linear function with a finite number of pieces.

Optimization Let $S=\{(u_{i},y_{i})\}^{N}_{i=1}$ , where $u_{i}=(x_{i},z_{i})$ be some dataset and let $\ell(a,b):=|a-b|^{p}$ be the $\ell^{p}$ -loss function. For a given hypernetwork $h(u;w)$ , we are interested in selecting the parameters $w$ that minimize the empirical risk:

c(w):=\sum^{N}_{i=1}\ell(h(u_{i};w),y_{i})

(4)

For simplicity, oftentimes we will simply write $\ell_{i}(a):=\ell(a,y_{i})$ and $h_{i}(w):=h(u_{i}):=h(u_{i};w)$ , depending on the context. In order to minimize the empirical error $c(w)$ , we consider the SGD method with learning rate $\mu>0$ and step of the form $w_{t+1}\leftarrow w_{t}-\mu\nabla_{w}\ell_{j_{t}}(h_{j_{t}}(w_{t}))$ for some index $j_{t}\sim U[N]$ that is selected uniformly at random for the $t$ ’th iteration. A continuous version of the GD method is the gradient flow method, in which $\dot{w}=-\mu\nabla_{w}c(w)$ . In recent works [14, 20, 1, 35], the optimization dynamics of the gradient method for standard fully-connected neural networks was analyzed, as the network width tends to infinity. In our work, since hypernetworks consist of two interacting neural networks, there are multiple ways in which the size can tend to infinity. We consider two cases: (i) the width of $f$ tends to infinity and that of $g$ is fixed and (ii) the width of both $f$ and $g$ tend to infinity.

3 Dynamics of Hypernetworks

Infinitely wide $f$ without infinitely wide $g$ induces non-convex optimization

In the NTK literature, it is common to adopt a functional view of the network evolution by analyzing the dynamics of the output of the network along with the cost, typically a convex function, as a function of that output. In the hypernetwork case, this presents us with two possible viewpoints of the same optimization problem of $h(u)=g(z;f(x))$ . On one hand, since only the hypernetwork $f$ contains the trainable parameters, we can view the optimization of $h$ under the loss $\ell$ as training of $f$ under the loss $\ell\circ g$ . The classical NTK theory would imply that $f$ evolves linearly when its width tends to infinity, but because $\ell\circ g$ is in general not convex anymore, even when $\ell$ originally is, an infinitely wide $f$ without an infinitely wide $g$ does not guarantee convergence to a global optimum. In what follows, we make this point precise by characterising how nonlinear the dynamics becomes in terms of the depth of $g$ .

After a single stochastic gradient descent step with learning rate $\mu$ , the hypernetwork output for example $i$ is given by $h_{i}\big{(}w-\mu\nabla_{w}\ell_{j}\big{)}$ . When computing the Taylor approximation around $w$ with respect to the function $h$ at the point $w^{\prime}=w-\mu\nabla_{w}\ell_{j}$ , it holds that:

\displaystyle h_{i}\big{(}w-\mu\nabla_{w}\ell_{j}\big{)}=\sum_{r=0}^{\infty}\frac{1}{r!}\langle\nabla^{(r)}_{w}h_{i},(-\mu\nabla_{w}\ell_{j})^{r}\rangle=\sum_{r=0}^{\infty}\frac{1}{r!}\left(-\mu\frac{\partial\ell_{j}}{\partial h_{j}}\right)^{r}\cdot\mathcal{K}^{(r)}_{i,j}

(5)

where $\mathcal{K}^{(r)}_{i,j}:=\langle\nabla^{(r)}_{w}h_{i},(\nabla_{w}h_{j})^{r}\rangle$ , and $\nabla^{(r)}_{w}h_{i}$ is the $r$ tensor that holds the $r$ ’th derivative of the output $h_{i}$ . The terms $\langle\nabla^{(r)}_{w}h_{i},(-\mu\nabla_{w}\ell_{j})^{r}\rangle$ are the multivariate extensions of the Taylor expansion terms $h^{(r)}_{i}(w)(w^{\prime}-w)^{r}$ , and take the general form of correlation functions as introduced in Eq. 5 in the appendix. This equation holds for neural networks with smooth activation functions (including hypernetworks), and holds in approximation for piece-wise linear activation functions.

Previous works have shown that, if $h$ is a wide fully connected network, the first order term ( $r=1$ ) converges to the NTK, while higher order terms ( $r>1$ ) scale like $\mathcal{O}(1/\sqrt{n})$ [20, 6]. Hence, for large widths and small learning rates, these higher order terms vanish, and the loss surface appears deterministic and linear at initialization, and remains so during training.

However, the situation is more complex for hypernetworks. As shown in the following theorem, for infinitely wide hypernetworks and finite primary network, the behaviour depends on the depth and width of the generated primary network. Specifically, when the primary network is deep and narrow, the higher order terms in Eq. 5 may not vanish, and parameter dynamics can be highly non-convex.

Theorem 1 (Higher order terms for hypernetworks).

Let $h(u)=g(z;f(x))$ for a hypernetwork $f$ and an primary network $g$ . Then, we have:

\mathcal{K}^{(r)}_{i,j}\sim\begin{cases}n^{H-r}&\text{if $r>H$}\\ 1&\text{otherwise.}\end{cases}

(6)

Thm. 1 illustrates the effect of the depth of the primary network $g$ on the evolution of the output $h$ . The larger $H$ is, the more non-linear the evolution is, even when $f$ is infinitely wide. Indeed, we observe empirically that when $f$ is wide and kept fixed, a deeper $g$ incurs slower training, and lower overall test performance as illustrated in Fig. 2.

As a special case of this theorem, when taking $H=1$ , we can also derive the asymptotic behaviour of $\mathcal{K}^{(r)}_{i,j}\sim n^{1-r}$ for a neural network $h$ . This provides a tighter bound than the previously conjectured $\mathcal{O}(1/n)$ upper bound [6]. The following remark is a consequence of this result and is validated in the supplementary material.

Remark 1.

The $r$ ’th order term of the Taylor expansion in Eq. 5 is of order $\mathcal{O}(\frac{\mu^{r}}{r!\cdot n^{r-1}})$ instead of the previously postulated $\mathcal{O}(\frac{\mu^{r}}{r!\cdot n})$ . Therefore, it is evident that for any choice $\mu=o(\sqrt{n})$ , all of the high order terms tend to zero as $n\to\infty$ . This is opposed the previous bound, which guarantees that all of the high order terms tend to zero as $n\to\infty$ only when $\mu$ is constant.

4 Dually Infinite Hypernetworks

It has been shown by [11, 19] that NNGPs and neural tangent kernels fully characterise the training dynamics of infinitely wide networks. As a result, in various publications [21, 9], these kernels are being treated as functional priors of neural networks. In the previous section, we have shown that the Taylor expansion of the hypernetwork is non-linear when the size of the primary network is finite. In this section, we consider the case when both hyper and primary networks are infinitely wide, with the intention of gaining insight into the functional prior of wide hypernetworks. For this purpose, we draw a formal correspondence between infinitely wide hypernetworks and GPs, and use this connection to derive the corresponding neural tangent kernel.

4.1 The NNGP kernel

Previous work have shown the equivalence between popular architectures, and Gaussian processes, when the width of the architecture tends to infinity. This equivalence has sparked renewed interest in kernel methods, through the corresponding NNGP kernel, and the Neural Tangent Kernel (NTK) induced by the architecture, which fully characterise the training dynamics of infinitely wide networks. This equivalence has recently been unified to encompass most architectures which use a pre-defined set of generic computational blocks [40, 41]. Hypernetworks represent a different class of neural networks where the parameters contain randomly initialized matrices except the last layer whose parameters are aggregated as a rank 3 tensor. All of the matrices/tensors dimensions tend to infinity. This means the results of [40, 41] do not apply to hypernetworks. Nevertheless, by considering sequential limit taking, where we take the limit of the width of $f$ ahead of the width of $g$ , we show the output of $f$ achieves a GP behaviour, essentially feeding $g$ with Gaussian distributed weights with adaptive variances. A formal argument is presented in the following theorem.

Refer to caption — Figure 1: Convergence to the hyperkernel. (a) Empirical variance of kernel values in log-scale for a single entry for varying width $f$ and $g$ . Variance of the kernel converges to zero only when the widths of $f$ and $g$ both increase. (b) Empirical kernel value for $z=(1,0),z^{\prime}=(\cos(\theta),\sin(\theta))$ , and $x=x^{\prime}=(1,0)$ for different values of $\theta\in[-\frac{\pi}{2},\frac{\pi}{2}]$ . Convergence to a deterministic kernel is observed only when both $f$ and $g$ are wide.

Theorem 2 (Hypernetworks as GPs).

Let $h(u)=g(z;f(x))$ be a hypernetwork. For any pair of inputs $u=(x,z)$ and $u^{\prime}=(x^{\prime},z^{\prime})$ , let $\Sigma^{0}(z,z^{\prime})=\frac{z^{\top}z^{\prime}}{m_{0}},S^{0}(x,x^{\prime})=\frac{x^{\top}x^{\prime}}{n_{0}}$ . Then, it holds for any unit $i$ in layer $0<l\leq H$ of the primary network:

g^{l}_{i}(z;f(x))\stackrel{{\scriptstyle d}}{{\longrightarrow}}\mathcal{G}_{i}^{l}(u)

(7)

as $m,n\to\infty$ sequentially. Here, $\{\mathcal{G}_{i}^{l}(u)\}_{i=1}^{m_{l}}$ are independent Gaussian processes, such that, $(\mathcal{G}_{i}^{l}(u),\mathcal{G}_{i}^{l}(u^{\prime}))\sim\mathcal{N}\big{(}0,\Lambda^{l}(u,u^{\prime})\big{)}$ defined by the following recursion:

\displaystyle\Lambda^{l+1}(u,u^{\prime})=\begin{pmatrix}\Sigma^{l}(u,u)&\Sigma^{l}(u^{\prime},u)\\ \Sigma^{l}(u,u^{\prime})&\Sigma^{l}(u^{\prime},u^{\prime})\end{pmatrix}\bigodot\begin{pmatrix}S^{L}(x,x)&S^{L}(x^{\prime},x)\\ S^{L}(x,x^{\prime})&S^{L}(x^{\prime},x^{\prime})\end{pmatrix}

(8)

\Sigma^{l}(u,u^{\prime})=2\operatorname*{\mathbb{E}}_{(u,v)\sim\mathcal{N}(0,\Lambda^{l})}[\sigma(u)\cdot\sigma(v)]

(9)

where $S^{L}(x,x^{\prime})$ is defined recursively:

\displaystyle S^{l}(x,x^{\prime})=2\operatorname*{\mathbb{E}}_{(u,v)\sim\mathcal{N}(0,\Gamma^{l})}[\sigma(u)\cdot\sigma(v)]~{}\textnormal{ and }~{}\Gamma^{l}(x,x^{\prime})=\begin{pmatrix}S^{l}(x,x)&S^{l}(x^{\prime},x)\\ S^{l}(x,x^{\prime})&S^{l}(x^{\prime},x^{\prime})\end{pmatrix}

(10)

In other words, the NNGP kernel, governing the behaviour of wide untrained hypernetworks, is given by the Hadamard product of the GP kernels of $f$ and $g$ (see Eq. 8).

As a consequence of the above theorem, we observe that the NNGP kernel of $h$ at each layer, $\Lambda^{l}(u,u^{\prime})$ , is simply a function of $\Sigma^{0}(z,z^{\prime}),S^{0}(x,x^{\prime})$ .

Corollary 1.

Let $h(u)=g(z;f(x))$ be a hypernetwork. For any $0<l\leq H$ , there exists a function $\mathcal{F}^{l}$ , such that, for all pairs of inputs $u=(x,z)$ and $u^{\prime}=(x^{\prime},z^{\prime})$ , it holds that:

\Lambda^{H}(u,u^{\prime})=\mathcal{F}\left(\Sigma^{0}(z,z^{\prime}),S^{0}(x,x^{\prime})\right)

(11)

The factorization of the NNGP kernel into a function of $\Sigma^{0}(z,z^{\prime})$ and $S^{0}(x,x^{\prime})$ provides a convenient way to explicitly encode useful invariances into the kernel.

As an example, in the following remark, we investigate the behaviour of the NNGP kernel of $h$ , when the inputs $z$ are preprocessed random random Fourier features as suggested by [28, 34].

Remark 2.

Let $p(z)=[\cos(W^{1}_{i}z+b^{1}_{i})]^{k}_{i=1}$ be a Fourier features preprocessing, where $W^{1}_{i,j}\sim\mathcal{N}(0,1)$ and biases $b_{i}\sim U[-\pi,\pi]$ . Let $h(u)=g(p(z);f(x))$ be a hypernetwork, with $z$ preprocessed according to $p$ . Let $u=(x,z)$ and $u^{\prime}=(x^{\prime},z^{\prime})$ be two pairs of inputs. Then, $\Lambda^{l}(u,u^{\prime})$ is a function of $\exp[-\|z-z^{\prime}\|^{2}_{2}/2]$ and $S^{L}(x,x^{\prime})$ .

The above remark shows that for any given inputs $x,x^{\prime}$ , the NNGP kernel depends on $z,z^{\prime}$ only through the distance between $z$ and $z^{\prime}$ , which has been shown to be especially useful in implicit neural representation [34].

We next derive the corresponding neural tangent kernel of hypernetworks, referred to as hyperkernels.

4.2 The Hyperkernel

Recall the definition of the NTK as the infinite width limit of the Jacobian inner product given by:

\displaystyle\mathcal{K}^{h}(u,u^{\prime})=\frac{\partial h(u)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime})}{\partial w}=\frac{\partial g(z;f(x))}{\partial f(x)}\cdot\mathcal{K}^{f}(x,x^{\prime})\cdot\frac{\partial^{\top}g(z^{\prime};f(x^{\prime}))}{\partial f(x^{\prime})}

(12)

where $\mathcal{K}^{f}(x,x^{\prime}):=\frac{\partial f(x)}{\partial w}\cdot\frac{\partial^{\top}f(x^{\prime})}{\partial w}$ and $\mathcal{K}^{g}(u,u^{\prime}):=\frac{\partial g(z;f(x))}{\partial f(x)}\cdot\frac{\partial^{\top}g(z^{\prime};f(x^{\prime}))}{\partial f(x^{\prime})}$ . In the following theorem we show that $\mathcal{K}^{h}(u,u^{\prime})$ converges in probability at initialization to a limiting kernel in the sequentially infinite width limit of $f$ and $g$ , denoted by $\Theta^{h}(u,u^{\prime})$ . Furthermore, we show that the hyperkernel is decomposed to the Hadamard product between the kernels corresponding to $f$ and $g$ . In addition, we show that the derivative of the hyperkernel with respect to time tends to zero at initialization.

Theorem 3 (Hyperkernel decomposition and convergence at initialization).

Let $h(u;w)=g(z;f(x;w))$ be a hypernetwork. Then,

\mathcal{K}^{h}(u,u^{\prime})\stackrel{{\scriptstyle p}}{{\longrightarrow}}\Theta^{h}(u,u^{\prime})

(13)

where:

\Theta^{h}(u,u^{\prime})=\Theta^{f}(x,x^{\prime})\cdot\Theta^{g}(u,u^{\prime},S^{L}(x,x^{\prime}))

(14)

such that:

\mathcal{K}^{f}(x,x^{\prime})\stackrel{{\scriptstyle p}}{{\longrightarrow}}\Theta^{f}(x,x^{\prime})\cdot I\textnormal{ and }\mathcal{K}^{g}(u,u^{\prime})\stackrel{{\scriptstyle p}}{{\longrightarrow}}\Theta^{g}(u,u^{\prime},S^{L}(x,x^{\prime}))

(15)

moreover, if $w$ evolves throughout gradient flow, we have:

\frac{\partial\mathcal{K}^{h}(u,u^{\prime})}{\partial t}\Big{|}_{t=0}\stackrel{{\scriptstyle p}}{{\longrightarrow}}0

(16)

where the limits are taken with respect to $m,n\to\infty$ sequentially.

As a consequence of Thm. 3, when applying a Fourier features preprocessing to $z$ , one obtains that $\Theta^{g}(u,u^{\prime})$ becomes shift invariant.

Remark 3.

Let $p(z)$ be as in Remark 2. Let $h(u)=g(p(z);f(x))$ be a hypernetwork, where $z$ is preprocessed according to $p$ . Let $u=(x,z)$ and $u^{\prime}=(x^{\prime},z^{\prime})$ be two pairs of inputs. Then, $\Theta^{g}(u,u^{\prime})$ is a function of $\exp[-\|z-z^{\prime}\|^{2}_{2}/2]$ and $S^{0}(x,x^{\prime})$ .

Note that $\Theta^{f}$ is the standard limiting NTK of $f$ and depends only on the inputs $\{x_{i}\}_{i=1}^{N}$ . However from Eq. 8, the term $\Theta^{g}$ requires the computation of the NNGP kernel of $f$ in advance in order to compute the terms $\{\Sigma^{l},\dot{\Sigma}^{l}\}$ . This form provides an intuitive factorization of the hyperkernel into a term $\Theta^{f}$ which depends on the meta function and data, and $\Theta^{g}$ which can be though of as a conditional term.

5 Experiments

Our experiments are divided into two main parts. In the first part, we validate the ideas presented in our theoretical analysis and study the effect of the width and depth of $g$ on the optimization of a hypernetwork. In the second part, we evaluate the performance of the NNGP and NTK kernels on image representation tasks. For further implementation details on all experiments see Appendix A.

	Representation				Inpainting
$N$	50	100	200	500	50	100	200	500
HK	0.055	0.050	0.043	0.032	0.057	0.051	0.047	0.038
NNGP	0.051	0.045	0.037	0.026	0.054	0.047	0.043	0.034
HN	0.12	0.08	0.052	0.041	0.16	0.098	0.066	0.49

Table 1: Results on image representation and inpainting. Reported are the MSE of the reconstructed image on test set where

N

is the number of training samples. As can be seen, in low data regime the kernels outperform a trained hypernetwork on both tasks, and the NNGP consistently outperforms the rest.

5.1 Convergence of the Hyperkernel

We verified our results of Thm. 3 by constructing a simple hypernetwork, for which both $f$ and $g$ are four layered fully connected networks with ReLU activations. For the input of $f$ , we used a fixed 2D vector $x=(1,-1)$ . The input $z(\theta)$ of $g$ varied according to $z(\theta)=(\sin(\theta),\cos(\theta))$ , where $\theta\in[-\frac{\pi}{2},\frac{\pi}{2}]$ . We then compute the empirical hyperkernel as follows, while varying the width of both $f$ and $g$ :

\mathcal{K}^{h}(u,u^{\prime})=\nabla_{w}h(x,z(\theta))\cdot\nabla_{w}^{\top}h(x,z(\theta))

(17)

Results are presented in Fig. 1. As can be seen, convergence to a fixed kernel is only observed in the dually wide regime, as stated in Thm. 3.

5.2 Training Dynamics

We consider a rotation prediction task. In this task, the hypernetwork $f$ is provided with a randomly rotated image $x$ and the primary network $g$ is provided with a rotated version $z$ of $x$ with a random angle $\alpha$ . The setting is cast into a classification task, where the goal is to predict the closest value to $\alpha/360$ within $\{\alpha_{i}=30i/360\mid i=0,\dots,11\}$ . We experimented with the MNIST [18] and CIFAR10 [17] datasets. For each dataset we took $10000$ training samples only.

We investigate the effect of the depth and width of $g$ on the training dynamics of a hypernetwork. We compared the performance of hypernetworks of various architectures to investigate the effect of the depth and width of $g$ . The architectures of the hypernetwork and the primary network are as follows. The hypernetwork, $f$ , is a fully-connected ReLU neural network of depth $4$ and width $200$ . The inputs of $f$ are flattened vectors of dimension $c\cdot h^{2}$ , where $c$ specifies the number of channels and $h$ the height/width of each image ( $c=1$ and $h=28$ for MNIST and $c=3$ and $h=32$ for CIFAR10). The primary network $g$ is a fully-connected ReLU neural network of depth $\in\{3,6,8\}$ . Since the MNIST rotations dataset is simpler, we varied the width of $g$ in $\in\{10,50,100\}$ and for the the CIFAR10 variation we selected the width of $g$ to be $\in\{100,200,300\}$ . The network outputs $12$ values and is trained using the cross-entropy loss.

We trained the hypernetworks for 100 epochs, using the SGD method with batch size 100 and learning rate $\mu=0.01$ . For completeness, we conducted a sensitivity study on the learning rate for both datasets, to show that the reported behaviour is consistent for any chosen learning rate, see appendix. In Fig. 2(a-c) we compare the performance of the various architectures on the MNIST and CIFAR10 rotations prediction tasks. The performance is computed as an average and standard deviation (error bars) over $100$ runs. As can be seen, we observe a clear improvement in test performance as the width of $g$ increases, especially at the initialization. When comparing the three plots, we observe that when $f$ is wide and kept fixed, a deeper $g$ incurs slower training, and lower overall test performance. This is aligned with the conclusions of Thm. 1.

5.3 Image representation and Inpainting

We compare the performance of a hypernetwork and kernel regression with the hyperkernel on two visual tasks: functional image representation and inpainting. In the MNIST image representation task, the goal of the hypernetwork $f$ is to represent an input image via the network $g$ , which receives image pixel coordinates and outputs pixel values. In the inpainting task, the goal is the same where only half of the image is observed by $f$ .

Problem Setup We cast these problems as a meta-learning problem, where $f$ receives an image, and the goal of the primary network $g:[28]^{2}\rightarrow[0,1]$ is then to learn a conditional mapping from pixel coordinates to pixel values for all the pixels in the image, with the MSE as the metric. Our training dataset $S=\{(u_{i},y_{i})\}_{i=1}^{N}$ then consists of samples $u_{i}=(x_{i},z_{i})$ , such that, $x_{i}$ are images, and $z_{i}$ are random pixel location (i.e., a tuple $\in[28]^{2}$ ), and $y_{i}$ is a label specifying the pixel value at the specified location (normalized between 0 and 1). In both experiments, training was done on randomly sampled training data of varying size, specified by $N$ .

Evaluation We evaluate the performance of both training a hypernetwork, and using kernel regression with the hyperkernel. For kernel regression, we use the following formula to infer the pixel value of a test point $u$ :

(\Theta^{h}(u,u_{1}),...,\Theta^{h}(u,u_{N}))\cdot\big{(}\Theta^{h}(U,U)+\epsilon\cdot I\big{)}^{-1}\cdot Y

(18)

where $\Theta^{h}(U,U)=(\Theta^{h}(u_{i},u_{j}))_{i,j\in[N]}$ is the hyperkernel matrix evaluated on all of the training data and $Y=(y_{i})^{N}_{i=1}$ is the vector of labels in the training dataset and $\epsilon=0.001$ .

In Tab. 1, we compare the results of the hyperkernel and the NNGP kernel with the corresponding hypernetwork. The reported numbers are averages over $20$ runs. As can be seen, in the case of a small dataset, the kernels outperforms the hypernetwork, and the NNGP outperforms the rest.

6 Conclusions

In this paper, we apply the well-established large width analysis to hypernetwork type models. For the class of models analyzed, we have shown that a wide hypernetwork must be coupled with a wide primary network in order achieve a simplified, convex training dynamics as in standard architectures. The deeper $g$ is, the more complicated the evolution is. In the dually infinite case, when the widths of both the hyper and primary networks tend to infinity, the optimization of the hypernetwork become convex and is governed by the proposed hyperkernel. The analysis presented in this paper is limited to a specific type of hypernetworks used in the literature, typically found in the functional neural representation literature, and we leave the extension of this work to additional types of hyper models to future work.
Some of the tools developed in this study, also apply for regular NTKs. Specifically, [6] provide a conjecture, for which one of its consequences is that $\mathcal{K}^{(r)}_{i,j}=\mathcal{O}(1/n)$ . In Thm. 1 we prove that this hypothesized upper bound is increasingly loose as $r$ increases, and prove an asymptotic behaviour in the order of $\mathcal{K}^{(r)}_{i,j}\sim 1/n^{r-1}$ .

Broader Impact

This work improves our understanding and design of hypernetworks and hopefully will help us improve the transparency of machine learning involving them. Beyond that, this work falls under the category of basic research and does not seem to have particular societal or ethical implications.

Acknowledgements and Funding Disclosure

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant ERC CoG 725974). The contribution of Tomer Galanti is part of Ph.D. thesis research conducted at Tel Aviv University.

References

[1] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019.
[2] Yu Bai and Jason D. Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. In International Conference on Learning Representations, 2020.
[3] Yunru Bai, Ben Krause, Haiquan Wang, Caiming Xiong, and Richard Socher. Taylorized training: Towards better approximation of neural network training at finite width. ArXiv, 2020.
[4] Andrew Brock, Theo Lim, J.M. Ritchie, and Nick Weston. SMASH: One-shot model architecture search through hypernetworks. In International Conference on Learning Representations, 2018.
[5] Oscar Chang, Lampros Flokas, and Hod Lipson. Principled weight initialization for hypernetworks. In International Conference on Learning Representations, 2020.
[6] Ethan Dyer and Guy Gur-Ari. Asymptotics of wide networks from feynman diagrams. In International Conference on Learning Representations, 2020.
[7] Tomer Galanti and Lior Wolf. On the modularity of hypernetworks. In Advances in Neural Information Processing Systems 33. Curran Associates, Inc., 2020.
[8] David Ha, Andrew M. Dai, and Quoc V. Le. Hypernetworks. In International Conference on Learning Representations, 2016.
[9] Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. In International Conference on Learning Representations, 2020.
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[11] Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2018.
[12] Siddhant M. Jayakumar, Jacob Menick, Wojciech M. Czarnecki, Jonathan Schwarz, Jack Rae, Simon Osindero, Yee Whye Teh, Tim Harley, and Razvan Pascanu. Multiplicative interactions and where to find them. In International Conference on Learning Representations, 2020.
[13] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic filter networks. In Advances in Neural Information Processing Systems 29. Curran Associates, Inc., 2016.
[14] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
[15] Benjamin Klein, Lior Wolf, and Yehuda Afek. A dynamic convolutional layer for short range weather prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[16] Sylwester Klocek, Lukasz Maziarka, Maciej Wolczyk, Jacek Tabor, Jakub Nowak, and Marek Śmieja. Hypernetwork functional image representation. Lecture Notes in Computer Science, 2019.
[17] Alex Krizhevsky. Convolutional deep belief networks on cifar-10, 2010.
[18] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
[19] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018.
[20] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019.
[21] Etai Littwin, Tomer Galanti, and Lior Wolf. On random kernels of residual architectures. Arxiv, 2020.
[22] Gidi Littwin and Lior Wolf. Deep meta functionals for shape representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[23] Jonathan Lorraine and David Duvenaud. Stochastic hyperparameter optimization through hypernetworks, 2018.
[24] Henry B. Mann and Abraham Wald. On stochastic limit and order relationships. Annals of Mathematical Statistics, 14(3):217–226, 09 1943.
[25] Eliya Nachmani and Lior Wolf. Hyper-graph-network decoders for block codes. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019.
[26] Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-dickstein. Bayesian deep convolutional networks with many channels are gaussian processes. In International Conference on Learning Representations, 2019.
[27] Daniel Park, Jascha Sohl-Dickstein, Quoc Le, and Samuel Smith. The effect of network width on stochastic gradient descent and generalization: an empirical study. In Proceedings of Machine Learning Research, volume 97, pages 5042–5051. PMLR, 2019.
[28] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 20. Curran Associates, Inc., 2008.
[29] G. Riegler, S. Schulter, M. Rüther, and H. Bischof. Conditioned regression models for non-blind single image super-resolution. In IEEE International Conference on Computer Vision (ICCV), pages 522–530, 2015.
[30] M. Rotman and L. Wolf. Electric analog circuit design with hypernetworks and a differential simulator. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
[31] Tim G. J. Rudner. On the connection between neural processes and gaussian processes with deep kernels. In Workshop on Bayesian Deep Learning (NeurIPS), 2018.
[32] Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. ArXiv, 2016.
[33] Vincent Sitzmann, Julien N. P. Martel, Alexander W. Bergman, David B. Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. Arxiv, 2020.
[34] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. Arxiv, 2020.
[35] Twan van Laarhoven. L2 regularization versus batch and weight normalization. Arxiv, 2017.
[36] Johannes von Oswald, Christian Henning, João Sacramento, and Benjamin F. Grewe. Continual learning with hypernetworks. In International Conference on Learning Representations, 2020.
[37] Colin Wei, Jason D. Lee, Qiang Liu, and Tengyu Ma. On the margin theory of feedforward neural networks. ArXiv, 2018.
[38] Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Proceedings of Machine Learning Research, volume 125, pages 3635–3673. PMLR, 2020.
[39] Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. Pay less attention with lightweight and dynamic convolutions. In International Conference on Learning Representations, 2019.
[40] Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. ArXiv, 2019.
[41] Greg Yang. Tensor programs I: Wide feedforward or recurrent neural networks of any architecture are gaussian processes. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019.
[42] Greg Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 2017.
[43] Dingli Yu, Ruosong Wang, Zhiyuan Li, Wei Hu, Ruslan Salakhutdinov, Sanjeev Arora, and Simon S. Du. Enhanced convolutional neural tangent kernels, 2020.
[44] Chris Zhang, Mengye Ren, and Raquel Urtasun. Graph hypernetworks for neural architecture search. In International Conference on Learning Representations, 2019.

Appendix A Implementation Details

A.1 Convergence of the Hyperkernel

In Fig. 1(a) (main text) we plot the variance of the kernel values $\mathcal{K}^{h}(u,u^{\prime})$ in log-scale, as a function of the width of both $f$ and $g$ . The variance was computed empirically over $k=100$ normally distributed samples $w$ . As can be seen, the variance of the kernel tends to zero only when both widths increase. In Fig. 1(b) (main text) we plot the value of $\mathcal{K}^{h}(u,u^{\prime})$ and its variance for a fixed hypernetwork $f$ of width $500$ and $g$ of width $10$ or $800$ . The $x$ -axis specifies the value of $\theta\in[-\frac{\pi}{2},\frac{\pi}{2}]$ and the y-axis specifies the value of the kernel. As can be seen, the expected value of the empirical kernel, $\mathcal{K}^{h}(u,u^{\prime})$ , is equal to the width-limit kernel (e.g., theoretical kernel) for both widths $10$ and $800$ . In addition, convergence of the width-limit kernel is guaranteed only when the widths of both networks increase, highlighting the importance of wide architectures for both the hyper and implicit networks for stable training.

A.2 Image Completion and Impainting

Architectures In both tasks, we used fully connected architectures, where $f$ contains two hidden layers, and $g$ contains one hidden layer. The hyperkernel used corresponds to the infinite width limit of the same architecture. For the input of $g$ , we used random Fourier features [34] of the pixel coordinates as inputs for both the hyperkernel and the hypernetwork. To ease on the computational burden of computing the full kernel matrix $\Theta^{h}(U,U)$ when evaluating the hyperkernel, we compute smaller kernel matrices on subsets of the data $\{U^{s}_{i}\}=\{x_{i}^{s},z_{i}^{s}\}_{s\in[10]}$ , where each subset contains 1k input images $\{x^{s}_{i}\}$ , and 20 random image coordinates per input, producing a kernel matrix of size $20k\times 20k$ . The final output prediction is then given by:

\frac{1}{10}\sum_{s}(\Theta^{h}(u,u_{1}^{s}),...,\Theta^{h}(u,u_{N}^{s}))\cdot\big{(}\Theta^{h}(U^{s},U^{s})+\epsilon\cdot I\big{)}^{-1}\cdot Y^{s}

(19)

where $Y^{s}$ are the corresponding labels of the subset $U^{s}$ . For the hypernetwork evaluation, we used the same inputs $\{U^{s}_{i}\}_{s\in[10]}$ to train the hypernetwork using a batchsize of $20$ , and a learning rate of $0.01$ which was found to produce the best results.

Appendix B Additional Experiments

B.1 Sensitivity Study

To further demonstrate the behavior reported in Fig. 2 (main text), we verified that it is consistent regardless of the value of the learning rate. We used the same architectures as in the rotations prediction experiments, i.e., $f$ is a fully-connected ReLU neural network of depth $4$ and width $200$ and $g$ is of depth $\in\{3,6,8\}$ and width $\in\{50,100,200\}$ . We vary the learning rate: $\mu=10^{j-7}$ , for $j=0,\dots,7$ . For each value of the learning rate, we report the average performance (and standard deviation over $100$ runs) of the various architectures after $40$ epochs of training.

As can be seen in Fig. 4, when $f$ is wide and kept fixed, there is a clear improvement in test performance as the width of $g$ increases, for every learning rate in which the networks provide non-trivial performance. When $f$ is wide and kept fixed, a deeper $g$ incurs slower training and lower overall test performance. We note that it might seem that the width of $g$ does not affect the performance when the learning rate is $\mu=0.01$ in all settings except Figs. 4(c,f). Indeed, we can verify from Fig. 2 (main text) that the performance at epoch $40$ is indeed similar for different widths. However, for earlier epochs, the performance improves for shallower and wider architectures.

B.2 Training wide networks with a large learning rate

Remark 1 (main text) states that one is able to train wide networks with a learning rate $\mu=o(n)$ . To validate this remark, we trained shallow networks of varying width $n\in\{10^{2},10^{3},10^{4},2.5\cdot 10^{5}\}$ with learning rate $\mu=\sqrt{n}$ on MNIST. As can be seen in Fig. 5, training those networks is possible despite the very large learning rate. In fact, we observe that the accuracy rate and loss improve as we increase the width of the network.

Appendix C Correlation Functions

Correlation functions are products of general high order tensors representing high order derivatives of a networks output with respect to the weights. In [6] a conjecture is posed on the order of magnitude of general correlation functions involving high order derivative tensors, which arise when analysing the dynamics of gradient descent. Roughly speaking, given inputs $\{x_{i}\}_{i=1}^{r}$ , the outputs of a neural network $f(x_{1};w),...,f(x_{r};w)\in\mathbb{R}$ with normally distributed parameters $w\in\mathbb{R}^{N}$ , correlation functions takes the form:

\sum_{\eta_{k_{0}},...,\eta_{k_{r}}\in[N]}\prod_{j=1}^{r}\Gamma_{\eta_{k_{j}+1},...,\eta_{k_{j+1}}}(x_{j})

(20)

where

\Gamma_{\eta_{1},...,\eta_{k}}(x_{j}):=\frac{\partial^{k}f(x_{j};w)}{\partial w_{\eta_{1}}...\partial w_{\eta_{k}}}

(21)

For instance, the following are two examples of correlation functions,

f(x_{1};w)\cdot\frac{\partial f(x_{2};w)}{\partial w_{\mu_{1}}},\frac{\partial^{2}f(x_{1};w)}{\partial w_{\mu_{1}}\partial w_{\mu_{2}}}\cdot\frac{\partial f(x_{2};w)}{\partial w_{\mu_{1}}}

(22)

Computing the expected value of these correlation functions involve keeping track of various moments of normally distributed weights along paths, as done in recent finite width correction works [9, 21]. [6] employ the Feynman diagram to efficiently compute the expected values (order of magnitude) of general correlation functions, albeit at the cost of only being provably accurate for deep linear, or shallow ReLU networks. In this work, we analyze the asymptotic behaviour correlation functions of the form:

	$\displaystyle\mathcal{T}^{r}(x_{0},...,x_{r})$	$\displaystyle:=\sum_{\eta_{k_{0}}...\eta_{k_{r}}\in[N]}\Gamma_{\eta_{k_{1}},...,\eta_{k_{r}}}(x_{0})\prod_{j=1}^{r}\Gamma_{\eta_{k_{j}}}(x_{j})$		(23)
		$\displaystyle=\bigg{\langle}\nabla^{(r)}_{w}f(x_{0}),\bigotimes^{r}_{j=1}\nabla_{w}f(x_{j})\bigg{\rangle}$		(23)

where $\nabla^{(r)}_{w}f(x_{0})$ is a rank $r$ tensor, representing the $r$ ’th derivative of the output, and $\bigotimes^{r}_{j=1}\nabla_{w}f(x_{j})$ denotes outer products of the gradients for different examples. terms of the form in Eq. 23 represent high order terms in the multivariate Taylor expansion of outputs, and are, therefore, relevant for the full understanding of training dynamics. As a consequence of Thm. 1, we prove that $\mathcal{T}^{r}(x_{0},...,x_{r})\sim 1/n^{\max(r-1,0)}$ for vanilla neural networks, where $n$ is the width of the network. As we have shown in Sec. 3, terms of the form in Eq. 23 represent high order terms in the multivariate Taylor expansion of outputs, and are, therefore, relevant for the full understanding of training dynamics. As a consequence of Thm. 1, we prove that $\mathcal{T}^{r}(x_{0},...,x_{r})\sim 1/n^{\max(r-1,0)}$ for vanilla neural networks, where $n$ is the width of the network.

This result is a partial solution to the open problem suggested by [6]. In their paper, they conjecture the asymptotic behaviour of general correlation functions, and predict an upper bound on the asymptotic behaviour of terms of the form in Eq. 23 in the order of $\mathcal{O}(1/n)$ . Our results therefore proves a stronger version of the conjecture, while giving the exact behaviour as a function of width.

Appendix D Proofs of the Main Results

Terminology and Notations Throughout the appendix, we denote by $A\otimes B$ and $A\odot B$ the outer and Hadamard products of the tensors $A$ and $B$ (resp.). When considering the outer products of a sequence of tensors $\{A_{i}\}^{k}_{i=1}$ , we denote, $\bigotimes^{k}_{i=1}A_{i}=A_{1}\otimes\dots\otimes A_{k}$ . We denote by $\textnormal{sgn}(x):=x/|x|$ the sign function. The notation $X_{n}\sim a_{n}$ states that $X_{n}/a_{n}$ converges in distribution to some non-zero random variable $X$ . A convenient property of this notation is that it satisfies: $X_{n}\cdot Y_{n}\sim a_{n}\cdot b_{n}$ when $X_{n}\sim a_{n}$ and $Y_{n}\sim b_{n}$ . Throughout the paper, we will make use of sequential limits and denote $n_{k},\dots,n_{1}\to\infty$ to express that $n_{1}$ tend to infinity, then $n_{2}$ , and so on. For a given sequence of random variable $\{X_{n}\}^{\infty}_{n=1}$ , we denote by $X_{n}\stackrel{{\scriptstyle d}}{{\longrightarrow}}X$ ( $X_{n}\stackrel{{\scriptstyle p}}{{\longrightarrow}}X$ ), when $X_{n}$ converges in distribution (probability) to a random variable $X$ .

D.1 Useful Lemmas

Lemma 1.

Let $X_{n}\stackrel{{\scriptstyle d}}{{\longrightarrow}}X$ . Then, $\textnormal{sgn}(X_{n})\stackrel{{\scriptstyle d}}{{\longrightarrow}}\textnormal{sgn}(X)$ .

Proof.

We have:

\lim_{n\to\infty}\mathbb{P}[\textnormal{sgn}(X_{n})=1]=\lim_{n\to\infty}\mathbb{P}[X_{n}\geq 0]=\mathbb{P}[X\geq 0]=\mathbb{P}[\textnormal{sgn}(X)=1]

(24)

Hence, $\textnormal{sgn}(X_{n})$ converges in distribution to $\textnormal{sgn}(X)$ . ∎

D.2 Main Technical Lemma

In this section, we prove Lem. 3, which is the main technical lemma that enables us proving Thm. 1. Let $f(x;w)$ be a neural network with $H$ outputs $\{f^{d}(x;w)\}^{H}_{d=1}$ . We would like to estimate the order of magnitude of the following expression:

\mathcal{T}^{\bm{l,i,d}}_{n,i,d}:=\left\langle\frac{\partial^{k}f^{d}(x_{i};w)}{\partial W^{l_{1}}\dots\partial W^{l_{k}}},\bigotimes^{k}_{t=1}\frac{\partial f^{d_{1}}(x_{i_{t}};w)}{\partial W^{l_{t}}}\right\rangle

(25)

where $\bm{d}=(d_{1},\dots,d_{k})$ , $\bm{i}=(i_{1},\dots,i_{k})$ and $\bm{l}=(l_{1},\dots,l_{k})$ . For simplicity, when, $i_{1}=\dots=i_{k}=j$ , we denote: $\mathcal{T}^{\bm{l,d}}_{n,i,j,d}:=\mathcal{T}^{\bm{l,i,d}}_{n,i,d}$ and $\mathcal{T}^{\bm{l}}_{n,i,j,d}:=\mathcal{T}^{\bm{l,i,d}}_{n,i,d}$ when $d_{1}=\dots=d_{k}=d$ as well.

To estimate the order of magnitude of the expression in Eq. 25, we provide an explicit expression for $\frac{\partial^{k}f^{d}(x_{i};w)}{\partial W^{l_{1}}\dots\partial W^{l_{k}}}$ . First, we note that for any $w$ , such that, $f^{d}(x_{i};w)$ is $k$ times continuously differentiable at $w$ , for any set $\bm{l}:=\{l_{1},\dots,l_{k}\}$ , we have:

\frac{\partial^{k}f^{d}(x_{i};w)}{\partial W^{l_{1}}\dots\partial W^{l_{k}}}=\frac{\partial^{k}f^{d}(x_{i};w)}{\partial W^{l^{\prime}_{1}}\dots\partial W^{l^{\prime}_{k}}}

(26)

where the set $\bm{l}^{\prime}:=\{l^{\prime}_{1},\dots,l^{\prime}_{k}\}$ is an ordered version of $\bm{l}$ , i.e., the two sets consist of the same elements but $l^{\prime}_{1}<\dots<l^{\prime}_{k}$ . In addition, we notice that for any multi-set $\bm{l}$ , such that, $l_{i}=l_{j}$ for some $i\neq j$ , then,

\frac{\partial^{k}f^{d}(x_{i};w)}{\partial W^{l_{1}}\dots\partial W^{l_{k}}}=0

(27)

since $f^{d}(x_{i};w)$ is a neural network with a piece-wise linear activation function. Therefore, with no loss of generality, we consider $\bm{l}=\{l_{1},\dots,l_{k}\}$ , such that, $l_{1}<\dots<l_{k}$ . It holds that:

\frac{\partial^{k}f^{d}(x_{i};w)}{\partial W^{l_{1}}\dots\partial W^{l_{k}}}=\frac{1}{\sqrt{n_{l_{1}-1}}}q^{l_{1}-1}_{i,d}\otimes\mathcal{A}^{l_{1}\rightarrow l_{2}}_{i,d}

(28)

where $\mathcal{A}^{l_{1}\rightarrow l_{2}}_{i,d}$ is a $2k-1$ tensor, defined as follows:

\mathcal{A}^{l_{j}\rightarrow l_{j+1}}_{i,d}=\begin{cases}\frac{1}{\sqrt{n_{l_{j+1}-1}}}C^{l_{j}\rightarrow l_{j+1}}_{i,d}\otimes\mathcal{A}^{l_{j+1}\rightarrow l_{j+2}}_{i,d}&1<j<k-1\\ \frac{1}{\sqrt{n_{l_{k}-1}}}C^{l_{k-1}\rightarrow l_{k}}_{i,d}\otimes C^{l_{k}\rightarrow L}_{i,d}&j=k-1\\ \end{cases}

(29)

where:

C^{l_{j}\rightarrow l_{j+1}}_{i,d}=\begin{cases}\sqrt{2}Z^{l_{j+1}-1}_{i,d}P_{i,d}^{l_{j}\rightarrow l_{j+1}-1}&l_{j+1}\neq L\\ P_{i,d}^{l_{j}\rightarrow L}&else\end{cases}

(30)

and:

P_{i}^{u\rightarrow v}=\prod_{l=u}^{v-1}(\sqrt{\frac{2}{n_{l}}}W^{l+1}Z^{l}_{i})\textnormal{ and }Z^{l}_{i}=\textnormal{diag}(\dot{\sigma}(y^{l}(x_{i})))

(31)

The individual gradients can be expressed using:

\frac{\partial f^{d_{j}}_{w}(x_{i_{j}})}{\partial W^{l_{j}}}=\frac{q^{l_{j}-1}_{i_{j},d_{j}}\otimes C^{l_{j}\rightarrow L}_{i_{j},d_{j}}}{\sqrt{n_{l_{j}-1}}}

(32)

Note that the following holds for any $u<v<h\leq L$ :

C^{u\rightarrow h}_{i,d}=C^{v\rightarrow h}\frac{W^{v}}{\sqrt{n_{v-1}}}C^{u\rightarrow v}_{i,d}\textnormal{ and }C^{u\rightarrow L}_{i,d}=C^{v-1\rightarrow L}_{i,d}P^{u\rightarrow v-1}_{i,d}

(33)

In the following, given the sets $\bm{l}=\{l_{1},\dots,l_{k}\}$ , $\bm{i}=\{i_{1},\dots,i_{k}\}$ and $\bm{d}=\{d_{1},\dots,d_{k}\}$ , we derive the limit of $\mathcal{T}^{\bm{l,i,d}}_{n,i,d}$ using elementary tensor algebra. By Eqs. 32 and 28, we see that:

	$\displaystyle\mathcal{T}^{\bm{l,i,d}}_{n,i,d}$	$\displaystyle=\Big{\langle}\bigotimes^{k}_{t=1}\frac{\partial f^{d_{t}}(x_{i_{t}};w)}{\partial W^{l_{t}}},\frac{q^{l_{1}-1}_{i,d}}{\sqrt{n_{l_{1}-1}}}\otimes\frac{C^{l_{1}\rightarrow l_{2}}_{i,d}}{\sqrt{n_{l_{2}-1}}}\otimes...\otimes\frac{C^{l_{r-1}\rightarrow l_{k}}_{i,d}}{\sqrt{n_{l_{k}-1}}}\otimes C^{l_{k}\rightarrow L}_{i,d}\Big{\rangle}$		(34)
		$\displaystyle=\frac{1}{n_{l_{1}-1}}\left\langle q^{l_{1}-1}_{i,d},q^{l_{1}-1}_{i_{1},d_{1}}\right\rangle\cdot\left\langle C_{i_{k},d_{k}}^{l_{k}\rightarrow L},C_{i,d}^{l_{k}\rightarrow L}\right\rangle\prod_{j=1}^{k-1}\left\langle\frac{C_{i_{j},d_{j}}^{l_{j}\rightarrow L}\otimes q_{i_{j+1},d_{j+1}}^{l_{j+1}-1}}{n_{l_{j+1}-1}},C_{i,d}^{l_{j}\rightarrow l_{j+1}}\right\rangle$		(34)

We recall the analysis of [41] showing that in the infinite width limit, with $n=\min(n_{1}\dots,n_{L-1})\to\infty$ , every pre-activation $y^{l}(x)$ of $f(x;w)$ at hidden layer $l\in[L]$ has all its coordinates tending to i.i.d. centered Gaussian processes of covariance $\Sigma^{l}(x,x^{\prime}):\mathbb{R}^{n_{0}}\times\mathbb{R}^{n_{0}}\to\mathbb{R}$ defined recursively as follows:

$\displaystyle\Sigma^{0}(x,x^{\prime})$	$\displaystyle=x^{\top}x^{\prime},$	(35)
$\displaystyle\Lambda^{l}(x,x^{\prime})$	$\displaystyle=\begin{bmatrix}\Sigma^{l-1}(x,x)&\Sigma^{l-1}(x,x^{\prime})\\ \Sigma^{l-1}(x^{\prime},x)&\Sigma^{l-1}(x^{\prime},x^{\prime})\end{bmatrix}\in\mathbb{R}^{2\times 2},$
$\displaystyle\Sigma^{l}(x,x^{\prime})$	$\displaystyle=\mathbb{E}_{(u,v)\sim\mathcal{N}(0,\Lambda^{l-1}(x,x^{\prime}))}[\sigma(u)\sigma(v)]$

In addition, we define the derivative covariance as follows:

\dot{\Sigma}^{l}(x,x^{\prime})=\mathbb{E}_{(u,v)\sim\mathcal{N}(0,\Lambda^{l-1}(x,x^{\prime}))}[\dot{\sigma}(u)\dot{\sigma}(v)]

(36)

when considering $x=x_{i}$ and $x^{\prime}=x_{j}$ from the training set, we simply write $\Sigma^{l}_{i,j}:=\Sigma^{l}(x_{i},x_{j})$ and $\dot{\Sigma}^{l}_{i,j}=\dot{\Sigma}^{l}(x_{i},x_{j})$ .

Lemma 2.

The following holds:

1.

For $n_{L-1},\dots,n_{1}\to\infty$ , we have: $P_{i,d_{1}}^{u\rightarrow L}(P_{j,d_{2}}^{u\rightarrow L})^{\top}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\prod_{l=u}^{L-1}\dot{\Sigma}_{i,j}^{l}\delta_{d_{1}=d_{2}}$ .
2.

For $n_{v},\dots,n_{1}\to\infty$ , we have: $\frac{(q_{i}^{v})^{\top}q_{j}^{v}}{n_{v}}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\Sigma_{i,j}^{v}$ .

Here, $\delta_{T}$ is an indicator that returns $1$ if $T$ is true and $0$ otherwise.

Proof.

See [1].

Lemma 3.

Let $k\geq 0$ and sets $\bm{l}=\{l_{1},\dots,l_{k}\}$ , $\bm{i}=\{i_{1},\dots,i_{k}\}$ and $\bm{d}=\{d_{1},\dots,d_{k}\}$ . We have:

n^{\max(k-1,0)}\cdot\mathcal{T}^{\bm{l,i,d}}_{n,i,d}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\begin{cases}\delta_{\bm{d}}\cdot\prod_{j=1}^{k-1}\mathcal{G}_{j}&k>1\\ const&k=1\\ \end{cases}

(37)

as $n\to\infty$ . Here, $\mathcal{G}_{1},...,\mathcal{G}_{k-1}$ are centered Gaussian variables with finite, non-zero variances, and $\delta_{\bm{d}}:=\delta(d_{1}=...=d_{k}=d)$ .

Proof.

The case $k=0$ is trivial. Let $k\geq 1$ . By Eq. 34, it holds that:

	$\displaystyle n^{k-1}\mathcal{T}^{\bm{l,i,d}}_{n,i,d}$	(38)
$\displaystyle=$	$\displaystyle n^{k-1}\frac{\Big{\langle}q^{l_{1}-1}_{i,d},q^{l_{1}-1}_{i_{1},d_{1}}\Big{\rangle}\Big{\langle}C_{i_{k},d_{k}}^{l_{k}\rightarrow L},C_{i,d}^{l_{k}\rightarrow L}\Big{\rangle}}{n}\cdot\prod_{j=1}^{k-1}\left\langle\frac{C_{i_{j},d_{j}}^{l_{j}\rightarrow L}\otimes q_{i_{j+1},d_{j+1}}^{l_{j+1}-1}}{n},C_{i,d}^{l_{j}\rightarrow l_{j+1}}\right\rangle$
$\displaystyle=$	$\displaystyle\frac{\Big{\langle}q^{l_{1}-1}_{i,d},q^{l_{1}-1}_{i_{1},d_{1}}\Big{\rangle}\Big{\langle}C_{i_{k},d_{k}}^{l_{k}\rightarrow L},C_{i,d}^{l_{k}\rightarrow L}\Big{\rangle}}{n}\cdot\prod_{j=1}^{k-1}\left\langle C_{i_{j},d_{j}}^{l_{j}\rightarrow L}\otimes q_{i_{j+1},d_{j+1}}^{l_{j+1}-1},C_{i,d}^{l_{j}\rightarrow l_{j+1}}\right\rangle$

Note that intermediate activations do not depend on the index $d_{j}$ , and so we remove the dependency on $d_{j}$ in the relevant terms. Next, by applying Lem. 2,

\frac{\Big{\langle}q^{l_{1}-1}_{i},q^{l_{1}-1}_{i_{1}}\Big{\rangle}\Big{\langle}C_{i_{k},d_{k}}^{l_{k}\rightarrow L},C_{i,d}^{l_{k}\rightarrow L}\Big{\rangle}}{n}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\Sigma_{i,i_{1}}^{l_{1}-1}\left(\prod_{j=l_{k}}^{L}\dot{\Sigma}_{i,i_{k}}^{l_{j}}\right)\delta_{\bm{d}}

(39)

Expanding the second term using Eq. 33:

		$\displaystyle\Big{\langle}C_{i_{j},d_{j}}^{l_{j}\rightarrow L}\otimes q_{i_{j+1}}^{l_{j+1}-1},C_{i,d}^{l_{j}\rightarrow i_{j+1}}\Big{\rangle}$		(40)
		$\displaystyle=C_{i_{j},d_{j}}^{l_{j}\rightarrow L}C_{i}^{l_{j}\rightarrow i_{j+1}}q_{i_{j+1}}^{l_{j+1}-1}$
		$\displaystyle=C_{i_{j},d_{j}}^{l_{j+1}-1\rightarrow L}P_{i_{j}}^{l_{j}\rightarrow l_{j+1}-1}(P_{i}^{l_{j}\rightarrow l_{j+1}-1})^{\top}\sqrt{2}\cdot Z_{i}^{l_{j+1}-1}q_{i_{j+1}}^{l_{j+1}-1}$
		$\displaystyle=\sqrt{2}\cdot\Big{\langle}C_{i_{j},d_{j}}^{l_{j+1}-1\rightarrow L}\otimes(Z_{i}^{l_{j+1}-1}q_{i_{j+1}}^{l_{j+1}-1}),P_{i_{j}}^{l_{j}\rightarrow l_{j+1}-1}(P_{i}^{l_{j}\rightarrow l_{j+1}-1})^{\top}\Big{\rangle}$
		$\displaystyle=\sqrt{2}\cdot C_{i_{j},d_{j}}^{l_{j+1}-1\rightarrow L}P_{i_{j}}^{l_{j}\rightarrow l_{j+1}-1}(P_{i}^{l_{j}\rightarrow l_{j+1}-1})^{\top}Z_{i}^{l_{j+1}-1}q_{i_{j+1}}^{l_{j+1}-1}$
		$\displaystyle=\xi_{j}.$

The above expression is fully implementable in a Tensor Program (see [41, 40]), and approaches a GP as width tend to infinity. In other words:

\xi_{j}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\hat{\mathcal{G}}_{j}

(41)

and denoting $\bm{\xi}=[\xi_{1},...,\xi_{k-1}]$ , and $\hat{\bm{\mathcal{G}}}=[\hat{\mathcal{G}}_{1},...,\hat{\mathcal{G}}_{k-1}]$ , it holds using the multivariate Central Limit theorem:

\bm{\xi}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\hat{\bm{\mathcal{G}}}

(42)

Using the Mann-Wald theorem [24] (where we take the mapping as the product pooling of $\bm{\xi}$ ), we have that:

\prod_{j=1}^{k-1}\xi_{j}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\prod_{j=1}^{k-1}\hat{\mathcal{G}}_{j}

(43)

Finally, by Slutsky’s theorem,

\displaystyle n^{k-1}\mathcal{T}^{\bm{l,i,d}}_{n,i,d}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\Sigma_{i,i_{1}}^{l_{1}-1}\left(\prod_{j=l_{k}}^{L}\dot{\Sigma}_{i,i_{k}}^{l_{j}}\right)\cdot\left(\prod^{k-1}_{j=1}\hat{\mathcal{G}}_{j}\right)\cdot\delta_{\bm{d}}.

(44)

Assigning $\mathcal{G}_{j}:=\Sigma_{i,i_{1}}^{l_{1}-1}\left(\prod_{j=l_{k}}^{L}\dot{\Sigma}_{i,i_{k}}^{l_{j}}\right)\hat{\mathcal{G}}_{j}$ completes the proof. ∎

D.3 Proof of Thm. 1

Since we assume that $g$ is a finite neural network, i.e., $m_{l}<\infty$ for all $l\in[H]$ , throughout the proofs with no loss of generality we assume that $m_{1}=\dots=m_{H}=1$ .

Lemma 4.

Let $h(u;w)=g(z;f(x;w))$ be a hypernetwork. We have:

\displaystyle\mathcal{K}^{(r)}_{i,j}=

\displaystyle\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ \alpha_{1},\dots,\alpha_{H}\geq 0\end{subarray}}\frac{r!}{\alpha_{1}!\cdots\alpha_{H}!}\cdot z_{i}\cdot\left[\prod^{H-1}_{j=1}\dot{\phi}(g^{j}_{i})\right]\cdot\prod^{H}_{d=1}\left\langle\nabla^{(\alpha_{d})}_{w}f^{d}_{i},(\nabla_{w}h_{j})^{\alpha_{d}}\right\rangle

(45)

Proof.

By the higher order product rule and the fact that the second derivative of a piece-wise linear function is $0$ everywhere:

\displaystyle\nabla^{(r)}_{w}h_{i}=\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ \alpha_{1},\dots,\alpha_{H}\geq 0\end{subarray}}\frac{r!}{\alpha_{1}!\cdots\alpha_{H}!}\cdot z_{i}\cdot\nabla^{(\alpha_{H})}_{w}f^{H}_{i}\bigotimes^{H-1}_{d=1}D_{H-d}

(46)

where

D_{d}:=\dot{\phi}(g^{d}_{i})\cdot\nabla^{(\alpha_{d})}_{w}f^{d}_{i}

(47)

In addition, by elementary tensor algebra, we have:

$\displaystyle\mathcal{K}^{(r)}_{i,j}=$	$\displaystyle\langle\nabla^{(r)}_{w}h_{i},(\nabla_{w}h_{j})^{r}\rangle$	(48)
$\displaystyle=$	$\displaystyle\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ \alpha_{1},\dots,\alpha_{H}\geq 0\end{subarray}}\frac{r!}{\alpha_{1}!\cdots\alpha_{H}!}z_{i}\cdot\left\langle\nabla^{(\alpha_{H})}_{w}f^{H}_{i}\cdot\bigotimes^{H-1}_{d=1}D_{H-d},(\nabla_{w}h_{j})^{r}\right\rangle$
$\displaystyle=$	$\displaystyle\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ \alpha_{1},\dots,\alpha_{H}\geq 0\end{subarray}}\frac{r!}{\alpha_{1}!\cdots\alpha_{H}!}z_{i}\cdot\left\langle\nabla^{(\alpha_{H})}_{w}f^{H}_{i},(\nabla_{w}h_{j})^{\alpha_{H}}\right\rangle$
	$\displaystyle\quad\quad\quad\quad\quad\quad\cdot\prod^{H-1}_{d=1}\left\langle\dot{\phi}(g^{H-d}_{i})\cdot\nabla^{(\alpha_{H-d})}_{w}f^{H-d}_{i},(\nabla_{w}h_{j})^{\alpha_{H-d}}\right\rangle$
$\displaystyle=$	$\displaystyle\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ \alpha_{1},\dots,\alpha_{H}\geq 0\end{subarray}}\frac{r!}{\alpha_{1}!\cdots\alpha_{H}!}\cdot z_{i}\cdot\left[\prod^{H-1}_{d=1}\dot{\phi}(g^{d}_{i})\right]\cdot\prod^{H}_{d=1}\left\langle\nabla^{(\alpha_{d})}_{w}f^{d}_{i},(\nabla_{w}h_{j})^{\alpha_{d}}\right\rangle$

∎

Lemma 5.

Let $h(u;w)=g(z;f(x;w))$ be a hypernetwork. In addition, let,

\forall d\in[H]:~{}~{}h^{d}_{j}:=a^{d-1}_{j}\prod^{H-d}_{t=1}f^{H-t+1}_{j}\cdot\dot{\phi}(g^{H-t}_{j})

(49)

We have:

\displaystyle\left\langle\nabla^{(\alpha_{d})}f^{d}_{i},(\nabla_{w}h_{j})^{\alpha_{d}}\right\rangle=\sum_{\bm{l}\in[L]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\left(\prod^{\alpha_{d}}_{k=1}h^{d_{k}}_{j}\right)\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d}

(50)

Proof.

We have:

\displaystyle\left\langle\nabla^{(\alpha_{d})}f^{d}_{i},(\nabla_{w}h_{j})^{\alpha_{d}}\right\rangle=\sum_{\bm{l}\in[L]^{\alpha_{d}}}\left\langle\frac{\partial^{\alpha_{d}}f^{d}_{i}}{\partial W^{l_{1}}\dots\partial W^{l_{\alpha_{d}}}},\bigotimes^{\alpha_{d}}_{k=1}\frac{\partial h_{j}}{\partial W^{l_{k}}}\right\rangle

(51)

By the product rule:

\displaystyle\frac{\partial h_{j}}{\partial W^{l_{k}}}=

\displaystyle\sum^{H}_{d=1}\left[\prod^{H-d}_{t=1}f^{H-t+1}_{j}\cdot\dot{\phi}(g^{H-t}_{j})\right]\cdot\frac{\partial f^{d}_{j}}{\partial W^{l_{k}}}\cdot a^{d-1}_{j}=\sum^{H}_{d=1}h^{d}_{j}\cdot\frac{\partial f^{d}_{j}}{\partial W^{l_{k}}}

(52)

Hence,

\displaystyle\bigotimes^{\alpha_{d}}_{k=1}\frac{\partial h_{j}}{\partial W^{l_{k}}}=\sum_{\bm{d}\in[H]^{\alpha_{d}}}\left(\prod^{\alpha_{d}}_{k=1}h^{d_{k}}_{j}\right)\bigotimes^{\alpha_{d}}_{k=1}\frac{\partial f^{d_{k}}_{j}}{\partial W^{l_{k}}}

(53)

In particular,

\displaystyle\left\langle\nabla^{(\alpha_{d})}f^{d}_{i},(\nabla_{w}h_{j})^{\alpha_{d}}\right\rangle=\sum_{\bm{l}\in[L]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\left(\prod^{\alpha_{d}}_{k=1}h^{d_{k}}_{j}\right)\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d}

(54)

∎

See 1

Proof.

Throughout the proof, in order to derive certain limits of various sequences of random variables, we implicitly make use of the Mann-Wald theorem [24]. For simplicity, oftentimes, we will avoid explicitly stating when this theorem is applied. As a general note, the repeated argument is as follows: terms, such as, $n^{\max(\alpha_{d}-1,0)}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d}$ , $\mathcal{Q}^{\bm{d}}_{n,j}$ , $g^{d}_{i}$ , etc’, (see below) can be expressed as continuous mappings of jointly convergent random variables. Hence, they jointly converge, and continuous mappings over them converge as well.

By Lems. 4 and 5, we have:

\displaystyle\mathcal{K}^{(r)}_{i,j}=\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ \alpha_{1},\dots,\alpha_{H}\geq 0\end{subarray}}\frac{r!}{\alpha_{1}!\cdots\alpha_{H}!}\cdot z_{i}\cdot\left[\prod^{H-1}_{d=1}\dot{\phi}(g^{d}_{i})\right]\cdot\prod^{H}_{d=1}\sum_{\bm{l}\in[H]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\mathcal{Q}^{\bm{d}}_{n,j}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d}

(55)

where $\mathcal{Q}^{\bm{d}}_{n,j}:=\left(\prod^{\alpha_{d}}_{k=1}h^{d_{k}}_{j}\right)$ . By the Mann-Wald theorem [24] $,g^{d}_{i}$ converges to some random variable $\mathcal{U}^{d}_{i}$ . If $\dot{\phi}$ is a continuous function, then $\dot{\phi}(g^{d}_{i})$ converges to $\dot{\phi}(\mathcal{U}^{d}_{i})$ . If $\phi$ is the ReLU activation function, by Lem. 1, $\dot{\phi}(g^{d}_{i})=\textnormal{sgn}(g^{d}_{i})$ converges to $\textnormal{sgn}(\mathcal{U}^{d}_{i})$ in distribution. We notice that $\mathcal{Q}^{\bm{d}}_{n,j}$ converges in distribution to some random variable $\mathcal{Q}^{\bm{d}}_{j}$ .

The proof is divided into two cases: $H=1$ and $H>1$ .

Case $H=1$ :

First, we note that for $H=1$ and $d\in[H]$ (i.e., $d=1$ ), we have:

\displaystyle h^{d}_{j}

\displaystyle=a^{d-1}_{j}\cdot\prod^{H-d}_{t=1}f^{H-t+1}_{j}\cdot\dot{\sigma}(g^{H-t}_{j})=a^{0}_{j}=z_{j}

(56)

In addition, $\prod^{H-1}_{d=1}\dot{\sigma}(g^{d}_{i})=1$ as it is an empty product. Therefore, we can rewrite:

\mathcal{K}^{(r)}_{i,j}=z_{i}\cdot z^{r}_{j}\sum_{\bm{l}\in[H]^{r}}\sum_{\bm{d}\in[H]^{r}}\mathcal{T}^{\bm{l,d}}_{n,i,j,d}

(57)

By Lem. 3, for $r=1$ , the above tends to a constant as $n\to\infty$ . For $r>1$ , $n^{r-1}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d}$ converges in distribution to zero for all $\bm{d}\neq(d,\dots,d)$ and converges to a non-constant random variable $\mathcal{T}^{\bm{l}}_{i,j,d}$ otherwise. Hence, by the Mann-Wald theorem [24],

n^{r-1}\cdot\mathcal{K}^{(r)}_{i,j}\stackrel{{\scriptstyle d}}{{\longrightarrow}}z_{i}\cdot z^{r}_{j}\sum_{\bm{l}\in[H]^{r}}\mathcal{T}^{\bm{l}}_{i,j,d}

(58)

which is a non-zero random variable.

Case $H>1$ :

By Lem. 3, $n^{\alpha_{d}-1}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d}$ converges in distribution to zero for all $\bm{d}\neq(d,\dots,d)$ . Therefore, in these cases, by Slutsky’s theorem, $n^{\alpha_{d}-1}\cdot\mathcal{Q}^{\bm{d}}_{n,j}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d}$ converges to zero in distribution. On the other hand, for each $\bm{l}\in[H]^{\alpha_{d}}$ , $d\in[H]$ and $\bm{d}=(d,\dots,d)$ , by Lem. 3, we have:

n^{\alpha_{d}-1}\cdot\mathcal{Q}^{\bm{d}}_{n,j}\cdot\mathcal{T}^{\bm{l}}_{n,i,j,d}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\mathcal{Q}^{\bm{d}}_{j}\cdot\mathcal{T}^{\bm{l}}_{i,j,d}

(59)

In particular,

n^{\max(\alpha_{d}-1,0)}\sum_{\bm{l}\in[H]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\cdot\mathcal{Q}^{\bm{d}}_{n,j}\cdot\mathcal{T}^{\bm{l}}_{n,i,j,d}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\sum_{\bm{l}\in[H]^{\alpha_{d}}}\sum_{d\in[H]}\mathcal{Q}^{d}_{j}\cdot\mathcal{T}^{\bm{l}}_{i,j,d}

(60)

Consider the case where $r\geq H$ . In this case, for any $\alpha_{1},\dots,\alpha_{H}$ , such that, there are $t>1$ indices $i\in[H]$ , such that, $\alpha_{i}=0$ . The following random variable converges in distribution:

X_{n}:=n^{r-(H-t)}\cdot\prod^{H}_{d=1}\sum_{\bm{l}\in[H]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\mathcal{Q}^{\bm{d}}_{n,j}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d}

(61)

Therefore, by Slutsky’s theorem:

n^{r-H}\cdot\prod^{H}_{d=1}\sum_{\bm{l}\in[H]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\mathcal{Q}^{\bm{d}}_{n,j}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d}=n^{-t}\cdot X_{n}\stackrel{{\scriptstyle d}}{{\longrightarrow}}0

(62)

We have:

	$\displaystyle n^{r-H}\cdot\left\langle\nabla^{(r)}_{w}h_{i},(\nabla_{w}h_{j})^{r}\right\rangle$	(63)
$\displaystyle=$	$\displaystyle n^{r-H}\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ \alpha_{1},\dots,\alpha_{H}\geq 0\end{subarray}}\frac{r!}{\alpha_{1}!\cdots\alpha_{H}!}\cdot z_{i}\cdot\left[\prod^{H-1}_{d=1}\dot{\sigma}(g^{d}_{i})\right]\cdot\prod^{H}_{d=1}\sum_{\bm{l}\in[H]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\left(\prod^{\alpha_{d}}_{k=1}h^{d_{k}}_{j}\right)\mathcal{T}^{\bm{l,d}}_{n,i,j,d}$
$\displaystyle=$	$\displaystyle\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ \alpha_{1},\dots,\alpha_{H}\geq 0\end{subarray}}\frac{r!}{\alpha_{1}!\cdots\alpha_{H}!}\cdot z_{i}\cdot\left[\prod^{H-1}_{d=1}\dot{\sigma}(g^{d}_{i})\right]\cdot\prod^{H}_{d=1}n^{\alpha_{d}-1}\sum_{\bm{l}\in[H]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\mathcal{Q}^{\bm{d}}_{n,j}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d}$
$\displaystyle\stackrel{{\scriptstyle d}}{{\longrightarrow}}$	$\displaystyle\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ \alpha_{1},\dots,\alpha_{H}\geq 1\end{subarray}}\frac{r!}{\alpha_{1}!\cdots\alpha_{H}!}\cdot z_{i}\cdot\left[\prod^{H-1}_{d=1}\textnormal{sgn}(\mathcal{U}^{d}_{i})\right]\cdot\prod^{H}_{d=1}\sum_{\bm{l}\in[H]^{\alpha_{d}}}\mathcal{Q}^{d}_{j}\cdot\mathcal{T}^{\bm{l}}_{i,j,d}$

which is a non-constant random variable.

Next, we consider the case when $r\leq H$ . By Lem. 3, for any $\alpha_{d}\geq 2$ , the term $\mathcal{T}^{\bm{l,d}}_{n,i,j,d}$ tends to zero as $n\to\infty$ . In addition, $\mathcal{Q}^{\bm{d}}_{n,j}$ converges in distribution. Therefore, for any $\alpha_{d}\geq 2$ , we have:

\sum_{\bm{l}\in[L]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\mathcal{Q}^{\bm{d}}_{n,j}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d}\stackrel{{\scriptstyle d}}{{\longrightarrow}}0

(64)

Hence, for any $\alpha_{1},\dots,\alpha_{H}\geq 0$ , such that, there is at least one $\alpha_{d}\geq 2$ , we have:

\prod^{H}_{d=1}\sum_{\bm{l}\in[L]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\mathcal{Q}^{\bm{d}}_{n,j}\cdot\mathcal{T}^{\bm{l,d}}_{n,i,j,d}\stackrel{{\scriptstyle d}}{{\longrightarrow}}0

(65)

On the other hand, for any $0\leq\alpha_{1},\dots,\alpha_{H}\leq 1$ , the terms $\{\mathcal{T}^{\bm{l,i,d}}_{n,i,d}\}$ , $\{g^{d}_{i}\}$ and $\{\mathcal{Q}^{\bm{d}}_{n,j}\}$ converge jointly in distribution to some random variables $\{\mathcal{T}^{\bm{l,i,d}}_{i,d}\}$ , $\{\textnormal{sgn}(\mathcal{U}^{d}_{i})\}$ and $\{\mathcal{Q}^{\bm{d}}_{j}\}$ as $n\to\infty$ . Hence,

\langle\nabla^{(r)}_{w}h_{i},(\nabla_{w}h_{j})^{r}\rangle\stackrel{{\scriptstyle d}}{{\longrightarrow}}\sum_{\begin{subarray}{c}\alpha_{1}+\dots+\alpha_{H}=r\\ 0\leq\alpha_{1},\dots,\alpha_{H}\leq 1\end{subarray}}r!\cdot\left[\prod^{H-1}_{d=1}\textnormal{sgn}(\mathcal{U}^{d}_{i})\right]\cdot\prod^{H}_{d=1}\sum_{\bm{l}\in[L]^{\alpha_{d}}}\sum_{\bm{d}\in[H]^{\alpha_{d}}}\mathcal{Q}^{\bm{d}}_{j}\cdot\mathcal{T}^{\bm{l,d}}_{i,j,d}

(66)

which is a non-constant random variable. ∎

D.4 Proofs of the Results in Sec. 4

See 2

Proof.

By [41], taking the width $n=\min(n_{1},...,n_{L-1})$ to infinity, the outputs $V^{d}(x;w):=f^{d}(x;w)$ are governed by a centered Gaussian process, such that, the entries $V^{d}_{i,j}(x;w)$ , given some input $x$ , are independent and identically distributed. Moreover, it holds that:

\Big{(}V^{d}_{i,j}(x;w),V^{d}_{i,j}(x^{\prime};w)\Big{)}\sim\mathcal{N}\left(0,S^{L}(x,x^{\prime})\right).

(67)

with $S^{L}(x,x^{\prime})$ as defined in Eq. 9. For the function $h(u;w)=g(z;f(x;w))$ , it holds for the first layer:

g^{1}(z;f(x;w))=\sqrt{\frac{1}{m_{0}}}V^{1}(x;w)z

(68)

After taking the limit $n=\min(n_{1},...,n_{L-1})$ to infinity, the implicit network $g$ is fed with Gaussian distributed weights. And so $g^{1}(z;f(x;w))$ also converges to a Gaussian process, such that:

(g^{1}(z;f(x;w))_{i},g^{1}(z^{\prime};f(x^{\prime};w))_{i})\sim\mathcal{N}(0,\Lambda^{1})

(69)

where:

\Lambda^{1}=\frac{1}{m_{0}}\begin{pmatrix}S^{L}(x,x)z^{\top}z&S^{L}(x^{\prime},x)z^{\prime\top}z\\ S^{L}(x,x^{\prime})z^{\top}z^{\prime}&S^{L}(x^{\prime},x^{\prime})z^{\prime\top}z^{\prime}\end{pmatrix}

(70)

In a similar fashion to the standard feed forward case, the pre-activations $g^{l}(z;f(x;w))$ converge to Gaussian processes as we let $m=\min(m_{1},...,m_{H-1})$ tend to infinity, with a covariance defined recursively:

\Sigma^{l}(u,u^{\prime})=\sqrt{2}\operatorname*{\mathbb{E}}_{(u,v)\sim\mathcal{N}(0,\Lambda^{l})}[\sigma(u)\sigma(v)]

(71)

where,

\Lambda^{l}=\begin{pmatrix}S^{L}(x,x)\cdot\Sigma^{l-1}(u,u)&S^{L}(x^{\prime},x)\cdot\Sigma^{l-1}(u^{\prime},u)\\ S^{L}(x,x^{\prime})\cdot\Sigma^{l-1}(u,u^{\prime})&S^{L}(x^{\prime},x^{\prime})\cdot\Sigma^{l-1}(u^{\prime},u^{\prime})\end{pmatrix}

(72)

and

\Sigma^{0}(z,z^{\prime})=\frac{1}{m_{0}}z^{\top}z^{\prime}

(73)

proving the claim. ∎

See 1

Proof.

We prove that $\Lambda^{l}(u,u^{\prime})$ is a function of $S^{0}(x,x^{\prime})$ and $\Sigma^{0}(u,u^{\prime})$ by induction. First, we note that $\Lambda^{1}(u,u^{\prime})$ is a function of $S^{L}(x,x^{\prime})$ and $\Sigma^{0}(u,u^{\prime})$ by definition. By the recursive definition of $S^{L}(x,x^{\prime})$ , it is a function of $S^{0}(x,x^{\prime})$ . Therefore, $\Lambda^{1}(u,u^{\prime})$ can be simply represented as a function of $S^{0}(x,x^{\prime})$ and $\Sigma^{0}(u,u^{\prime})$ . We assume by induction that $\Lambda^{l}(u,u^{\prime})$ is a function of $S^{0}(x,x^{\prime})$ and $\Sigma^{0}(u,u^{\prime})$ . We would like to show that $\Lambda^{l+1}(u,u^{\prime})$ is a function of $S^{0}(x,x^{\prime})$ and $\Sigma^{0}(u,u^{\prime})$ . By definition, $\Lambda^{l+1}(u,u^{\prime})$ is a function of $S^{L}(x,x^{\prime})$ and $\Sigma^{l}(u,u^{\prime})$ . In addition, $\Sigma^{l}(u,u^{\prime})$ is a function of $\Lambda^{l}(u,u^{\prime})$ . Hence, by induction, $\Sigma^{l}(u,u^{\prime})$ is simply a function of $S^{0}(x,x^{\prime})$ and $\Sigma^{0}(u,u^{\prime})$ . Since $S^{L}(x,x^{\prime})$ is a function of $S^{0}(x,x^{\prime})$ , we conclude that one can represent $\Lambda^{l+1}(u,u^{\prime})$ as a function of $S^{0}(x,x^{\prime})$ and $\Sigma^{0}(u,u^{\prime})$ . ∎

See 2

Proof.

We note that:

\Sigma^{0}(p(z),p(z^{\prime}))=\frac{1}{k}p(z)^{\top}p(z)=\frac{1}{k}\sum^{k}_{i=1}\cos(W^{1}_{i}z+b^{1}_{i})\cos(W^{1}_{i}z^{\prime}+b^{1}_{i})

(74)

By Thm. 1 in [28], we have:

\lim_{k\to\infty}\Sigma^{0}(p(z),p(z^{\prime}))=\exp[-\|z-z^{\prime}\|^{2}_{2}/2]/2

(75)

which is a function of $\exp[\|z-z^{\prime}\|^{2}_{2}]$ as desired. ∎

We make use of the following lemma in the proof of Thm. 3.

Lemma 6.

Recall the parametrization of the implicit network:

\begin{cases}g^{l}_{i}:=g^{l}(z_{i};v)=\sqrt{\frac{1}{m_{l-1}}}f^{l}(x_{i};w)\cdot a_{i}^{l-1}\\ a^{l}_{i}:=a^{l}(z_{i};v)=\sqrt{2}\cdot\sigma(g^{l}_{i})\end{cases}\textnormal{ and }a^{0}_{i}:=z_{i}

(76)

For any pair $u_{i}=\{u_{i}\}$ , we denote:

P_{i}^{l_{1}\rightarrow l_{2}}=\prod_{l=l_{1}}^{l_{2}-1}\left(\sqrt{\frac{2}{m_{l}}}V^{l+1}(x_{i};w)\cdot Z^{l}(z_{i})\right)\textnormal{ and }Z^{l}(z)=\textnormal{diag}(\dot{\sigma}(g^{l}(z)))

(77)

It holds that:

1.

$P_{i}^{l_{1}\rightarrow l_{2}}(P_{j}^{l_{1}\rightarrow l_{2}})^{\top}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\prod_{l=l_{1}}^{l_{2}-1}\dot{\Sigma}^{l}(u_{i},u_{j})I$ .
2.

$\frac{\partial h(u_{i},w)}{\partial v}\cdot\frac{\partial^{\top}h(u_{j},w)}{\partial v}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\sum_{l=0}^{H-1}\left(\Sigma^{l}(u_{i},u_{j})\prod_{h=l+1}^{H-1}\dot{\Sigma}^{l}(u_{i},u_{j})\right)$ .

where the limits are taken with respect to $m,n\to\infty$ sequentially.

Proof.

We have:

		$\displaystyle P_{i}^{l_{1}\rightarrow l_{2}}(P_{j}^{l_{1}\rightarrow l_{2}})^{\top}$		(78)
	$\displaystyle=$	$\displaystyle P_{i}^{l_{1}\rightarrow l_{2}-1}\frac{2}{m_{l_{2}-1}}V^{l_{2}}(x_{i};w)\cdot Z^{l_{2}-1}(z_{i})Z^{l_{2}-1}(z_{j})V^{l_{2}}(x_{j};w)^{\top}(P_{j}^{l_{1}\rightarrow l_{2}-1})^{\top}$		(78)

Note that it holds that when $m,n\to\infty$ sequentially, we have:

		$\displaystyle\frac{2}{m_{l_{2}-1}}V^{l_{2}}(x_{i};w)\cdot Z^{l_{2}-1}(z_{i})Z^{l_{2}-1}(z_{j})V^{l_{2}}(x_{j};w)^{\top}$		(79)
	$\displaystyle\stackrel{{\scriptstyle d}}{{\longrightarrow}}$	$\displaystyle\sqrt{2}\operatorname*{\mathbb{E}}_{(u,v)\sim\mathcal{N}(0,\Lambda^{l_{2}})}[\dot{\sigma(u)}\dot{\sigma(v)}]I=\dot{\Sigma}^{l_{2}}(u_{i},u_{j})I$		(79)

Applying the above recursively proves the first claim. Using the first claim, along with the derivation of the neural tangent kernel (see [1]) proves the second claim. ∎

See 3

Proof.

Recalling that $v=vec(f(x))=[vec(V^{1}),...,vec(V^{H})]$ , concatenated into a single vector of length $\sum_{l=0}^{H-1}m_{l}\cdot m_{l+1}$ . The components of the inner matrix $\mathcal{K}^{f}(x,x^{\prime})$ are given by:

\mathcal{K}^{f}(x,x^{\prime})(i,j)=\sum_{l=1}^{L}\left\langle\frac{\partial v_{i}(x)}{\partial w^{l}},\frac{\partial v_{j}(x^{\prime})}{\partial w^{l}}\right\rangle\\

(80)

and it holds that in the infinite width limit, $\mathcal{K}^{f}(x,x^{\prime})$ is a diagonal matrix:

\mathcal{K}^{f}(x,x^{\prime})\stackrel{{\scriptstyle d}}{{\longrightarrow}}\Theta^{f}(x,x^{\prime})\cdot I

(81)

By letting the widths $n$ and $m$ tend to infinity consecutively, by Lem. 6, it follows that:

\frac{\partial h(u;w)}{\partial v}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial v}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\Theta^{g}(u,u^{\prime},S^{L}(x,x^{\prime}))

(82)

Since $\mathcal{K}^{f}(x,x^{\prime})=\frac{\partial f(x;w)}{\partial w}\cdot\frac{\partial^{\top}f(x^{\prime};w)}{w}$ converges to the diagonal matrix $\Theta^{f}(x,x^{\prime})\cdot I$ , the limit of $\mathcal{K}^{h}(u,u^{\prime})$ is given by:

$\displaystyle\mathcal{K}^{h}(u,u^{\prime})=$	$\displaystyle\frac{\partial g(z;f(x;w))}{\partial f(x;w)}\cdot\frac{\partial f(x;w)}{\partial w}\cdot\frac{\partial^{\top}f(x^{\prime};w)}{w}\cdot\frac{\partial^{\top}g(z^{\prime};f(x^{\prime};w))}{\partial f(x^{\prime};w)}$	(83)
$\displaystyle=$	$\displaystyle\frac{\partial h(u;w)}{\partial v}\cdot\frac{\partial f(x;w)}{\partial w}\cdot\frac{\partial^{\top}f(x^{\prime};w)}{w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial v}$
$\displaystyle\stackrel{{\scriptstyle d}}{{\longrightarrow}}$	$\displaystyle\Theta^{f}(x,x^{\prime})\cdot\Theta^{g}(u,u^{\prime},S^{L}(x,x^{\prime}))$

where we used the results of Lem. 6.

Next, we would like to prove that $\frac{\partial\mathcal{K}^{h}(u,u^{\prime})}{\partial t}\Big{|}_{t=0}=0$ . For this purpose, we write the derivative explicitly:

\displaystyle\frac{\partial\mathcal{K}^{h}(u,u^{\prime})}{\partial t}=\frac{\partial h(u;w)}{\partial w}\cdot\frac{\partial}{\partial t}\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}+\frac{\partial}{\partial t}\frac{\partial h(u;w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}

(84)

We notice that the two terms are the same up to changing between the inputs $u$ and $u^{\prime}$ . Therefore, with no loss of generality, we can simply prove the convergence of the second term. We have:

	$\displaystyle\frac{\partial}{\partial t}\frac{\partial h(u;w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}$	(85)
$\displaystyle=$	$\displaystyle\left[\frac{\partial}{\partial t}\left(\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\frac{\partial f(x;w)}{\partial w}\right)\right]\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}$
$\displaystyle=$	$\displaystyle\left[\frac{\partial h(u;w)}{\partial f(x;w)\partial t}\cdot\frac{\partial f(x;w)}{\partial w}+\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\frac{\partial f(x;w)}{\partial w\partial t}\right]\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}$
$\displaystyle=$	$\displaystyle\frac{\partial h(u;w)}{\partial f(x;w)\partial t}\cdot\frac{\partial f(x;w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}+\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\frac{\partial f(x;w)}{\partial w\partial t}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}$

We analyze each term separately.

Analyzing the first term

By substituting $\frac{\partial}{\partial t}=-\mu\nabla_{w}c(w)\frac{\partial^{\top}}{\partial w}=-\mu\nabla_{w}c(w)\frac{\partial^{\top}f}{\partial w}\frac{\partial^{\top}}{\partial f}$ , we have:

		$\displaystyle\frac{\partial h(u;w)}{\partial f(x;w)\partial t}\cdot\frac{\partial f(x;w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}$		(86)
		$\displaystyle=-\mu\nabla_{w}c(w)\frac{\partial^{\top}f(x;w)}{\partial w}\cdot\frac{\partial^{2}h(u;w)}{\partial f(x;w)\partial f(x;w)}\cdot\frac{\partial f(x;w)}{\partial w}\cdot\frac{\partial^{\top}f(x^{\prime};w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial f(x^{\prime};w)}$
		$\displaystyle=-\mu\nabla_{w}c(w)\frac{\partial^{\top}f(x;w)}{\partial w}\frac{\partial^{2}h(u;w)}{\partial f(x;w)\partial f(x;w)}\mathcal{K}^{f}(x,x^{\prime})\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial f(x^{\prime};w)}$
		$\displaystyle=-\mu\sum^{N}_{i=1}\frac{\partial\ell(h(u_{i};w),y_{i})}{\partial h(u_{i};w)}\cdot\frac{\partial h(u_{i};w)}{\partial f(x;w)}\cdot\mathcal{K}^{f}(x,x_{i})\cdot\frac{\partial^{2}h(u;w)}{\partial f(x;w)\partial f(x;w)}\cdot\mathcal{K}^{f}(x,x^{\prime})\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial f(x^{\prime};w)}$

It then follows:

		$\displaystyle\lim_{n\to\infty}\frac{\partial h(u;w)}{\partial f(x;w)\partial t}\cdot\frac{\partial f(x;w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}$		(87)
	$\displaystyle=$	$\displaystyle-\mu\sum^{N}_{i=1}\ell_{i}\cdot\Theta^{f}(x,x_{i})\cdot\Theta^{f}(x,x^{\prime})\lim_{n\to\infty}\frac{\partial h(u_{i};w)}{\partial f(x_{i};w)}\cdot\frac{\partial^{2}h(u;w)}{\partial f(x;w)\partial f(x;w)}\cdot\frac{\partial h(u^{\prime};w)}{\partial f(x^{\prime};w)}$		(87)

We notice that:

	$\displaystyle\lim_{n\to\infty}\frac{\partial^{\top}h(u_{i};w)}{\partial f(x_{i};w)}\cdot\frac{\partial^{2}h(u;w)}{\partial f(x;w)\partial f(x;w)}\cdot\frac{\partial h(u^{\prime};w)}{\partial f(x^{\prime};w)}$	(88)
$\displaystyle=$	$\displaystyle\sum_{l_{1},l_{2}}\lim_{n\to\infty}\left\langle\frac{\partial^{2}h(u;w)}{\partial f^{l_{1}}(x;w)\partial f^{l_{2}}(x;w)},\frac{\partial h(u_{i};w)}{\partial f^{l_{1}}(x_{i};w)}\otimes\frac{\partial h(u^{\prime};w)}{\partial f^{l_{2}}(x^{\prime};w)}\right\rangle$
$\displaystyle:=$	$\displaystyle\sum_{l_{1},l_{2}}\mathcal{T}^{l_{1},l_{2}}_{m}(u,u_{i},u^{\prime})$

We recall that $f^{l}(x;w)$ converges to a GP (as a function of $x$ ) as $n\to\infty$ [19]. Therefore, $\mathcal{T}^{l_{1},l_{2}}_{m}(u,u_{i},u^{\prime})$ are special cases of the terms $\mathcal{T}^{\bm{l,i,d}}_{n,i,d}$ (see Eq. 25) with weights that are distributed according to a GP instead of a normal distribution. In this case, we have: $k=2$ , $d=d_{1}=\dots=d_{k}=1$ , the neural network $f^{1}$ is replaced with $h$ , the weights $W^{l}$ are translated into $f^{l}(x;w)$ . We recall that the proof of Lem. 3 showing that $\mathcal{T}^{\bm{l,i,d}}_{n,i,d}=\mathcal{O}_{p}(1/n^{k-1})$ is simply based on Lem. 2. Since Lem. 6 extends Lem. 2 to our case, the proof of Lem. 3 can be applied to show that $\mathcal{T}^{l_{1},l_{2}}_{m}(u,u_{i},u^{\prime})\sim 1/m$ .

Analyzing the second term

We would like to show that for any $m>0$ , we have:

\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\frac{\partial f(x;w)}{\partial w\partial t}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}\stackrel{{\scriptstyle d}}{{\longrightarrow}}0

(89)

as $n\to\infty$ . Since $\frac{\partial w}{\partial t}=-\mu\nabla_{w}c(w)$ , we have:

	$\displaystyle\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\frac{\partial f(x;w)}{\partial w\partial t}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}$	(90)
$\displaystyle=$	$\displaystyle-\mu\cdot\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\nabla_{w}c(w)\cdot\frac{\partial^{2}f(x;w)}{\partial w^{2}}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial w}$
$\displaystyle=$	$\displaystyle-\mu\cdot\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\nabla_{w}c(w)\cdot\frac{\partial^{2}f(x;w)}{\partial w^{2}}\cdot\frac{\partial^{\top}f(x^{\prime};w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial f(x;w)}$

In addition, we have:

\displaystyle\nabla_{w}c(w)=\sum^{N}_{i=1}\frac{\partial\ell(h(u_{i};w),y_{i})}{\partial h(u_{i};w)}\cdot\frac{\partial h(u_{i};w)}{\partial w}

(91)

We note that $\frac{\partial\ell(h(u_{i};w),y_{i})}{\partial h(u_{i};w)}$ converges in distribution as $m,n\to\infty$ . Therefore, we can simply analyze the convergence of:

\displaystyle\sum^{N}_{i=1}\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\frac{\partial h(u_{i};w)}{\partial w}\cdot\frac{\partial^{2}f(x;w)}{\partial w^{2}}\cdot\frac{\partial^{\top}f(x^{\prime};w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial f(x;w)}

(92)

Since $N$ is a constant, it is enough to show that each term converges to zero. We have:

	$\displaystyle\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\frac{\partial h(u_{i};w)}{\partial w}\cdot\frac{\partial^{2}f(x;w)}{\partial w^{2}}\cdot\frac{\partial^{\top}f(x^{\prime};w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial f(x;w)}$	(93)
$\displaystyle=$	$\displaystyle\frac{\partial h(u;w)}{\partial f(x;w)}\cdot\frac{\partial h(u_{i};w)}{\partial f(x_{i};w)}\cdot\frac{\partial f(x_{i};w)}{\partial w}\cdot\frac{\partial^{2}f(x;w)}{\partial w^{2}}\cdot\frac{\partial^{\top}f(x^{\prime};w)}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial f(x;w)}$
$\displaystyle=$	$\displaystyle\sum_{l,j,k}\frac{\partial h(u;w)}{\partial f(x;w)_{l}}\cdot\frac{\partial h(u_{i};w)}{\partial f(x_{i};w)_{j}}\cdot\frac{\partial f(x_{i};w)_{j}}{\partial w}\cdot\frac{\partial^{2}f(x;w)_{l}}{\partial w^{2}}\cdot\frac{\partial^{\top}f(x^{\prime};w)_{k}}{\partial w}\cdot\frac{\partial^{\top}h(u^{\prime};w)}{\partial f(x;w)_{k}}$

where $f(x;w)_{j}$ is the $j$ ’th output of $f$ over $x$ . In addition, the summation is done over the indices of the corresponding tensors. We note that for any $m>0$ , the number of indices $l,j,k$ is finite. We would like to show that each term in the sum tends to zero as $n\to\infty$ . We can write:

\frac{\partial f(x_{i};w)_{j}}{\partial w}\cdot\frac{\partial^{2}f(x;w)_{l}}{\partial w^{2}}\cdot\frac{\partial^{\top}f(x^{\prime};w)_{k}}{\partial w}=\left\langle\frac{\partial^{2}f(x;w)_{l}}{\partial w^{2}},\frac{\partial f(x_{i};w)_{j}}{\partial w}\otimes\frac{\partial f(x^{\prime};w)_{k}}{\partial w}\right\rangle

(94)

By Lem. 3, the term in Eq. 94 tends to zero as $n\to\infty$ . In addition, it is easy to see that $\frac{\partial h(u;w)}{\partial f(x;w)_{l}}$ , $\frac{\partial h(u_{i};w)}{\partial f(x_{i};w)_{j}}$ and $\frac{\partial^{\top}h(u^{\prime};w)}{\partial f(x;w)_{k}}$ converge to some random variables. Therefore, for any fixed $m>0$ , the above sum converges to zero as $n\to\infty$ . ∎


(a) $g$ of depth 3	(b) $g$ of depth 6	(c) $g$ of depth 8

(d) $g$ of depth 3	(e) $g$ of depth 6	(f) $g$ of depth 8

On Infinite-Width Hypernetworks

Abstract

1 Introduction

1.1 Related Works

Hypernetworks

Gaussian Processes and Neural Tangent Kernel

2 Setup

3 Dynamics of Hypernetworks

Infinitely wide ff without infinitely wide gg induces non-convex optimization

Theorem 1 (Higher order terms for hypernetworks).

Remark 1.

4 Dually Infinite Hypernetworks

4.1 The NNGP kernel

Theorem 2 (Hypernetworks as GPs).

Corollary 1.

Remark 2.

4.2 The Hyperkernel

Theorem 3 (Hyperkernel decomposition and convergence at initialization).

Remark 3.

5 Experiments

5.1 Convergence of the Hyperkernel

5.2 Training Dynamics

5.3 Image representation and Inpainting

6 Conclusions

Broader Impact

Acknowledgements and Funding Disclosure

References

Appendix A Implementation Details

A.1 Convergence of the Hyperkernel

A.2 Image Completion and Impainting

Appendix B Additional Experiments

B.1 Sensitivity Study

B.2 Training wide networks with a large learning rate

Appendix C Correlation Functions

Appendix D Proofs of the Main Results

D.1 Useful Lemmas

Lemma 1.

Proof.

D.2 Main Technical Lemma

Lemma 2.

Proof.

Lemma 3.

Proof.

D.3 Proof of Thm. 1

Lemma 4.

Proof.

Lemma 5.

Proof.

Proof.

Case H=1H=1:

Case H>1H>1:

D.4 Proofs of the Results in Sec. 4

Proof.

Proof.

Proof.

Lemma 6.

Proof.

Proof.

Analyzing the first term

Analyzing the second term

Infinitely wide $f$ without infinitely wide $g$ induces non-convex optimization

Case $H=1$ :

Case $H>1$ :