On Infinite-Width Hypernetworks
Abstract
Hypernetworks are architectures that produce the weights of a task-specific primary network. A notable application of hypernetworks in the recent literature involves learning to output functional representations. In these scenarios, the hypernetwork learns a representation corresponding to the weights of a shallow MLP, which typically encodes shape or image information. While such representations have seen considerable success in practice, they remain lacking in the theoretical guarantees in the wide regime of the standard architectures. In this work, we study wide over-parameterized hypernetworks. We show that unlike typical architectures, infinitely wide hypernetworks do not guarantee convergence to a global minima under gradient descent. We further show that convexity can be achieved by increasing the dimensionality of the hypernetwork’s output, to represent wide MLPs. In the dually infinite-width regime, we identify the functional priors of these architectures by deriving their corresponding GP and NTK kernels, the latter of which we refer to as the hyperkernel. As part of this study, we make a mathematical contribution by deriving tight bounds on high order Taylor expansion terms of standard fully connected ReLU networks.
1 Introduction
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/03d1bfcf-8e9f-45fc-80a0-e138fccf79f2/hypernetwork.png)
In this work, we analyze the training dynamics of over-parameterized meta networks, which are networks that output the weights of other networks, often referred to as hypernetworks. In the typical framework, a function involves two networks, and . The hypernetwork takes the input (typically an image) and returns the weights of the primary network, , which then takes the input and returns the output of .
The literature of hypernetworks is roughly divided into two main categories. In the functional representation literature [22, 33, 39, 30, 16] the input to the hypernetwork is typically an image. For shape reconstruction tasks, the network represents the shape via a signed distance field, where the input are coordinates in 3D space. In image completion tasks, the inputs to are image coordinates, and the output is the corresponding pixel intensity. In these settings, is typically a large network and is typically a shallow fully connected network.
In the second category [4, 23, 36, 44], hypernetworks are typically used for hyper-parameter search, where is being treated as a hyperparameter descriptor and is optimized alongside with the network’s weights. In this paper, we consider models corresponding to the first group of methods.
Following a prominent thread in the recent literature, our study takes place in the regime of wide networks. [11] recently showed that, when the width of the network approaches infinity, the gradient-descent training dynamics of a fully connected network can be characterized by a kernel, called the Neural Tangent Kernel (or NTK for short). In other words, as the width of each layer approaches infinity, provided with proper scaling and initialization of the weights, it holds that:
(1) |
as the width, of tends to infinity. Here, are the weights of the network . As shown in [11], as the width tends to infinity, when minimizing the squared loss using gradient descent, the evolution through time of the function computed by the network follows the dynamics of kernel gradient descent with kernel . To prove this phenomenon, various papers [20, 3, 2] introduce a Taylor expansion of the network output around the point of initialization and consider its values. It is shown that the first-order term is deterministic during the SGD optimization and the higher-order terms converge to zero as the width tends to infinity.
A natural question that arises when considering hypernetworks is whether a similar “wide” regime exists, where trained and untrained networks may be functionally approximated by kernels. If so, since this architecture involves two networks, the “wide” regime needs a more refined definition, taking into account both networks.
Our contributions:
-
1.
We show that infinitely wide hypernetworks can induce highly non-convex training dynamics under gradient descent. The complexity of the optimization problem is highly dependent on the architecture of the primary network , which may considerably impair the trainability of the architecture if not defined appropriately.
-
2.
However, when the widths of both the hypernetwork and the primary network tend to infinity, the optimization dynamics of the hypernetwork simplifies, and its neural tangent kernel (which we call the hyperkernel) has a well defined infinite-width limit governing the network evolution.
-
3.
We verify our theory empirically and also demonstrate the utility of this hyperkernel on several functional representation tasks. Consistent with prior observations on kernel methods, the hypernetwork induced kernels also outperforms a trained hypernetwork when training data is small.
-
4.
We make a technical contribution by deriving asymptotically tight bounds on high order Taylor expansion terms in ReLU MLPs. Our result partially settles a conjecture posed in [6] regarding the asymptotic behavior of general correlation functions.
1.1 Related Works
Hypernetworks
Hypernetworks were first introduced under this name in [8], are networks that generate the weights of a second primary network that computes the actual task. However, the idea of having one network predict the weights of another was proposed earlier and has reemerged multiple times [15, 29, 13]. The tool can naturally be applied for image representations tasks. In [22], they applied hypernetworks for 3D shape reconstruction from a single image. In [33] hypernetworks were shown to be useful for learning shared image representations. Hypernetworks were also shown to be effective in non-image domains. For instance, hypernetworks achieve state of the art results on the task of decoding error correcting codes [25].
Several publications consider a different framework, in which, the inputs of the hypernetwork are optimized alongside to the weights of the hypernetwork. In this setting, hypernetworks were recently used for continuous learning by [36]. Hypernetworks can be efficiently used for neural architecture search, as was demonstrated by [4, 44], where a feedforward regression (with network ) replaces direct gradient-based learning of the weights of the primary network while its architecture is being explored. Lorraine et al. applied hypernetworks for hyperparameters selection [23].
Despite their success and increasing prominence, little theoretical work was done in order to better understand hypernetworks and their behavior. A recent paper [12] studies the role of multiplicative interaction within a unifying framework to describe a range of classical and modern neural network architectural motifs, such as gating, attention layers, hypernetworks, and dynamic convolutions amongst others. It is shown that standard neural networks are a strict subset of neural networks with multiplicative interactions. In [7] the authors theoretically study the modular properties of hypernetworks. In particular, they show that compared to standard embedding methods, hypernetworks are exponentially more expressive when the primary network is of small complexity. In this work, we provide a complementary perspective and show that a shallow primary network is a requirement for successful training. [5] showed that applying standard initializations on a hypernetwork produces sub-optimal initialization of the primary network. A principled technique for weight initialization in hypernetworks is then developed.
Gaussian Processes and Neural Tangent Kernel
The connection between infinitely wide neural networks, Gaussian processes and kernel methods, has been the focus of many recent papers [11, 19, 40, 42, 32, 31, 38, 37, 26]. Empirical support has demonstrated the power of CNTK (convolutional neural tangent kernel) on popular datasets, demonstrating new state of the art results for kernel methods [1, 43]. [21] showed that ReLU ResNets [10] can have NTK convergence occur even when depth and width simultaneously tend to infinity, provided proper initialization. In this work, we extend the kernel analysis of networks to hypernetworks, and characterize the regime in which the kernels converge and training dynamics simplify.
2 Setup
In this section, we introduce the setting of the analysis considered in this paper. We begin by defining fully connected neural networks and hypernetworks in the context of the NTK framework.
Neural networks In the NTK framework, a fully connected neural network, , is defined in the following manner:
(2) |
where is the activation function of . Throughout the paper, we specifically take to be a piece-wise linear function with a finite number of pieces (e.g., the ReLU activation and the Leaky ReLU activation ). The weight matrices are trainable variables, initialized independently according to a standard normal distribution, . The width of is denoted by . The parameters are aggregated as a long vector . The coefficients serve for normalizing the activations of each layer. This parametrization is nonstandard, and we will refer to it as the NTK parameterization. It has already been employed in several recent works [14, 35, 27]. For simplicity, in many cases, we will omit to specify the weights associated with our model.
Hypernetworks Given the input tuple , we consider models of the form: , where and are two neural network architectures with depth and respectively. The function referred to as hypernetwork, takes the input and computes the weights of a second neural network , referred as the primary network, which is assumed to output a scalar. As before, the variable stands for a vector of trainable parameters ( is not trained directly and is given by ).
We parameterize the primary network as follows:
(3) |
Here, the weights of the primary network are given in a concatenated vector form by the output of the hypernetwork . The output dimension of the hypernetwork is therefore . We denote by the ’th output matrix of . The width of is denoted by . The function is an element-wise continuously differentiable function or a piece-wise linear function with a finite number of pieces.
Optimization Let , where be some dataset and let be the -loss function. For a given hypernetwork , we are interested in selecting the parameters that minimize the empirical risk:
(4) |
For simplicity, oftentimes we will simply write and , depending on the context. In order to minimize the empirical error , we consider the SGD method with learning rate and step of the form for some index that is selected uniformly at random for the ’th iteration. A continuous version of the GD method is the gradient flow method, in which . In recent works [14, 20, 1, 35], the optimization dynamics of the gradient method for standard fully-connected neural networks was analyzed, as the network width tends to infinity. In our work, since hypernetworks consist of two interacting neural networks, there are multiple ways in which the size can tend to infinity. We consider two cases: (i) the width of tends to infinity and that of is fixed and (ii) the width of both and tend to infinity.
3 Dynamics of Hypernetworks
Infinitely wide without infinitely wide induces non-convex optimization
In the NTK literature, it is common to adopt a functional view of the network evolution by analyzing the dynamics of the output of the network along with the cost, typically a convex function, as a function of that output. In the hypernetwork case, this presents us with two possible viewpoints of the same optimization problem of . On one hand, since only the hypernetwork contains the trainable parameters, we can view the optimization of under the loss as training of under the loss . The classical NTK theory would imply that evolves linearly when its width tends to infinity, but because is in general not convex anymore, even when originally is, an infinitely wide without an infinitely wide does not guarantee convergence to a global optimum. In what follows, we make this point precise by characterising how nonlinear the dynamics becomes in terms of the depth of .
After a single stochastic gradient descent step with learning rate , the hypernetwork output for example is given by . When computing the Taylor approximation around with respect to the function at the point , it holds that:
(5) |
where , and is the tensor that holds the ’th derivative of the output . The terms are the multivariate extensions of the Taylor expansion terms , and take the general form of correlation functions as introduced in Eq. 5 in the appendix. This equation holds for neural networks with smooth activation functions (including hypernetworks), and holds in approximation for piece-wise linear activation functions.
Previous works have shown that, if is a wide fully connected network, the first order term () converges to the NTK, while higher order terms () scale like [20, 6]. Hence, for large widths and small learning rates, these higher order terms vanish, and the loss surface appears deterministic and linear at initialization, and remains so during training.
However, the situation is more complex for hypernetworks. As shown in the following theorem, for infinitely wide hypernetworks and finite primary network, the behaviour depends on the depth and width of the generated primary network. Specifically, when the primary network is deep and narrow, the higher order terms in Eq. 5 may not vanish, and parameter dynamics can be highly non-convex.
Theorem 1 (Higher order terms for hypernetworks).
Let for a hypernetwork and an primary network . Then, we have:
(6) |
Thm. 1 illustrates the effect of the depth of the primary network on the evolution of the output . The larger is, the more non-linear the evolution is, even when is infinitely wide. Indeed, we observe empirically that when is wide and kept fixed, a deeper incurs slower training, and lower overall test performance as illustrated in Fig. 2.
As a special case of this theorem, when taking , we can also derive the asymptotic behaviour of for a neural network . This provides a tighter bound than the previously conjectured upper bound [6]. The following remark is a consequence of this result and is validated in the supplementary material.
Remark 1.
The ’th order term of the Taylor expansion in Eq. 5 is of order instead of the previously postulated . Therefore, it is evident that for any choice , all of the high order terms tend to zero as . This is opposed the previous bound, which guarantees that all of the high order terms tend to zero as only when is constant.
4 Dually Infinite Hypernetworks
It has been shown by [11, 19] that NNGPs and neural tangent kernels fully characterise the training dynamics of infinitely wide networks. As a result, in various publications [21, 9], these kernels are being treated as functional priors of neural networks. In the previous section, we have shown that the Taylor expansion of the hypernetwork is non-linear when the size of the primary network is finite. In this section, we consider the case when both hyper and primary networks are infinitely wide, with the intention of gaining insight into the functional prior of wide hypernetworks. For this purpose, we draw a formal correspondence between infinitely wide hypernetworks and GPs, and use this connection to derive the corresponding neural tangent kernel.
4.1 The NNGP kernel
Previous work have shown the equivalence between popular architectures, and Gaussian processes, when the width of the architecture tends to infinity. This equivalence has sparked renewed interest in kernel methods, through the corresponding NNGP kernel, and the Neural Tangent Kernel (NTK) induced by the architecture, which fully characterise the training dynamics of infinitely wide networks. This equivalence has recently been unified to encompass most architectures which use a pre-defined set of generic computational blocks [40, 41]. Hypernetworks represent a different class of neural networks where the parameters contain randomly initialized matrices except the last layer whose parameters are aggregated as a rank 3 tensor. All of the matrices/tensors dimensions tend to infinity. This means the results of [40, 41] do not apply to hypernetworks. Nevertheless, by considering sequential limit taking, where we take the limit of the width of ahead of the width of , we show the output of achieves a GP behaviour, essentially feeding with Gaussian distributed weights with adaptive variances. A formal argument is presented in the following theorem.
![]() |
![]() |
(a) | (b) |
Theorem 2 (Hypernetworks as GPs).
Let be a hypernetwork. For any pair of inputs and , let . Then, it holds for any unit in layer of the primary network:
(7) |
as sequentially. Here, are independent Gaussian processes, such that, defined by the following recursion:
(8) |
(9) |
where is defined recursively:
(10) |
In other words, the NNGP kernel, governing the behaviour of wide untrained hypernetworks, is given by the Hadamard product of the GP kernels of and (see Eq. 8).
As a consequence of the above theorem, we observe that the NNGP kernel of at each layer, , is simply a function of .
Corollary 1.
Let be a hypernetwork. For any , there exists a function , such that, for all pairs of inputs and , it holds that:
(11) |
The factorization of the NNGP kernel into a function of and provides a convenient way to explicitly encode useful invariances into the kernel.
As an example, in the following remark, we investigate the behaviour of the NNGP kernel of , when the inputs are preprocessed random random Fourier features as suggested by [28, 34].
Remark 2.
Let be a Fourier features preprocessing, where and biases . Let be a hypernetwork, with preprocessed according to . Let and be two pairs of inputs. Then, is a function of and .
The above remark shows that for any given inputs , the NNGP kernel depends on only through the distance between and , which has been shown to be especially useful in implicit neural representation [34].
We next derive the corresponding neural tangent kernel of hypernetworks, referred to as hyperkernels.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
(a) | (b) | (c) |
4.2 The Hyperkernel
Recall the definition of the NTK as the infinite width limit of the Jacobian inner product given by:
(12) |
where and . In the following theorem we show that converges in probability at initialization to a limiting kernel in the sequentially infinite width limit of and , denoted by . Furthermore, we show that the hyperkernel is decomposed to the Hadamard product between the kernels corresponding to and . In addition, we show that the derivative of the hyperkernel with respect to time tends to zero at initialization.
Theorem 3 (Hyperkernel decomposition and convergence at initialization).
Let be a hypernetwork. Then,
(13) |
where:
(14) |
such that:
(15) |
moreover, if evolves throughout gradient flow, we have:
(16) |
where the limits are taken with respect to sequentially.
As a consequence of Thm. 3, when applying a Fourier features preprocessing to , one obtains that becomes shift invariant.
Remark 3.
Let be as in Remark 2. Let be a hypernetwork, where is preprocessed according to . Let and be two pairs of inputs. Then, is a function of and .
Note that is the standard limiting NTK of and depends only on the inputs . However from Eq. 8, the term requires the computation of the NNGP kernel of in advance in order to compute the terms . This form provides an intuitive factorization of the hyperkernel into a term which depends on the meta function and data, and which can be though of as a conditional term.
5 Experiments
Our experiments are divided into two main parts. In the first part, we validate the ideas presented in our theoretical analysis and study the effect of the width and depth of on the optimization of a hypernetwork. In the second part, we evaluate the performance of the NNGP and NTK kernels on image representation tasks. For further implementation details on all experiments see Appendix A.
Representation | Inpainting | |||||||
---|---|---|---|---|---|---|---|---|
50 | 100 | 200 | 500 | 50 | 100 | 200 | 500 | |
HK | 0.055 | 0.050 | 0.043 | 0.032 | 0.057 | 0.051 | 0.047 | 0.038 |
NNGP | 0.051 | 0.045 | 0.037 | 0.026 | 0.054 | 0.047 | 0.043 | 0.034 |
HN | 0.12 | 0.08 | 0.052 | 0.041 | 0.16 | 0.098 | 0.066 | 0.49 |
5.1 Convergence of the Hyperkernel
We verified our results of Thm. 3 by constructing a simple hypernetwork, for which both and are four layered fully connected networks with ReLU activations. For the input of , we used a fixed 2D vector . The input of varied according to , where . We then compute the empirical hyperkernel as follows, while varying the width of both and :
(17) |
Results are presented in Fig. 1. As can be seen, convergence to a fixed kernel is only observed in the dually wide regime, as stated in Thm. 3.
5.2 Training Dynamics
We consider a rotation prediction task. In this task, the hypernetwork is provided with a randomly rotated image and the primary network is provided with a rotated version of with a random angle . The setting is cast into a classification task, where the goal is to predict the closest value to within . We experimented with the MNIST [18] and CIFAR10 [17] datasets. For each dataset we took training samples only.
We investigate the effect of the depth and width of on the training dynamics of a hypernetwork. We compared the performance of hypernetworks of various architectures to investigate the effect of the depth and width of . The architectures of the hypernetwork and the primary network are as follows. The hypernetwork, , is a fully-connected ReLU neural network of depth and width . The inputs of are flattened vectors of dimension , where specifies the number of channels and the height/width of each image ( and for MNIST and and for CIFAR10). The primary network is a fully-connected ReLU neural network of depth . Since the MNIST rotations dataset is simpler, we varied the width of in and for the the CIFAR10 variation we selected the width of to be . The network outputs values and is trained using the cross-entropy loss.
We trained the hypernetworks for 100 epochs, using the SGD method with batch size 100 and learning rate . For completeness, we conducted a sensitivity study on the learning rate for both datasets, to show that the reported behaviour is consistent for any chosen learning rate, see appendix. In Fig. 2(a-c) we compare the performance of the various architectures on the MNIST and CIFAR10 rotations prediction tasks. The performance is computed as an average and standard deviation (error bars) over runs. As can be seen, we observe a clear improvement in test performance as the width of increases, especially at the initialization. When comparing the three plots, we observe that when is wide and kept fixed, a deeper incurs slower training, and lower overall test performance. This is aligned with the conclusions of Thm. 1.
5.3 Image representation and Inpainting
We compare the performance of a hypernetwork and kernel regression with the hyperkernel on two visual tasks: functional image representation and inpainting. In the MNIST image representation task, the goal of the hypernetwork is to represent an input image via the network , which receives image pixel coordinates and outputs pixel values. In the inpainting task, the goal is the same where only half of the image is observed by .
![]() |
Problem Setup We cast these problems as a meta-learning problem, where receives an image, and the goal of the primary network is then to learn a conditional mapping from pixel coordinates to pixel values for all the pixels in the image, with the MSE as the metric. Our training dataset then consists of samples , such that, are images, and are random pixel location (i.e., a tuple ), and is a label specifying the pixel value at the specified location (normalized between 0 and 1). In both experiments, training was done on randomly sampled training data of varying size, specified by .
Evaluation We evaluate the performance of both training a hypernetwork, and using kernel regression with the hyperkernel. For kernel regression, we use the following formula to infer the pixel value of a test point :
(18) |
where is the hyperkernel matrix evaluated on all of the training data and is the vector of labels in the training dataset and .
In Tab. 1, we compare the results of the hyperkernel and the NNGP kernel with the corresponding hypernetwork. The reported numbers are averages over runs. As can be seen, in the case of a small dataset, the kernels outperforms the hypernetwork, and the NNGP outperforms the rest.
6 Conclusions
In this paper, we apply the well-established large width analysis to hypernetwork type models. For the class of models analyzed, we have shown that a wide hypernetwork must be coupled with a wide primary network in order achieve a simplified, convex training dynamics as in standard architectures.
The deeper is, the more complicated the evolution is. In the dually infinite case, when the widths of both the hyper and primary networks tend to infinity, the optimization of the hypernetwork become convex and is governed by the proposed hyperkernel.
The analysis presented in this paper is limited to a specific type of hypernetworks used in the literature, typically found in the functional neural representation literature, and we leave the extension of this work to additional types of hyper models to future work.
Some of the tools developed in this study, also apply for regular NTKs. Specifically, [6] provide a conjecture, for which one of its consequences is that
. In Thm. 1 we prove that this hypothesized upper bound is increasingly loose as increases, and prove an asymptotic behaviour in the order of .
Broader Impact
This work improves our understanding and design of hypernetworks and hopefully will help us improve the transparency of machine learning involving them. Beyond that, this work falls under the category of basic research and does not seem to have particular societal or ethical implications.
Acknowledgements and Funding Disclosure
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant ERC CoG 725974). The contribution of Tomer Galanti is part of Ph.D. thesis research conducted at Tel Aviv University.
References
- [1] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019.
- [2] Yu Bai and Jason D. Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. In International Conference on Learning Representations, 2020.
- [3] Yunru Bai, Ben Krause, Haiquan Wang, Caiming Xiong, and Richard Socher. Taylorized training: Towards better approximation of neural network training at finite width. ArXiv, 2020.
- [4] Andrew Brock, Theo Lim, J.M. Ritchie, and Nick Weston. SMASH: One-shot model architecture search through hypernetworks. In International Conference on Learning Representations, 2018.
- [5] Oscar Chang, Lampros Flokas, and Hod Lipson. Principled weight initialization for hypernetworks. In International Conference on Learning Representations, 2020.
- [6] Ethan Dyer and Guy Gur-Ari. Asymptotics of wide networks from feynman diagrams. In International Conference on Learning Representations, 2020.
- [7] Tomer Galanti and Lior Wolf. On the modularity of hypernetworks. In Advances in Neural Information Processing Systems 33. Curran Associates, Inc., 2020.
- [8] David Ha, Andrew M. Dai, and Quoc V. Le. Hypernetworks. In International Conference on Learning Representations, 2016.
- [9] Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. In International Conference on Learning Representations, 2020.
- [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- [11] Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2018.
- [12] Siddhant M. Jayakumar, Jacob Menick, Wojciech M. Czarnecki, Jonathan Schwarz, Jack Rae, Simon Osindero, Yee Whye Teh, Tim Harley, and Razvan Pascanu. Multiplicative interactions and where to find them. In International Conference on Learning Representations, 2020.
- [13] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic filter networks. In Advances in Neural Information Processing Systems 29. Curran Associates, Inc., 2016.
- [14] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
- [15] Benjamin Klein, Lior Wolf, and Yehuda Afek. A dynamic convolutional layer for short range weather prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- [16] Sylwester Klocek, Lukasz Maziarka, Maciej Wolczyk, Jacek Tabor, Jakub Nowak, and Marek Śmieja. Hypernetwork functional image representation. Lecture Notes in Computer Science, 2019.
- [17] Alex Krizhevsky. Convolutional deep belief networks on cifar-10, 2010.
- [18] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
- [19] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018.
- [20] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019.
- [21] Etai Littwin, Tomer Galanti, and Lior Wolf. On random kernels of residual architectures. Arxiv, 2020.
- [22] Gidi Littwin and Lior Wolf. Deep meta functionals for shape representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- [23] Jonathan Lorraine and David Duvenaud. Stochastic hyperparameter optimization through hypernetworks, 2018.
- [24] Henry B. Mann and Abraham Wald. On stochastic limit and order relationships. Annals of Mathematical Statistics, 14(3):217–226, 09 1943.
- [25] Eliya Nachmani and Lior Wolf. Hyper-graph-network decoders for block codes. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019.
- [26] Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-dickstein. Bayesian deep convolutional networks with many channels are gaussian processes. In International Conference on Learning Representations, 2019.
- [27] Daniel Park, Jascha Sohl-Dickstein, Quoc Le, and Samuel Smith. The effect of network width on stochastic gradient descent and generalization: an empirical study. In Proceedings of Machine Learning Research, volume 97, pages 5042–5051. PMLR, 2019.
- [28] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 20. Curran Associates, Inc., 2008.
- [29] G. Riegler, S. Schulter, M. Rüther, and H. Bischof. Conditioned regression models for non-blind single image super-resolution. In IEEE International Conference on Computer Vision (ICCV), pages 522–530, 2015.
- [30] M. Rotman and L. Wolf. Electric analog circuit design with hypernetworks and a differential simulator. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
- [31] Tim G. J. Rudner. On the connection between neural processes and gaussian processes with deep kernels. In Workshop on Bayesian Deep Learning (NeurIPS), 2018.
- [32] Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. ArXiv, 2016.
- [33] Vincent Sitzmann, Julien N. P. Martel, Alexander W. Bergman, David B. Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. Arxiv, 2020.
- [34] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. Arxiv, 2020.
- [35] Twan van Laarhoven. L2 regularization versus batch and weight normalization. Arxiv, 2017.
- [36] Johannes von Oswald, Christian Henning, João Sacramento, and Benjamin F. Grewe. Continual learning with hypernetworks. In International Conference on Learning Representations, 2020.
- [37] Colin Wei, Jason D. Lee, Qiang Liu, and Tengyu Ma. On the margin theory of feedforward neural networks. ArXiv, 2018.
- [38] Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Proceedings of Machine Learning Research, volume 125, pages 3635–3673. PMLR, 2020.
- [39] Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. Pay less attention with lightweight and dynamic convolutions. In International Conference on Learning Representations, 2019.
- [40] Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. ArXiv, 2019.
- [41] Greg Yang. Tensor programs I: Wide feedforward or recurrent neural networks of any architecture are gaussian processes. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019.
- [42] Greg Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 2017.
- [43] Dingli Yu, Ruosong Wang, Zhiyuan Li, Wei Hu, Ruslan Salakhutdinov, Sanjeev Arora, and Simon S. Du. Enhanced convolutional neural tangent kernels, 2020.
- [44] Chris Zhang, Mengye Ren, and Raquel Urtasun. Graph hypernetworks for neural architecture search. In International Conference on Learning Representations, 2019.
Appendix A Implementation Details
A.1 Convergence of the Hyperkernel
In Fig. 1(a) (main text) we plot the variance of the kernel values in log-scale, as a function of the width of both and . The variance was computed empirically over normally distributed samples . As can be seen, the variance of the kernel tends to zero only when both widths increase. In Fig. 1(b) (main text) we plot the value of and its variance for a fixed hypernetwork of width and of width or . The -axis specifies the value of and the y-axis specifies the value of the kernel. As can be seen, the expected value of the empirical kernel, , is equal to the width-limit kernel (e.g., theoretical kernel) for both widths and . In addition, convergence of the width-limit kernel is guaranteed only when the widths of both networks increase, highlighting the importance of wide architectures for both the hyper and implicit networks for stable training.
A.2 Image Completion and Impainting
Architectures In both tasks, we used fully connected architectures, where contains two hidden layers, and contains one hidden layer. The hyperkernel used corresponds to the infinite width limit of the same architecture. For the input of , we used random Fourier features [34] of the pixel coordinates as inputs for both the hyperkernel and the hypernetwork. To ease on the computational burden of computing the full kernel matrix when evaluating the hyperkernel, we compute smaller kernel matrices on subsets of the data , where each subset contains 1k input images , and 20 random image coordinates per input, producing a kernel matrix of size . The final output prediction is then given by:
(19) |
where are the corresponding labels of the subset . For the hypernetwork evaluation, we used the same inputs to train the hypernetwork using a batchsize of , and a learning rate of which was found to produce the best results.
Appendix B Additional Experiments
B.1 Sensitivity Study
To further demonstrate the behavior reported in Fig. 2 (main text), we verified that it is consistent regardless of the value of the learning rate. We used the same architectures as in the rotations prediction experiments, i.e., is a fully-connected ReLU neural network of depth and width and is of depth and width . We vary the learning rate: , for . For each value of the learning rate, we report the average performance (and standard deviation over runs) of the various architectures after epochs of training.
As can be seen in Fig. 4, when is wide and kept fixed, there is a clear improvement in test performance as the width of increases, for every learning rate in which the networks provide non-trivial performance. When is wide and kept fixed, a deeper incurs slower training and lower overall test performance. We note that it might seem that the width of does not affect the performance when the learning rate is in all settings except Figs. 4(c,f). Indeed, we can verify from Fig. 2 (main text) that the performance at epoch is indeed similar for different widths. However, for earlier epochs, the performance improves for shallower and wider architectures.
![]() |
![]() |
![]() |
(a) of depth 3 | (b) of depth 6 | (c) of depth 8 |
![]() |
![]() |
![]() |
(d) of depth 3 | (e) of depth 6 | (f) of depth 8 |
B.2 Training wide networks with a large learning rate
Remark 1 (main text) states that one is able to train wide networks with a learning rate . To validate this remark, we trained shallow networks of varying width with learning rate on MNIST. As can be seen in Fig. 5, training those networks is possible despite the very large learning rate. In fact, we observe that the accuracy rate and loss improve as we increase the width of the network.
![]() |
![]() |
(a) | (b) |
Appendix C Correlation Functions
Correlation functions are products of general high order tensors representing high order derivatives of a networks output with respect to the weights. In [6] a conjecture is posed on the order of magnitude of general correlation functions involving high order derivative tensors, which arise when analysing the dynamics of gradient descent. Roughly speaking, given inputs , the outputs of a neural network with normally distributed parameters , correlation functions takes the form:
(20) |
where
(21) |
For instance, the following are two examples of correlation functions,
(22) |
Computing the expected value of these correlation functions involve keeping track of various moments of normally distributed weights along paths, as done in recent finite width correction works [9, 21]. [6] employ the Feynman diagram to efficiently compute the expected values (order of magnitude) of general correlation functions, albeit at the cost of only being provably accurate for deep linear, or shallow ReLU networks. In this work, we analyze the asymptotic behaviour correlation functions of the form:
(23) | ||||
where is a rank tensor, representing the ’th derivative of the output, and denotes outer products of the gradients for different examples. terms of the form in Eq. 23 represent high order terms in the multivariate Taylor expansion of outputs, and are, therefore, relevant for the full understanding of training dynamics. As a consequence of Thm. 1, we prove that for vanilla neural networks, where is the width of the network. As we have shown in Sec. 3, terms of the form in Eq. 23 represent high order terms in the multivariate Taylor expansion of outputs, and are, therefore, relevant for the full understanding of training dynamics. As a consequence of Thm. 1, we prove that for vanilla neural networks, where is the width of the network.
This result is a partial solution to the open problem suggested by [6]. In their paper, they conjecture the asymptotic behaviour of general correlation functions, and predict an upper bound on the asymptotic behaviour of terms of the form in Eq. 23 in the order of . Our results therefore proves a stronger version of the conjecture, while giving the exact behaviour as a function of width.
Appendix D Proofs of the Main Results
Terminology and Notations Throughout the appendix, we denote by and the outer and Hadamard products of the tensors and (resp.). When considering the outer products of a sequence of tensors , we denote, . We denote by the sign function. The notation states that converges in distribution to some non-zero random variable . A convenient property of this notation is that it satisfies: when and . Throughout the paper, we will make use of sequential limits and denote to express that tend to infinity, then , and so on. For a given sequence of random variable , we denote by (), when converges in distribution (probability) to a random variable .
D.1 Useful Lemmas
Lemma 1.
Let . Then, .
Proof.
We have:
(24) |
Hence, converges in distribution to . ∎
D.2 Main Technical Lemma
In this section, we prove Lem. 3, which is the main technical lemma that enables us proving Thm. 1. Let be a neural network with outputs . We would like to estimate the order of magnitude of the following expression:
(25) |
where , and . For simplicity, when, , we denote: and when as well.
To estimate the order of magnitude of the expression in Eq. 25, we provide an explicit expression for . First, we note that for any , such that, is times continuously differentiable at , for any set , we have:
(26) |
where the set is an ordered version of , i.e., the two sets consist of the same elements but . In addition, we notice that for any multi-set , such that, for some , then,
(27) |
since is a neural network with a piece-wise linear activation function. Therefore, with no loss of generality, we consider , such that, . It holds that:
(28) |
where is a tensor, defined as follows:
(29) |
where:
(30) |
and:
(31) |
The individual gradients can be expressed using:
(32) |
Note that the following holds for any :
(33) |
In the following, given the sets , and , we derive the limit of using elementary tensor algebra. By Eqs. 32 and 28, we see that:
(34) | ||||
We recall the analysis of [41] showing that in the infinite width limit, with , every pre-activation of at hidden layer has all its coordinates tending to i.i.d. centered Gaussian processes of covariance defined recursively as follows:
(35) | ||||
In addition, we define the derivative covariance as follows:
(36) |
when considering and from the training set, we simply write and .
Lemma 2.
The following holds:
-
1.
For , we have: .
-
2.
For , we have: .
Here, is an indicator that returns if is true and otherwise.
Proof.
See [1].
Lemma 3.
Let and sets , and . We have:
(37) |
as . Here, are centered Gaussian variables with finite, non-zero variances, and .
Proof.
The case is trivial. Let . By Eq. 34, it holds that:
(38) | ||||
Note that intermediate activations do not depend on the index , and so we remove the dependency on in the relevant terms. Next, by applying Lem. 2,
(39) |
Expanding the second term using Eq. 33:
(40) | ||||
The above expression is fully implementable in a Tensor Program (see [41, 40]), and approaches a GP as width tend to infinity. In other words:
(41) |
and denoting , and , it holds using the multivariate Central Limit theorem:
(42) |
Using the Mann-Wald theorem [24] (where we take the mapping as the product pooling of ), we have that:
(43) |
Finally, by Slutsky’s theorem,
(44) |
Assigning completes the proof. ∎
D.3 Proof of Thm. 1
Since we assume that is a finite neural network, i.e., for all , throughout the proofs with no loss of generality we assume that .
Lemma 4.
Let be a hypernetwork. We have:
(45) |
Proof.
By the higher order product rule and the fact that the second derivative of a piece-wise linear function is everywhere:
(46) |
where
(47) |
In addition, by elementary tensor algebra, we have:
(48) | ||||
∎
Lemma 5.
Let be a hypernetwork. In addition, let,
(49) |
We have:
(50) |
Proof.
We have:
(51) |
By the product rule:
(52) |
Hence,
(53) |
In particular,
(54) |
∎
See 1
Proof.
Throughout the proof, in order to derive certain limits of various sequences of random variables, we implicitly make use of the Mann-Wald theorem [24]. For simplicity, oftentimes, we will avoid explicitly stating when this theorem is applied. As a general note, the repeated argument is as follows: terms, such as, , , , etc’, (see below) can be expressed as continuous mappings of jointly convergent random variables. Hence, they jointly converge, and continuous mappings over them converge as well.
(55) |
where . By the Mann-Wald theorem [24] converges to some random variable . If is a continuous function, then converges to . If is the ReLU activation function, by Lem. 1, converges to in distribution. We notice that converges in distribution to some random variable .
The proof is divided into two cases: and .
Case :
First, we note that for and (i.e., ), we have:
(56) |
In addition, as it is an empty product. Therefore, we can rewrite:
(57) |
By Lem. 3, for , the above tends to a constant as . For , converges in distribution to zero for all and converges to a non-constant random variable otherwise. Hence, by the Mann-Wald theorem [24],
(58) |
which is a non-zero random variable.
Case :
By Lem. 3, converges in distribution to zero for all . Therefore, in these cases, by Slutsky’s theorem, converges to zero in distribution. On the other hand, for each , and , by Lem. 3, we have:
(59) |
In particular,
(60) |
Consider the case where . In this case, for any , such that, there are indices , such that, . The following random variable converges in distribution:
(61) |
Therefore, by Slutsky’s theorem:
(62) |
We have:
(63) | ||||
which is a non-constant random variable.
Next, we consider the case when . By Lem. 3, for any , the term tends to zero as . In addition, converges in distribution. Therefore, for any , we have:
(64) |
Hence, for any , such that, there is at least one , we have:
(65) |
On the other hand, for any , the terms , and converge jointly in distribution to some random variables , and as . Hence,
(66) |
which is a non-constant random variable. ∎
D.4 Proofs of the Results in Sec. 4
See 2
Proof.
By [41], taking the width to infinity, the outputs are governed by a centered Gaussian process, such that, the entries , given some input , are independent and identically distributed. Moreover, it holds that:
(67) |
with as defined in Eq. 9. For the function , it holds for the first layer:
(68) |
After taking the limit to infinity, the implicit network is fed with Gaussian distributed weights. And so also converges to a Gaussian process, such that:
(69) |
where:
(70) |
In a similar fashion to the standard feed forward case, the pre-activations converge to Gaussian processes as we let tend to infinity, with a covariance defined recursively:
(71) |
where,
(72) |
and
(73) |
proving the claim. ∎
See 1
Proof.
We prove that is a function of and by induction. First, we note that is a function of and by definition. By the recursive definition of , it is a function of . Therefore, can be simply represented as a function of and . We assume by induction that is a function of and . We would like to show that is a function of and . By definition, is a function of and . In addition, is a function of . Hence, by induction, is simply a function of and . Since is a function of , we conclude that one can represent as a function of and . ∎
See 2
Proof.
We make use of the following lemma in the proof of Thm. 3.
Lemma 6.
Recall the parametrization of the implicit network:
(76) |
For any pair , we denote:
(77) |
It holds that:
-
1.
.
-
2.
.
where the limits are taken with respect to sequentially.
Proof.
We have:
(78) | ||||
Note that it holds that when sequentially, we have:
(79) | ||||
Applying the above recursively proves the first claim. Using the first claim, along with the derivation of the neural tangent kernel (see [1]) proves the second claim. ∎
See 3
Proof.
Recalling that , concatenated into a single vector of length . The components of the inner matrix are given by:
(80) |
and it holds that in the infinite width limit, is a diagonal matrix:
(81) |
By letting the widths and tend to infinity consecutively, by Lem. 6, it follows that:
(82) |
Since converges to the diagonal matrix , the limit of is given by:
(83) | ||||
where we used the results of Lem. 6.
Next, we would like to prove that . For this purpose, we write the derivative explicitly:
(84) |
We notice that the two terms are the same up to changing between the inputs and . Therefore, with no loss of generality, we can simply prove the convergence of the second term. We have:
(85) | ||||
We analyze each term separately.
Analyzing the first term
By substituting , we have:
(86) | ||||
It then follows:
(87) | ||||
We notice that:
(88) | ||||
We recall that converges to a GP (as a function of ) as [19]. Therefore, are special cases of the terms (see Eq. 25) with weights that are distributed according to a GP instead of a normal distribution. In this case, we have: , , the neural network is replaced with , the weights are translated into . We recall that the proof of Lem. 3 showing that is simply based on Lem. 2. Since Lem. 6 extends Lem. 2 to our case, the proof of Lem. 3 can be applied to show that .
Analyzing the second term
We would like to show that for any , we have:
(89) |
as . Since , we have:
(90) | ||||
In addition, we have:
(91) |
We note that converges in distribution as . Therefore, we can simply analyze the convergence of:
(92) |
Since is a constant, it is enough to show that each term converges to zero. We have:
(93) | ||||
where is the ’th output of over . In addition, the summation is done over the indices of the corresponding tensors. We note that for any , the number of indices is finite. We would like to show that each term in the sum tends to zero as . We can write:
(94) |
By Lem. 3, the term in Eq. 94 tends to zero as . In addition, it is easy to see that , and converge to some random variables. Therefore, for any fixed , the above sum converges to zero as . ∎