This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Why Quantization Improves Generalization: NTK of Binary Weight Neural Networks

Kaiqi Zhang, Ming Yin, Yu-Xiang Wang
Abstract

Quantized neural networks have drawn a lot of attention as they reduce the space and computational complexity during the inference. Moreover, there has been folklore that quantization acts as an implicit regularizer and thus can improve the generalizability of neural networks, yet no existing work formalizes this interesting folklore. In this paper, we take the binary weights in a neural network as random variables under stochastic rounding, and study the distribution propagation over different layers in the neural network. We propose a quasi neural network to approximate the distribution propagation, which is a neural network with continuous parameters and smooth activation function. We derive the neural tangent kernel (NTK) for this quasi neural network, and show that the eigenvalue of NTK decays at approximately exponential rate, which is comparable to that of Gaussian kernel with randomized scale. This in turn indicates that the Reproducing Kernel Hilbert Space (RKHS) of a binary weight neural network covers a strict subset of functions compared with the one with real value weights. We use experiments to verify that the quasi neural network we proposed can well approximate binary weight neural network. Furthermore, binary weight neural network gives a lower generalization gap compared with real value weight neural network, which is similar to the difference between Gaussian kernel and Laplace kernel.

1 Introduction

It has been found that by quantizing the parameters in a neural network, the memory footprint and computing cost can be greatly decreased with little to no loss in accuracy (Gupta et al., 2015). Furthermore, Hubara et al. (2016); Courbariaux et al. (2015) argued that quantization serves as an implicit regularizer and thus should increase the generalizability of neural network comparing to its full precision version. However, there is no formal theoretical investigation of this statement to the best of our knowledge.

Empirical results show that the traditional statistical learning techniques based on uniform convergence (e.g., VC-dimension (Blumer et al., 1989)) do not satisfactorily explain the generalization ability of neural networks. Zhang et al. (2016) showed that neural networks can perfectly fit the training data even if the labels are random, yet it generalized well when the data are not random. This seems to suggest that the model capacity of a neural network depends on not only the model, but also the dataset. Recent studies (He and Tao, 2020) managed to understand the empirical performance in a number of different aspects, including modeling stochastic gradient (SGD) with stochastic differential equation (SDE) (Weinan et al., 2019), studying the geometric structure of loss surface (He et al., 2020), and overparameterization – a particular asymptotic behavior when the number of parameters of the neural network tends to infinity (Li et al., 2018; Choromanska et al., 2015; Allen-Zhu et al., 2018; Arora et al., 2019a). Recently, it was proven that the training process of neural network in the overparameterized regime corresponds to kernel regression with Neural Tangent Kernel (NTK) (Jacot et al., 2018). A line of work (Bach, 2017; Bietti and Mairal, 2019; Geifman et al., 2020; Chen and Xu, 2020) further studied Mercer’s decomposition of NTK and proved that it is similar to a Laplacian kernel in terms of the eigenvalues.

In this paper, we propose modeling a two-layer binary weight neural network using a model with continuous parameters. Specifically, we assume the binary weights are drawn from the Bernoulli distribution where the parameters of the distribution (or the mean of the weights) are trainable parameters. We propose a quasi neural network, which has the same structure as a vanilla neural network but has a different activation function, and prove one can analytically approximate the expectation of output of this binary weight neural network with this quasi neural network. Using this model, our main contributions are as follows:

  • Under the overparameterized regime, we prove that the gradient computed from BinaryConnect algorithm is approximately an unbiased estimator of the gradient of the quasi neural network, hence such a quasi neural network can model the training dynamic of binary weight neural network.

  • We study the NTK of two-layer binary weight neural networks by studying the “quasi neural network”, and show that the eigenvalue of this kernel decays at an exponential rate, in contrast with the polynomial rate in a ReLU neural network Chen and Xu (2020); Geifman et al. (2020). We reveal the similarity between the Reproducing kernel Hilbert space (RKHS) of this kernel with Gaussian kernel, and it is a strict subset of function as the RKHS of NTK in a ReLU neural network. This indicates that the model capacity of binary weight neural network is smaller than that with real weights, and explains higher training error and lower generalization gap observed empirically.

2 Related work

Quantized neural networks.

There is a large body of work that focuses on training neural networks with quantized weights (Marchesi et al., 1993; Hubara et al., 2017; Gupta et al., 2015; Liang et al., 2021; Chu et al., 2021), including considering radically quantizing the weights to binary (Courbariaux et al., 2016; Rastegari et al., 2016) or ternary (Alemdar et al., 2017) values, which often comes at a mild cost on the model’s predictive accuracy. Despite all these empirical works, the theoretical analysis of quantized neural networks and their convergence is not well studied. Many researchers believed that quantization adds noise to the model, which serves as an implicit regularizer and makes neural networks generalize better (Hubara et al., 2016; Courbariaux et al., 2015), but this statement is instinctive and has never been formally proved to the best of our knowledge. One may argue that binary weight neural networks have a smaller parameter space than its real weight counterpart, yet Ding et al. (2018) showed that a quantized ReLU neural network with enough parameters can approximate any ReLU neural network with arbitrary precision. These seemingly controversy results motivate us to find another way to explain the stronger generalization ability that is observed empirically.

Theory of deep learning and NTK.

A notable recent technique in developing the theory of neural networks is the neural tangent kernel (NTK) Jacot et al. (2018). It draws the connection between an over-parameterized neural network and the kernel learning. This makes it possible to study the generalization of overparameterized neural network using more mature theoretical tools from kernel learning Bordelon et al. (2020); Simon et al. (2021).

The expressive power of kernel learning is determined by the RKHS of the kernel. Many researches have been done to identify the RKHS. Bach (2017); Bietti and Mairal (2019) studied the spectral properties of NTK of a two-layer neural network without bias. Geifman et al. (2020) further studied the NTK with bias and showed that the RKHS of two layer neural networks contains the same set of functions as RKHS of the Laplacian kernel. Chen and Xu (2020) expanded this result to arbitrary layer neural networks and showed that RKHS of arbitrary layer neural network is equivalent to Laplacian kernel. All these works are based on neural networks with real weights, and to the best of our knowledge, we are the first to study the NTK and generalization of binary weight neural networks.

3 Preliminary

3.1 Neural tangent kernel

It has been found that an overparameterized neural network has many local minima. Furthermore, most of the local minima are almost as good as the global minima (Soudry and Carmon, 2016). As a result, in the training process, the model parameters often do not need to move far away from the initialization point before reaching a local minimum (Du et al., 2018, 2019; Li and Liang, 2018). This phenomenon is also known as lazy training (Chizat et al., 2018). This allows one to approximate a neural network with a model that is nonlinear in its input and linear in its parameters. Using the connection between feature map and kernel learning, the optimization problem reduces to kernel optimization problem. More detailed explanation can be found below:

Denote Θ\Theta as the collection of all the parameters in a neural network fΘf_{\Theta} before an iteration, and Θ+\Theta^{+} as the parameters after this iteration. Let inin denote fixed distribution in the input space. In this paper, it is a discrete distribution induced by the training dataset. Using Taylor expansion, for any testing data x~\tilde{x}, let the stepsize be η\eta, the first-order update rule of gradient descent can be written as (lloss()l_{\text{loss}}(\cdot) be the differentiable loss function and the label is omitted)

Θ+Θ\displaystyle\Theta^{+}-\Theta =η𝔼xin[Θloss(fΘ(x))]\displaystyle=\eta\mathbb{E}_{x\sim in}\left[\nabla_{\Theta}\mathrm{loss}(f_{\Theta}(x))\right]
=η𝔼xin[ΘfΘ(x)loss(fΘ(x))]\displaystyle=\eta\mathbb{E}_{x\sim in}\left[\nabla_{\Theta}f_{\Theta}(x)\ \mathrm{loss}^{\prime}(f_{\Theta}(x))\right]
fΘ+(x)fΘ(x)\displaystyle f_{\Theta^{+}}(x^{\prime})-f_{\Theta}(x^{\prime}) =ηΘfΘ(x)𝔼xin[ΘfΘ(x)loss(fΘ(x))]\displaystyle=\eta\nabla_{\Theta}f_{\Theta}(x^{\prime})\cdot\mathbb{E}_{x\sim in}\left[\nabla_{\Theta}f_{\Theta}(x)\ \mathrm{loss}^{\prime}(f_{\Theta}(x))\right]
=η𝔼xin[ΘfΘ(x)ΘfΘ(x)loss(fΘ(x))]\displaystyle=\eta\mathbb{E}_{x\sim in}\left[\nabla_{\Theta}f_{\Theta}(x^{\prime})\cdot\nabla_{\Theta}f_{\Theta}(x)\ \mathrm{loss}^{\prime}(f_{\Theta}(x))\right]
:=η𝔼xin[𝒦(x,x)loss(fΘ(x))].\displaystyle:=\eta\mathbb{E}_{x\sim in}\left[\mathcal{K}(x,x^{\prime})\ \mathrm{loss}^{\prime}(f_{\Theta}(x))\right].

This indicates that the learning dynamics of overparameterized neural network is equivalent to kernel learning with kernel defined as

𝒦(x,x)=ΘfΘ(x)ΘfΘ(x).\mathcal{K}(x,x^{\prime})=\nabla_{\Theta}^{\top}f_{\Theta}(x)\cdot\nabla_{\Theta}f_{\Theta}(x^{\prime}).

which is called the neural tangent kernel (NTK). As the width of the hidden layers in this neural network tends to infinity, this kernel convergences to its expectation over Θ\Theta (Jacot et al., 2018).

3.2 Exponential kernel

A common class of kernel functions used in machine learning is the exponential kernel, which is a radial basis function kernel with the general form

𝒦(x,x)=exp((cxx)γ),\mathcal{K}(x,x^{\prime})=\exp(-(c\|x-x^{\prime}\|)^{\gamma}),

where c>0c>0 and γ1\gamma\geq 1 are constants. When γ=1\gamma=1, this kernel is known as the Laplacian kernel, and when γ=2\gamma=2, it is known as the Gaussian kernel.

According to Moore-Aronszajn theorem, each symmetric positive definite kernel uniquely induces a Reproducing kernel Hilbert space (RKHS). RKHS determines the functions that can be learned using a kernel. It has been found that the RKHS of NTK in a ReLU neural network is the same as Laplacian kernel (Geifman et al., 2020; Chen and Xu, 2020), and the empirical performance of a neural network is close to that of kernelized linear classifiers with exponential kernels in many datasets (Geifman et al., 2020).

3.3 Training neural networks with quantized weights

Among various methods to train a neural network, BinaryConnect (BC) (Courbariaux et al., 2015) is often one of the most efficient and accurate method. The key idea is to introduce a real-valued buffer θ\theta and use it to accumulate the gradients. The weights will be quantized just before forward and backward propagation, which can benefit from the reduced computing complexity. The update rule in each iteration is

θ+θηf~w(x)w,w+Quantize(θ+),\theta^{+}\leftarrow\theta-\eta\frac{\partial\tilde{f}_{w}(x)}{\partial w},\quad w^{+}\leftarrow Quantize(\theta^{+}), (1)

where Quantize():{1,1}Quantize(\cdot):\mathbb{R}\rightarrow\{-1,1\} denotes the quantization function which will be discussed in Section 4.2, ww denotes the binary (or quantized) weights, θ\theta and θ+\theta^{+} denote the real valued buffer before and after an iteration respectively, η\eta is the learning rate, and fw()f_{w}(\cdot) denotes the neural network with parameter ww. Here the gradients are computed by taking ww as if they are real numbers. The detailed algorithm can be founded in Section A.

4 Approximation of binary weight neural network

4.1 Notations

In this paper, we use w,ijw_{\ell,ij} to denote the binary weights in the \ell-th layer, θ,ij\theta_{\ell,ij} to denote its real-valued counterpart, and b,ib_{\ell,i} to denote the (real valued) bias. Θ\Theta is the collection of all the real-valued model parameters which will be specified in Section 4.2. The number of neurons in the \ell-th hidden layer is dd_{\ell}, the input to the \ell-th linear layer is 𝒙\boldsymbol{x}_{\ell} and the output is 𝒚\boldsymbol{y}_{\ell}. dd denote the number of input features. Besides, we use 𝒙\boldsymbol{x} to denote the input to this neural network, yy to denote the output and zz to denote the label.

We focus on the mean and variance under the randomness of stochastic rounding. Denote

μ,i\displaystyle\mu_{\ell,i} :=𝔼[x,i|𝒙,Θ],\displaystyle:=\mathbb{E}[x_{\ell,i}|\boldsymbol{x},\Theta], σ,i2\displaystyle\sigma_{\ell,i}^{2} :=Var[x,i|𝒙,Θ],\displaystyle:=\mathrm{Var}[x_{\ell,i}|\boldsymbol{x},\Theta],
ν,i\displaystyle\nu_{\ell,i} :=𝔼[y,i|𝒙,Θ],\displaystyle:=\mathbb{E}[y_{\ell,i}|\boldsymbol{x},\Theta], ςi,2\displaystyle\varsigma_{i,\ell}^{2} :=Var[y,i|𝒙,Θ],\displaystyle:=\mathrm{Var}[y_{\ell,i}|\boldsymbol{x},\Theta], y¯\displaystyle\bar{y} :=𝔼[y|Θ].\displaystyle:=\mathbb{E}[y|\Theta].

We use ψ(x)=max(x,0)\psi(x)=\max(x,0) to denote ReLU activation function, and inin to denote the (discrete) distribution of training dataset. 𝔼in[]:=𝔼(𝒙,z)in[]\mathbb{E}_{in}[\cdot]:=\mathbb{E}_{(\boldsymbol{x},z)\sim in}[\cdot] denotes the expectation over training dataset, or “sample average”. We use bold symbol to denote a collection of parameters or variables 𝒘2=[w2,j],𝒃2=[b2,j],𝝂1=[ν1,j],𝜽1=[θ1,ij],i[d1],j[d2]\boldsymbol{w}_{2}=[w_{2,j}],\boldsymbol{b}_{2}=[b_{2,j}],\boldsymbol{\nu}_{1}=[\nu_{1,j}],\boldsymbol{\theta}_{1}=[\theta_{1,ij}],i\in[d_{1}],j\in[d_{2}].

4.2 Problem statement

Input\dotsxxw0w_{0}\dotsx1x_{1} First layer Activation\dots\dotsw1w_{1}y1y_{1}ψ()\psi(\cdot)x2x_{2}w2w_{2} Second layer Outputyy
(a) Binary weight neural network we focus on.
Input\dotsxxw0w_{0}\dotsx1x_{1} First layer Activation\dots\dotsθ1\theta_{1}ν1\nu_{1}ψ~()\tilde{\psi}(\cdot)μ2\mu_{2}w2w_{2} Second layer Outputy¯\bar{y}
(b) Quasi neural network.

In this work, we target on stochastic quantization (Dong et al., 2017), which often yields higher accuracy empirically compared with deterministic rounding (Courbariaux et al., 2015). This also creates a smooth connection between the binary weights in a neural network and its real-valued parameters.

Let w,ij=Quantize(θ,ij),θ,ij[1,1]w_{\ell,ij}=Quantize(\theta_{\ell,ij}),\theta_{\ell,ij}\in[-1,1] be the binary weights from stochastic quantization function, which satisfy Bernoulli distribution:

w,ij={+1, with probability p,ij=θ,ij+12,1, with probability 1p,ij.w_{\ell,ij}=\left\{\begin{array}[]{cl}+1,&\textrm{ with probability }p_{\ell,ij}=\frac{\theta_{\ell,ij}+1}{2},\\ -1,&\textrm{ with probability }1-p_{\ell,ij}.\end{array}\right. (2)

This relationship leads to 𝔼[w,ij|θ,ij]=θ,ij\mathbb{E}[w_{\ell,ij}|\theta_{\ell,ij}]=\theta_{\ell,ij}.

We focus on a ReLU neural network with one hidden layer and two fully connect layers, which was also studied in Bach (2017); Bietti and Mairal (2019) except quantization. Besides, we add a linear layer (“additional layer”) in front of this neural network to project the input to an infinite dimension space. We randomly initialize the weights in this layer and leave it fixed (not trainable) throughout the training process. Furthermore, we quantize the weights in the first fully connect layer w1,ijw_{1,ij} and add a real-valued buffer θ1,ij\theta_{1,ij} which determines the distribution of w1,ijw_{1,ij} as in (2), and leave the second layers not quantized. It is a common practice to leave the last layer not quantized, because this often leads to better empirical performance. If the second layer is quantized as well, the main result of this paper will not be changed. This can be easily checked by extending Lemma 4 into the second layer.

Remark 1.

In many real applications, e.g. computer vision, the dimension of data are often very large (103\approx 10^{3}) while they are laying in the lower dimension linear subspace, so we can take the raw input in these applications as the output of the additional layer, and the NN in this case is a two-layer NN where the first layer is quantized.

The set of all the real-valued parameters is Θ={θ1,ij,w2,ij,b,i}\Theta=\{\theta_{\ell_{1},ij},w_{\ell_{2},ij},b_{\ell,i}\}. The neural network can be expressed as

x1,i\displaystyle x_{1,i} =1dk=1dw0,kixk+b0,i,i[d1];\displaystyle=\frac{1}{\sqrt{d}}\sum_{k=1}^{d}w_{0,ki}x_{k}+b_{0,i},\forall i\in[d_{1}]; y1,j\displaystyle y_{1,j} =cd1i=1d1w1,ijx1,i+b1,j,j[d2];\displaystyle=\sqrt{\frac{c}{d_{1}}}\sum_{i=1}^{d_{1}}w_{1,ij}x_{1,i}+b_{1,j},\forall j\in[d_{2}];
x2,j\displaystyle x_{2,j} =ψ(y1,j),j[d2];\displaystyle=\psi(y_{1,j}),\forall j\in[d_{2}]; y\displaystyle y =1d2j=1d2w2,jx2,j+b2.\displaystyle=\frac{1}{\sqrt{d_{2}}}\sum_{j=1}^{d_{2}}w_{2,j}x_{2,j}+b_{2}.

We follow the typical setting of NTK papers Geifman et al. (2020) in initializing the parameters except the quantized parameters. As for the quantized parameters, we only need to specify the real-valued buffer of the weights in the first layer θ1,ij\theta_{1,ij}.

Assumption 1.

We randomly initialize the weights in the “additional layer” and second linear layer independently as w0,ki,w2,j𝒩(0,1)w_{0,ki},w_{2,j}\sim\mathcal{N}(0,1), and initialize all the biases to 0. The real-valued buffer of the weights are initialized independently identical with zero mean, variance Var[θ]\mathrm{Var}[\theta] and bounded in [1,1][-1,1].

Remark 2.

Our theory applies to any initial distribution of θ1,ij\theta_{1,ij} as long as it satisfies the constraint above. One simple example is the uniform distribution in [1,1][-1,1], which has variance Var[θ]=1/3\mathrm{Var}[\theta]=1/3.

4.3 Quasi neural network

Given a fixed input and real-value model parameters Θ\Theta, under the randomness of stochastic rounding, the output of this binary weight neural network is a random variable. Furthermore, as the width of this neural network tends to infinity, the output of a linear layer tends to Gaussian distribution according to central limit theorem (CLT). We propose a method to determine the distribution of output and using the model parameters. Specifically, we give a closed form equation to compute the mean and variance of output of all the layers μ,σ,ν,ς\mu_{\ell},\sigma_{\ell},\nu_{\ell},\varsigma_{\ell}, and then marginalize over random initialization of Θ\Theta to further simplify this equation. We prove that ς\varsigma_{\ell} converges to a constant almost surely using the law of large number (LLN), and simplify the expression by replacing them with the constant. This allows us to compute μ,ν\mu_{\ell},\nu_{\ell} using a neural-network-style function for given Θ\Theta. We call this function quasi neural network, which is given below:

x1,i\displaystyle x_{1,i} =1dk=1dw0,kixk+b0,i,i[d1];\displaystyle=\frac{1}{\sqrt{d}}\sum_{k=1}^{d}w_{0,ki}x_{k}+b_{0,i},\forall i\in[d_{1}]; ν1,j\displaystyle\nu_{1,j} =cd1i=1d1θ1,ijx1,j+βb1,i,j[d2];\displaystyle=\sqrt{\frac{c}{d_{1}}}\sum_{i=1}^{d_{1}}\theta_{1,ij}x_{1,j}+\beta b_{1,i},\forall j\in[d_{2}]; (3)
μ2,j\displaystyle\mu_{2,j} =ψ~(ν1,j),j[d2];\displaystyle=\tilde{\psi}(\nu_{1,j}),\forall j\in[d_{2}]; y¯\displaystyle\bar{y} =1d2j=1d2w2,jμ2,j+βb2.\displaystyle=\frac{1}{\sqrt{d_{2}}}\sum_{j=1}^{d_{2}}w_{2,j}\mu_{2,j}+\beta b_{2}.

In Section 4.3.1, we study the distribution of the output of each layer in a binary weight neural network (BWNN) conditioned on the set of real-valued parameter Θ\Theta. In Section 4.3.2, we prove that the conditioned variance of the output of the first linear layer studied above converges almost surely to a constant which does not depend on the data (input). This simplifies the expression computed in Section 4.3.1 to the form of quasi neural network (3), and also give a closed-form expression to ψ~()\tilde{\psi}(\cdot) in (3). In Section 4.3.3, we prove that conditioned on the set of real-valued parameter, the expectation of the gradients of BWNN equals the gradient of quasi neural network on the overparameterization limit. This indicates that the training dynamics of BWNN at initialization is the same as training the quasi neural network directly. The training dynamics beyond initialization are discussed in Section 4.3.4. Before jumping to the proof, we make the following assumptions:

Assumption 2.

After training the binary weight neural network as in (1), all the real-valued weights θ,ij\theta_{\ell,ij} stay in the range [1,1][-1,1].

Based on this assumption, we can ignore constraints that θ,ij[1,1]\theta_{\ell,ij}\in[-1,1] and the projected gradient descent reduces to gradient descent. Because of the lazy training property of the overparameterized neural network, the model parameters θ,ij\theta_{\ell,ij} stay close to the initialization point during the training process, so this assumption can be satisfied by initializing θ,ij\theta_{\ell,ij} with smaller absolute value and/or applying weight decay during training. On the other hand, a common trick in a quantized neural network is to change the quantization level gradually during the training process to avoid (or reduce) overflow. With this trick, Assumption 2 are often naturally satisfied, but it introduces the quantization level as a trainable parameter.

Assumption 3.

The Euclidean norm of the input is 1:

𝒙2=1,𝒙d.\|\boldsymbol{x}\|_{2}=1,\forall\boldsymbol{x}\in\mathbb{R}^{d}.

This is a common assumption in studying NTK (Bach, 2017; Bietti and Mairal, 2019), and can be satisfied by normalizing the input data.

4.3.1 Conditioned distribution of the outputs of each layer

First we recognize that as the model parameters Θ\Theta are initialized randomly, there are “bad” initialization that will mess up our analysis. For example, all of θ1,ij\theta_{1,ij} are initialized to 1 (or -1) while they are drawn from a uniform distribution. Fortunately, as the width d1,d2d_{1},d_{2} grows to infinite, the probability of getting into these “bad” initialization is 0. We make this statement formal in the following part.

Definition 4.

“Good Initialization”. We call the set of parameters is a “Good Initialization” Θ𝒢\Theta\in\mathcal{G} if it satisfies:

  • k,k[d],limd11d1i=1d1w0,kiw0,ki=δk,k,\displaystyle\forall k,k^{\prime}\in[d],\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}w_{0,ki}w_{0,k^{\prime}i}=\delta_{k,k^{\prime}},

  • k,k,k′′[d],limd11d1i=1d1|w0,kiw0,kiw0,ki|8π,\displaystyle\forall k,k^{\prime},k^{\prime\prime}\in[d],\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}|w_{0,ki}w_{0,k^{\prime}i}w_{0,k^{\prime}i}|\leq\sqrt{\frac{8}{\pi}},

  • k,k[d],j[d2]\forall k,k^{\prime}\in[d],\forall j\in[d_{2}] for some finite d2d_{2}, limd11d1i=1d1w0,kiw0,kiθ1,ij2=Var[θ]δk,k\displaystyle\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}w_{0,ki}w_{0,k^{\prime}i}\theta_{1,ij}^{2}=\mathrm{Var}[\theta]\delta_{k,k^{\prime}},

where

1k=k\displaystyle 1\quad k=k^{\prime}
0kk.\displaystyle 0\quad k\neq k^{\prime}.
Proposition 3.

Under the assumption that all the parameters are initialized as in 1, the probability of getting “Good Initialization” is 1:

PΘ(Θ𝒢)=1.P_{\Theta}(\Theta\in\mathcal{G})=1.
Lemma 4.

On the limit d1d_{1}\rightarrow\infty, and any given xx conditioned on any fixed Θ𝒢\Theta\in\mathcal{G}, for any jj, the distribution of y1,jy_{1,j} converge to Gaussian distribution with mean ν1,j\nu_{1,j} and variance ς1,j2\varsigma_{1,j}^{2} which can be computed by:

y1,j|Θ\displaystyle y_{1,j}|\Theta 𝒩(ν1,j,ς1,j2),\displaystyle\rightarrow\mathcal{N}(\nu_{1,j},\varsigma_{1,j}^{2}), ν1,j\displaystyle\nu_{1,j} =cd1i=1d1θ1,ijx1,i+b1,j,\displaystyle=\sqrt{\frac{c}{d_{1}}}\sum_{i=1}^{d_{1}}\theta_{1,ij}x_{1,i}+b_{1,j}, ς1,j2\displaystyle\varsigma_{1,j}^{2} =cd1i=1d1(1θ1,ij2)x1,i2\displaystyle=\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}(1-\theta_{1,ij}^{2})x_{1,i}^{2} (4)

This lemma can be proved by Lyapunov central limit theorem and sum of expectation. See Section B.2 for the details.

Lemma 5.

Assume that the input to a ReLU layer y1,iy_{1,i} satisfy Gaussian distribution with mean ν1,i\nu_{1,i} and variance ς1,i2\varsigma_{1,i}^{2} conditioned on Θ\Theta,

y1,j|Θ𝒩(ν1,j,ς1,j2).y_{1,j}|\Theta\sim\mathcal{N}(\nu_{1,j},\varsigma_{1,j}^{2}).

Denote

gj=φ(νjςj),sj=Φ(νjςj),g_{j}=\varphi\left(\frac{\nu_{j}}{\varsigma_{j}}\right),s_{j}=\Phi\left(\frac{\nu_{j}}{\varsigma_{j}}\right), (5)

where φ(x)\varphi(x) denotes standard Gaussian function and Φ(x)\Phi(x) denotes its integration:

φ(x)\displaystyle\varphi(x) =12πexp(12x2),\displaystyle=\sqrt{\frac{1}{2\pi}}\exp\left(-\frac{1}{2}x^{2}\right), Φ(x)\displaystyle\Phi(x) =xφ(y)𝑑y.\displaystyle=\int_{-\infty}^{x}\varphi(y)dy.

Then the output x2,jx_{2,j} has mean μ2,i\mu_{2,i} and variance σ2,i2\sigma_{2,i}^{2} conditioned on Θ\Theta, with

μ2,j\displaystyle\mu_{2,j} :=𝔼[x2,j|Θ]=gjς1,j+sjν1,j,\displaystyle:=\mathbb{E}[x_{2,j}|\Theta]=g_{j}\varsigma_{1,j}+s_{j}\nu_{1,j}, (6)
σ2,i2\displaystyle\sigma_{2,i}^{2} :=Var[x2,j|Θ]=(ς1,i2+ν1,i2)si+ν1,iσ1,ig1,iν1,j2.\displaystyle:=\mathrm{Var}[x_{2,j}|\Theta]=(\varsigma_{1,i}^{2}+\nu_{1,i}^{2})s_{i}+\nu_{1,i}\sigma_{1,i}g_{1,i}-\nu_{1,j}^{2}.

From 4 we know that on the limit d1d_{1}\rightarrow\infty, conditioned on Θ\Theta and 𝒙\boldsymbol{x}, for any jj, y1,jy_{1,j} converge to Gaussian distribution. From continuous mapping theorem, the distribution of x2,jx_{2,j} converge to that shown in 5 so its mean μ2,j\mu_{2,j} and variance σ2,j\sigma_{2,j} converge to that computed in 5.

Equations (4) and (6) provide a method to calculate the mean and variance of output conditioned on the input and real-valued model parameters and allow us to provide a closed-form equation of quasi neural network. We will simplify this equation in Section 4.3.2.

4.3.2 Convergence of conditioned variance

In this part, we assume that the model parameters satisfy “Good Initialization”, which is almost surely on the limit d1d_{1}\rightarrow\infty as is proven in 3, and study the distribution of ν1,j\nu_{1,j} and ς1,j\varsigma_{1,j}.

Theorem 6.

If the parameters are “Good Initialization” Θ𝒢\Theta\in\mathcal{G}, on the limit d1d_{1}\rightarrow\infty, for any finite d2d_{2}, ν1,j\nu_{1,j} converges to Gaussian distribution which are independent of each other, and ς1,j2\varsigma_{1,j}^{2} converges a.s. to

ς~12=cd(1Var[θ]).\tilde{\varsigma}_{1}^{2}=\frac{c}{d}(1-\mathrm{Var}[\theta]).

With this approximation, we can replace the variance ς1,i\varsigma_{1,i} in equation (6) with ς~1\tilde{\varsigma}_{1} and leave the mean of output in the linear layer as the only variable in the quasi neural network. Formal proof can be found in Section B.4. Note the propagation function in the linear layer (the first equation in (4)) is also a linear function in xx and θ\theta. This motivates us to compute y¯\bar{y} using a neural network-like function as is given in (3), where ψ~()\tilde{\psi}(\cdot) is

ψ~(ν1,j)=𝔼[ψ(y1,j)|ν1,j]=ς~1ϕ(ν1,iς~1)+ν1,jΦ(ν1,jς~1).\tilde{\psi}(\nu_{1,j})=\mathbb{E}[\psi(y_{1,j})|\nu_{1,j}]=\tilde{\varsigma}_{1}\phi\left(\frac{\nu_{1,i}}{\tilde{\varsigma}_{1}}\right)+\nu_{1,j}\Phi\left(\frac{\nu_{1,j}}{\tilde{\varsigma}_{1}}\right). (7)

This equation gives a closed-form connection between the mean of output of neural network y¯\bar{y} and the real-valued model parameter Θ\Theta, and allows up to apply existing tools for analyzing neural networks with real-valued weight to analysis binary weight neural network. Its derivative in the sense of Calculus is:

ψ~(ν1,j)=Φ(ν1,jς~1).\tilde{\psi}^{\prime}(\nu_{1,j})=\Phi\left(\frac{\nu_{1,j}}{\tilde{\varsigma}_{1}}\right). (8)

The proof of derivative can be found in Section B.5.

4.3.3 Gradient of quasi neural network

In this part, we compute the gradients using binary weights as in BinaryConnect Algorithm, and make sense of the gradient in (8) by proving that it is the expectation of gradients under the randomness of stochastic rounding.

Theorem 7.

The expectation of gradients to output with respect to weights computed by sampling the quantized weights equals the gradients of “quasi neural network” defined above in (3) on the limit d2,d1d_{2}\rightarrow\infty,d_{1}\rightarrow\infty,

y¯θ1,ij\displaystyle\frac{\partial\bar{y}}{\partial\theta_{1,ij}} =𝔼[yw1,ij|Θ],\displaystyle=\mathbb{E}\left[\frac{\partial y}{\partial w_{1,ij}}\middle|\Theta\right], y¯b1,j\displaystyle\frac{\partial\bar{y}}{\partial b_{1,j}} =𝔼[yb1,j|Θ],\displaystyle=\mathbb{E}\left[\frac{\partial y}{\partial b_{1,j}}\middle|\Theta\right], y¯w2,j\displaystyle\frac{\partial\bar{y}}{\partial w_{2,j}} =𝔼[yw2,j|Θ].\displaystyle=\mathbb{E}\left[\frac{\partial y}{\partial w_{2,j}}\middle|\Theta\right].
Theorem 8.

For MSE loss, loss(y)=12(yz)2,\mathrm{loss}(y)=\frac{1}{2}(y-z)^{2}, where zz is the ground-truth label, the gradient of the loss converges on the limit d2,d1d_{2}\rightarrow\infty,d_{1}\rightarrow\infty,

loss(y¯)θ1,ij\displaystyle\frac{\partial\mathrm{loss}(\bar{y})}{\partial\theta_{1,ij}} =𝔼[loss(y)w1,ij|Θ],\displaystyle=\mathbb{E}\left[\frac{\partial\mathrm{loss}(y)}{\partial w_{1,ij}}\middle|\Theta\right], loss(y¯)b1,j\displaystyle\frac{\partial\mathrm{loss}(\bar{y})}{\partial b_{1,j}} =𝔼[loss(y)b1,j|Θ],\displaystyle=\mathbb{E}\left[\frac{\partial\mathrm{loss}(y)}{\partial b_{1,j}}\middle|\Theta\right],
loss(y¯)w2,j\displaystyle\frac{\partial\mathrm{loss}(\bar{y})}{\partial w_{2,j}} =𝔼[loss(y)w2,j|Θ].\displaystyle=\mathbb{E}\left[\frac{\partial\mathrm{loss}(y)}{\partial w_{2,j}}\middle|\Theta\right].

In other words, the BinaryConnect algorithm provides an unbiased estimator to the gradients for the quasi neural network on this limit of overparameterization.

Theorem 6 and Theorem 8 conclude that for an infinite wide neural network, the BinaryConnect algorithm is equivalent to training quasi neural network with stochastic gradient descent (SGD) directly. Furthermore, this points out the gradient flow of BinaryConnect algorithm and allows us to study this training process with neural tangent kernel (NTK).

4.3.4 Asymptotics during training

So far we have studied the distribution of output during initialization. To study the dynamic of binary weight neural network during training, one need to extend these results to any parameter during training Θ(t),t[0,T]\Theta(t),t\in[0,T]. Fortunately, motivated by Jacot et al. (2018), we can prove that as d1,d2d_{1},d_{2}\rightarrow\infty, the model parameters Θ(t)\Theta(t) stays asymptotically close to initialization for any finite TT, so-called “lazy training”, so the above results apply to the entire training process.

Lemma 9.

For all TT such that t=0Ty¯(t)zin𝑑t\int_{t=0}^{T}\|\bar{y}(t)-z\|_{in}dt stays stochastically bounded, as d2,d1d_{2}\rightarrow\infty,d_{1}\rightarrow\infty, 𝐰2(T)𝐰2(0),𝐛1(T)𝐛1(0),𝛉1(T)𝛉1(0)F\|{\boldsymbol{w}}_{2}(T)-{\boldsymbol{w}}_{2}(0)\|,\|{\boldsymbol{b}}_{1}(T)-{\boldsymbol{b}}_{1}(0)\|,\|{\boldsymbol{\theta}_{1}}(T)-{\boldsymbol{\theta}_{1}}(0)\|_{F} are all stochastically bounded, 𝛎1(t)𝛎1(0)\|{\boldsymbol{\nu}}_{1}(t)-{\boldsymbol{\nu}}_{1}(0)\| and t=0T𝛎1(t)t𝑑t\int_{t=0}^{T}\big{\|}\frac{\partial{\boldsymbol{\nu}}_{1}(t)}{\partial t}\big{\|}dt is stochastically bounded for all xx.

The proof can be found in Section B.8. Note that 𝒘2=O(d2),𝜽1F=O(d1d2)\|\boldsymbol{w}_{2}\|=O(\sqrt{d_{2}}),\|\boldsymbol{\theta}_{1}\|_{F}=O(\sqrt{d_{1}d_{2}}), this results indicates that as d2d_{2}\rightarrow\infty, the varying of the parameter is much smaller than the initialization, or so-called “lazy training”. Making use of this result, we further get the follow result:

Lemma 10.

Under the condition of 9, Lyapunov’s condition holds for all TT so y1,jy_{1,j} converges to Gaussian distribution conditioned on the model parameters Θ(T)\Theta(T). Furthermore, ς1,j(T)ς1,t(0)\varsigma_{1,j}(T)\rightarrow\varsigma_{1,t}(0), which equals ς~1\tilde{\varsigma}_{1} almost surely.

The proof can be found in Section B.9. This result shows that the analysis in Section 4.3 applies to the entire training process, and allows us to study the dynamics of binary weight neural network using quasi neural network.

5 Capacity of Binary Weight Neural Network

As has been found in Jacot et al. (2018), the dynamics of an overparameterized neural network trained with SGD is equivalent to kernel gradient descent where the kernel is NTK. As a result, the effective capacity of a neural network is equivalent to the RKHS of its NTK. In the following part, we will study the NTK of binary weight neural network using the approximation above, and compare it with Gaussian kernel.

5.1 NTK of three-layer binary weight neural networks

We consider the NTK binary weight neural network by studying this “quasi neural network” defined as

𝒦BWNN(x,x)\displaystyle\mathcal{K}_{BWNN}(x,x^{\prime}) =i=1,j=1d1,d2y¯θ1,ijy¯θ1,ij+j=1d2y¯b1,jy¯b1,j+j=1d2y¯w2,jy¯w2,j,\displaystyle=\sum_{i=1,j=1}^{d_{1},d_{2}}\frac{\partial\bar{y}}{\partial\theta_{1,ij}}\frac{\partial\bar{y}^{\prime}}{\partial\theta_{1,ij}}+\sum_{j=1}^{d_{2}}\frac{\partial\bar{y}}{\partial b_{1,j}}\frac{\partial\bar{y}^{\prime}}{\partial b_{1,j}}+\sum_{j=1}^{d_{2}}\frac{\partial\bar{y}}{\partial w_{2,j}}\frac{\partial\bar{y}^{\prime}}{\partial w_{2,j}}, (9)

where Θ:={w1,ij,b1,j,b2,j}\Theta:=\{w_{1,ij},b_{1,j},b_{2,j}\} denotes all the trainable parameters. We omitted the terms related to b2b_{2} (which is a constant) in this equation.

First prove that the change of kernel asymptotically converges to 0 during training process.

Theorem 11.

Under the condition of 9, 𝒦(x,x)(T)𝒦(x,x)(0)\mathcal{K}(x,x^{\prime})(T)\rightarrow\mathcal{K}(x,x^{\prime})(0) at rate 1/d21/\sqrt{d_{2}} for any x,xx,x^{\prime}.

The proof can be found in Section C.3. Using Assumption 3, we confine the input on the hypersphere 𝕊d1={xd:x2=1}\mathbb{S}^{d-1}=\{x\in\mathbb{R}^{d}:\|x\|_{2}=1\}. One can easily tell that it is positive definite, so we can apply Mercer’s decomposition Minh et al. (2006) to it.

To find the basis and eigenvalues to this kernel, we apply spherical harmonics decomposition to this kernel, which is common among studying of NTK (Bach, 2017; Bietti and Mairal, 2019):

𝒦BWNN(x,x)=k=1ukj=1N(d,k)Yk,j(x)Yk,j(x),\mathcal{K}_{BWNN}(x,x^{\prime})=\sum_{k=1}^{\infty}u_{k}\sum_{j=1}^{N(d,k)}Y_{k,j}(x)Y_{k,j}(x^{\prime}), (10)

where dd denotes the dimension of xx and xx^{\prime}, Yk,jY_{k,j} denotes the spherical harmonics of order kk. This suggests that NTK of binary weight neural network and exponential kernel can be spanned by the same set of basis function. The key question is the decay rate of uku_{k} with kk.

Theorem 12.

NTK of a binary weight neural network can be decomposed using (10). If kdk\gg d, then

Poly1(k)CkukPoly2(k)Ck.\mathrm{Poly}_{1}(k)C^{-k}\leq u_{k}\leq\mathrm{Poly}_{2}(k)C^{-k}. (11)

where Poly1(k)\mathrm{Poly}_{1}(k) and Poly2(k)\mathrm{Poly}_{2}(k) denote polynomials of kk, and CC is a constant.

In contrast, Geifman et al. (2020) shows that for NTK in the continuous space, it holds that

C1kdukC2kd,C_{1}k^{-d}\leq u_{k}\leq C_{2}k^{-d},

with constants C1C_{1} and C2C_{2}. Because its decay rate is slower than that of the binary weight neural network, its RKHS covers a strict superset of functions (Geifman et al., 2020).

Proof Sketch:  We first compute NTK of quasi neural network, which depends on the distribution of μ1,j\mu_{1,j}. As is shown in Theorem 6, μ1,j\mu_{1,j} converge to Gaussian distribution on the limit of infinite wide neural network. To find the joint distribution of μ1,j\mu_{1,j} and μ1,j\mu^{\prime}_{1,j} given arbitrary two inputs x,xx,x^{\prime}, we combine the first linear layer in the quasi neural network with the “additional layer” in front of it (the first two equations in (3)). This allows up to reparameterize μ1,j\mu_{1,j} as

μ1,j=wj,x,\mu_{1,j}=\langle w_{j},x\rangle,

where wj𝒩(0,cVar[θ]dI)w_{j}\sim\mathcal{N}(0,\frac{c\mathrm{Var}[\theta]}{d}I) denotes the fused weight. A key component in computing the NTK has the form

𝔼[ψ(μ1)ψ(μ1)]=𝔼[ψ(w,x)ψ(w,x)]=𝔼w𝔼[ψ(w,x)ψ(w,x)|w].\mathbb{E}[\psi(\mu_{1})\psi(\mu_{1}^{\prime})]=\mathbb{E}[\psi(\langle w,x\rangle)\psi^{\prime}(\langle w,x\rangle)]=\mathbb{E}_{\|w\|}\mathbb{E}[\psi(\langle w,x\rangle)\psi^{\prime}(\langle w,x\rangle)|\|w\|].

The second equation comes from the law of total expectation. We use 2-norm in this expression. The inner expectation is equivalent to integration on a sphere, and can be computed by applying sphere harmonics decomposition to ψ()\psi(\cdot). The squared norm of the fused weight w2\|w\|^{2} satisfy Chi-distribution, and we use momentum generating function to finish computing.  

5.2 Comparison with Gaussian Kernel

Even if the input to a neural network xx is constrained on a unit sphere, the first linear layer (together with the additional linear layer in front of it) will project it to the entire d\mathbb{R}^{d} space with Gaussian distribution. In order to simulate that, we define a kernel by randoming the scale of xx and xx^{\prime} beforing taking them into a Gaussian kernel.

𝒦RGauss(x,x)=𝔼κ[𝒦Gauss(κx,κx)],\displaystyle\mathcal{K}_{RGauss}(x,x^{\prime})=\mathbb{E}_{\kappa}[\mathcal{K}_{Gauss}(\kappa x,\kappa x^{\prime})],

where 𝒦Gauss(x,x)=exp(xx2ξ2)\mathcal{K}_{Gauss}(x,x^{\prime})=\exp\left(-\frac{\|x-x^{\prime}\|^{2}}{\xi^{2}}\right) is a Gaussian kernel, κχd\kappa\sim\chi_{d} satisfy Chi distribution with dd degrees of freedom. This scaling factor projects a random vector uniformly distributed on a unit sphere to Gaussian distributed. The corresponding eigenvalue satisfy

A1CkukA2Ck,\displaystyle A_{1}C^{-k}\leq u_{k}\leq A_{2}C^{-k}, (12)

where A1,A2,CA_{1},A_{2},C are constants that depend on ξ\xi. The dominated term in both (11) and (12) have an exponential decay rate CkC^{-k}, which suggests the similarity between NTK of binary weight neural network and Gaussian kernel. In comparison, Bietti and Mairal (2019); Geifman et al. (2020) showed that the eigenvalue of NTK decay at rate kdk^{-d}, which is slower that binary weight neural network or Gaussian kernel. Furthermore, Aronszajn’s inclusion theorem suggests 𝒦BWNN𝒦NN\mathcal{H}_{\mathcal{K}_{BWNN}}\subset\mathcal{H}_{\mathcal{K}_{NN}}, where 𝒦NN\mathcal{K}_{NN} denotes the NTK of real-valued weight neural network. In other words, the expressive power of binary weight neural network is weaker than its real valued counterpart on the limit that the width goes to infinity. Binary weight neural networks are less venerable to noise thanks to the smaller expressive power at the expense of failing to learn some “high frequency” components in the target function. This explains that binary weight neural network often achieve lower training accuracy and smaller generalization gap compared with real weight neural network.

6 Numerical result

6.1 Quasi neural network

Refer to caption(a)Refer to caption(b)Refer to caption(c)Refer to caption(d)
Figure 2: Approximation of quasi neural network. (a)(b): before (a) and after (b) training, histogram of output under fixed model parameter (blue), and fitted with Gaussian distribution (red). (c)(d): 𝔼(y|Θ)\mathbb{E}(y|\Theta) computed from quasi neural network (horizontal axis) and by Monte Carlo (Vertical axis). The red line shows y=xy=x.

In this part, we empirically verify the approximation of quasi neural network by comparing the inference result of quasi neural network with that achieved by Monte Carlo. The architecture is the same as that mentioned in Section 4.2, with 1600 hidden neurons. We train this neural network on MNIST dataset (LeCun et al., 1998) by directly applyng gradient descent to the quasi neural network. To reduce overflow, we add weight decay of 0.001 during training. Figure 2(a)(b) shows the histogram of output under stochastic rounding before and after training. We arbitrarily choose one input sample from the testing set and get 1000 samples under different stochastic rounding. This result supports our statement that the distribution of pre-activation (output of linear layer) conditioned on real-valued model parameters converge to Gaussian distribution. Figure 2(c)(d) compares the mean of output by quasi neural network approximation (horizontal axis) with that computed using Monte Carlo (vertical axis). These alignments further supports our method of approximating binary weight neural network with quasi neural network.

6.2 Generalization gap

6.2.1 Toy dataset

We compare the performance of the neural network with/without binary weight and kernel learning using the same set of 90 small scale UCI datasets with less than 5000 data points as in Geifman et al. (2020); Arora et al. (2019b). We report the training accuracy and testing accuracy of both vanilla neural network (NN) and binary weight neural network (BWNN) in Figure 3. To further illustrate the difference, we list the paired T-test result of neural network (NN) against binary weight neural network (BWNN), and Gaussian kernel (Gaussian) against Laplace kernel (Laplace) using in Table 1. In this table, t-stats and p-val denotes the t-statistic and two-sided p-value of the paired t-test between two classifiers, and << and >> denotes the percentage of dataset that the first classifier gets lower or higher testing accuracy or generalization bound (training accuracy - testing accuracy), respectively.

As can be seen from the results, although the Laplacian kernel gets higher training accuracy than the Gaussian kernel, its testing accuracy is almost the same as the latter one. In other words, the former has smaller generalization gap than the latter which can also be observed in Table 1. Similarly, a neural network gets higher training accuracy than a binary weight neural network but gets similar testing accuracy.

Refer to caption(a)Refer to caption(b)
Figure 3: Accuracy and generalization gap on selected 90 UCI datasets. The lines show the accuracy metric of a classifier from the lowest to the highest against their percentiles of datasets.
Table 1: Pairwise performance comparison on selected 90 UCI datasets.
Classifier Testing Training-Testing
t-stats p-val << >> t-stats p-val << >>
NN-BWNN 0.7471 0.4569 53.33% 41.11% 4.034 0.000 26.67% 67.77%
Laplace-Gaussian 0.4274 0.6701 51.11% 33.33% 3.280 0.001 37.78% 53.33%

6.2.2 MNIST-like dataset

Refer to caption(a)(b)(c)(d)
Figure 4: Training/testing error rate and loss of neural networks with/without binary weight. (a) Training and testing error rate. (b) Training and testing loss. (c) Testing error rate - Training error rate. (d) Testing loss - Training loss.

We compare the performance of neural networks with binary weights (Binary) with its counterpart with real value weights (Real). We take the number of training samples as a parameter by random sampling the training set and use the original test set for testing. The experiments are repeated 10 times and the mean and standard derivation is shown in Figure 4. In the MNIST dataset, the performance of neural networks with or without quantization is similar. This is because MNIST (LeCun et al., 1998) is simpler and less vulnerable to overfitting. On the other hand, the generalization gap with weight quantized is much smaller than without it in FashionMNIST (Xiao et al., 2017) dataset, which matches our prediction.

7 Discussion

In this paper, we propose a quasi neural network to approximate the binary weight neural network. The parameter space of quasi neural network is continuous, and its gradient can be approximated using the BinaryConnect algorithm. We study the expressive power of the binary weight neural network by studying the RKHS of its NTK and showed its similarity with the Gaussian kernel. We empirically verify that quantizing the weights can reduce the generalization gap, similar to Gaussian Kernel versus the Laplacian kernel. This result can be easily generalized to a neural network with other weight quantization methods, i.e. using more bits. Yet there are several questions to be answered by future work:

  1. 1.

    In this work, we only quantize the weights, while much empirical work to quantize both the weights and the activations has been done. Can we use a similar technique to study the expressive power of that neural network?

  2. 2.

    We study the NTK of a two-layer neural network with one additional linear liner in front of it, and only the weights in the first layer are quantized. It remains to be answered whether a multi-layer neural network allow similar approximation, and whether using more layers can increase its expressive power.

8 Acknowledgement

The authors would like to thank Chunfeng Cui for helpful discussion about the idea of quasi neural network and effort in proofreading the paper, as well as Yao Xuan for helpful discussion about kernel and RKHS.

References

  • Alemdar et al. (2017) H. Alemdar, V. Leroy, A. Prost-Boucle, and F. Pétrot. Ternary neural networks for resource-efficient ai applications. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 2547–2554. IEEE, 2017.
  • Allen-Zhu et al. (2018) Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018.
  • Arora et al. (2019a) S. Arora, S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019a.
  • Arora et al. (2019b) S. Arora, S. S. Du, Z. Li, R. Salakhutdinov, R. Wang, and D. Yu. Harnessing the power of infinitely wide deep nets on small-data tasks. arXiv preprint arXiv:1910.01663, 2019b.
  • Bach (2017) F. Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
  • Bietti and Mairal (2019) A. Bietti and J. Mairal. On the inductive bias of neural tangent kernels. In Advances in Neural Information Processing Systems, pages 12893–12904, 2019.
  • Blumer et al. (1989) A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.
  • Bordelon et al. (2020) B. Bordelon, A. Canatar, and C. Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. arXiv preprint arXiv:2002.02561, 2020.
  • Chen and Xu (2020) L. Chen and S. Xu. Deep neural tangent kernel and laplace kernel have the same rkhs. arXiv preprint arXiv:2009.10683, 2020.
  • Chizat et al. (2018) L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable programming. arXiv preprint arXiv:1812.07956, 2018.
  • Choromanska et al. (2015) A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pages 192–204. PMLR, 2015.
  • Chu et al. (2021) T. Chu, Q. Luo, J. Yang, and X. Huang. Mixed-precision quantized neural networks with progressively decreasing bitwidth. Pattern Recognition, 111:107647, 2021.
  • Courbariaux et al. (2015) M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems, 28:3123–3131, 2015.
  • Courbariaux et al. (2016) M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.
  • Ding et al. (2018) Y. Ding, J. Liu, J. Xiong, and Y. Shi. On the universal approximability and complexity bounds of quantized relu neural networks. arXiv preprint arXiv:1802.03646, 2018.
  • Dong et al. (2017) Y. Dong, R. Ni, J. Li, Y. Chen, J. Zhu, and H. Su. Learning accurate low-bit deep neural networks with stochastic quantization. arXiv preprint arXiv:1708.01001, 2017.
  • Du et al. (2019) S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675–1685. PMLR, 2019.
  • Du et al. (2018) S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2018.
  • Geifman et al. (2020) A. Geifman, A. Yadav, Y. Kasten, M. Galun, D. Jacobs, and R. Basri. On the similarity between the laplace and neural tangent kernels. arXiv preprint arXiv:2007.01580, 2020.
  • Gupta et al. (2015) S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning, pages 1737–1746, 2015.
  • He and Tao (2020) F. He and D. Tao. Recent advances in deep learning theory. arXiv preprint arXiv:2012.10931, 2020.
  • He et al. (2020) F. He, B. Wang, and D. Tao. Piecewise linear activations substantially shape the loss surfaces of neural networks. arXiv preprint arXiv:2003.12236, 2020.
  • Hubara et al. (2016) I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks. In Proceedings of the 30th international conference on neural information processing systems, pages 4114–4122. Citeseer, 2016.
  • Hubara et al. (2017) I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017.
  • Jacot et al. (2018) A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
  • LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Li et al. (2018) D. Li, T. Ding, and R. Sun. Over-parameterized deep neural networks have no strict local minima for any continuous activations. arXiv preprint arXiv:1812.11039, 2018.
  • Li and Liang (2018) Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. arXiv preprint arXiv:1808.01204, 2018.
  • Liang et al. (2021) T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing, 461:370–403, 2021.
  • Marchesi et al. (1993) M. Marchesi, G. Orlandi, F. Piazza, and A. Uncini. Fast neural networks without multipliers. IEEE transactions on Neural Networks, 4(1):53–62, 1993.
  • Minh et al. (2006) H. Q. Minh, P. Niyogi, and Y. Yao. Mercer’s theorem, feature maps, and smoothing. In International Conference on Computational Learning Theory, pages 154–168. Springer, 2006.
  • Rastegari et al. (2016) M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer, 2016.
  • Simon et al. (2021) J. B. Simon, M. Dickens, and M. R. DeWeese. Neural tangent kernel eigenvalues accurately predict generalization. arXiv preprint arXiv:2110.03922, 2021.
  • Soudry and Carmon (2016) D. Soudry and Y. Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
  • Weinan et al. (2019) E. Weinan, J. Han, and Q. Li. A mean-field optimal control formulation of deep learning. Research in the Mathematical Sciences, 6(1):1–41, 2019.
  • Xiao et al. (2017) H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
  • Zhang et al. (2016) C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

Appendix A BinaryConnect Algorithm

In this section, we briefly review the BinaryConnect algorithm [Courbariaux et al., 2015], which is the key algorithm we are studying in this paper.

Data: Training data {𝒙i,zi}\{\boldsymbol{x}_{i},z_{i}\}, (Randomly) initialized model parameters θ(0)\theta^{(0)},learning rate η\eta.
for tt in [0T][0\dots T] do
       Select a minibatch {𝒙b,zb}\{\boldsymbol{x}_{b},z_{b}\} from training data;
       Quantization: w(t)Quantize(θ(t))w^{(t)}\leftarrow Quantize(\theta^{(t)});
       Forward propagation: compute {yb}\{y_{b}\} using {𝒙b}\{\boldsymbol{x}_{b}\} and w(t)w^{(t)};
       Backward propagation: compute loss(yb,zb)w(t)\frac{\partial\mathrm{loss}(y_{b},z_{b})}{\partial w^{(t)}} using {yb},{zb}\{y_{b}\},\{z_{b}\} and w(t)w^{(t)};
       accumulate gradients: θ(t+1)θ(t)+ηloss(yb,zb)w(t)\theta^{(t+1)}\leftarrow\theta^{(t)}+\eta\frac{\partial\mathrm{loss}(y_{b},z_{b})}{\partial w^{(t)}}.
end for
Algorithm 1 Training binary weight neural network with BinaryConnect algorithm and SGD.

Appendix B Gaussian approximation in quantized neural network

B.1 Proof of Proposition 3

For the first statement, observe that fixing k,kk,k^{\prime} and taking ii as the variable, w0,kiw0,kiw_{0,ki}w_{0,k^{\prime}i} are independent from each other. Furthermore, 𝔼[w0,kiw0,ki]=δk,k\mathbb{E}[w_{0,ki}w_{0,k^{\prime}i}]=\delta_{k,k^{\prime}}. From the strong law of large number (SLLN), the first statement is proved. The third statement can be proved similarly, observing that 𝔼[w0,kiw0,kiθ1,ij2]=δk,kVar[θ]\mathbb{E}[w_{0,ki}w_{0,k^{\prime}i}\theta_{1,ij}^{2}]=\delta_{k,k^{\prime}}\mathrm{Var}[\theta].

To prove the second statement, because geometric mean is no larger than cubic mean,

|w0,ki||w0,ki||w0,ki|13(|w0,ki|3+|w0,ki|3+|w0,ki|3),|w_{0,ki}||w_{0,k^{\prime}i}||w_{0,k^{\prime}i}|\leq\frac{1}{3}(|w_{0,ki}|^{3}+|w_{0,k^{\prime}i}|^{3}+|w_{0,k^{\prime}i}|^{3}),

Since w0,ki𝒩(0,1)w_{0,ki}\sim\mathcal{N}(0,1), the expectation of the right hand side equals 8π\sqrt{\frac{8}{\pi}}. We apply SLLN again to finish the proof.

B.2 Proof of Lemma 4

We first compute the conditioned mean and variance ν1,j\nu_{1,j} and ς1,j\varsigma_{1,j}. Notice that conditioned on Θ\Theta, x1x_{1} is deterministic,

ν1,j\displaystyle\nu_{1,j} =𝔼w1[y1,j|Θ]\displaystyle=\mathbb{E}_{w_{1}}\left[y_{1,j}\middle|\Theta\right]
=cd1i=1d1𝔼w1[w1,ij]x1,i+βb1,j\displaystyle=\sqrt{\frac{c}{d_{1}}}\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}[w_{1,ij}]x_{1,i}+\beta b_{1,j}
=cd1i=1d1θ1,ijx1,i+βb1,j\displaystyle=\sqrt{\frac{c}{d_{1}}}\sum_{i=1}^{d_{1}}\theta_{1,ij}x_{1,i}+\beta b_{1,j}
ς1,j\displaystyle\varsigma_{1,j} =Varw1[y1,j|Θ]\displaystyle=\mathrm{Var}_{w_{1}}\left[y_{1,j}\middle|\Theta\right]
=𝔼w1[y1,j2|Θ]𝔼w1[y1,j|Θ]2\displaystyle=\mathbb{E}_{w_{1}}\left[y_{1,j}^{2}\middle|\Theta\right]-\mathbb{E}_{w_{1}}\left[y_{1,j}\middle|\Theta\right]^{2}
=cd1i=1d1i=1d1𝔼w1[w1,ijw1,ij|Θ]x1,ix1,i+2βb1,jcd1i=1d1𝔼[w1,ij|Θ]x1,i\displaystyle=\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}\sum_{i^{\prime}=1}^{d_{1}}\mathbb{E}_{w_{1}}\left[w_{1,ij}w_{1,i^{\prime}j}\middle|\Theta\right]x_{1,i}x_{1,i^{\prime}}+2\beta b_{1,j}\sqrt{\frac{c}{d_{1}}}\sum_{i=1}^{d_{1}}\mathbb{E}\left[w_{1,ij}\middle|\Theta\right]x_{1,i}
cd1i=1d1i=1d1θ1,ijθ1,ijx1,ix1,i2βb1,jcd1i=1d1θ1,ijx1,i\displaystyle\quad-\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}\sum_{i^{\prime}=1}^{d_{1}}\theta_{1,ij}\theta_{1,i^{\prime}j}x_{1,i}x_{1,i^{\prime}}-2\beta b_{1,j}\sqrt{\frac{c}{d_{1}}}\sum_{i=1}^{d_{1}}\theta_{1,ij}x_{1,i}
=cd1i=1d1i=1d1(𝔼[w1,ijw1,ij|Θ]θ1,ijθ1,ij)x1,i2\displaystyle=\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}\sum_{i^{\prime}=1}^{d_{1}}(\mathbb{E}[w_{1,ij}w_{1,i^{\prime}j}|\Theta]-\theta_{1,ij}\theta_{1,i^{\prime}j})x_{1,i}^{2}
=cd1i=1d1(𝔼[w1,ij2|Θ]θ1,ij2)x1,i2\displaystyle=\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}(\mathbb{E}[w_{1,ij}^{2}|\Theta]-\theta_{1,ij}^{2})x_{1,i}^{2}
=cd1i=1d1(1θ1,ij2)x1,i2.\displaystyle=\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}(1-\theta_{1,ij}^{2})x_{1,i}^{2}.

The second line is because 𝔼w1[w1,ijw1,ij|Θ]=𝔼w1[w1,ij|Θ]𝔼[w1,ij|Θ]=θ1,ijθ1,ij\mathbb{E}_{w_{1}}[w_{1,ij}w_{1,i^{\prime}j}|\Theta]=\mathbb{E}_{w_{1}}[w_{1,ij}|\Theta]\mathbb{E}[w_{1,i^{\prime}j}|\Theta]=\theta_{1,ij}\theta_{1,i^{\prime}j} when iii\neq i^{\prime}.

Next, we need to prove y1,iy_{1,i} converge to Gaussian distribution conditioned on Θ𝒢\Theta\in\mathcal{G} by verifying Lyapunov’s condition. Note that for any j[d2]j\in[d_{2}],

y1,j=cd1i=1d1w1,ijx1,i+b1,jy_{1,j}=\sqrt{\frac{c}{d_{1}}}\sum_{i=1}^{d_{1}}w_{1,ij}x_{1,i}+b_{1,j}

Define Xi=w1,ijx1,iX_{i}=w_{1,ij}x_{1,i}. As mentioned above, its mean and variance (conditioned on Θ\Theta) is

𝔼w1[Xi|Θ]\displaystyle\mathbb{E}_{w_{1}}[X_{i}|\Theta] =θ1,ijx1,i,\displaystyle=\theta_{1,ij}x_{1,i}, Varw1[Xi|Θ]\displaystyle\mathrm{Var}_{w_{1}}[X_{i}|\Theta] =𝔼w1[Xi2|Θ]𝔼w1[Xi|Θ]2=(1θ1,ij2)x1,i2\displaystyle=\mathbb{E}_{w_{1}}[X_{i}^{2}|\Theta]-\mathbb{E}_{w_{1}}[X_{i}|\Theta]^{2}=(1-\theta_{1,ij}^{2})x_{1,i}^{2}

Since Θ𝒢\Theta\in\mathcal{G}, j[d2]\forall j\in[d_{2}] for some finite d2d_{2},

limd11d1i=1d1Varw1[Xi|Θ]\displaystyle\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathrm{Var}_{w_{1}}[X_{i}|\Theta] =limd11d1i=1d1(1θ1,ij2)x1,i2\displaystyle=\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}(1-\theta_{1,ij}^{2})x_{1,i}^{2} (13)
=limd11dd1i=1d1(1θ1,ij2)(k=1dw0,kixk)2\displaystyle=\lim_{d_{1}\rightarrow\infty}\frac{1}{dd_{1}}\sum_{i=1}^{d_{1}}(1-\theta_{1,ij}^{2})\Big{(}\sum_{k=1}^{d}w_{0,ki}x_{k}\Big{)}^{2}
=limd11dd1i=1d1k,k=1d(1θ1,ij2)w0,kiw0,kixkxk\displaystyle=\lim_{d_{1}\rightarrow\infty}\frac{1}{dd_{1}}\sum_{i=1}^{d_{1}}\sum_{k,k^{\prime}=1}^{d}(1-\theta_{1,ij}^{2})w_{0,ki}w_{0,k^{\prime}i}x_{k}x_{k^{\prime}}
=limd1k,k=1d(1Var[θ])δk,kxkxk=1Var[θ]\displaystyle=\lim_{d_{1}\rightarrow\infty}\sum_{k,k^{\prime}=1}^{d}(1-\mathrm{Var}[\theta])\delta_{k,k^{\prime}}x_{k}x_{k^{\prime}}=1-\mathrm{Var}[\theta]

The fourth equality comes from the definition of 𝒢\mathcal{G}, and the fifth equality is because x2=1\|x\|_{2}=1. The third order absolute momentum can be bounded by

limd11d1i=1d1𝔼w1[|Xi𝔼w1[Xi|Θ]|3|Θ]\displaystyle\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}\big{[}|X_{i}-\mathbb{E}_{w_{1}}[X_{i}|\Theta]|^{3}\big{|}\Theta\big{]} (14)
=limd11d1i=1d1𝔼w1[|(w1,ijθ1,ij)x1,i|3|Θ]\displaystyle=\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}\big{[}|(w_{1,ij}-\theta_{1,ij})x_{1,i}|^{3}\big{|}\Theta\big{]}
limd11d1i=1d1𝔼w1[8|x1,i|3|Θ]\displaystyle\leq\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}\big{[}8|x_{1,i}|^{3}\big{|}\Theta\big{]}
=limd18d1i=1d1|k=1dw0,kixk|3\displaystyle=\lim_{d_{1}\rightarrow\infty}\frac{8}{d_{1}}\sum_{i=1}^{d_{1}}\Big{|}\sum_{k=1}^{d}w_{0,ki}x_{k}\Big{|}^{3}
limd18d1i=1d1(k=1d|w0,kixk|)3\displaystyle\leq\lim_{d_{1}\rightarrow\infty}\frac{8}{d_{1}}\sum_{i=1}^{d_{1}}\Bigg{(}\sum_{k=1}^{d}|w_{0,ki}x_{k}|\Bigg{)}^{3}
=limd18d1i=1d1k,k,k′′=1d|w0,kiw0,kiw0,k′′i||xkxkxk′′|\displaystyle=\lim_{d_{1}\rightarrow\infty}\frac{8}{d_{1}}\sum_{i=1}^{d_{1}}\sum_{k,k^{\prime},k^{\prime\prime}=1}^{d}|w_{0,ki}w_{0,k^{\prime}i}w_{0,k^{\prime\prime}i}||x_{k}x_{k^{\prime}}x_{k^{\prime\prime}}|
88πd3\displaystyle\leq 8\sqrt{\frac{8}{\pi}}d^{3}

The last inequality comes from the definition of “Good Initialization”: for all Θ𝒢\Theta\in\mathcal{G},

limd11d1i=1d1w0,kiw0,kiw0,ki8π,\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}w_{0,ki}w_{0,k^{\prime}i}w_{0,k^{\prime}i}\leq\sqrt{\frac{8}{\pi}},

and because x2=1,|xk|1\|x\|_{2}=1,|x_{k}|\leq 1 for all k[d]k\in[d]. Note that using the strong law of large number, one can prove that the third order absolute momentum converges almost surely to a constant that doesn’t depend on dd. On the other hand, we are proving a upper bound for all Θ𝒢\Theta\in\mathcal{G} which is stronger than almost surely converge.

limd1i=1d1𝔼w1[|Xi𝔼w1[Xi|Θ]|3|Θ](i=1d1Varw1[Xi|Θ])3/2\displaystyle\lim_{d_{1}\rightarrow\infty}\frac{\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}[|X_{i}-\mathbb{E}_{w_{1}}[X_{i}|\Theta]|^{3}|\Theta]}{(\sum_{i=1}^{d_{1}}\mathrm{Var}_{w_{1}}[X_{i}|\Theta])^{3/2}}
=limd11d1i=1d1𝔼w1[|Xi𝔼w1[Xi|Θ]|3|Θ]d11/2(1d1i=1d1Varw1[Xi|Θ])3/2\displaystyle=\lim_{d_{1}\rightarrow\infty}\frac{\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}[|X_{i}-\mathbb{E}_{w_{1}}[X_{i}|\Theta]|^{3}|\Theta]}{{d_{1}^{1/2}}(\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathrm{Var}_{w_{1}}[X_{i}|\Theta])^{3/2}}
=limd11d1i=1d1𝔼w1[|Xi𝔼w1[Xi|Θ]|3|Θ]limd1d11/2(1d1i=1d1Varw1[Xi|Θ])3/2\displaystyle=\frac{\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}[|X_{i}-\mathbb{E}_{w_{1}}[X_{i}|\Theta]|^{3}|\Theta]}{\lim_{d_{1}\rightarrow\infty}{d_{1}^{1/2}}(\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathrm{Var}_{w_{1}}[X_{i}|\Theta])^{3/2}}
88πd3limd1d11/2(1Var[θ])\displaystyle\leq\frac{8\sqrt{\frac{8}{\pi}}d^{3}}{\lim_{d_{1}\rightarrow\infty}d_{1}^{1/2}(1-\mathrm{Var}[\theta])}
=0\displaystyle=0

This proves that Lyapunov’s condition for all “Good Initialization”, so conditioned on Θ𝒢\Theta\in\mathcal{G}, y1,jy_{1,j} converges to Gaussian distribution.

B.3 Proof of Lemma 5

To compute 𝔼w1[x2,j|Θ]\mathbb{E}_{w_{1}}[x_{2,j}|\Theta] and Varw1[x2,j|Θ]\mathrm{Var}_{w_{1}}[x_{2,j}|\Theta], we first compute 𝔼w1[ψ(y1,j)|Θ]\mathbb{E}_{w_{1}}[\psi(y_{1,j})|\Theta] and 𝔼w1[ψ(y1,j)2|Θ]\mathbb{E}_{w_{1}}[\psi(y_{1,j})^{2}|\Theta]. Recall ψ(x)=x𝟏(x0)\psi(x)=x\mathbf{1}(x\geq 0),

𝔼w1[ψ(y1,j)|Θ]\displaystyle\mathbb{E}_{w_{1}}[\psi(y_{1,j})|\Theta] =0x12πς1,jexp(12(xν1,j)2ς1,j2)𝑑x\displaystyle=\int_{0}^{\infty}x\frac{1}{\sqrt{2\pi}\varsigma_{1,j}}\exp\left(-\frac{1}{2}\frac{(x-\nu_{1,j})^{2}}{\varsigma_{1,j}^{2}}\right)dx
=ν1,jς1,j(ς1,jy+ν1,j)12πexp(12y2)𝑑y\displaystyle=\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}(\varsigma_{1,j}y+\nu_{1,j})\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)dy
=ς1,jν1,jς1,j12πyexp(12y2)𝑑y+μ,iν1,jς1,j12πexp(12y2)𝑑y,\displaystyle=\varsigma_{1,j}\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}\frac{1}{\sqrt{2\pi}}y\exp\left(-\frac{1}{2}y^{2}\right)dy+\mu_{\ell,i}\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)dy,
𝔼w1[ψ(y1,j)2|Θ]\displaystyle\mathbb{E}_{w_{1}}[\psi(y_{1,j})^{2}|\Theta] =0x212πς1,jexp(12(xν1,j)2ς1,j2)𝑑x\displaystyle=\int_{0}^{\infty}x^{2}\frac{1}{\sqrt{2\pi}\varsigma_{1,j}}\exp\left(-\frac{1}{2}\frac{(x-\nu_{1,j})^{2}}{\varsigma_{1,j}^{2}}\right)dx
=ν1,jς1,j(ς1,jy+ν1,j)212πexp(12y2)𝑑y\displaystyle=\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}(\varsigma_{1,j}y+\nu_{1,j})^{2}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)dy
=ς1,j2ν1,jς1,jy212πexp(12y^2)𝑑y+2ς1,jν1,jν1,jς1,jy12πexp(12y2)𝑑y\displaystyle=\varsigma_{1,j}^{2}\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}y^{2}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}\hat{y}^{2}\right)dy+2\varsigma_{1,j}\nu_{1,j}\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}y\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)dy
+ν1,j2ν1,jς1,j12πexp(12y2)𝑑y.\displaystyle\quad+\nu_{1,j}^{2}\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)dy.

We only need to compute

ν1,jς1,j12πyαexp(12y2)𝑑y.\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}\frac{1}{\sqrt{2\pi}}y^{\alpha}\exp\left(-\frac{1}{2}y^{2}\right)dy.

For α=0,1,2\alpha=0,1,2. When α=0\alpha=0, this is integration to Gaussian function, and it is known that there’s no analytically function to express that. For sack of simplicity, define it as s1,js_{1,j}

s1,j\displaystyle s_{1,j} =ν1,jς1,j12πexp(12y2)𝑑y:=Φ(ν1,jς1,j).\displaystyle=\int_{\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)dy:=\Phi(\frac{\nu_{1,j}}{\varsigma_{1,j}}).

When α=1\alpha=1, this integration can be simply solved by change of the variable and we denote it as g1,jg_{1,j}:

g1,j\displaystyle g_{1,j} =ν1,jς1,jy12πexp(12y2)𝑑y=12πexp(12(ν1,jς1,j)2).\displaystyle=\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}y\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)dy=\sqrt{\frac{1}{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{\nu_{1,j}}{\varsigma_{1,j}}\right)^{2}\right).

When α=2\alpha=2, we can do integration by parts and express it using s1,js_{1,j} and g1,jg_{1,j}:

ν1,jς1,jy212πexp(12y2)𝑑y\displaystyle\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}y^{2}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)dy
=ν1,jς1,jy12πexp(12y2)𝑑12y2\displaystyle\quad=-\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}y\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)d\frac{1}{2}y^{2}
=ν1,jς1,j12πexp(12y2)𝑑yν1,jς1,j12πexp(12(ν1,jς1,j)2)\displaystyle\quad=\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)dy-\frac{\nu_{1,j}}{\varsigma_{1,j}}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{\nu_{1,j}}{\varsigma_{1,j}}\right)^{2}\right)
=s1,jν1,jς1,jg1,j.\displaystyle\quad=s_{1,j}-\frac{\nu_{1,j}}{\varsigma_{1,j}}g_{1,j}.

Using the definition of mean and variance,

μ2,j=𝔼w1[ψ(y1,j)|Θ],σ2,i2=𝔼w1[ψ(y1,j)2|Θ]𝔼w1[ψ(y1,j)|Θ]2,\mu_{2,j}=\mathbb{E}_{w_{1}}[\psi(y_{1,j})|\Theta],\sigma_{2,i}^{2}=\mathbb{E}_{w_{1}}[\psi(y_{1,j})^{2}|\Theta]-\mathbb{E}_{w_{1}}[\psi(y_{1,j})|\Theta]^{2},

we can come to the result.

B.4 Proof of Theorem 6

In this part, we take Θ={w0,θ1,w2,b0,b1,b2}\Theta=\{w_{0},\theta_{1},w_{2},b_{0},b_{1},b_{2}\} as the random variables and conditioned mean and variance derived above μ1,σ1,ν1,ς1\mu_{1},\sigma_{1},\nu_{1},\varsigma_{1} as functions to Θ\Theta. From Eq. (4), as d1d_{1}\rightarrow\infty, v1v_{1} tend to iid Gaussian processes, and there covariance converges almost surely to its expectation. We then focus on computing the expectation of covariance. For any jjj\neq j^{\prime}, we take the expectation over random initialization of Θ\Theta:

𝔼Θ[ν1,jν1,j]\displaystyle\mathbb{E}_{\Theta}[\nu_{1,j}\nu_{1,j^{\prime}}] (15)
=𝔼Θ[cd1i=1d1i=1d1θ1,ijθ1,ijx1,ix1,i+βcd1i=1d1(θ1,ijx1,ibj+θ1,ijx1,ibj)+β2bjbj]\displaystyle=\mathbb{E}_{\Theta}\left[\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}\sum_{i^{\prime}=1}^{d_{1}}\theta_{1,ij}\theta_{1,i^{\prime}j^{\prime}}x_{1,i}x_{1,i^{\prime}}+\beta\sqrt{\frac{c}{d_{1}}}\sum_{i=1}^{d_{1}}(\theta_{1,ij}x_{1,i}b_{j}+\theta_{1,ij^{\prime}}x_{1,i}b_{j^{\prime}})+\beta^{2}b_{j}b_{j^{\prime}}\right]
=cd1i=1d1i=1d1𝔼Θ[θ1,ij]𝔼Θ[θ1,ij]𝔼Θ[x1,ix1,i]+β2𝔼Θ[bjbj]\displaystyle=\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}\sum_{i^{\prime}=1}^{d_{1}}\mathbb{E}_{\Theta}[\theta_{1,ij}]\mathbb{E}_{\Theta}[\theta_{1,i^{\prime}j^{\prime}}]\mathbb{E}_{\Theta}[x_{1,i}x_{1,i^{\prime}}]+\beta^{2}\mathbb{E}_{\Theta}[b_{j}b_{j^{\prime}}]
+cd1βi=1d1(𝔼Θ[θ1,ij]𝔼Θ[x1,i]𝔼Θ[bj]+𝔼Θ[θ1,ij]𝔼Θ[x1,i]𝔼Θ[bj])\displaystyle\quad+\sqrt{\frac{c}{d_{1}}}\beta\sum_{i=1}^{d_{1}}(\mathbb{E}_{\Theta}[\theta_{1,ij}]\mathbb{E}_{\Theta}[x_{1,i}]\mathbb{E}_{\Theta}[b_{j}]+\mathbb{E}_{\Theta}[\theta_{1,ij^{\prime}}]\mathbb{E}_{\Theta}[x_{1,i}]\mathbb{E}_{\Theta}[b_{j^{\prime}}])
=0\displaystyle=0

which indicates that they are independent.

Computation of ς1,j\varsigma_{1,j} was already finished implicitly in Section B.2. We write it explicitly here. From (4), on the limit d1d_{1}\rightarrow\infty,

ς1,j2\displaystyle\varsigma_{1,j}^{2} =cd1i=1d1(1θ1,ij2)x1,i2\displaystyle=\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}(1-\theta_{1,ij}^{2})x_{1,i}^{2}
=cdd1i=1d1(1θ1,ij2)k,k=1dw0,kiw0,kixkxk\displaystyle=\frac{c}{dd_{1}}\sum_{i=1}^{d_{1}}(1-\theta_{1,ij}^{2})\sum_{k,k^{\prime}=1}^{d}w_{0,ki}w_{0,k^{\prime}i}x_{k}x_{k^{\prime}}
=cdk,k=1dxkxk1d1i=1d1w0,kiw0,ki(1θ1,ij2)\displaystyle=\frac{c}{d}\sum_{k,k^{\prime}=1}^{d}x_{k}x_{k^{\prime}}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}w_{0,ki}w_{0,k^{\prime}i}(1-\theta_{1,ij}^{2})
=cdk,k=1dxkxkδk,k(1Var[θ])\displaystyle=\frac{c}{d}\sum_{k,k^{\prime}=1}^{d}x_{k}x_{k^{\prime}}\delta_{k,k^{\prime}}(1-\mathrm{Var}[\theta])
=cd(1Var[θ])x22\displaystyle=\frac{c}{d}(1-\mathrm{Var}[\theta])\|x\|_{2}^{2}
=cd(1Var[θ])\displaystyle=\frac{c}{d}(1-\mathrm{Var}[\theta])

The fourth line comes from the definition of 𝒢\mathcal{G}.

B.5 Derivative of activation function in quasi neural network

Let

ψ~(x)=ς~1ϕ(xς~)+xΦ(xς~),\displaystyle\tilde{\psi}(x)=\tilde{\varsigma}_{1}\phi\left(\frac{x}{\tilde{\varsigma}}\right)+x\Phi\left(\frac{x}{\tilde{\varsigma}}\right),

Its derivative is

ψ~(x)\displaystyle\tilde{\psi}^{\prime}(x) =φ(xς~)+Φ(xς~)+xς~Φ(xς~)\displaystyle=\varphi^{\prime}\left(\frac{x}{\tilde{\varsigma}}\right)+\Phi\left(\frac{x}{\tilde{\varsigma}}\right)+\frac{x}{\tilde{\varsigma}}\Phi^{\prime}\left(\frac{x}{\tilde{\varsigma}}\right)
=xς~φ(xς~)+Φ(xς~)+xς~φ(xς~)\displaystyle=-\frac{x}{\tilde{\varsigma}}\varphi\left(\frac{x}{\tilde{\varsigma}}\right)+\Phi\left(\frac{x}{\tilde{\varsigma}}\right)+\frac{x}{\tilde{\varsigma}}\varphi\left(\frac{x}{\tilde{\varsigma}}\right)
=Φ(xς~)\displaystyle=\Phi\left(\frac{x}{\tilde{\varsigma}}\right)

The second line is because

Φ(x)\displaystyle\Phi^{\prime}(x) =φ(x),\displaystyle=\varphi(x),
φ(x)\displaystyle\varphi^{\prime}(x) =ddx12πexp(12x2)\displaystyle=\frac{d}{dx}\sqrt{\frac{1}{2\pi}}\exp\left(-\frac{1}{2}x^{2}\right)
=x12πexp(12x2)\displaystyle=-x\sqrt{\frac{1}{2\pi}}\exp\left(-\frac{1}{2}x^{2}\right)
=xφ(x).\displaystyle=-x\varphi(x).

B.6 Proof of Theorem 7

To make the proof more general, we make ς1,j\varsigma_{1,j} a parameter of the activation function in quasi neural network as ψ~(;ς1,j)\tilde{\psi}(\cdot;\varsigma_{1,j}). To get the derivative with respect to θ1,ij\theta_{1,ij}, we first get the derivative with respect to ν1,j\nu_{1,j}.

y¯ν1,j\displaystyle\frac{\partial\bar{y}}{\partial\nu_{1,j}} =y¯μ2,jμ2,jν1,j=1d2w2,jψ~(ν1,j;ς1,j)\displaystyle=\frac{\partial\bar{y}}{\partial\mu_{2,j}}\frac{\partial\mu_{2,j}}{\partial\nu_{1,j}}=\sqrt{\frac{1}{d_{2}}}w_{2,j}\tilde{\psi}^{\prime}(\nu_{1,j};\varsigma_{1,j})

then apply chain rule:

y¯w2,j\displaystyle\frac{\partial\bar{y}}{\partial w_{2,j}} =cd2μ2,j,\displaystyle=\sqrt{\frac{c}{d_{2}}}\mu_{2,j}, (16)
y¯b1,j\displaystyle\frac{\partial\bar{y}}{\partial b_{1,j}} =y¯ν1,jν1,jb1,j=cd2βw2,jψ~(ν1,j;ς1,j),\displaystyle=\frac{\partial\bar{y}}{\partial\nu_{1,j}}\frac{\partial\nu_{1,j}}{\partial b_{1,j}}=\sqrt{\frac{c}{d_{2}}}\beta w_{2,j}\tilde{\psi}^{\prime}(\nu_{1,j};\varsigma_{1,j}), (17)
y¯θ1,ij\displaystyle\frac{\partial\bar{y}}{\partial\theta_{1,ij}} =y¯ν1,jν1,jθ1,ij=cd1d2w2,jx1,iψ~(ν1,j;ς1,j).\displaystyle=\frac{\partial\bar{y}}{\partial\nu_{1,j}}\frac{\partial\nu_{1,j}}{\partial\theta_{1,ij}}=\sqrt{\frac{c}{d_{1}d_{2}}}w_{2,j}x_{1,i}\tilde{\psi}^{\prime}(\nu_{1,j};\varsigma_{1,j}). (18)

On the other hand, let’s first write down the gradient with respect to weights wijw_{ij} in quantized neural network and take their expectation conditioned on Θ\Theta:

𝔼w1[yw2,j|Θ]\displaystyle\mathbb{E}_{w_{1}}\left[\frac{\partial y}{\partial w_{2,j}}\middle|\Theta\right] =cd2𝔼w1[x2,j|Θ],\displaystyle=\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{w_{1}}[x_{2,j}|\Theta], (19)
𝔼w1[yb2,j|Θ]\displaystyle\mathbb{E}_{w_{1}}\left[\frac{\partial y}{\partial b_{2,j}}\middle|\Theta\right] =𝔼w1[y¯y1,jy1,jb2,j|Θ]=cd2βw2,j𝔼w1[ψ(y1,j)|Θ],\displaystyle=\mathbb{E}_{w_{1}}\left[\frac{\partial\bar{y}}{\partial y_{1,j}}\frac{\partial y_{1,j}}{\partial b_{2,j}}\middle|\Theta\right]=\sqrt{\frac{c}{d_{2}}}\beta w_{2,j}\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})|\Theta], (20)
𝔼w1[yw1,ij|Θ]\displaystyle\mathbb{E}_{w_{1}}\left[\frac{\partial y}{\partial w_{1,ij}}\middle|\Theta\right] =𝔼w1[yy1,jy1,jw1,ij|Θ]=cd1d2w2,jx1,i𝔼w1[ψ(y1,j)|Θ],\displaystyle=\mathbb{E}_{w_{1}}\left[\frac{\partial y}{\partial y_{1,j}}\frac{\partial y_{1,j}}{\partial w_{1,ij}}\middle|\Theta\right]=\sqrt{\frac{c}{d_{1}d_{2}}}w_{2,j}x_{1,i}\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})|\Theta], (21)

Comparing (16) with (19), one can easily find they are equal since by definition μ2,j=𝔼w1[x2,j|Θ]\mu_{2,j}=\mathbb{E}_{w_{1}}[x_{2,j}|\Theta]. On the other hand, from (7), one can tell using continuous mapping theorem that

ψ~(ν1,j;ς1,j)=Φ(ν1,jς1,j)=P[y1,j0]=𝔼w1[ψ(y1,j)|Θ],\displaystyle\tilde{\psi}^{\prime}(\nu_{1,j};\varsigma_{1,j})=\Phi\left(\frac{\nu_{1,j}}{\varsigma_{1,j}}\right)=P[y_{1,j}\geq 0]=\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})|\Theta],

which shows (17) equals (20) and (18) equals (21).

B.7 Proof of Theorem 8

Observe that conditioned on Θ\Theta, y1,jy_{1,j} depends only on {w1,ij,i[d1]}\{w_{1,ij},i\in[d_{1}]\}, and that {w1,ij,i[d1]}{w1,ij,i[d1]}=\{w_{1,ij},i\in[d_{1}]\}\cap\{w_{1,ij^{\prime}},i\in[d_{1}]\}=\emptyset for jjj\neq j^{\prime}. Because of that, y1,jy_{1,j} are independent of each other. Similarly, x2,jx_{2,j} are independent of each other conditioned on Θ\Theta. For MSE loss,

loss(y)=12(yz)2,dl(y)dy=yz,\displaystyle\mathrm{loss}(y)=\frac{1}{2}(y-z)^{2},\frac{dl(y)}{dy}=y-z,

According to the chain rule

loss(y¯)θ=loss(y¯)y¯y¯θ=(y¯z)y¯θ,\displaystyle\frac{\partial\mathrm{loss}(\bar{y})}{\partial\theta}=\frac{\partial\mathrm{loss}(\bar{y})}{\partial\bar{y}}\frac{\partial\bar{y}}{\partial\theta}=(\bar{y}-z)\frac{\partial\bar{y}}{\partial\theta}, (22)

for any θ{θ1,ij,b1,j,w2,j}\theta\in\{\theta_{1,ij},b_{1,j},w_{2,j}\}, which leads to

loss(y¯)w2,j\displaystyle\frac{\partial\mathrm{loss}(\bar{y})}{\partial w_{2,j}} =cd2(y¯z)μ2,j,\displaystyle=\sqrt{\frac{c}{d_{2}}}(\bar{y}-z)\mu_{2,j}, (23)
loss(y¯)b1,j\displaystyle\frac{\partial\mathrm{loss}(\bar{y})}{\partial b_{1,j}} =y¯ν1,jν1,jb1,j=cd2βw2,j(y¯z)ψ~(ν1,j;ς1,j),\displaystyle=\frac{\partial\bar{y}}{\partial\nu_{1,j}}\frac{\partial\nu_{1,j}}{\partial b_{1,j}}=\sqrt{\frac{c}{d_{2}}}\beta w_{2,j}(\bar{y}-z)\tilde{\psi}^{\prime}(\nu_{1,j};\varsigma_{1,j}), (24)
loss(y¯)θ1,ij\displaystyle\frac{\partial\mathrm{loss}(\bar{y})}{\partial\theta_{1,ij}} =y¯ν1,jν1,jθ1,ij=cd1d2w2,jx1,i(y¯z)ψ~(ν1,j;ς1,j).\displaystyle=\frac{\partial\bar{y}}{\partial\nu_{1,j}}\frac{\partial\nu_{1,j}}{\partial\theta_{1,ij}}=\sqrt{\frac{c}{d_{1}d_{2}}}w_{2,j}x_{1,i}(\bar{y}-z)\tilde{\psi}^{\prime}(\nu_{1,j};\varsigma_{1,j}). (25)

On the other hand, in the original binary weight neural network,

𝔼w1[loss(y)w2,j|Θ]\displaystyle\mathbb{E}_{w_{1}}\left[\frac{\partial\mathrm{loss}(y)}{\partial w_{2,j}}\middle|\Theta\right] =cd2𝔼w1[(yz)x2,j|Θ],\displaystyle=\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{w_{1}}\left[(y-z)x_{2,j}|\Theta\right], (26)
𝔼w1[loss(y)b1,j|Θ]\displaystyle\mathbb{E}_{w_{1}}\left[\frac{\partial\mathrm{loss}(y)}{\partial b_{1,j}}\middle|\Theta\right] =cd2βw2,j𝔼w1[(yz)ψ(y1,j)|Θ],\displaystyle=\sqrt{\frac{c}{d_{2}}}\beta w_{2,j}\mathbb{E}_{w_{1}}[(y-z)\psi^{\prime}(y_{1,j})|\Theta], (27)
𝔼w1[loss(y)w1,ij|Θ]\displaystyle\mathbb{E}_{w_{1}}\left[\frac{\partial\mathrm{loss}(y)}{\partial w_{1,ij}}\middle|\Theta\right] =cd1d2w2,jx1,i𝔼w1[(yz)ψ(y1,j)|Θ],\displaystyle=\sqrt{\frac{c}{d_{1}d_{2}}}w_{2,j}x_{1,i}\mathbb{E}_{w_{1}}[(y-z)\psi^{\prime}(y_{1,j})|\Theta], (28)

Note that yy is not independent form x2,jx_{2,j} or ψ(y1,j)\psi^{\prime}(y_{1,j}), which is the main challenge of the proof. To deal with this problem, we bound the difference between (23)-(25) and (26)-(28), which requires bounding their covariance.

𝔼w1[x2,jy|Θ]\displaystyle\mathbb{E}_{w_{1}}[x_{2,j}y|\Theta] =cd2𝔼w1[x2,jj=1d2w2,jx2,j|Θ]\displaystyle=\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{w_{1}}\left[x_{2,j}\sum_{j=1}^{d_{2}}w_{2,j}x_{2,j}\middle|\Theta\right] (29)
=cd2(𝔼w1[x2,j2w2,j|Θ]+jj𝔼w1[x2,jx2,jw2,j|Θ])\displaystyle=\sqrt{\frac{c}{d_{2}}}\left(\mathbb{E}_{w_{1}}\left[x_{2,j}^{2}w_{2,j}\middle|\Theta\right]+\sum_{j^{\prime}\neq j}\mathbb{E}_{w_{1}}\left[x_{2,j}x_{2,j^{\prime}}w_{2,j^{\prime}}\middle|\Theta\right]\right)
=cd2((μ2,j2+σ2,j2)w2,j+jjμ2,jμ2,jw2,j)\displaystyle=\sqrt{\frac{c}{d_{2}}}\left((\mu_{2,j}^{2}+\sigma_{2,j}^{2})w_{2,j}+\sum_{j^{\prime}\neq j}\mu_{2,j}\mu_{2,j^{\prime}}w_{2,j^{\prime}}\right)
=cd2(σ2,j2w2,j+j=1d2μ2,jμ2,jw2,j)\displaystyle=\sqrt{\frac{c}{d_{2}}}\left(\sigma_{2,j}^{2}w_{2,j}+\sum_{j^{\prime}=1}^{d_{2}}\mu_{2,j}\mu_{2,j^{\prime}}w_{2,j^{\prime}}\right)

The second term equals 𝔼w1[x2,j|Θ]𝔼w1[y|Θ]\mathbb{E}_{w_{1}}[x_{2,j}|\Theta]\mathbb{E}_{w_{1}}[y|\Theta] and the first term converges to 0 when d2d_{2}\rightarrow\infty. Taking it into (23) and (26), we can see that these two terms are equal on the limit d1d_{1}\rightarrow\infty.

Similarly,

𝔼w1[ψ(y1,j)y|Θ]\displaystyle\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})y|\Theta] (30)
=𝔼w1[cd2ψ(y1,j)j=1d2w2,jψ(y1,j)|Θ]\displaystyle=\mathbb{E}_{w_{1}}\left[\sqrt{\frac{c}{d_{2}}}\psi^{\prime}(y_{1,j})\sum_{j=1}^{d_{2}}w_{2,j}\psi(y_{1,j})\middle|\Theta\right]
=cd2(𝔼w1[ψ(y1,j)ψ(y1,j)w2,j|Θ]+jj𝔼w1[ψ(y1,j)ψ(y1,j)w2,j|Θ])\displaystyle=\sqrt{\frac{c}{d_{2}}}\left(\mathbb{E}_{w_{1}}\left[\psi(y_{1,j})\psi^{\prime}(y_{1,j})w_{2,j}\middle|\Theta\right]+\sum_{j^{\prime}\neq j}\mathbb{E}_{w_{1}}\left[\psi(y_{1,j^{\prime}})\psi^{\prime}(y_{1,j})w_{2,j^{\prime}}\middle|\Theta\right]\right)
=cd2(𝔼w1[ψ(y1,j)w2,j|Θ]w2,j+jj𝔼w1[ψ(y1,j)|Θ]μ2,jw2,j)\displaystyle=\sqrt{\frac{c}{d_{2}}}\left(\mathbb{E}_{w_{1}}\left[\psi(y_{1,j})w_{2,j}\middle|\Theta\right]w_{2,j}+\sum_{j^{\prime}\neq j}\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})|\Theta]\mu_{2,j^{\prime}}w_{2,j^{\prime}}\right)
=cd2((1𝔼w1[ψ(y1,j)|Θ])μ2,jw2,j+j=1d2𝔼w1[ψ(y1,j)|Θ]μ2,jw2,j)\displaystyle=\sqrt{\frac{c}{d_{2}}}\left((1-\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})|\Theta])\mu_{2,j}w_{2,j}+\sum_{j^{\prime}=1}^{d_{2}}\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})|\Theta]\mu_{2,j^{\prime}}w_{2,j^{\prime}}\right)

The second term equals 𝔼[ψ(y2,j)|Θ]𝔼[y|Θ]\mathbb{E}[\psi^{\prime}(y_{2,j})|\Theta]\mathbb{E}[y|\Theta] and the first term converges to 0 when d3d_{3}\rightarrow\infty. Taking it into (24)(25)(27)(28) finishes the proof.

B.8 Proof of 9

In this part, we denote a˙:=at\dot{a}:=\frac{\partial a}{\partial t} for a{w,θ,b}a\in\{w_{\ell},\theta_{\ell},b_{\ell}\}, and express each time-depent variable as a function of time tt. We define an inner product under the distribution of training dataset

𝒂,𝒃in=𝔼in[a(x)b(x)],\langle\boldsymbol{a},\boldsymbol{b}\rangle_{in}=\mathbb{E}_{in}[{a}(x){b}(x)],

and the corresponding norm

𝒂in=𝒂,𝒂in=𝔼in[a(x)2].\|\boldsymbol{a}\|_{in}=\sqrt{\langle\boldsymbol{a},\boldsymbol{a}\rangle_{in}}=\sqrt{\mathbb{E}_{in}[{a}(x)^{2}]}.

If 𝒂(x)\boldsymbol{a}(x) is a vector, 𝒂in:=𝔼in[𝒂(x)2]\|\boldsymbol{a}\|_{in}:=\sqrt{\mathbb{E}_{in}[\|\boldsymbol{a}(x)\|^{2}]}. Note this inner product and norm define a Hilbert space (not to be confused with the RKHS induced by a kernel), so by Cauchy-Schwarz inequality,

|𝒂,𝒃in|𝒂in𝒃in,𝒂,𝒃.\displaystyle|\langle\boldsymbol{a},\boldsymbol{b}\rangle_{in}|\leq\|\boldsymbol{a}\|_{in}\|\boldsymbol{b}\|_{in},\forall\boldsymbol{a},\boldsymbol{b}.

As is shown in 4.3.3, on the limit d1,d_{1},\rightarrow\infty, the dynamics of training this neural network using gradient descent can be written as:

w˙2,j(t)\displaystyle\dot{w}_{2,j}(t) =cd2𝔼in[(y¯(t)z)μ2,j(t)]\displaystyle=\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{in}[(\bar{y}(t)-z)\mu_{2,j}(t)]
b˙1,j(t)\displaystyle\dot{b}_{1,j}(t) =cd2𝔼in[βw2,j(t)(y¯(t)z)ψ~(ν1,j(t),ς1,j(t))],\displaystyle=\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{in}[\beta w_{2,j}(t)(\bar{y}(t)-z)\tilde{\psi}^{\prime}(\nu_{1,j}(t),\varsigma_{1,j}(t))],
θ˙1,ij(t)\displaystyle\dot{\theta}_{1,ij}(t) =cd1d2𝔼in[w2,j(t)x1,i(y¯(t)z)ψ~(ν1,j(t);ς1,j(t))]\displaystyle=\sqrt{\frac{c}{d_{1}d_{2}}}\mathbb{E}_{in}[w_{2,j}(t)x_{1,i}(\bar{y}(t)-z)\tilde{\psi}^{\prime}(\nu_{1,j}(t);\varsigma_{1,j}(t))]

where dot denotes the derivative with respect to tt. Note the activation function ψ~(;ς1,j(t))\tilde{\psi}(\cdot;\varsigma_{1,j}(t)) depends on ς1,j\varsigma_{1,j}, which makes it time dependent. One can further write down the dynamics of ν1,j(t)\nu_{1,j}(t) as

ν˙1,j(t)\displaystyle\dot{\nu}_{1,j}(t) =1d1i=1d1θ˙1,ij(t)x1,i(t)+b˙1,j(t)\displaystyle=\sqrt{\frac{1}{d_{1}}}\sum_{i=1}^{d_{1}}\dot{\theta}_{1,ij}(t)x_{1,i}(t)+\dot{b}_{1,j}(t)

Rewrite these two differential equations in matrix form:

𝒘˙2(t)\displaystyle\dot{\boldsymbol{w}}_{2}(t) =cd2𝔼in[(y¯(t)z)𝝁2(t)]\displaystyle=\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{in}[(\bar{y}(t)-z)\boldsymbol{\mu}_{2}(t)]
𝒃˙1(t)\displaystyle\dot{\boldsymbol{b}}_{1}(t) =βcd2𝔼in[(y¯(t)z)(ψ~(𝝂1(t))𝒘2(t))],\displaystyle=\beta\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{in}[(\bar{y}(t)-z)(\tilde{\psi}^{\prime}(\boldsymbol{\nu}_{1}(t))\circ\boldsymbol{w}_{2}(t))],
𝜽1˙(t)\displaystyle\dot{\boldsymbol{\theta}_{1}}(t) =cd1d2𝔼in[(y¯(t)z)𝒙1(ψ~(𝝂1(t))𝒘2(t))],\displaystyle=\sqrt{\frac{c}{d_{1}d_{2}}}\mathbb{E}_{in}\big{[}(\bar{y}(t)-z)\boldsymbol{x}_{1}\otimes\big{(}\tilde{\psi}^{\prime}(\boldsymbol{\nu}_{1}(t))\circ\boldsymbol{w}_{2}(t)\big{)}\big{]},
𝝂˙1(t)\displaystyle\dot{\boldsymbol{\nu}}_{1}(t) =1d1𝜽˙1𝒙1+𝒃˙1\displaystyle=\sqrt{\frac{1}{d_{1}}}\dot{\boldsymbol{\theta}}_{1}\boldsymbol{x}_{1}+\dot{\boldsymbol{b}}_{1}

where \circ denotes elementwise product and \otimes denotes outer product. Here we slightly abuse the notation ψ~()\tilde{\psi}(\cdot), which represents elementwise operation when applied to a vector. Their norm are bounded by

t𝒘2(t)𝒘2(0)\displaystyle\frac{\partial}{\partial t}\|{\boldsymbol{w}}_{2}(t)-{\boldsymbol{w}}_{2}(0)\| cd2𝔼in[(y¯(t)z)𝝁2(t)]=cd2y¯(t)z,𝝁2(t)in\displaystyle\leq\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{in}[(\bar{y}(t)-z)\|\boldsymbol{\mu}_{2}(t)\|]=\sqrt{\frac{c}{d_{2}}}\langle\bar{y}(t)-z,\boldsymbol{\mu}_{2}(t)\rangle_{in} (31)
cd2y¯(t)zin𝝁2(t)incd2y¯(t)zin𝝂1(t)in\displaystyle\leq\sqrt{\frac{c}{d_{2}}}\|\bar{y}(t)-z\|_{in}\|\boldsymbol{\mu}_{2}(t)\|_{in}\leq\sqrt{\frac{c}{d_{2}}}\|\bar{y}(t)-z\|_{in}\|\boldsymbol{\nu}_{1}(t)\|_{in}
t𝒃1(t)𝒃1(0)\displaystyle\frac{\partial}{\partial t}\|{\boldsymbol{b}}_{1}(t)-{\boldsymbol{b}}_{1}(0)\| βcd2𝔼in[(y¯(t)z)ψ~(𝝂1(t))𝒘2(t)]\displaystyle\leq\beta\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{in}[(\bar{y}(t)-z)\|\tilde{\psi}^{\prime}(\boldsymbol{\nu}_{1}(t))\circ\boldsymbol{w}_{2}(t)\|] (32)
βcd2𝔼in[(y¯(t)z)𝒘2(t)]=βcd2y¯(t)zin𝒘2(t),\displaystyle\leq\beta\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{in}[(\bar{y}(t)-z)\|\boldsymbol{w}_{2}(t)\|]=\beta\sqrt{\frac{c}{d_{2}}}\|\bar{y}(t)-z\|_{in}\|\boldsymbol{w}_{2}(t)\|,
t𝜽1(t)𝜽1(0)F\displaystyle\frac{\partial}{\partial t}\|{\boldsymbol{\theta}_{1}}(t)-{\boldsymbol{\theta}_{1}}(0)\|_{F} cd1d2𝔼in[(y¯(t)z)𝒙1(ψ~(𝝂1(t))𝒘2(t))F]\displaystyle\leq\sqrt{\frac{c}{d_{1}d_{2}}}\mathbb{E}_{in}\big{[}(\bar{y}(t)-z)\|\boldsymbol{x}_{1}\otimes\big{(}\tilde{\psi}^{\prime}(\boldsymbol{\nu}_{1}(t))\circ\boldsymbol{w}_{2}(t)\big{)}\|_{F}\big{]} (33)
cd1d2𝔼in[(y¯(t)z)𝒙1𝒘2(t)]\displaystyle\leq\sqrt{\frac{c}{d_{1}d_{2}}}\mathbb{E}_{in}\big{[}(\bar{y}(t)-z)\|\boldsymbol{x}_{1}\|\|\boldsymbol{w}_{2}(t)\|\big{]}
cd1d2y¯(t)zin𝒙1in𝒘2(t)\displaystyle\leq\sqrt{\frac{c}{d_{1}d_{2}}}\|\bar{y}(t)-z\|_{in}\|\boldsymbol{x}_{1}\|_{in}\|\boldsymbol{w}_{2}(t)\|
=cd2y¯(t)zin𝒘2(t),\displaystyle=\sqrt{\frac{c}{d_{2}}}\|\bar{y}(t)-z\|_{in}\|\boldsymbol{w}_{2}(t)\|,
𝒙1,t𝝂1(t)𝝂1(0)\displaystyle\forall\boldsymbol{x}_{1},\quad\frac{\partial}{\partial t}\|{\boldsymbol{\nu}}_{1}(t)-{\boldsymbol{\nu}}_{1}(0)\| t=0T𝝂1(t)t𝑑t\displaystyle\leq\int_{t=0}^{T}\Big{\|}\frac{\partial{\boldsymbol{\nu}}_{1}(t)}{\partial t}\Big{\|}dt (34)
1d1t𝜽1(t)𝜽1(0)op𝒙1+t𝒃1(t)𝒃1(0)\displaystyle\leq\sqrt{\frac{1}{d_{1}}}\frac{\partial}{\partial t}\|\boldsymbol{\theta}_{1}(t)-\boldsymbol{\theta}_{1}(0)\|_{op}\|\boldsymbol{x}_{1}\|+\frac{\partial}{\partial t}\|{\boldsymbol{b}}_{1}(t)-{\boldsymbol{b}}_{1}(0)\|
cd12d2y¯(t)zin𝒘2(t)𝒙1in𝒙1\displaystyle\leq\sqrt{\frac{c}{d_{1}^{2}d_{2}}}\|\bar{y}(t)-z\|_{in}\|\boldsymbol{w}_{2}(t)\|\|\boldsymbol{x}_{1}\|_{in}\|\boldsymbol{x}_{1}\|
+βcd2y¯(t)zin𝒘2(t)\displaystyle\qquad+\beta\sqrt{\frac{c}{d_{2}}}\|\bar{y}(t)-z\|_{in}\|\boldsymbol{w}_{2}(t)\|
=(1+β)cd2y¯(t)zin𝒘2(t),\displaystyle=(1+\beta)\sqrt{\frac{c}{d_{2}}}\|\bar{y}(t)-z\|_{in}\|\boldsymbol{w}_{2}(t)\|,
t𝝂1(t)𝝂1(0)in\displaystyle\frac{\partial}{\partial t}\|{\boldsymbol{\nu}}_{1}(t)-{\boldsymbol{\nu}}_{1}(0)\|_{in} (1+β)cd2y¯(t)zin𝒘2(t).\displaystyle\leq(1+\beta)\sqrt{\frac{c}{d_{2}}}\|\bar{y}(t)-z\|_{in}\|\boldsymbol{w}_{2}(t)\|.

Here we make use of the fact that ϕ~(x)1\tilde{\phi}^{\prime}(x)\leq 1, ϕ~(x)x\tilde{\phi}(x)\leq x regardless of the value of ς1,j(t)\varsigma_{1,j}(t), that limd1𝒙1in/d1=1\lim_{d_{1}\rightarrow\infty}\|\boldsymbol{x}_{1}\|_{in}/\sqrt{d_{1}}=1 as long as Θ𝒢\Theta\in\mathcal{G}, and that w0w_{0} is not updated during training. In the last equation, we make use of 𝜽˙1=t(𝜽1(t)𝜽1(0)),𝒃˙1=t(𝒃1(t)𝒃1(0)).\dot{\boldsymbol{\theta}}_{1}=\frac{\partial}{\partial t}(\boldsymbol{\theta}_{1}(t)-\boldsymbol{\theta}_{1}(0)),\dot{\boldsymbol{b}}_{1}=\frac{\partial}{\partial t}(\boldsymbol{b}_{1}(t)-\boldsymbol{b}_{1}(0)).

Define A(t)=cd21+β(𝒘2(t)𝒘2(0)+𝒘2(0))+cd2(𝝂1(t)𝝂1(0)in+𝝂1(0)in)A(t)=\sqrt{\frac{c}{d_{2}}}\sqrt{1+\beta}(\|{\boldsymbol{w}}_{2}(t)-{\boldsymbol{w}}_{2}(0)\|+\|{\boldsymbol{w}}_{2}(0)\|)+\sqrt{\frac{c}{d_{2}}}(\|{\boldsymbol{\nu}}_{1}(t)-{\boldsymbol{\nu}}_{1}(0)\|_{in}+\|{\boldsymbol{\nu}}_{1}(0)\|_{in}), then

A˙(t)\displaystyle\dot{A}(t) 1+βcd2y¯(t)zin𝝂1(t)in+(1+β)cd2y¯(t)zin𝒘2(t)\displaystyle\leq\sqrt{1+\beta}\sqrt{\frac{c}{d_{2}}}\|\bar{y}(t)-z\|_{in}\|\boldsymbol{\nu}_{1}(t)\|_{in}+(1+\beta)\sqrt{\frac{c}{d_{2}}}\|\bar{y}(t)-z\|_{in}\|\boldsymbol{w}_{2}(t)\|
1+βA(t)\displaystyle\leq\sqrt{1+\beta}A(t)

Observe that A(0)A(0) is stochastically bounded. Using Grönwall’s Lemma, for any finite TT:

A(T)A(0)exp(t=0T1+β𝑑t)=A(0)exp(1+βT)\displaystyle A(T)\leq A(0)\exp\Big{(}\int_{t=0}^{T}\sqrt{1+\beta}dt\Big{)}=A(0)\exp(\sqrt{1+\beta}T)

so A(T)A(T) is stochastically bounded for all finite TT as d2d_{2}\rightarrow\infty. Furthermore,

cd2𝒘2(T)cd2(𝒘2(T)𝒘2(0)+𝒘2(0))\displaystyle\sqrt{\frac{c}{d_{2}}}\|\boldsymbol{w}_{2}(T)\|\leq\sqrt{\frac{c}{d_{2}}}(\|\boldsymbol{w}_{2}(T)-\boldsymbol{w}_{2}(0)\|+\|\boldsymbol{w}_{2}(0)\|)

which is also stochastically bounded. Integrating (31)-(34) from 0 to TT finishes the proof.

B.9 Proof of 10

From (4), it’s easy to get the dynamics of ς1\varsigma_{1}:

ς1,j2(t)t\displaystyle\frac{\partial\varsigma_{1,j}^{2}(t)}{\partial t} =2cd1i=1d1θ1,ij(t)θ˙1,ij(t)x1,i2\displaystyle=-\frac{2c}{d_{1}}\sum_{i=1}^{d_{1}}\theta_{1,ij}(t)\dot{\theta}_{1,ij}(t)x_{1,i}^{2}
|ς1,j2(T)ς1,j2(0)|\displaystyle|\varsigma_{1,j}^{2}(T)-\varsigma_{1,j}^{2}(0)| 2cd1i=1d1x1,i2t=0T|θ1,ij(t)||θ˙1,ij(t)|𝑑t\displaystyle\leq\frac{2c}{d_{1}}\sum_{i=1}^{d_{1}}x_{1,i}^{2}\int_{t=0}^{T}|\theta_{1,ij}(t)||\dot{\theta}_{1,ij}(t)|dt
2cd1i=1d1x1,i2t=0T|θ˙1,ij(t)|𝑑t\displaystyle\leq\frac{2c}{d_{1}}\sum_{i=1}^{d_{1}}x_{1,i}^{2}\int_{t=0}^{T}|\dot{\theta}_{1,ij}(t)|dt
2cd1cd1d2i=1d1x1,i2t=0T𝔼in|w2,j(t)x1,i(y¯(t)z)ψ~(ν1,j(t);ς1,j(t))|𝑑t\displaystyle\leq\frac{2c}{d_{1}}\sqrt{\frac{c}{d_{1}d_{2}}}\sum_{i=1}^{d_{1}}x_{1,i}^{2}\int_{t=0}^{T}\mathbb{E}_{in}\big{|}w_{2,j}(t)x_{1,i}(\bar{y}(t)-z)\tilde{\psi}^{\prime}(\nu_{1,j}(t);\varsigma_{1,j}(t))\big{|}dt
2cd1cd1d2i=1d1x1,i2t=0T|w2,j(t)|𝔼in|x1,i(y¯(t)z)|𝑑t\displaystyle\leq\frac{2c}{d_{1}}\sqrt{\frac{c}{d_{1}d_{2}}}\sum_{i=1}^{d_{1}}x_{1,i}^{2}\int_{t=0}^{T}|w_{2,j}(t)|\mathbb{E}_{in}\big{|}x_{1,i}(\bar{y}(t)-z)\big{|}dt
2cd1cd1d2i=1d1x1,i2t=0T|w2,j(t)|x1,iiny¯(t)zin𝑑t\displaystyle\leq\frac{2c}{d_{1}}\sqrt{\frac{c}{d_{1}d_{2}}}\sum_{i=1}^{d_{1}}x_{1,i}^{2}\int_{t=0}^{T}|w_{2,j}(t)|\|x_{1,i}\|_{in}\|\bar{y}(t)-z\|_{in}dt
2cd13/2i=1d1x1,i2x1,iint=0TC(t)y¯(t)zin𝑑t\displaystyle\leq\frac{2c}{d_{1}^{3/2}}\sum_{i=1}^{d_{1}}x_{1,i}^{2}\|x_{1,i}\|_{in}\int_{t=0}^{T}C(t)\|\bar{y}(t)-z\|_{in}dt
2cd13/2i=1d1x1,i2x1,iinmaxt[0,T]C(t)t=0Ty¯(t)zin𝑑ta.s.\displaystyle\leq\frac{2c}{d_{1}^{3/2}}\sum_{i=1}^{d_{1}}x_{1,i}^{2}\|x_{1,i}\|_{in}\max_{t\in[0,T]}C(t)\int_{t=0}^{T}\|\bar{y}(t)-z\|_{in}dt\quad a.s.

Here we assume that cd2𝒘2(t)\sqrt{\frac{c}{d_{2}}}\|\boldsymbol{w}_{2}(t)\| is stochastically bounded by C(t)C(t). Since C(t)C(t) is finite for all t[0,T]t\in[0,T], it’s easy to check the term after max\max operator is stochastically bounded. The remaining task is to bound term before max\max operator. From standard Gaussian process analysis, x1,ix_{1,i} satisfy Gaussian distribution. From the law of large number (LLN), as d1d_{1}\rightarrow\infty,

1d1i=1d1x1,i2x1,iin=𝔼[x1,i2x1,iin]\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}x_{1,i}^{2}\|x_{1,i}\|_{in}=\mathbb{E}[x_{1,i}^{2}\|x_{1,i}\|_{in}]

almost surely, where the expectation is taken over w1w_{1}, and this limit is also bounded. Because of that, as d1,d2d_{1},d_{2}\rightarrow\infty, the difference |ς1,j2(T)ς1,j2(0)||\varsigma_{1,j}^{2}(T)-\varsigma_{1,j}^{2}(0)| converges to 0 at rate 1d2\frac{1}{\sqrt{d_{2}}}.

Notice that the proof of Lyapunov’s condition (LABEL:eq:Lyapunov) doesn’t depend on time TT from the third line. Since ς1,j(T)\varsigma_{1,j}(T) stochastically converges to ς1,j(0)\varsigma_{1,j}(0) for all finite TT, Lyapunov’s condition holds for all TT thus x2,jx_{2,j} always converges to Gaussian distribution conditioned on model parameter.

Appendix C NTK of neural networks with quantized weights

C.1 Spherical harmonics

This subsection briefly reviews the relevant concepts and properties of spherical harmonics. Most part of this subsection comes from Bach [2017, appendix Section D.1.] and Bietti and Mairal [2019, appendix Section C.1.]

According to Mercer’s theorem, any positive definite kernel can be decomposed as

𝒦(x,x)=iλiΦ(x)Φ(x),\mathcal{K}(x,x^{\prime})=\sum_{i}\lambda_{i}\Phi(x)\Phi(x^{\prime}),

where Φ()\Phi(\cdot) is called the feature map. Furthermore, any zonal kernel on the unit sphere, i.e., 𝒦(x,x)=𝒦(xTx)\mathcal{K}(x,x^{\prime})=\mathcal{K}(x^{T}x^{\prime}) for any x,xd,x2=x2=1x,x^{\prime}\in\mathbb{R}^{d},\|x\|_{2}=\|x^{\prime}\|_{2}=1, including exponential kernels and NTK, can be decomposed using spherical harmonics (equation (10)):

𝒦(x,x)=k=1λkj=1N(d,k)Yk,j(x)Yk,j(x).\mathcal{K}(x,x^{\prime})=\sum_{k=1}^{\infty}\lambda_{k}\sum_{j=1}^{N(d,k)}Y_{k,j}(x)Y_{k,j}(x^{\prime}).

Legendre polynomial. We have the additional formula

j=1N(d,k)Yk,j(x)Yk,j(x)=N(d,k)Pk(xTx),\sum_{j=1}^{N(d,k)}Y_{k,j}(x)Y_{k,j}(x^{\prime})=N(d,k)P_{k}(x^{T}x^{\prime}),

where

N(d,k)=(2k+d2)(k+d3)!k!(d2)!.N(d,k)=\frac{(2k+d-2)(k+d-3)!}{k!(d-2)!}.

The polynomial PkP_{k} is the kk-th Legendre polynomial in dd dimension, also known as Gegenbauer polynomials:

Pk(t)=(12)kΓ(d12)Γ(k+d12)(1t2)(3d)/2(ddt)k(1t2)k+(d3)/2.P_{k}(t)=\left(-\frac{1}{2}\right)^{k}\frac{\Gamma\left(\frac{d-1}{2}\right)}{\Gamma\left(k+\frac{d-1}{2}\right)}(1-t^{2})^{(3-d)/2}\left(\frac{d}{dt}\right)^{k}(1-t^{2})^{k+(d-3)/2}.

It is even (resp. odd) when kk is odd (reps. even). Furthermore, they have the orthogonal property

11Pk(t)Pj(t)(1t2)(d3)/2𝑑t=δijwd1wd21N(d,k),\int_{-1}^{1}P_{k}(t)P_{j}(t)(1-t^{2})^{(d-3)/2}dt=\delta_{ij}\frac{w_{d-1}}{w_{d-2}}\frac{1}{N(d,k)},

where

wd1=2πd2Γ(d/2)w_{d-1}=\frac{2\pi^{d-2}}{\Gamma(d/2)}

denotes the surface of sphere 𝕊d1\mathbb{S}^{d-1} in dd dimension, and this leads to the integration property

Pj(w,x)Pk(w,x)𝑑τ(w)=δjkN(p,k)Pk(x,y)\int P_{j}(\langle w,x\rangle)P_{k}(\langle w,x\rangle)d\tau(w)=\frac{\delta_{jk}}{N(p,k)}P_{k}(\langle x,y\rangle)

for any x,y𝕊d1x,y\in\mathbb{S}^{d-1}. τ(w)\tau(w) is the uniform measure on the sphere.

C.2 NTK of quasi neural network

We start the proof of the Theorem 12 by the following lemmas:

Lemma 13.

The NTK of a binary weight neural network can be simplified as

𝒦(x,x)=(cdx,x+β2)Σ(0)+Σ(1),\displaystyle\mathcal{K}(x,x^{\prime})=\left(\frac{c}{d}\langle x,x^{\prime}\rangle+\beta^{2}\right)\Sigma^{(0)}+\Sigma^{(1)}, (35)
Σ(0)=𝔼[ψ~(μ)ψ~(μ)],Σ(1)=𝔼[ψ~(μ)ψ~(μ)],\displaystyle\Sigma^{(0)}=\mathbb{E}\left[\tilde{\psi}^{\prime}\left(\mu\right)\tilde{\psi}^{\prime}\left(\mu^{\prime}\right)\right],\quad\Sigma^{(1)}=\mathbb{E}\left[\tilde{\psi}\left(\mu\right)\tilde{\psi}\left(\mu^{\prime}\right)\right],

where [μ,μ]𝒩(0,Σ)[\mu,\mu^{\prime}]\sim\mathcal{N}(0,\Sigma),

Σ=𝔼[x1,ix1,i]=cdVar[θ][1xTxxTx1]\displaystyle\Sigma=\mathbb{E}[x_{1,i}x^{\prime}_{1,i}]=\frac{c}{d}\mathrm{Var}[\theta]\left[\begin{matrix}1&x^{T}x^{\prime}\\ x^{T}x^{\prime}&1\end{matrix}\right]

are the pre-activation of the second layer.

Proof.
𝒦(x,x)\displaystyle\mathcal{K}(x,x^{\prime}) =i=1,j=1d1,d2y¯θ1,ijy¯θ1,ij+j=1d2y¯b1,jy¯b1,j+j=1d2y¯w2,jy¯w2,j\displaystyle=\sum_{i=1,j=1}^{d_{1},d_{2}}\frac{\partial\bar{y}}{\partial\theta_{1,ij}}\frac{\partial\bar{y}^{\prime}}{\partial\theta_{1,ij}}+\sum_{j=1}^{d_{2}}\frac{\partial\bar{y}}{\partial b_{1,j}}\frac{\partial\bar{y}^{\prime}}{\partial b_{1,j}}+\sum_{j=1}^{d_{2}}\frac{\partial\bar{y}}{\partial w_{2,j}}\frac{\partial\bar{y}^{\prime}}{\partial w_{2,j}}
=cd1d2i=1,j=1d1,d2x1,ix1,iw2,j2ψ~(ν1,j)ψ~2(ν1,j)\displaystyle=\frac{c}{d_{1}d_{2}}\sum_{i=1,j=1}^{d_{1},d_{2}}x_{1,i}x^{\prime}_{1,i}w_{2,j}^{2}\tilde{\psi}^{\prime}(\nu_{1,j})\tilde{\psi}^{\prime}_{2}(\nu^{\prime}_{1,j})
+β2d2j=1d2ψ~(ν1,j)ψ~(ν1,j)+1d2j=1d2ψ~(ν1,j)ψ~(ν1,j)\displaystyle\quad+\frac{\beta^{2}}{d_{2}}\sum_{j=1}^{d_{2}}\tilde{\psi}^{\prime}(\nu_{1,j})\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j})+\frac{1}{d_{2}}\sum_{j=1}^{d_{2}}\tilde{\psi}(\nu_{1,j})\tilde{\psi}(\nu^{\prime}_{1,j})
=cd1d2i=1d1x1,ix1,ij=1d2w2,j2ψ~(ν1,j)ψ~(ν1,j)\displaystyle=\frac{c}{d_{1}d_{2}}\sum_{i=1}^{d_{1}}x_{1,i}x^{\prime}_{1,i}\sum_{j=1}^{d_{2}}w_{2,j}^{2}\tilde{\psi}^{\prime}(\nu_{1,j})\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j})
+β2d2j=1d2w2,j2ψ~(ν1,j)ψ~(ν1,j)+1d2j=1d2ψ~(ν1,j)ψ~(ν1,j)\displaystyle\quad+\frac{\beta^{2}}{d_{2}}\sum_{j=1}^{d_{2}}w_{2,j}^{2}\tilde{\psi}^{\prime}(\nu_{1,j})\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j})+\frac{1}{d_{2}}\sum_{j=1}^{d_{2}}\tilde{\psi}(\nu_{1,j})\tilde{\psi}(\nu^{\prime}_{1,j})
=(cdx,x+β2)𝔼[ψ~(ν)ψ~(ν)]+𝔼[ψ~(ν)ψ~(ν)]a.s.\displaystyle=(\frac{c}{d}\langle x,x^{\prime}\rangle+\beta^{2})\mathbb{E}[\tilde{\psi}^{\prime}(\nu)\tilde{\psi}^{\prime}(\nu^{\prime})]+\mathbb{E}[\tilde{\psi}(\nu)\tilde{\psi}(\nu^{\prime})]\quad a.s.

where (ν,ν)(\nu,\nu^{\prime}) has the same distribution as (ν2,j,ν2,j)(\nu_{2,j},\nu^{\prime}_{2,j}) for any jj. We make use of the fact 𝔼[w2,j2]=1\mathbb{E}[w_{2,j}^{2}]=1, and from central limit theorem, x1,i,x1,ix_{1,i},x^{\prime}_{1,i} and μ1,i,μ1,i\mu_{1,i},\mu_{1,i}^{\prime} converge to joint Gaussian distribution for any fixed x,xx,x^{\prime} as d1d_{1}\rightarrow\infty

𝔼[x1,ix1,i]\displaystyle\mathbb{E}[x_{1,i}x^{\prime}_{1,i}] =1d𝔼[k=1dwkixkk=1dwkixk]\displaystyle=\frac{1}{d}\mathbb{E}[\sum_{k=1}^{d}w_{ki}x_{k}\sum_{k^{\prime}=1}^{d}w_{k^{\prime}i}x^{\prime}_{k^{\prime}}]
=1d𝔼[k=1dwki2xkxk]\displaystyle=\frac{1}{d}\mathbb{E}[\sum_{k=1}^{d}w_{ki}^{2}x_{k}x^{\prime}_{k}]
=1dx,x\displaystyle=\frac{1}{d}\langle x,x^{\prime}\rangle

Similarly,

𝔼[μ1,i2]\displaystyle\mathbb{E}[\mu_{1,i}^{2}] =cd1i=1d1𝔼[θ1,ij2]𝔼[x1,j2]=cdVar[θ]\displaystyle=\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}[\theta_{1,ij}^{2}]\mathbb{E}[x_{1,j}^{2}]=\frac{c}{d}\mathrm{Var}[\theta]
𝔼[μ1,iμ1,i]\displaystyle\mathbb{E}[\mu_{1,i}\mu_{1,i}^{\prime}] =cd1i=1d1𝔼[θ1,ij2]𝔼[x1,jx1,j]=cdVar[θ]x,x\displaystyle=\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}[\theta_{1,ij}^{2}]\mathbb{E}[x_{1,j}x^{\prime}_{1,j}]=\frac{c}{d}\mathrm{Var}[\theta]\langle x,x^{\prime}\rangle

C.3 Proof of Theorem 11

Remind that as is proved in Theorem 10, ς1,j(T)ς1,t(0)\varsigma_{1,j}(T)\rightarrow\varsigma_{1,t}(0) for any TT satisfying a mild condition, and ς1,t(0)\varsigma_{1,t}(0) is nonzero almost surely. Making use the fact that ψ~(;ς)\tilde{\psi}(\cdot;\varsigma) is continuous with respect to ς\varsigma, and its first and second order derivative is stochastically bounded, the change of kernel 𝒦\mathcal{K} induced by ς1,j\varsigma_{1,j} converges to 0 as d1,d2d_{1},d_{2}\rightarrow\infty. This reduces to this quasi neural network to a standard neural network with activation function ψ~()\tilde{\psi}(\cdot), which is twice differentiable and has bounded second order derivative. From Theorem 2 in Jacot et al. [2018], the kernel during training converges to the one during initialization. For the ease of the readers, we restate the proof below. On the limit d2,d1d_{2}\rightarrow\infty,d_{1}\rightarrow\infty,

𝒦(x,x)(t)𝒦(x,x)(0)\displaystyle\mathcal{K}(x,x^{\prime})(t)-\mathcal{K}(x,x^{\prime})(0)
=cx,x+β2d2j=1d2(w2,j2(t)ψ~(ν1,j(t))ψ~(ν1,j(t))w2,j2(0)ψ~(ν1,j(0))ψ~(ν1,j(0)))\displaystyle=\frac{c\langle x,x^{\prime}\rangle+\beta^{2}}{d_{2}}\sum_{j=1}^{d_{2}}\big{(}w_{2,j}^{2}(t)\tilde{\psi}^{\prime}(\nu_{1,j}(t))\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(t))-w_{2,j}^{2}(0)\tilde{\psi}^{\prime}(\nu_{1,j}(0))\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(0))\big{)}
+1d2j=1d2(ψ~(ν1,j(t))ψ~(ν1,j(t))ψ~(ν1,j(0))ψ~(ν1,j(0)))\displaystyle\quad+\frac{1}{d_{2}}\sum_{j=1}^{d_{2}}\big{(}\tilde{\psi}(\nu_{1,j}(t))\tilde{\psi}(\nu^{\prime}_{1,j}(t))-\tilde{\psi}(\nu_{1,j}(0))\tilde{\psi}(\nu^{\prime}_{1,j}(0))\big{)}
=cx,x+β2d2(j=1d2(w2,j2(t)w2,j2(0))ψ~(ν1,j(t))ψ~(ν1,j(t))\displaystyle=\frac{c\langle x,x^{\prime}\rangle+\beta^{2}}{d_{2}}\Bigg{(}\sum_{j=1}^{d_{2}}\big{(}w_{2,j}^{2}(t)-w_{2,j}^{2}(0)\big{)}\tilde{\psi}^{\prime}(\nu_{1,j}(t))\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(t))
+j=1d2w2,j2(0)(ψ~(ν1,j(t))ψ~(ν1,j(0)))ψ~(ν1,j(t))\displaystyle\quad+\sum_{j=1}^{d_{2}}w_{2,j}^{2}(0)\big{(}\tilde{\psi}^{\prime}(\nu_{1,j}(t))-\tilde{\psi}^{\prime}(\nu_{1,j}(0))\big{)}\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(t))
+j=1d2w2,j2(0)ψ~(ν1,j(0))(ψ~(ν1,j(t))ψ~(ν1,j(0)))),\displaystyle\quad+\sum_{j=1}^{d_{2}}w_{2,j}^{2}(0)\tilde{\psi}^{\prime}(\nu_{1,j}(0))\bigl{(}\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(t))-\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(0))\bigr{)}\Bigg{)},
|𝒦(x,x)(t)𝒦(x,x)(0)|\displaystyle|\mathcal{K}(x,x^{\prime})(t)-\mathcal{K}(x,x^{\prime})(0)| |cx,x+β2d2|\displaystyle\leq\Bigg{|}\frac{c\langle x,x^{\prime}\rangle+\beta^{2}}{d_{2}}\Bigg{|}
(j=1d2|w2,j(t)w2,j(0)||w2,j(t)+w2,j(0)||ψ~(ν1,j(t))ψ~(ν1,j(t))|\displaystyle\Bigg{(}\sum_{j=1}^{d_{2}}|w_{2,j}(t)-w_{2,j}(0)||w_{2,j}(t)+w_{2,j}(0)||\tilde{\psi}^{\prime}(\nu_{1,j}(t))\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(t))|
+j=1d2w2,j2(0)ψ~(ν1,j(t))|ψ~(ν1,j(t))ψ~(ν1,j(0))|\displaystyle\quad+\sum_{j=1}^{d_{2}}w_{2,j}^{2}(0)\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(t))\big{|}\tilde{\psi}^{\prime}(\nu_{1,j}(t))-\tilde{\psi}^{\prime}(\nu_{1,j}(0))\big{|}
+j=1d2w2,j2(0)ψ~(ν1,j(0))|ψ~(ν1,j(t))ψ~(ν1,j(0))|)\displaystyle\quad+\sum_{j=1}^{d_{2}}w_{2,j}^{2}(0)\tilde{\psi}^{\prime}(\nu_{1,j}(0))\bigl{|}\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(t))-\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(0))\bigr{|}\Bigg{)}
+1d2j=1d2(ψ~(ν1,j(t))(ψ~(ν1,j(t))ψ~(ν1,j(0))))\displaystyle\quad+\frac{1}{d_{2}}\sum_{j=1}^{d_{2}}\big{(}\tilde{\psi}(\nu_{1,j}(t))(\tilde{\psi}(\nu^{\prime}_{1,j}(t))\tilde{\psi}(\nu^{\prime}_{1,j}(0)))\big{)}
+1d2j=1d2(ψ~(ν1,j(0))(ψ~(ν1,j(t))ψ~(ν1,j(0))))\displaystyle\quad+\frac{1}{d_{2}}\sum_{j=1}^{d_{2}}\big{(}\tilde{\psi}(\nu^{\prime}_{1,j}(0))(\tilde{\psi}(\nu_{1,j}(t))-\tilde{\psi}(\nu_{1,j}(0)))\big{)}

From Theorem 10, and observing that ψ~(x),ψ~′′(x)\tilde{\psi}^{\prime}(x),\tilde{\psi}^{\prime\prime}(x) are bounded by constants, one can verify that each summation term is stochastically bounded by d2\sqrt{d_{2}}, so as d2d_{2}\rightarrow\infty, 𝒦(t)𝒦(0)\mathcal{K}(t)-\mathcal{K}(0) converges to 0 at rate d2\sqrt{d_{2}}.

C.4 Spherical harmonics decomposition to activation function

Following Bach [2017], we start by studying the decomposition of action in quasi neural network (7) and its gradients (8): for arbitrary fixed c>0c>0, 1t1-1\leq t\leq 1, we can decompose equation (7) and (8) as

σ~(ct)\displaystyle\tilde{\sigma}(ct) =k=0λkN(d,k)Pk(t),\displaystyle=\sum_{k=0}^{\infty}\lambda_{k}N(d,k)P_{k}(t), (36)
σ~(ct)\displaystyle\tilde{\sigma}^{\prime}(ct) =k=0λkN(d,k)Pk(t),\displaystyle=\sum_{k=0}^{\infty}\lambda^{\prime}_{k}N(d,k)P_{k}(t), (37)

where PkP_{k} is the kk-th Legendre polynomial in dimension dd.

Lemma 14.

The decomposition of activation function in the quasi neural network (36) satisfies

  1. 1.

    λk=0\lambda_{k}=0 if kk is odd,

  2. 2.

    λk>0\lambda_{k}>0 if kk is even,

  3. 3.

    λkPoly(k)(C/k)k\lambda_{k}\asymp\mathrm{Poly}(k)(C/\sqrt{k})^{-k} as kk\rightarrow\infty when kk is even, where Poly(k)\mathrm{Poly}(k) denotes a polynomial of kk, and CC is a constant.

Its gradient (37) satisfies

  1. 1.

    λk=0\lambda^{\prime}_{k}=0 if kk is even,

  2. 2.

    λk>0\lambda^{\prime}_{k}>0 if kk is odd,

  3. 3.

    λkPoly(k)(C/k)k\lambda^{\prime}_{k}\asymp\mathrm{Poly}(k)(C/\sqrt{k})^{-k} as kk\rightarrow\infty when kk is odd, where Poly(k)\mathrm{Poly}(k) denotes a polynomial of kk, and CC is a constant.

Proof.

Let’s start with the derivative of activation function in quasi neural network:

σ~(t)=Φ(c^t),1t1,\tilde{\sigma}^{\prime}(t)=\Phi(\hat{c}t),-1\leq t\leq 1,

where c^\hat{c} is a constant. We introduce the auxiliary parameters x,wdx,w\in\mathbb{R}^{d} s.t. x2=w2=1\|x\|_{2}=\|w\|_{2}=1 and let t=wTxt=w^{T}x By Cauchy–Schwarz inequality, 1wTx1-1\leq w^{T}x\leq 1. Following Bach [2017], we have the following decomposition to σ~(wTx)\tilde{\sigma}^{\prime}(w^{T}x):

σ~(wTx)=k=1λkN(d,k)Pk(wTx),\tilde{\sigma}^{\prime}(w^{T}x)=\sum_{k=1}^{\infty}\lambda_{k}^{\prime}N(d,k)P_{k}(w^{T}x),

where N(d,k)N(d,k) and Pk()P_{k}(\cdot) are defined in section C.1, λk\lambda_{k}^{\prime} can be computed by

λk\displaystyle\lambda^{\prime}_{k} =wd1wd11σ~(t)Pk(t)(1t2)(d2)/2𝑑t\displaystyle=\frac{w_{d-1}}{w_{d}}\int_{-1}^{1}\tilde{\sigma}^{\prime}(t)P_{k}(t)(1-t^{2})^{(d-2)/2}dt
=(12)kΓ((d1)/2)Γ(k+(d1)/2)wd1wd11σ~(t)(ddt)k(1t2)k+(d3)/2𝑑t.\displaystyle=\left(-\frac{1}{2}\right)^{k}\frac{\Gamma((d-1)/2)}{\Gamma(k+(d-1)/2)}\frac{w_{d-1}}{w_{d}}\int_{-1}^{1}\tilde{\sigma}^{\prime}(t)\left(\frac{d}{dt}\right)^{k}(1-t^{2})^{k+(d-3)/2}dt.

To solve this itegration, we can apply Taylor decomposition to σ~()\tilde{\sigma}^{\prime}(\cdot):

σ~(ct)=12+12πn=0(1)nc^2n+12nn!(2n+1)t2n+1.\tilde{\sigma}^{\prime}(ct)=\frac{1}{2}+\frac{1}{\sqrt{2\pi}}\sum_{n=0}^{\infty}\frac{(-1)^{n}\hat{c}^{2n+1}}{2^{n}n!(2n+1)}t^{2n+1}. (38)

We will study the following polynomial integration first

11tα(ddt)k(1t2)k+(d3)/2𝑑t.\int_{-1}^{1}t^{\alpha}\left(\frac{d}{dt}\right)^{k}(1-t^{2})^{k+(d-3)/2}dt.

When α<k\alpha<k, this integration equals 0 as PkP_{k} is orthogonal to all polynomials of degree less than kk. If (αk)mod20(\alpha-k)\mod 2\neq 0, this integration is 0 because the function to be integrated is an odd function. For αk\alpha\geq k and kαmod2k\equiv\alpha\mod 2 (kk is odd), using successive integration by parts,

11tα(ddt)k(1t2)k+(d3)/2𝑑t\displaystyle\int_{-1}^{1}t^{\alpha}\left(\frac{d}{dt}\right)^{k}(1-t^{2})^{k+(d-3)/2}dt =(1)kα!(αk)!11tαk(1t2)k+(d3)/2𝑑t\displaystyle=(-1)^{k}\frac{\alpha!}{(\alpha-k)!}\int_{-1}^{1}t^{\alpha-k}(1-t^{2})^{k+(d-3)/2}dt (39)
=(1)kα!(αk)!π/2π/2sinαk(x)cos2k+(d2)(x)𝑑x\displaystyle=(-1)^{k}\frac{\alpha!}{(\alpha-k)!}\int_{-\pi/2}^{\pi/2}\sin^{\alpha-k}(x)\cos^{2k+(d-2)}(x)dx
=(1)kCdα!(2k+d3)!!(αk)!!(α+k+d2)!!,\displaystyle=(-1)^{k}C_{d}\frac{\alpha!(2k+d-3)!!}{(\alpha-k)!!(\alpha+k+d-2)!!},

where CdC_{d} is a constant that depends only on dd mod 2.

Combining (38) and (39), we have λk=0\lambda_{k}=0 when kk is even and k0k\neq 0. When kk is odd,

λk=\displaystyle\lambda_{k}^{\prime}= (12)kΓ((d1)/2)Γ(k+(d1)/2)wd1wdCd2πα=k:2c^α(1)(α1)/2(α2)!!(2k+d3)!!(αk)!!(α+k+d2)!!.\displaystyle\left(-\frac{1}{2}\right)^{k}\frac{\Gamma((d-1)/2)}{\Gamma(k+(d-1)/2)}\frac{w_{d-1}}{w_{d}}\frac{C_{d}}{\sqrt{2\pi}}\sum_{\alpha=k:2}^{\infty}\hat{c}^{\alpha}(-1)^{(\alpha-1)/2}\frac{(\alpha-2)!!(2k+d-3)!!}{(\alpha-k)!!(\alpha+k+d-2)!!}.

Following Bach [2017], Geifman et al. [2020] we take dd as a constant and take kk to infinity. Let β=(αk)/20\beta=(\alpha-k)/2\geq 0 we have

λk\displaystyle\lambda_{k}^{\prime} =(1)(k+1)/2(12)kΓ((d1)/2)Γ(k+(d1)/2)wd1wdCd2πβ=0(1)βc^2β+k(2β+k2)!!(2k+d3)!!(2β)!!(2β+2k+d2)!!\displaystyle=(-1)^{(k+1)/2}\left(\frac{1}{2}\right)^{k}\frac{\Gamma((d-1)/2)}{\Gamma(k+(d-1)/2)}\frac{w_{d-1}}{w_{d}}\frac{C_{d}}{\sqrt{2\pi}}\sum_{\beta=0}^{\infty}\frac{(-1)^{\beta}\hat{c}^{2\beta+k}(2\beta+k-2)!!(2k+d-3)!!}{(2\beta)!!(2\beta+2k+d-2)!!}
(1)(k+1)/2(12)kΓ((d1)/2)Γ(k+(d1)/2)wd1wdCd2πβ=0(1)βc^2β+kΓ(β+k/2)Γ(k+(d1)/2)β!Γ(β+k+d/2)2βk/2\displaystyle\asymp(-1)^{(k+1)/2}\left(\frac{1}{2}\right)^{k}\frac{\Gamma((d-1)/2)}{\Gamma(k+(d-1)/2)}\frac{w_{d-1}}{w_{d}}\frac{C_{d}}{\sqrt{2\pi}}\sum_{\beta=0}^{\infty}\frac{(-1)^{\beta}\hat{c}^{2\beta+k}\Gamma(\beta+k/2)\Gamma(k+(d-1)/2)}{\beta!\Gamma(\beta+k+d/2)2^{\beta-k/2}}
:=(1)(k+1)/2(12)kΓ((d1)/2)Γ(k+(d1)/2)wd1wdCd2πβ=0g(β,k).\displaystyle:=(-1)^{(k+1)/2}\left(\frac{1}{2}\right)^{k}\frac{\Gamma((d-1)/2)}{\Gamma(k+(d-1)/2)}\frac{w_{d-1}}{w_{d}}\frac{C_{d}}{\sqrt{2\pi}}\sum_{\beta=0}^{\infty}g(\beta,k).

where \asymp means the radio converge to a constant which doesn’t depend on kk or β\beta as kk\rightarrow\infty. Here we introduced the function g(β,k)g(\beta,k) for simplification, and it satisfies

g(β,k)g(β1,k)=c^2(2β+k3)2β(2β+2k+d2),\displaystyle\frac{g(\beta,k)}{g(\beta-1,k)}=-\frac{\hat{c}^{2}(2\beta+k-3)}{2\beta(2\beta+2k+d-2)},

which indicates that g(β,k)g(\beta,k) decays at factorial rate when β>c^2/2\beta>\hat{c}^{2}/2. If kc^2/2k\gg\hat{c}^{2}/2, βk\beta\ll k regime dominates the summation.

Using Stirling’s approximation, one can easily prove

Γ(k+x)Γ(k)kx\displaystyle\Gamma(k+x)\asymp\Gamma(k)k^{x}

When kdk\gg d,

g(β,k)\displaystyle g(\beta,k) =(1)βc^2β+kΓ(β+k/2)Γ(k+(d1)/2)β!Γ(β+k+d/2)2βk/2\displaystyle=\frac{(-1)^{\beta}\hat{c}^{2\beta+k}\Gamma(\beta+k/2)\Gamma(k+(d-1)/2)}{\beta!\Gamma(\beta+k+d/2)2^{\beta-k/2}}
(14)βc^2β+kΓ(k+(d1)/2)2k/2Γ(k/2)Γ(k)kd/2β!\displaystyle\asymp\left(-\frac{1}{4}\right)^{\beta}\hat{c}^{2\beta+k}\Gamma(k+(d-1)/2)\frac{2^{k/2}\Gamma(k/2)}{\Gamma(k)k^{d/2}\beta!}
=c^kΓ(k+(d1)/2)2k/2Γ(k/2)Γ(k)kd/2(c^24)β1β!\displaystyle=\hat{c}^{k}\Gamma(k+(d-1)/2)\frac{2^{k/2}\Gamma(k/2)}{\Gamma(k)k^{d/2}}\left(-\frac{\hat{c}^{2}}{4}\right)^{\beta}\frac{1}{\beta!}

This splits g(β,k)g(\beta,k) into two parts: the first part depends only on kk and the rest part only depends on β\beta. The summation of the second part over β\beta yields

β=0(c^24)β1β!=exp(c^24),\displaystyle\sum_{\beta=0}^{\infty}\left(-\frac{\hat{c}^{2}}{4}\right)^{\beta}\frac{1}{\beta!}=\exp(-\frac{\hat{c}^{2}}{4}),

Using Stirling’s approximation

γ(x+1)2πx(x/e)x,\gamma(x+1)\asymp\sqrt{2\pi x}(x/e)^{x},

this leads to the expression for λk\lambda_{k}:

λk\displaystyle\lambda_{k}^{\prime} (1)(k+1)/2(12)kΓ((d1)/2)Γ(k+(d1)/2)c^kΓ(k+(d1)/2)2k/2Γ(k/2)Γ(k)kd/2exp(c^24)\displaystyle\asymp(-1)^{(k+1)/2}\left(\frac{1}{2}\right)^{k}\frac{\Gamma((d-1)/2)}{\Gamma(k+(d-1)/2)}\hat{c}^{k}\Gamma(k+(d-1)/2)\frac{2^{k/2}\Gamma(k/2)}{\Gamma(k)k^{d/2}}\exp(-\frac{\hat{c}^{2}}{4})
(1)(k+1)/2(c^2)k2k/2Γ(k/2)Γ(k)kd/2exp(c^24)\displaystyle\asymp(-1)^{(k+1)/2}\left(\frac{\hat{c}}{2}\right)^{k}\frac{2^{k/2}\Gamma(k/2)}{\Gamma(k)k^{d/2}}\exp(-\frac{\hat{c}^{2}}{4})
(1)(k+1)/2(c^2ek)kkd/2exp(c^24)\displaystyle\asymp(-1)^{(k+1)/2}\left(\frac{\hat{c}}{2}\sqrt{\frac{e}{k}}\right)^{k}k^{-d/2}\exp(-\frac{\hat{c}^{2}}{4})

Similarly, the activation function of quasi neural network has the Tayler expansion

σ~(x)\displaystyle\tilde{\sigma}(x) =ςφ(c^t)+xΦ(c^t)\displaystyle=\varsigma_{\ell}\varphi\left(\hat{c}t\right)+x\Phi\left(\hat{c}t\right)
=t2+n=0(1)nc^2n+12n+1(n+1)!(2n+1)t2n+2.\displaystyle=\frac{t}{2}+\sum_{n=0}^{\infty}\frac{(-1)^{n}\hat{c}^{2n+1}}{2^{n+1}(n+1)!(2n+1)}t^{2n+2}.

So λk=0\lambda_{k}=0 when kk is odd, and when kk is even:

λk\displaystyle\lambda_{k} =(1)(k+1)/2(12)kΓ((d1)/2)Γ(k+(d1)/2)wd1wdCd2πβ=0(1)βc^2β+k(2β+k2)!!(2k+d3)!!(2β)!!(2β+2k+d2)!!\displaystyle=(-1)^{(k+1)/2}\left(\frac{1}{2}\right)^{k}\frac{\Gamma((d-1)/2)}{\Gamma(k+(d-1)/2)}\frac{w_{d-1}}{w_{d}}\frac{C_{d}}{\sqrt{2\pi}}\sum_{\beta=0}^{\infty}\frac{(-1)^{\beta}\hat{c}^{2\beta+k}(2\beta+k-2)!!(2k+d-3)!!}{(2\beta)!!(2\beta+2k+d-2)!!}

Furthermore, when kdk\gg d,

λk(1)k/2kd2(c^2ek)kexp(c^24),\lambda_{k}\asymp(-1)^{k/2}{k^{-\frac{d}{2}}}\left(\frac{\hat{c}}{2}\sqrt{\frac{e}{k}}\right)^{k}\exp\left(-\frac{\hat{c}^{2}}{4}\right),

C.5 Computing covariance matrix

In this part, we prove Theorem 12 by computing Σ(0)\Sigma^{(0)} and Σ(1)\Sigma^{(1)}.

Theorem 12 NTK of a binary weight neural network can be decomposed using equation (10). If kdk\gg d, then

Poly1(k)(C)kukPoly2(k)(C)k\mathrm{Poly}_{1}(k)(C)^{-k}\leq u_{k}\leq\mathrm{Poly}_{2}(k)(C)^{-k}

where Poly1(k)\mathrm{Poly}_{1}(k) and Poly2(k)\mathrm{Poly}_{2}(k) denote polynomials of kk, and CC is a constant.

We make use of the results in Section C.4, and remind that λk,λk\lambda_{k},\lambda^{\prime}_{k} depends on c^\hat{c}, we make this explicit as λk(c^),λk(c^)\lambda_{k}(\hat{c}),\lambda^{\prime}_{k}(\hat{c}). We introduce an auxiliary parameter w𝒩(0,I)w\sim\mathcal{N}(0,I), and denote c~=cVar[Θ]dς~2=Var[θ]1Var[θ],w~=w/w2\tilde{c}=\sqrt{\frac{c\mathrm{Var}[\Theta]}{d\ \tilde{\varsigma}^{2}}}=\sqrt{\frac{\mathrm{Var}[\theta]}{1-\mathrm{Var}[\theta]}},\tilde{w}=w/\|w\|_{2}, then the decomposition of kernel (10) can be computed by

Σ(1)\displaystyle\Sigma^{(1)} =𝔼θ[σ~(μ)σ~(μ)]\displaystyle=\mathbb{E}_{\theta}\left[\tilde{\sigma}\left(\mu\right)\tilde{\sigma}\left(\mu\right)\right]
=𝔼w𝒩(0,I)[σ~(c~w,x)σ~(c~w,x)]\displaystyle=\mathbb{E}_{w\sim\mathcal{N}(0,I)}\left[\tilde{\sigma}(\tilde{c}\langle w,x\rangle)\tilde{\sigma}(\tilde{c}\langle w,x^{\prime}\rangle)\right]
=𝔼wσ~(c~w~,x)σ~(c~w~,x)𝑑τ(w~)\displaystyle=\mathbb{E}_{\|w\|}\int\tilde{\sigma}(\tilde{c}\langle\tilde{w},x\rangle)\tilde{\sigma}(\tilde{c}\langle\tilde{w},x^{\prime}\rangle)d\tau(\tilde{w})
=𝔼wk=0(λk(c~w))2N(p,k)Pk(x,x),\displaystyle=\mathbb{E}_{\|w\|}\sum_{k=0}^{\infty}(\lambda_{k}(\tilde{c}\|w\|))^{2}N(p,k)P_{k}(\langle x,x^{\prime}\rangle),
Σ(0)\displaystyle\Sigma^{(0)} =𝔼θ[σ~(μ)σ~(μ)]\displaystyle=\mathbb{E}_{\theta}\left[\tilde{\sigma}^{\prime}\left(\mu\right)\tilde{\sigma}^{\prime}\left(\mu\right)\right]
=𝔼wk=0(λk(c~w))2N(p,k)Pk(x,x).\displaystyle=\mathbb{E}_{\|w\|}\sum_{k=0}^{\infty}{(\lambda_{k}^{\prime}(\tilde{c}\|w\|)})^{2}N(p,k)P_{k}(\langle x,x^{\prime}\rangle).

First compute Σ(0)\Sigma^{(0)}. According to Lemma 16 in Bietti and Mairal [2019],

u0,k=𝔼w𝒩(0,I)[λk2]=𝔼w[λk2].u_{0,k}=\mathbb{E}_{w\sim\mathcal{N}(0,I)}[{\lambda_{k}^{\prime}}^{2}]=\mathbb{E}_{\|w\|}[{\lambda_{k}^{\prime}}^{2}].

Remind that

λk(c~w)(1)k/2kd/2(c~w2ek)kexp(c~2w24).\lambda^{\prime}_{k}(\tilde{c}\|w\|)\asymp(-1)^{k/2}k^{-d/2}\left(\frac{\tilde{c}\|w\|}{2}\sqrt{\frac{e}{k}}\right)^{k}\exp\left(-\frac{\tilde{c}^{2}\|w\|^{2}}{4}\right).
u0,k\displaystyle u_{0,k} =𝔼w(λk(c~w))2\displaystyle=\mathbb{E}_{\|w\|}{(\lambda_{k}^{\prime}(\tilde{c}\|w\|)})^{2}
𝔼wkd(c~2w2e4k)kexp(c~2w22)\displaystyle\asymp\mathbb{E}_{\|w\|}k^{-d}\left(\frac{\tilde{c}^{2}\|w\|^{2}e}{4k}\right)^{k}\exp\left(-\frac{\tilde{c}^{2}\|w\|^{2}}{2}\right)
=kd(k/e)k𝔼c~w(c~2w24)kexp(c~2w22)\displaystyle=k^{-d}\left(k/e\right)^{-k}\mathbb{E}_{\tilde{c}\|w\|}\left(\frac{\tilde{c}^{2}\|w\|^{2}}{4}\right)^{k}\exp\left(-\frac{\tilde{c}^{2}\|w\|^{2}}{2}\right)

Because w𝒩(0,1)w\sim\mathcal{N}(0,1), w22\|w\|_{2}^{2} satisfy Chi-square distribution, and its momentum generating function is

MX(t)\displaystyle M_{X}(t) =𝔼[exp(tw2)]=(12t)d/2\displaystyle=\mathbb{E}[\exp(t\|w\|^{2})]=(1-2t)^{-d/2}

It’s kk-th order derivative is

MX(k)\displaystyle M_{X}^{(k)} =𝔼[w2kexp(tw2)]=(d+2k2)!!(d2)!!(12t)d2k\displaystyle=\mathbb{E}[\|w\|^{2k}\exp(t\|w\|^{2})]=\frac{(d+2k-2)!!}{(d-2)!!}(1-2t)^{-\frac{d}{2}-k}

Let t=c~2/2t=-\tilde{c}^{2}/2, we get

𝔼[w2kexp(c~2w22)]\displaystyle\mathbb{E}\left[\|w\|^{2k}\exp\left(-\frac{\tilde{c}^{2}\|w\|^{2}}{2}\right)\right] =(d+2k2)!!(1+c~2)d/2+k(d2)!!\displaystyle=\frac{(d+2k-2)!!}{(1+\tilde{c}^{2})^{d/2+k}(d-2)!!}
2kΓ(k+d/2)Γ(d/2)(1+c~2)kd/2\displaystyle\asymp 2^{k}\frac{\Gamma(k+d/2)}{\Gamma(d/2)}(1+\tilde{c}^{2})^{-k-d/2}
(2k(1+c~2)e)d/2+k1k\displaystyle\asymp\left(\frac{2k}{(1+\tilde{c}^{2})e}\right)^{d/2+k}\sqrt{\frac{1}{k}}

so

u0,k\displaystyle u_{0,k} (c~2)2kkd(k/e)k(2k(1+c~2)e)d/2+k\displaystyle\asymp\left(\frac{\tilde{c}}{2}\right)^{2k}k^{-d}\left(k/e\right)^{-k}\left(\frac{2k}{(1+\tilde{c}^{2})e}\right)^{d/2+k}
k(d1)/2(c~22(1+c~2))k\displaystyle\asymp k^{-(d-1)/2}\left(\frac{\tilde{c}^{2}}{2(1+\tilde{c}^{2})}\right)^{k}

when kk is odd, and 0 when kk is even.

Similarly,

u1,k\displaystyle u_{1,k} (c~2)2kkd(k/e)k(2k(1+c~2)e)d/2+k\displaystyle\asymp\left(\frac{\tilde{c}}{2}\right)^{2k}k^{-d}\left(k/e\right)^{-k}\left(\frac{2k}{(1+\tilde{c}^{2})e}\right)^{d/2+k}
k(d1)/2(c~22(1+c~2))k\displaystyle\asymp k^{-(d-1)/2}\left(\frac{\tilde{c}^{2}}{2(1+\tilde{c}^{2})}\right)^{k}

when kk is even, and 0 when kk is odd.

Finally, using the recurrence relation

tPk(t)=k2k+d3Pk1(t)+k+d32k+d3Pk+1(t)\displaystyle tP_{k}(t)=\frac{k}{2k+d-3}P_{k-1}(t)+\frac{k+d-3}{2k+d-3}P_{k+1}(t)

taking them into (35) finishes the proof.

C.6 Gaussian kernel

𝒦RGauss(x,x)\displaystyle\mathcal{K}_{RGauss}(x,x^{\prime}) =𝔼[𝒦Gauss(κx,κx)]\displaystyle=\mathbb{E}[\mathcal{K}_{Gauss}(\kappa x,\kappa x^{\prime})]
=𝔼[exp(κ2xxξ2)]\displaystyle=\mathbb{E}\left[\exp\left(-\frac{\kappa^{2}\|x-x^{\prime}\|}{\xi^{2}}\right)\right]
=𝔼[exp(xx(ξ/κ)2)]\displaystyle=\mathbb{E}\left[\exp\left(-\frac{\|x-x^{\prime}\|}{(\xi/\kappa)^{2}}\right)\right]

This indicates that this kernel can be decomposed using spherical harmonics (10), and when kdk\gg d, the coefficient

uk\displaystyle u_{k} =𝔼[exp(2κ2ξ2)(ξκ)d2Ik+d/21(2κ2ξ2)Γ(d2)]\displaystyle=\mathbb{E}\left[\exp\left(-\frac{2\kappa^{2}}{\xi^{2}}\right)\left(\frac{\xi}{\kappa}\right)^{d-2}I_{k+d/2-1}\left(\frac{2\kappa^{2}}{\xi^{2}}\right)\Gamma\left(\frac{d}{2}\right)\right]
𝔼[exp(2κ2ξ2)Γ(d2)j=01j!Γ(k+d/2+j)(κ2ξ2)k+2j]\displaystyle\asymp\mathbb{E}\left[\exp\left(-\frac{2\kappa^{2}}{\xi^{2}}\right)\Gamma\left(\frac{d}{2}\right)\sum_{j=0}^{\infty}\frac{1}{j!\Gamma(k+d/2+j)}\left(\frac{\kappa^{2}}{\xi^{2}}\right)^{k+2j}\right]
=j=0Γ(d/2)j!Γ(k+d/2+j)𝔼[(κ2ξ2)k+2jexp(2κ2ξ2)]\displaystyle=\sum_{j=0}^{\infty}\frac{\Gamma(d/2)}{j!\Gamma(k+d/2+j)}\mathbb{E}\left[\left(\frac{\kappa^{2}}{\xi^{2}}\right)^{k+2j}\exp\left(-\frac{2\kappa^{2}}{\xi^{2}}\right)\right]
=j=0Γ(d/2)j!Γ(k+d/2+j)Γ(k+2j+d/2)Γ(d/2)(2ξ2)k+2j(11+4/ξ2)(k+2j+d/2)\displaystyle=\sum_{j=0}^{\infty}\frac{\Gamma(d/2)}{j!\Gamma(k+d/2+j)}\frac{\Gamma(k+2j+d/2)}{\Gamma(d/2)}\left(\frac{2}{\xi^{2}}\right)^{k+2j}\left(\frac{1}{1+4/\xi^{2}}\right)^{(k+2j+d/2)}
(2ξ2)k(1+4ξ2)(kd/2)j=01j!(k(2/ξ2)2(1+4/ξ2)2)j\displaystyle\asymp\left(\frac{2}{\xi^{2}}\right)^{k}\left(1+\frac{4}{\xi^{2}}\right)^{(-k-d/2)}\sum_{j=0}^{\infty}\frac{1}{j!}\left(\frac{k(2/\xi^{2})^{2}}{(1+4/\xi^{2})^{2}}\right)^{j}
(24+ξ2)kexp((24+ξ2)2k).\displaystyle\asymp\left(\frac{2}{4+\xi^{2}}\right)^{k}\exp\left(\left(\frac{2}{4+\xi^{2}}\right)^{2}k\right).

Note that 24+ξ2exp((24+ξ2)2)\frac{2}{4+\xi^{2}}\exp\left(\left(\frac{2}{4+\xi^{2}}\right)^{2}\right) is always smaller than 1 so uku_{k} is always decreasing with kk.

Appendix D Additional information about numerical result

D.1 Toy dataset

In neural networks (NN) experiment, we used three layers with the first layer fixed. The number of hidden neural is 512. In neural network with binary weights (BWNN) experiment, the setup is the same as NN except the second layer is Binary. We used BinaryConnect method with stochastic rounding. We used gradient descent with learning rate searched from 103,102,10110^{-3},10^{-2},10^{-1}. For Laplacian kernel and Gaussian kernel, we searched kernel bandwidth from 22μ2^{-2}\mu to 22μ2^{2}\mu by power of 2, and μ\mu is the medium of pairwise distance. The SVM cost value parameter is from 10210^{-2} to 10410^{4} by power of 2.

More results are listed in Table 2. Accuracy are shown in the format of mean ±\pm std. P90 and P95 denotes the percentage of dataset that a model achieves at least 90% and 95% of the highest accuracy, respectively.

Table 2: More results in UCI dataset experiment.
Classifier Training Testing
Accuracy P90 P95 Accuracy P90 P95
NN 96.19±\pm8.03% 96.67% 91.11% 77.62±\pm16.10% 73.33% 56.67%
BWNN 93.55±\pm10.39% 84.44% 76.67% 77.83±\pm16.57% 77.78% 54.44%
Laplacian 93.52±\pm9.65% 85.56% 76.67% 81.62±\pm14.72% 97.78% 91.11%
Gaussian 91.08±\pm10.63% 76.67% 58.89% 81.40±\pm14.85% 95.56% 87.78%

D.2 MNIST-like dataset

Similar to the toy dataset experiment, we used three layer neural networks with the first layer fixed, and only quantize the second layer. The number of neurons in the hidden layer is 2048. The batchsize if 100 and ADAM optimizer with learning rate 10310^{-3} is used.