Why Quantization Improves Generalization: NTK of Binary Weight Neural Networks

Kaiqi Zhang, Ming Yin, Yu-Xiang Wang

Abstract

Quantized neural networks have drawn a lot of attention as they reduce the space and computational complexity during the inference. Moreover, there has been folklore that quantization acts as an implicit regularizer and thus can improve the generalizability of neural networks, yet no existing work formalizes this interesting folklore. In this paper, we take the binary weights in a neural network as random variables under stochastic rounding, and study the distribution propagation over different layers in the neural network. We propose a quasi neural network to approximate the distribution propagation, which is a neural network with continuous parameters and smooth activation function. We derive the neural tangent kernel (NTK) for this quasi neural network, and show that the eigenvalue of NTK decays at approximately exponential rate, which is comparable to that of Gaussian kernel with randomized scale. This in turn indicates that the Reproducing Kernel Hilbert Space (RKHS) of a binary weight neural network covers a strict subset of functions compared with the one with real value weights. We use experiments to verify that the quasi neural network we proposed can well approximate binary weight neural network. Furthermore, binary weight neural network gives a lower generalization gap compared with real value weight neural network, which is similar to the difference between Gaussian kernel and Laplace kernel.

1 Introduction

It has been found that by quantizing the parameters in a neural network, the memory footprint and computing cost can be greatly decreased with little to no loss in accuracy (Gupta et al., 2015). Furthermore, Hubara et al. (2016); Courbariaux et al. (2015) argued that quantization serves as an implicit regularizer and thus should increase the generalizability of neural network comparing to its full precision version. However, there is no formal theoretical investigation of this statement to the best of our knowledge.

Empirical results show that the traditional statistical learning techniques based on uniform convergence (e.g., VC-dimension (Blumer et al., 1989)) do not satisfactorily explain the generalization ability of neural networks. Zhang et al. (2016) showed that neural networks can perfectly fit the training data even if the labels are random, yet it generalized well when the data are not random. This seems to suggest that the model capacity of a neural network depends on not only the model, but also the dataset. Recent studies (He and Tao, 2020) managed to understand the empirical performance in a number of different aspects, including modeling stochastic gradient (SGD) with stochastic differential equation (SDE) (Weinan et al., 2019), studying the geometric structure of loss surface (He et al., 2020), and overparameterization – a particular asymptotic behavior when the number of parameters of the neural network tends to infinity (Li et al., 2018; Choromanska et al., 2015; Allen-Zhu et al., 2018; Arora et al., 2019a). Recently, it was proven that the training process of neural network in the overparameterized regime corresponds to kernel regression with Neural Tangent Kernel (NTK) (Jacot et al., 2018). A line of work (Bach, 2017; Bietti and Mairal, 2019; Geifman et al., 2020; Chen and Xu, 2020) further studied Mercer’s decomposition of NTK and proved that it is similar to a Laplacian kernel in terms of the eigenvalues.

In this paper, we propose modeling a two-layer binary weight neural network using a model with continuous parameters. Specifically, we assume the binary weights are drawn from the Bernoulli distribution where the parameters of the distribution (or the mean of the weights) are trainable parameters. We propose a quasi neural network, which has the same structure as a vanilla neural network but has a different activation function, and prove one can analytically approximate the expectation of output of this binary weight neural network with this quasi neural network. Using this model, our main contributions are as follows:

•

Under the overparameterized regime, we prove that the gradient computed from BinaryConnect algorithm is approximately an unbiased estimator of the gradient of the quasi neural network, hence such a quasi neural network can model the training dynamic of binary weight neural network.
•

We study the NTK of two-layer binary weight neural networks by studying the “quasi neural network”, and show that the eigenvalue of this kernel decays at an exponential rate, in contrast with the polynomial rate in a ReLU neural network Chen and Xu (2020); Geifman et al. (2020). We reveal the similarity between the Reproducing kernel Hilbert space (RKHS) of this kernel with Gaussian kernel, and it is a strict subset of function as the RKHS of NTK in a ReLU neural network. This indicates that the model capacity of binary weight neural network is smaller than that with real weights, and explains higher training error and lower generalization gap observed empirically.

2 Related work

Quantized neural networks.

There is a large body of work that focuses on training neural networks with quantized weights (Marchesi et al., 1993; Hubara et al., 2017; Gupta et al., 2015; Liang et al., 2021; Chu et al., 2021), including considering radically quantizing the weights to binary (Courbariaux et al., 2016; Rastegari et al., 2016) or ternary (Alemdar et al., 2017) values, which often comes at a mild cost on the model’s predictive accuracy. Despite all these empirical works, the theoretical analysis of quantized neural networks and their convergence is not well studied. Many researchers believed that quantization adds noise to the model, which serves as an implicit regularizer and makes neural networks generalize better (Hubara et al., 2016; Courbariaux et al., 2015), but this statement is instinctive and has never been formally proved to the best of our knowledge. One may argue that binary weight neural networks have a smaller parameter space than its real weight counterpart, yet Ding et al. (2018) showed that a quantized ReLU neural network with enough parameters can approximate any ReLU neural network with arbitrary precision. These seemingly controversy results motivate us to find another way to explain the stronger generalization ability that is observed empirically.

Theory of deep learning and NTK.

A notable recent technique in developing the theory of neural networks is the neural tangent kernel (NTK) Jacot et al. (2018). It draws the connection between an over-parameterized neural network and the kernel learning. This makes it possible to study the generalization of overparameterized neural network using more mature theoretical tools from kernel learning Bordelon et al. (2020); Simon et al. (2021).

The expressive power of kernel learning is determined by the RKHS of the kernel. Many researches have been done to identify the RKHS. Bach (2017); Bietti and Mairal (2019) studied the spectral properties of NTK of a two-layer neural network without bias. Geifman et al. (2020) further studied the NTK with bias and showed that the RKHS of two layer neural networks contains the same set of functions as RKHS of the Laplacian kernel. Chen and Xu (2020) expanded this result to arbitrary layer neural networks and showed that RKHS of arbitrary layer neural network is equivalent to Laplacian kernel. All these works are based on neural networks with real weights, and to the best of our knowledge, we are the first to study the NTK and generalization of binary weight neural networks.

3 Preliminary

3.1 Neural tangent kernel

It has been found that an overparameterized neural network has many local minima. Furthermore, most of the local minima are almost as good as the global minima (Soudry and Carmon, 2016). As a result, in the training process, the model parameters often do not need to move far away from the initialization point before reaching a local minimum (Du et al., 2018, 2019; Li and Liang, 2018). This phenomenon is also known as lazy training (Chizat et al., 2018). This allows one to approximate a neural network with a model that is nonlinear in its input and linear in its parameters. Using the connection between feature map and kernel learning, the optimization problem reduces to kernel optimization problem. More detailed explanation can be found below:

Denote $\Theta$ as the collection of all the parameters in a neural network $f_{\Theta}$ before an iteration, and $\Theta^{+}$ as the parameters after this iteration. Let $in$ denote fixed distribution in the input space. In this paper, it is a discrete distribution induced by the training dataset. Using Taylor expansion, for any testing data $\tilde{x}$ , let the stepsize be $\eta$ , the first-order update rule of gradient descent can be written as ( $l_{\text{loss}}(\cdot)$ be the differentiable loss function and the label is omitted)

	$\displaystyle\Theta^{+}-\Theta$	$\displaystyle=\eta\mathbb{E}_{x\sim in}\left[\nabla_{\Theta}\mathrm{loss}(f_{\Theta}(x))\right]$
		$\displaystyle=\eta\mathbb{E}_{x\sim in}\left[\nabla_{\Theta}f_{\Theta}(x)\ \mathrm{loss}^{\prime}(f_{\Theta}(x))\right]$
	$\displaystyle f_{\Theta^{+}}(x^{\prime})-f_{\Theta}(x^{\prime})$	$\displaystyle=\eta\nabla_{\Theta}f_{\Theta}(x^{\prime})\cdot\mathbb{E}_{x\sim in}\left[\nabla_{\Theta}f_{\Theta}(x)\ \mathrm{loss}^{\prime}(f_{\Theta}(x))\right]$
		$\displaystyle=\eta\mathbb{E}_{x\sim in}\left[\nabla_{\Theta}f_{\Theta}(x^{\prime})\cdot\nabla_{\Theta}f_{\Theta}(x)\ \mathrm{loss}^{\prime}(f_{\Theta}(x))\right]$
		$\displaystyle:=\eta\mathbb{E}_{x\sim in}\left[\mathcal{K}(x,x^{\prime})\ \mathrm{loss}^{\prime}(f_{\Theta}(x))\right].$

This indicates that the learning dynamics of overparameterized neural network is equivalent to kernel learning with kernel defined as

\mathcal{K}(x,x^{\prime})=\nabla_{\Theta}^{\top}f_{\Theta}(x)\cdot\nabla_{\Theta}f_{\Theta}(x^{\prime}).

which is called the neural tangent kernel (NTK). As the width of the hidden layers in this neural network tends to infinity, this kernel convergences to its expectation over $\Theta$ (Jacot et al., 2018).

3.2 Exponential kernel

A common class of kernel functions used in machine learning is the exponential kernel, which is a radial basis function kernel with the general form

\mathcal{K}(x,x^{\prime})=\exp(-(c\|x-x^{\prime}\|)^{\gamma}),

where $c>0$ and $\gamma\geq 1$ are constants. When $\gamma=1$ , this kernel is known as the Laplacian kernel, and when $\gamma=2$ , it is known as the Gaussian kernel.

According to Moore-Aronszajn theorem, each symmetric positive definite kernel uniquely induces a Reproducing kernel Hilbert space (RKHS). RKHS determines the functions that can be learned using a kernel. It has been found that the RKHS of NTK in a ReLU neural network is the same as Laplacian kernel (Geifman et al., 2020; Chen and Xu, 2020), and the empirical performance of a neural network is close to that of kernelized linear classifiers with exponential kernels in many datasets (Geifman et al., 2020).

3.3 Training neural networks with quantized weights

Among various methods to train a neural network, BinaryConnect (BC) (Courbariaux et al., 2015) is often one of the most efficient and accurate method. The key idea is to introduce a real-valued buffer $\theta$ and use it to accumulate the gradients. The weights will be quantized just before forward and backward propagation, which can benefit from the reduced computing complexity. The update rule in each iteration is

\theta^{+}\leftarrow\theta-\eta\frac{\partial\tilde{f}_{w}(x)}{\partial w},\quad w^{+}\leftarrow Quantize(\theta^{+}),

(1)

where $Quantize(\cdot):\mathbb{R}\rightarrow\{-1,1\}$ denotes the quantization function which will be discussed in Section 4.2, $w$ denotes the binary (or quantized) weights, $\theta$ and $\theta^{+}$ denote the real valued buffer before and after an iteration respectively, $\eta$ is the learning rate, and $f_{w}(\cdot)$ denotes the neural network with parameter $w$ . Here the gradients are computed by taking $w$ as if they are real numbers. The detailed algorithm can be founded in Section A.

4 Approximation of binary weight neural network

4.1 Notations

In this paper, we use $w_{\ell,ij}$ to denote the binary weights in the $\ell$ -th layer, $\theta_{\ell,ij}$ to denote its real-valued counterpart, and $b_{\ell,i}$ to denote the (real valued) bias. $\Theta$ is the collection of all the real-valued model parameters which will be specified in Section 4.2. The number of neurons in the $\ell$ -th hidden layer is $d_{\ell}$ , the input to the $\ell$ -th linear layer is $\boldsymbol{x}_{\ell}$ and the output is $\boldsymbol{y}_{\ell}$ . $d$ denote the number of input features. Besides, we use $\boldsymbol{x}$ to denote the input to this neural network, $y$ to denote the output and $z$ to denote the label.

We focus on the mean and variance under the randomness of stochastic rounding. Denote

	$\displaystyle\mu_{\ell,i}$	$\displaystyle:=\mathbb{E}[x_{\ell,i}\|\boldsymbol{x},\Theta],$	$\displaystyle\sigma_{\ell,i}^{2}$	$\displaystyle:=\mathrm{Var}[x_{\ell,i}\|\boldsymbol{x},\Theta],$
	$\displaystyle\nu_{\ell,i}$	$\displaystyle:=\mathbb{E}[y_{\ell,i}\|\boldsymbol{x},\Theta],$	$\displaystyle\varsigma_{i,\ell}^{2}$	$\displaystyle:=\mathrm{Var}[y_{\ell,i}\|\boldsymbol{x},\Theta],$	$\displaystyle\bar{y}$	$\displaystyle:=\mathbb{E}[y\|\Theta].$

We use $\psi(x)=\max(x,0)$ to denote ReLU activation function, and $in$ to denote the (discrete) distribution of training dataset. $\mathbb{E}_{in}[\cdot]:=\mathbb{E}_{(\boldsymbol{x},z)\sim in}[\cdot]$ denotes the expectation over training dataset, or “sample average”. We use bold symbol to denote a collection of parameters or variables $\boldsymbol{w}_{2}=[w_{2,j}],\boldsymbol{b}_{2}=[b_{2,j}],\boldsymbol{\nu}_{1}=[\nu_{1,j}],\boldsymbol{\theta}_{1}=[\theta_{1,ij}],i\in[d_{1}],j\in[d_{2}]$ .

4.2 Problem statement

(a) Binary weight neural network we focus on.

(b) Quasi neural network.

In this work, we target on stochastic quantization (Dong et al., 2017), which often yields higher accuracy empirically compared with deterministic rounding (Courbariaux et al., 2015). This also creates a smooth connection between the binary weights in a neural network and its real-valued parameters.

Let $w_{\ell,ij}=Quantize(\theta_{\ell,ij}),\theta_{\ell,ij}\in[-1,1]$ be the binary weights from stochastic quantization function, which satisfy Bernoulli distribution:

w_{\ell,ij}=\left\{\begin{array}[]{cl}+1,&\textrm{ with probability }p_{\ell,ij}=\frac{\theta_{\ell,ij}+1}{2},\\ -1,&\textrm{ with probability }1-p_{\ell,ij}.\end{array}\right.

(2)

This relationship leads to $\mathbb{E}[w_{\ell,ij}|\theta_{\ell,ij}]=\theta_{\ell,ij}$ .

We focus on a ReLU neural network with one hidden layer and two fully connect layers, which was also studied in Bach (2017); Bietti and Mairal (2019) except quantization. Besides, we add a linear layer (“additional layer”) in front of this neural network to project the input to an infinite dimension space. We randomly initialize the weights in this layer and leave it fixed (not trainable) throughout the training process. Furthermore, we quantize the weights in the first fully connect layer $w_{1,ij}$ and add a real-valued buffer $\theta_{1,ij}$ which determines the distribution of $w_{1,ij}$ as in (2), and leave the second layers not quantized. It is a common practice to leave the last layer not quantized, because this often leads to better empirical performance. If the second layer is quantized as well, the main result of this paper will not be changed. This can be easily checked by extending Lemma 4 into the second layer.

Remark 1.

In many real applications, e.g. computer vision, the dimension of data are often very large ( $\approx 10^{3}$ ) while they are laying in the lower dimension linear subspace, so we can take the raw input in these applications as the output of the additional layer, and the NN in this case is a two-layer NN where the first layer is quantized.

The set of all the real-valued parameters is $\Theta=\{\theta_{\ell_{1},ij},w_{\ell_{2},ij},b_{\ell,i}\}$ . The neural network can be expressed as

	$\displaystyle x_{1,i}$	$\displaystyle=\frac{1}{\sqrt{d}}\sum_{k=1}^{d}w_{0,ki}x_{k}+b_{0,i},\forall i\in[d_{1}];$	$\displaystyle y_{1,j}$	$\displaystyle=\sqrt{\frac{c}{d_{1}}}\sum_{i=1}^{d_{1}}w_{1,ij}x_{1,i}+b_{1,j},\forall j\in[d_{2}];$
	$\displaystyle x_{2,j}$	$\displaystyle=\psi(y_{1,j}),\forall j\in[d_{2}];$	$\displaystyle y$	$\displaystyle=\frac{1}{\sqrt{d_{2}}}\sum_{j=1}^{d_{2}}w_{2,j}x_{2,j}+b_{2}.$

We follow the typical setting of NTK papers Geifman et al. (2020) in initializing the parameters except the quantized parameters. As for the quantized parameters, we only need to specify the real-valued buffer of the weights in the first layer $\theta_{1,ij}$ .

Assumption 1.

We randomly initialize the weights in the “additional layer” and second linear layer independently as $w_{0,ki},w_{2,j}\sim\mathcal{N}(0,1)$ , and initialize all the biases to 0. The real-valued buffer of the weights are initialized independently identical with zero mean, variance $\mathrm{Var}[\theta]$ and bounded in $[-1,1]$ .

Remark 2.

Our theory applies to any initial distribution of $\theta_{1,ij}$ as long as it satisfies the constraint above. One simple example is the uniform distribution in $[-1,1]$ , which has variance $\mathrm{Var}[\theta]=1/3$ .

4.3 Quasi neural network

Given a fixed input and real-value model parameters $\Theta$ , under the randomness of stochastic rounding, the output of this binary weight neural network is a random variable. Furthermore, as the width of this neural network tends to infinity, the output of a linear layer tends to Gaussian distribution according to central limit theorem (CLT). We propose a method to determine the distribution of output and using the model parameters. Specifically, we give a closed form equation to compute the mean and variance of output of all the layers $\mu_{\ell},\sigma_{\ell},\nu_{\ell},\varsigma_{\ell}$ , and then marginalize over random initialization of $\Theta$ to further simplify this equation. We prove that $\varsigma_{\ell}$ converges to a constant almost surely using the law of large number (LLN), and simplify the expression by replacing them with the constant. This allows us to compute $\mu_{\ell},\nu_{\ell}$ using a neural-network-style function for given $\Theta$ . We call this function quasi neural network, which is given below:

	$\displaystyle x_{1,i}$	$\displaystyle=\frac{1}{\sqrt{d}}\sum_{k=1}^{d}w_{0,ki}x_{k}+b_{0,i},\forall i\in[d_{1}];$	$\displaystyle\nu_{1,j}$	$\displaystyle=\sqrt{\frac{c}{d_{1}}}\sum_{i=1}^{d_{1}}\theta_{1,ij}x_{1,j}+\beta b_{1,i},\forall j\in[d_{2}];$		(3)
	$\displaystyle\mu_{2,j}$	$\displaystyle=\tilde{\psi}(\nu_{1,j}),\forall j\in[d_{2}];$	$\displaystyle\bar{y}$	$\displaystyle=\frac{1}{\sqrt{d_{2}}}\sum_{j=1}^{d_{2}}w_{2,j}\mu_{2,j}+\beta b_{2}.$		(3)

In Section 4.3.1, we study the distribution of the output of each layer in a binary weight neural network (BWNN) conditioned on the set of real-valued parameter $\Theta$ . In Section 4.3.2, we prove that the conditioned variance of the output of the first linear layer studied above converges almost surely to a constant which does not depend on the data (input). This simplifies the expression computed in Section 4.3.1 to the form of quasi neural network (3), and also give a closed-form expression to $\tilde{\psi}(\cdot)$ in (3). In Section 4.3.3, we prove that conditioned on the set of real-valued parameter, the expectation of the gradients of BWNN equals the gradient of quasi neural network on the overparameterization limit. This indicates that the training dynamics of BWNN at initialization is the same as training the quasi neural network directly. The training dynamics beyond initialization are discussed in Section 4.3.4. Before jumping to the proof, we make the following assumptions:

Assumption 2.

After training the binary weight neural network as in (1), all the real-valued weights $\theta_{\ell,ij}$ stay in the range $[-1,1]$ .

Based on this assumption, we can ignore constraints that $\theta_{\ell,ij}\in[-1,1]$ and the projected gradient descent reduces to gradient descent. Because of the lazy training property of the overparameterized neural network, the model parameters $\theta_{\ell,ij}$ stay close to the initialization point during the training process, so this assumption can be satisfied by initializing $\theta_{\ell,ij}$ with smaller absolute value and/or applying weight decay during training. On the other hand, a common trick in a quantized neural network is to change the quantization level gradually during the training process to avoid (or reduce) overflow. With this trick, Assumption 2 are often naturally satisfied, but it introduces the quantization level as a trainable parameter.

Assumption 3.

The Euclidean norm of the input is 1:

\|\boldsymbol{x}\|_{2}=1,\forall\boldsymbol{x}\in\mathbb{R}^{d}.

This is a common assumption in studying NTK (Bach, 2017; Bietti and Mairal, 2019), and can be satisfied by normalizing the input data.

4.3.1 Conditioned distribution of the outputs of each layer

First we recognize that as the model parameters $\Theta$ are initialized randomly, there are “bad” initialization that will mess up our analysis. For example, all of $\theta_{1,ij}$ are initialized to 1 (or -1) while they are drawn from a uniform distribution. Fortunately, as the width $d_{1},d_{2}$ grows to infinite, the probability of getting into these “bad” initialization is 0. We make this statement formal in the following part.

Definition 4.

“Good Initialization”. We call the set of parameters is a “Good Initialization” $\Theta\in\mathcal{G}$ if it satisfies:

•

$\displaystyle\forall k,k^{\prime}\in[d],\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}w_{0,ki}w_{0,k^{\prime}i}=\delta_{k,k^{\prime}},$
•

$\displaystyle\forall k,k^{\prime},k^{\prime\prime}\in[d],\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}|w_{0,ki}w_{0,k^{\prime}i}w_{0,k^{\prime}i}|\leq\sqrt{\frac{8}{\pi}},$
•

$\forall k,k^{\prime}\in[d],\forall j\in[d_{2}]$ for some finite $d_{2}$ , $\displaystyle\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}w_{0,ki}w_{0,k^{\prime}i}\theta_{1,ij}^{2}=\mathrm{Var}[\theta]\delta_{k,k^{\prime}}$ ,

where

	$\displaystyle 1\quad k=k^{\prime}$
	$\displaystyle 0\quad k\neq k^{\prime}.$

Proposition 3.

Under the assumption that all the parameters are initialized as in 1, the probability of getting “Good Initialization” is 1:

P_{\Theta}(\Theta\in\mathcal{G})=1.

Lemma 4.

On the limit $d_{1}\rightarrow\infty$ , and any given $x$ conditioned on any fixed $\Theta\in\mathcal{G}$ , for any $j$ , the distribution of $y_{1,j}$ converge to Gaussian distribution with mean $\nu_{1,j}$ and variance $\varsigma_{1,j}^{2}$ which can be computed by:

\displaystyle y_{1,j}|\Theta

\displaystyle\rightarrow\mathcal{N}(\nu_{1,j},\varsigma_{1,j}^{2}),

\displaystyle\nu_{1,j}

\displaystyle=\sqrt{\frac{c}{d_{1}}}\sum_{i=1}^{d_{1}}\theta_{1,ij}x_{1,i}+b_{1,j},

\displaystyle\varsigma_{1,j}^{2}

\displaystyle=\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}(1-\theta_{1,ij}^{2})x_{1,i}^{2}

(4)

This lemma can be proved by Lyapunov central limit theorem and sum of expectation. See Section B.2 for the details.

Lemma 5.

Assume that the input to a ReLU layer $y_{1,i}$ satisfy Gaussian distribution with mean $\nu_{1,i}$ and variance $\varsigma_{1,i}^{2}$ conditioned on $\Theta$ ,

y_{1,j}|\Theta\sim\mathcal{N}(\nu_{1,j},\varsigma_{1,j}^{2}).

Denote

g_{j}=\varphi\left(\frac{\nu_{j}}{\varsigma_{j}}\right),s_{j}=\Phi\left(\frac{\nu_{j}}{\varsigma_{j}}\right),

(5)

where $\varphi(x)$ denotes standard Gaussian function and $\Phi(x)$ denotes its integration:

\displaystyle\varphi(x)

\displaystyle=\sqrt{\frac{1}{2\pi}}\exp\left(-\frac{1}{2}x^{2}\right),

\displaystyle\Phi(x)

\displaystyle=\int_{-\infty}^{x}\varphi(y)dy.

Then the output $x_{2,j}$ has mean $\mu_{2,i}$ and variance $\sigma_{2,i}^{2}$ conditioned on $\Theta$ , with

	$\displaystyle\mu_{2,j}$	$\displaystyle:=\mathbb{E}[x_{2,j}\|\Theta]=g_{j}\varsigma_{1,j}+s_{j}\nu_{1,j},$		(6)
	$\displaystyle\sigma_{2,i}^{2}$	$\displaystyle:=\mathrm{Var}[x_{2,j}\|\Theta]=(\varsigma_{1,i}^{2}+\nu_{1,i}^{2})s_{i}+\nu_{1,i}\sigma_{1,i}g_{1,i}-\nu_{1,j}^{2}.$		(6)

From 4 we know that on the limit $d_{1}\rightarrow\infty$ , conditioned on $\Theta$ and $\boldsymbol{x}$ , for any $j$ , $y_{1,j}$ converge to Gaussian distribution. From continuous mapping theorem, the distribution of $x_{2,j}$ converge to that shown in 5 so its mean $\mu_{2,j}$ and variance $\sigma_{2,j}$ converge to that computed in 5.

Equations (4) and (6) provide a method to calculate the mean and variance of output conditioned on the input and real-valued model parameters and allow us to provide a closed-form equation of quasi neural network. We will simplify this equation in Section 4.3.2.

4.3.2 Convergence of conditioned variance

In this part, we assume that the model parameters satisfy “Good Initialization”, which is almost surely on the limit $d_{1}\rightarrow\infty$ as is proven in 3, and study the distribution of $\nu_{1,j}$ and $\varsigma_{1,j}$ .

Theorem 6.

If the parameters are “Good Initialization” $\Theta\in\mathcal{G}$ , on the limit $d_{1}\rightarrow\infty$ , for any finite $d_{2}$ , $\nu_{1,j}$ converges to Gaussian distribution which are independent of each other, and $\varsigma_{1,j}^{2}$ converges a.s. to

\tilde{\varsigma}_{1}^{2}=\frac{c}{d}(1-\mathrm{Var}[\theta]).

With this approximation, we can replace the variance $\varsigma_{1,i}$ in equation (6) with $\tilde{\varsigma}_{1}$ and leave the mean of output in the linear layer as the only variable in the quasi neural network. Formal proof can be found in Section B.4. Note the propagation function in the linear layer (the first equation in (4)) is also a linear function in $x$ and $\theta$ . This motivates us to compute $\bar{y}$ using a neural network-like function as is given in (3), where $\tilde{\psi}(\cdot)$ is

\tilde{\psi}(\nu_{1,j})=\mathbb{E}[\psi(y_{1,j})|\nu_{1,j}]=\tilde{\varsigma}_{1}\phi\left(\frac{\nu_{1,i}}{\tilde{\varsigma}_{1}}\right)+\nu_{1,j}\Phi\left(\frac{\nu_{1,j}}{\tilde{\varsigma}_{1}}\right).

(7)

This equation gives a closed-form connection between the mean of output of neural network $\bar{y}$ and the real-valued model parameter $\Theta$ , and allows up to apply existing tools for analyzing neural networks with real-valued weight to analysis binary weight neural network. Its derivative in the sense of Calculus is:

\tilde{\psi}^{\prime}(\nu_{1,j})=\Phi\left(\frac{\nu_{1,j}}{\tilde{\varsigma}_{1}}\right).

(8)

The proof of derivative can be found in Section B.5.

4.3.3 Gradient of quasi neural network

In this part, we compute the gradients using binary weights as in BinaryConnect Algorithm, and make sense of the gradient in (8) by proving that it is the expectation of gradients under the randomness of stochastic rounding.

Theorem 7.

The expectation of gradients to output with respect to weights computed by sampling the quantized weights equals the gradients of “quasi neural network” defined above in (3) on the limit $d_{2}\rightarrow\infty,d_{1}\rightarrow\infty$ ,

\displaystyle\frac{\partial\bar{y}}{\partial\theta_{1,ij}}

\displaystyle=\mathbb{E}\left[\frac{\partial y}{\partial w_{1,ij}}\middle|\Theta\right],

\displaystyle\frac{\partial\bar{y}}{\partial b_{1,j}}

\displaystyle=\mathbb{E}\left[\frac{\partial y}{\partial b_{1,j}}\middle|\Theta\right],

\displaystyle\frac{\partial\bar{y}}{\partial w_{2,j}}

\displaystyle=\mathbb{E}\left[\frac{\partial y}{\partial w_{2,j}}\middle|\Theta\right].

Theorem 8.

For MSE loss, $\mathrm{loss}(y)=\frac{1}{2}(y-z)^{2},$ where $z$ is the ground-truth label, the gradient of the loss converges on the limit $d_{2}\rightarrow\infty,d_{1}\rightarrow\infty$ ,

	$\displaystyle\frac{\partial\mathrm{loss}(\bar{y})}{\partial\theta_{1,ij}}$	$\displaystyle=\mathbb{E}\left[\frac{\partial\mathrm{loss}(y)}{\partial w_{1,ij}}\middle\|\Theta\right],$	$\displaystyle\frac{\partial\mathrm{loss}(\bar{y})}{\partial b_{1,j}}$	$\displaystyle=\mathbb{E}\left[\frac{\partial\mathrm{loss}(y)}{\partial b_{1,j}}\middle\|\Theta\right],$
	$\displaystyle\frac{\partial\mathrm{loss}(\bar{y})}{\partial w_{2,j}}$	$\displaystyle=\mathbb{E}\left[\frac{\partial\mathrm{loss}(y)}{\partial w_{2,j}}\middle\|\Theta\right].$

In other words, the BinaryConnect algorithm provides an unbiased estimator to the gradients for the quasi neural network on this limit of overparameterization.

Theorem 6 and Theorem 8 conclude that for an infinite wide neural network, the BinaryConnect algorithm is equivalent to training quasi neural network with stochastic gradient descent (SGD) directly. Furthermore, this points out the gradient flow of BinaryConnect algorithm and allows us to study this training process with neural tangent kernel (NTK).

4.3.4 Asymptotics during training

So far we have studied the distribution of output during initialization. To study the dynamic of binary weight neural network during training, one need to extend these results to any parameter during training $\Theta(t),t\in[0,T]$ . Fortunately, motivated by Jacot et al. (2018), we can prove that as $d_{1},d_{2}\rightarrow\infty$ , the model parameters $\Theta(t)$ stays asymptotically close to initialization for any finite $T$ , so-called “lazy training”, so the above results apply to the entire training process.

Lemma 9.

For all $T$ such that $\int_{t=0}^{T}\|\bar{y}(t)-z\|_{in}dt$ stays stochastically bounded, as $d_{2}\rightarrow\infty,d_{1}\rightarrow\infty$ , $\|{\boldsymbol{w}}_{2}(T)-{\boldsymbol{w}}_{2}(0)\|,\|{\boldsymbol{b}}_{1}(T)-{\boldsymbol{b}}_{1}(0)\|,\|{\boldsymbol{\theta}_{1}}(T)-{\boldsymbol{\theta}_{1}}(0)\|_{F}$ are all stochastically bounded, $\|{\boldsymbol{\nu}}_{1}(t)-{\boldsymbol{\nu}}_{1}(0)\|$ and $\int_{t=0}^{T}\big{\|}\frac{\partial{\boldsymbol{\nu}}_{1}(t)}{\partial t}\big{\|}dt$ is stochastically bounded for all $x$ .

The proof can be found in Section B.8. Note that $\|\boldsymbol{w}_{2}\|=O(\sqrt{d_{2}}),\|\boldsymbol{\theta}_{1}\|_{F}=O(\sqrt{d_{1}d_{2}})$ , this results indicates that as $d_{2}\rightarrow\infty$ , the varying of the parameter is much smaller than the initialization, or so-called “lazy training”. Making use of this result, we further get the follow result:

Lemma 10.

Under the condition of 9, Lyapunov’s condition holds for all $T$ so $y_{1,j}$ converges to Gaussian distribution conditioned on the model parameters $\Theta(T)$ . Furthermore, $\varsigma_{1,j}(T)\rightarrow\varsigma_{1,t}(0)$ , which equals $\tilde{\varsigma}_{1}$ almost surely.

The proof can be found in Section B.9. This result shows that the analysis in Section 4.3 applies to the entire training process, and allows us to study the dynamics of binary weight neural network using quasi neural network.

5 Capacity of Binary Weight Neural Network

As has been found in Jacot et al. (2018), the dynamics of an overparameterized neural network trained with SGD is equivalent to kernel gradient descent where the kernel is NTK. As a result, the effective capacity of a neural network is equivalent to the RKHS of its NTK. In the following part, we will study the NTK of binary weight neural network using the approximation above, and compare it with Gaussian kernel.

5.1 NTK of three-layer binary weight neural networks

We consider the NTK binary weight neural network by studying this “quasi neural network” defined as

\displaystyle\mathcal{K}_{BWNN}(x,x^{\prime})

\displaystyle=\sum_{i=1,j=1}^{d_{1},d_{2}}\frac{\partial\bar{y}}{\partial\theta_{1,ij}}\frac{\partial\bar{y}^{\prime}}{\partial\theta_{1,ij}}+\sum_{j=1}^{d_{2}}\frac{\partial\bar{y}}{\partial b_{1,j}}\frac{\partial\bar{y}^{\prime}}{\partial b_{1,j}}+\sum_{j=1}^{d_{2}}\frac{\partial\bar{y}}{\partial w_{2,j}}\frac{\partial\bar{y}^{\prime}}{\partial w_{2,j}},

(9)

where $\Theta:=\{w_{1,ij},b_{1,j},b_{2,j}\}$ denotes all the trainable parameters. We omitted the terms related to $b_{2}$ (which is a constant) in this equation.

First prove that the change of kernel asymptotically converges to 0 during training process.

Theorem 11.

Under the condition of 9, $\mathcal{K}(x,x^{\prime})(T)\rightarrow\mathcal{K}(x,x^{\prime})(0)$ at rate $1/\sqrt{d_{2}}$ for any $x,x^{\prime}$ .

The proof can be found in Section C.3. Using Assumption 3, we confine the input on the hypersphere $\mathbb{S}^{d-1}=\{x\in\mathbb{R}^{d}:\|x\|_{2}=1\}$ . One can easily tell that it is positive definite, so we can apply Mercer’s decomposition Minh et al. (2006) to it.

To find the basis and eigenvalues to this kernel, we apply spherical harmonics decomposition to this kernel, which is common among studying of NTK (Bach, 2017; Bietti and Mairal, 2019):

\mathcal{K}_{BWNN}(x,x^{\prime})=\sum_{k=1}^{\infty}u_{k}\sum_{j=1}^{N(d,k)}Y_{k,j}(x)Y_{k,j}(x^{\prime}),

(10)

where $d$ denotes the dimension of $x$ and $x^{\prime}$ , $Y_{k,j}$ denotes the spherical harmonics of order $k$ . This suggests that NTK of binary weight neural network and exponential kernel can be spanned by the same set of basis function. The key question is the decay rate of $u_{k}$ with $k$ .

Theorem 12.

NTK of a binary weight neural network can be decomposed using (10). If $k\gg d$ , then

\mathrm{Poly}_{1}(k)C^{-k}\leq u_{k}\leq\mathrm{Poly}_{2}(k)C^{-k}.

(11)

where $\mathrm{Poly}_{1}(k)$ and $\mathrm{Poly}_{2}(k)$ denote polynomials of $k$ , and $C$ is a constant.

In contrast, Geifman et al. (2020) shows that for NTK in the continuous space, it holds that

C_{1}k^{-d}\leq u_{k}\leq C_{2}k^{-d},

with constants $C_{1}$ and $C_{2}$ . Because its decay rate is slower than that of the binary weight neural network, its RKHS covers a strict superset of functions (Geifman et al., 2020).

Proof Sketch: We first compute NTK of quasi neural network, which depends on the distribution of $\mu_{1,j}$ . As is shown in Theorem 6, $\mu_{1,j}$ converge to Gaussian distribution on the limit of infinite wide neural network. To find the joint distribution of $\mu_{1,j}$ and $\mu^{\prime}_{1,j}$ given arbitrary two inputs $x,x^{\prime}$ , we combine the first linear layer in the quasi neural network with the “additional layer” in front of it (the first two equations in (3)). This allows up to reparameterize $\mu_{1,j}$ as

\mu_{1,j}=\langle w_{j},x\rangle,

where $w_{j}\sim\mathcal{N}(0,\frac{c\mathrm{Var}[\theta]}{d}I)$ denotes the fused weight. A key component in computing the NTK has the form

\mathbb{E}[\psi(\mu_{1})\psi(\mu_{1}^{\prime})]=\mathbb{E}[\psi(\langle w,x\rangle)\psi^{\prime}(\langle w,x\rangle)]=\mathbb{E}_{\|w\|}\mathbb{E}[\psi(\langle w,x\rangle)\psi^{\prime}(\langle w,x\rangle)|\|w\|].

The second equation comes from the law of total expectation. We use 2-norm in this expression. The inner expectation is equivalent to integration on a sphere, and can be computed by applying sphere harmonics decomposition to $\psi(\cdot)$ . The squared norm of the fused weight $\|w\|^{2}$ satisfy Chi-distribution, and we use momentum generating function to finish computing.

5.2 Comparison with Gaussian Kernel

Even if the input to a neural network $x$ is constrained on a unit sphere, the first linear layer (together with the additional linear layer in front of it) will project it to the entire $\mathbb{R}^{d}$ space with Gaussian distribution. In order to simulate that, we define a kernel by randoming the scale of $x$ and $x^{\prime}$ beforing taking them into a Gaussian kernel.

\displaystyle\mathcal{K}_{RGauss}(x,x^{\prime})=\mathbb{E}_{\kappa}[\mathcal{K}_{Gauss}(\kappa x,\kappa x^{\prime})],

where $\mathcal{K}_{Gauss}(x,x^{\prime})=\exp\left(-\frac{\|x-x^{\prime}\|^{2}}{\xi^{2}}\right)$ is a Gaussian kernel, $\kappa\sim\chi_{d}$ satisfy Chi distribution with $d$ degrees of freedom. This scaling factor projects a random vector uniformly distributed on a unit sphere to Gaussian distributed. The corresponding eigenvalue satisfy

\displaystyle A_{1}C^{-k}\leq u_{k}\leq A_{2}C^{-k},

(12)

where $A_{1},A_{2},C$ are constants that depend on $\xi$ . The dominated term in both (11) and (12) have an exponential decay rate $C^{-k}$ , which suggests the similarity between NTK of binary weight neural network and Gaussian kernel. In comparison, Bietti and Mairal (2019); Geifman et al. (2020) showed that the eigenvalue of NTK decay at rate $k^{-d}$ , which is slower that binary weight neural network or Gaussian kernel. Furthermore, Aronszajn’s inclusion theorem suggests $\mathcal{H}_{\mathcal{K}_{BWNN}}\subset\mathcal{H}_{\mathcal{K}_{NN}}$ , where $\mathcal{K}_{NN}$ denotes the NTK of real-valued weight neural network. In other words, the expressive power of binary weight neural network is weaker than its real valued counterpart on the limit that the width goes to infinity. Binary weight neural networks are less venerable to noise thanks to the smaller expressive power at the expense of failing to learn some “high frequency” components in the target function. This explains that binary weight neural network often achieve lower training accuracy and smaller generalization gap compared with real weight neural network.

6 Numerical result

6.1 Quasi neural network

Refer to caption — Figure 2: Approximation of quasi neural network. (a)(b): before (a) and after (b) training, histogram of output under fixed model parameter (blue), and fitted with Gaussian distribution (red). (c)(d): $\mathbb{E}(y|\Theta)$ computed from quasi neural network (horizontal axis) and by Monte Carlo (Vertical axis). The red line shows $y=x$ .

In this part, we empirically verify the approximation of quasi neural network by comparing the inference result of quasi neural network with that achieved by Monte Carlo. The architecture is the same as that mentioned in Section 4.2, with 1600 hidden neurons. We train this neural network on MNIST dataset (LeCun et al., 1998) by directly applyng gradient descent to the quasi neural network. To reduce overflow, we add weight decay of 0.001 during training. Figure 2(a)(b) shows the histogram of output under stochastic rounding before and after training. We arbitrarily choose one input sample from the testing set and get 1000 samples under different stochastic rounding. This result supports our statement that the distribution of pre-activation (output of linear layer) conditioned on real-valued model parameters converge to Gaussian distribution. Figure 2(c)(d) compares the mean of output by quasi neural network approximation (horizontal axis) with that computed using Monte Carlo (vertical axis). These alignments further supports our method of approximating binary weight neural network with quasi neural network.

6.2 Generalization gap

6.2.1 Toy dataset

We compare the performance of the neural network with/without binary weight and kernel learning using the same set of 90 small scale UCI datasets with less than 5000 data points as in Geifman et al. (2020); Arora et al. (2019b). We report the training accuracy and testing accuracy of both vanilla neural network (NN) and binary weight neural network (BWNN) in Figure 3. To further illustrate the difference, we list the paired T-test result of neural network (NN) against binary weight neural network (BWNN), and Gaussian kernel (Gaussian) against Laplace kernel (Laplace) using in Table 1. In this table, t-stats and p-val denotes the t-statistic and two-sided p-value of the paired t-test between two classifiers, and $<$ and $>$ denotes the percentage of dataset that the first classifier gets lower or higher testing accuracy or generalization bound (training accuracy - testing accuracy), respectively.

As can be seen from the results, although the Laplacian kernel gets higher training accuracy than the Gaussian kernel, its testing accuracy is almost the same as the latter one. In other words, the former has smaller generalization gap than the latter which can also be observed in Table 1. Similarly, a neural network gets higher training accuracy than a binary weight neural network but gets similar testing accuracy.

Table 1: Pairwise performance comparison on selected 90 UCI datasets.

Classifier	Testing				Training-Testing
Classifier	t-stats	p-val	$<$	$>$	t-stats	p-val	$<$	$>$
NN-BWNN	0.7471	0.4569	53.33%	41.11%	4.034	0.000	26.67%	67.77%
Laplace-Gaussian	0.4274	0.6701	51.11%	33.33%	3.280	0.001	37.78%	53.33%

6.2.2 MNIST-like dataset

We compare the performance of neural networks with binary weights (Binary) with its counterpart with real value weights (Real). We take the number of training samples as a parameter by random sampling the training set and use the original test set for testing. The experiments are repeated 10 times and the mean and standard derivation is shown in Figure 4. In the MNIST dataset, the performance of neural networks with or without quantization is similar. This is because MNIST (LeCun et al., 1998) is simpler and less vulnerable to overfitting. On the other hand, the generalization gap with weight quantized is much smaller than without it in FashionMNIST (Xiao et al., 2017) dataset, which matches our prediction.

7 Discussion

In this paper, we propose a quasi neural network to approximate the binary weight neural network. The parameter space of quasi neural network is continuous, and its gradient can be approximated using the BinaryConnect algorithm. We study the expressive power of the binary weight neural network by studying the RKHS of its NTK and showed its similarity with the Gaussian kernel. We empirically verify that quantizing the weights can reduce the generalization gap, similar to Gaussian Kernel versus the Laplacian kernel. This result can be easily generalized to a neural network with other weight quantization methods, i.e. using more bits. Yet there are several questions to be answered by future work:

1.

In this work, we only quantize the weights, while much empirical work to quantize both the weights and the activations has been done. Can we use a similar technique to study the expressive power of that neural network?
2.

We study the NTK of a two-layer neural network with one additional linear liner in front of it, and only the weights in the first layer are quantized. It remains to be answered whether a multi-layer neural network allow similar approximation, and whether using more layers can increase its expressive power.

8 Acknowledgement

The authors would like to thank Chunfeng Cui for helpful discussion about the idea of quasi neural network and effort in proofreading the paper, as well as Yao Xuan for helpful discussion about kernel and RKHS.

References

Alemdar et al. (2017) H. Alemdar, V. Leroy, A. Prost-Boucle, and F. Pétrot. Ternary neural networks for resource-efficient ai applications. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 2547–2554. IEEE, 2017.
Allen-Zhu et al. (2018) Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018.
Arora et al. (2019a) S. Arora, S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019a.
Arora et al. (2019b) S. Arora, S. S. Du, Z. Li, R. Salakhutdinov, R. Wang, and D. Yu. Harnessing the power of infinitely wide deep nets on small-data tasks. arXiv preprint arXiv:1910.01663, 2019b.
Bach (2017) F. Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
Bietti and Mairal (2019) A. Bietti and J. Mairal. On the inductive bias of neural tangent kernels. In Advances in Neural Information Processing Systems, pages 12893–12904, 2019.
Blumer et al. (1989) A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.
Bordelon et al. (2020) B. Bordelon, A. Canatar, and C. Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. arXiv preprint arXiv:2002.02561, 2020.
Chen and Xu (2020) L. Chen and S. Xu. Deep neural tangent kernel and laplace kernel have the same rkhs. arXiv preprint arXiv:2009.10683, 2020.
Chizat et al. (2018) L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable programming. arXiv preprint arXiv:1812.07956, 2018.
Choromanska et al. (2015) A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pages 192–204. PMLR, 2015.
Chu et al. (2021) T. Chu, Q. Luo, J. Yang, and X. Huang. Mixed-precision quantized neural networks with progressively decreasing bitwidth. Pattern Recognition, 111:107647, 2021.
Courbariaux et al. (2015) M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems, 28:3123–3131, 2015.
Courbariaux et al. (2016) M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.
Ding et al. (2018) Y. Ding, J. Liu, J. Xiong, and Y. Shi. On the universal approximability and complexity bounds of quantized relu neural networks. arXiv preprint arXiv:1802.03646, 2018.
Dong et al. (2017) Y. Dong, R. Ni, J. Li, Y. Chen, J. Zhu, and H. Su. Learning accurate low-bit deep neural networks with stochastic quantization. arXiv preprint arXiv:1708.01001, 2017.
Du et al. (2019) S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675–1685. PMLR, 2019.
Du et al. (2018) S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2018.
Geifman et al. (2020) A. Geifman, A. Yadav, Y. Kasten, M. Galun, D. Jacobs, and R. Basri. On the similarity between the laplace and neural tangent kernels. arXiv preprint arXiv:2007.01580, 2020.
Gupta et al. (2015) S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning, pages 1737–1746, 2015.
He and Tao (2020) F. He and D. Tao. Recent advances in deep learning theory. arXiv preprint arXiv:2012.10931, 2020.
He et al. (2020) F. He, B. Wang, and D. Tao. Piecewise linear activations substantially shape the loss surfaces of neural networks. arXiv preprint arXiv:2003.12236, 2020.
Hubara et al. (2016) I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks. In Proceedings of the 30th international conference on neural information processing systems, pages 4114–4122. Citeseer, 2016.
Hubara et al. (2017) I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017.
Jacot et al. (2018) A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Li et al. (2018) D. Li, T. Ding, and R. Sun. Over-parameterized deep neural networks have no strict local minima for any continuous activations. arXiv preprint arXiv:1812.11039, 2018.
Li and Liang (2018) Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. arXiv preprint arXiv:1808.01204, 2018.
Liang et al. (2021) T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing, 461:370–403, 2021.
Marchesi et al. (1993) M. Marchesi, G. Orlandi, F. Piazza, and A. Uncini. Fast neural networks without multipliers. IEEE transactions on Neural Networks, 4(1):53–62, 1993.
Minh et al. (2006) H. Q. Minh, P. Niyogi, and Y. Yao. Mercer’s theorem, feature maps, and smoothing. In International Conference on Computational Learning Theory, pages 154–168. Springer, 2006.
Rastegari et al. (2016) M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer, 2016.
Simon et al. (2021) J. B. Simon, M. Dickens, and M. R. DeWeese. Neural tangent kernel eigenvalues accurately predict generalization. arXiv preprint arXiv:2110.03922, 2021.
Soudry and Carmon (2016) D. Soudry and Y. Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
Weinan et al. (2019) E. Weinan, J. Han, and Q. Li. A mean-field optimal control formulation of deep learning. Research in the Mathematical Sciences, 6(1):1–41, 2019.
Xiao et al. (2017) H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
Zhang et al. (2016) C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

Appendix A BinaryConnect Algorithm

In this section, we briefly review the BinaryConnect algorithm [Courbariaux et al., 2015], which is the key algorithm we are studying in this paper.

Data: Training data

\{\boldsymbol{x}_{i},z_{i}\}

, (Randomly) initialized model parameters

\theta^{(0)}

,learning rate

\eta

for $t$ in $[0\dots T]$ do

Select a minibatch

\{\boldsymbol{x}_{b},z_{b}\}

from training data;

Quantization:

w^{(t)}\leftarrow Quantize(\theta^{(t)})

;

Forward propagation: compute

\{y_{b}\}

using

\{\boldsymbol{x}_{b}\}

and

w^{(t)}

;

Backward propagation: compute

\frac{\partial\mathrm{loss}(y_{b},z_{b})}{\partial w^{(t)}}

using

\{y_{b}\},\{z_{b}\}

and

w^{(t)}

;

accumulate gradients:

\theta^{(t+1)}\leftarrow\theta^{(t)}+\eta\frac{\partial\mathrm{loss}(y_{b},z_{b})}{\partial w^{(t)}}

end for

Algorithm 1 Training binary weight neural network with BinaryConnect algorithm and SGD.

Appendix B Gaussian approximation in quantized neural network

B.1 Proof of Proposition 3

For the first statement, observe that fixing $k,k^{\prime}$ and taking $i$ as the variable, $w_{0,ki}w_{0,k^{\prime}i}$ are independent from each other. Furthermore, $\mathbb{E}[w_{0,ki}w_{0,k^{\prime}i}]=\delta_{k,k^{\prime}}$ . From the strong law of large number (SLLN), the first statement is proved. The third statement can be proved similarly, observing that $\mathbb{E}[w_{0,ki}w_{0,k^{\prime}i}\theta_{1,ij}^{2}]=\delta_{k,k^{\prime}}\mathrm{Var}[\theta]$ .

To prove the second statement, because geometric mean is no larger than cubic mean,

|w_{0,ki}||w_{0,k^{\prime}i}||w_{0,k^{\prime}i}|\leq\frac{1}{3}(|w_{0,ki}|^{3}+|w_{0,k^{\prime}i}|^{3}+|w_{0,k^{\prime}i}|^{3}),

Since $w_{0,ki}\sim\mathcal{N}(0,1)$ , the expectation of the right hand side equals $\sqrt{\frac{8}{\pi}}$ . We apply SLLN again to finish the proof.

B.2 Proof of Lemma 4

We first compute the conditioned mean and variance $\nu_{1,j}$ and $\varsigma_{1,j}$ . Notice that conditioned on $\Theta$ , $x_{1}$ is deterministic,

	$\displaystyle\nu_{1,j}$	$\displaystyle=\mathbb{E}_{w_{1}}\left[y_{1,j}\middle\|\Theta\right]$
		$\displaystyle=\sqrt{\frac{c}{d_{1}}}\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}[w_{1,ij}]x_{1,i}+\beta b_{1,j}$
		$\displaystyle=\sqrt{\frac{c}{d_{1}}}\sum_{i=1}^{d_{1}}\theta_{1,ij}x_{1,i}+\beta b_{1,j}$
	$\displaystyle\varsigma_{1,j}$	$\displaystyle=\mathrm{Var}_{w_{1}}\left[y_{1,j}\middle\|\Theta\right]$
		$\displaystyle=\mathbb{E}_{w_{1}}\left[y_{1,j}^{2}\middle\|\Theta\right]-\mathbb{E}_{w_{1}}\left[y_{1,j}\middle\|\Theta\right]^{2}$
		$\displaystyle=\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}\sum_{i^{\prime}=1}^{d_{1}}\mathbb{E}_{w_{1}}\left[w_{1,ij}w_{1,i^{\prime}j}\middle\|\Theta\right]x_{1,i}x_{1,i^{\prime}}+2\beta b_{1,j}\sqrt{\frac{c}{d_{1}}}\sum_{i=1}^{d_{1}}\mathbb{E}\left[w_{1,ij}\middle\|\Theta\right]x_{1,i}$
		$\displaystyle\quad-\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}\sum_{i^{\prime}=1}^{d_{1}}\theta_{1,ij}\theta_{1,i^{\prime}j}x_{1,i}x_{1,i^{\prime}}-2\beta b_{1,j}\sqrt{\frac{c}{d_{1}}}\sum_{i=1}^{d_{1}}\theta_{1,ij}x_{1,i}$
		$\displaystyle=\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}\sum_{i^{\prime}=1}^{d_{1}}(\mathbb{E}[w_{1,ij}w_{1,i^{\prime}j}\|\Theta]-\theta_{1,ij}\theta_{1,i^{\prime}j})x_{1,i}^{2}$
		$\displaystyle=\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}(\mathbb{E}[w_{1,ij}^{2}\|\Theta]-\theta_{1,ij}^{2})x_{1,i}^{2}$
		$\displaystyle=\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}(1-\theta_{1,ij}^{2})x_{1,i}^{2}.$

The second line is because $\mathbb{E}_{w_{1}}[w_{1,ij}w_{1,i^{\prime}j}|\Theta]=\mathbb{E}_{w_{1}}[w_{1,ij}|\Theta]\mathbb{E}[w_{1,i^{\prime}j}|\Theta]=\theta_{1,ij}\theta_{1,i^{\prime}j}$ when $i\neq i^{\prime}$ .

Next, we need to prove $y_{1,i}$ converge to Gaussian distribution conditioned on $\Theta\in\mathcal{G}$ by verifying Lyapunov’s condition. Note that for any $j\in[d_{2}]$ ,

y_{1,j}=\sqrt{\frac{c}{d_{1}}}\sum_{i=1}^{d_{1}}w_{1,ij}x_{1,i}+b_{1,j}

Define $X_{i}=w_{1,ij}x_{1,i}$ . As mentioned above, its mean and variance (conditioned on $\Theta$ ) is

\displaystyle\mathbb{E}_{w_{1}}[X_{i}|\Theta]

\displaystyle=\theta_{1,ij}x_{1,i},

\displaystyle\mathrm{Var}_{w_{1}}[X_{i}|\Theta]

\displaystyle=\mathbb{E}_{w_{1}}[X_{i}^{2}|\Theta]-\mathbb{E}_{w_{1}}[X_{i}|\Theta]^{2}=(1-\theta_{1,ij}^{2})x_{1,i}^{2}

Since $\Theta\in\mathcal{G}$ , $\forall j\in[d_{2}]$ for some finite $d_{2}$ ,

$\displaystyle\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathrm{Var}_{w_{1}}[X_{i}\|\Theta]$	$\displaystyle=\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}(1-\theta_{1,ij}^{2})x_{1,i}^{2}$	(13)
	$\displaystyle=\lim_{d_{1}\rightarrow\infty}\frac{1}{dd_{1}}\sum_{i=1}^{d_{1}}(1-\theta_{1,ij}^{2})\Big{(}\sum_{k=1}^{d}w_{0,ki}x_{k}\Big{)}^{2}$
	$\displaystyle=\lim_{d_{1}\rightarrow\infty}\frac{1}{dd_{1}}\sum_{i=1}^{d_{1}}\sum_{k,k^{\prime}=1}^{d}(1-\theta_{1,ij}^{2})w_{0,ki}w_{0,k^{\prime}i}x_{k}x_{k^{\prime}}$
	$\displaystyle=\lim_{d_{1}\rightarrow\infty}\sum_{k,k^{\prime}=1}^{d}(1-\mathrm{Var}[\theta])\delta_{k,k^{\prime}}x_{k}x_{k^{\prime}}=1-\mathrm{Var}[\theta]$

The fourth equality comes from the definition of $\mathcal{G}$ , and the fifth equality is because $\|x\|_{2}=1$ . The third order absolute momentum can be bounded by

		$\displaystyle\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}\big{[}\|X_{i}-\mathbb{E}_{w_{1}}[X_{i}\|\Theta]\|^{3}\big{\|}\Theta\big{]}$		(14)
		$\displaystyle=\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}\big{[}\|(w_{1,ij}-\theta_{1,ij})x_{1,i}\|^{3}\big{\|}\Theta\big{]}$
		$\displaystyle\leq\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}\big{[}8\|x_{1,i}\|^{3}\big{\|}\Theta\big{]}$
		$\displaystyle=\lim_{d_{1}\rightarrow\infty}\frac{8}{d_{1}}\sum_{i=1}^{d_{1}}\Big{\|}\sum_{k=1}^{d}w_{0,ki}x_{k}\Big{\|}^{3}$
		$\displaystyle\leq\lim_{d_{1}\rightarrow\infty}\frac{8}{d_{1}}\sum_{i=1}^{d_{1}}\Bigg{(}\sum_{k=1}^{d}\|w_{0,ki}x_{k}\|\Bigg{)}^{3}$
		$\displaystyle=\lim_{d_{1}\rightarrow\infty}\frac{8}{d_{1}}\sum_{i=1}^{d_{1}}\sum_{k,k^{\prime},k^{\prime\prime}=1}^{d}\|w_{0,ki}w_{0,k^{\prime}i}w_{0,k^{\prime\prime}i}\|\|x_{k}x_{k^{\prime}}x_{k^{\prime\prime}}\|$
		$\displaystyle\leq 8\sqrt{\frac{8}{\pi}}d^{3}$

The last inequality comes from the definition of “Good Initialization”: for all $\Theta\in\mathcal{G}$ ,

\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}w_{0,ki}w_{0,k^{\prime}i}w_{0,k^{\prime}i}\leq\sqrt{\frac{8}{\pi}},

and because $\|x\|_{2}=1,|x_{k}|\leq 1$ for all $k\in[d]$ . Note that using the strong law of large number, one can prove that the third order absolute momentum converges almost surely to a constant that doesn’t depend on $d$ . On the other hand, we are proving a upper bound for all $\Theta\in\mathcal{G}$ which is stronger than almost surely converge.

		$\displaystyle\lim_{d_{1}\rightarrow\infty}\frac{\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}[\|X_{i}-\mathbb{E}_{w_{1}}[X_{i}\|\Theta]\|^{3}\|\Theta]}{(\sum_{i=1}^{d_{1}}\mathrm{Var}_{w_{1}}[X_{i}\|\Theta])^{3/2}}$
		$\displaystyle=\lim_{d_{1}\rightarrow\infty}\frac{\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}[\|X_{i}-\mathbb{E}_{w_{1}}[X_{i}\|\Theta]\|^{3}\|\Theta]}{{d_{1}^{1/2}}(\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathrm{Var}_{w_{1}}[X_{i}\|\Theta])^{3/2}}$
		$\displaystyle=\frac{\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}[\|X_{i}-\mathbb{E}_{w_{1}}[X_{i}\|\Theta]\|^{3}\|\Theta]}{\lim_{d_{1}\rightarrow\infty}{d_{1}^{1/2}}(\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathrm{Var}_{w_{1}}[X_{i}\|\Theta])^{3/2}}$
		$\displaystyle\leq\frac{8\sqrt{\frac{8}{\pi}}d^{3}}{\lim_{d_{1}\rightarrow\infty}d_{1}^{1/2}(1-\mathrm{Var}[\theta])}$
		$\displaystyle=0$

This proves that Lyapunov’s condition for all “Good Initialization”, so conditioned on $\Theta\in\mathcal{G}$ , $y_{1,j}$ converges to Gaussian distribution.

B.3 Proof of Lemma 5

To compute $\mathbb{E}_{w_{1}}[x_{2,j}|\Theta]$ and $\mathrm{Var}_{w_{1}}[x_{2,j}|\Theta]$ , we first compute $\mathbb{E}_{w_{1}}[\psi(y_{1,j})|\Theta]$ and $\mathbb{E}_{w_{1}}[\psi(y_{1,j})^{2}|\Theta]$ . Recall $\psi(x)=x\mathbf{1}(x\geq 0)$ ,

	$\displaystyle\mathbb{E}_{w_{1}}[\psi(y_{1,j})\|\Theta]$	$\displaystyle=\int_{0}^{\infty}x\frac{1}{\sqrt{2\pi}\varsigma_{1,j}}\exp\left(-\frac{1}{2}\frac{(x-\nu_{1,j})^{2}}{\varsigma_{1,j}^{2}}\right)dx$
		$\displaystyle=\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}(\varsigma_{1,j}y+\nu_{1,j})\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)dy$
		$\displaystyle=\varsigma_{1,j}\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}\frac{1}{\sqrt{2\pi}}y\exp\left(-\frac{1}{2}y^{2}\right)dy+\mu_{\ell,i}\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)dy,$

	$\displaystyle\mathbb{E}_{w_{1}}[\psi(y_{1,j})^{2}\|\Theta]$	$\displaystyle=\int_{0}^{\infty}x^{2}\frac{1}{\sqrt{2\pi}\varsigma_{1,j}}\exp\left(-\frac{1}{2}\frac{(x-\nu_{1,j})^{2}}{\varsigma_{1,j}^{2}}\right)dx$
		$\displaystyle=\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}(\varsigma_{1,j}y+\nu_{1,j})^{2}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)dy$
		$\displaystyle=\varsigma_{1,j}^{2}\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}y^{2}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}\hat{y}^{2}\right)dy+2\varsigma_{1,j}\nu_{1,j}\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}y\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)dy$
		$\displaystyle\quad+\nu_{1,j}^{2}\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)dy.$

We only need to compute

\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}\frac{1}{\sqrt{2\pi}}y^{\alpha}\exp\left(-\frac{1}{2}y^{2}\right)dy.

For $\alpha=0,1,2$ . When $\alpha=0$ , this is integration to Gaussian function, and it is known that there’s no analytically function to express that. For sack of simplicity, define it as $s_{1,j}$

\displaystyle s_{1,j}

\displaystyle=\int_{\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)dy:=\Phi(\frac{\nu_{1,j}}{\varsigma_{1,j}}).

When $\alpha=1$ , this integration can be simply solved by change of the variable and we denote it as $g_{1,j}$ :

\displaystyle g_{1,j}

\displaystyle=\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}y\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)dy=\sqrt{\frac{1}{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{\nu_{1,j}}{\varsigma_{1,j}}\right)^{2}\right).

When $\alpha=2$ , we can do integration by parts and express it using $s_{1,j}$ and $g_{1,j}$ :

		$\displaystyle\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}y^{2}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)dy$
		$\displaystyle\quad=-\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}y\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)d\frac{1}{2}y^{2}$
		$\displaystyle\quad=\int_{-\frac{\nu_{1,j}}{\varsigma_{1,j}}}^{\infty}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}y^{2}\right)dy-\frac{\nu_{1,j}}{\varsigma_{1,j}}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{\nu_{1,j}}{\varsigma_{1,j}}\right)^{2}\right)$
		$\displaystyle\quad=s_{1,j}-\frac{\nu_{1,j}}{\varsigma_{1,j}}g_{1,j}.$

Using the definition of mean and variance,

\mu_{2,j}=\mathbb{E}_{w_{1}}[\psi(y_{1,j})|\Theta],\sigma_{2,i}^{2}=\mathbb{E}_{w_{1}}[\psi(y_{1,j})^{2}|\Theta]-\mathbb{E}_{w_{1}}[\psi(y_{1,j})|\Theta]^{2},

we can come to the result.

B.4 Proof of Theorem 6

In this part, we take $\Theta=\{w_{0},\theta_{1},w_{2},b_{0},b_{1},b_{2}\}$ as the random variables and conditioned mean and variance derived above $\mu_{1},\sigma_{1},\nu_{1},\varsigma_{1}$ as functions to $\Theta$ . From Eq. (4), as $d_{1}\rightarrow\infty$ , $v_{1}$ tend to iid Gaussian processes, and there covariance converges almost surely to its expectation. We then focus on computing the expectation of covariance. For any $j\neq j^{\prime}$ , we take the expectation over random initialization of $\Theta$ :

		$\displaystyle\mathbb{E}_{\Theta}[\nu_{1,j}\nu_{1,j^{\prime}}]$		(15)
		$\displaystyle=\mathbb{E}_{\Theta}\left[\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}\sum_{i^{\prime}=1}^{d_{1}}\theta_{1,ij}\theta_{1,i^{\prime}j^{\prime}}x_{1,i}x_{1,i^{\prime}}+\beta\sqrt{\frac{c}{d_{1}}}\sum_{i=1}^{d_{1}}(\theta_{1,ij}x_{1,i}b_{j}+\theta_{1,ij^{\prime}}x_{1,i}b_{j^{\prime}})+\beta^{2}b_{j}b_{j^{\prime}}\right]$
		$\displaystyle=\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}\sum_{i^{\prime}=1}^{d_{1}}\mathbb{E}_{\Theta}[\theta_{1,ij}]\mathbb{E}_{\Theta}[\theta_{1,i^{\prime}j^{\prime}}]\mathbb{E}_{\Theta}[x_{1,i}x_{1,i^{\prime}}]+\beta^{2}\mathbb{E}_{\Theta}[b_{j}b_{j^{\prime}}]$
		$\displaystyle\quad+\sqrt{\frac{c}{d_{1}}}\beta\sum_{i=1}^{d_{1}}(\mathbb{E}_{\Theta}[\theta_{1,ij}]\mathbb{E}_{\Theta}[x_{1,i}]\mathbb{E}_{\Theta}[b_{j}]+\mathbb{E}_{\Theta}[\theta_{1,ij^{\prime}}]\mathbb{E}_{\Theta}[x_{1,i}]\mathbb{E}_{\Theta}[b_{j^{\prime}}])$
		$\displaystyle=0$

which indicates that they are independent.

Computation of $\varsigma_{1,j}$ was already finished implicitly in Section B.2. We write it explicitly here. From (4), on the limit $d_{1}\rightarrow\infty$ ,

	$\displaystyle\varsigma_{1,j}^{2}$	$\displaystyle=\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}(1-\theta_{1,ij}^{2})x_{1,i}^{2}$
		$\displaystyle=\frac{c}{dd_{1}}\sum_{i=1}^{d_{1}}(1-\theta_{1,ij}^{2})\sum_{k,k^{\prime}=1}^{d}w_{0,ki}w_{0,k^{\prime}i}x_{k}x_{k^{\prime}}$
		$\displaystyle=\frac{c}{d}\sum_{k,k^{\prime}=1}^{d}x_{k}x_{k^{\prime}}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}w_{0,ki}w_{0,k^{\prime}i}(1-\theta_{1,ij}^{2})$
		$\displaystyle=\frac{c}{d}\sum_{k,k^{\prime}=1}^{d}x_{k}x_{k^{\prime}}\delta_{k,k^{\prime}}(1-\mathrm{Var}[\theta])$
		$\displaystyle=\frac{c}{d}(1-\mathrm{Var}[\theta])\\|x\\|_{2}^{2}$
		$\displaystyle=\frac{c}{d}(1-\mathrm{Var}[\theta])$

The fourth line comes from the definition of $\mathcal{G}$ .

B.5 Derivative of activation function in quasi neural network

Let

\displaystyle\tilde{\psi}(x)=\tilde{\varsigma}_{1}\phi\left(\frac{x}{\tilde{\varsigma}}\right)+x\Phi\left(\frac{x}{\tilde{\varsigma}}\right),

Its derivative is

	$\displaystyle\tilde{\psi}^{\prime}(x)$	$\displaystyle=\varphi^{\prime}\left(\frac{x}{\tilde{\varsigma}}\right)+\Phi\left(\frac{x}{\tilde{\varsigma}}\right)+\frac{x}{\tilde{\varsigma}}\Phi^{\prime}\left(\frac{x}{\tilde{\varsigma}}\right)$
		$\displaystyle=-\frac{x}{\tilde{\varsigma}}\varphi\left(\frac{x}{\tilde{\varsigma}}\right)+\Phi\left(\frac{x}{\tilde{\varsigma}}\right)+\frac{x}{\tilde{\varsigma}}\varphi\left(\frac{x}{\tilde{\varsigma}}\right)$
		$\displaystyle=\Phi\left(\frac{x}{\tilde{\varsigma}}\right)$

The second line is because

	$\displaystyle\Phi^{\prime}(x)$	$\displaystyle=\varphi(x),$
	$\displaystyle\varphi^{\prime}(x)$	$\displaystyle=\frac{d}{dx}\sqrt{\frac{1}{2\pi}}\exp\left(-\frac{1}{2}x^{2}\right)$
		$\displaystyle=-x\sqrt{\frac{1}{2\pi}}\exp\left(-\frac{1}{2}x^{2}\right)$
		$\displaystyle=-x\varphi(x).$

B.6 Proof of Theorem 7

To make the proof more general, we make $\varsigma_{1,j}$ a parameter of the activation function in quasi neural network as $\tilde{\psi}(\cdot;\varsigma_{1,j})$ . To get the derivative with respect to $\theta_{1,ij}$ , we first get the derivative with respect to $\nu_{1,j}$ .

\displaystyle\frac{\partial\bar{y}}{\partial\nu_{1,j}}

\displaystyle=\frac{\partial\bar{y}}{\partial\mu_{2,j}}\frac{\partial\mu_{2,j}}{\partial\nu_{1,j}}=\sqrt{\frac{1}{d_{2}}}w_{2,j}\tilde{\psi}^{\prime}(\nu_{1,j};\varsigma_{1,j})

then apply chain rule:

$\displaystyle\frac{\partial\bar{y}}{\partial w_{2,j}}$	$\displaystyle=\sqrt{\frac{c}{d_{2}}}\mu_{2,j},$	(16)
$\displaystyle\frac{\partial\bar{y}}{\partial b_{1,j}}$	$\displaystyle=\frac{\partial\bar{y}}{\partial\nu_{1,j}}\frac{\partial\nu_{1,j}}{\partial b_{1,j}}=\sqrt{\frac{c}{d_{2}}}\beta w_{2,j}\tilde{\psi}^{\prime}(\nu_{1,j};\varsigma_{1,j}),$	(17)
$\displaystyle\frac{\partial\bar{y}}{\partial\theta_{1,ij}}$	$\displaystyle=\frac{\partial\bar{y}}{\partial\nu_{1,j}}\frac{\partial\nu_{1,j}}{\partial\theta_{1,ij}}=\sqrt{\frac{c}{d_{1}d_{2}}}w_{2,j}x_{1,i}\tilde{\psi}^{\prime}(\nu_{1,j};\varsigma_{1,j}).$	(18)

On the other hand, let’s first write down the gradient with respect to weights $w_{ij}$ in quantized neural network and take their expectation conditioned on $\Theta$ :

$\displaystyle\mathbb{E}_{w_{1}}\left[\frac{\partial y}{\partial w_{2,j}}\middle\|\Theta\right]$	$\displaystyle=\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{w_{1}}[x_{2,j}\|\Theta],$	(19)
$\displaystyle\mathbb{E}_{w_{1}}\left[\frac{\partial y}{\partial b_{2,j}}\middle\|\Theta\right]$	$\displaystyle=\mathbb{E}_{w_{1}}\left[\frac{\partial\bar{y}}{\partial y_{1,j}}\frac{\partial y_{1,j}}{\partial b_{2,j}}\middle\|\Theta\right]=\sqrt{\frac{c}{d_{2}}}\beta w_{2,j}\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})\|\Theta],$	(20)
$\displaystyle\mathbb{E}_{w_{1}}\left[\frac{\partial y}{\partial w_{1,ij}}\middle\|\Theta\right]$	$\displaystyle=\mathbb{E}_{w_{1}}\left[\frac{\partial y}{\partial y_{1,j}}\frac{\partial y_{1,j}}{\partial w_{1,ij}}\middle\|\Theta\right]=\sqrt{\frac{c}{d_{1}d_{2}}}w_{2,j}x_{1,i}\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})\|\Theta],$	(21)

Comparing (16) with (19), one can easily find they are equal since by definition $\mu_{2,j}=\mathbb{E}_{w_{1}}[x_{2,j}|\Theta]$ . On the other hand, from (7), one can tell using continuous mapping theorem that

\displaystyle\tilde{\psi}^{\prime}(\nu_{1,j};\varsigma_{1,j})=\Phi\left(\frac{\nu_{1,j}}{\varsigma_{1,j}}\right)=P[y_{1,j}\geq 0]=\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})|\Theta],

which shows (17) equals (20) and (18) equals (21).

B.7 Proof of Theorem 8

Observe that conditioned on $\Theta$ , $y_{1,j}$ depends only on $\{w_{1,ij},i\in[d_{1}]\}$ , and that $\{w_{1,ij},i\in[d_{1}]\}\cap\{w_{1,ij^{\prime}},i\in[d_{1}]\}=\emptyset$ for $j\neq j^{\prime}$ . Because of that, $y_{1,j}$ are independent of each other. Similarly, $x_{2,j}$ are independent of each other conditioned on $\Theta$ . For MSE loss,

\displaystyle\mathrm{loss}(y)=\frac{1}{2}(y-z)^{2},\frac{dl(y)}{dy}=y-z,

According to the chain rule

\displaystyle\frac{\partial\mathrm{loss}(\bar{y})}{\partial\theta}=\frac{\partial\mathrm{loss}(\bar{y})}{\partial\bar{y}}\frac{\partial\bar{y}}{\partial\theta}=(\bar{y}-z)\frac{\partial\bar{y}}{\partial\theta},

(22)

for any $\theta\in\{\theta_{1,ij},b_{1,j},w_{2,j}\}$ , which leads to

$\displaystyle\frac{\partial\mathrm{loss}(\bar{y})}{\partial w_{2,j}}$	$\displaystyle=\sqrt{\frac{c}{d_{2}}}(\bar{y}-z)\mu_{2,j},$	(23)
$\displaystyle\frac{\partial\mathrm{loss}(\bar{y})}{\partial b_{1,j}}$	$\displaystyle=\frac{\partial\bar{y}}{\partial\nu_{1,j}}\frac{\partial\nu_{1,j}}{\partial b_{1,j}}=\sqrt{\frac{c}{d_{2}}}\beta w_{2,j}(\bar{y}-z)\tilde{\psi}^{\prime}(\nu_{1,j};\varsigma_{1,j}),$	(24)
$\displaystyle\frac{\partial\mathrm{loss}(\bar{y})}{\partial\theta_{1,ij}}$	$\displaystyle=\frac{\partial\bar{y}}{\partial\nu_{1,j}}\frac{\partial\nu_{1,j}}{\partial\theta_{1,ij}}=\sqrt{\frac{c}{d_{1}d_{2}}}w_{2,j}x_{1,i}(\bar{y}-z)\tilde{\psi}^{\prime}(\nu_{1,j};\varsigma_{1,j}).$	(25)

On the other hand, in the original binary weight neural network,

$\displaystyle\mathbb{E}_{w_{1}}\left[\frac{\partial\mathrm{loss}(y)}{\partial w_{2,j}}\middle\|\Theta\right]$	$\displaystyle=\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{w_{1}}\left[(y-z)x_{2,j}\|\Theta\right],$	(26)
$\displaystyle\mathbb{E}_{w_{1}}\left[\frac{\partial\mathrm{loss}(y)}{\partial b_{1,j}}\middle\|\Theta\right]$	$\displaystyle=\sqrt{\frac{c}{d_{2}}}\beta w_{2,j}\mathbb{E}_{w_{1}}[(y-z)\psi^{\prime}(y_{1,j})\|\Theta],$	(27)
$\displaystyle\mathbb{E}_{w_{1}}\left[\frac{\partial\mathrm{loss}(y)}{\partial w_{1,ij}}\middle\|\Theta\right]$	$\displaystyle=\sqrt{\frac{c}{d_{1}d_{2}}}w_{2,j}x_{1,i}\mathbb{E}_{w_{1}}[(y-z)\psi^{\prime}(y_{1,j})\|\Theta],$	(28)

Note that $y$ is not independent form $x_{2,j}$ or $\psi^{\prime}(y_{1,j})$ , which is the main challenge of the proof. To deal with this problem, we bound the difference between (23)-(25) and (26)-(28), which requires bounding their covariance.

$\displaystyle\mathbb{E}_{w_{1}}[x_{2,j}y\|\Theta]$	$\displaystyle=\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{w_{1}}\left[x_{2,j}\sum_{j=1}^{d_{2}}w_{2,j}x_{2,j}\middle\|\Theta\right]$	(29)
	$\displaystyle=\sqrt{\frac{c}{d_{2}}}\left(\mathbb{E}_{w_{1}}\left[x_{2,j}^{2}w_{2,j}\middle\|\Theta\right]+\sum_{j^{\prime}\neq j}\mathbb{E}_{w_{1}}\left[x_{2,j}x_{2,j^{\prime}}w_{2,j^{\prime}}\middle\|\Theta\right]\right)$
	$\displaystyle=\sqrt{\frac{c}{d_{2}}}\left((\mu_{2,j}^{2}+\sigma_{2,j}^{2})w_{2,j}+\sum_{j^{\prime}\neq j}\mu_{2,j}\mu_{2,j^{\prime}}w_{2,j^{\prime}}\right)$
	$\displaystyle=\sqrt{\frac{c}{d_{2}}}\left(\sigma_{2,j}^{2}w_{2,j}+\sum_{j^{\prime}=1}^{d_{2}}\mu_{2,j}\mu_{2,j^{\prime}}w_{2,j^{\prime}}\right)$

The second term equals $\mathbb{E}_{w_{1}}[x_{2,j}|\Theta]\mathbb{E}_{w_{1}}[y|\Theta]$ and the first term converges to 0 when $d_{2}\rightarrow\infty$ . Taking it into (23) and (26), we can see that these two terms are equal on the limit $d_{1}\rightarrow\infty$ .

Similarly,

		$\displaystyle\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})y\|\Theta]$		(30)
		$\displaystyle=\mathbb{E}_{w_{1}}\left[\sqrt{\frac{c}{d_{2}}}\psi^{\prime}(y_{1,j})\sum_{j=1}^{d_{2}}w_{2,j}\psi(y_{1,j})\middle\|\Theta\right]$
		$\displaystyle=\sqrt{\frac{c}{d_{2}}}\left(\mathbb{E}_{w_{1}}\left[\psi(y_{1,j})\psi^{\prime}(y_{1,j})w_{2,j}\middle\|\Theta\right]+\sum_{j^{\prime}\neq j}\mathbb{E}_{w_{1}}\left[\psi(y_{1,j^{\prime}})\psi^{\prime}(y_{1,j})w_{2,j^{\prime}}\middle\|\Theta\right]\right)$
		$\displaystyle=\sqrt{\frac{c}{d_{2}}}\left(\mathbb{E}_{w_{1}}\left[\psi(y_{1,j})w_{2,j}\middle\|\Theta\right]w_{2,j}+\sum_{j^{\prime}\neq j}\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})\|\Theta]\mu_{2,j^{\prime}}w_{2,j^{\prime}}\right)$
		$\displaystyle=\sqrt{\frac{c}{d_{2}}}\left((1-\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})\|\Theta])\mu_{2,j}w_{2,j}+\sum_{j^{\prime}=1}^{d_{2}}\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})\|\Theta]\mu_{2,j^{\prime}}w_{2,j^{\prime}}\right)$

The second term equals $\mathbb{E}[\psi^{\prime}(y_{2,j})|\Theta]\mathbb{E}[y|\Theta]$ and the first term converges to 0 when $d_{3}\rightarrow\infty$ . Taking it into (24)(25)(27)(28) finishes the proof.

B.8 Proof of 9

In this part, we denote $\dot{a}:=\frac{\partial a}{\partial t}$ for $a\in\{w_{\ell},\theta_{\ell},b_{\ell}\}$ , and express each time-depent variable as a function of time $t$ . We define an inner product under the distribution of training dataset

\langle\boldsymbol{a},\boldsymbol{b}\rangle_{in}=\mathbb{E}_{in}[{a}(x){b}(x)],

and the corresponding norm

\|\boldsymbol{a}\|_{in}=\sqrt{\langle\boldsymbol{a},\boldsymbol{a}\rangle_{in}}=\sqrt{\mathbb{E}_{in}[{a}(x)^{2}]}.

If $\boldsymbol{a}(x)$ is a vector, $\|\boldsymbol{a}\|_{in}:=\sqrt{\mathbb{E}_{in}[\|\boldsymbol{a}(x)\|^{2}]}$ . Note this inner product and norm define a Hilbert space (not to be confused with the RKHS induced by a kernel), so by Cauchy-Schwarz inequality,

\displaystyle|\langle\boldsymbol{a},\boldsymbol{b}\rangle_{in}|\leq\|\boldsymbol{a}\|_{in}\|\boldsymbol{b}\|_{in},\forall\boldsymbol{a},\boldsymbol{b}.

As is shown in 4.3.3, on the limit $d_{1},\rightarrow\infty$ , the dynamics of training this neural network using gradient descent can be written as:

	$\displaystyle\dot{w}_{2,j}(t)$	$\displaystyle=\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{in}[(\bar{y}(t)-z)\mu_{2,j}(t)]$
	$\displaystyle\dot{b}_{1,j}(t)$	$\displaystyle=\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{in}[\beta w_{2,j}(t)(\bar{y}(t)-z)\tilde{\psi}^{\prime}(\nu_{1,j}(t),\varsigma_{1,j}(t))],$
	$\displaystyle\dot{\theta}_{1,ij}(t)$	$\displaystyle=\sqrt{\frac{c}{d_{1}d_{2}}}\mathbb{E}_{in}[w_{2,j}(t)x_{1,i}(\bar{y}(t)-z)\tilde{\psi}^{\prime}(\nu_{1,j}(t);\varsigma_{1,j}(t))]$

where dot denotes the derivative with respect to $t$ . Note the activation function $\tilde{\psi}(\cdot;\varsigma_{1,j}(t))$ depends on $\varsigma_{1,j}$ , which makes it time dependent. One can further write down the dynamics of $\nu_{1,j}(t)$ as

\displaystyle\dot{\nu}_{1,j}(t)

\displaystyle=\sqrt{\frac{1}{d_{1}}}\sum_{i=1}^{d_{1}}\dot{\theta}_{1,ij}(t)x_{1,i}(t)+\dot{b}_{1,j}(t)

Rewrite these two differential equations in matrix form:

	$\displaystyle\dot{\boldsymbol{w}}_{2}(t)$	$\displaystyle=\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{in}[(\bar{y}(t)-z)\boldsymbol{\mu}_{2}(t)]$
	$\displaystyle\dot{\boldsymbol{b}}_{1}(t)$	$\displaystyle=\beta\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{in}[(\bar{y}(t)-z)(\tilde{\psi}^{\prime}(\boldsymbol{\nu}_{1}(t))\circ\boldsymbol{w}_{2}(t))],$
	$\displaystyle\dot{\boldsymbol{\theta}_{1}}(t)$	$\displaystyle=\sqrt{\frac{c}{d_{1}d_{2}}}\mathbb{E}_{in}\big{[}(\bar{y}(t)-z)\boldsymbol{x}_{1}\otimes\big{(}\tilde{\psi}^{\prime}(\boldsymbol{\nu}_{1}(t))\circ\boldsymbol{w}_{2}(t)\big{)}\big{]},$
	$\displaystyle\dot{\boldsymbol{\nu}}_{1}(t)$	$\displaystyle=\sqrt{\frac{1}{d_{1}}}\dot{\boldsymbol{\theta}}_{1}\boldsymbol{x}_{1}+\dot{\boldsymbol{b}}_{1}$

where $\circ$ denotes elementwise product and $\otimes$ denotes outer product. Here we slightly abuse the notation $\tilde{\psi}(\cdot)$ , which represents elementwise operation when applied to a vector. Their norm are bounded by

	$\displaystyle\frac{\partial}{\partial t}\\|{\boldsymbol{w}}_{2}(t)-{\boldsymbol{w}}_{2}(0)\\|$	$\displaystyle\leq\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{in}[(\bar{y}(t)-z)\\|\boldsymbol{\mu}_{2}(t)\\|]=\sqrt{\frac{c}{d_{2}}}\langle\bar{y}(t)-z,\boldsymbol{\mu}_{2}(t)\rangle_{in}$		(31)
		$\displaystyle\leq\sqrt{\frac{c}{d_{2}}}\\|\bar{y}(t)-z\\|_{in}\\|\boldsymbol{\mu}_{2}(t)\\|_{in}\leq\sqrt{\frac{c}{d_{2}}}\\|\bar{y}(t)-z\\|_{in}\\|\boldsymbol{\nu}_{1}(t)\\|_{in}$		(31)

	$\displaystyle\frac{\partial}{\partial t}\\|{\boldsymbol{b}}_{1}(t)-{\boldsymbol{b}}_{1}(0)\\|$	$\displaystyle\leq\beta\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{in}[(\bar{y}(t)-z)\\|\tilde{\psi}^{\prime}(\boldsymbol{\nu}_{1}(t))\circ\boldsymbol{w}_{2}(t)\\|]$		(32)
		$\displaystyle\leq\beta\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{in}[(\bar{y}(t)-z)\\|\boldsymbol{w}_{2}(t)\\|]=\beta\sqrt{\frac{c}{d_{2}}}\\|\bar{y}(t)-z\\|_{in}\\|\boldsymbol{w}_{2}(t)\\|,$		(32)

$\displaystyle\frac{\partial}{\partial t}\\|{\boldsymbol{\theta}_{1}}(t)-{\boldsymbol{\theta}_{1}}(0)\\|_{F}$	$\displaystyle\leq\sqrt{\frac{c}{d_{1}d_{2}}}\mathbb{E}_{in}\big{[}(\bar{y}(t)-z)\\|\boldsymbol{x}_{1}\otimes\big{(}\tilde{\psi}^{\prime}(\boldsymbol{\nu}_{1}(t))\circ\boldsymbol{w}_{2}(t)\big{)}\\|_{F}\big{]}$	(33)
	$\displaystyle\leq\sqrt{\frac{c}{d_{1}d_{2}}}\mathbb{E}_{in}\big{[}(\bar{y}(t)-z)\\|\boldsymbol{x}_{1}\\|\\|\boldsymbol{w}_{2}(t)\\|\big{]}$
	$\displaystyle\leq\sqrt{\frac{c}{d_{1}d_{2}}}\\|\bar{y}(t)-z\\|_{in}\\|\boldsymbol{x}_{1}\\|_{in}\\|\boldsymbol{w}_{2}(t)\\|$
	$\displaystyle=\sqrt{\frac{c}{d_{2}}}\\|\bar{y}(t)-z\\|_{in}\\|\boldsymbol{w}_{2}(t)\\|,$

$\displaystyle\forall\boldsymbol{x}_{1},\quad\frac{\partial}{\partial t}\\|{\boldsymbol{\nu}}_{1}(t)-{\boldsymbol{\nu}}_{1}(0)\\|$	$\displaystyle\leq\int_{t=0}^{T}\Big{\\|}\frac{\partial{\boldsymbol{\nu}}_{1}(t)}{\partial t}\Big{\\|}dt$	(34)
	$\displaystyle\leq\sqrt{\frac{1}{d_{1}}}\frac{\partial}{\partial t}\\|\boldsymbol{\theta}_{1}(t)-\boldsymbol{\theta}_{1}(0)\\|_{op}\\|\boldsymbol{x}_{1}\\|+\frac{\partial}{\partial t}\\|{\boldsymbol{b}}_{1}(t)-{\boldsymbol{b}}_{1}(0)\\|$
	$\displaystyle\leq\sqrt{\frac{c}{d_{1}^{2}d_{2}}}\\|\bar{y}(t)-z\\|_{in}\\|\boldsymbol{w}_{2}(t)\\|\\|\boldsymbol{x}_{1}\\|_{in}\\|\boldsymbol{x}_{1}\\|$
	$\displaystyle\qquad+\beta\sqrt{\frac{c}{d_{2}}}\\|\bar{y}(t)-z\\|_{in}\\|\boldsymbol{w}_{2}(t)\\|$
	$\displaystyle=(1+\beta)\sqrt{\frac{c}{d_{2}}}\\|\bar{y}(t)-z\\|_{in}\\|\boldsymbol{w}_{2}(t)\\|,$
$\displaystyle\frac{\partial}{\partial t}\\|{\boldsymbol{\nu}}_{1}(t)-{\boldsymbol{\nu}}_{1}(0)\\|_{in}$	$\displaystyle\leq(1+\beta)\sqrt{\frac{c}{d_{2}}}\\|\bar{y}(t)-z\\|_{in}\\|\boldsymbol{w}_{2}(t)\\|.$

Here we make use of the fact that $\tilde{\phi}^{\prime}(x)\leq 1$ , $\tilde{\phi}(x)\leq x$ regardless of the value of $\varsigma_{1,j}(t)$ , that $\lim_{d_{1}\rightarrow\infty}\|\boldsymbol{x}_{1}\|_{in}/\sqrt{d_{1}}=1$ as long as $\Theta\in\mathcal{G}$ , and that $w_{0}$ is not updated during training. In the last equation, we make use of $\dot{\boldsymbol{\theta}}_{1}=\frac{\partial}{\partial t}(\boldsymbol{\theta}_{1}(t)-\boldsymbol{\theta}_{1}(0)),\dot{\boldsymbol{b}}_{1}=\frac{\partial}{\partial t}(\boldsymbol{b}_{1}(t)-\boldsymbol{b}_{1}(0)).$

Define $A(t)=\sqrt{\frac{c}{d_{2}}}\sqrt{1+\beta}(\|{\boldsymbol{w}}_{2}(t)-{\boldsymbol{w}}_{2}(0)\|+\|{\boldsymbol{w}}_{2}(0)\|)+\sqrt{\frac{c}{d_{2}}}(\|{\boldsymbol{\nu}}_{1}(t)-{\boldsymbol{\nu}}_{1}(0)\|_{in}+\|{\boldsymbol{\nu}}_{1}(0)\|_{in})$ , then

	$\displaystyle\dot{A}(t)$	$\displaystyle\leq\sqrt{1+\beta}\sqrt{\frac{c}{d_{2}}}\\|\bar{y}(t)-z\\|_{in}\\|\boldsymbol{\nu}_{1}(t)\\|_{in}+(1+\beta)\sqrt{\frac{c}{d_{2}}}\\|\bar{y}(t)-z\\|_{in}\\|\boldsymbol{w}_{2}(t)\\|$
		$\displaystyle\leq\sqrt{1+\beta}A(t)$

Observe that $A(0)$ is stochastically bounded. Using Grönwall’s Lemma, for any finite $T$ :

\displaystyle A(T)\leq A(0)\exp\Big{(}\int_{t=0}^{T}\sqrt{1+\beta}dt\Big{)}=A(0)\exp(\sqrt{1+\beta}T)

so $A(T)$ is stochastically bounded for all finite $T$ as $d_{2}\rightarrow\infty$ . Furthermore,

\displaystyle\sqrt{\frac{c}{d_{2}}}\|\boldsymbol{w}_{2}(T)\|\leq\sqrt{\frac{c}{d_{2}}}(\|\boldsymbol{w}_{2}(T)-\boldsymbol{w}_{2}(0)\|+\|\boldsymbol{w}_{2}(0)\|)

which is also stochastically bounded. Integrating (31)-(34) from 0 to $T$ finishes the proof.

B.9 Proof of 10

From (4), it’s easy to get the dynamics of $\varsigma_{1}$ :

	$\displaystyle\frac{\partial\varsigma_{1,j}^{2}(t)}{\partial t}$	$\displaystyle=-\frac{2c}{d_{1}}\sum_{i=1}^{d_{1}}\theta_{1,ij}(t)\dot{\theta}_{1,ij}(t)x_{1,i}^{2}$
	$\displaystyle\|\varsigma_{1,j}^{2}(T)-\varsigma_{1,j}^{2}(0)\|$	$\displaystyle\leq\frac{2c}{d_{1}}\sum_{i=1}^{d_{1}}x_{1,i}^{2}\int_{t=0}^{T}\|\theta_{1,ij}(t)\|\|\dot{\theta}_{1,ij}(t)\|dt$
		$\displaystyle\leq\frac{2c}{d_{1}}\sum_{i=1}^{d_{1}}x_{1,i}^{2}\int_{t=0}^{T}\|\dot{\theta}_{1,ij}(t)\|dt$
		$\displaystyle\leq\frac{2c}{d_{1}}\sqrt{\frac{c}{d_{1}d_{2}}}\sum_{i=1}^{d_{1}}x_{1,i}^{2}\int_{t=0}^{T}\mathbb{E}_{in}\big{\|}w_{2,j}(t)x_{1,i}(\bar{y}(t)-z)\tilde{\psi}^{\prime}(\nu_{1,j}(t);\varsigma_{1,j}(t))\big{\|}dt$
		$\displaystyle\leq\frac{2c}{d_{1}}\sqrt{\frac{c}{d_{1}d_{2}}}\sum_{i=1}^{d_{1}}x_{1,i}^{2}\int_{t=0}^{T}\|w_{2,j}(t)\|\mathbb{E}_{in}\big{\|}x_{1,i}(\bar{y}(t)-z)\big{\|}dt$
		$\displaystyle\leq\frac{2c}{d_{1}}\sqrt{\frac{c}{d_{1}d_{2}}}\sum_{i=1}^{d_{1}}x_{1,i}^{2}\int_{t=0}^{T}\|w_{2,j}(t)\|\\|x_{1,i}\\|_{in}\\|\bar{y}(t)-z\\|_{in}dt$
		$\displaystyle\leq\frac{2c}{d_{1}^{3/2}}\sum_{i=1}^{d_{1}}x_{1,i}^{2}\\|x_{1,i}\\|_{in}\int_{t=0}^{T}C(t)\\|\bar{y}(t)-z\\|_{in}dt$
		$\displaystyle\leq\frac{2c}{d_{1}^{3/2}}\sum_{i=1}^{d_{1}}x_{1,i}^{2}\\|x_{1,i}\\|_{in}\max_{t\in[0,T]}C(t)\int_{t=0}^{T}\\|\bar{y}(t)-z\\|_{in}dt\quad a.s.$

Here we assume that $\sqrt{\frac{c}{d_{2}}}\|\boldsymbol{w}_{2}(t)\|$ is stochastically bounded by $C(t)$ . Since $C(t)$ is finite for all $t\in[0,T]$ , it’s easy to check the term after $\max$ operator is stochastically bounded. The remaining task is to bound term before $\max$ operator. From standard Gaussian process analysis, $x_{1,i}$ satisfy Gaussian distribution. From the law of large number (LLN), as $d_{1}\rightarrow\infty$ ,

\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}x_{1,i}^{2}\|x_{1,i}\|_{in}=\mathbb{E}[x_{1,i}^{2}\|x_{1,i}\|_{in}]

almost surely, where the expectation is taken over $w_{1}$ , and this limit is also bounded. Because of that, as $d_{1},d_{2}\rightarrow\infty$ , the difference $|\varsigma_{1,j}^{2}(T)-\varsigma_{1,j}^{2}(0)|$ converges to 0 at rate $\frac{1}{\sqrt{d_{2}}}$ .

Notice that the proof of Lyapunov’s condition (LABEL:eq:Lyapunov) doesn’t depend on time $T$ from the third line. Since $\varsigma_{1,j}(T)$ stochastically converges to $\varsigma_{1,j}(0)$ for all finite $T$ , Lyapunov’s condition holds for all $T$ thus $x_{2,j}$ always converges to Gaussian distribution conditioned on model parameter.

Appendix C NTK of neural networks with quantized weights

C.1 Spherical harmonics

This subsection briefly reviews the relevant concepts and properties of spherical harmonics. Most part of this subsection comes from Bach [2017, appendix Section D.1.] and Bietti and Mairal [2019, appendix Section C.1.]

According to Mercer’s theorem, any positive definite kernel can be decomposed as

\mathcal{K}(x,x^{\prime})=\sum_{i}\lambda_{i}\Phi(x)\Phi(x^{\prime}),

where $\Phi(\cdot)$ is called the feature map. Furthermore, any zonal kernel on the unit sphere, i.e., $\mathcal{K}(x,x^{\prime})=\mathcal{K}(x^{T}x^{\prime})$ for any $x,x^{\prime}\in\mathbb{R}^{d},\|x\|_{2}=\|x^{\prime}\|_{2}=1$ , including exponential kernels and NTK, can be decomposed using spherical harmonics (equation (10)):

\mathcal{K}(x,x^{\prime})=\sum_{k=1}^{\infty}\lambda_{k}\sum_{j=1}^{N(d,k)}Y_{k,j}(x)Y_{k,j}(x^{\prime}).

Legendre polynomial. We have the additional formula

\sum_{j=1}^{N(d,k)}Y_{k,j}(x)Y_{k,j}(x^{\prime})=N(d,k)P_{k}(x^{T}x^{\prime}),

where

N(d,k)=\frac{(2k+d-2)(k+d-3)!}{k!(d-2)!}.

The polynomial $P_{k}$ is the $k$ -th Legendre polynomial in $d$ dimension, also known as Gegenbauer polynomials:

P_{k}(t)=\left(-\frac{1}{2}\right)^{k}\frac{\Gamma\left(\frac{d-1}{2}\right)}{\Gamma\left(k+\frac{d-1}{2}\right)}(1-t^{2})^{(3-d)/2}\left(\frac{d}{dt}\right)^{k}(1-t^{2})^{k+(d-3)/2}.

It is even (resp. odd) when $k$ is odd (reps. even). Furthermore, they have the orthogonal property

\int_{-1}^{1}P_{k}(t)P_{j}(t)(1-t^{2})^{(d-3)/2}dt=\delta_{ij}\frac{w_{d-1}}{w_{d-2}}\frac{1}{N(d,k)},

where

w_{d-1}=\frac{2\pi^{d-2}}{\Gamma(d/2)}

denotes the surface of sphere $\mathbb{S}^{d-1}$ in $d$ dimension, and this leads to the integration property

\int P_{j}(\langle w,x\rangle)P_{k}(\langle w,x\rangle)d\tau(w)=\frac{\delta_{jk}}{N(p,k)}P_{k}(\langle x,y\rangle)

for any $x,y\in\mathbb{S}^{d-1}$ . $\tau(w)$ is the uniform measure on the sphere.

C.2 NTK of quasi neural network

We start the proof of the Theorem 12 by the following lemmas:

Lemma 13.

The NTK of a binary weight neural network can be simplified as

		$\displaystyle\mathcal{K}(x,x^{\prime})=\left(\frac{c}{d}\langle x,x^{\prime}\rangle+\beta^{2}\right)\Sigma^{(0)}+\Sigma^{(1)},$		(35)
		$\displaystyle\Sigma^{(0)}=\mathbb{E}\left[\tilde{\psi}^{\prime}\left(\mu\right)\tilde{\psi}^{\prime}\left(\mu^{\prime}\right)\right],\quad\Sigma^{(1)}=\mathbb{E}\left[\tilde{\psi}\left(\mu\right)\tilde{\psi}\left(\mu^{\prime}\right)\right],$		(35)

where $[\mu,\mu^{\prime}]\sim\mathcal{N}(0,\Sigma)$ ,

\displaystyle\Sigma=\mathbb{E}[x_{1,i}x^{\prime}_{1,i}]=\frac{c}{d}\mathrm{Var}[\theta]\left[\begin{matrix}1&x^{T}x^{\prime}\\ x^{T}x^{\prime}&1\end{matrix}\right]

are the pre-activation of the second layer.

Proof.

	$\displaystyle\mathcal{K}(x,x^{\prime})$	$\displaystyle=\sum_{i=1,j=1}^{d_{1},d_{2}}\frac{\partial\bar{y}}{\partial\theta_{1,ij}}\frac{\partial\bar{y}^{\prime}}{\partial\theta_{1,ij}}+\sum_{j=1}^{d_{2}}\frac{\partial\bar{y}}{\partial b_{1,j}}\frac{\partial\bar{y}^{\prime}}{\partial b_{1,j}}+\sum_{j=1}^{d_{2}}\frac{\partial\bar{y}}{\partial w_{2,j}}\frac{\partial\bar{y}^{\prime}}{\partial w_{2,j}}$
		$\displaystyle=\frac{c}{d_{1}d_{2}}\sum_{i=1,j=1}^{d_{1},d_{2}}x_{1,i}x^{\prime}_{1,i}w_{2,j}^{2}\tilde{\psi}^{\prime}(\nu_{1,j})\tilde{\psi}^{\prime}_{2}(\nu^{\prime}_{1,j})$
		$\displaystyle\quad+\frac{\beta^{2}}{d_{2}}\sum_{j=1}^{d_{2}}\tilde{\psi}^{\prime}(\nu_{1,j})\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j})+\frac{1}{d_{2}}\sum_{j=1}^{d_{2}}\tilde{\psi}(\nu_{1,j})\tilde{\psi}(\nu^{\prime}_{1,j})$
		$\displaystyle=\frac{c}{d_{1}d_{2}}\sum_{i=1}^{d_{1}}x_{1,i}x^{\prime}_{1,i}\sum_{j=1}^{d_{2}}w_{2,j}^{2}\tilde{\psi}^{\prime}(\nu_{1,j})\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j})$
		$\displaystyle\quad+\frac{\beta^{2}}{d_{2}}\sum_{j=1}^{d_{2}}w_{2,j}^{2}\tilde{\psi}^{\prime}(\nu_{1,j})\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j})+\frac{1}{d_{2}}\sum_{j=1}^{d_{2}}\tilde{\psi}(\nu_{1,j})\tilde{\psi}(\nu^{\prime}_{1,j})$
		$\displaystyle=(\frac{c}{d}\langle x,x^{\prime}\rangle+\beta^{2})\mathbb{E}[\tilde{\psi}^{\prime}(\nu)\tilde{\psi}^{\prime}(\nu^{\prime})]+\mathbb{E}[\tilde{\psi}(\nu)\tilde{\psi}(\nu^{\prime})]\quad a.s.$

where $(\nu,\nu^{\prime})$ has the same distribution as $(\nu_{2,j},\nu^{\prime}_{2,j})$ for any $j$ . We make use of the fact $\mathbb{E}[w_{2,j}^{2}]=1$ , and from central limit theorem, $x_{1,i},x^{\prime}_{1,i}$ and $\mu_{1,i},\mu_{1,i}^{\prime}$ converge to joint Gaussian distribution for any fixed $x,x^{\prime}$ as $d_{1}\rightarrow\infty$

	$\displaystyle\mathbb{E}[x_{1,i}x^{\prime}_{1,i}]$	$\displaystyle=\frac{1}{d}\mathbb{E}[\sum_{k=1}^{d}w_{ki}x_{k}\sum_{k^{\prime}=1}^{d}w_{k^{\prime}i}x^{\prime}_{k^{\prime}}]$
		$\displaystyle=\frac{1}{d}\mathbb{E}[\sum_{k=1}^{d}w_{ki}^{2}x_{k}x^{\prime}_{k}]$
		$\displaystyle=\frac{1}{d}\langle x,x^{\prime}\rangle$

Similarly,

	$\displaystyle\mathbb{E}[\mu_{1,i}^{2}]$	$\displaystyle=\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}[\theta_{1,ij}^{2}]\mathbb{E}[x_{1,j}^{2}]=\frac{c}{d}\mathrm{Var}[\theta]$
	$\displaystyle\mathbb{E}[\mu_{1,i}\mu_{1,i}^{\prime}]$	$\displaystyle=\frac{c}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}[\theta_{1,ij}^{2}]\mathbb{E}[x_{1,j}x^{\prime}_{1,j}]=\frac{c}{d}\mathrm{Var}[\theta]\langle x,x^{\prime}\rangle$

∎

C.3 Proof of Theorem 11

Remind that as is proved in Theorem 10, $\varsigma_{1,j}(T)\rightarrow\varsigma_{1,t}(0)$ for any $T$ satisfying a mild condition, and $\varsigma_{1,t}(0)$ is nonzero almost surely. Making use the fact that $\tilde{\psi}(\cdot;\varsigma)$ is continuous with respect to $\varsigma$ , and its first and second order derivative is stochastically bounded, the change of kernel $\mathcal{K}$ induced by $\varsigma_{1,j}$ converges to 0 as $d_{1},d_{2}\rightarrow\infty$ . This reduces to this quasi neural network to a standard neural network with activation function $\tilde{\psi}(\cdot)$ , which is twice differentiable and has bounded second order derivative. From Theorem 2 in Jacot et al. [2018], the kernel during training converges to the one during initialization. For the ease of the readers, we restate the proof below. On the limit $d_{2}\rightarrow\infty,d_{1}\rightarrow\infty$ ,

		$\displaystyle\mathcal{K}(x,x^{\prime})(t)-\mathcal{K}(x,x^{\prime})(0)$
		$\displaystyle=\frac{c\langle x,x^{\prime}\rangle+\beta^{2}}{d_{2}}\sum_{j=1}^{d_{2}}\big{(}w_{2,j}^{2}(t)\tilde{\psi}^{\prime}(\nu_{1,j}(t))\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(t))-w_{2,j}^{2}(0)\tilde{\psi}^{\prime}(\nu_{1,j}(0))\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(0))\big{)}$
		$\displaystyle\quad+\frac{1}{d_{2}}\sum_{j=1}^{d_{2}}\big{(}\tilde{\psi}(\nu_{1,j}(t))\tilde{\psi}(\nu^{\prime}_{1,j}(t))-\tilde{\psi}(\nu_{1,j}(0))\tilde{\psi}(\nu^{\prime}_{1,j}(0))\big{)}$
		$\displaystyle=\frac{c\langle x,x^{\prime}\rangle+\beta^{2}}{d_{2}}\Bigg{(}\sum_{j=1}^{d_{2}}\big{(}w_{2,j}^{2}(t)-w_{2,j}^{2}(0)\big{)}\tilde{\psi}^{\prime}(\nu_{1,j}(t))\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(t))$
		$\displaystyle\quad+\sum_{j=1}^{d_{2}}w_{2,j}^{2}(0)\big{(}\tilde{\psi}^{\prime}(\nu_{1,j}(t))-\tilde{\psi}^{\prime}(\nu_{1,j}(0))\big{)}\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(t))$
		$\displaystyle\quad+\sum_{j=1}^{d_{2}}w_{2,j}^{2}(0)\tilde{\psi}^{\prime}(\nu_{1,j}(0))\bigl{(}\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(t))-\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(0))\bigr{)}\Bigg{)},$

	$\displaystyle\|\mathcal{K}(x,x^{\prime})(t)-\mathcal{K}(x,x^{\prime})(0)\|$	$\displaystyle\leq\Bigg{\|}\frac{c\langle x,x^{\prime}\rangle+\beta^{2}}{d_{2}}\Bigg{\|}$
		$\displaystyle\Bigg{(}\sum_{j=1}^{d_{2}}\|w_{2,j}(t)-w_{2,j}(0)\|\|w_{2,j}(t)+w_{2,j}(0)\|\|\tilde{\psi}^{\prime}(\nu_{1,j}(t))\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(t))\|$
		$\displaystyle\quad+\sum_{j=1}^{d_{2}}w_{2,j}^{2}(0)\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(t))\big{\|}\tilde{\psi}^{\prime}(\nu_{1,j}(t))-\tilde{\psi}^{\prime}(\nu_{1,j}(0))\big{\|}$
		$\displaystyle\quad+\sum_{j=1}^{d_{2}}w_{2,j}^{2}(0)\tilde{\psi}^{\prime}(\nu_{1,j}(0))\bigl{\|}\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(t))-\tilde{\psi}^{\prime}(\nu^{\prime}_{1,j}(0))\bigr{\|}\Bigg{)}$
		$\displaystyle\quad+\frac{1}{d_{2}}\sum_{j=1}^{d_{2}}\big{(}\tilde{\psi}(\nu_{1,j}(t))(\tilde{\psi}(\nu^{\prime}_{1,j}(t))\tilde{\psi}(\nu^{\prime}_{1,j}(0)))\big{)}$
		$\displaystyle\quad+\frac{1}{d_{2}}\sum_{j=1}^{d_{2}}\big{(}\tilde{\psi}(\nu^{\prime}_{1,j}(0))(\tilde{\psi}(\nu_{1,j}(t))-\tilde{\psi}(\nu_{1,j}(0)))\big{)}$

From Theorem 10, and observing that $\tilde{\psi}^{\prime}(x),\tilde{\psi}^{\prime\prime}(x)$ are bounded by constants, one can verify that each summation term is stochastically bounded by $\sqrt{d_{2}}$ , so as $d_{2}\rightarrow\infty$ , $\mathcal{K}(t)-\mathcal{K}(0)$ converges to 0 at rate $\sqrt{d_{2}}$ .

C.4 Spherical harmonics decomposition to activation function

Following Bach [2017], we start by studying the decomposition of action in quasi neural network (7) and its gradients (8): for arbitrary fixed $c>0$ , $-1\leq t\leq 1$ , we can decompose equation (7) and (8) as

	$\displaystyle\tilde{\sigma}(ct)$	$\displaystyle=\sum_{k=0}^{\infty}\lambda_{k}N(d,k)P_{k}(t),$		(36)
	$\displaystyle\tilde{\sigma}^{\prime}(ct)$	$\displaystyle=\sum_{k=0}^{\infty}\lambda^{\prime}_{k}N(d,k)P_{k}(t),$		(37)

where $P_{k}$ is the $k$ -th Legendre polynomial in dimension $d$ .

Lemma 14.

The decomposition of activation function in the quasi neural network (36) satisfies

1.

$\lambda_{k}=0$ if $k$ is odd,
2.

$\lambda_{k}>0$ if $k$ is even,
3.

$\lambda_{k}\asymp\mathrm{Poly}(k)(C/\sqrt{k})^{-k}$ as $k\rightarrow\infty$ when $k$ is even, where $\mathrm{Poly}(k)$ denotes a polynomial of $k$ , and $C$ is a constant.

Its gradient (37) satisfies

1.

$\lambda^{\prime}_{k}=0$ if $k$ is even,
2.

$\lambda^{\prime}_{k}>0$ if $k$ is odd,
3.

$\lambda^{\prime}_{k}\asymp\mathrm{Poly}(k)(C/\sqrt{k})^{-k}$ as $k\rightarrow\infty$ when $k$ is odd, where $\mathrm{Poly}(k)$ denotes a polynomial of $k$ , and $C$ is a constant.

Proof.

Let’s start with the derivative of activation function in quasi neural network:

\tilde{\sigma}^{\prime}(t)=\Phi(\hat{c}t),-1\leq t\leq 1,

where $\hat{c}$ is a constant. We introduce the auxiliary parameters $x,w\in\mathbb{R}^{d}$ s.t. $\|x\|_{2}=\|w\|_{2}=1$ and let $t=w^{T}x$ By Cauchy–Schwarz inequality, $-1\leq w^{T}x\leq 1$ . Following Bach [2017], we have the following decomposition to $\tilde{\sigma}^{\prime}(w^{T}x)$ :

\tilde{\sigma}^{\prime}(w^{T}x)=\sum_{k=1}^{\infty}\lambda_{k}^{\prime}N(d,k)P_{k}(w^{T}x),

where $N(d,k)$ and $P_{k}(\cdot)$ are defined in section C.1, $\lambda_{k}^{\prime}$ can be computed by

	$\displaystyle\lambda^{\prime}_{k}$	$\displaystyle=\frac{w_{d-1}}{w_{d}}\int_{-1}^{1}\tilde{\sigma}^{\prime}(t)P_{k}(t)(1-t^{2})^{(d-2)/2}dt$
		$\displaystyle=\left(-\frac{1}{2}\right)^{k}\frac{\Gamma((d-1)/2)}{\Gamma(k+(d-1)/2)}\frac{w_{d-1}}{w_{d}}\int_{-1}^{1}\tilde{\sigma}^{\prime}(t)\left(\frac{d}{dt}\right)^{k}(1-t^{2})^{k+(d-3)/2}dt.$

To solve this itegration, we can apply Taylor decomposition to $\tilde{\sigma}^{\prime}(\cdot)$ :

\tilde{\sigma}^{\prime}(ct)=\frac{1}{2}+\frac{1}{\sqrt{2\pi}}\sum_{n=0}^{\infty}\frac{(-1)^{n}\hat{c}^{2n+1}}{2^{n}n!(2n+1)}t^{2n+1}.

(38)

We will study the following polynomial integration first

\int_{-1}^{1}t^{\alpha}\left(\frac{d}{dt}\right)^{k}(1-t^{2})^{k+(d-3)/2}dt.

When $\alpha<k$ , this integration equals 0 as $P_{k}$ is orthogonal to all polynomials of degree less than $k$ . If $(\alpha-k)\mod 2\neq 0$ , this integration is 0 because the function to be integrated is an odd function. For $\alpha\geq k$ and $k\equiv\alpha\mod 2$ ( $k$ is odd), using successive integration by parts,

$\displaystyle\int_{-1}^{1}t^{\alpha}\left(\frac{d}{dt}\right)^{k}(1-t^{2})^{k+(d-3)/2}dt$	$\displaystyle=(-1)^{k}\frac{\alpha!}{(\alpha-k)!}\int_{-1}^{1}t^{\alpha-k}(1-t^{2})^{k+(d-3)/2}dt$	(39)
	$\displaystyle=(-1)^{k}\frac{\alpha!}{(\alpha-k)!}\int_{-\pi/2}^{\pi/2}\sin^{\alpha-k}(x)\cos^{2k+(d-2)}(x)dx$
	$\displaystyle=(-1)^{k}C_{d}\frac{\alpha!(2k+d-3)!!}{(\alpha-k)!!(\alpha+k+d-2)!!},$

where $C_{d}$ is a constant that depends only on $d$ mod 2.

Combining (38) and (39), we have $\lambda_{k}=0$ when $k$ is even and $k\neq 0$ . When $k$ is odd,

\displaystyle\lambda_{k}^{\prime}=

\displaystyle\left(-\frac{1}{2}\right)^{k}\frac{\Gamma((d-1)/2)}{\Gamma(k+(d-1)/2)}\frac{w_{d-1}}{w_{d}}\frac{C_{d}}{\sqrt{2\pi}}\sum_{\alpha=k:2}^{\infty}\hat{c}^{\alpha}(-1)^{(\alpha-1)/2}\frac{(\alpha-2)!!(2k+d-3)!!}{(\alpha-k)!!(\alpha+k+d-2)!!}.

Following Bach [2017], Geifman et al. [2020] we take $d$ as a constant and take $k$ to infinity. Let $\beta=(\alpha-k)/2\geq 0$ we have

	$\displaystyle\lambda_{k}^{\prime}$	$\displaystyle=(-1)^{(k+1)/2}\left(\frac{1}{2}\right)^{k}\frac{\Gamma((d-1)/2)}{\Gamma(k+(d-1)/2)}\frac{w_{d-1}}{w_{d}}\frac{C_{d}}{\sqrt{2\pi}}\sum_{\beta=0}^{\infty}\frac{(-1)^{\beta}\hat{c}^{2\beta+k}(2\beta+k-2)!!(2k+d-3)!!}{(2\beta)!!(2\beta+2k+d-2)!!}$
		$\displaystyle\asymp(-1)^{(k+1)/2}\left(\frac{1}{2}\right)^{k}\frac{\Gamma((d-1)/2)}{\Gamma(k+(d-1)/2)}\frac{w_{d-1}}{w_{d}}\frac{C_{d}}{\sqrt{2\pi}}\sum_{\beta=0}^{\infty}\frac{(-1)^{\beta}\hat{c}^{2\beta+k}\Gamma(\beta+k/2)\Gamma(k+(d-1)/2)}{\beta!\Gamma(\beta+k+d/2)2^{\beta-k/2}}$
		$\displaystyle:=(-1)^{(k+1)/2}\left(\frac{1}{2}\right)^{k}\frac{\Gamma((d-1)/2)}{\Gamma(k+(d-1)/2)}\frac{w_{d-1}}{w_{d}}\frac{C_{d}}{\sqrt{2\pi}}\sum_{\beta=0}^{\infty}g(\beta,k).$

where $\asymp$ means the radio converge to a constant which doesn’t depend on $k$ or $\beta$ as $k\rightarrow\infty$ . Here we introduced the function $g(\beta,k)$ for simplification, and it satisfies

\displaystyle\frac{g(\beta,k)}{g(\beta-1,k)}=-\frac{\hat{c}^{2}(2\beta+k-3)}{2\beta(2\beta+2k+d-2)},

which indicates that $g(\beta,k)$ decays at factorial rate when $\beta>\hat{c}^{2}/2$ . If $k\gg\hat{c}^{2}/2$ , $\beta\ll k$ regime dominates the summation.

Using Stirling’s approximation, one can easily prove

\displaystyle\Gamma(k+x)\asymp\Gamma(k)k^{x}

When $k\gg d$ ,

	$\displaystyle g(\beta,k)$	$\displaystyle=\frac{(-1)^{\beta}\hat{c}^{2\beta+k}\Gamma(\beta+k/2)\Gamma(k+(d-1)/2)}{\beta!\Gamma(\beta+k+d/2)2^{\beta-k/2}}$
		$\displaystyle\asymp\left(-\frac{1}{4}\right)^{\beta}\hat{c}^{2\beta+k}\Gamma(k+(d-1)/2)\frac{2^{k/2}\Gamma(k/2)}{\Gamma(k)k^{d/2}\beta!}$
		$\displaystyle=\hat{c}^{k}\Gamma(k+(d-1)/2)\frac{2^{k/2}\Gamma(k/2)}{\Gamma(k)k^{d/2}}\left(-\frac{\hat{c}^{2}}{4}\right)^{\beta}\frac{1}{\beta!}$

This splits $g(\beta,k)$ into two parts: the first part depends only on $k$ and the rest part only depends on $\beta$ . The summation of the second part over $\beta$ yields

\displaystyle\sum_{\beta=0}^{\infty}\left(-\frac{\hat{c}^{2}}{4}\right)^{\beta}\frac{1}{\beta!}=\exp(-\frac{\hat{c}^{2}}{4}),

Using Stirling’s approximation

\gamma(x+1)\asymp\sqrt{2\pi x}(x/e)^{x},

this leads to the expression for $\lambda_{k}$ :

	$\displaystyle\lambda_{k}^{\prime}$	$\displaystyle\asymp(-1)^{(k+1)/2}\left(\frac{1}{2}\right)^{k}\frac{\Gamma((d-1)/2)}{\Gamma(k+(d-1)/2)}\hat{c}^{k}\Gamma(k+(d-1)/2)\frac{2^{k/2}\Gamma(k/2)}{\Gamma(k)k^{d/2}}\exp(-\frac{\hat{c}^{2}}{4})$
		$\displaystyle\asymp(-1)^{(k+1)/2}\left(\frac{\hat{c}}{2}\right)^{k}\frac{2^{k/2}\Gamma(k/2)}{\Gamma(k)k^{d/2}}\exp(-\frac{\hat{c}^{2}}{4})$
		$\displaystyle\asymp(-1)^{(k+1)/2}\left(\frac{\hat{c}}{2}\sqrt{\frac{e}{k}}\right)^{k}k^{-d/2}\exp(-\frac{\hat{c}^{2}}{4})$

Similarly, the activation function of quasi neural network has the Tayler expansion

	$\displaystyle\tilde{\sigma}(x)$	$\displaystyle=\varsigma_{\ell}\varphi\left(\hat{c}t\right)+x\Phi\left(\hat{c}t\right)$
		$\displaystyle=\frac{t}{2}+\sum_{n=0}^{\infty}\frac{(-1)^{n}\hat{c}^{2n+1}}{2^{n+1}(n+1)!(2n+1)}t^{2n+2}.$

So $\lambda_{k}=0$ when $k$ is odd, and when $k$ is even:

\displaystyle\lambda_{k}

\displaystyle=(-1)^{(k+1)/2}\left(\frac{1}{2}\right)^{k}\frac{\Gamma((d-1)/2)}{\Gamma(k+(d-1)/2)}\frac{w_{d-1}}{w_{d}}\frac{C_{d}}{\sqrt{2\pi}}\sum_{\beta=0}^{\infty}\frac{(-1)^{\beta}\hat{c}^{2\beta+k}(2\beta+k-2)!!(2k+d-3)!!}{(2\beta)!!(2\beta+2k+d-2)!!}

Furthermore, when $k\gg d$ ,

\lambda_{k}\asymp(-1)^{k/2}{k^{-\frac{d}{2}}}\left(\frac{\hat{c}}{2}\sqrt{\frac{e}{k}}\right)^{k}\exp\left(-\frac{\hat{c}^{2}}{4}\right),

∎

C.5 Computing covariance matrix

In this part, we prove Theorem 12 by computing $\Sigma^{(0)}$ and $\Sigma^{(1)}$ .

Theorem 12 NTK of a binary weight neural network can be decomposed using equation (10). If $k\gg d$ , then

\mathrm{Poly}_{1}(k)(C)^{-k}\leq u_{k}\leq\mathrm{Poly}_{2}(k)(C)^{-k}

where $\mathrm{Poly}_{1}(k)$ and $\mathrm{Poly}_{2}(k)$ denote polynomials of $k$ , and $C$ is a constant.

We make use of the results in Section C.4, and remind that $\lambda_{k},\lambda^{\prime}_{k}$ depends on $\hat{c}$ , we make this explicit as $\lambda_{k}(\hat{c}),\lambda^{\prime}_{k}(\hat{c})$ . We introduce an auxiliary parameter $w\sim\mathcal{N}(0,I)$ , and denote $\tilde{c}=\sqrt{\frac{c\mathrm{Var}[\Theta]}{d\ \tilde{\varsigma}^{2}}}=\sqrt{\frac{\mathrm{Var}[\theta]}{1-\mathrm{Var}[\theta]}},\tilde{w}=w/\|w\|_{2}$ , then the decomposition of kernel (10) can be computed by

	$\displaystyle\Sigma^{(1)}$	$\displaystyle=\mathbb{E}_{\theta}\left[\tilde{\sigma}\left(\mu\right)\tilde{\sigma}\left(\mu\right)\right]$
		$\displaystyle=\mathbb{E}_{w\sim\mathcal{N}(0,I)}\left[\tilde{\sigma}(\tilde{c}\langle w,x\rangle)\tilde{\sigma}(\tilde{c}\langle w,x^{\prime}\rangle)\right]$
		$\displaystyle=\mathbb{E}_{\\|w\\|}\int\tilde{\sigma}(\tilde{c}\langle\tilde{w},x\rangle)\tilde{\sigma}(\tilde{c}\langle\tilde{w},x^{\prime}\rangle)d\tau(\tilde{w})$
		$\displaystyle=\mathbb{E}_{\\|w\\|}\sum_{k=0}^{\infty}(\lambda_{k}(\tilde{c}\\|w\\|))^{2}N(p,k)P_{k}(\langle x,x^{\prime}\rangle),$
	$\displaystyle\Sigma^{(0)}$	$\displaystyle=\mathbb{E}_{\theta}\left[\tilde{\sigma}^{\prime}\left(\mu\right)\tilde{\sigma}^{\prime}\left(\mu\right)\right]$
		$\displaystyle=\mathbb{E}_{\\|w\\|}\sum_{k=0}^{\infty}{(\lambda_{k}^{\prime}(\tilde{c}\\|w\\|)})^{2}N(p,k)P_{k}(\langle x,x^{\prime}\rangle).$

First compute $\Sigma^{(0)}$ . According to Lemma 16 in Bietti and Mairal [2019],

u_{0,k}=\mathbb{E}_{w\sim\mathcal{N}(0,I)}[{\lambda_{k}^{\prime}}^{2}]=\mathbb{E}_{\|w\|}[{\lambda_{k}^{\prime}}^{2}].

Remind that

\lambda^{\prime}_{k}(\tilde{c}\|w\|)\asymp(-1)^{k/2}k^{-d/2}\left(\frac{\tilde{c}\|w\|}{2}\sqrt{\frac{e}{k}}\right)^{k}\exp\left(-\frac{\tilde{c}^{2}\|w\|^{2}}{4}\right).

	$\displaystyle u_{0,k}$	$\displaystyle=\mathbb{E}_{\\|w\\|}{(\lambda_{k}^{\prime}(\tilde{c}\\|w\\|)})^{2}$
		$\displaystyle\asymp\mathbb{E}_{\\|w\\|}k^{-d}\left(\frac{\tilde{c}^{2}\\|w\\|^{2}e}{4k}\right)^{k}\exp\left(-\frac{\tilde{c}^{2}\\|w\\|^{2}}{2}\right)$
		$\displaystyle=k^{-d}\left(k/e\right)^{-k}\mathbb{E}_{\tilde{c}\\|w\\|}\left(\frac{\tilde{c}^{2}\\|w\\|^{2}}{4}\right)^{k}\exp\left(-\frac{\tilde{c}^{2}\\|w\\|^{2}}{2}\right)$

Because $w\sim\mathcal{N}(0,1)$ , $\|w\|_{2}^{2}$ satisfy Chi-square distribution, and its momentum generating function is

\displaystyle M_{X}(t)

\displaystyle=\mathbb{E}[\exp(t\|w\|^{2})]=(1-2t)^{-d/2}

It’s $k$ -th order derivative is

\displaystyle M_{X}^{(k)}

\displaystyle=\mathbb{E}[\|w\|^{2k}\exp(t\|w\|^{2})]=\frac{(d+2k-2)!!}{(d-2)!!}(1-2t)^{-\frac{d}{2}-k}

Let $t=-\tilde{c}^{2}/2$ , we get

	$\displaystyle\mathbb{E}\left[\\|w\\|^{2k}\exp\left(-\frac{\tilde{c}^{2}\\|w\\|^{2}}{2}\right)\right]$	$\displaystyle=\frac{(d+2k-2)!!}{(1+\tilde{c}^{2})^{d/2+k}(d-2)!!}$
		$\displaystyle\asymp 2^{k}\frac{\Gamma(k+d/2)}{\Gamma(d/2)}(1+\tilde{c}^{2})^{-k-d/2}$
		$\displaystyle\asymp\left(\frac{2k}{(1+\tilde{c}^{2})e}\right)^{d/2+k}\sqrt{\frac{1}{k}}$

	$\displaystyle u_{0,k}$	$\displaystyle\asymp\left(\frac{\tilde{c}}{2}\right)^{2k}k^{-d}\left(k/e\right)^{-k}\left(\frac{2k}{(1+\tilde{c}^{2})e}\right)^{d/2+k}$
		$\displaystyle\asymp k^{-(d-1)/2}\left(\frac{\tilde{c}^{2}}{2(1+\tilde{c}^{2})}\right)^{k}$

when $k$ is odd, and 0 when $k$ is even.

Similarly,

	$\displaystyle u_{1,k}$	$\displaystyle\asymp\left(\frac{\tilde{c}}{2}\right)^{2k}k^{-d}\left(k/e\right)^{-k}\left(\frac{2k}{(1+\tilde{c}^{2})e}\right)^{d/2+k}$
		$\displaystyle\asymp k^{-(d-1)/2}\left(\frac{\tilde{c}^{2}}{2(1+\tilde{c}^{2})}\right)^{k}$

when $k$ is even, and 0 when $k$ is odd.

Finally, using the recurrence relation

\displaystyle tP_{k}(t)=\frac{k}{2k+d-3}P_{k-1}(t)+\frac{k+d-3}{2k+d-3}P_{k+1}(t)

taking them into (35) finishes the proof.

C.6 Gaussian kernel

	$\displaystyle\mathcal{K}_{RGauss}(x,x^{\prime})$	$\displaystyle=\mathbb{E}[\mathcal{K}_{Gauss}(\kappa x,\kappa x^{\prime})]$
		$\displaystyle=\mathbb{E}\left[\exp\left(-\frac{\kappa^{2}\\|x-x^{\prime}\\|}{\xi^{2}}\right)\right]$
		$\displaystyle=\mathbb{E}\left[\exp\left(-\frac{\\|x-x^{\prime}\\|}{(\xi/\kappa)^{2}}\right)\right]$

This indicates that this kernel can be decomposed using spherical harmonics (10), and when $k\gg d$ , the coefficient

	$\displaystyle u_{k}$	$\displaystyle=\mathbb{E}\left[\exp\left(-\frac{2\kappa^{2}}{\xi^{2}}\right)\left(\frac{\xi}{\kappa}\right)^{d-2}I_{k+d/2-1}\left(\frac{2\kappa^{2}}{\xi^{2}}\right)\Gamma\left(\frac{d}{2}\right)\right]$
		$\displaystyle\asymp\mathbb{E}\left[\exp\left(-\frac{2\kappa^{2}}{\xi^{2}}\right)\Gamma\left(\frac{d}{2}\right)\sum_{j=0}^{\infty}\frac{1}{j!\Gamma(k+d/2+j)}\left(\frac{\kappa^{2}}{\xi^{2}}\right)^{k+2j}\right]$
		$\displaystyle=\sum_{j=0}^{\infty}\frac{\Gamma(d/2)}{j!\Gamma(k+d/2+j)}\mathbb{E}\left[\left(\frac{\kappa^{2}}{\xi^{2}}\right)^{k+2j}\exp\left(-\frac{2\kappa^{2}}{\xi^{2}}\right)\right]$
		$\displaystyle=\sum_{j=0}^{\infty}\frac{\Gamma(d/2)}{j!\Gamma(k+d/2+j)}\frac{\Gamma(k+2j+d/2)}{\Gamma(d/2)}\left(\frac{2}{\xi^{2}}\right)^{k+2j}\left(\frac{1}{1+4/\xi^{2}}\right)^{(k+2j+d/2)}$
		$\displaystyle\asymp\left(\frac{2}{\xi^{2}}\right)^{k}\left(1+\frac{4}{\xi^{2}}\right)^{(-k-d/2)}\sum_{j=0}^{\infty}\frac{1}{j!}\left(\frac{k(2/\xi^{2})^{2}}{(1+4/\xi^{2})^{2}}\right)^{j}$
		$\displaystyle\asymp\left(\frac{2}{4+\xi^{2}}\right)^{k}\exp\left(\left(\frac{2}{4+\xi^{2}}\right)^{2}k\right).$

Note that $\frac{2}{4+\xi^{2}}\exp\left(\left(\frac{2}{4+\xi^{2}}\right)^{2}\right)$ is always smaller than 1 so $u_{k}$ is always decreasing with $k$ .

Appendix D Additional information about numerical result

D.1 Toy dataset

In neural networks (NN) experiment, we used three layers with the first layer fixed. The number of hidden neural is 512. In neural network with binary weights (BWNN) experiment, the setup is the same as NN except the second layer is Binary. We used BinaryConnect method with stochastic rounding. We used gradient descent with learning rate searched from $10^{-3},10^{-2},10^{-1}$ . For Laplacian kernel and Gaussian kernel, we searched kernel bandwidth from $2^{-2}\mu$ to $2^{2}\mu$ by power of 2, and $\mu$ is the medium of pairwise distance. The SVM cost value parameter is from $10^{-2}$ to $10^{4}$ by power of 2.

More results are listed in Table 2. Accuracy are shown in the format of mean $\pm$ std. P90 and P95 denotes the percentage of dataset that a model achieves at least 90% and 95% of the highest accuracy, respectively.

Table 2: More results in UCI dataset experiment.

Classifier	Training			Testing
Classifier	Accuracy	P90	P95	Accuracy	P90	P95
NN	96.19 $\pm$ 8.03%	96.67%	91.11%	77.62 $\pm$ 16.10%	73.33%	56.67%
BWNN	93.55 $\pm$ 10.39%	84.44%	76.67%	77.83 $\pm$ 16.57%	77.78%	54.44%
Laplacian	93.52 $\pm$ 9.65%	85.56%	76.67%	81.62 $\pm$ 14.72%	97.78%	91.11%
Gaussian	91.08 $\pm$ 10.63%	76.67%	58.89%	81.40 $\pm$ 14.85%	95.56%	87.78%

D.2 MNIST-like dataset

Similar to the toy dataset experiment, we used three layer neural networks with the first layer fixed, and only quantize the second layer. The number of neurons in the hidden layer is 2048. The batchsize if 100 and ADAM optimizer with learning rate $10^{-3}$ is used.

		$\displaystyle\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}\big{[}\|X_{i}-\mathbb{E}_{w_{1}}[X_{i}\|\Theta]\|^{3}\big{\|}\Theta\big{]}$		(14)
		$\displaystyle=\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}\big{[}\|(w_{1,ij}-\theta_{1,ij})x_{1,i}\|^{3}\big{\|}\Theta\big{]}$
		$\displaystyle\leq\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}\big{[}8\|x_{1,i}\|^{3}\big{\|}\Theta\big{]}$
		$\displaystyle=\lim_{d_{1}\rightarrow\infty}\frac{8}{d_{1}}\sum_{i=1}^{d_{1}}\Big{\|}\sum_{k=1}^{d}w_{0,ki}x_{k}\Big{\|}^{3}$
		$\displaystyle\leq\lim_{d_{1}\rightarrow\infty}\frac{8}{d_{1}}\sum_{i=1}^{d_{1}}\Bigg{(}\sum_{k=1}^{d}\|w_{0,ki}x_{k}\|\Bigg{)}^{3}$
		$\displaystyle=\lim_{d_{1}\rightarrow\infty}\frac{8}{d_{1}}\sum_{i=1}^{d_{1}}\sum_{k,k^{\prime},k^{\prime\prime}=1}^{d}\|w_{0,ki}w_{0,k^{\prime}i}w_{0,k^{\prime\prime}i}\|\|x_{k}x_{k^{\prime}}x_{k^{\prime\prime}}\|$
		$\displaystyle\leq 8\sqrt{\frac{8}{\pi}}d^{3}$

		$\displaystyle\lim_{d_{1}\rightarrow\infty}\frac{\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}[\|X_{i}-\mathbb{E}_{w_{1}}[X_{i}\|\Theta]\|^{3}\|\Theta]}{(\sum_{i=1}^{d_{1}}\mathrm{Var}_{w_{1}}[X_{i}\|\Theta])^{3/2}}$
		$\displaystyle=\lim_{d_{1}\rightarrow\infty}\frac{\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}[\|X_{i}-\mathbb{E}_{w_{1}}[X_{i}\|\Theta]\|^{3}\|\Theta]}{{d_{1}^{1/2}}(\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathrm{Var}_{w_{1}}[X_{i}\|\Theta])^{3/2}}$
		$\displaystyle=\frac{\lim_{d_{1}\rightarrow\infty}\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathbb{E}_{w_{1}}[\|X_{i}-\mathbb{E}_{w_{1}}[X_{i}\|\Theta]\|^{3}\|\Theta]}{\lim_{d_{1}\rightarrow\infty}{d_{1}^{1/2}}(\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\mathrm{Var}_{w_{1}}[X_{i}\|\Theta])^{3/2}}$
		$\displaystyle\leq\frac{8\sqrt{\frac{8}{\pi}}d^{3}}{\lim_{d_{1}\rightarrow\infty}d_{1}^{1/2}(1-\mathrm{Var}[\theta])}$
		$\displaystyle=0$

$\displaystyle\mathbb{E}_{w_{1}}\left[\frac{\partial y}{\partial w_{2,j}}\middle\|\Theta\right]$	$\displaystyle=\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{w_{1}}[x_{2,j}\|\Theta],$	(19)
$\displaystyle\mathbb{E}_{w_{1}}\left[\frac{\partial y}{\partial b_{2,j}}\middle\|\Theta\right]$	$\displaystyle=\mathbb{E}_{w_{1}}\left[\frac{\partial\bar{y}}{\partial y_{1,j}}\frac{\partial y_{1,j}}{\partial b_{2,j}}\middle\|\Theta\right]=\sqrt{\frac{c}{d_{2}}}\beta w_{2,j}\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})\|\Theta],$	(20)
$\displaystyle\mathbb{E}_{w_{1}}\left[\frac{\partial y}{\partial w_{1,ij}}\middle\|\Theta\right]$	$\displaystyle=\mathbb{E}_{w_{1}}\left[\frac{\partial y}{\partial y_{1,j}}\frac{\partial y_{1,j}}{\partial w_{1,ij}}\middle\|\Theta\right]=\sqrt{\frac{c}{d_{1}d_{2}}}w_{2,j}x_{1,i}\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})\|\Theta],$	(21)

$\displaystyle\mathbb{E}_{w_{1}}\left[\frac{\partial\mathrm{loss}(y)}{\partial w_{2,j}}\middle\|\Theta\right]$	$\displaystyle=\sqrt{\frac{c}{d_{2}}}\mathbb{E}_{w_{1}}\left[(y-z)x_{2,j}\|\Theta\right],$	(26)
$\displaystyle\mathbb{E}_{w_{1}}\left[\frac{\partial\mathrm{loss}(y)}{\partial b_{1,j}}\middle\|\Theta\right]$	$\displaystyle=\sqrt{\frac{c}{d_{2}}}\beta w_{2,j}\mathbb{E}_{w_{1}}[(y-z)\psi^{\prime}(y_{1,j})\|\Theta],$	(27)
$\displaystyle\mathbb{E}_{w_{1}}\left[\frac{\partial\mathrm{loss}(y)}{\partial w_{1,ij}}\middle\|\Theta\right]$	$\displaystyle=\sqrt{\frac{c}{d_{1}d_{2}}}w_{2,j}x_{1,i}\mathbb{E}_{w_{1}}[(y-z)\psi^{\prime}(y_{1,j})\|\Theta],$	(28)

		$\displaystyle\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})y\|\Theta]$		(30)
		$\displaystyle=\mathbb{E}_{w_{1}}\left[\sqrt{\frac{c}{d_{2}}}\psi^{\prime}(y_{1,j})\sum_{j=1}^{d_{2}}w_{2,j}\psi(y_{1,j})\middle\|\Theta\right]$
		$\displaystyle=\sqrt{\frac{c}{d_{2}}}\left(\mathbb{E}_{w_{1}}\left[\psi(y_{1,j})\psi^{\prime}(y_{1,j})w_{2,j}\middle\|\Theta\right]+\sum_{j^{\prime}\neq j}\mathbb{E}_{w_{1}}\left[\psi(y_{1,j^{\prime}})\psi^{\prime}(y_{1,j})w_{2,j^{\prime}}\middle\|\Theta\right]\right)$
		$\displaystyle=\sqrt{\frac{c}{d_{2}}}\left(\mathbb{E}_{w_{1}}\left[\psi(y_{1,j})w_{2,j}\middle\|\Theta\right]w_{2,j}+\sum_{j^{\prime}\neq j}\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})\|\Theta]\mu_{2,j^{\prime}}w_{2,j^{\prime}}\right)$
		$\displaystyle=\sqrt{\frac{c}{d_{2}}}\left((1-\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})\|\Theta])\mu_{2,j}w_{2,j}+\sum_{j^{\prime}=1}^{d_{2}}\mathbb{E}_{w_{1}}[\psi^{\prime}(y_{1,j})\|\Theta]\mu_{2,j^{\prime}}w_{2,j^{\prime}}\right)$