Why Quantization Improves Generalization: NTK of Binary Weight Neural Networks
Abstract
Quantized neural networks have drawn a lot of attention as they reduce the space and computational complexity during the inference. Moreover, there has been folklore that quantization acts as an implicit regularizer and thus can improve the generalizability of neural networks, yet no existing work formalizes this interesting folklore. In this paper, we take the binary weights in a neural network as random variables under stochastic rounding, and study the distribution propagation over different layers in the neural network. We propose a quasi neural network to approximate the distribution propagation, which is a neural network with continuous parameters and smooth activation function. We derive the neural tangent kernel (NTK) for this quasi neural network, and show that the eigenvalue of NTK decays at approximately exponential rate, which is comparable to that of Gaussian kernel with randomized scale. This in turn indicates that the Reproducing Kernel Hilbert Space (RKHS) of a binary weight neural network covers a strict subset of functions compared with the one with real value weights. We use experiments to verify that the quasi neural network we proposed can well approximate binary weight neural network. Furthermore, binary weight neural network gives a lower generalization gap compared with real value weight neural network, which is similar to the difference between Gaussian kernel and Laplace kernel.
1 Introduction
It has been found that by quantizing the parameters in a neural network, the memory footprint and computing cost can be greatly decreased with little to no loss in accuracy (Gupta et al., 2015). Furthermore, Hubara et al. (2016); Courbariaux et al. (2015) argued that quantization serves as an implicit regularizer and thus should increase the generalizability of neural network comparing to its full precision version. However, there is no formal theoretical investigation of this statement to the best of our knowledge.
Empirical results show that the traditional statistical learning techniques based on uniform convergence (e.g., VC-dimension (Blumer et al., 1989)) do not satisfactorily explain the generalization ability of neural networks. Zhang et al. (2016) showed that neural networks can perfectly fit the training data even if the labels are random, yet it generalized well when the data are not random. This seems to suggest that the model capacity of a neural network depends on not only the model, but also the dataset. Recent studies (He and Tao, 2020) managed to understand the empirical performance in a number of different aspects, including modeling stochastic gradient (SGD) with stochastic differential equation (SDE) (Weinan et al., 2019), studying the geometric structure of loss surface (He et al., 2020), and overparameterization – a particular asymptotic behavior when the number of parameters of the neural network tends to infinity (Li et al., 2018; Choromanska et al., 2015; Allen-Zhu et al., 2018; Arora et al., 2019a). Recently, it was proven that the training process of neural network in the overparameterized regime corresponds to kernel regression with Neural Tangent Kernel (NTK) (Jacot et al., 2018). A line of work (Bach, 2017; Bietti and Mairal, 2019; Geifman et al., 2020; Chen and Xu, 2020) further studied Mercer’s decomposition of NTK and proved that it is similar to a Laplacian kernel in terms of the eigenvalues.
In this paper, we propose modeling a two-layer binary weight neural network using a model with continuous parameters. Specifically, we assume the binary weights are drawn from the Bernoulli distribution where the parameters of the distribution (or the mean of the weights) are trainable parameters. We propose a quasi neural network, which has the same structure as a vanilla neural network but has a different activation function, and prove one can analytically approximate the expectation of output of this binary weight neural network with this quasi neural network. Using this model, our main contributions are as follows:
-
•
Under the overparameterized regime, we prove that the gradient computed from BinaryConnect algorithm is approximately an unbiased estimator of the gradient of the quasi neural network, hence such a quasi neural network can model the training dynamic of binary weight neural network.
-
•
We study the NTK of two-layer binary weight neural networks by studying the “quasi neural network”, and show that the eigenvalue of this kernel decays at an exponential rate, in contrast with the polynomial rate in a ReLU neural network Chen and Xu (2020); Geifman et al. (2020). We reveal the similarity between the Reproducing kernel Hilbert space (RKHS) of this kernel with Gaussian kernel, and it is a strict subset of function as the RKHS of NTK in a ReLU neural network. This indicates that the model capacity of binary weight neural network is smaller than that with real weights, and explains higher training error and lower generalization gap observed empirically.
2 Related work
Quantized neural networks.
There is a large body of work that focuses on training neural networks with quantized weights (Marchesi et al., 1993; Hubara et al., 2017; Gupta et al., 2015; Liang et al., 2021; Chu et al., 2021), including considering radically quantizing the weights to binary (Courbariaux et al., 2016; Rastegari et al., 2016) or ternary (Alemdar et al., 2017) values, which often comes at a mild cost on the model’s predictive accuracy. Despite all these empirical works, the theoretical analysis of quantized neural networks and their convergence is not well studied. Many researchers believed that quantization adds noise to the model, which serves as an implicit regularizer and makes neural networks generalize better (Hubara et al., 2016; Courbariaux et al., 2015), but this statement is instinctive and has never been formally proved to the best of our knowledge. One may argue that binary weight neural networks have a smaller parameter space than its real weight counterpart, yet Ding et al. (2018) showed that a quantized ReLU neural network with enough parameters can approximate any ReLU neural network with arbitrary precision. These seemingly controversy results motivate us to find another way to explain the stronger generalization ability that is observed empirically.
Theory of deep learning and NTK.
A notable recent technique in developing the theory of neural networks is the neural tangent kernel (NTK) Jacot et al. (2018). It draws the connection between an over-parameterized neural network and the kernel learning. This makes it possible to study the generalization of overparameterized neural network using more mature theoretical tools from kernel learning Bordelon et al. (2020); Simon et al. (2021).
The expressive power of kernel learning is determined by the RKHS of the kernel. Many researches have been done to identify the RKHS. Bach (2017); Bietti and Mairal (2019) studied the spectral properties of NTK of a two-layer neural network without bias. Geifman et al. (2020) further studied the NTK with bias and showed that the RKHS of two layer neural networks contains the same set of functions as RKHS of the Laplacian kernel. Chen and Xu (2020) expanded this result to arbitrary layer neural networks and showed that RKHS of arbitrary layer neural network is equivalent to Laplacian kernel. All these works are based on neural networks with real weights, and to the best of our knowledge, we are the first to study the NTK and generalization of binary weight neural networks.
3 Preliminary
3.1 Neural tangent kernel
It has been found that an overparameterized neural network has many local minima. Furthermore, most of the local minima are almost as good as the global minima (Soudry and Carmon, 2016). As a result, in the training process, the model parameters often do not need to move far away from the initialization point before reaching a local minimum (Du et al., 2018, 2019; Li and Liang, 2018). This phenomenon is also known as lazy training (Chizat et al., 2018). This allows one to approximate a neural network with a model that is nonlinear in its input and linear in its parameters. Using the connection between feature map and kernel learning, the optimization problem reduces to kernel optimization problem. More detailed explanation can be found below:
Denote as the collection of all the parameters in a neural network before an iteration, and as the parameters after this iteration. Let denote fixed distribution in the input space. In this paper, it is a discrete distribution induced by the training dataset. Using Taylor expansion, for any testing data , let the stepsize be , the first-order update rule of gradient descent can be written as ( be the differentiable loss function and the label is omitted)
This indicates that the learning dynamics of overparameterized neural network is equivalent to kernel learning with kernel defined as
which is called the neural tangent kernel (NTK). As the width of the hidden layers in this neural network tends to infinity, this kernel convergences to its expectation over (Jacot et al., 2018).
3.2 Exponential kernel
A common class of kernel functions used in machine learning is the exponential kernel, which is a radial basis function kernel with the general form
where and are constants. When , this kernel is known as the Laplacian kernel, and when , it is known as the Gaussian kernel.
According to Moore-Aronszajn theorem, each symmetric positive definite kernel uniquely induces a Reproducing kernel Hilbert space (RKHS). RKHS determines the functions that can be learned using a kernel. It has been found that the RKHS of NTK in a ReLU neural network is the same as Laplacian kernel (Geifman et al., 2020; Chen and Xu, 2020), and the empirical performance of a neural network is close to that of kernelized linear classifiers with exponential kernels in many datasets (Geifman et al., 2020).
3.3 Training neural networks with quantized weights
Among various methods to train a neural network, BinaryConnect (BC) (Courbariaux et al., 2015) is often one of the most efficient and accurate method. The key idea is to introduce a real-valued buffer and use it to accumulate the gradients. The weights will be quantized just before forward and backward propagation, which can benefit from the reduced computing complexity. The update rule in each iteration is
(1) |
where denotes the quantization function which will be discussed in Section 4.2, denotes the binary (or quantized) weights, and denote the real valued buffer before and after an iteration respectively, is the learning rate, and denotes the neural network with parameter . Here the gradients are computed by taking as if they are real numbers. The detailed algorithm can be founded in Section A.
4 Approximation of binary weight neural network
4.1 Notations
In this paper, we use to denote the binary weights in the -th layer, to denote its real-valued counterpart, and to denote the (real valued) bias. is the collection of all the real-valued model parameters which will be specified in Section 4.2. The number of neurons in the -th hidden layer is , the input to the -th linear layer is and the output is . denote the number of input features. Besides, we use to denote the input to this neural network, to denote the output and to denote the label.
We focus on the mean and variance under the randomness of stochastic rounding. Denote
We use to denote ReLU activation function, and to denote the (discrete) distribution of training dataset. denotes the expectation over training dataset, or “sample average”. We use bold symbol to denote a collection of parameters or variables .
4.2 Problem statement
In this work, we target on stochastic quantization (Dong et al., 2017), which often yields higher accuracy empirically compared with deterministic rounding (Courbariaux et al., 2015). This also creates a smooth connection between the binary weights in a neural network and its real-valued parameters.
Let be the binary weights from stochastic quantization function, which satisfy Bernoulli distribution:
(2) |
This relationship leads to .
We focus on a ReLU neural network with one hidden layer and two fully connect layers, which was also studied in Bach (2017); Bietti and Mairal (2019) except quantization. Besides, we add a linear layer (“additional layer”) in front of this neural network to project the input to an infinite dimension space. We randomly initialize the weights in this layer and leave it fixed (not trainable) throughout the training process. Furthermore, we quantize the weights in the first fully connect layer and add a real-valued buffer which determines the distribution of as in (2), and leave the second layers not quantized. It is a common practice to leave the last layer not quantized, because this often leads to better empirical performance. If the second layer is quantized as well, the main result of this paper will not be changed. This can be easily checked by extending Lemma 4 into the second layer.
Remark 1.
In many real applications, e.g. computer vision, the dimension of data are often very large () while they are laying in the lower dimension linear subspace, so we can take the raw input in these applications as the output of the additional layer, and the NN in this case is a two-layer NN where the first layer is quantized.
The set of all the real-valued parameters is . The neural network can be expressed as
We follow the typical setting of NTK papers Geifman et al. (2020) in initializing the parameters except the quantized parameters. As for the quantized parameters, we only need to specify the real-valued buffer of the weights in the first layer .
Assumption 1.
We randomly initialize the weights in the “additional layer” and second linear layer independently as , and initialize all the biases to 0. The real-valued buffer of the weights are initialized independently identical with zero mean, variance and bounded in .
Remark 2.
Our theory applies to any initial distribution of as long as it satisfies the constraint above. One simple example is the uniform distribution in , which has variance .
4.3 Quasi neural network
Given a fixed input and real-value model parameters , under the randomness of stochastic rounding, the output of this binary weight neural network is a random variable. Furthermore, as the width of this neural network tends to infinity, the output of a linear layer tends to Gaussian distribution according to central limit theorem (CLT). We propose a method to determine the distribution of output and using the model parameters. Specifically, we give a closed form equation to compute the mean and variance of output of all the layers , and then marginalize over random initialization of to further simplify this equation. We prove that converges to a constant almost surely using the law of large number (LLN), and simplify the expression by replacing them with the constant. This allows us to compute using a neural-network-style function for given . We call this function quasi neural network, which is given below:
(3) | ||||||
In Section 4.3.1, we study the distribution of the output of each layer in a binary weight neural network (BWNN) conditioned on the set of real-valued parameter . In Section 4.3.2, we prove that the conditioned variance of the output of the first linear layer studied above converges almost surely to a constant which does not depend on the data (input). This simplifies the expression computed in Section 4.3.1 to the form of quasi neural network (3), and also give a closed-form expression to in (3). In Section 4.3.3, we prove that conditioned on the set of real-valued parameter, the expectation of the gradients of BWNN equals the gradient of quasi neural network on the overparameterization limit. This indicates that the training dynamics of BWNN at initialization is the same as training the quasi neural network directly. The training dynamics beyond initialization are discussed in Section 4.3.4. Before jumping to the proof, we make the following assumptions:
Assumption 2.
After training the binary weight neural network as in (1), all the real-valued weights stay in the range .
Based on this assumption, we can ignore constraints that and the projected gradient descent reduces to gradient descent. Because of the lazy training property of the overparameterized neural network, the model parameters stay close to the initialization point during the training process, so this assumption can be satisfied by initializing with smaller absolute value and/or applying weight decay during training. On the other hand, a common trick in a quantized neural network is to change the quantization level gradually during the training process to avoid (or reduce) overflow. With this trick, Assumption 2 are often naturally satisfied, but it introduces the quantization level as a trainable parameter.
Assumption 3.
The Euclidean norm of the input is 1:
This is a common assumption in studying NTK (Bach, 2017; Bietti and Mairal, 2019), and can be satisfied by normalizing the input data.
4.3.1 Conditioned distribution of the outputs of each layer
First we recognize that as the model parameters are initialized randomly, there are “bad” initialization that will mess up our analysis. For example, all of are initialized to 1 (or -1) while they are drawn from a uniform distribution. Fortunately, as the width grows to infinite, the probability of getting into these “bad” initialization is 0. We make this statement formal in the following part.
Definition 4.
“Good Initialization”. We call the set of parameters is a “Good Initialization” if it satisfies:
-
•
-
•
-
•
for some finite , ,
where
Proposition 3.
Under the assumption that all the parameters are initialized as in 1, the probability of getting “Good Initialization” is 1:
Lemma 4.
On the limit , and any given conditioned on any fixed , for any , the distribution of converge to Gaussian distribution with mean and variance which can be computed by:
(4) |
This lemma can be proved by Lyapunov central limit theorem and sum of expectation. See Section B.2 for the details.
Lemma 5.
Assume that the input to a ReLU layer satisfy Gaussian distribution with mean and variance conditioned on ,
Denote
(5) |
where denotes standard Gaussian function and denotes its integration:
Then the output has mean and variance conditioned on , with
(6) | ||||
From 4 we know that on the limit , conditioned on and , for any , converge to Gaussian distribution. From continuous mapping theorem, the distribution of converge to that shown in 5 so its mean and variance converge to that computed in 5.
Equations (4) and (6) provide a method to calculate the mean and variance of output conditioned on the input and real-valued model parameters and allow us to provide a closed-form equation of quasi neural network. We will simplify this equation in Section 4.3.2.
4.3.2 Convergence of conditioned variance
In this part, we assume that the model parameters satisfy “Good Initialization”, which is almost surely on the limit as is proven in 3, and study the distribution of and .
Theorem 6.
If the parameters are “Good Initialization” , on the limit , for any finite , converges to Gaussian distribution which are independent of each other, and converges a.s. to
With this approximation, we can replace the variance in equation (6) with and leave the mean of output in the linear layer as the only variable in the quasi neural network. Formal proof can be found in Section B.4. Note the propagation function in the linear layer (the first equation in (4)) is also a linear function in and . This motivates us to compute using a neural network-like function as is given in (3), where is
(7) |
This equation gives a closed-form connection between the mean of output of neural network and the real-valued model parameter , and allows up to apply existing tools for analyzing neural networks with real-valued weight to analysis binary weight neural network. Its derivative in the sense of Calculus is:
(8) |
The proof of derivative can be found in Section B.5.
4.3.3 Gradient of quasi neural network
In this part, we compute the gradients using binary weights as in BinaryConnect Algorithm, and make sense of the gradient in (8) by proving that it is the expectation of gradients under the randomness of stochastic rounding.
Theorem 7.
The expectation of gradients to output with respect to weights computed by sampling the quantized weights equals the gradients of “quasi neural network” defined above in (3) on the limit ,
Theorem 8.
For MSE loss, where is the ground-truth label, the gradient of the loss converges on the limit ,
In other words, the BinaryConnect algorithm provides an unbiased estimator to the gradients for the quasi neural network on this limit of overparameterization.
Theorem 6 and Theorem 8 conclude that for an infinite wide neural network, the BinaryConnect algorithm is equivalent to training quasi neural network with stochastic gradient descent (SGD) directly. Furthermore, this points out the gradient flow of BinaryConnect algorithm and allows us to study this training process with neural tangent kernel (NTK).
4.3.4 Asymptotics during training
So far we have studied the distribution of output during initialization. To study the dynamic of binary weight neural network during training, one need to extend these results to any parameter during training . Fortunately, motivated by Jacot et al. (2018), we can prove that as , the model parameters stays asymptotically close to initialization for any finite , so-called “lazy training”, so the above results apply to the entire training process.
Lemma 9.
For all such that stays stochastically bounded, as , are all stochastically bounded, and is stochastically bounded for all .
The proof can be found in Section B.8. Note that , this results indicates that as , the varying of the parameter is much smaller than the initialization, or so-called “lazy training”. Making use of this result, we further get the follow result:
Lemma 10.
Under the condition of 9, Lyapunov’s condition holds for all so converges to Gaussian distribution conditioned on the model parameters . Furthermore, , which equals almost surely.
The proof can be found in Section B.9. This result shows that the analysis in Section 4.3 applies to the entire training process, and allows us to study the dynamics of binary weight neural network using quasi neural network.
5 Capacity of Binary Weight Neural Network
As has been found in Jacot et al. (2018), the dynamics of an overparameterized neural network trained with SGD is equivalent to kernel gradient descent where the kernel is NTK. As a result, the effective capacity of a neural network is equivalent to the RKHS of its NTK. In the following part, we will study the NTK of binary weight neural network using the approximation above, and compare it with Gaussian kernel.
5.1 NTK of three-layer binary weight neural networks
We consider the NTK binary weight neural network by studying this “quasi neural network” defined as
(9) |
where denotes all the trainable parameters. We omitted the terms related to (which is a constant) in this equation.
First prove that the change of kernel asymptotically converges to 0 during training process.
Theorem 11.
Under the condition of 9, at rate for any .
The proof can be found in Section C.3. Using Assumption 3, we confine the input on the hypersphere . One can easily tell that it is positive definite, so we can apply Mercer’s decomposition Minh et al. (2006) to it.
To find the basis and eigenvalues to this kernel, we apply spherical harmonics decomposition to this kernel, which is common among studying of NTK (Bach, 2017; Bietti and Mairal, 2019):
(10) |
where denotes the dimension of and , denotes the spherical harmonics of order . This suggests that NTK of binary weight neural network and exponential kernel can be spanned by the same set of basis function. The key question is the decay rate of with .
Theorem 12.
NTK of a binary weight neural network can be decomposed using (10). If , then
(11) |
where and denote polynomials of , and is a constant.
In contrast, Geifman et al. (2020) shows that for NTK in the continuous space, it holds that
with constants and . Because its decay rate is slower than that of the binary weight neural network, its RKHS covers a strict superset of functions (Geifman et al., 2020).
Proof Sketch: We first compute NTK of quasi neural network, which depends on the distribution of . As is shown in Theorem 6, converge to Gaussian distribution on the limit of infinite wide neural network. To find the joint distribution of and given arbitrary two inputs , we combine the first linear layer in the quasi neural network with the “additional layer” in front of it (the first two equations in (3)). This allows up to reparameterize as
where denotes the fused weight. A key component in computing the NTK has the form
The second equation comes from the law of total expectation. We use 2-norm in this expression. The inner expectation is equivalent to integration on a sphere, and can be computed by applying sphere harmonics decomposition to . The squared norm of the fused weight satisfy Chi-distribution, and we use momentum generating function to finish computing.
5.2 Comparison with Gaussian Kernel
Even if the input to a neural network is constrained on a unit sphere, the first linear layer (together with the additional linear layer in front of it) will project it to the entire space with Gaussian distribution. In order to simulate that, we define a kernel by randoming the scale of and beforing taking them into a Gaussian kernel.
where is a Gaussian kernel, satisfy Chi distribution with degrees of freedom. This scaling factor projects a random vector uniformly distributed on a unit sphere to Gaussian distributed. The corresponding eigenvalue satisfy
(12) |
where are constants that depend on . The dominated term in both (11) and (12) have an exponential decay rate , which suggests the similarity between NTK of binary weight neural network and Gaussian kernel. In comparison, Bietti and Mairal (2019); Geifman et al. (2020) showed that the eigenvalue of NTK decay at rate , which is slower that binary weight neural network or Gaussian kernel. Furthermore, Aronszajn’s inclusion theorem suggests , where denotes the NTK of real-valued weight neural network. In other words, the expressive power of binary weight neural network is weaker than its real valued counterpart on the limit that the width goes to infinity. Binary weight neural networks are less venerable to noise thanks to the smaller expressive power at the expense of failing to learn some “high frequency” components in the target function. This explains that binary weight neural network often achieve lower training accuracy and smaller generalization gap compared with real weight neural network.
6 Numerical result
6.1 Quasi neural network
In this part, we empirically verify the approximation of quasi neural network by comparing the inference result of quasi neural network with that achieved by Monte Carlo. The architecture is the same as that mentioned in Section 4.2, with 1600 hidden neurons. We train this neural network on MNIST dataset (LeCun et al., 1998) by directly applyng gradient descent to the quasi neural network. To reduce overflow, we add weight decay of 0.001 during training. Figure 2(a)(b) shows the histogram of output under stochastic rounding before and after training. We arbitrarily choose one input sample from the testing set and get 1000 samples under different stochastic rounding. This result supports our statement that the distribution of pre-activation (output of linear layer) conditioned on real-valued model parameters converge to Gaussian distribution. Figure 2(c)(d) compares the mean of output by quasi neural network approximation (horizontal axis) with that computed using Monte Carlo (vertical axis). These alignments further supports our method of approximating binary weight neural network with quasi neural network.
6.2 Generalization gap
6.2.1 Toy dataset
We compare the performance of the neural network with/without binary weight and kernel learning using the same set of 90 small scale UCI datasets with less than 5000 data points as in Geifman et al. (2020); Arora et al. (2019b). We report the training accuracy and testing accuracy of both vanilla neural network (NN) and binary weight neural network (BWNN) in Figure 3. To further illustrate the difference, we list the paired T-test result of neural network (NN) against binary weight neural network (BWNN), and Gaussian kernel (Gaussian) against Laplace kernel (Laplace) using in Table 1. In this table, t-stats and p-val denotes the t-statistic and two-sided p-value of the paired t-test between two classifiers, and and denotes the percentage of dataset that the first classifier gets lower or higher testing accuracy or generalization bound (training accuracy - testing accuracy), respectively.
As can be seen from the results, although the Laplacian kernel gets higher training accuracy than the Gaussian kernel, its testing accuracy is almost the same as the latter one. In other words, the former has smaller generalization gap than the latter which can also be observed in Table 1. Similarly, a neural network gets higher training accuracy than a binary weight neural network but gets similar testing accuracy.
Classifier | Testing | Training-Testing | ||||||
---|---|---|---|---|---|---|---|---|
t-stats | p-val | t-stats | p-val | |||||
NN-BWNN | 0.7471 | 0.4569 | 53.33% | 41.11% | 4.034 | 0.000 | 26.67% | 67.77% |
Laplace-Gaussian | 0.4274 | 0.6701 | 51.11% | 33.33% | 3.280 | 0.001 | 37.78% | 53.33% |
6.2.2 MNIST-like dataset
We compare the performance of neural networks with binary weights (Binary) with its counterpart with real value weights (Real). We take the number of training samples as a parameter by random sampling the training set and use the original test set for testing. The experiments are repeated 10 times and the mean and standard derivation is shown in Figure 4. In the MNIST dataset, the performance of neural networks with or without quantization is similar. This is because MNIST (LeCun et al., 1998) is simpler and less vulnerable to overfitting. On the other hand, the generalization gap with weight quantized is much smaller than without it in FashionMNIST (Xiao et al., 2017) dataset, which matches our prediction.
7 Discussion
In this paper, we propose a quasi neural network to approximate the binary weight neural network. The parameter space of quasi neural network is continuous, and its gradient can be approximated using the BinaryConnect algorithm. We study the expressive power of the binary weight neural network by studying the RKHS of its NTK and showed its similarity with the Gaussian kernel. We empirically verify that quantizing the weights can reduce the generalization gap, similar to Gaussian Kernel versus the Laplacian kernel. This result can be easily generalized to a neural network with other weight quantization methods, i.e. using more bits. Yet there are several questions to be answered by future work:
-
1.
In this work, we only quantize the weights, while much empirical work to quantize both the weights and the activations has been done. Can we use a similar technique to study the expressive power of that neural network?
-
2.
We study the NTK of a two-layer neural network with one additional linear liner in front of it, and only the weights in the first layer are quantized. It remains to be answered whether a multi-layer neural network allow similar approximation, and whether using more layers can increase its expressive power.
8 Acknowledgement
The authors would like to thank Chunfeng Cui for helpful discussion about the idea of quasi neural network and effort in proofreading the paper, as well as Yao Xuan for helpful discussion about kernel and RKHS.
References
- Alemdar et al. (2017) H. Alemdar, V. Leroy, A. Prost-Boucle, and F. Pétrot. Ternary neural networks for resource-efficient ai applications. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 2547–2554. IEEE, 2017.
- Allen-Zhu et al. (2018) Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018.
- Arora et al. (2019a) S. Arora, S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019a.
- Arora et al. (2019b) S. Arora, S. S. Du, Z. Li, R. Salakhutdinov, R. Wang, and D. Yu. Harnessing the power of infinitely wide deep nets on small-data tasks. arXiv preprint arXiv:1910.01663, 2019b.
- Bach (2017) F. Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
- Bietti and Mairal (2019) A. Bietti and J. Mairal. On the inductive bias of neural tangent kernels. In Advances in Neural Information Processing Systems, pages 12893–12904, 2019.
- Blumer et al. (1989) A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.
- Bordelon et al. (2020) B. Bordelon, A. Canatar, and C. Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. arXiv preprint arXiv:2002.02561, 2020.
- Chen and Xu (2020) L. Chen and S. Xu. Deep neural tangent kernel and laplace kernel have the same rkhs. arXiv preprint arXiv:2009.10683, 2020.
- Chizat et al. (2018) L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable programming. arXiv preprint arXiv:1812.07956, 2018.
- Choromanska et al. (2015) A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pages 192–204. PMLR, 2015.
- Chu et al. (2021) T. Chu, Q. Luo, J. Yang, and X. Huang. Mixed-precision quantized neural networks with progressively decreasing bitwidth. Pattern Recognition, 111:107647, 2021.
- Courbariaux et al. (2015) M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems, 28:3123–3131, 2015.
- Courbariaux et al. (2016) M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.
- Ding et al. (2018) Y. Ding, J. Liu, J. Xiong, and Y. Shi. On the universal approximability and complexity bounds of quantized relu neural networks. arXiv preprint arXiv:1802.03646, 2018.
- Dong et al. (2017) Y. Dong, R. Ni, J. Li, Y. Chen, J. Zhu, and H. Su. Learning accurate low-bit deep neural networks with stochastic quantization. arXiv preprint arXiv:1708.01001, 2017.
- Du et al. (2019) S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675–1685. PMLR, 2019.
- Du et al. (2018) S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2018.
- Geifman et al. (2020) A. Geifman, A. Yadav, Y. Kasten, M. Galun, D. Jacobs, and R. Basri. On the similarity between the laplace and neural tangent kernels. arXiv preprint arXiv:2007.01580, 2020.
- Gupta et al. (2015) S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning, pages 1737–1746, 2015.
- He and Tao (2020) F. He and D. Tao. Recent advances in deep learning theory. arXiv preprint arXiv:2012.10931, 2020.
- He et al. (2020) F. He, B. Wang, and D. Tao. Piecewise linear activations substantially shape the loss surfaces of neural networks. arXiv preprint arXiv:2003.12236, 2020.
- Hubara et al. (2016) I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks. In Proceedings of the 30th international conference on neural information processing systems, pages 4114–4122. Citeseer, 2016.
- Hubara et al. (2017) I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017.
- Jacot et al. (2018) A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
- LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Li et al. (2018) D. Li, T. Ding, and R. Sun. Over-parameterized deep neural networks have no strict local minima for any continuous activations. arXiv preprint arXiv:1812.11039, 2018.
- Li and Liang (2018) Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. arXiv preprint arXiv:1808.01204, 2018.
- Liang et al. (2021) T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing, 461:370–403, 2021.
- Marchesi et al. (1993) M. Marchesi, G. Orlandi, F. Piazza, and A. Uncini. Fast neural networks without multipliers. IEEE transactions on Neural Networks, 4(1):53–62, 1993.
- Minh et al. (2006) H. Q. Minh, P. Niyogi, and Y. Yao. Mercer’s theorem, feature maps, and smoothing. In International Conference on Computational Learning Theory, pages 154–168. Springer, 2006.
- Rastegari et al. (2016) M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer, 2016.
- Simon et al. (2021) J. B. Simon, M. Dickens, and M. R. DeWeese. Neural tangent kernel eigenvalues accurately predict generalization. arXiv preprint arXiv:2110.03922, 2021.
- Soudry and Carmon (2016) D. Soudry and Y. Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
- Weinan et al. (2019) E. Weinan, J. Han, and Q. Li. A mean-field optimal control formulation of deep learning. Research in the Mathematical Sciences, 6(1):1–41, 2019.
- Xiao et al. (2017) H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
- Zhang et al. (2016) C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
Appendix A BinaryConnect Algorithm
In this section, we briefly review the BinaryConnect algorithm [Courbariaux et al., 2015], which is the key algorithm we are studying in this paper.
Appendix B Gaussian approximation in quantized neural network
B.1 Proof of Proposition 3
For the first statement, observe that fixing and taking as the variable, are independent from each other. Furthermore, . From the strong law of large number (SLLN), the first statement is proved. The third statement can be proved similarly, observing that .
To prove the second statement, because geometric mean is no larger than cubic mean,
Since , the expectation of the right hand side equals . We apply SLLN again to finish the proof.
B.2 Proof of Lemma 4
We first compute the conditioned mean and variance and . Notice that conditioned on , is deterministic,
The second line is because when .
Next, we need to prove converge to Gaussian distribution conditioned on by verifying Lyapunov’s condition. Note that for any ,
Define . As mentioned above, its mean and variance (conditioned on ) is
Since , for some finite ,
(13) | ||||
The fourth equality comes from the definition of , and the fifth equality is because . The third order absolute momentum can be bounded by
(14) | ||||
The last inequality comes from the definition of “Good Initialization”: for all ,
and because for all . Note that using the strong law of large number, one can prove that the third order absolute momentum converges almost surely to a constant that doesn’t depend on . On the other hand, we are proving a upper bound for all which is stronger than almost surely converge.
This proves that Lyapunov’s condition for all “Good Initialization”, so conditioned on , converges to Gaussian distribution.
B.3 Proof of Lemma 5
To compute and , we first compute and . Recall ,
We only need to compute
For . When , this is integration to Gaussian function, and it is known that there’s no analytically function to express that. For sack of simplicity, define it as
When , this integration can be simply solved by change of the variable and we denote it as :
When , we can do integration by parts and express it using and :
Using the definition of mean and variance,
we can come to the result.
B.4 Proof of Theorem 6
In this part, we take as the random variables and conditioned mean and variance derived above as functions to . From Eq. (4), as , tend to iid Gaussian processes, and there covariance converges almost surely to its expectation. We then focus on computing the expectation of covariance. For any , we take the expectation over random initialization of :
(15) | ||||
which indicates that they are independent.
B.5 Derivative of activation function in quasi neural network
Let
Its derivative is
The second line is because
B.6 Proof of Theorem 7
To make the proof more general, we make a parameter of the activation function in quasi neural network as . To get the derivative with respect to , we first get the derivative with respect to .
then apply chain rule:
(16) | ||||
(17) | ||||
(18) |
On the other hand, let’s first write down the gradient with respect to weights in quantized neural network and take their expectation conditioned on :
(19) | ||||
(20) | ||||
(21) |
B.7 Proof of Theorem 8
Observe that conditioned on , depends only on , and that for . Because of that, are independent of each other. Similarly, are independent of each other conditioned on . For MSE loss,
According to the chain rule
(22) |
for any , which leads to
(23) | ||||
(24) | ||||
(25) |
On the other hand, in the original binary weight neural network,
(26) | ||||
(27) | ||||
(28) |
Note that is not independent form or , which is the main challenge of the proof. To deal with this problem, we bound the difference between (23)-(25) and (26)-(28), which requires bounding their covariance.
(29) | ||||
The second term equals and the first term converges to 0 when . Taking it into (23) and (26), we can see that these two terms are equal on the limit .
B.8 Proof of 9
In this part, we denote for , and express each time-depent variable as a function of time . We define an inner product under the distribution of training dataset
and the corresponding norm
If is a vector, . Note this inner product and norm define a Hilbert space (not to be confused with the RKHS induced by a kernel), so by Cauchy-Schwarz inequality,
As is shown in 4.3.3, on the limit , the dynamics of training this neural network using gradient descent can be written as:
where dot denotes the derivative with respect to . Note the activation function depends on , which makes it time dependent. One can further write down the dynamics of as
Rewrite these two differential equations in matrix form:
where denotes elementwise product and denotes outer product. Here we slightly abuse the notation , which represents elementwise operation when applied to a vector. Their norm are bounded by
(31) | ||||
(32) | ||||
(33) | ||||
(34) | ||||
Here we make use of the fact that , regardless of the value of , that as long as , and that is not updated during training. In the last equation, we make use of
Define , then
B.9 Proof of 10
From (4), it’s easy to get the dynamics of :
Here we assume that is stochastically bounded by . Since is finite for all , it’s easy to check the term after operator is stochastically bounded. The remaining task is to bound term before operator. From standard Gaussian process analysis, satisfy Gaussian distribution. From the law of large number (LLN), as ,
almost surely, where the expectation is taken over , and this limit is also bounded. Because of that, as , the difference converges to 0 at rate .
Notice that the proof of Lyapunov’s condition (LABEL:eq:Lyapunov) doesn’t depend on time from the third line. Since stochastically converges to for all finite , Lyapunov’s condition holds for all thus always converges to Gaussian distribution conditioned on model parameter.
Appendix C NTK of neural networks with quantized weights
C.1 Spherical harmonics
This subsection briefly reviews the relevant concepts and properties of spherical harmonics. Most part of this subsection comes from Bach [2017, appendix Section D.1.] and Bietti and Mairal [2019, appendix Section C.1.]
According to Mercer’s theorem, any positive definite kernel can be decomposed as
where is called the feature map. Furthermore, any zonal kernel on the unit sphere, i.e., for any , including exponential kernels and NTK, can be decomposed using spherical harmonics (equation (10)):
Legendre polynomial. We have the additional formula
where
The polynomial is the -th Legendre polynomial in dimension, also known as Gegenbauer polynomials:
It is even (resp. odd) when is odd (reps. even). Furthermore, they have the orthogonal property
where
denotes the surface of sphere in dimension, and this leads to the integration property
for any . is the uniform measure on the sphere.
C.2 NTK of quasi neural network
We start the proof of the Theorem 12 by the following lemmas:
Lemma 13.
The NTK of a binary weight neural network can be simplified as
(35) | ||||
where ,
are the pre-activation of the second layer.
Proof.
where has the same distribution as for any . We make use of the fact , and from central limit theorem, and converge to joint Gaussian distribution for any fixed as
Similarly,
∎
C.3 Proof of Theorem 11
Remind that as is proved in Theorem 10, for any satisfying a mild condition, and is nonzero almost surely. Making use the fact that is continuous with respect to , and its first and second order derivative is stochastically bounded, the change of kernel induced by converges to 0 as . This reduces to this quasi neural network to a standard neural network with activation function , which is twice differentiable and has bounded second order derivative. From Theorem 2 in Jacot et al. [2018], the kernel during training converges to the one during initialization. For the ease of the readers, we restate the proof below. On the limit ,
From Theorem 10, and observing that are bounded by constants, one can verify that each summation term is stochastically bounded by , so as , converges to 0 at rate .
C.4 Spherical harmonics decomposition to activation function
Following Bach [2017], we start by studying the decomposition of action in quasi neural network (7) and its gradients (8): for arbitrary fixed , , we can decompose equation (7) and (8) as
(36) | ||||
(37) |
where is the -th Legendre polynomial in dimension .
Lemma 14.
The decomposition of activation function in the quasi neural network (36) satisfies
-
1.
if is odd,
-
2.
if is even,
-
3.
as when is even, where denotes a polynomial of , and is a constant.
Its gradient (37) satisfies
-
1.
if is even,
-
2.
if is odd,
-
3.
as when is odd, where denotes a polynomial of , and is a constant.
Proof.
Let’s start with the derivative of activation function in quasi neural network:
where is a constant. We introduce the auxiliary parameters s.t. and let By Cauchy–Schwarz inequality, . Following Bach [2017], we have the following decomposition to :
where and are defined in section C.1, can be computed by
To solve this itegration, we can apply Taylor decomposition to :
(38) |
We will study the following polynomial integration first
When , this integration equals 0 as is orthogonal to all polynomials of degree less than . If , this integration is 0 because the function to be integrated is an odd function. For and ( is odd), using successive integration by parts,
(39) | ||||
where is a constant that depends only on mod 2.
Following Bach [2017], Geifman et al. [2020] we take as a constant and take to infinity. Let we have
where means the radio converge to a constant which doesn’t depend on or as . Here we introduced the function for simplification, and it satisfies
which indicates that decays at factorial rate when . If , regime dominates the summation.
Using Stirling’s approximation, one can easily prove
When ,
This splits into two parts: the first part depends only on and the rest part only depends on . The summation of the second part over yields
Using Stirling’s approximation
this leads to the expression for :
Similarly, the activation function of quasi neural network has the Tayler expansion
So when is odd, and when is even:
Furthermore, when ,
∎
C.5 Computing covariance matrix
In this part, we prove Theorem 12 by computing and .
Theorem 12 NTK of a binary weight neural network can be decomposed using equation (10). If , then
where and denote polynomials of , and is a constant.
We make use of the results in Section C.4, and remind that depends on , we make this explicit as . We introduce an auxiliary parameter , and denote , then the decomposition of kernel (10) can be computed by
First compute . According to Lemma 16 in Bietti and Mairal [2019],
Remind that
Because , satisfy Chi-square distribution, and its momentum generating function is
It’s -th order derivative is
Let , we get
so
when is odd, and 0 when is even.
Similarly,
when is even, and 0 when is odd.
C.6 Gaussian kernel
This indicates that this kernel can be decomposed using spherical harmonics (10), and when , the coefficient
Note that is always smaller than 1 so is always decreasing with .
Appendix D Additional information about numerical result
D.1 Toy dataset
In neural networks (NN) experiment, we used three layers with the first layer fixed. The number of hidden neural is 512. In neural network with binary weights (BWNN) experiment, the setup is the same as NN except the second layer is Binary. We used BinaryConnect method with stochastic rounding. We used gradient descent with learning rate searched from . For Laplacian kernel and Gaussian kernel, we searched kernel bandwidth from to by power of 2, and is the medium of pairwise distance. The SVM cost value parameter is from to by power of 2.
More results are listed in Table 2. Accuracy are shown in the format of mean std. P90 and P95 denotes the percentage of dataset that a model achieves at least 90% and 95% of the highest accuracy, respectively.
Classifier | Training | Testing | ||||
---|---|---|---|---|---|---|
Accuracy | P90 | P95 | Accuracy | P90 | P95 | |
NN | 96.198.03% | 96.67% | 91.11% | 77.6216.10% | 73.33% | 56.67% |
BWNN | 93.5510.39% | 84.44% | 76.67% | 77.8316.57% | 77.78% | 54.44% |
Laplacian | 93.529.65% | 85.56% | 76.67% | 81.6214.72% | 97.78% | 91.11% |
Gaussian | 91.0810.63% | 76.67% | 58.89% | 81.4014.85% | 95.56% | 87.78% |
D.2 MNIST-like dataset
Similar to the toy dataset experiment, we used three layer neural networks with the first layer fixed, and only quantize the second layer. The number of neurons in the hidden layer is 2048. The batchsize if 100 and ADAM optimizer with learning rate is used.