Optimal Quantization for Batch Normalization in Neural Network Deployments and Beyond
Abstract
Quantized Neural Networks (QNNs) use low bit-width fixed-point numbers for representing weight parameters and activations, and are often used in real-world applications due to their saving of computation resources and reproducibility of results. Batch Normalization (BN) poses a challenge for QNNs for requiring floating points in reciprocal operations, and previous QNNs either require computing BN at high precision or revise BN to some variants in heuristic ways. In this work, we propose a novel method to quantize BN by converting an affine transformation of two floating points to a fixed-point operation with shared quantized scale, which is friendly for hardware acceleration and model deployment. We confirm that our method maintains same outputs through rigorous theoretical analysis and numerical analysis. Accuracy and efficiency of our quantization method are verified by experiments at layer level on CIFAR and ImageNet datasets. We also believe that our method is potentially useful in other problems involving quantization.
1 Introduction
Deep neural networks have achieved great success in many applications, such as computer vision [20], speech recognition [10] and Natural Language Processing [1]. In computer vision, the most proposed architecture, Convolutional Neural Network (CNN), has demonstrated state-of-the-art results in many tasks. However, CNN-based recognition systems need large amounts of memory and computational power, which may take up to weeks on a modern multi-GPU server for large datasets such as ImageNet [5]. Hence, they are often unsuitable for smaller devices like embedded electronics, and there is a pressing demand for techniques to optimize models with reduced model size, faster inference and lower power consumption.
Accordingly, a variety of literature has made attention on the reduction of model size through the use of quantization [22, 4, 21], low-rank matrix factorization [6, 16], architecture pruning [7, 8], etc. Quantization is one of the simpler ways to reduce complexity of any model with less precision requirements for weights and activations as well as speeding up the computation.
The methods for quantizing gradients, weights and activations [22, 4, 13, 33, 24, 2], have achieved much closer performance to full precision networks. Whereas, after the previous efficient quantized neural network (QNN) training, there still has floating point operators in model inference particularly due to the float-type numbers in Batch Normalization (BN).
Since (deep) networks are hard trained, the BN operator [15] is usually used. Implementation of the conventional BN requires much computation of mean and variance, involving the sum of squares, square-root and reciprocal operations. These operators require float-type numbers for high precision. Previous attempts to use low precision networks do not use BN layers [30] or keep them in full precision [33]. In addition, Hubara et al. [13] proposed shift-based BN by replacing almost all multiplication with power-of-2 approximation and shift operations, and Banner et al. [2] devised Range BN for the variance normalizing according to the range of input distribution. These modifications to BN for less computational costs are heuristic, confirmed only with experiments. Besides, modern hardware has less deployment cost with fixed-point style or perhaps some only support fixed-points [14]. Hence, previous QNN training methods with float-type BN layers fail to deploy on total fixed-number hardware.
In this paper we develop a direct way to handle the floating numbers in BN after obtaining a benchmark QNNs with previous training methods. We view conventional BN operator combined with several quantization operators used in feedforward step as an affine transformation with only two floating numbers. In order to eliminate the floating numbers in BN, we start by considering the exact substitution of these two floating numbers to as less as possible fixed numbers. We recommend to quantize floating numbers with shared quantized scale and mathematically show that all floating points are able to convert to fixed-point operators in an idealized case. In addition, we also demonstrate the lower bound and upper bound of the least possible quantized scale in our scheme. After given the proposed quantized scale, we give few search attempts to decide remaining integers. By the way, our methods used for model deployment also guarantee the precision in QNNs’ experiments, which can be seen as a supplement of modern QNN training.
Our main contribution is summarized as follows:
-
•
We propose a new method for quantizing BN in model deployment by converting the two floating points affine transformations to a fixed-point operation with shared quantized scale.
-
•
We give theoretical guarantee of our quantization method, including the existence of quantized scale and the magnitude of optimal quantized bit-width. In addition, we accurately search the least possible quantized scale according to quantized bits to support our transformation numerically.
-
•
We conduct our method on CIFAR and ImageNet datasets based on benchmark quantized neural networks, showing little performance degradation if exists.
- •
The remainder of our paper is organized as follows. In Section 2, we formulate the problem and present the motivation of our transformation. In Section 3, we give the equivalence of problem in Section 2 and show the upper bound and lower bound of satisfied solution through several properties. We then give a trivial search algorithm in Section 4, for finding accurate value of the solution and verifying previous results in Section 3. We also briefly discuss the precision loss in practical hardware implementation and validate numerically in common quantized neural networks in Section 5. Finally, we conclude our method in Section 6.
2 Problem Formulation
Quantization mainly focuses on compressing parameters in neural networks, such as weights in convolutional and fully-connected layers, activation outputs of hidden layers, and parameters in BN. A common design for QNN is uniform quantization [32, 18], which makes quantized values evenly distributed among possible values, that is,
where (the set of positive integers) is quantized scale, which measures the smallest gap between quantized values, and means the floor function. The input is restricted to predefined interval due to limited computing power. Particularly, for -bit uniform quantization, choose for friendly hardware support and simply choose for non-symmetric quantizer, for symmetric quantizer.
The weights which should be uniformly quantized may first be mapped into the required input value range through normalized operator [13]
or combined with hyperbolic tangent function [33]
Here the maximum is taken over all weights in the same layer.
Activations are quantized more difficultly than weights due to nondifferentiable optimization incurred by quantizer functions [3, 29]. Several works [21, 24, 33] have addressed this problem using bit-width allocation across layers or a continuous approximation to the quantizer function in the backpropagation step.
A common quantization approach for activations [23, 13, 33] quantizes output of the previous layer after clipping inputs into a limited interval, that is,
where , and are usually integers, and is some quantizer function defined earlier. For common used ReLU function, , or [12, 26] (ReLU6) or some specific positive integers.
Since weights and activations are of the form with constant quantization bits in denominator, feed-forward calculation is able to execute with complete fixed-point operation without regard of the fixed denominator, which achieves hardware acceleration.
Benchmark models employ BN for deep neural network training. The conventional BN operator is two affine transformations of four parameters:
where and are the mean and standard error estimators of output in the same layer and updated according to training batch statistics. Except conventional BN, several BN variants use similar transformation with different calculation method, such as BN [11] with norm of feature as , Range BN [2] with maximum gap in feature as , Group Normalization [31] and Instance Normalization [28] with different part of feature for calculating and . Anyhow, all parameters are fixed during inference and deployment.
We emphasize that floating points come from the division operation in BN and its variants, since is the running mean of batch feature in the same layer, and are updated based on quantized gradients, but the reciprocal of fails.
Previous state-of-the-art networks such as ResNet [9] and VGG [27] employ BN after convolutional layer or fully-connected layer, followed by an activation function. Let denote the set of integer numbers. Suppose we use and to quantize activations and weights. Then from quantizer function , we can see outputs of previous layers are of the type with and quantized weight with . Therefore, the outputs of convolutional or fully-connected layer are of the type , where is the number related to the calculation of outputs in the next layer, and is the bias term if exists and may also be quantized. We record this output of the type with . Here we do not identify specific corresponding position of parameters if no ambiguity.
After quantized convolutional or fully-connected layer is BN and quantized activation, then an element of the next layer outputs is
(1) | ||||
Here we exchange the order of and round function because and are integers in second equality. We also simplify two-affine operator of BN to one-affine operator and replace some terms as follows:
Generally, floating number operation will cause loss of precision, so this modification may lead to some error due to the limited computation precision. We will give experiments in subsequent section briefly.
Without loss of generality, we assume clipped from ReLU function. Other cases can be obtained by shifting the clip range by
Then and .
An empirical attempt is trying to make all operators only related to fixed-points, which means quantizing floating points and to some integers. A simple way is to let floating numbers and be integers respectively, which seems difficult to work because input variable may vary in a wide range. A weaker requirement is to quantize and similar to quantizer function and do, namely approximate and by rational numbers:
Since this consideration would add two more numbers and no hardware support in advance if either or is decided. Another way is to consider whether the surrogates of and can share same quantized scale. For example, same prefixed denominator for whole potential floating numbers and , which is consistent with previous scheme, because
where we set , , . Then and share same pre-fixed quantized scale .
Thus we want to replace the float numbers to integers for friendly hardware execution, where should not changed when and vary to support fixed hardware design.
Problem 1
Given , the problem is to find some (as small as possible), such that for all , there exist some , such that for all ,
(2) |
In particular, we are able to examine whether can be directly converted to integers if already satisfies.
After having obtained , we also need to decide the remaining quantized number based on the given , which is the second problem we need to tackle.
Problem 2
Given , and a suitable which is one solution in Problem 1, the problem is to find , such that for all ,
Particularly, in the conventional -bit quantization setting, we would set for some small and give the least bits number of satisfied .
3 Theoretical Analysis
In this section, we analyze Problem 1 and Problem 2 with some propositions while leave detail in Appendix B.


3.1 Analysis of Problem 1
For Problem 1, a direct way to deal with the randomness of is to specify all clipped intervals when varies.
Proposition 1
Problem 1 is equivalent to finding such a that for all , we can obtain , s.t.
(3) |
Here means the ceiling function. We first analyze Eq. (3) without considering and leave the constraint in Appendix E, which have little influence in practice.
Set . According to Proposition 1, we only need to analyze the sequence (Here we replace by for notation simplicity). It seems hard to give analytic formula of , though we only want to have a rough understanding of the least possible for future hardware support. Let the minimal which satisfies all sequences of length be . First, we need roughly understand whether our transformation is reasonable. The following proposition guarantees the existence of and also gives the magnitude of w.r.t. .
Proposition 2
, exists. Moreover, if , then as long as , we can find satisfying the requirements. Hence, .
The key insight of the existence of is that while is large enough, then leave more choices to search for empirically on account of finite constraints in Eq. (3). The technique of proofs for the above proposition is to simply remove with conditions according to all possible sequences of and apply discreteness of integers. The proof is given in Appendix B.2.
As for lower bound of , we can take some specific sequences to see how large the possible is. The insight of choosing several sequences is making the maximum gap between the items of such sequence small, in order to quick check the existence of .
Proposition 3
If , we have . Furthermore, if , then .
Remark 1
Remark 2
Besides, some special input range would obviously decrease the magnitude of , such as the case Appendix G.2 showed.
3.2 Analysis of Problem 2
Intuitively, given and leaving out floor function and clip function, we have
Hence, we can get . Because of , we obtain (or ) and (or ). However, after subsequent search for all possible , we find it is not always the nearest integer (see Proposition 8 and Appendix B.8). The gap of suitable and intuitive approximation depends on specific . By the way, the intuitive way to obtain is correct most of the time.
To establish given that is suitable based on the previous understanding, we offer a conservative way to obtain all appropriate as Proposition 4 and Algorithm 5 (see Appendix F) mentioned .
Proposition 4
Given and proper , a pair of satisfy Problem 2 if and only if there exists such a that the following conditions met:
(4) | ||||
where .
Moreover, we show that the intuitive way to obtain is partially right in Proposition 8, and give a loose bound of the possible ranges of and in Proposition 9 (see Appendix A). In addition, for general quantizer range instead of in Problem 1, we can apply similar analysis in Appendix G.1
3 (2-bit) | 2 | 7 (3-bit) | 9 | 15 (4-bit) | 51 | 31 (5-bit) | 289 |
---|---|---|---|---|---|---|---|
63 (6-bit) | 1459 | 127 (7-bit) | 6499 | 255 (8-bit) | 28323 |
4 Numerical Computation
In the previous section we have discussed theoretically. In this section, we present how to compute the accurate to evaluate the theoretical analysis.
For convenience, we are able to only consider and by Proposition 5. Then , , , for .
Proposition 5
Considering and is enough for finding .
For insurance purpose, we save all possible sequences as a naive idea, then to check every if the corresponding exist. We generate the sequences of recursively based on the following result and the scope of .
Proposition 6
Suppose is a satisfied sequence of length . Then is a satisfied sequence of length . In addition, if is a satisfied sequence of length , there must be a sequence of length and with the same previous terms.
Now we turn to find in an accurate way. On account of Proposition 3, we start search from when is larger than . We use brute force and follow proposition below to search which satisfies all possible sequences .
Proposition 7
Given a sequence , if
(5) |
then there exist that can generate the sequence above.
The whole process is showed below:
-
1.
First, produce candidate sequences. Note that from Proposition 6, we can get all the recursively.
-
2.
Second, record all satisfied sequences . Given a candidate sequence , we use Proposition 7 to confirm there exist which can reproduce this sequence.
-
3.
Third, based on the obtained sequence , check whether the given is satisfied through Proposition 4.
Due to the large amount of the possible sequences , we suggest searching a small part of all sequences and running the window successively until all sequences are satisfied. The main pseudo-algorithm and the auxiliary pseudo-algorithms for search exact are all shown in Appendix F.
4.1 Experimental Results
We give a short Table 1 corresponding to specific bit quantized activations. More with other please see Appendix C. Due to numerical errors, we also reexamine a wide range of input in with large enough in order to ensure every possible input.
We also draw the growth trend of vs in Figures 1. Since saving all sequences is time-consuming and space-consuming, we only search in and . The magnitude depicted in the figures is consistent with Propositions 2 and 3, showing the quadratic growth of .
In practice, there is no necessary requirement of finding , since we only need some proper to convert the float-type parameters in BN as well as provide hardware friendly , so not exactly is the best.
5 Experiments
In this section we assess the effect of our BN quantization method through previous trained QNNs on ImageNet [25] (we also try on CIFAR-10 [19] in Appendix H). Our method is applied to model deployment at layer level, which is fitted for various quantized neural networks using the conventional BN or its variants. Besides, we aim at verifying accuracy variation through our modification for friendly hardware support, rather than obtaining high accuracy as previous quantization training methods do.
There are two main concerns we should verify. First, the derivation in Eq. (1) absorbs the parameters of BN into the neighboring convolutional layer or fully-connected layer, which may have precision degradation. We refer to means absorbing BN parameters used in such layer though Eq. (1) based on quantized training models. Second, because the original theoretical analysis is based on real numbers rather than floating numbers, we should check whether the theoretical results match with the expression of floating numbers. Though numerical experiments have already examined in Section 4, we display the final accuracy to support our method again. We adopt which uses our fixed-points layerwise replacements in Problem 1 based on in the following experimental tables.
We use DoReFa-Net [33], one of efficient QNNs to train a quantized model for verification, and denote as the accuracy with DoReFa-Net quantization methods.
DoReFa-Net has low bitwidth weights and activations using low bitwidth parameter gradients to accelerate both in training and inference. In order to get more suitable baselines, we only choose quantizing weights and activations while using full-precision gradients in training, and we do not quantize the final fully-connected layer for better training performance. We adopt to represent -bit quantized weights and -bit quantized activations. We examine , and in subsequent experimental tables. We choose and for -bit and -bit quantized activations according to the discussion in Appendix D.
5.1 Quantization for ImageNet Classification
The next set of experiments study our method using VGG16 [27] and ResNet18 [9] on ImageNet dataset. We adopt data augmentation including random cropping, horizontal flipping, lighting and color jittering with Adam optimizer using step-wise learning rate initialized from divided by on each epochs with total epochs. After we obtain the trained QNN, we convert final model into and mode we mentioned above. The main results are shown in Tables 2.
5.2 Accuracy and Complexity Analysis
From Table 2, using the QNN training method is able to get comparable performance when quantized even with and . In addition, once a QNN has trained, we absorb the only floating-points in BN by our attempts. From the experimental results, enjoys slight difference with the original quantized model , which means one affine transformation in Eq. (1) brings tiny disturbance. Moreover, we observe that our substitution in Problem 1 also introduces the same results between and . In principle, there should have diversity across and when encountering operators which excess the computer precision. We prefer to use our case due to entire fixed-point implementation and if have, slight performance degradation.
Additionally, practical application of QNN mainly uses small bits (up to 8-bits to our knowledge with ) for quantization. From Proposition 4 (or Proposition 9 in Appendix A), a single conversion in worse case is , but time is able to be saved if we search around intuitive way and , while we only need to convert model once with few BN parameters.
Bits | 2W4A | 4W4A | 8W8A | |||
---|---|---|---|---|---|---|
Top-1 | Top-5 | Top-1 | Top-5 | Top-1 | Top-5 | |
69.46 | 88.84 | 70.62 | 89.46 | 70.83 | 89.59 | |
69.43 | 88.88 | 70.52 | 89.45 | 70.81 | 89.58 | |
69.43 | 88.88 | 70.52 | 89.45 | 70.81 | 89.58 | |
VGG16 | ||||||
65.94 | 86.54 | 68.15 | 88.09 | 68.67 | 88.18 | |
65.99 | 86.55 | 68.15 | 88.11 | 68.66 | 88.17 | |
65.99 | 86.55 | 68.15 | 88.11 | 68.66 | 88.17 | |
ResNet18 |
6 Conclusion
In this paper we have investigated the problem of quantizing floating points in BN, combined with quantized convolutional or fully-connected weights and activations. Noting that the conventional BN includes two affine transformations, we have made all floating points into one affine operator with only two floating-points. We have accordingly proposed to quantize each floating-point in the converted one-affine operator sharing a quantized scale. We have shown that possible quantized scale is roughly twice of activation bits in theory and given numerical computation schemes of the corresponding substitution. Our approach enjoys the errorless performance in inference using high precision. The strategy we recommended displays efficient model deployment with complete fixed-point operators.
It is worth emphasizing that our quantization scheme is suitable for other affine operators which are also common in deep NNs. Beyond BN as well as the NN context, we believe that our quantization scheme has potential applications in other problems that involve quantization.
Broader Impact
a) & b) If the proposed quantization method is verified to be useful in numerous real applications by the engineers in the future, it will produce good impacts on model compression. Hence, previous style of QNNs would pay attention to entire fixed-point effective QNN design with BN. c) The method we proposed leverages all possible outputs and shows convincing results in theory, though our method may fails when the scope of fixed-point doesn’t support the converted large range of that we seldom see this in practice. Besides, floating operation is likely to introduce precision error, which our method in first conversion step would encounter. d) Our method is data irrelevant.
References
- Bahdanau et al. [2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Banner et al. [2018] Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry. Scalable methods for 8-bit training of neural networks. In Advances in Neural Information Processing Systems, pages 5145–5153, 2018.
- Cai et al. [2017] Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. Deep learning with low precision by half-wave gaussian quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5918–5926, 2017.
- Courbariaux et al. [2015] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.
- Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Denton et al. [2014] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pages 1269–1277, 2014.
- Han et al. [2015a] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015a.
- Han et al. [2015b] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015b.
- He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Hinton et al. [2012] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29, 2012.
- Hoffer et al. [2018] Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate normalization schemes in deep networks. In Advances in Neural Information Processing Systems, pages 2160–2170, 2018.
- Howard et al. [2017] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Hubara et al. [2017] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017.
- Ignatov et al. [2018] Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. Ai benchmark: Running deep neural networks on android smartphones. In Proceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018.
- Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Jaderberg et al. [2014] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
- Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Krishnamoorthi [2018] Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018.
- Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
- Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Lin et al. [2016] Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pages 2849–2858, 2016.
- Lin et al. [2015] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. arXiv preprint arXiv:1510.03009, 2015.
- Mishra et al. [2017] Asit Mishra, Eriko Nurvitadhi, Jeffrey J Cook, and Debbie Marr. Wrpn: wide reduced-precision networks. arXiv preprint arXiv:1709.01134, 2017.
- Rastegari et al. [2016] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
- Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- Sandler et al. [2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
- Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Ulyanov et al. [2016] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
- Vanhoucke et al. [2011] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks on cpus. 2011.
- Wu et al. [2018] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and inference with integers in deep neural networks. arXiv preprint arXiv:1802.04680, 2018.
- Wu and He [2018] Yuxin Wu and Kaiming He. Group normalization. In The European Conference on Computer Vision (ECCV), September 2018.
- Zhou et al. [2017] Shu-Chang Zhou, Yu-Zhi Wang, He Wen, Qin-Yao He, and Yu-Heng Zou. Balanced quantization: An effective and efficient approach to quantized neural networks. Journal of Computer Science and Technology, 32(4):667–682, 2017.
- Zhou et al. [2016] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
Appendix A Additional Propositions
Proposition 8
Proposition 9
Appendix B Proof of Propositions
The following proofs may need use property of ceiling function below.
Lemma 1 (The properties of ceiling function)
B.1 Proof of Proposition 1
Proof: Without loss of generality, we may assume , otherwise we can reverse . Then we can see as , and must have the same sign, so while we leave the case in Appendix E.
Then according to the property of floor function, the original problem is equivalent to the follow situations:
(6) |
We can see ,
Hence .
B.2 Proof of Proposition 2
Proof: It is obvious that if , then .
When , we only need to consider the problem in Proposition 1, that is, ,
Set ,
(7) |
Then
Pay attention to , so that
which means
Analyzing the cases of and , we obtain
,
(8) |
It follows from Lemma 1 in Appendix B that , ,
(9) |
, ,
Hence,
Therefore .
We make
(10) |
Then always exists because the scope of is larger than one and includes at least one integer.
A naive idea is that when ,
When and , we have that
So if , we are able to find given any . Therefore, exists.
Before the remaining proof, we need lemma below.
Lemma 2
Suppose and , then and
where is the fractional part of .
Now we turn to the left part of Proposition 2.
Using Proposition 5, we only need consider . Let us take more precise analysis from Eq. (10). Suppose
(11) |
When and ,
So we only need to check .
-
1.
If , .
-
2.
If , .
Set . Use Proposition 5, then .
-
•
If , then , or , then .
Hence
No solution!
-
•
If , then , or , then .
Hence
No solution!
Therefore,
-
•
-
3.
If , .
Set , which lead to
(13) So we can see
Next we will prove that satisfy the follow cases.
Case1. If or , then
Case2. If , then
From Case1, we may assume . Meanwhile take Eq. (13) into the condition , we get , and
Since , therefore
(14) Hence
(15) (16) - •
-
•
If or , by the previous fact that and Eq. (13), we can get the range of :
We set
then
The last inequality comes from .
- –
-
–
When ,
If ,
Then the range of is which less than , and the fraction of the left point of can’t be zero by Lemma 2. So , else , which means , to Case2.
Then . As , so .Due to , hence
Therefore , but this time
No solution!
Therefore , so , to Case2. -
–
Similarly we can get same results when with
From previous discussion, we obtain
Hence, exists from Eq. (10) when and .
B.3 Proof of Proposition 3
Proof: Consider through special choices of when .
-
•
, , with sequence ,
-
•
, , with sequence ,
-
•
, , with sequence .
The corresponding satisfied is . From Eq. (8), the corresponding results are
Therefore,
(17) |
Then
(18) | |||
(19) | |||
(20) |
From Eq. (18) and Eq. (20), then , so . Take this into Eq. (18) and Eq. (20), then
Then
If , from
So either or can be satisfied.
- •
- •
So when , then . Hence,
With the same idea further on the , we can get when ,
B.4 Proof of Proposition 4
B.5 Proof of Proposition 5
Proof: For the convenience of analysis, we can assume and as long as we can minus an integer in both sides:
where is the integer satisfies .
, this is obvious by the assumption and .
From Lemma 1 or , so the sequence is non-decrease and .
B.6 Proof of Proposition 6
Proof:
First, as is a satisfied sequence with length ,then there exist satisfy , so is a satisfied sequence with length when we choose as the parameters.
Second, when is a satisfied sequence with length ,then there exist satisfy , so is a satisfied sequence with length and the same previous terms.
B.7 Proof of Proposition 7
Note that
Then
So exists as long as
which means that
Therefore, exists if and only if
(27) |
B.8 Proof of Proposition 8
Proof: Once we know exist, then from Eq. (9), and Lemma 1, ,
(28) |
Since there exist satisfy requirements, so there exists an integer in the interval
From Eq. (28), is in the interval, so the neighboring integer or must have at least one satisfies Eq. (2).
Similarly, from Eq. (7)
Since
Hence
Since there exist satisfy requirements, so that there exists an integer in the interval
(29) |
From Eq. (29), is in the interval, so the neighboring integer or must have at least one satisfies Eq. (2).
However, the combined pair , and may don’t satisfy Problem 2.
For example, take , all possible sequences , while .
B.9 Proof of Proposition 9
Appendix C Searched
In this section, we list more with various in Table 3. Since searching accurate is time-consuming, we only find large bit quantized scale () when is large.
1 | 1 | 14 | 46 | 27 | 221 | 40 | 529 | 53 | 1013 |
---|---|---|---|---|---|---|---|---|---|
2 | 1 | 15 | 51 | 28 | 232 | 41 | 545 | 54 | 1036 |
3 | 2 | 16 | 67 | 29 | 265 | 42 | 596 | 55 | 1059 |
4 | 3 | 17 | 73 | 30 | 277 | 43 | 613 | 56 | 1129 |
5 | 5 | 18 | 79 | 31 | 289 | 44 | 667 | 57 | 1153 |
6 | 7 | 19 | 99 | 32 | 326 | 45 | 685 | 58 | 1226 |
7 | 9 | 20 | 106 | 33 | 339 | 46 | 742 | 59 | 1251 |
8 | 11 | 21 | 113 | 34 | 379 | 47 | 761 | 60 | 1327 |
9 | 13 | 22 | 137 | 35 | 393 | 48 | 781 | 61 | 1353 |
10 | 22 | 23 | 145 | 36 | 407 | 49 | 841 | 62 | 1379 |
11 | 25 | 24 | 172 | 37 | 451 | 50 | 862 | 63 | 1459 |
12 | 29 | 25 | 181 | 38 | 466 | 51 | 925 | 127 | 6499 |
13 | 41 | 26 | 191 | 39 | 513 | 52 | 947 | 255 | 28323 |
Appendix D Satisfied is Not Sequential
. | |
In this section, we emphasize that not all satisfy Problem 1. We show all satisfied in Problem 1 when choose () for -bit quantization. Since, from Proposition 2, we know that all satisfy Problem 1 and from Appendix C, , we list satisfied sequences of in Table 4.
When choose () for -bit quantization, , all satisfies Problem 1, we also list satisfied sequence of in Table 4.
From Table 4, there some missed numbers from to our upper bound . Moreover, hardware support may choose as power of , so we recommand using for -bit activation quantization and for -bit activation quantization and naively for -bit activation quantization (actually is enough).
Appendix E Constraint of T
From Problem 1 and Proposition 1, we need but we do not consider this constraint in the main body of the paper. In this section, we show the case when satisfies.
Based on Eq. (3) when , then , , which means there exists , if we assume . Therefore , that is . Additionally, when ; when , showing that
We propose to use a sign function to quantize in this case.
where and we need to check .
Since the potential scope of is particularly limited in when is large, hence we seldom come across this situation, and we also can solve this problem by replacing original formulation to simple binary function even though this case happens.
Appendix F Auxiliary Algorithms
Appendix G Other Cases
G.1 General Quantizer Range
We have already mentioned in Section 2 when use other integers. Since previous discussion only focus on non-symmetric quantizer with in and in , we briefly show ways for obtaining symmetric quantizer and more general cases in this section.
Problem 3
Given , the problem is to find some (as small as possible), such that for all , there exist some , such that for all ,
A simple way to solve Problem 3 is shifting the range of clip function.
Hence we are able to apply conclusions of Problem 1 with changed , and same as follows:
However, due to limited numerical precision, our searching method already introduces error. We prefer to using our derivation to Problem 3 directly.
Accordingly, we obtain a simple version of Problem 3.
Proposition 10
Problem 3 is equivalent to finding such a that for all , we can obtain , s.t.
(30) |
Because we know Problem 3 and Problem 1 are same from previous analysis, meaning that the searched with already satisfies Problem 3. Based on Proposition 4, to establish given that is suitable based on the previous understanding, we also offer a conservative way to obtain all appropriate as follows.
Proposition 11
Given and proper , a pair of satisfy Problem 3 if and only if there exists such a that the following conditions met:
(31) | ||||
where .
In other words, our proposed method and results can simply convert to general quantizer range.
G.2 Special Input Range
Before training, we have little understanding of final BN parameters and hidden layer outputs, hence we consider all possible value which can take. Actually, due to quantized activation and quantized weights function in previous layers, the input varies in a finite range, but we are hard to extract effective information because and are combined together, so we still use whole integers for .
Besides, some structures may lead to special input range of which would reduce the magnitude of . For example, is always even, then the problem changes into
Under slight transformation,
Hence we can use our searching method for and first, with as satisfied solution in Problem 3. Finally, we still can obtain but with less used. For instance, during -bit quantization, , we may use for friendly hardware support. When is always even, we can use instead.
Generally, when input range is described as with fixed and various , we are able to screen out if already satisfies Problem 3. In conclusion, our method with special structure of input may introduce further less we searched.
Appendix H Quantization for CIFAR-10 Classification
We compare our method using VGG11 [27] with BN on CIFAR-10 dataset. We adopt data augmentation including horizontal flip, random cropping, random rotation and random scaling with Adam [17] optimizer using circular learning initialized from for total training epochs and batch size 256. The main results are shown in Table 5, which agree with the results on ImageNet.
Bits | 2W4A | 4W4A | 8W8A |
---|---|---|---|
90.60 | 91.75 | 91.74 | |
90.40 | 91.71 | 91.67 | |
90.40 | 91.71 | 91.67 | |
VGG11 |