A Learning Framework for $n$ -bit Quantized Neural Networks toward FPGAs

Jun Chen, Liang Liu, Yong Liu, , Xianfang Zeng Jun Chen, Liang Liu, Yong Liu and Xianfang Zeng are with the Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, China, 310027, e-mail: [email protected] Liu is the corresponding author.

Abstract

The quantized neural network (QNN) is an efficient approach for network compression and can be widely used in the implementation of FPGAs. This paper proposes a novel learning framework for $n$ -bit QNNs, whose weights are constrained to the power of two. To solve the gradient vanishing problem, we propose a reconstructed gradient function for QNNs in back-propagation algorithm that can directly get the real gradient rather than estimating an approximate gradient of the expected loss. We also propose a novel QNN structure named $n$ -BQ-NN, which uses shift operation to replace the multiply operation and is more suitable for the inference on FPGAs. Furthermore, we also design a shift vector processing element (SVPE) array to replace all 16-bit multiplications with SHIFT operations in convolution operation on FPGAs. We also carry out comparable experiments to evaluate our framework. The experimental results show that the quantized models of ResNet, DenseNet and AlexNet through our learning framework can achieve almost the same accuracies with the original full-precision models. Moreover, when using our learning framework to train our $n$ -BQ-NN from scratch, it can achieve state-of-the-art results compared with typical low-precision QNNs. Experiments on Xilinx ZCU102 platform show that our $n$ -BQ-NN with our SVPE can execute 2.9 times faster than with the vector processing element (VPE) in inference. As the SHIFT operation in our SVPE array will not consume Digital Signal Processings (DSPs) resources on FPGAs, the experiments have shown that the use of SVPE array also reduces average energy consumption to 68.7% of the VPE array with 16-bit.

Index Terms:

Deep learning, quantized neural network (QNN), deep compression, FPGA

I Introduction

Deep convolutional neural networks (CNNs) have substantially become the dominant Artificial Intelligence (AI) approach for a variety of computer vision tasks such as image classification [1, 2, 3], face recognition [4, 5], semantic segmentation [6, 7] and object detection [8, 9]. The significant accuracy improvement of CNNs brings with the cost of huge computational complexity, resource, and power consumption as it requires a comprehensive estimation of all the scopes within the feature maps [10, 11]. For example, the AlexNet model is over 200 MB, and the VGG-16 model is over 500 MB [10]. Towards such overwhelming resources and computation pressure, hardware accelerators such as GPUs, FPGAs, and ASICs have been applied to accelerate CNNs. Among these accelerators, FPGAs have emerged as one of the popular solutions when considering both the reprogramability and energy efficiency.

Implementing CNN on FPGAs is not an efficient practice due to limited resources and bandwidth. Thus QNN is a good choice for FPGAs implementation, which simultaneously gives consideration to computational efficiency, resources and classification accuracy in inference. In general, QNNs can be achieved in two ways: $1)$ An estimator is used to estimate the gradient of the expected loss to solve the problem of gradient vanishing so that QNNs can be trained from scratch with the help of this estimator. $2)$ Fine-tuning on a pretrained full-precision model obtains QNNs that bypasses the problem of gradient vanishing. Although the first method estimates a gradient, which makes it possible to train QNNs from scratch, the gradient of expected loss obtained by estimators has a noise source compared to the real gradient that causes a gap on classification accuracy between the QNNs and full-precision CNNs. The second method fine-tunes QNNs on a pretrained full-precision model that solves the problem of classification accuracy better, but a challenging factor is that the structure of QNNs is limited by the original structure of the pretrained CNNs model, and the structure of QNNs cannot be flexibly adjusted. Due to the constraints of computational resources and computational efficiency on FPGAs, it is inevitable to adjust the network structure for the hardware environment. In order to transform different CNNs into QNNs that can run efficiently on FPGAs, it is essential for a general learning framework to solve the above two challenges [12, 13, 14, 15].

In this paper, we propose a novel learning framework for $n$ -bit QNNs, whose weights are constrained to the power of two $(\pm 2^{-0},\pm 2^{-1},\cdots,0)$ . We introduce a reconstructed gradient function in back-propagation algorithm that can directly get the real gradient, rather than the estimated gradient given by estimators. Thus the QNNs trained by our framework will be more accurate. At the same time, QNNs after adjusting the structure can continue to fine-tune with our framework. The learning framework is applied to train our proposed $n$ -BQ-NN, which is suitable for efficient implementation on FPGAs. We also evaluate the effectiveness of our approach on state-of-the-art networks such as ResNet [16], DenseNet [17] and AlexNet [1]. The main contributions of this article are summarized as follows:

1.

We propose a novel learning framework for $n$ -bit QNNs. In this framework, we propose a reconstructed gradient function in back-propagation algorithm, which can overcome the gradient vanishing problem during training the QNNs and can calculate the accurate gradient compared with the estimators based approaches. We achieve state-of-the-art results compared with typically low-precision QNNs.
2.

We propose a highly efficient QNN structure called $n$ -BQ-NN for FPGAs. Our proposed architecture, which consists entirely of convolutional layers and implements a uniform convolution kernel, can maximize the resource utilization and improve the parallel computational efficiency on FPGAs while preserving the accuracy of QNNs.
3.

We propose a novel shift vector processing element (SVPE) array for FPGAs, which replaces the multiplication with the SHIFT operation when calculating convolution operation on FPGAs. The computational efficiency of our SVPE array can achieve a performance of 2.9 times higher than that of the VPE array in the case of the same network structure on FPGAs.

The rest of this paper is organized as follows: Section II summarizes related prior works on QNNs and FPGAs. Our learning framework is presented in Section III. In Sections IV, we demonstrate the effectiveness of our learning framework via comparable experiments. We theoretically analyse and practically test the computational efficiency of our $n$ -BQ-NN using our quantization method in Section V. The conclusion is given in Section VI.

II Related work

II-A Learning for QNNs

Since the amount of the model capacity is too large, it is necessary to cut down it to perform CNNs on FPGAs, which is consistent with the purpose of deep compression. In general, deep compression can be divided into three categories, i.e., pruning, Huffman coding, and quantization. The pruning method will simplify the deep neural network by cutting off the network connections with small weights on the normal trained network [18] [19] [20]. The Huffman coding method is an optimal code used for lossless data compression [21] which uses entropy to encode source symbols by variable-length codewords. Han et al. [20] show that 20% - 30% of the network storage will be saved after Huffman coding the non-uniformly distributed values. When considering perform compressed networks on FPGAs, the network after pruning is an asymmetric structure, which is unsuitable for hardware implementation, and the Huffman coding may only be regarded as a post-compression combined with the other two compression methods, so most of the hardware accelerators will focus on the quantization method.

The quantization based method normally employ the low-precision weights, varied from 1 bit to 5 bits, to represent the CNNs [15, 22, 23, 24, 25, 26, 27, 28]. Some studies train QNNs from scratch by estimating the gradient of expected loss based on Straight-Through Estimator [15, 23, 22, 27]. For example, Courbariaux et al. [15] train a classification neural network from scratch with 1-bit weight and activation, which can run 7 times faster than the CNNs. Choi et al. [23] propose a neural quantization scheme called Parameter Clipping Activation, which uses a parameter to find the optimal quantization scale for arbitrary bit width activations. Choi et al. [22] introduce a novel technique called Statistics-Aware Weight Binning, which finds the optimal scaling factor based on statistical characteristics of the distribution of the weights to minimize the quantization error. The QNNs trained by the above quantization methods only accelerate the inference, Zhou et al. [27] propose a DoReFa-Net that can accelerate both training and inference by low bit width weights, activations and gradients respectively. However, these estimator-based methods have a noise compared to the real gradient. Thus these QNNs can’t achieve an ideal classification accuracy, especially on multi-classification datasets such as CIFAR-100.

Some other quantization methods are dedicated to design special strategies to fine-tune QNNs, which will not rely on the backpropagation algorithm and can bypass the problem of gradient vanishing [29, 26, 28]. They can achieve a much better accuracy as they are independent of estimators. For example, Park et al. [29] propose precision highway that has an end-to-end high-precision information flow for ultra-low-precision computation. This linear weight quantization method is based on the assumption that the weight distribution is the Laplace distribution. Recently, Zhou et al. [26] propose an incremental network quantization method, which converts pretrained full-precision CNNs model into a low-precision model where the weights are constrained to the power of two or zero. It has been studies that there will be little loss on the classification accuracies when using 2-5 bits low-precision weight [27, 26]. However, these quantization methods will depend on the pretrained network structure rather than the backpropagation algorithm, which will be difficult to satisfy the network-structure-optimization requirements due to the hardware limitation.

II-B CNNs Implemented by FPGAs

Considering the inference, the CNNs have a highly hierarchical structure of multiple feature maps, whose structure exposes a large amount of parallelism that makes CNNs very suitable for FPGAs implementation. This structure builds on the accumulation of a huge number of convolutions that will consume a huge number of floating-point resources on FPGAs. In addition, the structure of CNNs often contains many convolutional layers. Thus the convolution module with different parameters needs to be executed iteratively during the inference. Frequent execution of data caching and parameter loading will be limited by the bandwidth. Therefore, in many studies, their hardware structures of CNNs are designed mainly for the two bottlenecks of floating-point resources and bandwidth [12, 17, 16, 13].

In terms of optimizing for floating-point resources, Lu et al. [30] design a fast Winograd algorithm, which can decrease the use of floating-point resources on FPGAs and reduces the complexity of convolution dramatically. Simultaneously, they also give the formula for estimating the computational efficiency, which demonstrates that the fast Winograd algorithm is more efficient than conventional convolutional algorithm due to the use of fewer floating-point resources on FPGAs. Meloni et al. [13] present an accelerator configuration for CNNs that reaches more than 97% DSP resource utilization at 150 MHz operating frequency with 16-bit precision. And they show that the floating-point resource utilization is the highest when executing 3 $\times$ 3 filters on FPGAs.

Other studies have focused on optimizing the data scheduling structure to reduce the impact of the bandwidth. For example, Sankaradas et al. [14] implement a vector processing element (VPE) array coprocessor, which can accelerate the CNNs by optimizing the cache between distributed off-chip memory banks and on-chip computing elements on FPGAs. Peemen et al. [12] show that their scheduler prefers to use only convolutional layers without fully connected layers on FPGAs, which can maximize the efficiency of on-chip memories by reducing the impact of the bandwidth bottleneck.

The crucial issue with the above methods is that they usually only consider the bottleneck at a single level and fail to coordinate these two constraints to improve the computational efficiency of the hardware accelerators. In this paper, we reduce the impact of the above two constraints by introducing the QNNs into FPGAs, which provides a new idea to deal with the above two bottlenecks. Since the weights in our QNNs are quantized to the power of two, the quantized weights directly reduce the bandwidth required to load the weights. In addition, the use of the quantized weights can translate the multiplication into shifting in convolution module, which greatly reduces the use of floating-point resources.

III $n$ -BQ-NN

III-A Fundamental Idea of Our $n$ -BQ-NN

Refer to caption — Figure 1: For any full-precision weight distribution (indicated by the blue curve in the figure) trained by CNNs model, non-uniform sampling can be used to approximate the full-precision distribution which represented by the red curve in the figure.

The main idea of our $n$ -BQ-NN is based on Fig.1, which shows the information loss led by the quantization method with the power of $2$ can be interpreted as the sampling loss caused by non-uniform sampling. In fact, the weights of CNNs with large absolute values will be dominant to the overall classification accuracy of the networks, although these weights with large values only account for a small ratio among all the weights [11, 31]. For an arbitrary probability density function of the weights in a neural network, denoted as $\phi(x)$ , we can use the blue curve in Fig.1 to represent $\phi(x)$ which meets $\int_{-1}^{1}\phi(x)dx=1$ . In this way, we can calculate the sampling loss $\Phi(x)$ which can be represented as the area between two distributions (red and blue curves) in Fig.1. By calculating, the sampling loss $\Phi(x)$ is represented as the following recursive formula, where $n$ is the quantized bit width of the weights.

\begin{split}\left\{\begin{array}[]{cll}\Phi(1)=1-\phi(2^{-1})&n=1\\ \Phi(n)=\Phi(n-1)+2^{1-n}\phi(2^{1-n})\\ -2^{1-n}\phi(2^{-n})&n>1\end{array}\right.\end{split}

(1)

It can be seen from the above formula that the sampling loss always decreases as the increasing of quantized bit width of the weights, which indicates that the sampling loss is negatively related to the quantized bit width of weights. However, the bit width is limited and needs to reduce as much as possible in QNNs. Thus, finding the best balance between quantized bit width and the sampling loss is the key to balancing the performance, speed, and resources of QNNs. We define the $\mathcal{L}(n)=2^{1-n}[\phi(2^{1-n})-\phi(2^{-n})]$ as the variation between two sampling losses, $\Phi(n)$ and $\Phi(n-1)$ , from Eq.1. Then, we can prove that $\mathcal{L}(4)$ will approach to zero in our quantization method with the power of $2$ , which can be ensured by the Theorem 1. Therefore, continuing to increase the quantized bit width of $n$ after $3$ is not helpful to decrease the sampling loss.

Theorem 1

$0<\left|\mathcal{L}(4)\right|<7.8\times 10^{-3}$

Proof:

We use Taylor expansion with Peano residuals to represent the probability density function $\phi(x)$ ,

\begin{split}&\phi(2^{-n})=\phi(0)-ln2\phi^{\prime}(0)n2^{-n}+o(n2^{-n})\\ &\phi(2^{1-n})=\phi(0)-ln2\phi^{\prime}(0)n2^{1-n}+o(n2^{1-n})\\ \end{split}

(2)

Substituting Eq.2 into $\mathcal{L}(n)$ , we get

\begin{split}&\mathcal{L}(n)=2^{1-n}[\phi(2^{1-n})-\phi(2^{-n})]\\ &=2^{1-n}[ln2\phi^{\prime}(0)n2^{-n}-\cancel{o(n2^{-n})}-ln2\phi^{\prime}(0)n2^{1-n}\\ &+\cancel{o(n2^{1-n})}]\\ &\approx-ln2\phi^{\prime}(0)n2^{1-2n}\end{split}

(3)

Since $0<\Phi(n)<1$ , we deduce that $0<\left|\mathcal{L}(2)\right|<1$ and $0<ln2\phi^{\prime}(0)<\frac{1}{4}$ . In final, we get $0<\left|\mathcal{L}(4)\right|<\frac{1}{128}$ by substituting the range of $ln2\phi^{\prime}(0)$ into $\mathcal{L}(4)$ due to the Eq.3. ∎

From the hardware perspective, the resource consumption of SHIFT operation is much less than multiplication, so our intention is to use the SHIFT operation instead of multiplication. Considering that the shift right operation will make the weights exceed the constraint range of ( $-1$ , $1$ ), thus, all SHIFT operations are shift left and every quantized weight is chosen from the entries $(\pm 2^{-0},\pm 2^{-1},\cdots,\pm 2^{-i},0)$ , where $\pm 2^{-i}$ indicates its multiplication can be calculated by $<<i$ and $0$ indicates that no operations are required. Our $n$ -BQ-NN quantizes the weights to the entries, which are encoded to $n$ -bit and suitable for hardware computation. Under such circumstance, the staircase function $\operatorname{staircase}(W)$ can be used to describe our $n$ -bit quantized weights as Eq.4 (typically, $n$ is greater than $1$ , and $\operatorname{staircase}(W)$ is degraded to $\operatorname{sign}(W)$ if $n$ is equal to $1$ ), where $W$ are full-precision weights.

\operatorname{staircase}(W)=\left\{\begin{array}[]{ccc}2^{-i}\operatorname{sign}(W)&&\Delta_{i+1}\leq|W|<\Delta_{i}\\ 0&&|W|<\Delta_{r}\end{array}\right.

(4)

Here $i$ is taken from $r-1$ to $0$ in turn, where $r=2^{n-1}-1$ and $\operatorname{sign}(W)$ is the sign function:

\operatorname{sign}(W)=\left\{\begin{array}[]{clc}+1&&W\geq 0\\ -1&&W<0\end{array}\right.

(5)

III-B Gradients Computation in $n$ -BQ-NN

In order to facilitate the discussion as follows, we need to define some variables first, where $W^{l}_{jk}$ represents the weight that connects the $k$ -th neuron of the $(l-1)$ -th layer to the $j$ -th neuron of the $l$ -th layer, $b^{l}_{j}$ represents the bias of the $j$ -th neuron of the $l$ -th layer, $z^{l}_{j}$ represents the input of the $j$ -th neuron of the $l$ -th layer ( $z^{l}_{j}=\sum_{k}W^{l}_{jk}a^{l-1}_{k}+b^{l}_{j}$ ), $a^{l}_{j}$ represents the output of the $j$ -th neuron of the $l$ -th layer ( $a^{l}_{j}=\theta(z^{l}_{j})$ ), and $\theta$ is activation function.

We also have to add a extra quantized weight so that we can train our $n$ -BQ-NN, where the quantized weight is shown as follows,

\hat{W}^{l}_{jk}=\operatorname{staircase}(W^{l}_{jk})

(6)

And the cost function of mini-batch of $m$ samples in our $n$ -BQ-NN is,

C=\frac{1}{2m}\sum_{x}\left\|y(x)-a^{L}(x)\right\|^{2}

(7)

Where $x$ is the input sample, $y$ is the actual classification, $a^{L}$ is the prediction output, and $L$ is the maximum number of layers in the network.

By defining $\mathcal{T}_{j}^{l}\equiv\frac{\partial C}{\partial z_{j}^{l}}$ as the error produced by the $j$ -th neuron of the $l$ -th layer, we can use the back-propagation algorithm to calculate the gradient and update the parameters according to the following three steps¹¹1 $\odot$ represents the Hadamard product that is used for point-to-point product between matrices or vectors..

Calculating the error of the last layer of the network.

		$\displaystyle\mathcal{T}_{j}^{L}=\frac{\partial C}{\partial z_{j}^{L}}=\frac{\partial C}{\partial a_{j}^{L}}\cdot\frac{\partial a_{j}^{L}}{\partial z_{j}^{L}}$		(8)
		$\displaystyle{\bm{\mathcal{T}}^{L}=\frac{\partial C}{\partial\textbf{a}^{L}}\odot\frac{\partial\textbf{a}^{L}}{\partial\textbf{z}^{L}}=\nabla_{\textbf{a}}C\odot\theta^{\prime}\left(\textbf{z}^{L}\right)}$		(8)

Calculating the error of each layer of the network from the back to the front.

$\displaystyle\mathcal{T}_{j}^{l}=\frac{\partial C}{\partial z_{j}^{l}}$	$\displaystyle=\sum_{k}\frac{\partial C}{\partial z_{k}^{l+1}}\cdot\frac{\partial z_{k}^{l+1}}{\partial a_{j}^{l}}\cdot\frac{\partial a_{j}^{l}}{\partial z_{j}^{l}}$	(9)
	$\displaystyle=\sum_{k}\mathcal{T}_{k}^{l+1}\cdot\frac{\partial\left(\hat{W}_{kj}^{l+1}a_{j}^{l}+b_{k}^{l+1}\right)}{\partial a_{j}^{l}}\cdot\theta^{\prime}\left(z_{j}^{l}\right)$
	$\displaystyle=\sum_{k}\mathcal{T}_{k}^{l+1}\cdot\hat{W}_{kj}^{l+1}\cdot\theta^{\prime}\left(z_{j}^{l}\right)$
$\displaystyle\bm{\mathcal{T}}^{l}=$	$\displaystyle\left(\left(\hat{\textbf{W}}^{l+1}\right)^{T}\bm{\mathcal{T}}^{l+1}\right)\odot\theta^{\prime}\left(\textbf{z}^{l}\right)$

Calculating the gradient of weight and bias respectively.

	$\displaystyle g^{b}$	$\displaystyle=\frac{\partial C}{\partial b_{j}^{l}}=\frac{\partial C}{\partial z_{j}^{l}}\cdot\frac{\partial z_{j}^{l}}{\partial b_{j}^{l}}$		(10)
		$\displaystyle=\mathcal{T}_{j}^{l}\cdot\frac{\partial\left(\hat{W}^{l}_{jk}a_{k}^{l-1}+b_{j}^{l}\right)}{\partial b_{j}^{l}}=\mathcal{T}_{j}^{l}$		(10)

$\displaystyle g^{W}$	$\displaystyle=\frac{\partial C}{\partial W_{jk}^{l}}=\frac{\partial C}{\partial z_{j}^{l}}\cdot\frac{\partial z_{j}^{l}}{\partial W_{jk}^{l}}=\mathcal{T}_{j}^{l}\cdot\frac{\partial\left(\hat{W}_{jk}^{l}a_{k}^{l-1}+b_{j}^{l}\right)}{\partial W_{jk}^{l}}$	(11)
	$\displaystyle=\mathcal{T}_{j}^{l}\cdot\frac{\partial\left(\hat{W}_{jk}^{l}a_{k}^{l-1}+b_{j}^{l}\right)}{\partial\hat{W}_{jk}^{l}}\cdot\frac{\partial\hat{W}_{jk}^{l}}{\partial W_{jk}^{l}}$
	$\displaystyle=\mathcal{T}_{j}^{l}\cdot a^{l-1}_{k}\cdot\frac{\partial\hat{W}_{jk}^{l}}{\partial W_{jk}^{l}}$

In the above process of deriving the entire back-propagation, except for the gradient of weight of the last step, the other steps are well-defined. Based on the Eq.11, the gradient of weight can be calculated as follows,

g^{W}=\mathcal{T}_{j}^{l}\cdot a^{l-1}_{k}\cdot\frac{\partial\hat{W}_{jk}^{l}}{\partial W_{jk}^{l}}=0

(12)

Where $\frac{\partial\hat{W}_{jk}^{l}}{\partial W_{jk}^{l}}$ is exactly the $\operatorname{staircase}^{\prime}(W_{jk}^{l})$ , which is the derivative of $\operatorname{staircase}(W_{jk}^{l})$ . And this derivative satisfies the conditions of Dirac Delta Function $\delta(x)$ . According to the properties of $\delta(x)$ , $\frac{\partial\hat{W}_{jk}^{l}}{\partial W_{jk}^{l}}$ can be calculated as follows,

\frac{\partial\hat{W}_{jk}^{l}}{\partial W_{jk}^{l}}=\delta(W_{jk}^{l})=0

(13)

Substituting $\frac{\partial\hat{W}_{jk}^{l}}{\partial W_{jk}^{l}}=0$ into Eq.12, we discover that model cannot be trained by back-propagation algorithm due to gradient vanishing.

To resolve the above problem, we reconstruct the quantized weight function as Eq.14 to ensure that the weights can be updated by using the back-propagation algorithm as shown in Fig.2, the blue full line, where $\alpha$ is an adjustable parameter in the range of $(0,1)$ .

\tilde{W}^{l}_{jk}=(1-\alpha)\hat{W}^{l}_{jk}+\alpha W^{l}_{jk}

(14)

By substituting Eq.14 into Eq.11, we can recalculate the gradient of weight as follows again with $\frac{\partial\tilde{W}_{jk}^{l}}{\partial W_{jk}^{l}}=\alpha+(1-\alpha)\delta(W^{l}_{jk})=\alpha$ ,

g^{W}=\mathcal{T}_{j}^{l}\cdot a^{l-1}_{k}\cdot\frac{\partial\tilde{W}_{jk}^{l}}{\partial W_{jk}^{l}}=\alpha\mathcal{T}_{j}^{l}\cdot a^{l-1}_{k}

(15)

At this point, we have reconstructed the quantized weight function as Eq.14 to solve the gradient vanishing, but the weights cannot be quantized to the entries $(\pm 2^{-0},\pm 2^{-1},\cdots,0)$ directly as Eq.6. However, we can prove that the reconstructed quantized weight function will approximate to the entries after several iterations, which can be ensured by the Theorem 2.

Assumption. Since the algorithm needs to be iterated, our problem needs to be discussed within the framework of the series. We define $W^{l}_{jk}$ as an iteration of $x_{n}$ , $\tilde{W}^{l}_{jk}$ is equivalent to $x_{n+1}$ , and the value of $a_{j}$ is chosen from $\hat{W}^{l}_{jk}=\operatorname{staircase}(W^{l}_{jk})=(\pm 2^{-0},\pm 2^{-1},\cdots,\pm 2^{-i},0)$ .

Theorem 2

In the framework of the series, $\tilde{W}^{l}_{jk}$ will approach to $\hat{W}^{l}_{jk}$ when the number of iterations is sufficient, when $n$ is the number of iterations.

Proof:

The general terms of series $x$ from $1$ to $n$ are written as follows based on Eq.14,

\left\{\begin{array}[]{ccc}&x_{2}-\alpha x_{1}=(1-\alpha)a_{j}&(1)\\ &\vdots&\vdots\\ &x_{n}-\alpha x_{n-1}=(1-\alpha)a_{j}&(n-1)\\ &x_{n+1}-\alpha x_{n}=(1-\alpha)a_{j}&(n)\\ \end{array}\right.

(16)

We let $\alpha^{(n-1)}\times(1)+\cdots+\alpha\times(n-1)+(n)$ , then we get the equation as follows,

	$\displaystyle x_{n+1}-\alpha^{(n-1)}x_{1}$	$\displaystyle=(1-\alpha)a_{j}(1+\alpha+\cdots+\alpha^{(n-1)})$		(17)
		$\displaystyle=a_{j}(1-\alpha^{(n-1)})$		(17)

As the number of iterations increases, $x_{n+1}$ will approach $a_{j}$ . With the guarantee of Theorem 2, the above equation can be rewritten as $\tilde{W}^{l}_{jk}=\operatorname{staircase}(W^{l}_{jk})$ (namely, $x_{n+1}=a_{j}$ ) when the number of iterations is enough ( $n\to\infty$ ) and $\alpha$ is in the range of $(0,1)$ . In the actual algorithm implementation, it is only necessary to iterate through several steps following the training process, and the networks can be quantized completely as Eq.6. ∎

The design of $\alpha$ in our reconstructed quantized weight function takes three aspects into consideration: First, the designed function must satisfy the Theorem 2. Second, our reconstructed function indicates that the ratio of $1-\alpha:\alpha$ between quantized weights and full-precision weights can be used to adjust the information ratio of between quantized weights and full-precision weights in the training process. Third, on the other hand, $\alpha$ is the slope of our reconstructed function shown as the blue full line in Fig.2, which can be used to change the gradient descent rate of back-propagation based on Eq.15 during the training.

III-C Posterior-Distribution Adjustment

In the initialization of the networks, the initialization modes MSRA and Xavier [32] which will adjust variance based on the number of inputs are prone to converge than the traditional Gaussian distribution initialization mode with fixed variance in DNNs. Inspiring by this fact, we suspect that adjusting the distribution of quantized weights may make it easier for us to train our $n$ -BQ-NN. Here, we consider that full-precision networks are prone to converge than quantized networks; thus, we prefer to keep the distribution of quantized weights consistent with full-precision weights’. Comparing the probability density function before quantization $\phi(x)$ (its corresponding expectation and variance are $E(x)$ and $Var(x)$ , respectively) and the probability density function after quantization $\operatorname{staircase}(x)$ (as Eq.4), we make their expectation and variance equal respectively so that their distribution is consistent as follows,

\left\{\begin{array}[]{lcl}E(x)=\int_{-1}^{1}x\operatorname{staircase}(x)dx\\ Var(x)=\int_{-1}^{1}(x-E(x))^{2}\operatorname{staircase}(x)dx\end{array}\right.

(18)

The original full-precision probability density function $\phi(x)$ and the value of quantized weight function $\operatorname{staircase}(x)$ are fixed, so we can only adjust the value range of $\operatorname{staircase}(x)$ to meet Eq.18.

III-D The Training Algorithm for $n$ -BQ-NN

In actual training algorithm for $n$ -BQ-NN, the Batch Normalization (BN [31]) is added in our $n$ -BQ-NN because it is conducive to reduce the overall impact of the weight scale and accelerate the training. Thus, we will derive the back-propagation algorithm for $n$ -BQ-NN with BN and give the training algorithm in this subsection.

First, we define four variables of BN, where $\sigma$ represents the variance of all samples of a batch, $\mu$ represents the sample mean, and $\gamma,\beta$ are the scale variation coefficients. Due to the existence of BN, the bias term can be ignored, so the input of the neuron is re-expressed as $z^{l}_{j}=\sum_{k}W^{l}_{jk}a^{l-1}_{k}$ , the normalized input of the neuron is $\hat{z}_{j}=\gamma\frac{z_{j}-\mu}{\sigma}+\beta$ , and the output of the neuron is $a_{j}=\theta(\hat{z}_{j})$ . Then, we can calculate the error and the gradient, based on the discussion of Section III-B, according to the following three steps.

Counting the mean and variance of the sample, and calculating the gradient of them.

		$\displaystyle\mu=\frac{1}{m}\sum_{j=1}^{m}z_{j}$		(19)
		$\displaystyle\sigma^{2}=\frac{1}{m}\sum_{j=1}^{m}(z_{j}-\mu)^{2}$		(19)

	$\displaystyle\frac{\partial C}{\partial\sigma^{2}}$	$\displaystyle=\sum_{k}\frac{\partial C}{\partial a_{k}}\frac{\partial a_{k}}{\partial\hat{z}_{k}}\frac{\partial\hat{z}_{k}}{\partial\sigma^{2}}$		(20)
		$\displaystyle=-\frac{1}{2}\gamma\sigma^{-3}\sum_{k}\frac{\partial C}{\partial a_{k}}\theta^{\prime}\left(z_{k}\right)\left(z_{k}-\mu\right)$		(20)

	$\displaystyle\frac{\partial C}{\partial\mu}$	$\displaystyle=\sum_{k}\frac{\partial C}{\partial a_{k}}\frac{\partial a_{k}}{\partial\hat{z}_{k}}\frac{\partial\hat{z}_{k}}{\partial\mu}+\frac{\partial C}{\partial\sigma^{2}}\frac{\partial\sigma^{2}}{\partial\mu}$		(21)
		$\displaystyle=-\frac{\gamma}{\sigma}\sum_{k}\frac{\partial C}{\partial a_{k}}\theta^{\prime}\left(z_{k}\right)-\frac{2}{m}\frac{\partial C}{\partial\sigma^{2}}\sum_{k}\left(z_{k}-\mu\right)$		(21)

Calculating the error of the network.

		$\displaystyle\mathcal{T}_{j}=\frac{\partial C}{\partial z_{j}}=\frac{\partial C}{\partial a_{j}}\frac{\partial a_{j}}{\partial\hat{z}_{j}}\frac{\partial\hat{z}_{j}}{\partial z_{j}}+\frac{\partial C}{\partial\sigma^{2}}\frac{\partial\sigma^{2}}{\partial z_{j}}+\frac{\partial C}{\partial\mu}\frac{\partial\mu}{\partial z_{j}}$		(22)
		$\displaystyle=\frac{\gamma}{\sigma}\frac{\partial C}{\partial a_{j}}\theta^{\prime}\left(\hat{z}_{j}\right)+\frac{2}{m}\sum_{k}\left(z_{k}-\mu\right)\frac{\partial C}{\partial\sigma^{2}}+\frac{1}{m}\frac{\partial C}{\partial\mu}$		(22)

Calculating the gradient of weight, $\gamma$ , and $\beta$ respectively.

g^{W}=\mathcal{T}_{j}\cdot a_{k}\cdot\frac{\partial\tilde{W}_{jk}}{\partial W_{jk}^{l}}=\alpha\mathcal{T}_{j}\cdot a_{k}

(23)

g^{\gamma}=\sum_{k}\frac{\partial C}{\partial a_{k}}\frac{\partial a_{k}}{\partial\hat{z}_{k}}\frac{\partial\hat{z}_{k}}{\partial\gamma}=\sum_{k}\frac{\partial C}{\partial a_{k}}\theta^{\prime}\left(\hat{z}_{k}\right)\frac{z_{k}-\mu}{\sigma}

(24)

g^{\beta}=\sum_{k}\frac{\partial C}{\partial a_{k}}\frac{\partial a_{k}}{\partial\hat{z}_{k}}\frac{\partial\hat{z}_{k}}{\partial\beta}=\sum_{k}\frac{\partial C}{\partial a_{k}}\theta^{\prime}\left(\hat{z}_{k}\right)

(25)

With the foundation of the above formulas, we can propose our training algorithm for $n$ -BQ-NN, as indicated in Algorithm 1. This algorithm covers two learning modes: training from scratch and fine-tuning on the pretrained model, where the first mode means the weights are randomly initialized and the second mode means the weights are initialized by the pretrained full-precision network model. The overall quantization process is illustrated as Fig.3. The code for training algorithm is available ²²2https://github.com/papcjy/n-BQ-NN.

Require : a minibatch of outputs and targets

(a^{L},y)

, learning rate

\eta

, previous weights

W^{k}

, previous BN parameters (

\gamma^{k}

\beta^{k}

), and a constant

\alpha

Ensure : the updated weights

{(W^{k})}^{*}

and updated BN parameters

({(\gamma^{k})}^{*},{(\beta^{k})}^{*})

{1. Computing the parameter gradients:}

{1.1 Forward propagation:}

for $k=1$ to $L$ do

\tilde{W}^{k}\leftarrow(1-\alpha)\operatorname{staircase}(W^{k})+\alpha W^{k}

z^{k}\leftarrow a^{k-1}\tilde{W}^{k}

\hat{z}^{k}\leftarrow\operatorname{BatchNorm}(z^{k},\gamma^{k},\beta^{k})

a^{k}\leftarrow\theta(\hat{z}^{k})

end for

{1.2 Backward propagation:}

Computing

g^{a^{L}}=\frac{\partial C}{\partial a^{L}}

based on

a^{L}

and

y

for $k=L$ to $1$ do

(g^{\gamma^{k}},g^{\beta^{k}})\leftarrow\operatorname{BackBatchNorm}(g^{a^{k}},z^{k},\gamma^{k},\beta^{k})

\mathcal{T}^{k}\leftarrow g^{a^{k}}\theta^{\prime}(\hat{z}^{k})

g^{a^{k-1}}\leftarrow\mathcal{T}^{k}\tilde{W}^{k}

g^{W^{k}}\leftarrow\alpha{(\mathcal{T}^{k})}^{T}a^{k-1}

end for

{2. Updating the parameter gradients:}

for $k=1$ to $L$ do

({(\gamma^{k})}^{*},{(\beta^{k})}^{*})\leftarrow\operatorname{Update}(\gamma^{k},\beta^{k},\eta,g^{\gamma^{k}},g^{\beta^{k}})

{(W^{k})}^{*}\leftarrow\operatorname{Update}(W^{k},g^{W^{k}},\eta)

end for

Algorithm 1 Training algorithm for

n

-BQ-NN with BN.

C

is the cost function for mini-batch,

\theta

is the activation function, and

L

is the number of layers. The function

\operatorname{staircase}(\cdot)

specifies how to quantize the weights.

\operatorname{BatchNorm}()

specifies how to batch-normalize the inputs.

\operatorname{BackBatchNorm}()

specifies how to back-propagate through the BN.

\operatorname{Update}()

specifies how to update the parameters when their gradients are known, using either SGD or ADAM.

III-E Activation Quantization in $n$ -BQ-NN

The above discussion is all about the quantization of weights. To take the integrity of our $n$ -BQ-NN and the necessity of subsequent ablation experiments into consideration, we need to discuss the quantization of activations in this subsection. Now, let’s put our eyes back on Section III-B. In the case of the quantized activations, the output of the $j$ -th neuron of the $l$ -th layer can be rewritten as,

\hat{a}^{l}_{j}=\operatorname{staircase}(z^{l}_{j})

(26)

Where $\operatorname{staircase}()$ is activation function.

At this point, we have encountered the same problem, the error of network $\mathcal{T}_{j}^{l}$ becomes zero due to the existence of $\frac{\partial\hat{a}_{j}^{l}}{\partial z_{j}^{l}}$ , when the Eq.26 is substituted into Eq.8 and Eq.9.

Considering the expectation of $\frac{\partial C}{\partial z_{j}^{L}}$ , the error of network has reappeared, which is guaranteed by Theorem 3.

Theorem 3

Let us define $C=C(\hat{a}_{j},\epsilon_{j})$ where $\hat{a}_{j}$ follows Eq.26 that is chosen from $(\pm 2^{-0},\pm 2^{-1},\cdots,0)$ , then, we get a new expression as

\mathbb{E}_{\epsilon_{j}}\left[\frac{\partial C}{\partial z_{j}}\right]=\lambda\frac{\partial C}{\partial\hat{a}_{j}},\operatorname{if}|z_{j}|\leq 1,

(27)

Where $\epsilon_{j}$ is the noise source that influences $z_{j}$ , $\mathbb{E}_{\epsilon_{j}}[\cdot]$ means the expectation over $z_{j}$ , and $\lambda$ is a constant.

Proof:

\begin{split}&\mathbb{E}_{\epsilon_{j}}\left[\frac{\partial}{\partial z_{j}}C\right]=\frac{\partial}{\partial z_{j}}\mathbb{E}_{\epsilon_{j}}[C]\\ &=\frac{\partial}{\partial z_{j}}[\sum_{i}C(\hat{a}_{j}=+2^{i})P(z_{j}>\epsilon_{j}|z_{j})+\\ &\sum_{i}C(\hat{a}_{j}=-2^{i})(1-P(z_{j}>\epsilon_{j}|z_{j}))]\\ &=\frac{\partial P(z_{j}>\epsilon_{j}|z_{j})}{\partial z_{j}}[\sum_{i}C(\hat{a}_{j}=+2^{i})-\sum_{i}C(\hat{a}_{j}=-2^{i})]\end{split}

(28)

For $C(\hat{a}_{j}=\pm 2^{i})$ , we can approximate it using the Taylor expansion.

\begin{split}&C(\hat{a}_{j}=+2^{i}))=C(\hat{a}_{j}=0)+\frac{\partial C}{\partial\hat{a}_{j}}\bigg{|}_{\hat{a}_{j}=0}2^{i}+\\ &\frac{\partial^{2}C}{\partial\hat{a}_{j}^{2}}\bigg{|}_{\hat{a}_{j}=0}2^{2i}+O\left(\frac{\partial^{3}C}{\partial\hat{a}_{j}^{3}}\bigg{|}_{\hat{a}_{j}=0}2^{3i}\right)\\ &C(\hat{a}_{j}=-2^{i}))=C(\hat{a}_{j}=0)-\frac{\partial C}{\partial\hat{a}_{j}}\bigg{|}_{\hat{a}_{j}=0}2^{i}+\\ &\frac{\partial^{2}C}{\partial\hat{a}_{j}^{2}}\bigg{|}_{\hat{a}_{j}=0}2^{2i}+O\left(\frac{\partial^{3}C}{\partial\hat{a}_{j}^{3}}\bigg{|}_{\hat{a}_{j}=0}2^{3i}\right)\end{split}

(29)

For $\frac{\partial P(z_{j}>\epsilon_{j}|z_{j})}{\partial z_{j}}$ , we split it into two parts.

\begin{split}&\frac{\partial P(z_{j}>\epsilon_{j}|z_{j})}{\partial z_{j}}=\underset{|z_{j}|>1}{\frac{\partial P(z_{j}>\epsilon_{j}|z_{j})}{\partial z_{j}}}+\underset{|z_{j}|\leq 1}{\frac{\partial P(z_{j}>\epsilon_{j}|z_{j})}{\partial z_{j}}}\\ &=\frac{\partial\int_{-1}^{1}\frac{1}{2}\,d\epsilon_{j}}{\partial z_{j}}+\frac{\partial\int_{-z_{j}}^{z_{j}}\frac{1}{2}\,d\epsilon_{j}}{\partial z_{j}}=\mathbf{1}_{|z_{j}|\leq 1}\end{split}

(30)

Combining Eq.29 and Eq.30, the Eq.28 can be derived as

\mathbb{E}_{\epsilon_{j}}\left[\frac{\partial C}{\partial z_{j}}\right]=\mathbf{1}_{|z_{j}|\leq 1}\left(2\sum_{i}2^{2i}\frac{\partial C}{\partial\hat{a}_{j}}\bigg{|}_{\hat{a}_{j}=0}\right)

(31)

Let $2\sum_{i}2^{2i}=\lambda$ , then

\mathbb{E}_{\epsilon_{j}}\left[\frac{\partial C}{\partial z_{j}}\right]=\lambda\frac{\partial C}{\partial\hat{a}_{j}}\mathbf{1}_{|z_{j}|\leq 1}

(32)

∎

Under the Theorem 3, we can re-express the error of network and quantize the activations in our $n$ -BQ-NN by rewriting the Eq.8 and Eq.9 as follows,

\mathcal{T}_{j}^{L}=\frac{\partial C}{\partial z_{j}^{L}}=\lambda\frac{\partial C}{\partial\hat{a}_{j}^{L}}\mathbf{1}_{|z_{j}|\leq 1}

(33)

$\displaystyle\mathcal{T}_{j}^{l}$	$\displaystyle=\frac{\partial C}{\partial z_{j}^{l}}=\lambda\frac{\partial C}{\partial\hat{a}_{j}^{l}}\mathbf{1}_{\|z_{j}\|\leq 1}=\lambda\sum_{k}\frac{\partial C}{\partial z_{k}^{l+1}}\cdot\frac{\partial z_{k}^{l+1}}{\partial\hat{a}_{j}^{l}}$	(34)
	$\displaystyle=\lambda\sum_{k}\mathcal{T}_{k}^{l+1}\cdot\frac{\partial\left(\hat{W}_{kj}^{l+1}\hat{a}_{j}^{l}+b_{k}^{l+1}\right)}{\partial\hat{a}_{j}^{l}}$
	$\displaystyle=\lambda\sum_{k}\mathcal{T}_{k}^{l+1}\cdot\hat{W}_{kj}^{l+1}\mathbf{1}_{\|z_{j}\|\leq 1}$

IV Experiment

TABLE I: The outline of the proposed

n

-BQ-NN network architecture. Here, taking the CIFAR datasets as an example, the initial input size of the network is

32\times 32\times 3

. The conv quantized contains three calculation steps, respectively,

\hat{W}=\operatorname{staircase}(W)

\operatorname{net}=\operatorname{conv2d}(\hat{W},x)

, and

\operatorname{net}=\operatorname{BatchNorm}(\operatorname{net})

, where the weights involved in convolution calculation are quantized weights that are chosen from the entries

(\pm 2^{-0},\pm 2^{-1},\cdots,0)

. In convolution calculation, the multiplications are replaced by SHIFT operations during the inference, because the weights are power of 2.

type	patch size/stride	output size
conv quantized	3 $\times$ 3/1	32 $\times$ 32 $\times$ 128
conv quantized	3 $\times$ 3/1	32 $\times$ 32 $\times$ 128
conv quantized	3 $\times$ 3/1	32 $\times$ 32 $\times$ 128
pool	2 $\times$ 2/2	16 $\times$ 16 $\times$ 128
conv quantized	3 $\times$ 3/1	16 $\times$ 16 $\times$ 256
conv quantized	3 $\times$ 3/1	16 $\times$ 16 $\times$ 256
conv quantized	3 $\times$ 3/1	16 $\times$ 16 $\times$ 256
pool	2 $\times$ 2/2	8 $\times$ 8 $\times$ 256
conv quantized	3 $\times$ 3/1	8 $\times$ 8 $\times$ 512
conv quantized	1 $\times$ 1/1	8 $\times$ 8 $\times$ 1024
conv quantized	1 $\times$ 1/1	8 $\times$ 8 $\times$ 10 (100)
pool	8 $\times$ 8	1 $\times$ 1 $\times$ 10 (100)
softmax	classifier	1 $\times$ 1 $\times$ 10 (100)

In our experiments, we use three network structures ResNet, DenseNet and AlexNet. The network structure of our $n$ -BQ-NN ( $n$ can take $1,2,3,4,5$ ) is similar to the architecture of All-CNN [33] that consists solely of convolution layers and Network in Network block [34]. Table I details the parameter settings and our network architecture. In the following experiments, our training algorithm is used to train the model from scratch or fine-tune on the full-precision model in five benchmark datasets MNIST, SVHN, CIFAR-10, CIFAR-100 and ImageNet. We unfold our experiments from 4 dimensions, respectively classification accuracy compared with low-precision QNNs, quantization errors by our training method, compression ratio in different datasets, and convergence speed compared with BNN.

IV-A MNIST

The MNIST dataset [35] consists of handwritten digit images with 32 $\times$ 32 pixels, organized into 10 classes (0 to 9). The training and test sets contain 60,000 and 10,000 images respectively. We perform this dataset without data augmentation [36].

IV-B CIFAR

The two CIFAR datasets [37] consist of natural color images with 32 $\times$ 32 pixels, respectively 50,000 training and 10,000 test images, and we hold out 5,000 training images as a validation set from the training set. CIFAR-10 (C10) consists of images organized into 10 classes and CIFAR-100 (C100) into 100 classes. We adopt a standard data augmentation scheme (random corner cropping and random flipping) that is widely used for these two datasets [38, 39, 40, 34, 41, 36, 33, 38]. We normalize the images using the channel means and standard deviations in preprocessing.

IV-C SVHN

The SVHN dataset [42] consists of color images of house numbers collected by Google Street View with 32 $\times$ 32 pixels, organized into 10 classes (0 to 9). There are 73,257 images in the training set, 531,131 images for additional training, and 26,032 images in the test set respectively. We divide the pixel values by 255.0 so that they are in the [0,1] range as [43]. Moreover, we do not preprocess the dataset following common practice without data augmentation [44, 39, 34, 36, 45].

IV-D Experiment Results

IV-D1 $n$ -bit

TABLE II: Our

n

-BQ-ResNet generates extremely low-precision models with very similar accuracy compared with full-precision ResNet-110 model on CIFAR-10.

Model	Bit-width	Test error
ResNet-110 ref	16	6.61%
$n$ -BQ-ResNet	5	7.04%
$n$ -BQ-ResNet	4	7.07%
$n$ -BQ-ResNet	3	7.15%
$n$ -BQ-ResNet	2	8.76%
$n$ -BQ-ResNet	1	10.52%

As the theoretical analysis in Section III, different quantized bit width brings different sampling loss, and the larger bit width means the less sampling loss. Thus, in this experiment, we evaluate the test error rates of our $n$ -BQ-ResNet that is fine-tuned on full-precision ResNet-110 when $n$ takes different values on CIFAR-10. The experimental results from Table II are consistent with Eq.1. Therefore, the choice of 3-bit is better because $\mathcal{L}(4)$ is close to $0$ as Eq.3 when considering both the sampling loss and the conciseness of weight representation. Obviously, this result is also experimentally proved by works in [26]. Thus, our $n$ -BQ-NN is chosen as T-BQ-NN when $n=3$ in the subsequent experiments.

IV-D2 Accuracy and Capacity

TABLE III: Error rates on CIFAR-10 and CIFAR-100 datasets with standard data augmentation (translation and mirroring). Error rates on MNIST and SVHN datasets without data augmentation. The overall best results are bold. “*” represents the results run by our implementation, the rest of results represents that these results are directly cited from the paper in the front of the row.

			Test error
Method	Depth	Params	CIFAR-10	CIFAR-100	SVHN	MNIST
Network in Network [34]	9	1.9M	8.81%	35.68%	2.35%	0.53%
All-CNN [33]	9	1.4M	7.25%	33.71%	^∗3.17%	^∗0.63%
Highway Network [38]	19	2.3M	7.72%	32.39%	${}^{*}2.61$ %	0.67%
BNN [15]	9	1.7M	11.40%	${}^{*}42.13$ %	2.80%	0.96%
Round Quantization	9	1.2M	85.88%	98.90%	83.72%	80.55%
T-BQ-NN	9	1.2M	7.59%	28.90%	2.29%	0.50%

TABLE IV: Fine-tunning ResNet and DenseNet by our training algorithm on CIFAR10(100) and SVHN. Where the results on C10 and C100 with data augmentation and the results on SVHN without data augmentation.

	Network	Depth	Bit-width	Params	Test error(%)
CIFAR-10	ResNet	110	16	1.7M	6.61
	T-BQ-ResNet	110	3	0.3M	7.15 (+0.54)
	DenseNet	100	16	0.8M	4.51
	T-BQ-DenseNet	100	3	0.15M	5.31 (+0.80)
CIFAR-100	ResNet	110	16	1.7M	35.87
	T-BQ-ResNet	110	3	0.3M	37.56 (+1.69)
	DenseNet	100	16	0.8M	22.27
	T-BQ-DenseNet	100	3	0.15M	24.10 (+1.83)
SVHN	ResNet	110	16	1.7M	3.13
	T-BQ-ResNet	110	3	0.3M	3.25 (+0.12)
	DenseNet	100	16	0.8M	1.76
	T-BQ-DenseNet	100	3	0.15M	2.10 (+0.34)

TABLE V: Deep compression method for T-BQ-ResNet and T-BQ-DenseNet. P: Pruning, Q: Quantization, H: Huffman coding.

Method	Encoding bit-width	Compression ratio
T-BQ-ResNet on C10 (P+Q)	3	49 $\times$
T-BQ-ResNet on C10 (P+Q+H)	2.6	57 $\times$
T-BQ-ResNet on C100 (P+Q)	3	25 $\times$
T-BQ-ResNet on C100 (P+Q+H)	2.8	27 $\times$
T-BQ-ResNet on SVHN (P+Q)	3	24 $\times$
T-BQ-ResNet on SVHN (P+Q+H)	2.8	26 $\times$
T-BQ-DenseNet on C10 (P+Q)	3	38 $\times$
T-BQ-DenseNet on C10 (P+Q+H)	2.5	46 $\times$
T-BQ-DenseNet on C100 (P+Q)	3	15 $\times$
T-BQ-DenseNet on C100 (P+Q+H)	2.8	16 $\times$
T-BQ-DenseNet on SVHN (P+Q)	3	133 $\times$
T-BQ-DenseNet on SVHN (P+Q+H)	2.1	190 $\times$

TABLE VI: Comparison of classification accuracy on the test set for ImageNet with different bitwidths of weights and activations. Single-crop evaluation results top-1 and top-5 accuracy are given based on AlexNet. Note the gray results are implemented by our

n

-BQ-NN where the training method of 1-bit activations is introduced in Section III-E, and other results are reported by [46]. We quantize the same layers of AlexNet to low-precision, as BNN [15], BC [47], TWN [48] and DoReFa-Net [27] do.

$n$ -bit Weights	$n$ -bit Activations	Inference Operation	AlexNet Top-1 Accuracy	AlexNet Top-5 Accuracy
1	1	XNOR	0.279 (BNN)	0.504 (BNN)
1	1	XNOR	0.348	0.601
1	32 (float)	XNOR ADDER	0.368 (BC)	0.620 (BC)
1	16 (float)	XNOR ADDER	0.486	0.734
2	32 (float)	XNOR ADDER	0.529 (TWN)	0.766 (TWN)
2	16 (float)	XNOR ADDER	0.536	0.777
3	16 (float)	SHIFT ADDER	0.560	0.795
8 (float)	8 (float)	MAC	0.530 (DoReFa-Net)	0.768 (DoReFa-Net)
32 (float)	32 (float)	MAC	0.566	0.802

As hardware devices require relatively simple architecture and less number of layers, we have selected some network suited for hardware implementation as our comparative experiment. For example, BNN with binary weights and activations replaces most multiplications by 1-bit XNOR operations. Network in Network utilizes the global average pooling over feature maps in the classification layer, which is less prone to overfitting than the fully connected layers. All-CNN achieves a new architecture that consists solely of convolution layers replacing max-pooling by a convolution layer without loss in accuracy on several benchmarks. Highway Network allows unimpeded information flow across many layers using adaptive gating units to regulate the information flow.

There is a general manifestation that T-BQ-NN performs better than most other network structures, while these network structures have never been quantized except BNN. In the experiment here, T-BQ-NN is trained by our training algorithm from scratch due to the lack of pretrained model, and this model is trained with a mini-batch size of 50, a weight decay of 0.0001. Its test error rates of 7.59 % on CIFAR-10, 28.9 % on CIFAR-100, 2.29 % on SVHN, and 0.5 % on MNIST are lower than the test error rates achieved by Network in Network, Highway Network, and BNN. Particularly, T-BQ-NN makes up for the classification accuracy of BNN on CIFAR-100 to some extent. The best result for all listed datasets is T-BQ-NN except CIFAR-10 is All-CNN, and all results are shown in Table III.

Our model capacity is even more encouraging: the number of parameters of T-BQ-NN is significantly lower than other network structures. Particularly, T-BQ-NN achieves the number of parameters of 1.2M that is even lower than 1.7M of BNN shown in Table III.

IV-D3 Extension

One positive-effect of our training algorithm is universal. We popularize our training method to the better and deeper architectures, not just limited to CNNs, such as ResNet [16] and DenseNet [17]. In the experiment here, T-BQ-ResNet and T-BQ-DenseNet are 3-bit that are fine-tuned by our training algorithm based on the full-precision model of ResNet and DenseNet.

For T-BQ-ResNet, all the multiplications are converted to SHIFT and ADDER operations using 3-bit weights in all convolutional layers and shortcut connections. We use a momentum of 0.9, weight decay of 0.0001 [49, 44], and adopt the weight initialization and BN [31, 32] without dropout [50]. This model is trained with a mini-batch size of 128 and a learning rate of 0.1, divided by 10 at 32k and 38k iterations, and terminates training at 64k iterations. We achieve the test error rates of 7.15% on C10, 37.56% on C100, and 3.25% on SVHN using T-BQ-ResNet, just rises 0.54% on C10, 1.69% on C100, and 0.12% on SVHN compared with ResNet on the basis of Table IV.

For T-BQ-DenseNet, its model consists of Bottleneck layers indicated to BN-ReLU-Conv(1 $\times$ 1)-BN-ReLU-Conv(3 $\times$ 3) and transition layers indicated to BN-ReLU-Conv(1 $\times$ 1)-averpool(2 $\times$ 2), and both of these layers contain 1x1 convolution. We use a weight decay of 0.0001, momentum of 0.9 [51], and adopt the weight initialization and BN without dropout. This model is trained with an initial learning rate of 0.1, divided by 10 at 50% and 75% of the total number of training epochs. And we train using a batch size of 64 for 300 and 40 epochs respectively on CIFAR and SVHN. Comparing between DenseNet and T-BQ-DenseNet, the increasing in error is 0.80% from 4.51% to 5.31% on C10, 1.83% from 22.27% to 24.10% on C100, and 0.34% from 1.76% to 2.10% on SVHN as Table IV.

We attribute this primarily to reduce the number of parameters approximately 5 times from 0.8M to 0.15M on T-BQ-DenseNet, and from 1.7M to 0.3M on T-BQ-ResNet is shown as Table IV. Furthermore, a hybrid network compression solution combined with three different techniques, respectively network pruning [19], quantization and Huffman coding is tested for T-BQ-ResNet and T-BQ-DenseNet in a scene with the same classification accuracy. Comparing with the original full-precision ResNet-110 model, we achieve the compression ratio of 57 $\times$ on C10, 27 $\times$ on C100 and 26 $\times$ on SVHN for T-BQ-ResNet. For T-BQ-DenseNet, the compression ratio is 46 $\times$ on C10, 16 $\times$ on C100 and 190 $\times$ on SVHN shown as Table V.

IV-D4 Convergence Speed

In this experiment, we train our T-BQ-NN and BNN from scratch on C10, C100, and SVHN. The results in Fig.4 indicate that T-BQ-NN not only has better performance on classification accuracy than BNN but also converges much faster. We just only compare our method with BNN, because the weights of other network models in Table III are full-precision and these models are not quantized except BNN and T-BQ-NN. We use the same conditions, including learning rate, batch size, and iterations, to test the error rates of BNN and T-BQ-NN at first epoch. Comparing with BNN, T-BQ-NN reaches the best test error dropping from 500 to 150 epochs on C10, from 1000 to 100 epochs on MNIST, and from 1000 to 180 epochs on C100. As a result, T-BQ-NN can be trained much easier and faster than BNN.

This result may be due to the fact that straight-through estimator used by BNN contains noise, which causes the unexpected deviation, while our training algorithm is based entirely on back-propagation without the effect of noise and weight representation is more abundant.

IV-E ImageNet

We further evaluate our $n$ -BQ-NN on ILSVRC2012 [11] image classification dataset that consists of 1.2 million high-resolution natural images where the validation set contains 50k images. These images are organized into 1000 categories of the object for training, which are resized to 224 $\times$ 224 pixels before fed into the network. In the next experiments, we report our single-crop evaluation results using top-1 and top-5 accuracy.

AlexNet: This CNN architecture is the first structure that shows to be successful on ImageNet classification task, which consists of 5 convolutional layers and 2 fully-connected layers [1]. We use AlexNet coupled with BN that contains a total of 61M parameters.

In training, images are resized randomly to 256 $\times$ 256 pixels, and then a random crop of 224 $\times$ 224 is selected for training. We train our $n$ -BQ-NN for 50 epochs with batch size of 128/16 based on AlexNet/Vgg-16. We use ADAM optimizer with the learning rate of 1e-4. We replace the Local Contrast Renomalization layer with Batch Normalization layer. At inference, we use the 224 $\times$ 224 center crop for forward-propagation.

The ablation experiments are listed in Table VI. The baseline AlexNet model scores 56.6% top-1 accuracy and 80.2% top-5 accuracy that is reported in [52]. For ablation studies, we strictly control the consistency of variables, including network structure, bit width and quantized layers. The only difference is the quantization method. In experiments of “1-1” v.s. “1-1”, “1-16” v.s. “1-32” and “2-16” v.s. “2-32”, our $n$ -BQ-NN achieves 6.9%, 11.8% and 0.7% accuracy improvements respectively. For “3-16” v.s. “32-32”, our $n$ -BQ-NN only reduces the accuracy by 0.6%.

V Acceleration on FPGA

We evaluate our $n$ -BQ-NN on FPGA platform: Xilinx ZCU102, which consists of an UltraScale FPGA, quad ARM Cortex-A53 processors and 500 MB DDR3. To measure the performance of our $n$ -BQ-NN running on FPGA, we get the data of run-time, resource utilization, and power by simulating and testing on Vivado-2017 when the operating frequency is 200 MHz. Our $n$ -BQ-NN implementation involves a few design parameters, parallelization degree ( $P_{n}$ and $P_{m}$ ), filter size ( $k$ ), input feature maps width ( $W$ ), input feature maps height ( $H$ ), input feature channels ( $M$ ), and output feature channels ( $N$ ).

V-A Coprocessor Architecture

Fig.5 shows the block diagram of our system. In the CNN calculation process, the host computer hands off the weights and images to the coprocessor (ZCU102), and collects the predicted classification results. The transmission mode between host computer and ARM CPU can be switched in PCI or UDP. Before the computation, the host computer is responsible for feeding the images and reducing precision. In addition, the ARM CPU needs to complete the calculation of fully-connected layers that is not suitable for FPGA parallel acceleration and the FPGA accelerates the calculation of convolutional layers.

We build the coprocessor with parallel architecture, as shown in Fig.6. The critical part of the coprocessor is SVPE cluster interface that has $M$ SVPE clusters, where each SVPE cluster consists of $N$ SVPE arrays with a size of $k\times k$ . The adders are used to compute partial sums of convolutions while the SVPE arrays compute convolutions. The fetch unit is programmed to fetch images and weights from ARM-based processing system (PS), and the load/store unit is used to load or store intermediate calculation results. The AXI-HP port is used to receive or send the data, and the AXI-GP port is used to receive or send the network structure information and the control signal. A key point to note is 16-bit computational accuracy acts on the data buses to save data bandwidth.

The basic design ideas continue the architecture of $n$ -BQ-NN, which converts all 16-bit weights as $n$ -bit weights to reduce memory usage and increase the parallelization degree. On FPGAs, due to the shortage of DSPs, this has become a major factor limiting the increase in parallelization degree that directly affects the ability to accelerate calculation for CNNs, because the multiplication in the convolution calculation needs to call DSPs. Instead, we implement the multiplication with SHIFT operation that consumes the Look-Up-Table (LUT) arrays on FPGAs, while the resource of LUTs is more abundant than DSPs’ [30]. In general, our $n$ -BQ-NN, which consists of 16-bit activations and $n$ -bit weights.

Our architecture of convolution computation is characterized by several key attributes compared with VPEs [14]. First, we organize the architecture as arrays of SVPEs, where the SVPE array is an array of 2D convolvers, each of which consists of $k^{2}$ connected SHIFT and ADDER units instead of Multiply Accumulate (MAC) units, shown in Fig.7. The weights and feature maps are loaded into each PE alternately by AXI-HP port. Before each calculation, the weights are buffered to the specified areas (the double orange rectangles in Fig.7), and then the pipeline calculation starts with the enablement of feature maps. Modeling the SVPE and VPE arrays, we compare their resource consumption, parallelization degree and power on FPGAs shown as Table VII, and our SVPE array achieves the average energy consumption of 3.81 W at different $n$ that is less than VPE array of 5.53 W. Second, we reduce banded off-chip data memory and improve the data movement between the SVPE clusters and the off-chip memory by using $n$ -bit weights. Third, all convolvers are homogeneous when $k$ is fixed as our primitive operator. We evaluate the improvement of the computational efficiency of $n$ -BQ-NN in hardware by the following section.

V-B Computational Efficiency

TABLE VII: Comparison of VPE and SVPE array resource consumption, parallelization degree and power.

Array		SVPE					VPE
Precision (n)		1	2	3	4	5	16
Power (W)	Signal	1.94	2.31	2.61	2.53	2.13	4.88
	Logic	1.03	1.33	1.55	1.63	1.62	0.25
	DSPs	0.08	0.09	0.09	0.08	0.06	0.40
	Total	3.05	3.73	4.25	4.24	3.81	5.53
Used Resource	LUTs	353	280	307	334	346	41
	FFs	220	226	232	238	244	213
	DSPs	3	3	3	3	3	12
Parallelization degree ( $P_{n},P_{m}$ )		(8,32)					(4,16)

Since the filter size (3 $\times$ 3) is fixed for our $n$ -BQ-NN, resource utilization will be maximized. Here, we can predict the performance of $n$ -BQ-NN on FPGAs by developing an analytical model. In the following, we rely on it to compare computational efficiency between traditional implementation and $n$ -BQ-NN on FPGAs.

On the hardware, MAC unit, adder and multiplier will consume DSP. In fact, the number of DSPs only depends on the size of filter and parallelization degree [30, 12] as follows,

DSP=(k^{2}+k)\times P_{n}\times P_{m}

(35)

We must balance the memory bandwidth between the on-chip and off-chip memory and ensure that the speed of transmission is greater than or equal to the speed of computation for utilizing the resource efficiently. The formula of the time to process input data in the line buffer on FPGA is,

T_{compute}=(H\times W\times\frac{M}{P_{m}}\times\frac{N}{P_{n}})\times\frac{1}{Freq}

(36)

Where $Freq$ is the operating frequency of the FPGA. Together, we have to parallel the speed of transmission between input and output data as follows,

T_{transfer}=\frac{M\times N\times k^{2}+k\times W\times M}{Bandwidth}

(37)

We require that $T_{transfer}\leq T_{compute}$ . Therefore, we can get the minimum requirement of bandwidth is,

Bandwidth_{min}=\frac{P_{m}\times P_{n}}{min(N,M)}\times b_{compute}\times Freq

(38)

Where $b_{compute}$ is the bit-width of computation, and we evaluate the performance of hardware acceleration choosing 16 bit-width. We define the $T_{init}$ as the time to load the first $n$ rows of input image and filter needed into on-chip memory as follows,

T_{init}=\frac{M\times N\times k^{2}\times b_{weight}}{Bandwidth}+\frac{W\times M\times k}{Bandwidth}\times b_{compute}

(39)

Where $b_{weight}$ is the bit-width of the weights. The total operations are,

OPs=H\times W\times M\times N\times k^{2}\times 2

(40)

And the total processing time of the convolution is,

T_{total}=T_{compute}+T_{init}

(41)

Finally, we can compare the computational efficiency of the different models defining the effective performance of convolution as,

Perf_{eff}=\frac{OPs}{T_{total}}

(42)

We obtain the computational efficiency $Perf_{eff}(n)$ corresponding to different bit width of the weights where $n=b_{weight}$ represents the bit width of our $n$ -BQ-NN.

Perf_{eff}(n)=\frac{{32FreqP_{m}P_{n}HWNk^{2}}}{16HWN+nMNk^{2}+16WMk}

(43)

Now, given a convolutional layer represented by ( $W$ , $H$ , $M$ , $N$ ), we get the computational efficiency based on design parameters ( $k$ , $P_{m}$ , $P_{n}$ ).

The main reason for restricting the computational efficiency of CNNs on FPGA is parallelization degree, which is directly related to DSPs when the setting of bandwidth is reasonable. To speed-up the inference of CNNs on FPGAs, we use our SVPE cluster to replace the traditional VPE cluster by converting the multiplications as the SHIFT and ADDER operations. Since we no longer use the multiplication, the amount of DSPs is reduced as follows,

DSP=k\times P_{n}\times P_{m}

(44)

Since SVPE array consumes much less DSPs than VPE array compared Eq.35 with Eq.44, $n$ -BQ-NN with SVPE array can get a larger amount of parallelization degree than CNNs with VPE array when the consumed DSPs are the same. Based on the maximum DSPs number of 2520 as Table VIII and the balanced memory bandwidth, we can design the maximum parallelization degree of ( $P_{m}=32$ , $P_{n}=8$ ) and ( $P_{m}=16$ , $P_{n}=4$ ) respectively on SVPE and VPE array with filters 3 $\times$ 3. Thanks to the SVPE array, the parallelization degree increases by 4 times to improve the computational efficiency greatly when the total consumed DSPs is 768. Supposing the color image is 32( $W$ ) $\times$ 32( $H$ ) $\times$ 3( $M$ ) pixels, filter size $k$ is 3, DDR bit width $N$ is 128, the computational efficiency of our $n$ -BQ-NN using SVPE array has improved by about 4.1 times compared with traditional network using VPE array on the basis of Eq.43.

V-C Performance on FPGA

We evaluate our SVPE cluster implementation using AlexNet, where our $n$ -BQ-NN contains 16-bit activations and 3-bit weights. Table VIII gives the evaluation results with the comparison of the state-of-the-art FPGA accelerators, where GOP indicates the unit of the number of operations. From the hardware perspective, we prefer to use the computational efficiency to describe the performance of the algorithm. Because the total operations of computing a network is fixed, we can get the execution time (s) by dividing the total operations (GOP) by the computational efficiency (GOP/s). Similar to the structure of Fig.7, a universal SVPE array designed by the largest filter size of AlexNet is proposed in Fig.8. This experiment will use the universal SVPE array that fits the full-size after a slight adjustment. Our array designs to be recycled when calculating the convolutions of different layers, and $11\times 11$ filter of AlexNet is only used in the first convolutional layer, so most of the array utilization is extremely low. This also confirms the necessity of designing the network architecture with $3\times 3$ unified filter to improve resource utilization as Table I.

Compared to prior works [53, 30], we improve the average CNN performance to 957.4 GOP/s where the work [30] is implemented by Winograd algorithm. The baseline is to implement the same hardware architecture as our implementation. The only difference is that it uses VPE cluster because its weights and activations are both 16-bit. The computational efficiency of our implementation has improved by 2.9 times compared with the baseline, which is slightly less than 4.1 times based on the theoretical calculations of the Section V-B. On the other hand, Our implementation also improves the energy efficiency to 48.9 GOP/s/W. The better energy efficiency and resource efficiency come from the novel SVPE structure.

TABLE VIII: Performance comparison for AlexNet

	[53]	Baseline	[30]	Our Impl.
Precision	32bits fixed	16bits fixed	16bits fixed	16bits fixed
Device	VX485T	ZCU102	ZCU102	ZCU102
Freq(MHz)	100	200	200	200
Logic cell(K)	485.7	600	600	600
DSP	2800	2520	2520	2520
BRAM(Kb)	2060 $\times$ 18	1824 $\times$ 18	1824 $\times$ 18	1824 $\times$ 18
conv1(GOP/s)	27.5	227.5	409.6	410.5
conv2(GOP/s)	83.8	535.8	1355.6	1744.3
conv3(GOP/s)	78.8	655.9	1535.7	1680.7
conv4(GOP/s)	77.9	634.4	1361.7	1739.4
conv5(GOP/s)	77.6	559.5	1285.7	1456.1
CNN average (GOP/s)	61.6	332.2	854.6	957.4
Power(W)	18.6	28.7	23.6	19.6
DSP Efficiency (GOP/s/DSPs)	0.022	0.131	0.339	0.381
Logic cell Efficiency (GOP/s/cells/K)	0.127	0.553	1.424	1.596
Energy Efficiency (GOP/s/W)	3.31	11.57	36.2	48.85
DSP Utilization	80%	30%	63%	30%
LUT Utilization	61%	48%	39%	73%
FF Utilization	34%	42%	33%	68%
BRAM Utilization	50%	50%	43%	83%

VI Conclusion and future work

In this paper, we present a novel learning framework to quantize full-precision CNN models into low-precision QNN models whose weights are constrained to the power of two. We solve the problem of gradient vanishing by adding a reconstructed gradient function into back-propagation algorithm. To satisfy the network-structure-optimization requirements for hardware limitation, we propose $n$ -BQ-NN, a novel QNN structure, to replace the multiplication with SHIFT operation whose structure is more suitable for the inference on FPGAs. Furthermore, we also design the SVPE array to replace all 16-bit multiplications with SHIFT operations in convolution operation on FPGAs. For proving the validity of our learning framework, we conduct experiments and show that the quantized models of ResNet, DenseNet and AlexNet through our learning framework can achieve almost the same accuracies with the original full-precision models. Moreover, when using our learning framework to train our $n$ -BQ-NN from scratch, it can achieve nearly state-of-the-art results compared with typically low-precision QNNs. We also evaluate the computational efficiency and energy consumption by implementing our QNNs models on Xilinx ZCU102 platform. In our hardware experiments, our $n$ -BQ-NN with our SVPE can execute 2.9 times faster than with the VPE in inference, and the use of SVPE array also reduces average energy consumption to 68.7% of the VPE array with 16-bit. Our future work should explore how to decrease the accumulated quantization errors further when our learning framework is used on different CNNs structures.

References

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in International Conference on Neural Information Processing Systems, pp. 1097–1105, 2012.
[2] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” pp. 1–9, 2014.
[3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2015.
[4] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708, 2014.
[5] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation from predicting 10,000 classes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1891–1898, 2014.
[6] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, 2015.
[7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” arXiv preprint arXiv:1412.7062, 2014.
[8] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, pp. 1440–1448, 2015.
[9] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, pp. 91–99, 2015.
[10] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. Ieee, 2009.
[12] M. Peemen, A. A. Setio, B. Mesman, H. Corporaal et al., “Memory-centric accelerator design for convolutional neural networks.” in ICCD, vol. 2013, pp. 13–19, 2013.
[13] P. Meloni, G. Deriu, F. Conti, I. Loi, L. Raffo, and L. Benini, “A high-efficiency runtime reconfigurable ip for cnn acceleration on a mid-range all-programmable soc,” in 2016 International Conference on ReConFigurable Computing and FPGAs, pp. 1–8. IEEE, 2016.
[14] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic, E. Cosatto, and H. P. Graf, “A massively parallel coprocessor for convolutional neural networks,” in Application-specific Systems, Architectures and Processors, 2009. ASAP 2009. 20th IEEE International Conference on, pp. 53–60. IEEE, 2009.
[15] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1,” 2016.
[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
[17] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks.” in CVPR, vol. 1, no. 2, p. 3, 2017.
[18] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in neural information processing systems, pp. 1135–1143, 2015.
[19] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
[20] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
[21] J. Van Leeuwen, “On the construction of huffman trees.” in ICALP, pp. 382–410, 1976.
[22] J. Choi, P. I.-J. Chuang, Z. Wang, S. Venkataramani, V. Srinivasan, and K. Gopalakrishnan, “Bridging the accuracy gap for 2-bit quantized neural networks (qnn),” arXiv preprint arXiv:1807.06964, 2018.
[23] J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan, and K. Gopalakrishnan, “Pact: Parameterized clipping activation for quantized neural networks,” arXiv preprint arXiv:1805.06085, 2018.
[24] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K.-T. Cheng, “Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 722–737, 2018.
[25] B. Zhuang, C. Shen, M. Tan, L. Liu, and I. Reid, “Towards effective low-bitwidth convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7920–7928, 2018.
[26] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network quantization: Towards lossless cnns with low-precision weights,” arXiv preprint arXiv:1702.03044, 2017.
[27] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
[28] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv preprint arXiv:1612.01064, 2016.
[29] E. Park, D. Kim, S. Yoo, and P. Vajda, “Precision highway for ultra low-precision quantization,” arXiv preprint arXiv:1812.09818, 2018.
[30] L. Lu, Y. Liang, Q. Xiao, and S. Yan, “Evaluating fast algorithms for convolutional neural networks on fpgas,” in Field-Programmable Custom Computing Machines (FCCM), 2017 IEEE 25th Annual International Symposium on, pp. 101–108. IEEE, 2017.
[31] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
[32] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
[33] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” arXiv preprint arXiv:1412.6806, 2014.
[34] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
[35] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[36] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” in Artificial Intelligence and Statistics, pp. 562–570, 2015.
[37] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
[38] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in Advances in neural information processing systems, pp. 2377–2385, 2015.
[39] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic depth,” in European Conference on Computer Vision, pp. 646–661. Springer, 2016.
[40] G. Larsson, M. Maire, and G. Shakhnarovich, “Fractalnet: Ultra-deep neural networks without residuals,” arXiv preprint arXiv:1605.07648, 2016.
[41] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014.
[42] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” in NIPS workshop on deep learning and unsupervised feature learning, vol. 2011, no. 2, p. 5, 2011.
[43] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
[44] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” arXiv preprint arXiv:1302.4389, 2013.
[45] P. Sermanet, S. Chintala, and Y. LeCun, “Convolutional neural networks applied to house numbers digit classification,” in Pattern Recognition, 2012 21st International Conference on, pp. 3288–3291. IEEE, 2012.
[46] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proc. IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
[47] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in neural information processing systems, pp. 3123–3131, 2015.
[48] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprint arXiv:1605.04711, 2016.
[49] S. Gross and M. Wilber, “Training and investigating residual nets,” Facebook AI Research, CA.[Online]. Avilable: http://torch. ch/blog/2016/02/04/resnets. html, 2016.
[50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[51] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in International conference on machine learning, pp. 1139–1147, 2013.
[52] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in European Conference on Computer Vision, pp. 525–542. Springer, 2016.
[53] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing fpga-based accelerator design for deep convolutional neural networks,” in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 161–170. ACM, 2015.

A Learning Framework for nn-bit Quantized Neural Networks toward FPGAs

Abstract

Index Terms:

I Introduction

II Related work

II-A Learning for QNNs

II-B CNNs Implemented by FPGAs

III nn-BQ-NN

III-A Fundamental Idea of Our nn-BQ-NN

Theorem 1

Proof:

III-B Gradients Computation in nn-BQ-NN

Theorem 2

Proof:

III-C Posterior-Distribution Adjustment

III-D The Training Algorithm for nn-BQ-NN

III-E Activation Quantization in nn-BQ-NN

Theorem 3

Proof:

IV Experiment

IV-A MNIST

IV-B CIFAR

IV-C SVHN

IV-D Experiment Results

IV-D1 nn-bit

IV-D2 Accuracy and Capacity

IV-D3 Extension

IV-D4 Convergence Speed

IV-E ImageNet

V Acceleration on FPGA

V-A Coprocessor Architecture

V-B Computational Efficiency

V-C Performance on FPGA

VI Conclusion and future work

References

A Learning Framework for $n$ -bit Quantized Neural Networks toward FPGAs

III $n$ -BQ-NN

III-A Fundamental Idea of Our $n$ -BQ-NN

III-B Gradients Computation in $n$ -BQ-NN

III-D The Training Algorithm for $n$ -BQ-NN

III-E Activation Quantization in $n$ -BQ-NN

IV-D1 $n$ -bit