This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Learning Framework for nn-bit Quantized Neural Networks toward FPGAs

Jun Chen, Liang Liu, Yong Liu, , Xianfang Zeng Jun Chen, Liang Liu, Yong Liu and Xianfang Zeng are with the Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, China, 310027, e-mail: [email protected] Liu is the corresponding author.
Abstract

The quantized neural network (QNN) is an efficient approach for network compression and can be widely used in the implementation of FPGAs. This paper proposes a novel learning framework for nn-bit QNNs, whose weights are constrained to the power of two. To solve the gradient vanishing problem, we propose a reconstructed gradient function for QNNs in back-propagation algorithm that can directly get the real gradient rather than estimating an approximate gradient of the expected loss. We also propose a novel QNN structure named nn-BQ-NN, which uses shift operation to replace the multiply operation and is more suitable for the inference on FPGAs. Furthermore, we also design a shift vector processing element (SVPE) array to replace all 16-bit multiplications with SHIFT operations in convolution operation on FPGAs. We also carry out comparable experiments to evaluate our framework. The experimental results show that the quantized models of ResNet, DenseNet and AlexNet through our learning framework can achieve almost the same accuracies with the original full-precision models. Moreover, when using our learning framework to train our nn-BQ-NN from scratch, it can achieve state-of-the-art results compared with typical low-precision QNNs. Experiments on Xilinx ZCU102 platform show that our nn-BQ-NN with our SVPE can execute 2.9 times faster than with the vector processing element (VPE) in inference. As the SHIFT operation in our SVPE array will not consume Digital Signal Processings (DSPs) resources on FPGAs, the experiments have shown that the use of SVPE array also reduces average energy consumption to 68.7% of the VPE array with 16-bit.

Index Terms:
Deep learning, quantized neural network (QNN), deep compression, FPGA

I Introduction

Deep convolutional neural networks (CNNs) have substantially become the dominant Artificial Intelligence (AI) approach for a variety of computer vision tasks such as image classification [1, 2, 3], face recognition [4, 5], semantic segmentation [6, 7] and object detection [8, 9]. The significant accuracy improvement of CNNs brings with the cost of huge computational complexity, resource, and power consumption as it requires a comprehensive estimation of all the scopes within the feature maps [10, 11]. For example, the AlexNet model is over 200 MB, and the VGG-16 model is over 500 MB [10]. Towards such overwhelming resources and computation pressure, hardware accelerators such as GPUs, FPGAs, and ASICs have been applied to accelerate CNNs. Among these accelerators, FPGAs have emerged as one of the popular solutions when considering both the reprogramability and energy efficiency.

Implementing CNN on FPGAs is not an efficient practice due to limited resources and bandwidth. Thus QNN is a good choice for FPGAs implementation, which simultaneously gives consideration to computational efficiency, resources and classification accuracy in inference. In general, QNNs can be achieved in two ways: 1)1) An estimator is used to estimate the gradient of the expected loss to solve the problem of gradient vanishing so that QNNs can be trained from scratch with the help of this estimator. 2)2) Fine-tuning on a pretrained full-precision model obtains QNNs that bypasses the problem of gradient vanishing. Although the first method estimates a gradient, which makes it possible to train QNNs from scratch, the gradient of expected loss obtained by estimators has a noise source compared to the real gradient that causes a gap on classification accuracy between the QNNs and full-precision CNNs. The second method fine-tunes QNNs on a pretrained full-precision model that solves the problem of classification accuracy better, but a challenging factor is that the structure of QNNs is limited by the original structure of the pretrained CNNs model, and the structure of QNNs cannot be flexibly adjusted. Due to the constraints of computational resources and computational efficiency on FPGAs, it is inevitable to adjust the network structure for the hardware environment. In order to transform different CNNs into QNNs that can run efficiently on FPGAs, it is essential for a general learning framework to solve the above two challenges [12, 13, 14, 15].

In this paper, we propose a novel learning framework for nn-bit QNNs, whose weights are constrained to the power of two (±20,±21,,0)(\pm 2^{-0},\pm 2^{-1},\cdots,0). We introduce a reconstructed gradient function in back-propagation algorithm that can directly get the real gradient, rather than the estimated gradient given by estimators. Thus the QNNs trained by our framework will be more accurate. At the same time, QNNs after adjusting the structure can continue to fine-tune with our framework. The learning framework is applied to train our proposed nn-BQ-NN, which is suitable for efficient implementation on FPGAs. We also evaluate the effectiveness of our approach on state-of-the-art networks such as ResNet [16], DenseNet [17] and AlexNet [1]. The main contributions of this article are summarized as follows:

  1. 1.

    We propose a novel learning framework for nn-bit QNNs. In this framework, we propose a reconstructed gradient function in back-propagation algorithm, which can overcome the gradient vanishing problem during training the QNNs and can calculate the accurate gradient compared with the estimators based approaches. We achieve state-of-the-art results compared with typically low-precision QNNs.

  2. 2.

    We propose a highly efficient QNN structure called nn-BQ-NN for FPGAs. Our proposed architecture, which consists entirely of convolutional layers and implements a uniform convolution kernel, can maximize the resource utilization and improve the parallel computational efficiency on FPGAs while preserving the accuracy of QNNs.

  3. 3.

    We propose a novel shift vector processing element (SVPE) array for FPGAs, which replaces the multiplication with the SHIFT operation when calculating convolution operation on FPGAs. The computational efficiency of our SVPE array can achieve a performance of 2.9 times higher than that of the VPE array in the case of the same network structure on FPGAs.

The rest of this paper is organized as follows: Section II summarizes related prior works on QNNs and FPGAs. Our learning framework is presented in Section III. In Sections IV, we demonstrate the effectiveness of our learning framework via comparable experiments. We theoretically analyse and practically test the computational efficiency of our nn-BQ-NN using our quantization method in Section V. The conclusion is given in Section VI.

II Related work

II-A Learning for QNNs

Since the amount of the model capacity is too large, it is necessary to cut down it to perform CNNs on FPGAs, which is consistent with the purpose of deep compression. In general, deep compression can be divided into three categories, i.e., pruning, Huffman coding, and quantization. The pruning method will simplify the deep neural network by cutting off the network connections with small weights on the normal trained network [18] [19] [20]. The Huffman coding method is an optimal code used for lossless data compression [21] which uses entropy to encode source symbols by variable-length codewords. Han et al. [20] show that 20% - 30% of the network storage will be saved after Huffman coding the non-uniformly distributed values. When considering perform compressed networks on FPGAs, the network after pruning is an asymmetric structure, which is unsuitable for hardware implementation, and the Huffman coding may only be regarded as a post-compression combined with the other two compression methods, so most of the hardware accelerators will focus on the quantization method.

The quantization based method normally employ the low-precision weights, varied from 1 bit to 5 bits, to represent the CNNs [15, 22, 23, 24, 25, 26, 27, 28]. Some studies train QNNs from scratch by estimating the gradient of expected loss based on Straight-Through Estimator [15, 23, 22, 27]. For example, Courbariaux et al. [15] train a classification neural network from scratch with 1-bit weight and activation, which can run 7 times faster than the CNNs. Choi et al. [23] propose a neural quantization scheme called Parameter Clipping Activation, which uses a parameter to find the optimal quantization scale for arbitrary bit width activations. Choi et al. [22] introduce a novel technique called Statistics-Aware Weight Binning, which finds the optimal scaling factor based on statistical characteristics of the distribution of the weights to minimize the quantization error. The QNNs trained by the above quantization methods only accelerate the inference, Zhou et al. [27] propose a DoReFa-Net that can accelerate both training and inference by low bit width weights, activations and gradients respectively. However, these estimator-based methods have a noise compared to the real gradient. Thus these QNNs can’t achieve an ideal classification accuracy, especially on multi-classification datasets such as CIFAR-100.

Some other quantization methods are dedicated to design special strategies to fine-tune QNNs, which will not rely on the backpropagation algorithm and can bypass the problem of gradient vanishing [29, 26, 28]. They can achieve a much better accuracy as they are independent of estimators. For example, Park et al. [29] propose precision highway that has an end-to-end high-precision information flow for ultra-low-precision computation. This linear weight quantization method is based on the assumption that the weight distribution is the Laplace distribution. Recently, Zhou et al. [26] propose an incremental network quantization method, which converts pretrained full-precision CNNs model into a low-precision model where the weights are constrained to the power of two or zero. It has been studies that there will be little loss on the classification accuracies when using 2-5 bits low-precision weight [27, 26]. However, these quantization methods will depend on the pretrained network structure rather than the backpropagation algorithm, which will be difficult to satisfy the network-structure-optimization requirements due to the hardware limitation.

II-B CNNs Implemented by FPGAs

Considering the inference, the CNNs have a highly hierarchical structure of multiple feature maps, whose structure exposes a large amount of parallelism that makes CNNs very suitable for FPGAs implementation. This structure builds on the accumulation of a huge number of convolutions that will consume a huge number of floating-point resources on FPGAs. In addition, the structure of CNNs often contains many convolutional layers. Thus the convolution module with different parameters needs to be executed iteratively during the inference. Frequent execution of data caching and parameter loading will be limited by the bandwidth. Therefore, in many studies, their hardware structures of CNNs are designed mainly for the two bottlenecks of floating-point resources and bandwidth [12, 17, 16, 13].

In terms of optimizing for floating-point resources, Lu et al. [30] design a fast Winograd algorithm, which can decrease the use of floating-point resources on FPGAs and reduces the complexity of convolution dramatically. Simultaneously, they also give the formula for estimating the computational efficiency, which demonstrates that the fast Winograd algorithm is more efficient than conventional convolutional algorithm due to the use of fewer floating-point resources on FPGAs. Meloni et al. [13] present an accelerator configuration for CNNs that reaches more than 97% DSP resource utilization at 150 MHz operating frequency with 16-bit precision. And they show that the floating-point resource utilization is the highest when executing 3×\times3 filters on FPGAs.

Other studies have focused on optimizing the data scheduling structure to reduce the impact of the bandwidth. For example, Sankaradas et al. [14] implement a vector processing element (VPE) array coprocessor, which can accelerate the CNNs by optimizing the cache between distributed off-chip memory banks and on-chip computing elements on FPGAs. Peemen et al. [12] show that their scheduler prefers to use only convolutional layers without fully connected layers on FPGAs, which can maximize the efficiency of on-chip memories by reducing the impact of the bandwidth bottleneck.

The crucial issue with the above methods is that they usually only consider the bottleneck at a single level and fail to coordinate these two constraints to improve the computational efficiency of the hardware accelerators. In this paper, we reduce the impact of the above two constraints by introducing the QNNs into FPGAs, which provides a new idea to deal with the above two bottlenecks. Since the weights in our QNNs are quantized to the power of two, the quantized weights directly reduce the bandwidth required to load the weights. In addition, the use of the quantized weights can translate the multiplication into shifting in convolution module, which greatly reduces the use of floating-point resources.

III nn-BQ-NN

III-A Fundamental Idea of Our nn-BQ-NN

Refer to caption
Figure 1: For any full-precision weight distribution (indicated by the blue curve in the figure) trained by CNNs model, non-uniform sampling can be used to approximate the full-precision distribution which represented by the red curve in the figure.

The main idea of our nn-BQ-NN is based on Fig.1, which shows the information loss led by the quantization method with the power of 22 can be interpreted as the sampling loss caused by non-uniform sampling. In fact, the weights of CNNs with large absolute values will be dominant to the overall classification accuracy of the networks, although these weights with large values only account for a small ratio among all the weights [11, 31]. For an arbitrary probability density function of the weights in a neural network, denoted as ϕ(x)\phi(x), we can use the blue curve in Fig.1 to represent ϕ(x)\phi(x) which meets 11ϕ(x)𝑑x=1\int_{-1}^{1}\phi(x)dx=1. In this way, we can calculate the sampling loss Φ(x)\Phi(x) which can be represented as the area between two distributions (red and blue curves) in Fig.1. By calculating, the sampling loss Φ(x)\Phi(x) is represented as the following recursive formula, where nn is the quantized bit width of the weights.

{Φ(1)=1ϕ(21)n=1Φ(n)=Φ(n1)+21nϕ(21n)21nϕ(2n)n>1\begin{split}\left\{\begin{array}[]{cll}\Phi(1)=1-\phi(2^{-1})&n=1\\ \Phi(n)=\Phi(n-1)+2^{1-n}\phi(2^{1-n})\\ -2^{1-n}\phi(2^{-n})&n>1\end{array}\right.\end{split} (1)

It can be seen from the above formula that the sampling loss always decreases as the increasing of quantized bit width of the weights, which indicates that the sampling loss is negatively related to the quantized bit width of weights. However, the bit width is limited and needs to reduce as much as possible in QNNs. Thus, finding the best balance between quantized bit width and the sampling loss is the key to balancing the performance, speed, and resources of QNNs. We define the (n)=21n[ϕ(21n)ϕ(2n)]\mathcal{L}(n)=2^{1-n}[\phi(2^{1-n})-\phi(2^{-n})] as the variation between two sampling losses, Φ(n)\Phi(n) and Φ(n1)\Phi(n-1), from Eq.1. Then, we can prove that (4)\mathcal{L}(4) will approach to zero in our quantization method with the power of 22, which can be ensured by the Theorem 1. Therefore, continuing to increase the quantized bit width of nn after 33 is not helpful to decrease the sampling loss.

Theorem 1

0<|(4)|<7.8×1030<\left|\mathcal{L}(4)\right|<7.8\times 10^{-3}

Proof:

We use Taylor expansion with Peano residuals to represent the probability density function ϕ(x)\phi(x),

ϕ(2n)=ϕ(0)ln2ϕ(0)n2n+o(n2n)ϕ(21n)=ϕ(0)ln2ϕ(0)n21n+o(n21n)\begin{split}&\phi(2^{-n})=\phi(0)-ln2\phi^{\prime}(0)n2^{-n}+o(n2^{-n})\\ &\phi(2^{1-n})=\phi(0)-ln2\phi^{\prime}(0)n2^{1-n}+o(n2^{1-n})\\ \end{split} (2)

Substituting Eq.2 into (n)\mathcal{L}(n), we get

(n)=21n[ϕ(21n)ϕ(2n)]=21n[ln2ϕ(0)n2no(n2n)ln2ϕ(0)n21n+o(n21n)]ln2ϕ(0)n212n\begin{split}&\mathcal{L}(n)=2^{1-n}[\phi(2^{1-n})-\phi(2^{-n})]\\ &=2^{1-n}[ln2\phi^{\prime}(0)n2^{-n}-\cancel{o(n2^{-n})}-ln2\phi^{\prime}(0)n2^{1-n}\\ &+\cancel{o(n2^{1-n})}]\\ &\approx-ln2\phi^{\prime}(0)n2^{1-2n}\end{split} (3)

Since 0<Φ(n)<10<\Phi(n)<1, we deduce that 0<|(2)|<10<\left|\mathcal{L}(2)\right|<1 and 0<ln2ϕ(0)<140<ln2\phi^{\prime}(0)<\frac{1}{4}. In final, we get 0<|(4)|<11280<\left|\mathcal{L}(4)\right|<\frac{1}{128} by substituting the range of ln2ϕ(0)ln2\phi^{\prime}(0) into (4)\mathcal{L}(4) due to the Eq.3. ∎

From the hardware perspective, the resource consumption of SHIFT operation is much less than multiplication, so our intention is to use the SHIFT operation instead of multiplication. Considering that the shift right operation will make the weights exceed the constraint range of (1-1,11), thus, all SHIFT operations are shift left and every quantized weight is chosen from the entries (±20,±21,,±2i,0)(\pm 2^{-0},\pm 2^{-1},\cdots,\pm 2^{-i},0), where ±2i\pm 2^{-i} indicates its multiplication can be calculated by <<i<<i and 0 indicates that no operations are required. Our nn-BQ-NN quantizes the weights to the entries, which are encoded to nn-bit and suitable for hardware computation. Under such circumstance, the staircase function staircase(W)\operatorname{staircase}(W) can be used to describe our nn-bit quantized weights as Eq.4 (typically, nn is greater than 11, and staircase(W)\operatorname{staircase}(W) is degraded to sign(W)\operatorname{sign}(W) if nn is equal to 11), where WW are full-precision weights.

staircase(W)={2isign(W)Δi+1|W|<Δi0|W|<Δr\operatorname{staircase}(W)=\left\{\begin{array}[]{ccc}2^{-i}\operatorname{sign}(W)&&\Delta_{i+1}\leq|W|<\Delta_{i}\\ 0&&|W|<\Delta_{r}\end{array}\right. (4)

Here ii is taken from r1r-1 to 0 in turn, where r=2n11r=2^{n-1}-1 and sign(W)\operatorname{sign}(W) is the sign function:

sign(W)={+1W01W<0\operatorname{sign}(W)=\left\{\begin{array}[]{clc}+1&&W\geq 0\\ -1&&W<0\end{array}\right. (5)

III-B Gradients Computation in nn-BQ-NN

In order to facilitate the discussion as follows, we need to define some variables first, where WjklW^{l}_{jk} represents the weight that connects the kk-th neuron of the (l1)(l-1)-th layer to the jj-th neuron of the ll-th layer, bjlb^{l}_{j} represents the bias of the jj-th neuron of the ll-th layer, zjlz^{l}_{j} represents the input of the jj-th neuron of the ll-th layer (zjl=kWjklakl1+bjlz^{l}_{j}=\sum_{k}W^{l}_{jk}a^{l-1}_{k}+b^{l}_{j}), ajla^{l}_{j} represents the output of the jj-th neuron of the ll-th layer (ajl=θ(zjl)a^{l}_{j}=\theta(z^{l}_{j})), and θ\theta is activation function.

We also have to add a extra quantized weight so that we can train our nn-BQ-NN, where the quantized weight is shown as follows,

W^jkl=staircase(Wjkl)\hat{W}^{l}_{jk}=\operatorname{staircase}(W^{l}_{jk}) (6)

And the cost function of mini-batch of mm samples in our nn-BQ-NN is,

C=12mxy(x)aL(x)2C=\frac{1}{2m}\sum_{x}\left\|y(x)-a^{L}(x)\right\|^{2} (7)

Where xx is the input sample, yy is the actual classification, aLa^{L} is the prediction output, and LL is the maximum number of layers in the network.

By defining 𝒯jlCzjl\mathcal{T}_{j}^{l}\equiv\frac{\partial C}{\partial z_{j}^{l}} as the error produced by the jj-th neuron of the ll-th layer, we can use the back-propagation algorithm to calculate the gradient and update the parameters according to the following three steps111\odot represents the Hadamard product that is used for point-to-point product between matrices or vectors..

  1. -

    Calculating the error of the last layer of the network.

    𝒯jL=CzjL=CajLajLzjL\displaystyle\mathcal{T}_{j}^{L}=\frac{\partial C}{\partial z_{j}^{L}}=\frac{\partial C}{\partial a_{j}^{L}}\cdot\frac{\partial a_{j}^{L}}{\partial z_{j}^{L}} (8)
    𝓣L=CaLaLzL=aCθ(zL)\displaystyle{\bm{\mathcal{T}}^{L}=\frac{\partial C}{\partial\textbf{a}^{L}}\odot\frac{\partial\textbf{a}^{L}}{\partial\textbf{z}^{L}}=\nabla_{\textbf{a}}C\odot\theta^{\prime}\left(\textbf{z}^{L}\right)}
  2. -

    Calculating the error of each layer of the network from the back to the front.

    𝒯jl=Czjl\displaystyle\mathcal{T}_{j}^{l}=\frac{\partial C}{\partial z_{j}^{l}} =kCzkl+1zkl+1ajlajlzjl\displaystyle=\sum_{k}\frac{\partial C}{\partial z_{k}^{l+1}}\cdot\frac{\partial z_{k}^{l+1}}{\partial a_{j}^{l}}\cdot\frac{\partial a_{j}^{l}}{\partial z_{j}^{l}} (9)
    =k𝒯kl+1(W^kjl+1ajl+bkl+1)ajlθ(zjl)\displaystyle=\sum_{k}\mathcal{T}_{k}^{l+1}\cdot\frac{\partial\left(\hat{W}_{kj}^{l+1}a_{j}^{l}+b_{k}^{l+1}\right)}{\partial a_{j}^{l}}\cdot\theta^{\prime}\left(z_{j}^{l}\right)
    =k𝒯kl+1W^kjl+1θ(zjl)\displaystyle=\sum_{k}\mathcal{T}_{k}^{l+1}\cdot\hat{W}_{kj}^{l+1}\cdot\theta^{\prime}\left(z_{j}^{l}\right)
    𝓣l=\displaystyle\bm{\mathcal{T}}^{l}= ((W^l+1)T𝓣l+1)θ(zl)\displaystyle\left(\left(\hat{\textbf{W}}^{l+1}\right)^{T}\bm{\mathcal{T}}^{l+1}\right)\odot\theta^{\prime}\left(\textbf{z}^{l}\right)
  3. -

    Calculating the gradient of weight and bias respectively.

    gb\displaystyle g^{b} =Cbjl=Czjlzjlbjl\displaystyle=\frac{\partial C}{\partial b_{j}^{l}}=\frac{\partial C}{\partial z_{j}^{l}}\cdot\frac{\partial z_{j}^{l}}{\partial b_{j}^{l}} (10)
    =𝒯jl(W^jklakl1+bjl)bjl=𝒯jl\displaystyle=\mathcal{T}_{j}^{l}\cdot\frac{\partial\left(\hat{W}^{l}_{jk}a_{k}^{l-1}+b_{j}^{l}\right)}{\partial b_{j}^{l}}=\mathcal{T}_{j}^{l}
    gW\displaystyle g^{W} =CWjkl=CzjlzjlWjkl=𝒯jl(W^jklakl1+bjl)Wjkl\displaystyle=\frac{\partial C}{\partial W_{jk}^{l}}=\frac{\partial C}{\partial z_{j}^{l}}\cdot\frac{\partial z_{j}^{l}}{\partial W_{jk}^{l}}=\mathcal{T}_{j}^{l}\cdot\frac{\partial\left(\hat{W}_{jk}^{l}a_{k}^{l-1}+b_{j}^{l}\right)}{\partial W_{jk}^{l}} (11)
    =𝒯jl(W^jklakl1+bjl)W^jklW^jklWjkl\displaystyle=\mathcal{T}_{j}^{l}\cdot\frac{\partial\left(\hat{W}_{jk}^{l}a_{k}^{l-1}+b_{j}^{l}\right)}{\partial\hat{W}_{jk}^{l}}\cdot\frac{\partial\hat{W}_{jk}^{l}}{\partial W_{jk}^{l}}
    =𝒯jlakl1W^jklWjkl\displaystyle=\mathcal{T}_{j}^{l}\cdot a^{l-1}_{k}\cdot\frac{\partial\hat{W}_{jk}^{l}}{\partial W_{jk}^{l}}
Refer to caption
Figure 2: The red dotted line represents the staircase function, and the blue full line represents our reconstructed function.

In the above process of deriving the entire back-propagation, except for the gradient of weight of the last step, the other steps are well-defined. Based on the Eq.11, the gradient of weight can be calculated as follows,

gW=𝒯jlakl1W^jklWjkl=0g^{W}=\mathcal{T}_{j}^{l}\cdot a^{l-1}_{k}\cdot\frac{\partial\hat{W}_{jk}^{l}}{\partial W_{jk}^{l}}=0 (12)

Where W^jklWjkl\frac{\partial\hat{W}_{jk}^{l}}{\partial W_{jk}^{l}} is exactly the staircase(Wjkl)\operatorname{staircase}^{\prime}(W_{jk}^{l}), which is the derivative of staircase(Wjkl)\operatorname{staircase}(W_{jk}^{l}). And this derivative satisfies the conditions of Dirac Delta Function δ(x)\delta(x). According to the properties of δ(x)\delta(x), W^jklWjkl\frac{\partial\hat{W}_{jk}^{l}}{\partial W_{jk}^{l}} can be calculated as follows,

W^jklWjkl=δ(Wjkl)=0\frac{\partial\hat{W}_{jk}^{l}}{\partial W_{jk}^{l}}=\delta(W_{jk}^{l})=0 (13)

Substituting W^jklWjkl=0\frac{\partial\hat{W}_{jk}^{l}}{\partial W_{jk}^{l}}=0 into Eq.12, we discover that model cannot be trained by back-propagation algorithm due to gradient vanishing.

To resolve the above problem, we reconstruct the quantized weight function as Eq.14 to ensure that the weights can be updated by using the back-propagation algorithm as shown in Fig.2, the blue full line, where α\alpha is an adjustable parameter in the range of (0,1)(0,1).

W~jkl=(1α)W^jkl+αWjkl\tilde{W}^{l}_{jk}=(1-\alpha)\hat{W}^{l}_{jk}+\alpha W^{l}_{jk} (14)

By substituting Eq.14 into Eq.11, we can recalculate the gradient of weight as follows again with W~jklWjkl=α+(1α)δ(Wjkl)=α\frac{\partial\tilde{W}_{jk}^{l}}{\partial W_{jk}^{l}}=\alpha+(1-\alpha)\delta(W^{l}_{jk})=\alpha,

gW=𝒯jlakl1W~jklWjkl=α𝒯jlakl1g^{W}=\mathcal{T}_{j}^{l}\cdot a^{l-1}_{k}\cdot\frac{\partial\tilde{W}_{jk}^{l}}{\partial W_{jk}^{l}}=\alpha\mathcal{T}_{j}^{l}\cdot a^{l-1}_{k} (15)

At this point, we have reconstructed the quantized weight function as Eq.14 to solve the gradient vanishing, but the weights cannot be quantized to the entries (±20,±21,,0)(\pm 2^{-0},\pm 2^{-1},\cdots,0) directly as Eq.6. However, we can prove that the reconstructed quantized weight function will approximate to the entries after several iterations, which can be ensured by the Theorem 2.

Assumption. Since the algorithm needs to be iterated, our problem needs to be discussed within the framework of the series. We define WjklW^{l}_{jk} as an iteration of xnx_{n}, W~jkl\tilde{W}^{l}_{jk} is equivalent to xn+1x_{n+1}, and the value of aja_{j} is chosen from W^jkl=staircase(Wjkl)=(±20,±21,,±2i,0)\hat{W}^{l}_{jk}=\operatorname{staircase}(W^{l}_{jk})=(\pm 2^{-0},\pm 2^{-1},\cdots,\pm 2^{-i},0).

Theorem 2

In the framework of the series, W~jkl\tilde{W}^{l}_{jk} will approach to W^jkl\hat{W}^{l}_{jk} when the number of iterations is sufficient, when nn is the number of iterations.

Proof:

The general terms of series xx from 11 to nn are written as follows based on Eq.14,

{x2αx1=(1α)aj(1)xnαxn1=(1α)aj(n1)xn+1αxn=(1α)aj(n)\left\{\begin{array}[]{ccc}&x_{2}-\alpha x_{1}=(1-\alpha)a_{j}&(1)\\ &\vdots&\vdots\\ &x_{n}-\alpha x_{n-1}=(1-\alpha)a_{j}&(n-1)\\ &x_{n+1}-\alpha x_{n}=(1-\alpha)a_{j}&(n)\\ \end{array}\right. (16)

We let α(n1)×(1)++α×(n1)+(n)\alpha^{(n-1)}\times(1)+\cdots+\alpha\times(n-1)+(n), then we get the equation as follows,

xn+1α(n1)x1\displaystyle x_{n+1}-\alpha^{(n-1)}x_{1} =(1α)aj(1+α++α(n1))\displaystyle=(1-\alpha)a_{j}(1+\alpha+\cdots+\alpha^{(n-1)}) (17)
=aj(1α(n1))\displaystyle=a_{j}(1-\alpha^{(n-1)})
Refer to caption
Figure 3: Overview of our trained quantization procedure.

As the number of iterations increases, xn+1x_{n+1} will approach aja_{j}. With the guarantee of Theorem 2, the above equation can be rewritten as W~jkl=staircase(Wjkl)\tilde{W}^{l}_{jk}=\operatorname{staircase}(W^{l}_{jk}) (namely, xn+1=ajx_{n+1}=a_{j}) when the number of iterations is enough (nn\to\infty) and α\alpha is in the range of (0,1)(0,1). In the actual algorithm implementation, it is only necessary to iterate through several steps following the training process, and the networks can be quantized completely as Eq.6. ∎

The design of α\alpha in our reconstructed quantized weight function takes three aspects into consideration: First, the designed function must satisfy the Theorem 2. Second, our reconstructed function indicates that the ratio of 1α:α1-\alpha:\alpha between quantized weights and full-precision weights can be used to adjust the information ratio of between quantized weights and full-precision weights in the training process. Third, on the other hand, α\alpha is the slope of our reconstructed function shown as the blue full line in Fig.2, which can be used to change the gradient descent rate of back-propagation based on Eq.15 during the training.

III-C Posterior-Distribution Adjustment

In the initialization of the networks, the initialization modes MSRA and Xavier [32] which will adjust variance based on the number of inputs are prone to converge than the traditional Gaussian distribution initialization mode with fixed variance in DNNs. Inspiring by this fact, we suspect that adjusting the distribution of quantized weights may make it easier for us to train our nn-BQ-NN. Here, we consider that full-precision networks are prone to converge than quantized networks; thus, we prefer to keep the distribution of quantized weights consistent with full-precision weights’. Comparing the probability density function before quantization ϕ(x)\phi(x) (its corresponding expectation and variance are E(x)E(x) and Var(x)Var(x), respectively) and the probability density function after quantization staircase(x)\operatorname{staircase}(x) (as Eq.4), we make their expectation and variance equal respectively so that their distribution is consistent as follows,

{E(x)=11xstaircase(x)𝑑xVar(x)=11(xE(x))2staircase(x)𝑑x\left\{\begin{array}[]{lcl}E(x)=\int_{-1}^{1}x\operatorname{staircase}(x)dx\\ Var(x)=\int_{-1}^{1}(x-E(x))^{2}\operatorname{staircase}(x)dx\end{array}\right. (18)

The original full-precision probability density function ϕ(x)\phi(x) and the value of quantized weight function staircase(x)\operatorname{staircase}(x) are fixed, so we can only adjust the value range of staircase(x)\operatorname{staircase}(x) to meet Eq.18.

III-D The Training Algorithm for nn-BQ-NN

In actual training algorithm for nn-BQ-NN, the Batch Normalization (BN [31]) is added in our nn-BQ-NN because it is conducive to reduce the overall impact of the weight scale and accelerate the training. Thus, we will derive the back-propagation algorithm for nn-BQ-NN with BN and give the training algorithm in this subsection.

First, we define four variables of BN, where σ\sigma represents the variance of all samples of a batch, μ\mu represents the sample mean, and γ,β\gamma,\beta are the scale variation coefficients. Due to the existence of BN, the bias term can be ignored, so the input of the neuron is re-expressed as zjl=kWjklakl1z^{l}_{j}=\sum_{k}W^{l}_{jk}a^{l-1}_{k}, the normalized input of the neuron is z^j=γzjμσ+β\hat{z}_{j}=\gamma\frac{z_{j}-\mu}{\sigma}+\beta, and the output of the neuron is aj=θ(z^j)a_{j}=\theta(\hat{z}_{j}). Then, we can calculate the error and the gradient, based on the discussion of Section III-B, according to the following three steps.

  1. -

    Counting the mean and variance of the sample, and calculating the gradient of them.

    μ=1mj=1mzj\displaystyle\mu=\frac{1}{m}\sum_{j=1}^{m}z_{j} (19)
    σ2=1mj=1m(zjμ)2\displaystyle\sigma^{2}=\frac{1}{m}\sum_{j=1}^{m}(z_{j}-\mu)^{2}
    Cσ2\displaystyle\frac{\partial C}{\partial\sigma^{2}} =kCakakz^kz^kσ2\displaystyle=\sum_{k}\frac{\partial C}{\partial a_{k}}\frac{\partial a_{k}}{\partial\hat{z}_{k}}\frac{\partial\hat{z}_{k}}{\partial\sigma^{2}} (20)
    =12γσ3kCakθ(zk)(zkμ)\displaystyle=-\frac{1}{2}\gamma\sigma^{-3}\sum_{k}\frac{\partial C}{\partial a_{k}}\theta^{\prime}\left(z_{k}\right)\left(z_{k}-\mu\right)
    Cμ\displaystyle\frac{\partial C}{\partial\mu} =kCakakz^kz^kμ+Cσ2σ2μ\displaystyle=\sum_{k}\frac{\partial C}{\partial a_{k}}\frac{\partial a_{k}}{\partial\hat{z}_{k}}\frac{\partial\hat{z}_{k}}{\partial\mu}+\frac{\partial C}{\partial\sigma^{2}}\frac{\partial\sigma^{2}}{\partial\mu} (21)
    =γσkCakθ(zk)2mCσ2k(zkμ)\displaystyle=-\frac{\gamma}{\sigma}\sum_{k}\frac{\partial C}{\partial a_{k}}\theta^{\prime}\left(z_{k}\right)-\frac{2}{m}\frac{\partial C}{\partial\sigma^{2}}\sum_{k}\left(z_{k}-\mu\right)
  2. -

    Calculating the error of the network.

    𝒯j=Czj=Cajajz^jz^jzj+Cσ2σ2zj+Cμμzj\displaystyle\mathcal{T}_{j}=\frac{\partial C}{\partial z_{j}}=\frac{\partial C}{\partial a_{j}}\frac{\partial a_{j}}{\partial\hat{z}_{j}}\frac{\partial\hat{z}_{j}}{\partial z_{j}}+\frac{\partial C}{\partial\sigma^{2}}\frac{\partial\sigma^{2}}{\partial z_{j}}+\frac{\partial C}{\partial\mu}\frac{\partial\mu}{\partial z_{j}} (22)
    =γσCajθ(z^j)+2mk(zkμ)Cσ2+1mCμ\displaystyle=\frac{\gamma}{\sigma}\frac{\partial C}{\partial a_{j}}\theta^{\prime}\left(\hat{z}_{j}\right)+\frac{2}{m}\sum_{k}\left(z_{k}-\mu\right)\frac{\partial C}{\partial\sigma^{2}}+\frac{1}{m}\frac{\partial C}{\partial\mu}
  3. -

    Calculating the gradient of weight, γ\gamma, and β\beta respectively.

    gW=𝒯jakW~jkWjkl=α𝒯jakg^{W}=\mathcal{T}_{j}\cdot a_{k}\cdot\frac{\partial\tilde{W}_{jk}}{\partial W_{jk}^{l}}=\alpha\mathcal{T}_{j}\cdot a_{k} (23)
    gγ=kCakakz^kz^kγ=kCakθ(z^k)zkμσg^{\gamma}=\sum_{k}\frac{\partial C}{\partial a_{k}}\frac{\partial a_{k}}{\partial\hat{z}_{k}}\frac{\partial\hat{z}_{k}}{\partial\gamma}=\sum_{k}\frac{\partial C}{\partial a_{k}}\theta^{\prime}\left(\hat{z}_{k}\right)\frac{z_{k}-\mu}{\sigma} (24)
    gβ=kCakakz^kz^kβ=kCakθ(z^k)g^{\beta}=\sum_{k}\frac{\partial C}{\partial a_{k}}\frac{\partial a_{k}}{\partial\hat{z}_{k}}\frac{\partial\hat{z}_{k}}{\partial\beta}=\sum_{k}\frac{\partial C}{\partial a_{k}}\theta^{\prime}\left(\hat{z}_{k}\right) (25)

With the foundation of the above formulas, we can propose our training algorithm for nn-BQ-NN, as indicated in Algorithm 1. This algorithm covers two learning modes: training from scratch and fine-tuning on the pretrained model, where the first mode means the weights are randomly initialized and the second mode means the weights are initialized by the pretrained full-precision network model. The overall quantization process is illustrated as Fig.3. The code for training algorithm is available 222https://github.com/papcjy/n-BQ-NN.

Require : a minibatch of outputs and targets (aL,y)(a^{L},y), learning rate η\eta, previous weights WkW^{k}, previous BN parameters (γk\gamma^{k},βk\beta^{k}), and a constant α\alpha.
Ensure : the updated weights (Wk){(W^{k})}^{*} and updated BN parameters ((γk),(βk))({(\gamma^{k})}^{*},{(\beta^{k})}^{*})
{1. Computing the parameter gradients:}
{1.1 Forward propagation:}
for k=1k=1 to LL do
       W~k(1α)staircase(Wk)+αWk\tilde{W}^{k}\leftarrow(1-\alpha)\operatorname{staircase}(W^{k})+\alpha W^{k}
      zkak1W~kz^{k}\leftarrow a^{k-1}\tilde{W}^{k}
      z^kBatchNorm(zk,γk,βk)\hat{z}^{k}\leftarrow\operatorname{BatchNorm}(z^{k},\gamma^{k},\beta^{k})
      akθ(z^k)a^{k}\leftarrow\theta(\hat{z}^{k})
end for
{1.2 Backward propagation:}
Computing gaL=CaLg^{a^{L}}=\frac{\partial C}{\partial a^{L}} based on aLa^{L} and yy.
for k=Lk=L to 11 do
       (gγk,gβk)BackBatchNorm(gak,zk,γk,βk)(g^{\gamma^{k}},g^{\beta^{k}})\leftarrow\operatorname{BackBatchNorm}(g^{a^{k}},z^{k},\gamma^{k},\beta^{k})
      𝒯kgakθ(z^k)\mathcal{T}^{k}\leftarrow g^{a^{k}}\theta^{\prime}(\hat{z}^{k})
      gak1𝒯kW~kg^{a^{k-1}}\leftarrow\mathcal{T}^{k}\tilde{W}^{k}
      gWkα(𝒯k)Tak1g^{W^{k}}\leftarrow\alpha{(\mathcal{T}^{k})}^{T}a^{k-1}
end for
{2. Updating the parameter gradients:}
for k=1k=1 to LL do
       ((γk),(βk))Update(γk,βk,η,gγk,gβk)({(\gamma^{k})}^{*},{(\beta^{k})}^{*})\leftarrow\operatorname{Update}(\gamma^{k},\beta^{k},\eta,g^{\gamma^{k}},g^{\beta^{k}})
      (Wk)Update(Wk,gWk,η){(W^{k})}^{*}\leftarrow\operatorname{Update}(W^{k},g^{W^{k}},\eta)
end for
Algorithm 1 Training algorithm for nn-BQ-NN with BN. CC is the cost function for mini-batch, θ\theta is the activation function, and LL is the number of layers. The function staircase()\operatorname{staircase}(\cdot) specifies how to quantize the weights. BatchNorm()\operatorname{BatchNorm}() specifies how to batch-normalize the inputs. BackBatchNorm()\operatorname{BackBatchNorm}() specifies how to back-propagate through the BN. Update()\operatorname{Update}() specifies how to update the parameters when their gradients are known, using either SGD or ADAM.

III-E Activation Quantization in nn-BQ-NN

The above discussion is all about the quantization of weights. To take the integrity of our nn-BQ-NN and the necessity of subsequent ablation experiments into consideration, we need to discuss the quantization of activations in this subsection. Now, let’s put our eyes back on Section III-B. In the case of the quantized activations, the output of the jj-th neuron of the ll-th layer can be rewritten as,

a^jl=staircase(zjl)\hat{a}^{l}_{j}=\operatorname{staircase}(z^{l}_{j}) (26)

Where staircase()\operatorname{staircase}() is activation function.

At this point, we have encountered the same problem, the error of network 𝒯jl\mathcal{T}_{j}^{l} becomes zero due to the existence of a^jlzjl\frac{\partial\hat{a}_{j}^{l}}{\partial z_{j}^{l}}, when the Eq.26 is substituted into Eq.8 and Eq.9.

Considering the expectation of CzjL\frac{\partial C}{\partial z_{j}^{L}}, the error of network has reappeared, which is guaranteed by Theorem 3.

Theorem 3

Let us define C=C(a^j,ϵj)C=C(\hat{a}_{j},\epsilon_{j}) where a^j\hat{a}_{j} follows Eq.26 that is chosen from (±20,±21,,0)(\pm 2^{-0},\pm 2^{-1},\cdots,0), then, we get a new expression as

𝔼ϵj[Czj]=λCa^j,if|zj|1,\mathbb{E}_{\epsilon_{j}}\left[\frac{\partial C}{\partial z_{j}}\right]=\lambda\frac{\partial C}{\partial\hat{a}_{j}},\operatorname{if}|z_{j}|\leq 1, (27)

Where ϵj\epsilon_{j} is the noise source that influences zjz_{j}, 𝔼ϵj[]\mathbb{E}_{\epsilon_{j}}[\cdot] means the expectation over zjz_{j}, and λ\lambda is a constant.

Proof:
𝔼ϵj[zjC]=zj𝔼ϵj[C]=zj[iC(a^j=+2i)P(zj>ϵj|zj)+iC(a^j=2i)(1P(zj>ϵj|zj))]=P(zj>ϵj|zj)zj[iC(a^j=+2i)iC(a^j=2i)]\begin{split}&\mathbb{E}_{\epsilon_{j}}\left[\frac{\partial}{\partial z_{j}}C\right]=\frac{\partial}{\partial z_{j}}\mathbb{E}_{\epsilon_{j}}[C]\\ &=\frac{\partial}{\partial z_{j}}[\sum_{i}C(\hat{a}_{j}=+2^{i})P(z_{j}>\epsilon_{j}|z_{j})+\\ &\sum_{i}C(\hat{a}_{j}=-2^{i})(1-P(z_{j}>\epsilon_{j}|z_{j}))]\\ &=\frac{\partial P(z_{j}>\epsilon_{j}|z_{j})}{\partial z_{j}}[\sum_{i}C(\hat{a}_{j}=+2^{i})-\sum_{i}C(\hat{a}_{j}=-2^{i})]\end{split} (28)

For C(a^j=±2i)C(\hat{a}_{j}=\pm 2^{i}), we can approximate it using the Taylor expansion.

C(a^j=+2i))=C(a^j=0)+Ca^j|a^j=02i+2Ca^j2|a^j=022i+O(3Ca^j3|a^j=023i)C(a^j=2i))=C(a^j=0)Ca^j|a^j=02i+2Ca^j2|a^j=022i+O(3Ca^j3|a^j=023i)\begin{split}&C(\hat{a}_{j}=+2^{i}))=C(\hat{a}_{j}=0)+\frac{\partial C}{\partial\hat{a}_{j}}\bigg{|}_{\hat{a}_{j}=0}2^{i}+\\ &\frac{\partial^{2}C}{\partial\hat{a}_{j}^{2}}\bigg{|}_{\hat{a}_{j}=0}2^{2i}+O\left(\frac{\partial^{3}C}{\partial\hat{a}_{j}^{3}}\bigg{|}_{\hat{a}_{j}=0}2^{3i}\right)\\ &C(\hat{a}_{j}=-2^{i}))=C(\hat{a}_{j}=0)-\frac{\partial C}{\partial\hat{a}_{j}}\bigg{|}_{\hat{a}_{j}=0}2^{i}+\\ &\frac{\partial^{2}C}{\partial\hat{a}_{j}^{2}}\bigg{|}_{\hat{a}_{j}=0}2^{2i}+O\left(\frac{\partial^{3}C}{\partial\hat{a}_{j}^{3}}\bigg{|}_{\hat{a}_{j}=0}2^{3i}\right)\end{split} (29)

For P(zj>ϵj|zj)zj\frac{\partial P(z_{j}>\epsilon_{j}|z_{j})}{\partial z_{j}}, we split it into two parts.

P(zj>ϵj|zj)zj=P(zj>ϵj|zj)zj|zj|>1+P(zj>ϵj|zj)zj|zj|1=1112𝑑ϵjzj+zjzj12𝑑ϵjzj=𝟏|zj|1\begin{split}&\frac{\partial P(z_{j}>\epsilon_{j}|z_{j})}{\partial z_{j}}=\underset{|z_{j}|>1}{\frac{\partial P(z_{j}>\epsilon_{j}|z_{j})}{\partial z_{j}}}+\underset{|z_{j}|\leq 1}{\frac{\partial P(z_{j}>\epsilon_{j}|z_{j})}{\partial z_{j}}}\\ &=\frac{\partial\int_{-1}^{1}\frac{1}{2}\,d\epsilon_{j}}{\partial z_{j}}+\frac{\partial\int_{-z_{j}}^{z_{j}}\frac{1}{2}\,d\epsilon_{j}}{\partial z_{j}}=\mathbf{1}_{|z_{j}|\leq 1}\end{split} (30)

Combining Eq.29 and Eq.30, the Eq.28 can be derived as

𝔼ϵj[Czj]=𝟏|zj|1(2i22iCa^j|a^j=0)\mathbb{E}_{\epsilon_{j}}\left[\frac{\partial C}{\partial z_{j}}\right]=\mathbf{1}_{|z_{j}|\leq 1}\left(2\sum_{i}2^{2i}\frac{\partial C}{\partial\hat{a}_{j}}\bigg{|}_{\hat{a}_{j}=0}\right) (31)

Let 2i22i=λ2\sum_{i}2^{2i}=\lambda, then

𝔼ϵj[Czj]=λCa^j𝟏|zj|1\mathbb{E}_{\epsilon_{j}}\left[\frac{\partial C}{\partial z_{j}}\right]=\lambda\frac{\partial C}{\partial\hat{a}_{j}}\mathbf{1}_{|z_{j}|\leq 1} (32)

Under the Theorem 3, we can re-express the error of network and quantize the activations in our nn-BQ-NN by rewriting the Eq.8 and Eq.9 as follows,

𝒯jL=CzjL=λCa^jL𝟏|zj|1\mathcal{T}_{j}^{L}=\frac{\partial C}{\partial z_{j}^{L}}=\lambda\frac{\partial C}{\partial\hat{a}_{j}^{L}}\mathbf{1}_{|z_{j}|\leq 1} (33)
𝒯jl\displaystyle\mathcal{T}_{j}^{l} =Czjl=λCa^jl𝟏|zj|1=λkCzkl+1zkl+1a^jl\displaystyle=\frac{\partial C}{\partial z_{j}^{l}}=\lambda\frac{\partial C}{\partial\hat{a}_{j}^{l}}\mathbf{1}_{|z_{j}|\leq 1}=\lambda\sum_{k}\frac{\partial C}{\partial z_{k}^{l+1}}\cdot\frac{\partial z_{k}^{l+1}}{\partial\hat{a}_{j}^{l}} (34)
=λk𝒯kl+1(W^kjl+1a^jl+bkl+1)a^jl\displaystyle=\lambda\sum_{k}\mathcal{T}_{k}^{l+1}\cdot\frac{\partial\left(\hat{W}_{kj}^{l+1}\hat{a}_{j}^{l}+b_{k}^{l+1}\right)}{\partial\hat{a}_{j}^{l}}
=λk𝒯kl+1W^kjl+1𝟏|zj|1\displaystyle=\lambda\sum_{k}\mathcal{T}_{k}^{l+1}\cdot\hat{W}_{kj}^{l+1}\mathbf{1}_{|z_{j}|\leq 1}

IV Experiment

TABLE I: The outline of the proposed nn-BQ-NN network architecture. Here, taking the CIFAR datasets as an example, the initial input size of the network is 32×32×332\times 32\times 3. The conv quantized contains three calculation steps, respectively, W^=staircase(W)\hat{W}=\operatorname{staircase}(W), net=conv2d(W^,x)\operatorname{net}=\operatorname{conv2d}(\hat{W},x), and net=BatchNorm(net)\operatorname{net}=\operatorname{BatchNorm}(\operatorname{net}), where the weights involved in convolution calculation are quantized weights that are chosen from the entries (±20,±21,,0)(\pm 2^{-0},\pm 2^{-1},\cdots,0). In convolution calculation, the multiplications are replaced by SHIFT operations during the inference, because the weights are power of 2.
type patch size/stride output size
conv quantized 3×\times3/1 32×\times32×\times128
conv quantized 3×\times3/1 32×\times32×\times128
conv quantized 3×\times3/1 32×\times32×\times128
pool 2×\times2/2 16×\times16×\times128
conv quantized 3×\times3/1 16×\times16×\times256
conv quantized 3×\times3/1 16×\times16×\times256
conv quantized 3×\times3/1 16×\times16×\times256
pool 2×\times2/2 8×\times8×\times256
conv quantized 3×\times3/1 8×\times8×\times512
conv quantized 1×\times1/1 8×\times8×\times1024
conv quantized 1×\times1/1 8×\times8×\times10 (100)
pool 8×\times8 1×\times1×\times10 (100)
softmax classifier 1×\times1×\times10 (100)

In our experiments, we use three network structures ResNet, DenseNet and AlexNet. The network structure of our nn-BQ-NN (nn can take 1,2,3,4,51,2,3,4,5) is similar to the architecture of All-CNN [33] that consists solely of convolution layers and Network in Network block [34]. Table I details the parameter settings and our network architecture. In the following experiments, our training algorithm is used to train the model from scratch or fine-tune on the full-precision model in five benchmark datasets MNIST, SVHN, CIFAR-10, CIFAR-100 and ImageNet. We unfold our experiments from 4 dimensions, respectively classification accuracy compared with low-precision QNNs, quantization errors by our training method, compression ratio in different datasets, and convergence speed compared with BNN.

IV-A MNIST

The MNIST dataset [35] consists of handwritten digit images with 32×\times32 pixels, organized into 10 classes (0 to 9). The training and test sets contain 60,000 and 10,000 images respectively. We perform this dataset without data augmentation [36].

IV-B CIFAR

The two CIFAR datasets [37] consist of natural color images with 32×\times32 pixels, respectively 50,000 training and 10,000 test images, and we hold out 5,000 training images as a validation set from the training set. CIFAR-10 (C10) consists of images organized into 10 classes and CIFAR-100 (C100) into 100 classes. We adopt a standard data augmentation scheme (random corner cropping and random flipping) that is widely used for these two datasets [38, 39, 40, 34, 41, 36, 33, 38]. We normalize the images using the channel means and standard deviations in preprocessing.

IV-C SVHN

The SVHN dataset [42] consists of color images of house numbers collected by Google Street View with 32×\times32 pixels, organized into 10 classes (0 to 9). There are 73,257 images in the training set, 531,131 images for additional training, and 26,032 images in the test set respectively. We divide the pixel values by 255.0 so that they are in the [0,1] range as [43]. Moreover, we do not preprocess the dataset following common practice without data augmentation [44, 39, 34, 36, 45].

IV-D Experiment Results

IV-D1 nn-bit

TABLE II: Our nn-BQ-ResNet generates extremely low-precision models with very similar accuracy compared with full-precision ResNet-110 model on CIFAR-10.
Model Bit-width Test error
ResNet-110 ref 16 6.61%
nn-BQ-ResNet 5 7.04%
nn-BQ-ResNet 4 7.07%
nn-BQ-ResNet 3 7.15%
nn-BQ-ResNet 2 8.76%
nn-BQ-ResNet 1 10.52%

As the theoretical analysis in Section III, different quantized bit width brings different sampling loss, and the larger bit width means the less sampling loss. Thus, in this experiment, we evaluate the test error rates of our nn-BQ-ResNet that is fine-tuned on full-precision ResNet-110 when nn takes different values on CIFAR-10. The experimental results from Table II are consistent with Eq.1. Therefore, the choice of 3-bit is better because (4)\mathcal{L}(4) is close to 0 as Eq.3 when considering both the sampling loss and the conciseness of weight representation. Obviously, this result is also experimentally proved by works in [26]. Thus, our nn-BQ-NN is chosen as T-BQ-NN when n=3n=3 in the subsequent experiments.

IV-D2 Accuracy and Capacity

TABLE III: Error rates on CIFAR-10 and CIFAR-100 datasets with standard data augmentation (translation and mirroring). Error rates on MNIST and SVHN datasets without data augmentation. The overall best results are bold. “*” represents the results run by our implementation, the rest of results represents that these results are directly cited from the paper in the front of the row.
Test error
Method Depth Params CIFAR-10 CIFAR-100 SVHN MNIST
Network in Network [34] 9 1.9M 8.81% 35.68% 2.35% 0.53%
All-CNN [33] 9 1.4M 7.25% 33.71% 3.17% 0.63%
Highway Network [38] 19 2.3M 7.72% 32.39% 2.61{}^{*}2.61% 0.67%
BNN [15] 9 1.7M 11.40% 42.13{}^{*}42.13% 2.80% 0.96%
Round Quantization 9 1.2M 85.88% 98.90% 83.72% 80.55%
T-BQ-NN 9 1.2M 7.59% 28.90% 2.29% 0.50%
TABLE IV: Fine-tunning ResNet and DenseNet by our training algorithm on CIFAR10(100) and SVHN. Where the results on C10 and C100 with data augmentation and the results on SVHN without data augmentation.
Network Depth Bit-width Params Test error(%)
CIFAR-10 ResNet 110 16 1.7M 6.61
T-BQ-ResNet 110 3 0.3M 7.15 (+0.54)
DenseNet 100 16 0.8M 4.51
T-BQ-DenseNet 100 3 0.15M 5.31 (+0.80)
CIFAR-100 ResNet 110 16 1.7M 35.87
T-BQ-ResNet 110 3 0.3M 37.56 (+1.69)
DenseNet 100 16 0.8M 22.27
T-BQ-DenseNet 100 3 0.15M 24.10 (+1.83)
SVHN ResNet 110 16 1.7M 3.13
T-BQ-ResNet 110 3 0.3M 3.25 (+0.12)
DenseNet 100 16 0.8M 1.76
T-BQ-DenseNet 100 3 0.15M 2.10 (+0.34)
TABLE V: Deep compression method for T-BQ-ResNet and T-BQ-DenseNet. P: Pruning, Q: Quantization, H: Huffman coding.
Method Encoding bit-width Compression ratio
T-BQ-ResNet on C10 (P+Q) 3 49×\times
T-BQ-ResNet on C10 (P+Q+H) 2.6 57×\times
T-BQ-ResNet on C100 (P+Q) 3 25×\times
T-BQ-ResNet on C100 (P+Q+H) 2.8 27×\times
T-BQ-ResNet on SVHN (P+Q) 3 24×\times
T-BQ-ResNet on SVHN (P+Q+H) 2.8 26×\times
T-BQ-DenseNet on C10 (P+Q) 3 38×\times
T-BQ-DenseNet on C10 (P+Q+H) 2.5 46×\times
T-BQ-DenseNet on C100 (P+Q) 3 15×\times
T-BQ-DenseNet on C100 (P+Q+H) 2.8 16×\times
T-BQ-DenseNet on SVHN (P+Q) 3 133×\times
T-BQ-DenseNet on SVHN (P+Q+H) 2.1 190×\times
TABLE VI: Comparison of classification accuracy on the test set for ImageNet with different bitwidths of weights and activations. Single-crop evaluation results top-1 and top-5 accuracy are given based on AlexNet. Note the gray results are implemented by our nn-BQ-NN where the training method of 1-bit activations is introduced in Section III-E, and other results are reported by [46]. We quantize the same layers of AlexNet to low-precision, as BNN [15], BC [47], TWN [48] and DoReFa-Net [27] do.
nn-bit Weights nn-bit Activations Inference Operation AlexNet Top-1 Accuracy AlexNet Top-5 Accuracy
1 1 XNOR 0.279 (BNN) 0.504 (BNN)
1 1 XNOR 0.348 0.601
1 32 (float) XNOR ADDER 0.368 (BC) 0.620 (BC)
1 16 (float) XNOR ADDER 0.486 0.734
2 32 (float) XNOR ADDER 0.529 (TWN) 0.766 (TWN)
2 16 (float) XNOR ADDER 0.536 0.777
3 16 (float) SHIFT ADDER 0.560 0.795
8 (float) 8 (float) MAC 0.530 (DoReFa-Net) 0.768 (DoReFa-Net)
32 (float) 32 (float) MAC 0.566 0.802

As hardware devices require relatively simple architecture and less number of layers, we have selected some network suited for hardware implementation as our comparative experiment. For example, BNN with binary weights and activations replaces most multiplications by 1-bit XNOR operations. Network in Network utilizes the global average pooling over feature maps in the classification layer, which is less prone to overfitting than the fully connected layers. All-CNN achieves a new architecture that consists solely of convolution layers replacing max-pooling by a convolution layer without loss in accuracy on several benchmarks. Highway Network allows unimpeded information flow across many layers using adaptive gating units to regulate the information flow.

There is a general manifestation that T-BQ-NN performs better than most other network structures, while these network structures have never been quantized except BNN. In the experiment here, T-BQ-NN is trained by our training algorithm from scratch due to the lack of pretrained model, and this model is trained with a mini-batch size of 50, a weight decay of 0.0001. Its test error rates of 7.59 % on CIFAR-10, 28.9 % on CIFAR-100, 2.29 % on SVHN, and 0.5 % on MNIST are lower than the test error rates achieved by Network in Network, Highway Network, and BNN. Particularly, T-BQ-NN makes up for the classification accuracy of BNN on CIFAR-100 to some extent. The best result for all listed datasets is T-BQ-NN except CIFAR-10 is All-CNN, and all results are shown in Table III.

Our model capacity is even more encouraging: the number of parameters of T-BQ-NN is significantly lower than other network structures. Particularly, T-BQ-NN achieves the number of parameters of 1.2M that is even lower than 1.7M of BNN shown in Table III.

IV-D3 Extension

Refer to caption
Figure 4: Comparison of the T-BQ-NN and BNN top-1 error rates on CIFAR10, CIFAR100, and SVHN validation datasets. Where the curves, from top to bottom at 900 iterations, represent T-BQ-NN on C100, BNN on C100, T-BQ-NN on C10, BNN on C10, T-BQ-NN on SVHN, and BNN on SVHN, respectively. Note that the results of datasets run by ourselves.

One positive-effect of our training algorithm is universal. We popularize our training method to the better and deeper architectures, not just limited to CNNs, such as ResNet [16] and DenseNet [17]. In the experiment here, T-BQ-ResNet and T-BQ-DenseNet are 3-bit that are fine-tuned by our training algorithm based on the full-precision model of ResNet and DenseNet.

For T-BQ-ResNet, all the multiplications are converted to SHIFT and ADDER operations using 3-bit weights in all convolutional layers and shortcut connections. We use a momentum of 0.9, weight decay of 0.0001 [49, 44], and adopt the weight initialization and BN [31, 32] without dropout [50]. This model is trained with a mini-batch size of 128 and a learning rate of 0.1, divided by 10 at 32k and 38k iterations, and terminates training at 64k iterations. We achieve the test error rates of 7.15% on C10, 37.56% on C100, and 3.25% on SVHN using T-BQ-ResNet, just rises 0.54% on C10, 1.69% on C100, and 0.12% on SVHN compared with ResNet on the basis of Table IV.

For T-BQ-DenseNet, its model consists of Bottleneck layers indicated to BN-ReLU-Conv(1 ×\times 1)-BN-ReLU-Conv(3 ×\times 3) and transition layers indicated to BN-ReLU-Conv(1 ×\times 1)-averpool(2 ×\times 2), and both of these layers contain 1x1 convolution. We use a weight decay of 0.0001, momentum of 0.9 [51], and adopt the weight initialization and BN without dropout. This model is trained with an initial learning rate of 0.1, divided by 10 at 50% and 75% of the total number of training epochs. And we train using a batch size of 64 for 300 and 40 epochs respectively on CIFAR and SVHN. Comparing between DenseNet and T-BQ-DenseNet, the increasing in error is 0.80% from 4.51% to 5.31% on C10, 1.83% from 22.27% to 24.10% on C100, and 0.34% from 1.76% to 2.10% on SVHN as Table IV.

We attribute this primarily to reduce the number of parameters approximately 5 times from 0.8M to 0.15M on T-BQ-DenseNet, and from 1.7M to 0.3M on T-BQ-ResNet is shown as Table IV. Furthermore, a hybrid network compression solution combined with three different techniques, respectively network pruning [19], quantization and Huffman coding is tested for T-BQ-ResNet and T-BQ-DenseNet in a scene with the same classification accuracy. Comparing with the original full-precision ResNet-110 model, we achieve the compression ratio of 57×\times on C10, 27×\times on C100 and 26×\times on SVHN for T-BQ-ResNet. For T-BQ-DenseNet, the compression ratio is 46×\times on C10, 16×\times on C100 and 190×\times on SVHN shown as Table V.

IV-D4 Convergence Speed

In this experiment, we train our T-BQ-NN and BNN from scratch on C10, C100, and SVHN. The results in Fig.4 indicate that T-BQ-NN not only has better performance on classification accuracy than BNN but also converges much faster. We just only compare our method with BNN, because the weights of other network models in Table III are full-precision and these models are not quantized except BNN and T-BQ-NN. We use the same conditions, including learning rate, batch size, and iterations, to test the error rates of BNN and T-BQ-NN at first epoch. Comparing with BNN, T-BQ-NN reaches the best test error dropping from 500 to 150 epochs on C10, from 1000 to 100 epochs on MNIST, and from 1000 to 180 epochs on C100. As a result, T-BQ-NN can be trained much easier and faster than BNN.

This result may be due to the fact that straight-through estimator used by BNN contains noise, which causes the unexpected deviation, while our training algorithm is based entirely on back-propagation without the effect of noise and weight representation is more abundant.

IV-E ImageNet

We further evaluate our nn-BQ-NN on ILSVRC2012 [11] image classification dataset that consists of 1.2 million high-resolution natural images where the validation set contains 50k images. These images are organized into 1000 categories of the object for training, which are resized to 224×\times224 pixels before fed into the network. In the next experiments, we report our single-crop evaluation results using top-1 and top-5 accuracy.

AlexNet: This CNN architecture is the first structure that shows to be successful on ImageNet classification task, which consists of 5 convolutional layers and 2 fully-connected layers [1]. We use AlexNet coupled with BN that contains a total of 61M parameters.

In training, images are resized randomly to 256×\times256 pixels, and then a random crop of 224×\times224 is selected for training. We train our nn-BQ-NN for 50 epochs with batch size of 128/16 based on AlexNet/Vgg-16. We use ADAM optimizer with the learning rate of 1e-4. We replace the Local Contrast Renomalization layer with Batch Normalization layer. At inference, we use the 224×\times224 center crop for forward-propagation.

The ablation experiments are listed in Table VI. The baseline AlexNet model scores 56.6% top-1 accuracy and 80.2% top-5 accuracy that is reported in [52]. For ablation studies, we strictly control the consistency of variables, including network structure, bit width and quantized layers. The only difference is the quantization method. In experiments of “1-1” v.s. “1-1”, “1-16” v.s. “1-32” and “2-16” v.s. “2-32”, our nn-BQ-NN achieves 6.9%, 11.8% and 0.7% accuracy improvements respectively. For “3-16” v.s. “32-32”, our nn-BQ-NN only reduces the accuracy by 0.6%.

V Acceleration on FPGA

Refer to caption
Figure 5: High-level block diagram of our system.
Refer to caption
Figure 6: The overall parallel architecture with cluster of shift vector processing elements (SVPEs).
Refer to caption
Figure 7: A SVPE array implementing the primitive 2D convolver unit. Where the double orange rectangles represent on-chip memory banks to buffer the weights, the single orange rectangles represent registers, the skyblue roundnesses represent shift operations instead of multipliers in VPE array, and the deepgreen roundnesses represent adders.

We evaluate our nn-BQ-NN on FPGA platform: Xilinx ZCU102, which consists of an UltraScale FPGA, quad ARM Cortex-A53 processors and 500 MB DDR3. To measure the performance of our nn-BQ-NN running on FPGA, we get the data of run-time, resource utilization, and power by simulating and testing on Vivado-2017 when the operating frequency is 200 MHz. Our nn-BQ-NN implementation involves a few design parameters, parallelization degree (PnP_{n} and PmP_{m}), filter size (kk), input feature maps width (WW), input feature maps height (HH), input feature channels (MM), and output feature channels (NN).

V-A Coprocessor Architecture

Fig.5 shows the block diagram of our system. In the CNN calculation process, the host computer hands off the weights and images to the coprocessor (ZCU102), and collects the predicted classification results. The transmission mode between host computer and ARM CPU can be switched in PCI or UDP. Before the computation, the host computer is responsible for feeding the images and reducing precision. In addition, the ARM CPU needs to complete the calculation of fully-connected layers that is not suitable for FPGA parallel acceleration and the FPGA accelerates the calculation of convolutional layers.

We build the coprocessor with parallel architecture, as shown in Fig.6. The critical part of the coprocessor is SVPE cluster interface that has MM SVPE clusters, where each SVPE cluster consists of NN SVPE arrays with a size of k×kk\times k. The adders are used to compute partial sums of convolutions while the SVPE arrays compute convolutions. The fetch unit is programmed to fetch images and weights from ARM-based processing system (PS), and the load/store unit is used to load or store intermediate calculation results. The AXI-HP port is used to receive or send the data, and the AXI-GP port is used to receive or send the network structure information and the control signal. A key point to note is 16-bit computational accuracy acts on the data buses to save data bandwidth.

The basic design ideas continue the architecture of nn-BQ-NN, which converts all 16-bit weights as nn-bit weights to reduce memory usage and increase the parallelization degree. On FPGAs, due to the shortage of DSPs, this has become a major factor limiting the increase in parallelization degree that directly affects the ability to accelerate calculation for CNNs, because the multiplication in the convolution calculation needs to call DSPs. Instead, we implement the multiplication with SHIFT operation that consumes the Look-Up-Table (LUT) arrays on FPGAs, while the resource of LUTs is more abundant than DSPs’ [30]. In general, our nn-BQ-NN, which consists of 16-bit activations and nn-bit weights.

Our architecture of convolution computation is characterized by several key attributes compared with VPEs [14]. First, we organize the architecture as arrays of SVPEs, where the SVPE array is an array of 2D convolvers, each of which consists of k2k^{2} connected SHIFT and ADDER units instead of Multiply Accumulate (MAC) units, shown in Fig.7. The weights and feature maps are loaded into each PE alternately by AXI-HP port. Before each calculation, the weights are buffered to the specified areas (the double orange rectangles in Fig.7), and then the pipeline calculation starts with the enablement of feature maps. Modeling the SVPE and VPE arrays, we compare their resource consumption, parallelization degree and power on FPGAs shown as Table VII, and our SVPE array achieves the average energy consumption of 3.81 W at different nn that is less than VPE array of 5.53 W. Second, we reduce banded off-chip data memory and improve the data movement between the SVPE clusters and the off-chip memory by using nn-bit weights. Third, all convolvers are homogeneous when kk is fixed as our primitive operator. We evaluate the improvement of the computational efficiency of nn-BQ-NN in hardware by the following section.

V-B Computational Efficiency

TABLE VII: Comparison of VPE and SVPE array resource consumption, parallelization degree and power.
Array SVPE VPE
Precision (n) 1 2 3 4 5 16
Power (W) Signal 1.94 2.31 2.61 2.53 2.13 4.88
Logic 1.03 1.33 1.55 1.63 1.62 0.25
DSPs 0.08 0.09 0.09 0.08 0.06 0.40
Total 3.05 3.73 4.25 4.24 3.81 5.53
Used Resource LUTs 353 280 307 334 346 41
FFs 220 226 232 238 244 213
DSPs 3 3 3 3 3 12
Parallelization degree (Pn,PmP_{n},P_{m}) (8,32) (4,16)

Since the filter size (3×\times3) is fixed for our nn-BQ-NN, resource utilization will be maximized. Here, we can predict the performance of nn-BQ-NN on FPGAs by developing an analytical model. In the following, we rely on it to compare computational efficiency between traditional implementation and nn-BQ-NN on FPGAs.

On the hardware, MAC unit, adder and multiplier will consume DSP. In fact, the number of DSPs only depends on the size of filter and parallelization degree [30, 12] as follows,

DSP=(k2+k)×Pn×PmDSP=(k^{2}+k)\times P_{n}\times P_{m} (35)

We must balance the memory bandwidth between the on-chip and off-chip memory and ensure that the speed of transmission is greater than or equal to the speed of computation for utilizing the resource efficiently. The formula of the time to process input data in the line buffer on FPGA is,

Tcompute=(H×W×MPm×NPn)×1FreqT_{compute}=(H\times W\times\frac{M}{P_{m}}\times\frac{N}{P_{n}})\times\frac{1}{Freq} (36)

Where FreqFreq is the operating frequency of the FPGA. Together, we have to parallel the speed of transmission between input and output data as follows,

Ttransfer=M×N×k2+k×W×MBandwidthT_{transfer}=\frac{M\times N\times k^{2}+k\times W\times M}{Bandwidth} (37)

We require that TtransferTcomputeT_{transfer}\leq T_{compute}. Therefore, we can get the minimum requirement of bandwidth is,

Bandwidthmin=Pm×Pnmin(N,M)×bcompute×FreqBandwidth_{min}=\frac{P_{m}\times P_{n}}{min(N,M)}\times b_{compute}\times Freq (38)

Where bcomputeb_{compute} is the bit-width of computation, and we evaluate the performance of hardware acceleration choosing 16 bit-width. We define the TinitT_{init} as the time to load the first nn rows of input image and filter needed into on-chip memory as follows,

Tinit=M×N×k2×bweightBandwidth+W×M×kBandwidth×bcomputeT_{init}=\frac{M\times N\times k^{2}\times b_{weight}}{Bandwidth}+\frac{W\times M\times k}{Bandwidth}\times b_{compute} (39)

Where bweightb_{weight} is the bit-width of the weights. The total operations are,

OPs=H×W×M×N×k2×2OPs=H\times W\times M\times N\times k^{2}\times 2 (40)

And the total processing time of the convolution is,

Ttotal=Tcompute+TinitT_{total}=T_{compute}+T_{init} (41)

Finally, we can compare the computational efficiency of the different models defining the effective performance of convolution as,

Perfeff=OPsTtotalPerf_{eff}=\frac{OPs}{T_{total}} (42)

We obtain the computational efficiency Perfeff(n)Perf_{eff}(n) corresponding to different bit width of the weights where n=bweightn=b_{weight} represents the bit width of our nn-BQ-NN.

Perfeff(n)=32FreqPmPnHWNk216HWN+nMNk2+16WMkPerf_{eff}(n)=\frac{{32FreqP_{m}P_{n}HWNk^{2}}}{16HWN+nMNk^{2}+16WMk} (43)

Now, given a convolutional layer represented by (WW,HH,MM,NN), we get the computational efficiency based on design parameters (kk,PmP_{m},PnP_{n}).

The main reason for restricting the computational efficiency of CNNs on FPGA is parallelization degree, which is directly related to DSPs when the setting of bandwidth is reasonable. To speed-up the inference of CNNs on FPGAs, we use our SVPE cluster to replace the traditional VPE cluster by converting the multiplications as the SHIFT and ADDER operations. Since we no longer use the multiplication, the amount of DSPs is reduced as follows,

DSP=k×Pn×PmDSP=k\times P_{n}\times P_{m} (44)

Since SVPE array consumes much less DSPs than VPE array compared Eq.35 with Eq.44, nn-BQ-NN with SVPE array can get a larger amount of parallelization degree than CNNs with VPE array when the consumed DSPs are the same. Based on the maximum DSPs number of 2520 as Table VIII and the balanced memory bandwidth, we can design the maximum parallelization degree of (Pm=32P_{m}=32,Pn=8P_{n}=8) and (Pm=16P_{m}=16,Pn=4P_{n}=4) respectively on SVPE and VPE array with filters 3×\times3. Thanks to the SVPE array, the parallelization degree increases by 4 times to improve the computational efficiency greatly when the total consumed DSPs is 768. Supposing the color image is 32(WW)×\times32(HH)×\times3(MM) pixels, filter size kk is 3, DDR bit width NN is 128, the computational efficiency of our nn-BQ-NN using SVPE array has improved by about 4.1 times compared with traditional network using VPE array on the basis of Eq.43.

V-C Performance on FPGA

Refer to caption
Figure 8: A universal SVPE array supporting AlexNet with multiple kernel sizes (3, 5, and 11), where the yellow blocks are Shift-Accumulator units.

We evaluate our SVPE cluster implementation using AlexNet, where our nn-BQ-NN contains 16-bit activations and 3-bit weights. Table VIII gives the evaluation results with the comparison of the state-of-the-art FPGA accelerators, where GOP indicates the unit of the number of operations. From the hardware perspective, we prefer to use the computational efficiency to describe the performance of the algorithm. Because the total operations of computing a network is fixed, we can get the execution time (s) by dividing the total operations (GOP) by the computational efficiency (GOP/s). Similar to the structure of Fig.7, a universal SVPE array designed by the largest filter size of AlexNet is proposed in Fig.8. This experiment will use the universal SVPE array that fits the full-size after a slight adjustment. Our array designs to be recycled when calculating the convolutions of different layers, and 11×1111\times 11 filter of AlexNet is only used in the first convolutional layer, so most of the array utilization is extremely low. This also confirms the necessity of designing the network architecture with 3×33\times 3 unified filter to improve resource utilization as Table I.

Compared to prior works [53, 30], we improve the average CNN performance to 957.4 GOP/s where the work [30] is implemented by Winograd algorithm. The baseline is to implement the same hardware architecture as our implementation. The only difference is that it uses VPE cluster because its weights and activations are both 16-bit. The computational efficiency of our implementation has improved by 2.9 times compared with the baseline, which is slightly less than 4.1 times based on the theoretical calculations of the Section V-B. On the other hand, Our implementation also improves the energy efficiency to 48.9 GOP/s/W. The better energy efficiency and resource efficiency come from the novel SVPE structure.

TABLE VIII: Performance comparison for AlexNet
[53] Baseline [30] Our Impl.
Precision 32bits fixed 16bits fixed 16bits fixed 16bits fixed
Device VX485T ZCU102 ZCU102 ZCU102
Freq(MHz) 100 200 200 200
Logic cell(K) 485.7 600 600 600
DSP 2800 2520 2520 2520
BRAM(Kb) 2060×\times18 1824×\times18 1824×\times18 1824×\times18
conv1(GOP/s) 27.5 227.5 409.6 410.5
conv2(GOP/s) 83.8 535.8 1355.6 1744.3
conv3(GOP/s) 78.8 655.9 1535.7 1680.7
conv4(GOP/s) 77.9 634.4 1361.7 1739.4
conv5(GOP/s) 77.6 559.5 1285.7 1456.1
CNN average (GOP/s) 61.6 332.2 854.6 957.4
Power(W) 18.6 28.7 23.6 19.6
DSP Efficiency (GOP/s/DSPs) 0.022 0.131 0.339 0.381
Logic cell Efficiency (GOP/s/cells/K) 0.127 0.553 1.424 1.596
Energy Efficiency (GOP/s/W) 3.31 11.57 36.2 48.85
DSP Utilization 80% 30% 63% 30%
LUT Utilization 61% 48% 39% 73%
FF Utilization 34% 42% 33% 68%
BRAM Utilization 50% 50% 43% 83%

VI Conclusion and future work

In this paper, we present a novel learning framework to quantize full-precision CNN models into low-precision QNN models whose weights are constrained to the power of two. We solve the problem of gradient vanishing by adding a reconstructed gradient function into back-propagation algorithm. To satisfy the network-structure-optimization requirements for hardware limitation, we propose nn-BQ-NN, a novel QNN structure, to replace the multiplication with SHIFT operation whose structure is more suitable for the inference on FPGAs. Furthermore, we also design the SVPE array to replace all 16-bit multiplications with SHIFT operations in convolution operation on FPGAs. For proving the validity of our learning framework, we conduct experiments and show that the quantized models of ResNet, DenseNet and AlexNet through our learning framework can achieve almost the same accuracies with the original full-precision models. Moreover, when using our learning framework to train our nn-BQ-NN from scratch, it can achieve nearly state-of-the-art results compared with typically low-precision QNNs. We also evaluate the computational efficiency and energy consumption by implementing our QNNs models on Xilinx ZCU102 platform. In our hardware experiments, our nn-BQ-NN with our SVPE can execute 2.9 times faster than with the VPE in inference, and the use of SVPE array also reduces average energy consumption to 68.7% of the VPE array with 16-bit. Our future work should explore how to decrease the accumulated quantization errors further when our learning framework is used on different CNNs structures.

References

  • [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in International Conference on Neural Information Processing Systems, pp. 1097–1105, 2012.
  • [2] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” pp. 1–9, 2014.
  • [3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2015.
  • [4] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708, 2014.
  • [5] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation from predicting 10,000 classes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1891–1898, 2014.
  • [6] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, 2015.
  • [7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” arXiv preprint arXiv:1412.7062, 2014.
  • [8] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, pp. 1440–1448, 2015.
  • [9] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, pp. 91–99, 2015.
  • [10] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255.   Ieee, 2009.
  • [12] M. Peemen, A. A. Setio, B. Mesman, H. Corporaal et al., “Memory-centric accelerator design for convolutional neural networks.” in ICCD, vol. 2013, pp. 13–19, 2013.
  • [13] P. Meloni, G. Deriu, F. Conti, I. Loi, L. Raffo, and L. Benini, “A high-efficiency runtime reconfigurable ip for cnn acceleration on a mid-range all-programmable soc,” in 2016 International Conference on ReConFigurable Computing and FPGAs, pp. 1–8.   IEEE, 2016.
  • [14] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic, E. Cosatto, and H. P. Graf, “A massively parallel coprocessor for convolutional neural networks,” in Application-specific Systems, Architectures and Processors, 2009. ASAP 2009. 20th IEEE International Conference on, pp. 53–60.   IEEE, 2009.
  • [15] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1,” 2016.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
  • [17] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks.” in CVPR, vol. 1, no. 2, p. 3, 2017.
  • [18] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in neural information processing systems, pp. 1135–1143, 2015.
  • [19] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
  • [20] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
  • [21] J. Van Leeuwen, “On the construction of huffman trees.” in ICALP, pp. 382–410, 1976.
  • [22] J. Choi, P. I.-J. Chuang, Z. Wang, S. Venkataramani, V. Srinivasan, and K. Gopalakrishnan, “Bridging the accuracy gap for 2-bit quantized neural networks (qnn),” arXiv preprint arXiv:1807.06964, 2018.
  • [23] J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan, and K. Gopalakrishnan, “Pact: Parameterized clipping activation for quantized neural networks,” arXiv preprint arXiv:1805.06085, 2018.
  • [24] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K.-T. Cheng, “Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 722–737, 2018.
  • [25] B. Zhuang, C. Shen, M. Tan, L. Liu, and I. Reid, “Towards effective low-bitwidth convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7920–7928, 2018.
  • [26] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network quantization: Towards lossless cnns with low-precision weights,” arXiv preprint arXiv:1702.03044, 2017.
  • [27] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
  • [28] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv preprint arXiv:1612.01064, 2016.
  • [29] E. Park, D. Kim, S. Yoo, and P. Vajda, “Precision highway for ultra low-precision quantization,” arXiv preprint arXiv:1812.09818, 2018.
  • [30] L. Lu, Y. Liang, Q. Xiao, and S. Yan, “Evaluating fast algorithms for convolutional neural networks on fpgas,” in Field-Programmable Custom Computing Machines (FCCM), 2017 IEEE 25th Annual International Symposium on, pp. 101–108.   IEEE, 2017.
  • [31] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [32] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
  • [33] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” arXiv preprint arXiv:1412.6806, 2014.
  • [34] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
  • [35] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [36] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” in Artificial Intelligence and Statistics, pp. 562–570, 2015.
  • [37] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
  • [38] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in Advances in neural information processing systems, pp. 2377–2385, 2015.
  • [39] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic depth,” in European Conference on Computer Vision, pp. 646–661.   Springer, 2016.
  • [40] G. Larsson, M. Maire, and G. Shakhnarovich, “Fractalnet: Ultra-deep neural networks without residuals,” arXiv preprint arXiv:1605.07648, 2016.
  • [41] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014.
  • [42] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” in NIPS workshop on deep learning and unsupervised feature learning, vol. 2011, no. 2, p. 5, 2011.
  • [43] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
  • [44] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” arXiv preprint arXiv:1302.4389, 2013.
  • [45] P. Sermanet, S. Chintala, and Y. LeCun, “Convolutional neural networks applied to house numbers digit classification,” in Pattern Recognition, 2012 21st International Conference on, pp. 3288–3291.   IEEE, 2012.
  • [46] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proc. IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
  • [47] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in neural information processing systems, pp. 3123–3131, 2015.
  • [48] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprint arXiv:1605.04711, 2016.
  • [49] S. Gross and M. Wilber, “Training and investigating residual nets,” Facebook AI Research, CA.[Online]. Avilable: http://torch. ch/blog/2016/02/04/resnets. html, 2016.
  • [50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [51] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in International conference on machine learning, pp. 1139–1147, 2013.
  • [52] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in European Conference on Computer Vision, pp. 525–542.   Springer, 2016.
  • [53] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing fpga-based accelerator design for deep convolutional neural networks,” in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 161–170.   ACM, 2015.