This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

FleXOR: Trainable Fractional Quantization

Dongsoo Lee           Se Jung Kwon11footnotemark: 1           Byeongwook Kim
Yongkweon Jeon           Baeseong Park           Jeongin Yun
Samsung Research, Seoul, Republic of Korea
{dongsoo3.lee, sejung0.kwon, byeonguk.kim,
dragwon.jeon, bpbs.park, ji6373.yun}@samsung.com
Equal Contribution.
Abstract

Quantization based on the binary codes is gaining attention because each quantized bit can be directly utilized for computations without dequantization using look-up tables. Previous attempts, however, only allow for integer numbers of quantization bits, which ends up restricting the search space for compression ratio and accuracy. In this paper, we propose an encryption algorithm/architecture to compress quantized weights so as to achieve fractional numbers of bits per weight. Decryption during inference is implemented by digital XOR-gate networks added into the neural network model while XOR gates are described by utilizing tanh(x)\tanh(x) for backward propagation to enable gradient calculations. We perform experiments using MNIST, CIFAR-10, and ImageNet to show that inserting XOR gates learns quantization/encrypted bit decisions through training and obtains high accuracy even for fractional sub 1-bit weights. As a result, our proposed method yields smaller size and higher model accuracy compared to binary neural networks.

1 Introduction

Deep Neural Networks (DNNs) demand a larger number of parameters and more computations to support various task descriptions all while adhering to ever-increasing model accuracy requirements. Because of abundant redundancy in DNN models [9, 5, 3], numerous model compression techniques are being studied to expedite the inference of DNNs [21, 17]. As a practical model compression scheme, parameter quantization is a popular choice because of the high compression ratio and regular formats after compression so as to enable full memory bandwidth utilization.

Quantization schemes based on binary codes are gaining increasing attention since quantized weights follow specific constraints to allow simpler computations during inference. Specifically, using the binary codes, a weight vector is represented as i=1q(αi𝒃i)\sum_{i=1}^{q}(\alpha_{i}{\bm{b}}_{i}), where qq is the number of quantization bits, α\alpha is a scaling factor (α)(\alpha\in\mathbb{R}), and each element of a vector 𝒃i{\bm{b}}_{i} is a binary {1,+1}\in\{-1,+1\}. Then, a dot product with activations is conducted as i=1q(αij=1vajbi,j)\sum_{i=1}^{q}(\alpha_{i}\sum_{j=1}^{v}a_{j}b_{i,j}), where aja_{j} is a full-precision activation and vv is the vector size. Note that the number of multiplications is reduced from vv to qq (expensive floating-point multipliers are less required for inference). Moreover, even though we do not discuss a new activation quantization method in this paper, if activations are also quantized by using binary codes, then most computations are replaced with bit-wise operations (using XNOR logic and population counts) [27, 22]. Consequently, even though representation space is constrained compared with quantization methods based on look-up tables, various inference accelerators can be designed to exploit the advantages of binary codes [22, 27]. Since a successful 1-bit weight quantization method has been demonstrated in BinaryConnect [3], advances in compression-aware training algorithms in the form of binary codes (e.g., binary weight networks [22] and LQ-Nets [29]) produce 1-3 bits for quantization while accuracy drop is modest or negligible. Fundamental investigations on DNN training mechanisms using fewer quantization bits have also been actively reported [19, 2].

Previously, binary-coding-based quantization has only permitted integer numbers of quantization bits, limiting the compression/accuracy trade-off search space, especially in the range of very low quantization bits. In this paper, we propose a flexible encryption algorithm/architecture (called “FleXOR”) to enable fractional sub 1-bit numbers to represent each weight while quantized bits are trained by gradient descent. Even though vector quantization is also a well-known scheme with a high compression ratio [24], we assume the form of binary codes. Note that the number of quantization bits can be different for each layer (e.g., [26]) to allow fractional quantization bits on average. FleXOR implies fractional quantization bits for each layer that can be quantized with a different number of bits.

Refer to caption
Figure 1: Dataflow and computation formats of binary-coding-based quantization, vector quantization, and our proposed quantization scheme.

To the best of our knowledge, our work is the first to explore model accuracy under 1 bit/weight when weights are quantized based on the binary codes. Figure 1 compares representations of weights to be stored in memory, converting method, and computation schemes of three quantization schemes. FleXOR maintains the advantages of binary-coding-based quantization (i.e., dequantization is not necessary for the computations) while quantized weights are further compressed by encryption. Note that since our major contribution is to enable fractional sub 1-bit weight quantization, for experiments, we selected models that have been previously quantized by ‘1’ bit/weight for comparisons on the model accuracy. As a result, unfortunately, the range of model selections is somewhat limited.

2 Encrypting Quantized Bits using XOR Gates

Refer to caption
Figure 2: FleXOR components added to the quantized DNNs to compress quantized weights through encryption. Encrypted weight bits are decrypted by XOR gates to produce quantized weight bits.

The main purpose of FleXOR is to compress quantized bits into encrypted bits that can be reconstructed by XOR gates as shown in Figure 2. Suppose that NoutN_{out} bits are to be compressed into NinN_{in} bits (Nout>NinN_{out}>N_{in}). The role of an XOR-gate network is to produce various NoutN_{out}-bit combinations using NinN_{in} bits [15]. In other words, in order to maximize the chance of generating a desirable set of quantized bits, the encryption scheme is designed to seek a particular property where all possible 2Nin2^{N_{in}} outcomes through decryption are evenly distributed in 2Nout2^{N_{out}} space.

A linear Boolean function, f(𝒙)f({\bm{x}}), maps f:{0,1}Nin{0,1}f:\{0,1\}^{N_{in}}\rightarrow\{0,1\} and has the form of a1x1a2x2aNinxNina_{1}x_{1}\oplus a_{2}x_{2}\oplus\dots\oplus a_{N_{in}}x_{N_{in}} where aj{0,1}a_{j}\in\{0,1\} (1jNin)1\leq j\leq N_{in}) and \oplus indicates bit-wise modulo-2 addition. In Figure 2, six binary outputs are generated through six Boolean functions using four binary inputs. Let f1(𝒙)f_{1}({\bm{x}}) and f2(𝒙)f_{2}({\bm{x}}) be two such linear Boolean functions using 𝒙=(x1,x2,,xNin){0,1}Nin{\bm{x}}=(x_{1},x_{2},...,x_{N_{in}})\in\{0,1\}^{N_{in}}. The Hamming distance between f1(𝒙)f_{1}({\bm{x}}) and f2(𝒙)f_{2}({\bm{x}}) is the number of inputs on which f1(𝒙)f_{1}({\bm{x}}) and f2(𝒙)f_{2}({\bm{x}}) differ, and defined as

dH(f1,f2):=wH(f1f2)=#{𝒙{0,1}Nin|f1(𝒙)f2(𝒙)},d_{H}(f_{1},f_{2}):=w_{H}(f_{1}\oplus f_{2})=\#\{{\bm{x}}\in\{0,1\}^{N_{in}}|f_{1}({\bm{x}})\neq f_{2}({\bm{x}})\}, (1)

where wH(f)=#{𝒙{0,1}Nin|f(𝒙)=1}w_{H}(f)=\#\{{\bm{x}}\in\{0,1\}^{N_{in}}|f({\bm{x}})=1\} is the Hamming weight of a function and #{}\#\{\} corresponds to the size of a set [13]. The Hamming distance is a well-known method to express non-linearity between two Boolean functions [13] and increased Hamming distance between a pair of two Boolean functions results in a variety of outputs produced by XOR gates. Increasing Hamming distance is a required feature for cryptography to derive complicated encryption structure such that inverting encrypted data becomes difficult. For digital communication, the Hamming distance between encoded signals is closely related to the amount of error correction possible.

FleXOR should be able to select the best out of 2Nin2^{N_{in}} possible outputs that are randomly selected from larger 2Nout2^{N_{out}} search space. Encryption performance of XOR gates is determined by the randomness of 2Nin2^{N_{in}} output candidates, and is enhanced by increasing Hamming distance that is achieved by larger NoutN_{out} (for a fixed compression ratio). Now, let NtapN_{tap} be the number of 1’s in a row of 𝑴{\bm{M}}^{\oplus}. Another method to enhance encryption performance is to increase NtapN_{tap} so as to increase the number of shuffles (through more XOR operations) using encrypted bits to generate quantized bits such that correlation between quantized bits is reduced.

Refer to caption
Figure 3: Encrypted weight bits are sliced and reconstructed by a XOR-gate network which can be shared (in time or space). Then quantized bits after XOR gates are finally reshaped.

In Figure 2, y1y_{1} is represented as x1x3x4x_{1}\oplus x_{3}\oplus x_{4}, or equivalently a vector [1 0 1 1][1\,0\,1\,1] denoting which inputs are selected. Concatenating such vectors, a XOR-gate network in Figure 2 can be described as a binary matrix 𝑴{0,1}Nout×Nin{\bm{M}}^{\oplus}\in\{0,1\}^{N_{out}\times N_{in}} (e.g., the second row of 𝑴{\bm{M}}^{\oplus} is [1 1 0 0][1\,1\,0\,0] and the third row is [1 1 1 0][1\,1\,1\,0]). Then, decryption through XOR gates is simply represented as 𝒚=𝑴𝒙{\bm{y}}={\bm{M}}^{\oplus}{\bm{x}} where 𝒙{\bm{x}} and 𝒚{\bm{y}} are the binary inputs and binary outputs of XOR gates, and addition is ‘XOR’ and multiplication is ‘AND’ (see Appendix for more details and examples).

Encrypted weight bits are stored in a 1-dimensional vector format and sliced into blocks of NinN_{in}-bit size as shown in Figure 3. Then, the decryption of each slice is performed by an XOR-gate network that is shared by all slices (temporally- or spatially-shared). Depending on the quantization scheme and characteristics of layers, quantized bits may need to be scaled by a scaling factor and/or reshaped. Area and latency overhead induced by XOR gates are negligible as demonstrated in VLSI testing and parameter pruning works [25, 17, 1].

Since an XOR-gate network is shared by many weights (i.e., 𝑴{\bm{M}} is fixed for all slices), it is difficult (if not impossible) to manually optimize an XOR-gate network. Hence, a random 𝑴{\bm{M}} configuration is enough to fulfill the purpose of random number generation. In short, the XOR-gate network design is simple and straightforward.

3 FleXOR Training Algorithm for Quantization Bits Decision

Once the structure of XOR gates has been pre-determined and fixed to increase the Hamming distance of XOR outputs, we find quantized and encrypted bits by adding XOR gates into the model. In other words, we want an optimizer that understands the XOR-gate network structure so as to compute encrypted bits and scaling factors via gradient descent. For inference, we store binary encrypted weights (converted from real number encrypted weights) in memory and generate binary quantized weights through Boolean XOR operations. Activation quantization is not discussed in this paper to avoid the cases where the choice of activation quantization method affects the model accuracy.

Similar to the STE method introduced in [3], Boolean functions need to be described in a differentiable manner to obtain gradients in backward propagation. For two real number inputs x1x_{1} and x2x_{2} (x1,x2x_{1},x_{2}\in\mathbb{R} to be used as encrypted weights), the Boolean version of a XOR gate for forward propagation is described as (note that 0 is replaced with 1-1)

(x1,x2)=(1)sign(x1)sign(x2).\mathcal{F}^{\oplus}(x_{1},x_{2})=(-1)\operatorname{sign}(x_{1})\operatorname{sign}(x_{2}). (2)

For inference, we store sign(x1)\operatorname{sign}(x_{1}) and sign(x2)\operatorname{sign}(x_{2}) instead of x1x_{1} and x2x_{2}. On the other hand, a differentiable XOR gate for backward propagation is presented as

f(x1,x2)=(1)tanh(x1Stanh)tanh(x2Stanh),f^{\oplus}(x_{1},x_{2})=(-1)\tanh(x_{1}\cdot S_{\tanh})\tanh(x_{2}\cdot S_{\tanh}), (3)

where StanhS_{\tanh} is a scaling factor for FleXOR. Note that tanh\tanh functions are widely used to approximate Heaviside step functions (i.e., y(x)=1y(x){=}1 if x>0x{>}0 or 0, otherwise) in digital signal processing and StanhS_{\tanh} can control the steepness. In [6, 16], tanh\tanh is also suggested to approximate the STE function. In our work, on the other hand, tanh\tanh is to proposed to make XOR operations trainable for ‘encryption.’ rather than ‘quantization.’ In the case of consecutive XOR operations, the order of inputs to be fed into XOR gates should not affect the computation of partial gradients for XOR inputs. Therefore, as a simple extension of Eq. (3), a differentiable XOR gate network with nn inputs can be described as

f(x1,x2,,xn)=(1)n1tanh(x1Stanh)tanh(x2Stanh)tanh(xnStanh).f^{\oplus}(x_{1},x_{2},\dots,x_{n})=(-1)^{n-1}\tanh(x_{1}\cdot S_{\tanh})\tanh(x_{2}\cdot S_{\tanh})\dots\tanh(x_{n}\cdot S_{\tanh}). (4)

Then, a partial derivative of ff^{\oplus} with respect to xix_{i} (an encrypted weight) is given as

f(x1,x2,,xn)xi=Stanh(1)n1(1tanh2(xiStanh))j=1ntanh(xjStanh)tanh(xiStanh)\frac{\partial f^{\oplus}(x_{1},x_{2},\dots,x_{n})}{\partial x_{i}}=S_{\tanh}(-1)^{n-1}(1-\tanh^{2}(x_{i}\cdot S_{\tanh}))\frac{\prod_{j=1}^{n}\tanh(x_{j}\cdot S_{\tanh})}{\tanh(x_{i}\cdot S_{\tanh})} (5)

Note that increasing NtapN_{tap} is associated with more tanh\tanh multiplications for each XOR-gate network output. From Eq. (5), thus, increasing NtapN_{tap} may lead to the vanishing gradient problem since |tanh(x)|1|\tanh(x)|\leq 1. To resolve this problem, we also consider a simplified alternative partial derivative expressed as

f(x1,x2,,xn)xiStanh(1)n1(1tanh2(xiStanh))jisign(xj).\frac{\partial f^{\oplus}(x_{1},x_{2},\dots,x_{n})}{\partial x_{i}}\approx S_{\tanh}(-1)^{n-1}(1-\tanh^{2}(x_{i}\cdot S_{\tanh}))\prod_{j\neq i}\operatorname{sign}(x_{j}). (6)

Compared to Eq. (5), approximation in Eq. (6) is obtained by replacing tanh(xStanh)\tanh(x\cdot S_{\tanh}) with sign(x)\operatorname{sign}(x). Eq. (6) shows that when we compute a partial derivative, all XOR inputs other than xix_{i} are assumed to be binary, i.e., the magnitude of a partial derivative is then only determined by xix_{i}. We use Eq. (6) in this paper to calculate custom gradients of encrypted weights due to fast training computations and convergence, and use Eq.  (2) for forward propagation.

𝒘eR(kkCinCout)/NoutNin\displaystyle{\bm{w}}^{e}\in\ R^{\lceil(k\cdot k\cdot C_{in}\cdot C_{out})/N_{out}\rceil\cdot{N_{in}}}
  \triangleright Encrypted weights
𝑴{0,1}Nout×Nin\displaystyle{\bm{M}}^{\oplus}\in\{0,1\}^{N_{out}\times N_{in}}
  \triangleright XOR gates (shared)
𝜶Cout\bm{\alpha}\in\mathbb{R}^{C_{out}}
  \triangleright Scaling factors for each output channel
Function FleXOR_Conv(inputinput, stridestride, paddingpadding):
      
      for i0i\leftarrow 0 to (kkCinCout)/Nout1\displaystyle\lceil(k\cdot k\cdot C_{in}\cdot C_{out})/N_{out}\rceil-1 do
             for j1j\leftarrow 1 to NoutN_{out} do
                   wiNout+jq(1)(l=1,Mj,l=1NinSignc(wiNin+le)(1))\displaystyle{w}^{q}_{i\cdot N_{out}+j}\leftarrow(-1)\cdot\left(\prod\nolimits_{\begin{subarray}{c}l=1,{M}^{\oplus}_{j,l}=1\end{subarray}}\nolimits^{N_{in}}\texttt{Sign\textsuperscript{c}}\left({w}^{e}_{i\cdot N_{in}+l}\right)\cdot(-1)\right)
                    \triangleright Eq. (2)
                  
            
      𝑾q\displaystyle{\bm{\mathsfit{W}}}^{q}\leftarrow Reshape(𝒘q\displaystyle{\bm{w}}^{q}, [k,k,Cin,Cout]\displaystyle[k,k,C_{in},C_{out}])
       return Conv(inputinput, 𝑾q\displaystyle{\bm{\mathsfit{W}}}^{q}, 𝜶\bm{\alpha}, stridestride, paddingpadding)
        \triangleright Conv. operation for binary codes
      
Forward Function Signc(x):
       return sign(x)\operatorname{sign}(x)
      
Gradient Function Signc (x,  \nabla):
       return (1tanh2(xStanh))Stanh\nabla\cdot(1-\tanh^{2}(x\cdot S_{\tanh}))\cdot S_{\tanh}
        \triangleright Eq. (6)
      
Algorithm 1 Pseudo codes of Conv Layer with FleXOR when the kernel size is k×kk\times k, the number of input channel and output channel are CinC_{in} and CoutC_{out}, respectively.

By training the whole network including FleXOR components using custom gradient computation methods described above, encrypted and quantized weights are obtained in a holistic manner. FleXOR operations for convolutional layers are described in Algorithm 1, where encrypted weights (inputs of an XOR-gate network) and quantized weights (outputs of an XOR-gate network) are 𝒘e{\bm{w}}^{e} and 𝑾q{\bm{\mathsfit{W}}}^{q}. We note that Algorithm 1 describes hardware operations (that are best implemented by ASIC or FPGA) rather than instructions to be operated by CPUs or GPUs.

We first verify the basic training principles of FleXOR using LeNet-5 on the MNIST dataset. LeNet-5 consists of two convolutional layers and two fully-connected layers (specifically, 32C5-MP2-64C5-MP2-512FC-10SoftMax), and each layer is accompanied by an XOR-gate network with NinN_{in} binary inputs and NoutN_{out} binary outputs. The quantization scheme follows a 1-bit binary code with full-precision scaling factors that are shared across weights for the same output channel number (for conv layers) or output neurons (for FC layers). Encrypted weights are randomly initialized with 𝒩(μ=0,σ2=0.0012)\mathcal{N}(\mu{=}0,\sigma^{2}{=}0.001^{2}). All scaling factors for quantization are initialized to be 0.20.2 (note that if batch normalization layers are immediately followed, then scaling factors for quantization are redundant).

Refer to caption
Refer to caption
Figure 4: Test accuracy and training loss (average of 6 runs) with LeNet-5 on MNIST when 𝑴{\bm{M}}^{\oplus} is randomly filled with {0,1}\{0,1\}. NoutN_{out} is 10 or 20 to generate 0.4, 0.6, or 0.8 bit/weight quantization.

Using the Adam optimizer with an initial learning rate of 10-4 and batch size of 50 without dropout, Figure 4 shows training loss and test accuracy when Stanh=100S_{\tanh}{=}100, elements of 𝑴{\bm{M}}^{\oplus} are randomly filled with 1 or 0, and for two values of NoutN_{out}1010 and 2020. Using the 1-bit internal quantization method and (Nin,Nout)(N_{in},N_{out}) encryption scheme, one weight can be represented by (Nin/Nout)(N_{in}/N_{out}) bits. Hence, Figure 4 represents training results for 0.4, 0.6, and 0.8 bits per weight. Note that as for a randomly filled 𝑴{\bm{M}}^{\oplus}, increasing NoutN_{out} (and NinN_{in} is determined correspondingly for the same compression ratio) increases the Hamming distance for a pair of any two rows of 𝑴{\bm{M}}^{\oplus} and, hence, offers the chance to produce more diversified outputs. Indeed, as shown in Figure 4, the results for NoutN_{out}=20 present improved test accuracy and less variation compared with NoutN_{out}=10. See Appendix for the distribution of encrypted weights at different training steps.

4 Practical FleXOR Training Techniques

In this section, we present practical training techniques for FleXOR using ResNet-32 [10] on the CIFAR-10 dataset [14]. We show compression results for ResNet-32 using fractional numbers as effective quantization bits, such as 0.4 and 1.2, that have not been available previously.

All layers, except the first and the last layers, are followed by FleXOR components sharing the same 𝑴{\bm{M}}^{\oplus} structure (thus, storage footprint of 𝑴{\bm{M}}^{\oplus} is ignorable). SGD optimizer is used with a momentum of 0.9 and a weight decay factor of 10510^{-5}. Initial learning rate is 0.1, which is decayed by 0.5 at the 150th and 175th epoch. As learning rate decays, StanhS_{\tanh} is empirically multiplied by 2 to cancel out the effects of weight decay on encrypted weights. The batch size is 128 and initial scaling factors of α\alpha are 0.2. ‘qq’ is the number of bits to represent binary codes for quantization. We provide some useful training insights below with relevant experimental results.

1) Use small NtapN_{tap} (such as 2): Large NtapN_{tap} can induce vanishing gradient problems in Eq. (5) or increased approximation error in Eq. (6). In practice, hence, FleXOR training with small NtapN_{tap} converges well with high test accuracy. Studying a training algorithm to understand a complex XOR-gate network with large NtapN_{tap} would be an interesting research topic that is beyond the scope of this work. Subsequently, we show experimental results using Ntap=2N_{tap}{=}2 in the remainder of this paper.

Refer to caption
  XOR Training Method
STE Analog XOR FleXOR
Forward sign\operatorname{sign} tanh\tanh sign\operatorname{sign}
Backward Identity (tanh)\partial(\tanh) (tanh)\partial(\tanh)
XOR Output Binary 1-1 or +1+1 \mathbb{R} (1,+1)(-1,+1) Binary 1-1 or +1+1
 
Figure 5: Test accuracy comparison on ResNet-32 (for CIFAR-10) using various XOR training methods. Nout=10N_{out}{=}10, Nin=8N_{in}{=}8, q=1q{=}1 (thus, 0.8bit/weight), and Stanh=10S_{\tanh}{=}10.

2) Use ‘𝐭𝐚𝐧𝐡\tanh’ rather than STE for XOR: Since forward propagation for an XOR gate only needs a sign\operatorname{sign} function, the STE method is also applicable to XOR-gate gradient calculations. Another alternative method to model an XOR gate is to use Eq. (3) for both forward and backward propagation as if the XOR is modeled in an analog manner (then, real number XOR outputs are quantized through STE). We compare three different XOR modeling schemes in Figure 5 with test accuracy measured when encrypted weights and XOR gates are converted to be binary for inference. FleXOR training method shows the best result because a) sign\operatorname{sign} function for forward propagation enables estimating the impact of binary XOR computations on the loss function and b) (tanh)\partial(\tanh) for backward propagation approximates the Heaviside step function better compared to STE. Note that limited gradients from the tanh\tanh function eliminate the need for weight clipping, which is often required for quantization-aware training schemes [3, 27].

Refer to caption
Refer to caption
Figure 6: Test accuracy and distribution of encrypted weights (at the end of training) of ResNet-32 on CIFAR-10 using various StanhS_{\tanh} and the same NoutN_{out}, NinN_{in}, and qq as Figure 5.

3) Optimize S𝐭𝐚𝐧𝐡S_{\tanh}: StanhS_{\tanh} controls the smoothness of the tanh\tanh function for near-zero inputs. Large StanhS_{\tanh} employs large gradient for small inputs and, hence, results in well-clustered encrypted weight values as shown in Figure 6. Too large of a StanhS_{\tanh}, however, hinders encrypted weights from being finely-tuned through training. For FleXOR, StanhS_{\tanh} is a hyper-parameter to be optimized empirically.

4) Learning rate and S𝐭𝐚𝐧𝐡S_{\tanh} warmup: Learning rate starts from 0 and linearly increases to reach the initial learning rate at a certain epoch as a warmup. Learning rate warmup is a heuristic scheme, but widely accepted to improve generalization capability mainly by avoiding large learning rate in the initial phase [11, 8]. Similarly, StanhS_{\tanh} starts from 5, and linearly increases to 10 using the same warmup schedule of the learning rate.

Refer to caption
Refer to caption
Figure 7: Test accuracy of ResNet-32 on CIFAR10 using learning rate warmup and various qq, NinN_{in}, and NoutN_{out}. The results on the right side are obtained by 5 runs.

5) Try various qq, NinN_{in}, and NoutN_{out}: Using a warmup scheme for 100 epochs and learning rate decay by 50% at the 350th, 400th, and 450th epoch, Figure 7 presents test accuracy of ResNet-32 with various qq, NinN_{in}, and NoutN_{out}. For q>1q{>}1, different 𝑴{\bm{M}}^{\oplus} configurations are constructed and then shared across all layers. Note that even for the 0.4bit/weight configuration (using q=1q{=}1, Nin=8N_{in}{=}8, and Nout=20N_{out}{=}20), high accuracy close to 89% is achieved. 0.8bit/weight can be achieved by two different configurations (as shown on the right side of Figure 7) using (q=1q{=}1, Nin=8N_{in}{=}8, Nout=10N_{out}{=}10) or (q=2q{=}2, Nin=8N_{in}{=}8, Nout=20N_{out}{=}20). Interestingly, those two configurations show almost the same test accuracy, which implies that FleXOR is able to provide a linear relationship between the number of encrypted weights and model accuracy (regardless of internal configurations). In general, lowering qq reduces the number of computations with quantized weights.

Table 1: Weight compression comparison of ResNet-20 and ResNet-32 on CIFAR-10. For FleXOR, we use warmup scheme, Stanh=10S_{\tanh}{=}10, and Nout=20N_{out}{=}20.
  ResNet-20 ResNet-32
7 
FP Compressed Diff. FP Compressed Diff.
BWN (1 bit) 92.68% 87.44% -5.24 93.40% 89.49% -4.51
BinaryRelax (1 bit) 92.68% 87.82% -4.86 93.40% 90.65% -2.80
LQ-Net (1 bit) 92.10% 90.10% -1.90 - - -
DSQ (1 bit) 90.70% 90.24% -0.56 - - -
  FleXOR (1.0 bit) 91.87% 90.44% -1.47 92.33% 91.36% -0.97
FleXOR (0.8 bit) 89.91% -1.90 91.20% -1.13
FleXOR (0.6 bit) 89.16% -2.71 90.43% -1.90
FleXOR (0.4 bit) 88.23% -3.64 89.61% -2.72
 

We compare quantization results of ResNet-20 and ResNet-32 on CIFAR-10 using different compression schemes in Table 1 (with full-precision activation). BWN [22], BinaryRelax [28], and LQ-Net [29] propose different training algorithms for the same quantization scheme (i.e., binary codes). The main idea of these methods is to minimize quantization error and to obtain gradients from full-precision weights while the loss function is aware of quantization. Because all of the quantization schemes in Table 1 uses q=1q{=}1 and binary codes, the amount of computations using quantized weights is the same. FleXOR, however, allows reduced memory footprint and bandwidth which are critical for energy-efficient inference designs [9, 1].

Note that even though achieving the best accuracy for 1.0 bit/weight is not the main purpose of FleXOR (e.g., XOR gate may be redundant for Nin=NoutN_{in}{=}N_{out}), FleXOR shows the minimum accuracy drop for ResNet-20 and ResNet-32 as shown in Table 1. It would be an exciting research topic to study the distribution of the optimal number of quantization bits for each weight. We believe that such distribution would be wide and some weights require >1b while numerous weights need <1b because 1) increasing NinN_{in} and NoutN_{out} allows such distributions to be wider and enhances model accuracy even for the same compression ratio and 2) as shown in Table 1, model accuracy of 1-bit quantization with FleXOR is higher than other quantization schemes that do not include encoding schemes.

Table 2: ResNet-20 quantized by FleXOR with various 𝑴{\bm{M}}^{\oplus} assigned to layers (NoutN_{out}=20 for all layers). We divide 20 layers into three groups of layers except for the first and last layers.
                                               NinN_{in} (Bits/Weight) Average Bits/Weight Accuracy
Layer 2–7 (13.5k params) Layer 8–13 (45k params) Layer 14–19 (180k params)
                                               Fixed to be 12 (0.60) 0.60 89.16%
19 (0.95) 19 (0.95) 8 (0.40) 0.53 89.23% (+0.07)
16 (0.80) 16 (0.80) 8 (0.40) 0.50 89.19% (+0.03)
19 (0.95) 16 (0.80) 7 (0.35) 0.47 89.29% (+0.13)
 

While binary neural networks allow only 1-bit quantization as the minimum, FleXOR can assign any fractional quantization bits (less than 1) to different layers. Such a property is especially useful when some layers exhibit high redundancy and relatively less importance such that very low number of quantization bits do not degrade accuracy noticeably [4, 26]. To demonstrate mixed-precision quantization (with all less than 1-bit) enabled by FleXOR, we conduct experiments with ResNet-20 on CIFAR-10 while employing three different XOR-gate structures (i.e. multiple configurations of 𝑴{\bm{M}}^{\oplus} are provided to different layer groups.). Table 2 shows that FleXOR with differently optimized 𝑴{\bm{M}}^{\oplus} for each layer group can achieve a higher compression ratio with a smaller storage footprint compared to the case of FleXOR associated with just one common 𝑴{\bm{M}}^{\oplus} configuration for all layers. When NoutN_{out} is fixed to be 20 for all layers, due to the varied importance of each group, small NinN_{in} is allowed for the third group (of layers with a large number of parameters) while relatively large NinN_{in} is selected for small layers. Compared to the case of NinN_{in}=12 for all layers (with 0.6 bits/weights), adaptively chosen NinN_{in} sets (i.e., 19 for layer 2-7, 16 for layer 8-13, and 7 for layer 14-19) yield higher accuracy (by 0.13%) and smaller bits/weights (by 0.13 bits/weights). As such, FleXOR facilitates a fine-grained exploration of optimal quantization bit search (given as fractional numbers determined by NinN_{in}, NoutN_{out}, and qq) that has not been available in the previous binary-coding-based quantization methods.

5 Experimental Results on ImageNet

In order to show that FleXOR principles can be extended to larger models, we choose ResNet-18 on ImageNet [23]. We use SGD optimizer with a momentum of 0.9 and an initial learning rate of 0.1. Batch size is 128, weight decay factor is 10510^{-5}, and StanhS_{\tanh} is 10. Learning rate is reduced by half at the 70th, 100th, and 130th. For warmup, during the initial ten epochs, StanhS_{\tanh} and learning rate increase linearly from 5 and 0.0, respectively, to initial values.

Table 3: Weight compression comparison of ResNet-18 on ImageNet.
  Methods Bits/Weight Top-1 Top-5 Storage Saving
  Full Precision [10] 32 69.6% 89.2% 1×\times
BWN [22] 1 60.8% 83.0% 32×\sim 32\times
ABC-Net [20] 1 62.8% 84.4% 32×\sim 32\times
BinaryRelax [28] 1 63.2% 85.1% 32×\sim 32\times
DSQ [7] 1 63.7% - 32×\sim 32\times
  FleXOR (Nout=20N_{out}=20) 0.8 63.8% 84.8% 40×\sim 40\times
0.63 (mixed)111To 4 groups of 3×\times3 conv layers in ResNet-18 (except the first conv layer connected to the inputs), we assign 0.9, 0.8, 0.7, and 0.6 bits/weight, respectively. To the remaining 1×\times1 conv layers (performing downsampling), we assign 0.95, 0.9, and 0.8 bits/weight, respectively. 63.3% 84.5% 50.8×\sim 50.8\times
0.6 62.0% 83.7% 53×\sim 53\times
 
Refer to caption
Figure 8: Test accuracy (Top-1) of ResNet-18 on ImageNet using FleXOR.

Figure 8 depicts the test accuracy of ResNet-18 on ImageNet when (q=1q{=}1, Nin=16N_{in}{=}16 and Nout=20N_{out}{=}20) and (q=1q{=}1, Nin=12N_{in}{=}12 and Nout=20N_{out}{=}20). Refer to Appendix for more results with q=2q{=}2. Table 3 shows the comparison on model accuracy of ResNet-18 when weights are compressed by quantization (and additional encryption by FleXOR) while activations maintain full precision. Training ResNet-18 including FleXOR components is successfully performed. In Table 3, BinaryRelax and BWN do not modify the underlying model architecture, while ABC-Net introduces a new block structure of the convolution for quantized network designs. FleXOR achieves the best top-1 accuracy even with only 0.8bit/weight and demonstrates improved model accuracy as the number of bits per weight increases.

We acknowledge that there are numerous other methods to reduce the neural networks in size. For example, low-rank approximation and parameter pruning could be additionally performed to reduce the size further. We believe that such methods are orthogonal to our proposed method.

6 Conclusion

This paper proposes an encryption algorithm/architecture, FleXOR, as a framework to further compress quantized weights. Encryption is designed to produce more outputs than inputs by increasing the Hamming distance of output functions when output functions are linear functions of inputs. Output functions are implemented as a combination of XOR gates that are included in the model to find encrypted and quantized weights through gradient descent while using the tanh\tanh function for backward propagation. FleXOR enables fractional numbers of bits for weights and, thus, much wider trade-offs between weight storage and model accuracy. Experimental results show that ResNet on CIFAR-10 and ImageNet can be represented by sub 1-bit/weight compression with high accuracy.

Broader Impact

Due to rapid advances in developing neural networks of higher model accuracy and increasingly complicated tasks to be supported, the size of DNNs is becoming exponentially larger. Our work facilitates the deployment of large DNN applications in various forms including mobile devices because of the powerful model compression ratio. As for positive perspectives, hence, a huge amount of energy consumption to run model inferences can be saved by our proposed quantization and encryption techniques. Also, a lot of computing systems that are based on binary neural network forms can improve model accuracy. We expect that lots of useful DNN models would be available for devices of low cost. On the other hand, some common concerns on DNNs such as privacy breaching and heavy surveillance can be worsened by DNN devices that are more available economically by using our proposed techniques.

Acknowledgments

We would like to thank the anonymous reviewers for their valuable comments on our manuscript.

References

  • Ahn et al. [2019] D. Ahn, D. Lee, T. Kim, and J.-J. Kim. Double Viterbi: Weight encoding for high compression ratio and fast on-chip reconstruction for deep neural network. In International Conference on Learning Representations (ICLR), 2019.
  • Choi et al. [2017] Y. Choi, M. El-Khamy, and J. Lee. Towards the limit of network quantization. In International Conference on Learning Representations (ICLR), 2017.
  • Courbariaux et al. [2015] M. Courbariaux, Y. Bengio, and J.-P. David. BinaryConnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pages 3123–3131, 2015.
  • Dong et al. [2019] Z. Dong, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE International Conference on Computer Vision, pages 293–302, 2019.
  • Frankle and Carbin [2019] J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR), 2019.
  • Gong et al. [2019a] R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. arXiv:1908.05033, 2019a.
  • Gong et al. [2019b] R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 4852–4861, 2019b.
  • Gotmare et al. [2019] A. Gotmare, N. S. Keskar, C. Xiong, and R. Socher. A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. In International Conference on Learning Representations (ICLR), 2019.
  • Han et al. [2016] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. In International Conference on Learning Representations (ICLR), 2016.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • He et al. [2018] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li. Bag of tricks to train convolutional neural networks for image classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • Jung et al. [2019] S. Jung, C. Son, S. Lee, J. Son, J.-J. Han, Y. Kwak, S. J. Hwang, and C. Choi. Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4350–4359, 2019.
  • Kahn et al. [1988] J. Kahn, G. Kalai, and N. Linial. The influence of variables on boolean functions. In Proceedings of the 29th Annual Symposium on Foundations of Computer Science, SFCS ’88, pages 68–80, 1988.
  • Krizhevsky et al. [2009] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • Kwon et al. [2020] S. J. Kwon, D. Lee, B. Kim, P. Kapoor, B. Park, and G.-Y. Wei. Structured compression by weight encryption for unstructured pruning and quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1909–1918, 2020.
  • Lahoud et al. [2019] F. Lahoud, R. Achanta, P. Márquez-Neila, and S. Süsstrunk. Self-binarizing networks. arXiv:1902.00730, 2019.
  • Lee et al. [2018] D. Lee, D. Ahn, T. Kim, P. I. Chuang, and J.-J. Kim. Viterbi-based pruning for sparse matrix with fixed and high index compression ratio. In International Conference on Learning Representations (ICLR), 2018.
  • Li and Liu [2016] F. Li and B. Liu. Ternary weight networks. arXiv:1605.04711, 2016.
  • Li et al. [2017] H. Li, S. De, Z. Xu, C. Studer, H. Samet, and T. Goldstein. Training quantized nets: A deeper understanding. In Advances in Neural Information Processing Systems, pages 5813–5823, 2017.
  • Lin et al. [2017] X. Lin, C. Zhao, and W. Pan. Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems, pages 345–353, 2017.
  • Polino et al. [2018] A. Polino, R. Pascanu, and D. Alistarh. Model compression via distillation and quantization. In International Conference on Learning Representations (ICLR), 2018.
  • Rastegari et al. [2016] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
  • Russakovsky et al. [2015] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
  • Stock et al. [2019] P. Stock, A. Joulin, R. Gribonval, B. Graham, and H. Jégou. And the bit goes down: Revisiting the quantization of neural networks. arXiv:1907.05686, 2019.
  • Touba [2006] N. A. Touba. Survey of test vector compression techniques. IEEE Design & Test of Computers, 23:294–303, 2006.
  • Wang et al. [2019] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han. Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8612–8620, 2019.
  • Xu et al. [2018] C. Xu, J. Yao, Z. Lin, W. Ou, Y. Cao, Z. Wang, and H. Zha. Alternating multi-bit quantization for recurrent neural networks. In International Conference on Learning Representations (ICLR), 2018.
  • Yin et al. [2018] P. Yin, S. Zhang, J. Lyu, S. Osher, Y. Qi, and J. Xin. Binaryrelax: A relaxation approach for training deep neural networks with quantized weights. SIAM Journal on Imaging Sciences, 11(4):2205–2223, 2018.
  • Zhang et al. [2018] D. Zhang, J. Yang, D. Ye, and G. Hua. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 365–382, 2018.
  • Zhu et al. [2017] C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization. In International Conference on Learning Representations (ICLR), 2017.

Appendix A Example of a XOR-gate Network Structure Representation

In Figure 2, outputs of an XOR-gate network are given as

y1\displaystyle y_{1} =x1x3x4\displaystyle=x_{1}\oplus x_{3}\oplus x_{4}
y2\displaystyle y_{2} =x1x2\displaystyle=x_{1}\oplus x_{2}
y3\displaystyle y_{3} =x1x2x3\displaystyle=x_{1}\oplus x_{2}\oplus x_{3}
y4\displaystyle y_{4} =x3x4\displaystyle=x_{3}\oplus x_{4}
y5\displaystyle y_{5} =x2x4\displaystyle=x_{2}\oplus x_{4}
y6\displaystyle y_{6} =x2x3x4.\displaystyle=x_{2}\oplus x_{3}\oplus x_{4}.

Equivalently, the same structure as above can be represented in a matrix as

𝑴=[101111001110001101010111].{\bm{M}}^{\oplus}=\begin{bmatrix}1&0&1&1\\ 1&1&0&0\\ 1&1&1&0\\ 0&0&1&1\\ 0&1&0&1\\ 0&1&1&1\\ \end{bmatrix}. (7)

Note that elements of 𝑴{\bm{M}}^{\oplus} are matched with coefficients of yi(1i6)y_{i}(1\leq\!i\!\leq 6). For two vectors 𝒚={y1,y2,y3,y4,y5,y6}{\bm{y}}=\{y_{1},y_{2},y_{3},y_{4},y_{5},y_{6}\} and 𝒙={x1,x2,x3,x4}{\bm{x}}=\{x_{1},x_{2},x_{3},x_{4}\}, the following equation holds:

𝒚=𝑴𝒙,{\bm{y}}={\bm{M}}^{\oplus}\cdot{\bm{x}}, (8)

where element-wise addition and multiplication are performed by ‘XOR’ and ‘AND’ function, respectively. In Eq. (7), NtapN_{tap} (i.e., the number of ‘1’s in a row) is 2 or 3.

Appendix B Supplementary Data for Basic FleXOR Training Principles

A Boolean XOR gate can be modeled as (x1,x2)=(1)sign(x1)sign(x2)\mathcal{F}^{\oplus}(x_{1},x_{2})=(-1)\operatorname{sign}(x_{1})\operatorname{sign}(x_{2}) if 0 is replaced with 1-1 as shown in Table 4.

  sign(x1)\operatorname{sign}(x_{1}) sign(x2)\operatorname{sign}(x_{2}) (x1,x2)\mathcal{F}^{\oplus}(x_{1},x_{2})
  1-1 1-1 1-1
1-1 +1+1 +1+1
+1+1 1-1 +1+1
+1+1 +1+1 1-1
 
Table 4: An XOR gate modeling using (x1,x2)\mathcal{F}^{\oplus}(x_{1},x_{2}).

In Eq. (7), forward propagation for y3y_{3} is expressed as

y3=(x1,x2,x3)=(1)2sign(x1)sign(x2)sign(x3).y_{3}=\mathcal{F}^{\oplus}(x_{1},x_{2},x_{3})=(-1)^{2}\operatorname{sign}(x_{1})\operatorname{sign}(x_{2})\operatorname{sign}(x_{3}). (9)

while partial derivative of y3y_{3} with respect to x1x_{1} is given as (not derived from Eq. (9))

y3x1=Stanh(1)2(1tanh2(x1Stanh))tanh(x2Stanh)tanh(x3Stanh),\frac{\partial y_{3}}{\partial x_{1}}=S_{\tanh}(-1)^{2}(1-\tanh^{2}(x_{1}\cdot S_{\tanh}))\tanh(x_{2}\cdot S_{\tanh})\tanh(x_{3}\cdot S_{\tanh}), (10)

or as

y3x1Stanh(1)2(1tanh2(x1Stanh))sign(x2)sign(x3).\frac{\partial y_{3}}{\partial x_{1}}\approx S_{\tanh}(-1)^{2}(1-\tanh^{2}(x_{1}\cdot S_{\tanh}))\operatorname{sign}(x_{2})\operatorname{sign}(x_{3}). (11)

We choose Eq. (11), instead of Eq. (10), as explained in Section 3.

Refer to caption
Figure 9: The left graph shows hyperbolic tangent (y=tanh(xStanhy=\tanh(x\cdot S_{\tanh})) graphs with various scaling factors (StanhS_{\tanh}), . The right graph shows their derivatives. These graphs support the arguments of ‘Optimize S𝐭𝐚𝐧𝐡S_{\tanh}’ in Section 4.
Refer to caption
Figure 10: An example showing FleXOR operations for training. XOR gates are described in different ways for forward- and backward propagation. Once we obtain encrypted binary weights after training, we use digital XOR gates for inference.
Refer to caption
Figure 11: Using the same weight storage footprint, FleXOR enables various internal quantization schemes. (Left): 1-bit internal quantization. (Right): 3-bit internal quantization with 3 different 𝑴{\bm{M}}^{\oplus} configurations.

As shown in Figure 9, large StanhS_{\tanh} yields sharp transitions for near-zero inputs. Such a sharp approximation of the Heaviside step function produces large gradient values for small inputs and encourages encrypted weights to be separated into negative or positive values. Too large StanhS_{\tanh}, however, has the same issues of a too-large learning rate.

Refer to caption
Refer to caption
Figure 12: Test accuracy and training loss of LeNet-5 on MNIST when number of ‘1’s in each row of 𝑴{\bm{M}}^{\oplus} is fixed to be 2 (Ntap=2N_{tap}{=}2). NoutN_{out} is 10 or 20 to generate, effectively, 0.4, 0.6, or 0.8 bit/weight quantization. With low NtapN_{tap} of 𝑴{\bm{M}}^{\oplus}, MNIST training presents less variations on training loss and test accuracy that in Figure 5.

Figure 12 presents training loss and test accuracy when Ntap=2N_{tap}{=}2 and NoutN_{out} is 10 or 20. Compared with Figure 5, Ntap=2N_{tap}{=}2 presents improved accuracy for the cases of high compression configurations (e.g., Nin=4N_{in}{=}4 and Nout=10N_{out}{=}10). We use Ntap=2N_{tap}=2 for CIFAR-10 and ImageNet, since low NtapN_{tap} avoids gradient vanishing problems or high approximation errors in Eq.(5) or Eq.(6).

Refer to caption
Refer to caption
Figure 13: Distribution of encrypted weight values for FC1 layer of LeNet-5 at different training steps using Stanh=100S_{\tanh}{=}100 and Nout=10N_{out}{=}10. (Left): 𝑴{\bm{M}}^{\oplus} is randomly filled (NtapN_{tap} \approx Nin/2N_{in}/2). (Right): Ntap=2N_{tap}=2 for every row of 𝑴{\bm{M}}^{\oplus}.

Figure 13 plots the distribution of encrypted weights at different training steps when each row of 𝑴{\bm{M}}^{\oplus} is randomly assigned with {0,1}\{0,1\} (i.e., NtapN_{tap} is Nin/2N_{in}/2 on average) or assigned with only two 1’s (NtapN_{tap}=2). Due to gradient calculations based on tanh\tanh and high StanhS_{\tanh}, encrypted weights tend to be clustered on the left or right (near-zero encrypted weights become less as NtapN_{tap} increases) even without weight clipping.

Appendix C Supplementary Experimental Results of CIFAR-10 and ImageNet

In this section, we additionally provide various graphs and accuracy tables for ResNet models on CIFAR10 and ImageNet. We also present experimental results from wider hyper-parameters searches including q=2q{=}2 with two separate 𝑴{\bm{M}}^{\oplus} configurations (with the same NinN_{in} and NoutN_{out} for two 𝑴{\bm{M}}^{\oplus} matrices).

Refer to caption
(a) First convolution layer in Layer1
Refer to caption
(b) Last convolution layer in Layer1
Refer to caption
(c) First convolution layer in Layer2
Refer to caption
(d) Last convolution layer in Layer2
Figure 14: Distributions of encrypted weights (at the end of training) in various layers of ResNet-32 on CIFAR-10 using various StanhS_{\tanh} and the same NoutN_{out}, NinN_{in}, and qq as Figure 7. The ResNet-32 network mainly consists of three layers according to the feature map sizes: Layer1, Layer2 and Layer3.
Refer to caption
(a) Initial Learning Rate (0.1): Test accuracy of ResNet-32 on CIFAR10 using the learning schedule in Figure 7 and various initial learning rates (0.05, 0.1, 0.2, 0.5).
Refer to caption
(b) No Weight Clipping: Test accuracy of ResNet-32 on CIFAR10 using the learning schedule in Figure 7. As for weight clipping, we restrict the encrypted weights to be ranged as (2.0/Stanh-2.0/S_{\tanh}, +2.0/Stanh+2.0/S_{\tanh}). As can be observed, the red line implies that weight clipping is not effective with FleXOR.
Refer to caption
(c) Weight Decay Factor (10-5): Two graphs depict test accuracy of ResNet-18 on ImageNet with or without weight decay. The learning rate in the red line (no weight decay) is reduced by half at the 100th, 130th and 150th epochs. The learning rate of the blue line (with weight decay) is reduced by half at 70th, 100th and 130th epochs. With weight decay (blue graph), despite slow convergence in the early training steps, model accuracy is eventually higher than the red one without weight decay scheme.
Figure 15: Comparison of various hyper-parameter choices for CIFAR-10 or ImageNet.
Refer to caption
(a) Test accuracy using q=1q{=}1.
Refer to caption
(b) Test accuracy using q=2q{=}2. Compared to the above plots (Figure 16(a)), this figure shows that a combination of multiple 𝑴{\bm{M}}^{\oplus} for a binary code can lead to stable learning curves and higher model accuracy.
Figure 16: Test accuracy of ResNet-32 on CIFAR-10 using learning rate warmup (for 100 epochs) and Nout=20N_{out}{=}20
  Bits/Weight ResNet-20 ResNet-32 Comp. Ratio
  FP 32 91.87% Diff. 92.33% Diff. 1.0x
  NinN_{in}=10, NoutN_{out}=10 1.0 90.21% -1.66 91.40% -0.93 29.95×\times
NinN_{in}=9, NoutN_{out}=10 0.9 90.03% -1.84 91.28% -1.05 31.82×\times
NinN_{in}=8, NoutN_{out}=10 0.8 89.73% -2.14 90.96% -1.37 35.32×\times
NinN_{in}=7, NoutN_{out}=10 0.7 89.88% -1.99 90.67% -1.66 39.68×\times
NinN_{in}=6, NoutN_{out}=10 0.6 89.21% -2.66 90.41% -1.92 45.27×\times
NinN_{in}=5, NoutN_{out}=10 0.5 88.59% -3.28 89.95% -2.38 52.70×\times
 
Table 5: Weight compression comparison of ResNet-20 and ResNet-32 on CIFAR-10 when Nout=10N_{out}{=}10. Parameters and recipes not described in the table are the same as in Table 1. We also present compression ratio for fractional quantized ResNet-32 when one scaling factor (α\alpha) is assigned to each output channel.
  ResNet-20 ResNet-32
FP Quant. Diff. FP Quant. Diff.
  TWN (ternary) 92.68% 88.65% -4.03 93.40% 90.94% -2.46
BinaryRelax (ternary) 92.68% 90.07% -1.91 93.40% 92.04% -1.36
TTQ (ternary) 91.77% 91.13% -0.64 92.33% 92.37% +0.04
LQ-Net (2 bit) 92.10% 91.80% -0.30 - - -
                                                                       FleXOR(q=2q=2, Nout=20N_{out}=20)
NinN_{in}=20, 2.0 bit/weight 91.87% 91.38% -0.49 92.33% 92.25% -0.08
NinN_{in}=18, 1.8 bit/weight 91.00% -0.87 92.27% -0.06
NinN_{in}=16, 1.6 bit/weight 90.88% -0.99 92.11% -0.22
NinN_{in}=14, 1.4 bit/weight 90.90% -0.97 92.02% -0.31
NinN_{in}=12, 1.2 bit/weight 90.56% -1.31 91.62% -0.71
                                                                       FleXOR(q=2q=2, Nout=10N_{out}=10)
NinN_{in}=10, 2.0 bit/weight 91.87% 91.19% -0.68 92.33% 92.61% +0.28
NinN_{in}=9, 1.8 bit/weight 91.44% -0.43 92.09% -0.24
NinN_{in}=8, 1.6 bit/weight 91.10% -0.77 92.08% -0.25
NinN_{in}=7, 1.4 bit/weight 90.94% -0.93 91.74% -0.59
NinN_{in}=6, 1.2 bit/weight 90.56% -1.31 91.37% -0.96
 
Table 6: Weight compression comparison of ResNet-20 and ResNet-32 on CIFAR-10 using learning rate warmup (for 100 epochs) and q=2q{=}2. As mentioned in Figure 6, multiple 𝑴{\bm{M}}^{\oplus} can be combined for multi-bit quantization schemes. Then, the number of scaling factors should be doubled. FleXOR with q=2q{=}2 and two different 𝑴{\bm{M}}^{\oplus} structures achieve full-precision accuracy when both NinN_{in} and NoutN_{out} are 10.
  Methods Bits/Weight Top-1 Top-5
  Full Precision [10] 32 69.6% 89.2%
TWN [18] ternary 61.8% 84.2%
ABC-Net [20] 2 63.7% 85.2%
BinaryRelax [28] ternary 66.5% 87.3%
TTQ(1.5×\times Wide) [30] ternary 66.6% 87.2%
LQ-net [29] 2 68.0% 88.0%
QIL [12] 2 68.1% 88.3%
  FleXOR (q=2q{=}2, Nout=20N_{out}{=}20) 1.6 (0.8×\times2) 66.2% 86.7%
1.2 (0.6×\times2) 65.4% 86.0%
0.8 (0.4×\times2) 63.8% 85.0%
 
Table 7: Weight compression comparison of ResNet-18 on ImageNet when q=2q{=}2. Since qq is 2, we also list the other compression schemes which use 2-bit or ternary quantization scheme for model compression.