This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SiMaN: Sign-to-Magnitude Network Binarization

Mingbao Lin, Rongrong Ji, , Zihan Xu, Baochang Zhang, ,
Fei Chao, , Chia-Wen Lin  and Ling Shao, 
M. Lin, R. Ji (Corresponding Author), Z. Xu and F. Chao are with the Media Analytics and Computing Laboratory, Department of Artificial Intelligence, School of Informatics, Xiamen University, Xiamen 361005, China (e-mail: [email protected]).M. Lin and Z. Xu are also with the Tencent Youtu Lab, Shanghai 200233, China.B. Zhang is with the Zhongguancun Lab, Beijing 100190, China.C.-W. Lin is with the Department of Electrical Engineering and the Institute of Communications Engineering, National Tsing Hua University, Hsinchu 30013, Taiwan. L. Shao is with Terminus Group, China.Manuscript received April 19, 2005; revised August 26, 2015.
Abstract

Binary neural networks (BNNs) have attracted broad research interest due to their efficient storage and computational ability. Nevertheless, a significant challenge of BNNs lies in handling discrete constraints while ensuring bit entropy maximization, which typically makes their weight optimization very difficult. Existing methods relax the learning using the sign function, which simply encodes positive weights into +1+1s, and 1-1s otherwise. Alternatively, we formulate an angle alignment objective to constrain the weight binarization to {0,+1}\{0,+1\} to solve the challenge. In this paper, we show that our weight binarization provides an analytical solution by encoding high-magnitude weights into +1+1s, and 0s otherwise. Therefore, a high-quality discrete solution is established in a computationally efficient manner without the sign function. We prove that the learned weights of binarized networks roughly follow a Laplacian distribution that does not allow entropy maximization, and further demonstrate that it can be effectively solved by simply removing the 2\ell_{2} regularization during network training. Our method, dubbed sign-to-magnitude network binarization (SiMaN), is evaluated on CIFAR-10 and ImageNet, demonstrating its superiority over the sign-based state-of-the-arts. Our source code, experimental settings, training logs and binary models are available at https://github.com/lmbxmu/SiMaN.

Index Terms:
Binary neural network, Network binarization, weight magnitude, angular alignment, network compression & acceleration, network quantization.

1 Introduction

Deep neural networks (DNNs), especially convolutional neural networks (CNNs), have been effectively used in many tasks of computer vision, such as image recognition [1, 2, 3], object detection [4, 5, 6], and semantic segmentation [7, 8, 9]. Nowadays, DNNs are almost trained on high-capacity but power-hungry graphics processing units (GPUs); however, such DNN models often fail to run on low-power devices such as cell phones and Internet-of-Things (IoT) devices that have been universally popularized in modern society. As a result, substantial efforts have been invested to reduce the model redundancy while retaining a comparable or even better accuracy performance in comparison with the full model, such that the compressed model can be easily deployed on these resource-limited devices.

Typical methods for reducing the model redundancy include, but are not limited to: (1) Weight pruning discards individual weights in the filters or connections across different layers, and then reshapes the model in a sparse format [10, 11]. (2) Filter pruning resorts to directly removing all weights in a filter and the corresponding channel in the next layer [12, 13]. (3) Compact network designs, such as ShuffleNets [14, 15], MobileNets [16, 17, 18] and GhostNet [19], choose to directly build parameter-efficient neural network models. (4) Tensor decomposition approximates the weight tensor with a series of low-rank matrices, which are then reorganized in a sum-product form [20, 21] to recover the original weight tensor. (5) Low-precision quantization aims to compress the model by reducing the number of bits used to represent the weight parameters of the pre-trained models [22, 23, 24].

In particular, binary neural networks (BNNs), which quantize their weights and activations in a 1-bit binary form, have attracted increasing attention for two major reasons: 1) The memory usage of a BNN is 32×\times lower than its full-precision counterpart, since the weights of the latter are stored in a 32-bit floating-point form. 2) A significant reduction of computational complexity can be achieved by executing efficient XNOR and bitcount operations, e.g., up to 58×\times speed-ups on CPUs as reported by [25]. Regardless of these two merits, BNNs are also famed for their significant performance degradation. For example, XNOR-Net [25] suffers an  18% drop in top-1 accuracy when binarizing ResNet-18 on the ImageNet classification task [26]. The poor performance greatly barricades the possibility of deploying BNNs in real-world applications.

One of the major obstacles in constructing a high-performing BNN is the discrete constraints imposed on the pursued binary weights, which challenges the weight optimization. Meanwhile, BNNs also require the two possible values of binarized weights to be uniformly (half-half) distributed to ensure bit entropy maximization. To this end, most existing approaches simply employ the sign function to binarize weights where positive weights are encoded into +1+1s, and 1-1s are used otherwise [25, 27, 28, 29, 24]. To compensate for the entropy information, recent methods, such as Bayesian optimization [30], rotation matrix [24], and weight standardization [29], learn a two-mode distribution for real-valued weights to increase the probability of encoding one half of the weights into +1+1s and the other half into 1-1s by the sign function. These strategies, however, increase the learning complexity, since the optimization involves additional training loss terms and variables. Moreover, it is unclear whether the simple usage of the sign function is the optimal encoding option for the weight binarization process.

Refer to caption
Figure 1: (a) Early works [31, 32] suffer from large quantization error caused by both the norm gap and angular bias between the full-precision weights and its binarized version. (b) Recent works [25, 29] introduce a scaling factor to reduce the norm gap but cannot reduce the angular bias, i.e., θ\theta. Therefore the quantization error 𝐰sinθ2\|\mathbf{w}\sin\theta\|^{2} is still large when θ\theta is large.

Another obstacle in learning BNNs comes at the large quantization error between the full-precision weight vector 𝐰\mathbf{w} and its binary vector 𝐛\mathbf{b} [31, 32] as illustrated in Fig. 1(a). To solve this, state-of-the-art approaches [25, 29] introduce a per-channel learnable/optimizable scaling factor λ\lambda to decrease the quantization error

minλ,𝐛λ𝐛𝐰2.\mathop{\min}_{\lambda,\mathbf{b}}\|\lambda\mathbf{b}-\mathbf{w}\|^{2}. (1)

However, as revealed in the earlier version of this paper [24], the introduction of λ\lambda only partly mitigates the quantization error by compensating for the norm gap between the full-precision weight and its binarized version, but cannot reduce the quantization error due to an angular bias as shown in Fig. 1(b). Apparently, with a fixed angular bias θ\theta, when λ𝐛𝐰\lambda\mathbf{b}-\mathbf{w} is orthogonal to λ𝐛\lambda\mathbf{b}, Eq. (1) reaches the minimum and we have

𝐰sinθ2λ𝐛𝐰2.\|\mathbf{w}\sin\theta\|^{2}\leq\|\lambda\mathbf{b}-\mathbf{w}\|^{2}. (2)

Thus, the 𝐰sinθ2\|\mathbf{w}\sin\theta\|^{2} serves as the lower bound of the quantization error and cannot be diminished as long as the angular bias exists. This lower bound could be huge with a large angular bias θ\theta. Though the training process updates the weights and may close the angular bias, we experimentally observe the possibility of this case is small, as illustrated by XNOR-Net [25] in Fig. 3. Thus, it is natural for researchers to go further reduce this angular error for the sake of minimizing the quantization error if a better BNN performance is anticipated to obtain.

To solve the angular bias, the earlier version [24] proposed the angle alignment based learning objective which is originally formulated as

argmax𝐑sign(𝐑T𝐰)T(𝐑T𝐰)sign(𝐑T𝐰)2𝐑T𝐰2,s.t.𝐑T𝐑=𝐈n,\begin{split}\mathop{\arg\max}_{\mathbf{R}}&\frac{\operatorname{sign}(\mathbf{R}^{T}\mathbf{w})^{T}(\mathbf{R}^{T}\mathbf{w})}{\|\operatorname{sign}(\mathbf{R}^{T}\mathbf{w})\|_{2}\|\mathbf{R}^{T}\mathbf{w}\|_{2}},\\ &\;s.t.\quad\mathbf{R}^{T}\mathbf{R}=\mathbf{I}_{n},\end{split} (3)

where 𝐑\mathbf{R} is constrained to an nn-order rotation matrix. As shown in Fig. 2(a), by applying the sign function on the rotated weight vector 𝐑T𝐰\mathbf{R}^{T}\mathbf{w}, we attain the binarization of 𝐰\mathbf{w}, i.e., 𝐛wsign(𝐑T𝐰)\mathbf{b}_{w}\in\operatorname{sign}(\mathbf{R}^{T}\mathbf{w}). Thus, Eq. (3) aims to learn a rotation matrix such that the angle bias between the rotated weight vector and its encoded binarization is reduced as illustrated by RBNN [24] (conference version) in Fig. 3. Though the great reduction on quantization error has been quantitatively measured in [24], the learning complexity of the rotation matrix, 𝐑\mathbf{R}, is very high, due to the non-convexity of Eq. (3). Thus, an alternating optimization approach is developed. Nevertheless, the alternating optimization results in sub-optimal binarization. Moreover, the optimization is still built upon the sign function. Note that, in Fig. 3, we also train XNOR-Net and RBNN with the two-step training paradigm [33]. We can see that the angular bias remains similar to these with commonly-used from-scratch training. Thus, training BNNs with different strategies does not correct the angular bias.

Refer to caption
Figure 2: Comparison between (a) the preliminary version of RBNN [24] and (b) the extended version of SiMaN in this paper. RBNN learns a rotation matrix 𝐑\mathbf{R} first, and then applies the sign function to binarize the rotated weight 𝐛w=sign(𝐑T𝐰){1,+1}n\mathbf{b}_{w}=\operatorname{sign}(\mathbf{R}^{T}\mathbf{w})\in\{-1,+1\}^{n}. In contrast, the presented SiMaN in this paper involves the magnitude of the weight, and then discretely learns 𝐛¯w{0,1}n\bar{\mathbf{b}}_{w}\in\{0,1\}^{n}.

In this paper, a novel sign-to-magnitude network binarization (SiMaN) is proposed to discretely encode DNNs, leading to improved accuracy. Within our method, we reformulate the angle alignment objective in the conference version [24], which aims to maximize the cosine distance between the full-precision weight vector and its encoded binarization. Different from existing works that binarize weights into {1,+1}\{-1,+1\} by the sign function, our binarization falls into {0,+1}\{0,+1\} as illustrated in Fig. 2(b). In this way, we reveal that the globally analytical binarization for our angle alignment can be found in a computationally efficient manner of 𝒪(nlogn)\mathcal{O}(n\log n) by quantizing into +1+1s the high-magnitude weights, and 0s otherwise, therefore enabling weight binarization without the sign function. To the best of our knowledge, we prove for the first time that the learned real-valued weights roughly follow a Laplacian distribution, which results in around 37% of weights being encoded into +1+1s. This prevents the BNN from maximizing the entropy of information. To solve this, we do not add a term to the loss function since this increases the optimization difficulty. Alternatively, we analyze the intrinsic numerical values of weights, and show that the simple removal of the 2\ell_{2} regularization destroys the Laplacian distribution, and thus enhances the half-half weight binarization. As a result, the final binarization is obtained by encoding into +1+1s weights within the largest top-half magnitude, and 0s otherwise to further reduce the computational complexity from 𝒪(nlogn)\mathcal{O}(n\log n) to 𝒪(n)\mathcal{O}(n).

Refer to caption
Figure 3: Cosine similarity between the full-precision weight vector and the corresponding binary vector in various layers of ResNet-20.

A preliminary conference version of this work was presented in [24]. The main contributions we have made in this paper are listed in the following.

  • A new learning objective based on the angle alignment is proposed and a magnitude-based analytical solution for BNNs is developed in a computationally efficient manner.

  • We formally prove that the learned weights in BNNs follow a Laplacian distribution, which, as revealed, prevents the maximization of bit entropy.

  • A detailed analysis on the numerical values of weights shows that simply removing the 2\ell_{2} regularization benefits maximizing the bit entropy while further reducing the computational complexity.

  • Experiments on CIFAR-10 [34] and ImageNet [26] demonstrate that our sign-to-magnitude framework for network binarization outperforms the traditional sign-based binarization.

2 Related Work

Following the introduction of pioneering research [32] where the sign function and the straight-through estimator (STE) [35] are respectively adopted for the forward weight/activation binarization and backward gradient updating, BNNs have emerged as one of the most appealing approaches for the deployment of DNNs in resource-limited devices. As such, great efforts have been put into closing the gap between full-precision networks and their BNNs. In what follows, we briefly review some related works. A comprehensive overview can be found in the survey papers [36, 37].

XNOR-Net [25] introduces two scaling factors for channel-wise weights and activations to minimize quantization error. Inspired by this, XNOR-Net++ [38] improves the performance by integrating the two scaling factors into one, which is then updated using the standard gradient propagation. Except for the scaling factors, RBNN [24] further reduces the quantization error by optimizing the angle difference between a full-precision weight vector and its binarization. Xu et al[39] observed “dead weights” in binary neural networks and proposed to mitigate quantization error by clipping large-magnitude weights to a fixed element. To enable the gradient propagation and reduce the “gradient mismatch” by the STE [35], several works, such as the swish function [40], piece-wise polynomial function [28], and error decay estimator [29], formulate the forward/backward quantization as a differentiable non-linear mapping. FDA [41] estimates the gradient of sign function in the Fourier frequency domain using the combination of sine function for training BNNs.

Another direction circumvents the gradient approximation of the sign function by sampling from the weight distribution [42, 43]. Qin et al[44] introduces entropy-maximizing aggregation to modulate the distribution for the maximum information entroy, and layer-wise scale recovery to restore feature representation capacity. There are also abundant works that explore the optimization of BNNs [45, 46, 47, 48, 33] and explain their effectiveness [49]. Wang et al[50] proposed to train BNNs under a kernel-aware optimization framework. ProxConnect (PC) [51] generalizes and improves BinaryConnect (BC) with well-established theory and algorithms. Recent works [40, 52] embed various regularization terms into the training loss to binarize the weights and control the activation ranges [53]. Hu et al[54] added real-valued input features to the subsequent convolutional output features to enrich information flow within a BNN. Moreover, other recent studies devise binarization-friendly structures to boost the performance. For example, Bi-Real [28] designs double residual connections with full-precision downsampling layers. XNOR-Net++ [38] replaces ReLU by PReLU. ReActNet [55] adds parameter-free shortcuts on MobileNetV1 [16] and the group convolution is replaced by a regular convolution.

3 Binary Neural Networks

For an LL-layer CNN model, we denote 𝐖i={𝐰1i,𝐰2i,,𝐰coutii}ni×couti\mathbf{W}^{i}=\{\mathbf{w}^{i}_{1},\mathbf{w}^{i}_{2},...,\mathbf{w}^{i}_{c^{i}_{out}}\}\in\mathbb{R}^{n^{i}\times c^{i}_{out}} as the real-valued weight set for the ii-th layer, where 𝐰jini\mathbf{w}^{i}_{j}\in\mathbb{R}^{n^{i}} denotes the jj-th weight. The real-valued input activations of the ii-th layer are represented as 𝐀i={𝐚1i,𝐚2i,,𝐚cinii}mi×cini\mathbf{A}^{i}=\{\mathbf{a}^{i}_{1},\mathbf{a}^{i}_{2},...,\mathbf{a}^{i}_{c^{i}_{in}}\}\in\mathbb{R}^{m^{i}\times c^{i}_{in}}; here, coutic^{i}_{out} and cinic^{i}_{in} respectively represent the output and input channels, and nin^{i} and mim^{i} denote the size of each weight and input, respectively. Then, the convolution result can be expressed by

𝐚ji+1=𝐰ji𝐀i,\mathbf{a}^{i+1}_{j}=\mathbf{w}^{i}_{j}\circledast\mathbf{A}^{i}, (4)

where \circledast stands for the convolution operation. For simplicity, we omit the non-linear layer here.

BNN Training. To train a BNN, the real-valued 𝐰ji\mathbf{w}^{i}_{j} and 𝐀i\mathbf{A}^{i} in Eq. (4) are quantized into binary values (𝐛w)ji{1,+1}ni(\mathbf{b}_{w})^{i}_{j}\in\{-1,+1\}^{n^{i}} and (𝐁A)i{1,+1}cini×mi(\mathbf{B}_{A})^{i}\in\{-1,+1\}^{c^{i}_{in}\times m^{i}}, respectively. As a result, the convolution result can be approximated as

𝐚ji+1βji(𝐛w)ji(𝐁A)i,\mathbf{a}^{i+1}_{j}\approx{\beta}^{i}_{j}\cdot(\mathbf{b}_{w})_{j}^{i}\circledast(\mathbf{B}_{A})^{i}, (5)

where βji{\beta}^{i}_{j} is a channel-level scaling factor [25, 38].

For the implementation of the BNN training, the forward calculation is fulfilled by conducting the convolution between (𝐛w)ji(\mathbf{b}_{w})_{j}^{i} and (𝐁A)i(\mathbf{B}_{A})^{i} in Eq. (5), whereas their real-valued counterparts, 𝐰ji\mathbf{w}_{j}^{i} and 𝐀i\mathbf{A}^{i}, are updated during backpropagation. To this end, following existing studies [56, 38, 24], the activation binarization in this work is simply realized by the sign function as

(𝐁A)i=sign(𝐀i)={+1, if 𝐀i0,1, otherwise.(\mathbf{B}_{A})^{i}=\operatorname{sign}(\mathbf{A}^{i})=\left\{\begin{array}[]{l}+1,\text{ if }\mathbf{A}^{i}\geq 0,\\ -1,\text{ otherwise.}\end{array}\right. (6)

In the backpropagation phase, we adopt the piece-wise polynomial function [28] to approximate the gradient of a given loss \mathcal{L} w.r.t. the input activations 𝐀i\mathbf{A}^{i} as follows

𝐀i=(𝐁A)i(𝐁A)i(𝐀)i(𝐁A)iF(𝐀i)𝐀i,\frac{\partial\mathcal{L}}{\partial\mathbf{A}^{i}}=\frac{\partial\mathcal{L}}{\partial(\mathbf{B}_{A})^{i}}\cdot\frac{\partial(\mathbf{B}_{A})^{i}}{\partial(\mathbf{A})^{i}}\approx\frac{\partial\mathcal{L}}{\partial(\mathbf{B}_{A})^{i}}\cdot\frac{\partial F(\mathbf{A}^{i})}{\partial\mathbf{A}^{i}}, (7)

where F(𝐀i)𝐀i\dfrac{\partial F(\mathbf{A}^{i})}{\partial\mathbf{A}^{i}} is defined by

F(𝐀i)𝐀i={2+2𝐀i, if 1𝐀i<0,22𝐀i, if  0𝐀i<1,0, otherwise. \dfrac{\partial F(\mathbf{A}^{i})}{\partial\mathbf{A}^{i}}=\left\{\begin{array}[]{ll}2+2\mathbf{A}^{i},&\text{ if }-1\leq\mathbf{A}^{i}<0,\\ 2-2\mathbf{A}^{i},&\text{ if }\quad\,0\leq\mathbf{A}^{i}<1,\\ 0,&\text{ otherwise. }\end{array}\right. (8)

Besides, the STE [35] is used to calculate the gradient of the loss \mathcal{L} w.r.t. the weight 𝐰ji\mathbf{w}_{j}^{i} as

𝐰ji=(𝐛w)ji(𝐛w)ji𝐰ji(𝐛w)ji.\frac{\partial\mathcal{L}}{\partial\mathbf{w}_{j}^{i}}=\frac{\partial\mathcal{L}}{\partial(\mathbf{b}_{w})_{j}^{i}}\cdot\frac{\partial(\mathbf{b}_{w})_{j}^{i}}{\partial\mathbf{w}_{j}^{i}}\approx\frac{\partial\mathcal{L}}{\partial(\mathbf{b}_{w})_{j}^{i}}. (9)

BNN Inference. In practical deployment, the BNN model is accelerated using the efficient XNOR and bitcount logics embeded in the hardware. Thus, the quantized weights and activations need to be further transformed back into {0,1}\{0,1\} space. Such a transformation process can be realized by setting

(𝐁¯A)i=(1+(𝐁A)i)/2.\displaystyle(\bar{\mathbf{B}}_{A})^{i}=\big{(}1+(\mathbf{B}_{A})^{i}\big{)}/2. (10)
(𝐛¯w)ji=(1+(𝐛w)ji)/2.\displaystyle(\bar{\mathbf{b}}_{w})^{i}_{j}=\big{(}1+(\mathbf{b}_{w})^{i}_{j}\big{)}/2. (11)

Then, the approximated convolution in Eq. (5) can be replaced by the following equality

𝐚ji+1βji(2(𝐛¯w)ji(𝐁¯A)ini),\begin{split}\mathbf{a}^{i+1}_{j}\approx{\beta}^{i}_{j}\cdot\big{(}2\cdot(\bar{\mathbf{b}}_{w})_{j}^{i}\odot(\bar{\mathbf{B}}_{A})^{i}-n^{i}\big{)},\end{split} (12)

where \odot represents the XNOR and bitcount operations that are well-fitted for real-time network inference.

Our Insight. In this paper, we focus on binarizing the real-valued weight 𝐰ji\mathbf{w}_{j}^{i}. Different from most existing works [56, 57, 38, 24] that project weights 𝐰ji\mathbf{w}_{j}^{i} into (𝐛w)ji{1,+1}ni(\mathbf{b}_{w})_{j}^{i}\in\{-1,+1\}^{n^{i}} using the sign function during training and then transform (𝐛w)ji(\mathbf{b}_{w})_{j}^{i} into (𝐛¯w)ji{0,+1}ni(\bar{\mathbf{b}}_{w})_{j}^{i}\in\{0,+1\}^{n^{i}} for inference, we seek to directly encode the weights into 𝐛¯w{0,+1}ni\bar{\mathbf{b}}_{w}\in\{0,+1\}^{n^{i}} and then devise an efficient optimization to attain the optimal solution in Sec. 4.1. We demonstrate in Sec. 4.2 that the weight, 𝐰ji\mathbf{w}_{j}^{i}, roughly follows a Laplacian distribution, which inhibits the entropy maximization. We reveal that this can be easily addressed by removing the 2\ell_{2} regularization in Sec. 4.3.

For simplicity, the scripts “ii” and “jj” are omitted in the following context.

4 Weight Binarization

In this section, we specify the formulation of our weight binarization, including the binary learning objective, weight distribution, and bit entropy maximization.

4.1 Learning Objective

To achieve high-quality weight binarization, different from the conference version [24], we reformulate the learning objective in Eq. (3) as

argmax𝐛¯w(𝐛¯w)T|𝐰|𝐛¯w2|𝐰|2,s.t.𝐛¯w{0,+1}n,\begin{split}&\mathop{\arg\max}_{\bar{\mathbf{b}}_{w}}\frac{(\bar{\mathbf{b}}_{w})^{T}|\mathbf{w}|}{\|\bar{\mathbf{b}}_{w}\|_{2}\big{\|}|\mathbf{w}|\big{\|}_{2}},\\ &\;\;s.t.\quad\bar{\mathbf{b}}_{w}\in\{0,+1\}^{n},\end{split} (13)

where |||\cdot| returns the absolute result of its input.

As can be seen, our learning objective is also built on the basis of angle alignment. Nevertheless, our method differs from Eq. (3) in many aspects: First, we drop the sign function since variables in a binarized network must be retained in a discrete set; thus, the binarization should be built upon the concept of the discrete optimization rather than the simple sign function. Second, we encode the weights into 𝐛¯w{0,+1}n\bar{\mathbf{b}}_{w}\in\{0,+1\}^{n} rather than 𝐛w{1,+1}n\mathbf{b}_{w}\in\{-1,+1\}^{n}. In Corollary 1, we show that 𝐛¯w{0,+1}n\bar{\mathbf{b}}_{w}\in\{0,+1\}^{n} allows us to find an analytical solution in an efficient manner by transferring the high-magnitude weights to +1+1s and 0s otherwise. Third, our angle alignment is independent of the rotation matrix, 𝐑\mathbf{R}, since we remove the sign function, which makes the rotation direction unpredictable. Lastly, we align the angle difference between the binarization and absolute weight vector, |𝐰||\mathbf{w}|, instead of the weight 𝐰\mathbf{w} itself. The rationale behind this is that our binarization falls into the non-negative set 𝐛¯w\bar{\mathbf{b}}_{w}. Fig. 2(b) outlines our binarization process.

Note that |𝐰|2\big{\|}|\mathbf{w}|\big{\|}_{2} is irrelevant to the optimization of Eq. (13). Thus, the learning can be simplified to

argmax𝐛¯w(𝐛¯w)T𝐛¯w2|𝐰|,s.t.𝐛¯w{0,+1}n.\begin{split}&\mathop{\arg\max}_{\bar{\mathbf{b}}_{w}}\;\frac{(\bar{\mathbf{b}}_{w})^{T}}{\|\bar{\mathbf{b}}_{w}\|_{2}}|\mathbf{w}|,\\ &s.t.\quad\bar{\mathbf{b}}_{w}\in\{0,+1\}^{n}.\end{split} (14)

This is an integer programming problem [58]. Nevertheless, as demonstrated in Corollary 1, by learning the encoding space in {0,+1}n\{0,+1\}^{n}, we can reach the global maximum in a substantially efficient fashion.

Corollary 1. For Eq. (14), the computational complexity of finding the global optimum is 𝒪(nlogn)\mathcal{O}(n\log{n}).

Proof: For 𝐛¯w{0,+1}n\bar{\mathbf{b}}_{w}\in\{0,+1\}^{n}, it is intuitive to see that 𝐛¯w2\|\bar{\mathbf{b}}_{w}\|_{2} falls into the set {1,,n}\{\sqrt{1},...,\sqrt{n}\}. Considering that 𝐛¯w2=k\|\bar{\mathbf{b}}_{w}\|_{2}=\sqrt{k} (k=1,,nk=1,...,n), the integer programming problem in Eq. (14) can be maximized by encoding to +1+1s those elements of 𝐛¯w\bar{\mathbf{b}}_{w} that correspond to the largest kk entries of |𝐰||\mathbf{w}|. To this end, we need to perform sorting upon |𝐰||\mathbf{w}|, for which the complexity is 𝒪(nlogn)\mathcal{O}(n\log{n}). Since kk has nn possible values, we need to evaluate Eq. (14) nn times, and then select the 𝐛¯w\bar{\mathbf{b}}_{w} that maximizes the objective function, leading to a linear complexity with nn. Hence, the overall complexity is 𝒪(nlogn)\mathcal{O}(n\log{n}). \hfill\blacksquare

Therefore, given one filter weight 𝐰n\mathbf{w}\in\mathbb{R}^{n}, we can find the binarization 𝐛¯w{0,+1}n\bar{\mathbf{b}}_{w}\in\{0,+1\}^{n} having the smallest angle with |𝐰||\mathbf{w}|. Note that 𝐛¯w\bar{\mathbf{b}}_{w} found in this way is the global optimum. Furthermore, we emphasize that the proof of Corollary 1 indicates that the binarization in our framework involves the magnitudes of weights instead of the signs of weights, which significantly differentiates our work from existing works. In the next two sections, we show that the overall complexity can be further reduced to 𝒪(n)\mathcal{O}(n), given the bit entropy maximization.

4.2 Weight Distribution

The capacity of a binarization model, often measured by the bit entropy, is maximized when it is half-half distributed, i.e., one half of the weights are encoded into 0 and the other half are encoded into +1+1 [24, 29]. In this case, we expect to maximize our objective in Eq. (14) when those weights with the top-half magnitudes are encoded into +1+1s and the remaining are encoded into 0s. However, we reveal that it is difficult to binarize 𝐰\mathbf{w} with entropy maximization due to its specific form of distribution.

Specifically, after training, w𝐰w\in\mathbf{w} is widely believed to roughly obey a zero-mean Laplacian distribution, i.e., wLa(0,b)w\sim La(0,b), or a zero-mean Gaussian distribution, i.e., w𝒩(0,σ2)w\sim\mathcal{N}(0,\sigma^{2}) [22, 59, 60]. In Corollary 2, for the first time, we demonstrate its specific distribution.

Refer to caption
1 Layer1.1.2 of ResNet-18
Refer to caption
2 Layer2.1.2 of ResNet-18
Refer to caption
3 Layer4.1.2 of ResNet-18
Refer to caption
4 Layer2.1.2 of ResNet-20
Refer to caption
5 Layer2.2.2 of ResNet-20
Refer to caption
6 Layer3.2.1 of ResNet-20
Figure 4: Proportion of +1+1s trained (a) with and (b) without the 2\ell_{2} regularization across different layers in different networks. The dashed blue lines denote the average proportions of +1+1s of all filter weights.

Corollary 2. w𝐰w\in\mathbf{w} roughly follows a zero-mean Laplacian distribution.

Proof: Suppose ww is encoded into +1+1 if |w|>t|w|>t, and 0 otherwise. Note that the learning of Eq. (14) can actually be regarded as a problem of finding the centroid of a subset [61] as well. Consequently, the learning process can be calculated by the integral as

(𝐛¯w)T𝐛¯w2|𝐰|=twf(w)𝑑w+t+wf(w)𝑑wtf(w)𝑑w+t+f(w)𝑑w=2t+wf(w)𝑑w2t+f(w)𝑑w,\begin{split}\frac{(\bar{\mathbf{b}}_{w})^{T}}{\|\bar{\mathbf{b}}_{w}\|_{2}}|\mathbf{w}|&=\frac{\int^{-t}_{-\infty}wf(w)dw+\int_{t}^{+\infty}wf(w)dw}{\sqrt{\int^{-t}_{-\infty}f(w)dw+\int_{t}^{+\infty}f(w)dw}}\\ &=\frac{2\int_{t}^{+\infty}wf(w)dw}{\sqrt{2\int_{t}^{+\infty}f(w)dw}},\end{split} (15)

where f(w)f(w) represents the probability density function with regard to ww. Then, we denote p+1p_{+1} as the proportion of 𝐰\mathbf{w} being encoded into +1+1s. Intuitively, its calculation can be derived as

p+1=120tf(w)𝑑w.p_{+1}=1-2\int^{t}_{0}f(w)dw. (16)

To demonstrate our corollary, we first derive the theoretical values of p+1p_{+1} when p(w)p(w) follows a Laplacian or Gaussian distribution, and then experimentally complete our final proof.

Laplacian distribution. In this case, we have f(w)=12bew/bf(w)=\frac{1}{2b}e^{-w/b}. Therefore, Eq. (15) becomes

(𝐛¯w)T𝐛¯w2|𝐰|=2t+w2bew/b𝑑w2t+12bew/b𝑑w=(b+t)et/bet/b=(b+t)et/b.\begin{split}\frac{(\bar{\mathbf{b}}_{w})^{T}}{\|\bar{\mathbf{b}}_{w}\|_{2}}|\mathbf{w}|&=\frac{2\int_{t}^{+\infty}\frac{w}{2b}e^{-w/b}dw}{\sqrt{2\int_{t}^{+\infty}\frac{1}{2b}e^{-w/b}dw}}\\ &=\frac{(b+t)e^{-t/b}}{\sqrt{e^{-t/b}}}\\ &=(b+t)\sqrt{e^{-t/b}}.\end{split} (17)

Setting (b+t)et/bt=0\frac{\partial(b+t)\sqrt{e^{-t/b}}}{\partial t}=0 to attain the maximum of Eq. (17), we have t=bt=b. The proportion of +1+1s can be obtained as

p+1=120t12bew/b𝑑w0.37.p_{+1}=1-2\int_{0}^{t}\frac{1}{2b}e^{-w/b}dw\approx 0.37. (18)

Gaussian distribution. In this case, we have f(w)=12πσew2/(2σ2)f(w)=\frac{1}{\sqrt{2\pi}\sigma}e^{-w^{2}/(2\sigma^{2})}. Similarly, Eq. (15) can be rewritten as

(𝐛¯w)T𝐛¯w2|𝐰|=2t+w2πσew2/(2σ2)𝑑w2t+12πσew2/(2σ2)𝑑w=σ2πet2/(2σ2)12erfc(t2σ),\begin{split}\frac{(\bar{\mathbf{b}}_{w})^{T}}{\|\bar{\mathbf{b}}_{w}\|_{2}}|\mathbf{w}|&=\frac{2\int_{t}^{+\infty}\frac{w}{\sqrt{2\pi}\sigma}e^{-w^{2}/(2\sigma^{2})}dw}{\sqrt{2\int_{t}^{+\infty}\frac{1}{\sqrt{2\pi}\sigma}e^{-w^{2}/(2\sigma^{2})}dw}}\\ &=\frac{\frac{\sigma}{\sqrt{2\pi}}e^{-t^{2}/(2\sigma^{2})}}{\sqrt{\frac{1}{2}\operatorname{erfc}(\frac{t}{\sqrt{2}\sigma})}},\end{split} (19)

where erfc()\operatorname{erfc}(\cdot) represents the well-known complementary error function [62].

Let m=t2σm=\frac{t}{\sqrt{2}\sigma}, then Eq. (19) can be written as

(𝐛¯w)T𝐛¯w2|𝐰|=σπem2erfc(m).\frac{(\bar{\mathbf{b}}_{w})^{T}}{\|\bar{\mathbf{b}}_{w}\|_{2}}|\mathbf{w}|=\frac{\sigma}{\sqrt{\pi}}\frac{e^{-m^{2}}}{\sqrt{\operatorname{erfc}(m)}}. (20)

Setting em2/erfc(m)m=0\frac{\partial e^{-m^{2}}/\sqrt{\operatorname{erfc}(m)}}{\partial m}=0, we have m=t2σ0.43m=\frac{t}{\sqrt{2}\sigma}\approx 0.43. Thus, we obtain t0.432σt\approx 0.43\sqrt{2}\sigma, and then derive the proportion of +1+1s:

p+1=120t12πσew2/(2σ2)𝑑w0.54.p_{+1}=1-2\int_{0}^{t}\frac{1}{\sqrt{2\pi}\sigma}e^{-w^{2}/(2\sigma^{2})}dw\approx 0.54. (21)

As mentioned above, the trained weight w𝐰w\in\mathbf{w} obeys either a Laplacian distribution (with p+10.37p_{+1}\approx 0.37) or a Gaussian distribution (with p+10.54p_{+1}\approx 0.54). In Fig. 4(a), we conduct an experiment which shows a practical p+1p_{+1} of around 0.36\sim0.38 after training.111Similar phenomena can be observed in other layers and networks as well. This implies that w𝐰w\in\mathbf{w} follows a Laplacian distribution, and then our proof is completed. \hfill\blacksquare

4.3 Maximizing Bit Entropy

The proof of Corollary 1 indicates that the binarization in our framework is related to the weight magnitude, i.e., |𝐰||\mathbf{w}|. However, the Laplacian distribution contradicts the entropy maximization. To solve this, one naive solution is to assign the top half of elements of the sorted |𝐰||\mathbf{w}| with +1+1 and assign the remaining elements with 0, that is

𝐛~w={+1, top half of sorted |𝐰|,0, otherwise.\tilde{\mathbf{b}}_{w}=\left\{\begin{array}[]{l}+1,\text{ top half of sorted }|\mathbf{w}|,\\ 0,\text{ otherwise.}\end{array}\right. (22)

Despite helping achieve entropy maximization, such a simple operation violates the learning objective of minimizing the angular bias in Eq. (14) since 𝐛~w\tilde{\mathbf{b}}_{w} deviates significantly from the optimal 𝐛¯w\bar{\mathbf{b}}_{w} as revealed in Corollary 3.

Corollary 3. Suppose 𝐛¯w\bar{\mathbf{b}}_{w} is a binarized vector with a total of kk +1+1s and the binarized vector 𝐛~w\tilde{\mathbf{b}}_{w} has rr different bits from 𝐛¯w\bar{\mathbf{b}}_{w}. Then the angle between 𝐛¯w\bar{\mathbf{b}}_{w} and 𝐛~w\tilde{\mathbf{b}}_{w} is bounded by [arccoskk+r,arccoskrk]\big{[}\arccos\sqrt{\frac{k}{k+r}},\arccos\sqrt{\frac{k-r}{k}}\big{]}.

Proof: To find the lower bound, we need to obtain 𝐛~w\tilde{\mathbf{b}}_{w} such that (𝐛¯w)T𝐛~w𝐛¯w2𝐛~w2\frac{(\bar{\mathbf{b}}_{w})^{T}\tilde{\mathbf{b}}_{w}}{\|\bar{\mathbf{b}}_{w}\|_{2}\|\tilde{\mathbf{b}}_{w}\|_{2}} is maximized. Intuitively, this can be achieved when 𝐛~w\tilde{\mathbf{b}}_{w} has all ones at the same positions as 𝐛¯w\bar{\mathbf{b}}_{w}, and rr additional ones in the remaining positions, in which 𝐛~w2=k+r\|\tilde{\mathbf{b}}_{w}\|_{2}=\sqrt{k+r} and (𝐛¯w)T𝐛~w=k(\bar{\mathbf{b}}_{w})^{T}\tilde{\mathbf{b}}_{w}=k. Then, we have the lower bound of arccoskk+r\arccos\sqrt{\frac{k}{k+r}}. To obtain the upper bound, we need to minimize (𝐛¯w)T𝐛~w𝐛¯w2𝐛~w2\frac{(\bar{\mathbf{b}}_{w})^{T}\tilde{\mathbf{b}}_{w}}{\|\bar{\mathbf{b}}_{w}\|_{2}\|\tilde{\mathbf{b}}_{w}\|_{2}}, which can be done when there are (krk-r) +1+1s in 𝐛~w\tilde{\mathbf{b}}_{w} in the common positions with 𝐛¯w\bar{\mathbf{b}}_{w} and the rest are set to zeros. In this case, we have 𝐛~2=kr\|\tilde{\mathbf{b}}\|_{2}=\sqrt{k-r} and (𝐛¯w)T𝐛~w=kr(\bar{\mathbf{b}}_{w})^{T}\tilde{\mathbf{b}}_{w}=k-r, which leads to the upper bound of arccoskrk\arccos\sqrt{\frac{k-r}{k}}.

According to Corollary 2 and Eq. (22), we have k0.37nk\approx 0.37n, and r0.13nr\approx 0.13n. Then, we can derive the practical angle bounds as [arccos0.370.37+0.13,arccos0.370.130.37][30.66,36.35][\arccos\sqrt{\frac{0.37}{0.37+0.13}},\arccos\sqrt{\frac{0.37-0.13}{0.37}}]\approx[30.66^{\circ},36.35^{\circ}]. Therefore, a large angle bias occurs between the solution 𝐛~w\tilde{\mathbf{b}}_{w} from Eq. (22) and the solution 𝐛¯w\bar{\mathbf{b}}_{w} from optimizing Eq. (14). In Sec. 5.2, we demonstrate the poor performance when simply using 𝐛~w\tilde{\mathbf{b}}_{w}.

Algorithm 1 Sign-to-Magnitude Network Binarization
  Input: An LL-layer full-precision network with weights 𝐖i={𝐰1i,𝐰2i,,𝐰coutii}\mathbf{W}^{i}=\{\mathbf{w}_{1}^{i},\mathbf{w}_{2}^{i},...,\mathbf{w}^{i}_{c^{i}_{out}}\} (i=1,2,,Li=1,2,...,L), input images (activations) 𝐀1={𝐚11,𝐚21,,𝐚cin11}\mathbf{A}^{1}=\{\mathbf{a}_{1}^{1},\mathbf{a}_{2}^{1},...,\mathbf{a}^{1}_{c_{in}^{1}}\}.
  1) Forward Propagation:
  Remove the 2\ell_{2} regularization term.
  for i=1i=1 to LL do
     Binarize the inputs (𝐁A)i=sign(𝐀i)(\mathbf{B}_{A})^{i}=\operatorname{sign}(\mathbf{A}^{i}) (Eq. (6));
     for j=1j=1 to coutic^{i}_{out} do
        Obtain the half-half binarization (𝐛~w)ji(\tilde{\mathbf{b}}_{w})^{i}_{j} (Eq. (22));
        Obtain the binarization (𝐛w)ji=2(𝐛~w)ji1(\mathbf{b}_{w})^{i}_{j}=2\cdot(\tilde{\mathbf{b}}_{w})^{i}_{j}-1 (the inverse of Eq. (11));
        Conduct the convolution 𝐚ji+1βji(𝐛w)ji(𝐁A)i\mathbf{a}^{i+1}_{j}\approx{\beta}^{i}_{j}\cdot(\mathbf{b}_{w})_{j}^{i}\circledast(\mathbf{B}_{A})^{i} (Eq. (5));
     end for
     𝐀i+1={𝐚1i+1,𝐚2i+1,,𝐚couti+1i+1}\mathbf{A}^{i+1}=\{\mathbf{a}_{1}^{i+1},\mathbf{a}_{2}^{i+1},...,\mathbf{a}_{c_{out}^{i+1}}^{i+1}\};
  end for
  2) Backward Propagation:
  for i=Li=L to 11 do
     Compute gradient 𝐀i(𝐁A)iF(𝐀i)𝐀i\frac{\partial\mathcal{L}}{\partial\mathbf{A}^{i}}\approx\frac{\partial\mathcal{L}}{\partial(\mathbf{B}_{A})^{i}}\cdot\frac{\partial F(\mathbf{A}^{i})}{\partial\mathbf{A}^{i}} (Eq. (7));
     for j=1j=1 to coutic^{i}_{out} do
        Compute gradient 𝐰ji(𝐛w)ji\frac{\partial\mathcal{L}}{\partial\mathbf{w}_{j}^{i}}\approx\frac{\partial\mathcal{L}}{\partial(\mathbf{b}_{w})^{i}_{j}} (Eq. (9));
     end for
  end for
  3) Weight Updating:
  for i=Li=L to 11 do
     for j=1j=1 to coutic^{i}_{out} do
        Update 𝐰ji=𝐰jiη𝐰ji\mathbf{w}^{i}_{j}=\mathbf{w}^{i}_{j}-\eta\frac{\partial\mathcal{L}}{\partial\mathbf{w}_{j}^{i}}; # η\eta denotes learning rate
     end for
  end for
  Output: An LL-layer binarized network with weights (𝐛~W)i={(𝐛~w)1i,(𝐛~w)2i,,(𝐛~w)coutii}(i=1,2,,L)(\tilde{\mathbf{b}}_{W})^{i}=\{(\tilde{\mathbf{b}}_{w})^{i}_{1},(\tilde{\mathbf{b}}_{w})^{i}_{2},...,(\tilde{\mathbf{b}}_{w})^{i}_{c^{i}_{out}}\}(i=1,2,...,L).

Instead of imposing an additional loss term to regularize the ideal half-half binarization, we analyze the numerical value of each weight and reveal that simply removing the 2\ell_{2} regularization can explicitly maximize the bit capacity, leading to a more informative binarized network.

Let 𝐛wk=argmax𝐛w(𝐛w)Tk|𝐰|=i=1kw~ik\mathcal{L}^{k}_{\mathbf{b}_{w}}=\mathop{\arg\max}_{\mathbf{b}_{w}}\frac{(\mathbf{b}_{w})^{T}}{\sqrt{k}}|\mathbf{w}|=\frac{\sum_{i=1}^{k}\tilde{w}_{i}}{\sqrt{k}} denote the maximum result of the integer programming problem in Eq. (14), where w~i|𝐰|\tilde{w}_{i}\in|\mathbf{w}| corresponds to the ii-th largest magnitude. We have 𝐛wk+1<𝐛wk\mathcal{L}^{k+1}_{\mathbf{b}_{w}}<\mathcal{L}^{k}_{\mathbf{b}_{w}}, i.e.,

w~k+1+i=1kw~i(k+1k)+k<i=1kw~ik.\frac{\tilde{w}_{k+1}+\sum_{i=1}^{k}\tilde{w}_{i}}{(\sqrt{k+1}-\sqrt{k})+\sqrt{k}}<\frac{\sum_{i=1}^{k}\tilde{w}_{i}}{\sqrt{k}}. (23)

We can deduce that

w~k+1<𝐛wk(k+1k).\tilde{w}_{k+1}<\mathcal{L}^{k}_{\mathbf{b}_{w}}(\sqrt{k+1}-\sqrt{k}). (24)

For Laplacian distributed weights, we know that k0.37nk\approx 0.37n. Thus, the above inequality can be rewritten as

w~k+1<𝐛wk(0.37n+10.37n).\tilde{w}_{k+1}<\mathcal{L}^{k}_{\mathbf{b}_{w}}(\sqrt{0.37n+1}-\sqrt{0.37n}). (25)

Since nn is typically thousands for a neural network and we statistically find that 𝐛wk\mathcal{L}^{k}_{\mathbf{b}_{w}} ranges from 0.63 to 0.73, the multiplication of the two terms in Eq. (25) thus results in an extremely small w~k+1\tilde{w}_{k+1} approximating to zero. Thus, we need to enlarge the value of w~k+1\tilde{w}_{k+1} to break the above inequality. We realize that one of the major causes for a small w~k+1\tilde{w}_{k+1} lies in the existence of the 2\ell_{2} regularization imposed on the training of neural network. This inspires us to remove the 2\ell_{2} regularization for the to-be-binarized weights 𝐰\mathbf{w}.

As shown in Fig. 4(b), the removal of the 2\ell_{2} regularization increases the proportion of +1+1s in 𝐛¯w\bar{\mathbf{b}}_{w} to around 0.50\sim0.52. Taking 0.51 as an example, we then further enforce the ideal half-half binarization 𝐛~w\tilde{\mathbf{b}}_{w} using Eq. (22). As a result, k0.51nk\approx 0.51n and r0.01nr\approx 0.01n, yielding the much smaller angle bounds of [arccos0.510.51+0.01,arccos0.510.010.51][7.97,8.05][\arccos\sqrt{\frac{0.51}{0.51+0.01}},\arccos\sqrt{\frac{0.51-0.01}{0.51}}]\approx[7.97^{\circ},8.05^{\circ}] between 𝐛~w\tilde{\mathbf{b}}_{w} and 𝐛¯w\bar{\mathbf{b}}_{w}. This effectively increases the bit entropy and leads to a nearly optimal solution for the learning objective of Eq. (14). Besides, the half-half binarization further reduces the computational complexity of 𝒪(nlogn)\mathcal{O}(n\log n) to 𝒪(n)\mathcal{O}(n) since we only need to find the median of |𝐰||\mathbf{w}|, and encode weights into +1+1s when their magnitude is larger than the median, and 0s otherwise.

The forward and backward processes of SiMaN are summarized in Algorithm 1. During training, we remove the 2\ell_{2} regularization and adopt the binarization 𝐛w{1,+1}n\mathbf{b}_{w}\in\{-1,+1\}^{n} transformed from the half-half binarization 𝐛~w{0,+1}n\tilde{\mathbf{b}}_{w}\in\{0,+1\}^{n} for the convolution in Eq. (5). After training, we obtain a network consisting of a binarized weight 𝐛~w\tilde{\mathbf{b}}_{w} for practical deployment on hardware where the convolution is executed using the XNOR and bitcount operations in Eq. (12).

5 Experiments

To demonstrate the efficacy of the proposed SiMaN binarization scheme, we compare its performance with several state-of-the-art BNNs [32, 57, 25, 27, 22, 63, 28, 53, 64, 30, 52, 65, 66, 29, 23] as well as the conference version [24] on two image classification datasets, including CIFAR-10 [34] and ImageNet [26].

5.1 Datasets and Experimental Settings

CIFAR-10 [34] consists of 60,00060,000 32×3232\times 32 images from 1010 classes. Each class has 6,0006,000 images. We split the dataset into 50,00050,000 training images and 10,00010,000 testing images. Data augmentation includes random cropping and random flipping, as done in [1] for the training images.

ImageNet [26] contains over 1.21.2 million images for training and 50,00050,000 validation images from 1,0001,000 classes for classification. For fair comparison with the recent advances in [29, 23, 24], we only apply the data augmentation including random cropping and flipping.

Network Structures For CIFAR-10, we binarize ResNet-18/20 [1] and VGG-small [67]. For ImageNet, ResNet-18/34 are chosen for binarization. Following [29, 23, 24], double skip connections [28] are added to the ResNets and we do not binarize the first and last layers for all networks.

Implementation Details We implement our SiMaN using Pytorch [68] and all experiments are conducted on NVIDIA Tesla V100 GPUs. We use the cosine scheduler with a learning rate of 0.10.1 [29, 24]. The SGD is adopted as the optimizer with a momentum of 0.90.9. For those layers that are not binarized, the weight decay is set to 5×1045\times 10^{-4} on CIFAR-10 and 1×1041\times 10^{-4} on ImageNet, and 0 otherwise to remove the 2\ell_{2} regularization for the bit entropy maximization as discussed in Sec. 4.3. We train the models from scratch with 400400 epochs and a batch size of 256256 on CIFAR-10, and with 150150 epochs and a batch size of 512512 on ImageNet.

Note that, we only apply the classification loss during network training for fair comparison. Other training losses such as those proposed in [69, 53, 70], the variants of network structures in [71, 72, 55], and even the two-step training strategy [33] can be integrated to further boost the binarized networks’ performance. These, however, are not considered here. We aim to show the advantages of our magnitude-based optimization solution over the traditional sign-based methods under regular training loss, the same network structure and a common training strategy.

Refer to caption
1 Cosine Similarity
Refer to caption
2 Quantization Error
Figure 5: Illustration of quantization error in various layers of ResNet-20.

5.2 Ablation Studies

TABLE I: Ablation studies with/without the 2\ell_{2} regularization and half-half binarization (ResNet-18 on ImageNet).
2\ell_{2} regularization half-half Top-1(%) Top-5(%)
SiMaN1 55.1 75.5
SiMaN2 57.3 77.4
SiMaN3 59.2 81.5
SiMaN 60.1 82.3

In this subsection, we conduct ablation studies of different variants to demonstrate the efficacy of our SiMaN, as well as quantization error to demonstrate the superiority of our analytical discrete optimization.

SiMaN Variants. Our SiMaN is built by removing the 2\ell_{2} regularization and enforcing the half-half strategy in Eq. (22). To analyze their influence, in Table I, we develop three variants, including (1) SiMaN1: The 2\ell_{2} regularization is added while removing the half-half binarization. This variant simply implements binarization based on the proof process of Corollary 1. It results in around 37%37\% of weights encoded into +1+1s, as analyzed in Corollary 2, which fails to maximize the entropy information, and thus leads to poorer accuracies of 55.1% for the top-1 and 75.5% for the top-5. (2) SiMaN2: Both the 2\ell_{2} regularization and the half-half binarization are removed. It shows better top-1 (57.3%57.3\%) and top-5 (77.4%77.4\%) accuracies since the removal of the 2\ell_{2} regularization breaks the Laplacian distribution and results in around 51%51\% of weights being encoded into +1+1s as experimentally verified in Fig. 4. (3) SiMaN3: Both the 2\ell_{2} regularization and the half-half binarization are added. Though the performance increases, it is still limited. This is because, the half-half binarization with the 2\ell_{2} regularization causes a large angle deviation of around 30.6636.3530.66^{\circ}-36.35^{\circ} as analyzed in Sec. 4.3, from the optimal binarization of our learning objective in Eq. (14).

Based on SiMaN3, our SiMaN further removes the 2\ell_{2} regularization. On one hand, this ensures the maximal bit entropy; on the other hand, it ensures that the half-half binarization closely matches the optimal binarization (only 7.978.057.97^{\circ}-8.05^{\circ} angle deviation as analyzed in Sec. 4.3), thereby leading to the best performance in Table I.

Quantization Error. Recall in Sec. 1, to mitigate the quantization error, XNOR-Net [25] compensates for the norm gap between the full-precision weight and the corresponding binarization. Our conference implementation of RBNN [24] introduces a rotation matrix to reduce the angular bias. In this paper, we reformulate the angle alignment objective and derive an analytical discrete solution. To validate the theoretical claims on reducing quantization error, we measure the practical quantization errors across different layers of ResNet-20 as a toy example in Fig. 5. Similar observations can be found in other networks as well.

In Fig. 5(1), we first measure the cosine similarity between the full-precision weight vector and the corresponding binarization. We can see that our earlier implementation of RBNN achieves a significantly higher cosine similarity than XNOR-Net does, implying fewer angular bias. Upon RBNN, our SiMaN further increases the cosine similarity across different network layers. Consequently, in Fig. 5(2), though XNOR-Net mitigates quantization error from the norm gap, quantization error still heavily accumulates in most layers since the angular bias remains unsolved. By weight rotation, our conference version of RBNN greatly closes the angular bias thus it can effectively decrease quantization error. Nevertheless, the alternating optimization in RBNN leads to sub-optimal binarization. Therefore, the quantization error in top layers still stays in a relatively large state. By contrast, the analytical optimization in SiMaN enables to further reduce the quantization error to a very small level, providing one new perspective of BNN optimization.

5.3 Convergence

We further show the convergence ability of our SiMaN, and compare with its conference version of RBNN which implements network binarization with the sign function [24]. The experiments in Fig. 6 show that our sign-to-magnitude weight binarization has a significantly better ability to converge during BNN training than the traditional sign-based optimization on both training and validation sets, which demonstrates the feasibility of our discrete solution in learning BNNs.

Refer to caption
Figure 6: Comparison of training and validation accuracy curves between our SiMaN and RBNN (ResNet-34 on ImageNet).
TABLE II: Comparison with the state-of-the-arts on CIFAR-10. W/A denotes the bit length of the weights and activations. Top-1 accuracy is reported.
Network Method W/A
Top-1 (%)
ResNet -18 Full-precision 32/32 94.8
RAD [53] 1/1 90.5
IR-Net [29] 1/1 91.5
RBNN [24] 1/1 92.2
SiMaN (Ours) 1/1 92.5
ResNet -20 Full-precision 32/32 92.1
DoReFa [57] 1/1 79.3
DSQ [64] 1/1 84.1
SLB [65] 1/1 85.5
LNS [23] 1/1 85.8
IR-Net [29] 1/1 86.5
RBNN [24] 1/1 87.8
SiMaN (Ours) 1/1 87.4
VGG -small Full-precision 32/32 94.1
XNOR-Net [25] 1/1 89.8
BNN [32] 1/1 89.9
DoReFa [57] 1/1 90.2
RAD [53] 1/1 90.0
DSQ [64] 1/1 91.7
IR-Net [29] 1/1 90.4
RBNN [24] 1/1 91.3
SLB [65] 1/1 92.0
SiMaN (Ours) 1/1 92.5

5.4 Results on CIFAR-10

We first conduct detailed studies on CIFAR-10 for the proposed SiMaN as shown in Table II. Despite that RBNN performs best in binarizing ResNet-20, our sign-to-magnitude binarization consistently outperforms the recent sign-based state-of-the-arts. Specifically, SiMaN outperforms RBNN [24] and SLB [65] by 0.3%0.3\% and 0.5%0.5\% in binarizing ResNet-18 and VGG-small, respectively. The results emphasize the importance of building discrete optimization to pursue high-quality weight binarization. Importantly, compared with the conference version, i.e., RBNN, our SiMaN increases the performance to 92.5%92.5\% when binarizing VGG-small, leading to a performance gain of 1.2%1.2\%. Besides, SiMaN also merits in its easy implementation, where the median of absolute weights acts as the boundary between 0s and +1+1s to ensure the bit entropy maximization, while the angle alignment is also guaranteed, as analyzed in Sec. 4.3. In contrast, RBNN has to learn two complex rotation matrices and applies them in the beginning phase of each training epoch in order to reduce the angle bias. Note that, for ResNet-20, we realize that smaller quantization error (see Fig. 5) does not necessarily lead to better performance. This indicates that there may exist an unexplored optimal solution that is not related to the quantization error. Nevertheless, it has been widely accepted in literature that quantization error can improve the BNN performance in most cases, which has been demonstrated by the experimental results (except ResNet-20) in this paper as well. This paper also addresses the quantization error problem.

5.5 Results on ImageNet

We also conduct similar experiments on ImageNet to validate the performance of SiMaN on a large-scale dataset. Two common networks, ResNet-18 and ResNet-34, are adopted for binarization. Table III shows the results of SiMaN and several other binarization methods. The performance of SiMaN on ImageNet also takes the leading place. Specifically, with ResNet-18, SiMaN achieves 60.1%60.1\% top-1 and 82.3%82.3\% top-5 accuracies, respectively, with 0.2%0.2\% and 0.4%0.4\% improvements over its conference version of RBNN. By doubling the number of residual blocks and shrinking per-block convolutions, ReActNet [55] achieves 65.9% top-1 accuracy. To manifest the advantage of our discrete optimization, we further perform SiMaN on the modified network structure and obtain a better performance of 66.1%. With ResNet-34, it achieves a top-1 accuracy of 63.9%63.9\% and a top-5 accuracy of 84.8%84.8\%, outperforming RBNN by 0.8%0.8\% and 0.4%0.4\%, respectively.

The performance improvements in Table II and Table III strongly demonstrate the impact of exploring discrete optimization and the effectiveness of our magnitude-based discrete solution in constructing a high-performing BNN.

TABLE III: Comparison with the state-of-the-arts on ImageNet. W/A denotes the bit length of the weights and activations. Both top-1 and top-5 accuracies are reported. SiMaN means using the same network and training setting as ReActNet [55].
Network Method W/A
Top-1 (%)
Top-5 (%)
ResNet -18 Full-precision 32/32 69.6 89.2
BNN [32] 1/1 42.2 67.1
XNOR-Net [25] 1/1 51.2 73.2
DoReFa [57] 1/2 53.4 -
HWGQ [22] 1/2 59.6 82.2
TBN [63] 1/2 55.6 79.0
Bi-Real [28] 1/1 56.4 79.5
PDNN [52] 1/1 57.3 80.0
BONN [30] 1/1 59.3 81.6
Si-BNN [66] 1/1 59.7 81.8
IR-Net [29] 1/1 58.1 80.0
LNS [23] 1/1 59.4 81.7
RBNN [24] 1/1 59.9 81.9
SiMaN (Ours) 1/1 60.1 82.3
ReActNet [55] 1/1 65.9 -
SiMaN 1/1 66.1 85.9
ResNet -34 Full-precision 32/32 73.3 91.3
ABC-Net [27] 1/1 52.4 76.5
Bi-Real [28] 1/1 62.2 83.9
IR-Net [29] 1/1 62.9 84.1
RBNN [24] 1/1 63.1 84.4
SiMaN (Ours) 1/1 63.9 84.8

6 Conclusion

In this paper, we proposed a novel sign-to-magnitude network binarization (SiMaN) scheme that avoids the dependency on the sign function, to optimize a binary neural network for higher accuracy. Our SiMaN reformulates the angle alignment between the weight vector and its binarization as being constrained to {0,+1}\{0,+1\}. We proved that an analytical discrete solution can be attained in a computationally efficient manner by encoding into +1+1s the high-magnitude weights, and 0s otherwise. We also mathematically proved that the learned weights roughly follow a Laplacian distribution, which is harmful to bit entropy maximization. To address the problem, we have shown that simply removing the 2\ell_{2} regularization during network training can break the Laplacian distribution and lead to a half-half distribution of binarized weights. As a result, the complexity of our binarization could be further simplified by encoding into +1+1 weights within the largest top-half magnitude, and 0 otherwise. Our experimental results demonstrate the significant performance improvement of SiMaN.

Acknowledgement

This work was supported by the National Science Fund for Distinguished Young Scholars (No.62025603), the National Natural Science Foundation of China (No. U21B2037, No. 62176222, No. 62176223, No. 62176226, No. 62072386, No. 62072387, No. 62072389, and No. 62002305), Guangdong Basic and Applied Basic Research Foundation (No.2019B1515120049), and the Natural Science Foundation of Fujian Province of China (No.2021J01002).

References

  • [1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  • [2] Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan, “Hcp: A flexible cnn framework for multi-label image classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 38, no. 9, pp. 1901–1907, 2015.
  • [3] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 37, no. 9, pp. 1904–1916, 2015.
  • [4] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788.
  • [5] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun, “Object detection networks on convolutional feature maps,” IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 39, no. 7, pp. 1476–1481, 2016.
  • [6] X. Zhang, F. Wan, C. Liu, X. Ji, and Q. Ye, “Learning to match anchors for visual object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 44, no. 6, pp. 3096–3109, 2021.
  • [7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431–3440.
  • [8] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 39, no. 12, pp. 2481–2495, 2017.
  • [9] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 39, no. 4, pp. 640–651, 2017.
  • [10] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2015, pp. 1135–1143.
  • [11] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” in Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  • [12] J.-H. Luo, H. Zhang, H.-Y. Zhou, C.-W. Xie, J. Wu, and W. Lin, “Thinet: pruning cnn filters for a thinner net,” IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 41, no. 10, pp. 2525–2538, 2018.
  • [13] M. Lin, R. Ji, Y. Wang, Y. Zhang, B. Zhang, Y. Tian, and L. Shao, “Hrank: Filter pruning using high-rank feature map,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1529–1538.
  • [14] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6848–6856.
  • [15] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 116–131.
  • [16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
  • [17] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4510–4520.
  • [18] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for mobilenetv3,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1314–1324.
  • [19] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “Ghostnet: More features from cheap operations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1580–1589.
  • [20] S. Lin, R. Ji, C. Chen, D. Tao, and J. Luo, “Holistic cnn compression via low-rank decomposition with knowledge transfer,” IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 41, no. 12, pp. 2889–2905, 2018.
  • [21] K. Hayashi, T. Yamaguchi, Y. Sugawara, and S.-i. Maeda, “Exploring unexplored tensor network decompositions for convolutional neural networks,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 5552–5562.
  • [22] Z. Cai, X. He, J. Sun, and N. Vasconcelos, “Deep learning with low precision by half-wave gaussian quantization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5918–5926.
  • [23] K. Han, Y. Wang, Y. Xu, C. Xu, E. Wu, and C. Xu, “Training binary neural networks through learning with noisy supervision,” in Proceedings of the International Conference on Machine Learning (ICML), 2020, pp. 4017–4026.
  • [24] M. Lin, R. Ji, Z. Xu, B. Zhang, Y. Wang, Y. Wu, F. Huang, and C.-W. Lin, “Rotated binary neural network,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 7474–7485.
  • [25] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2016, pp. 525–542.
  • [26] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
  • [27] X. Lin, C. Zhao, and W. Pan, “Towards accurate binary convolutional neural network,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 345–353.
  • [28] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K.-T. Cheng, “Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 722–737.
  • [29] H. Qin, R. Gong, X. Liu, M. Shen, Z. Wei, F. Yu, and J. Song, “Forward and backward information retention for accurate binary neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2250–2259.
  • [30] J. Gu, J. Zhao, X. Jiang, B. Zhang, J. Liu, G. Guo, and R. Ji, “Bayesian optimized 1-bit cnns,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 4909–4917.
  • [31] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2015, pp. 3123–3131.
  • [32] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1,” arXiv preprint arXiv:1602.02830, 2016.
  • [33] B. Martinez, J. Yang, A. Bulat, and G. Tzimiropoulos, “Training binary neural networks with real-to-binary convolutions,” in International Conference on Learning Representations (ICLR), 2020.
  • [34] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, University of Tront, 2009.
  • [35] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013.
  • [36] T. Simons and D.-J. Lee, “A review of binarized neural networks,” Electronics, vol. 8, no. 6, p. 661, 2019.
  • [37] H. Qin, R. Gong, X. Liu, X. Bai, J. Song, and N. Sebe, “Binary neural networks: A survey,” Pattern Recognition (PR), vol. 105, p. 107281, 2020.
  • [38] A. Bulat and G. Tzimiropoulos, “Xnor-net++: Improved binary neural networks,” arXiv preprint arXiv:1909.13863, 2019.
  • [39] Z. Xu, M. Lin, J. Liu, J. Chen, L. Shao, Y. Gao, Y. Tian, and R. Ji, “Recu: Reviving the dead weights in binary neural networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021, pp. 5198–5208.
  • [40] S. Darabi, M. Belbahri, M. Courbariaux, and V. P. Nia, “Bnn+: Improved binary network training,” arXiv preprint arXiv:1812.11800, 2018.
  • [41] Y. Xu, K. Han, C. Xu, Y. Tang, C. Xu, and Y. Wang, “Learning frequency domain approximation for binary neural networks,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2021, pp. 25 553–25 565.
  • [42] J. W. Peters and M. Welling, “Probabilistic binary neural networks,” arXiv preprint arXiv:1809.03368, 2018.
  • [43] O. Shayer, D. Levi, and E. Fetaya, “Learning discrete weights using the local reparameterization trick,” in Proceedings of the International Conference on Learning Representations (ICLR), 2018.
  • [44] H. Qin, Z. Cai, M. Zhang, Y. Ding, H. Zhao, S. Yi, X. Liu, and H. Su, “Bipointnet: Binary neural network for point clouds,” in Proceedings of the International Conference on Learning Representations (ICLR), 2022.
  • [45] C. Leng, Z. Dou, H. Li, S. Zhu, and R. Jin, “Extremely low bit neural network: Squeeze the last bit out with admm,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2018, pp. 3466–3473.
  • [46] M. Alizadeh, J. Fernández-Marqués, N. D. Lane, and Y. Gal, “An empirical study of binary neural networks’ optimisation,” in Proceedings of the International Conference on Learning Representations (ICLR), 2018.
  • [47] J. Bethge, H. Yang, M. Bornstein, and C. Meinel, “Back to simplicity: How to train accurate bnns from scratch?” arXiv preprint arXiv:1906.08637, 2019.
  • [48] K. Helwegen, J. Widdicombe, L. Geiger, Z. Liu, K.-T. Cheng, and R. Nusselder, “Latent weights do not exist: Rethinking binarized neural network optimization,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 7533–7544.
  • [49] A. G. Anderson and C. P. Berg, “The high-dimensional geometry of binary neural networks,” in International Conference on Learning Representations (ICLR), 2018.
  • [50] Y. Wang, Y. Yang, F. Sun, and A. Yao, “Sub-bit neural networks: Learning to compress and accelerate binary neural networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021, pp. 5360–5369.
  • [51] T. Dockhorn, Y. Yu, E. Sari, M. Zolnouri, and V. Partovi Nia, “Demystifying and generalizing binaryconnect,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2021, pp. 13 202–13 216.
  • [52] J. Gu, C. Li, B. Zhang, J. Han, X. Cao, J. Liu, and D. Doermann, “Projection convolutional neural networks for 1-bit cnns via discrete back propagation,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2019, pp. 8344–8351.
  • [53] R. Ding, T.-W. Chin, Z. Liu, and D. Marculescu, “Regularizing activation distribution for training binarized deep networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11 408–11 417.
  • [54] J. Hu, W. Ziheng, V. Tan, Z. Lu, M. Zeng, and E. Wu, “Elastic-link for binarized neural network,” arXiv preprint arXiv:2112.10149, 2021.
  • [55] Z. Liu, Z. Shen, M. Savvides, and K.-T. Cheng, “Reactnet: Towards precise binary neural network with generalized activation functions,” in Proceedings of the European Conference on Computer Vision (ECCV), 2020, pp. 143–159.
  • [56] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2016, pp. 4107–4115.
  • [57] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
  • [58] M. Conforti, G. Cornuéjols, G. Zambelli et al., Integer programming.   Springer, 2014, vol. 271.
  • [59] R. Banner, Y. Nahshan, E. Hoffer, and D. Soudry, “Post-training 4-bit quantization of convolution networks for rapid-deployment,” arXiv preprint arXiv:1810.05723, 2018.
  • [60] K. Zhong, T. Zhao, X. Ning, S. Zeng, K. Guo, Y. Wang, and H. Yang, “Towards lower bit multiplication for convolutional neural network training,” arXiv preprint arXiv:2006.02804, 2020.
  • [61] C. M. Shakarji and V. Srinivasan, “Theory and algorithms for weighted total least-squares fitting of lines, planes, and parallel planes to support tolerancing standards,” Journal of Computing and Information Science in Engineering, vol. 13, no. 3, 2013.
  • [62] L. C. Andrews, Special functions of mathematics for engineers.   Spie Press, 1998, vol. 49.
  • [63] D. Wan, F. Shen, L. Liu, F. Zhu, J. Qin, L. Shao, and H. Tao Shen, “Tbn: Convolutional neural network with ternary inputs and binary weights,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 315–332.
  • [64] R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan, “Differentiable soft quantization: Bridging full-precision and low-bit neural networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 4852–4861.
  • [65] Z. Yang, Y. Wang, K. Han, C. Xu, C. Xu, D. Tao, and C. Xu, “Searching for low-bit weights in quantized neural networks,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 4091–4102.
  • [66] P. Wang, X. He, G. Li, T. Zhao, and J. Cheng, “Sparsity-inducing binarized neural networks.” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020, pp. 12 192–12 199.
  • [67] D. Zhang, J. Yang, D. Ye, and G. Hua, “Lq-nets: Learned quantization for highly accurate and compact deep neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 365–382.
  • [68] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 8026–8037.
  • [69] L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep networks,” in Proceedings of the International Conference on Learning Representations (ICLR), 2016.
  • [70] Z. Wang, J. Lu, C. Tao, J. Zhou, and Q. Tian, “Learning channel-wise interactions for binary convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 568–577.
  • [71] J. Bethge, C. Bartz, H. Yang, Y. Chen, and C. Meinel, “Meliusnet: Can binary neural networks achieve mobilenet-level accuracy?” arXiv preprint arXiv:2001.05936, 2020.
  • [72] S. Zhu, X. Dong, and H. Su, “Binary ensemble neural network: More bits per network or more networks per bit?” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4923–4932.
[Uncaptioned image] Mingbao Lin finished his M.S.-Ph.D. study and obtained the Ph.D. degree in intelligence science and technology from Xiamen University, Xiamen, China, in 2022. Earlier, he received the B.S. degree from Fuzhou University, Fuzhou, China, in 2016. He is currently a senior researcher with the Tencent Youtu Lab, Shanghai, China. He has published over ten papers as the first author in top-tier journals and conferences, including IEEE TPAMI, IJCV, IEEE TIP, IEEE TNNLS, CVPR, NeurIPS, AAAI, IJCAI, ACM MM and so on. His current research interest includes network compression & acceleration, and information retrieval.
[Uncaptioned image] Rongrong Ji (Senior Member, IEEE) is currently a Professor and the Director of the Intelligent Multimedia Technology Laboratory, and the Dean Assistant with the School of Information Science and Engineering, Xiamen University, Xiamen, China. His work mainly focuses on innovative technologies for multimedia signal processing, computer vision, and pattern recognition, with over 100 papers published in international journals and conferences. He is a member of the ACM. He was a recipient of the ACM Multimedia Best Paper Award and the Best Thesis Award of Harbin Institute of Technology. He serves as an Associate/Guest Editor for international journals and magazines such as Neurocomputing, Signal Processing, Multimedia Tools and Applications, the IEEE Multimedia Magazine, and the Multimedia Systems. He also serves as program committee member for several Tier-11 international conferences.
[Uncaptioned image] Zihan Xu received the B.S. degree in applied mathematics from Zhengzhou University, China, in 2019. He is currently pursuing the M.S. degree with Xiamen University, China. His research interests include computer vision and machine learning.
[Uncaptioned image] Baochang Zhang (Senior Member, IEEE) received the B.S., M.S., and Ph.D. degrees in computer science from the Harbin Institute of the Technology, Harbin, China, in 1999, 2001, and 2006, respectively. From 2006 to 2008, he was a Research Fellow with The Chinese University of Hong Kong, Hong Kong, and also with Griffith University, Brisban, Australia. He is a researcher with the Zhongguancun Lab, Beijing, China. His current research interests include pattern recognition, machine learning, face recognition, and wavelets.
[Uncaptioned image] Fei Chao (Member, IEEE) received the B.Sc. degree in mechanical engineering from the Fuzhou University, Fuzhou, China, in 2004, the M.Sc. degree with distinction in computer science from the University of Wales, Aberystwyth, U.K., in 2005, and the Ph.D. degree in robotics from the Aberystwyth University, Wales, U.K., in 2009. He is currently an Associate Professor with the School of Informatics, Xiamen University, Xiamen, China. He has authored/co-authored more than 50 peer-reviewed journal and conference papers. His current research interests include developmental robotics, machine learning, and optimization algorithms.
[Uncaptioned image] Chia-Wen Lin (Fellow, IEEE) received the Ph.D degree in electrical engineering from National Tsing Hua University (NTHU), Hsinchu, Taiwan, in 2000. He is currently Professor with the Department of Electrical Engineering and the Institute of Communications Engineering, NTHU. He is also Deputy Director of the AI Research Center of NTHU. He was with the Department of Computer Science and Information Engineering, National Chung Cheng University, Taiwan, during 2000–2007. Prior to joining academia, he worked for the Information and Communications Research Laboratories, Industrial Technology Research Institute, Hsinchu, Taiwan, during 1992–2000. His research interests include image and video processing, computer vision, and video networking. Dr. Lin served as Distinguished Lecturer of IEEE Circuits and Systems Society from 2018 to 2019, a Steering Committee member of IEEE Transactions on Multimedia from 2014 to 2015, and the Chair of the Multimedia Systems and Applications Technical Committee of the IEEE Circuits and Systems Society from 2013 to 2015. His articles received the Best Paper Award of IEEE VCIP 2015 and the Young Investigator Award of VCIP 2005. He received Outstanding Electrical Professor Award presented by Chinese Institute of Electrical Engineering in 2019, and Young Investigator Award presented by Ministry of Science and Technology, Taiwan, in 2006. He is also the Chair of the Steering Committee of IEEE ICME. He has served as a Technical Program Co-Chair for IEEE ICME 2010, and a General Co-Chair for IEEE VCIP 2018, and a Technical Program Co-Chair for IEEE ICIP 2019. He has served as an Associate Editor of IEEE Transactions on Image Processing, IEEE Transactions on Circuits and Systems for Video Technology, IEEE Transactions on Multimedia, IEEE Multimedia, and Journal of Visual Communication and Image Representation.
[Uncaptioned image] Ling Shao (Fellow, IEEE) is the Chief Scientist of Terminus Group and the President of Terminus International. He was the founding CEO and Chief Scientist of the Inception Institute of Artificial Intelligence, Abu Dhabi, UAE. His research interests include computer vision, deep learning, medical imaging and vision and language. He is a fellow of the IEEE, the IAPR, the BCS and the IET.