SiMaN: Sign-to-Magnitude Network Binarization

Mingbao Lin, Rongrong Ji, , Zihan Xu, Baochang Zhang, ,
Fei Chao, , Chia-Wen Lin and Ling Shao, M. Lin, R. Ji (Corresponding Author), Z. Xu and F. Chao are with the Media Analytics and Computing Laboratory, Department of Artificial Intelligence, School of Informatics, Xiamen University, Xiamen 361005, China (e-mail: [email protected]).M. Lin and Z. Xu are also with the Tencent Youtu Lab, Shanghai 200233, China.B. Zhang is with the Zhongguancun Lab, Beijing 100190, China.C.-W. Lin is with the Department of Electrical Engineering and the Institute of Communications Engineering, National Tsing Hua University, Hsinchu 30013, Taiwan. L. Shao is with Terminus Group, China.Manuscript received April 19, 2005; revised August 26, 2015.

Abstract

Binary neural networks (BNNs) have attracted broad research interest due to their efficient storage and computational ability. Nevertheless, a significant challenge of BNNs lies in handling discrete constraints while ensuring bit entropy maximization, which typically makes their weight optimization very difficult. Existing methods relax the learning using the sign function, which simply encodes positive weights into $+1$ s, and $-1$ s otherwise. Alternatively, we formulate an angle alignment objective to constrain the weight binarization to $\{0,+1\}$ to solve the challenge. In this paper, we show that our weight binarization provides an analytical solution by encoding high-magnitude weights into $+1$ s, and $0$ s otherwise. Therefore, a high-quality discrete solution is established in a computationally efficient manner without the sign function. We prove that the learned weights of binarized networks roughly follow a Laplacian distribution that does not allow entropy maximization, and further demonstrate that it can be effectively solved by simply removing the $\ell_{2}$ regularization during network training. Our method, dubbed sign-to-magnitude network binarization (SiMaN), is evaluated on CIFAR-10 and ImageNet, demonstrating its superiority over the sign-based state-of-the-arts. Our source code, experimental settings, training logs and binary models are available at https://github.com/lmbxmu/SiMaN.

Index Terms:

Binary neural network, Network binarization, weight magnitude, angular alignment, network compression & acceleration, network quantization.

1 Introduction

Deep neural networks (DNNs), especially convolutional neural networks (CNNs), have been effectively used in many tasks of computer vision, such as image recognition [1, 2, 3], object detection [4, 5, 6], and semantic segmentation [7, 8, 9]. Nowadays, DNNs are almost trained on high-capacity but power-hungry graphics processing units (GPUs); however, such DNN models often fail to run on low-power devices such as cell phones and Internet-of-Things (IoT) devices that have been universally popularized in modern society. As a result, substantial efforts have been invested to reduce the model redundancy while retaining a comparable or even better accuracy performance in comparison with the full model, such that the compressed model can be easily deployed on these resource-limited devices.

Typical methods for reducing the model redundancy include, but are not limited to: (1) Weight pruning discards individual weights in the filters or connections across different layers, and then reshapes the model in a sparse format [10, 11]. (2) Filter pruning resorts to directly removing all weights in a filter and the corresponding channel in the next layer [12, 13]. (3) Compact network designs, such as ShuffleNets [14, 15], MobileNets [16, 17, 18] and GhostNet [19], choose to directly build parameter-efficient neural network models. (4) Tensor decomposition approximates the weight tensor with a series of low-rank matrices, which are then reorganized in a sum-product form [20, 21] to recover the original weight tensor. (5) Low-precision quantization aims to compress the model by reducing the number of bits used to represent the weight parameters of the pre-trained models [22, 23, 24].

In particular, binary neural networks (BNNs), which quantize their weights and activations in a 1-bit binary form, have attracted increasing attention for two major reasons: 1) The memory usage of a BNN is 32 $\times$ lower than its full-precision counterpart, since the weights of the latter are stored in a 32-bit floating-point form. 2) A significant reduction of computational complexity can be achieved by executing efficient XNOR and bitcount operations, e.g., up to 58 $\times$ speed-ups on CPUs as reported by [25]. Regardless of these two merits, BNNs are also famed for their significant performance degradation. For example, XNOR-Net [25] suffers an 18% drop in top-1 accuracy when binarizing ResNet-18 on the ImageNet classification task [26]. The poor performance greatly barricades the possibility of deploying BNNs in real-world applications.

One of the major obstacles in constructing a high-performing BNN is the discrete constraints imposed on the pursued binary weights, which challenges the weight optimization. Meanwhile, BNNs also require the two possible values of binarized weights to be uniformly (half-half) distributed to ensure bit entropy maximization. To this end, most existing approaches simply employ the sign function to binarize weights where positive weights are encoded into $+1$ s, and $-1$ s are used otherwise [25, 27, 28, 29, 24]. To compensate for the entropy information, recent methods, such as Bayesian optimization [30], rotation matrix [24], and weight standardization [29], learn a two-mode distribution for real-valued weights to increase the probability of encoding one half of the weights into $+1$ s and the other half into $-1$ s by the sign function. These strategies, however, increase the learning complexity, since the optimization involves additional training loss terms and variables. Moreover, it is unclear whether the simple usage of the sign function is the optimal encoding option for the weight binarization process.

Refer to caption — Figure 1: (a) Early works [31, 32] suffer from large quantization error caused by both the norm gap and angular bias between the full-precision weights and its binarized version. (b) Recent works [25, 29] introduce a scaling factor to reduce the norm gap but cannot reduce the angular bias, *i.e.*, $\theta$ . Therefore the quantization error $\|\mathbf{w}\sin\theta\|^{2}$ is still large when $\theta$ is large.

Another obstacle in learning BNNs comes at the large quantization error between the full-precision weight vector $\mathbf{w}$ and its binary vector $\mathbf{b}$ [31, 32] as illustrated in Fig. 1(a). To solve this, state-of-the-art approaches [25, 29] introduce a per-channel learnable/optimizable scaling factor $\lambda$ to decrease the quantization error

\mathop{\min}_{\lambda,\mathbf{b}}\|\lambda\mathbf{b}-\mathbf{w}\|^{2}.

(1)

However, as revealed in the earlier version of this paper [24], the introduction of $\lambda$ only partly mitigates the quantization error by compensating for the norm gap between the full-precision weight and its binarized version, but cannot reduce the quantization error due to an angular bias as shown in Fig. 1(b). Apparently, with a fixed angular bias $\theta$ , when $\lambda\mathbf{b}-\mathbf{w}$ is orthogonal to $\lambda\mathbf{b}$ , Eq. (1) reaches the minimum and we have

\|\mathbf{w}\sin\theta\|^{2}\leq\|\lambda\mathbf{b}-\mathbf{w}\|^{2}.

(2)

Thus, the $\|\mathbf{w}\sin\theta\|^{2}$ serves as the lower bound of the quantization error and cannot be diminished as long as the angular bias exists. This lower bound could be huge with a large angular bias $\theta$ . Though the training process updates the weights and may close the angular bias, we experimentally observe the possibility of this case is small, as illustrated by XNOR-Net [25] in Fig. 3. Thus, it is natural for researchers to go further reduce this angular error for the sake of minimizing the quantization error if a better BNN performance is anticipated to obtain.

To solve the angular bias, the earlier version [24] proposed the angle alignment based learning objective which is originally formulated as

\begin{split}\mathop{\arg\max}_{\mathbf{R}}&\frac{\operatorname{sign}(\mathbf{R}^{T}\mathbf{w})^{T}(\mathbf{R}^{T}\mathbf{w})}{\|\operatorname{sign}(\mathbf{R}^{T}\mathbf{w})\|_{2}\|\mathbf{R}^{T}\mathbf{w}\|_{2}},\\ &\;s.t.\quad\mathbf{R}^{T}\mathbf{R}=\mathbf{I}_{n},\end{split}

(3)

where $\mathbf{R}$ is constrained to an $n$ -order rotation matrix. As shown in Fig. 2(a), by applying the sign function on the rotated weight vector $\mathbf{R}^{T}\mathbf{w}$ , we attain the binarization of $\mathbf{w}$ , i.e., $\mathbf{b}_{w}\in\operatorname{sign}(\mathbf{R}^{T}\mathbf{w})$ . Thus, Eq. (3) aims to learn a rotation matrix such that the angle bias between the rotated weight vector and its encoded binarization is reduced as illustrated by RBNN [24] (conference version) in Fig. 3. Though the great reduction on quantization error has been quantitatively measured in [24], the learning complexity of the rotation matrix, $\mathbf{R}$ , is very high, due to the non-convexity of Eq. (3). Thus, an alternating optimization approach is developed. Nevertheless, the alternating optimization results in sub-optimal binarization. Moreover, the optimization is still built upon the sign function. Note that, in Fig. 3, we also train XNOR-Net and RBNN with the two-step training paradigm [33]. We can see that the angular bias remains similar to these with commonly-used from-scratch training. Thus, training BNNs with different strategies does not correct the angular bias.

In this paper, a novel sign-to-magnitude network binarization (SiMaN) is proposed to discretely encode DNNs, leading to improved accuracy. Within our method, we reformulate the angle alignment objective in the conference version [24], which aims to maximize the cosine distance between the full-precision weight vector and its encoded binarization. Different from existing works that binarize weights into $\{-1,+1\}$ by the sign function, our binarization falls into $\{0,+1\}$ as illustrated in Fig. 2(b). In this way, we reveal that the globally analytical binarization for our angle alignment can be found in a computationally efficient manner of $\mathcal{O}(n\log n)$ by quantizing into $+1$ s the high-magnitude weights, and $0$ s otherwise, therefore enabling weight binarization without the sign function. To the best of our knowledge, we prove for the first time that the learned real-valued weights roughly follow a Laplacian distribution, which results in around 37% of weights being encoded into $+1$ s. This prevents the BNN from maximizing the entropy of information. To solve this, we do not add a term to the loss function since this increases the optimization difficulty. Alternatively, we analyze the intrinsic numerical values of weights, and show that the simple removal of the $\ell_{2}$ regularization destroys the Laplacian distribution, and thus enhances the half-half weight binarization. As a result, the final binarization is obtained by encoding into $+1$ s weights within the largest top-half magnitude, and $0$ s otherwise to further reduce the computational complexity from $\mathcal{O}(n\log n)$ to $\mathcal{O}(n)$ .

A preliminary conference version of this work was presented in [24]. The main contributions we have made in this paper are listed in the following.

•

A new learning objective based on the angle alignment is proposed and a magnitude-based analytical solution for BNNs is developed in a computationally efficient manner.
•

We formally prove that the learned weights in BNNs follow a Laplacian distribution, which, as revealed, prevents the maximization of bit entropy.
•

A detailed analysis on the numerical values of weights shows that simply removing the $\ell_{2}$ regularization benefits maximizing the bit entropy while further reducing the computational complexity.
•

Experiments on CIFAR-10 [34] and ImageNet [26] demonstrate that our sign-to-magnitude framework for network binarization outperforms the traditional sign-based binarization.

2 Related Work

Following the introduction of pioneering research [32] where the sign function and the straight-through estimator (STE) [35] are respectively adopted for the forward weight/activation binarization and backward gradient updating, BNNs have emerged as one of the most appealing approaches for the deployment of DNNs in resource-limited devices. As such, great efforts have been put into closing the gap between full-precision networks and their BNNs. In what follows, we briefly review some related works. A comprehensive overview can be found in the survey papers [36, 37].

XNOR-Net [25] introduces two scaling factors for channel-wise weights and activations to minimize quantization error. Inspired by this, XNOR-Net++ [38] improves the performance by integrating the two scaling factors into one, which is then updated using the standard gradient propagation. Except for the scaling factors, RBNN [24] further reduces the quantization error by optimizing the angle difference between a full-precision weight vector and its binarization. Xu et al. [39] observed “dead weights” in binary neural networks and proposed to mitigate quantization error by clipping large-magnitude weights to a fixed element. To enable the gradient propagation and reduce the “gradient mismatch” by the STE [35], several works, such as the swish function [40], piece-wise polynomial function [28], and error decay estimator [29], formulate the forward/backward quantization as a differentiable non-linear mapping. FDA [41] estimates the gradient of sign function in the Fourier frequency domain using the combination of sine function for training BNNs.

Another direction circumvents the gradient approximation of the sign function by sampling from the weight distribution [42, 43]. Qin et al. [44] introduces entropy-maximizing aggregation to modulate the distribution for the maximum information entroy, and layer-wise scale recovery to restore feature representation capacity. There are also abundant works that explore the optimization of BNNs [45, 46, 47, 48, 33] and explain their effectiveness [49]. Wang et al. [50] proposed to train BNNs under a kernel-aware optimization framework. ProxConnect (PC) [51] generalizes and improves BinaryConnect (BC) with well-established theory and algorithms. Recent works [40, 52] embed various regularization terms into the training loss to binarize the weights and control the activation ranges [53]. Hu et al. [54] added real-valued input features to the subsequent convolutional output features to enrich information flow within a BNN. Moreover, other recent studies devise binarization-friendly structures to boost the performance. For example, Bi-Real [28] designs double residual connections with full-precision downsampling layers. XNOR-Net++ [38] replaces ReLU by PReLU. ReActNet [55] adds parameter-free shortcuts on MobileNetV1 [16] and the group convolution is replaced by a regular convolution.

3 Binary Neural Networks

For an $L$ -layer CNN model, we denote $\mathbf{W}^{i}=\{\mathbf{w}^{i}_{1},\mathbf{w}^{i}_{2},...,\mathbf{w}^{i}_{c^{i}_{out}}\}\in\mathbb{R}^{n^{i}\times c^{i}_{out}}$ as the real-valued weight set for the $i$ -th layer, where $\mathbf{w}^{i}_{j}\in\mathbb{R}^{n^{i}}$ denotes the $j$ -th weight. The real-valued input activations of the $i$ -th layer are represented as $\mathbf{A}^{i}=\{\mathbf{a}^{i}_{1},\mathbf{a}^{i}_{2},...,\mathbf{a}^{i}_{c^{i}_{in}}\}\in\mathbb{R}^{m^{i}\times c^{i}_{in}}$ ; here, $c^{i}_{out}$ and $c^{i}_{in}$ respectively represent the output and input channels, and $n^{i}$ and $m^{i}$ denote the size of each weight and input, respectively. Then, the convolution result can be expressed by

\mathbf{a}^{i+1}_{j}=\mathbf{w}^{i}_{j}\circledast\mathbf{A}^{i},

(4)

where $\circledast$ stands for the convolution operation. For simplicity, we omit the non-linear layer here.

BNN Training. To train a BNN, the real-valued $\mathbf{w}^{i}_{j}$ and $\mathbf{A}^{i}$ in Eq. (4) are quantized into binary values $(\mathbf{b}_{w})^{i}_{j}\in\{-1,+1\}^{n^{i}}$ and $(\mathbf{B}_{A})^{i}\in\{-1,+1\}^{c^{i}_{in}\times m^{i}}$ , respectively. As a result, the convolution result can be approximated as

\mathbf{a}^{i+1}_{j}\approx{\beta}^{i}_{j}\cdot(\mathbf{b}_{w})_{j}^{i}\circledast(\mathbf{B}_{A})^{i},

(5)

where ${\beta}^{i}_{j}$ is a channel-level scaling factor [25, 38].

For the implementation of the BNN training, the forward calculation is fulfilled by conducting the convolution between $(\mathbf{b}_{w})_{j}^{i}$ and $(\mathbf{B}_{A})^{i}$ in Eq. (5), whereas their real-valued counterparts, $\mathbf{w}_{j}^{i}$ and $\mathbf{A}^{i}$ , are updated during backpropagation. To this end, following existing studies [56, 38, 24], the activation binarization in this work is simply realized by the sign function as

(\mathbf{B}_{A})^{i}=\operatorname{sign}(\mathbf{A}^{i})=\left\{\begin{array}[]{l}+1,\text{ if }\mathbf{A}^{i}\geq 0,\\ -1,\text{ otherwise.}\end{array}\right.

(6)

In the backpropagation phase, we adopt the piece-wise polynomial function [28] to approximate the gradient of a given loss $\mathcal{L}$ w.r.t. the input activations $\mathbf{A}^{i}$ as follows

\frac{\partial\mathcal{L}}{\partial\mathbf{A}^{i}}=\frac{\partial\mathcal{L}}{\partial(\mathbf{B}_{A})^{i}}\cdot\frac{\partial(\mathbf{B}_{A})^{i}}{\partial(\mathbf{A})^{i}}\approx\frac{\partial\mathcal{L}}{\partial(\mathbf{B}_{A})^{i}}\cdot\frac{\partial F(\mathbf{A}^{i})}{\partial\mathbf{A}^{i}},

(7)

where $\dfrac{\partial F(\mathbf{A}^{i})}{\partial\mathbf{A}^{i}}$ is defined by

\dfrac{\partial F(\mathbf{A}^{i})}{\partial\mathbf{A}^{i}}=\left\{\begin{array}[]{ll}2+2\mathbf{A}^{i},&\text{ if }-1\leq\mathbf{A}^{i}<0,\\ 2-2\mathbf{A}^{i},&\text{ if }\quad\,0\leq\mathbf{A}^{i}<1,\\ 0,&\text{ otherwise. }\end{array}\right.

(8)

Besides, the STE [35] is used to calculate the gradient of the loss $\mathcal{L}$ w.r.t. the weight $\mathbf{w}_{j}^{i}$ as

\frac{\partial\mathcal{L}}{\partial\mathbf{w}_{j}^{i}}=\frac{\partial\mathcal{L}}{\partial(\mathbf{b}_{w})_{j}^{i}}\cdot\frac{\partial(\mathbf{b}_{w})_{j}^{i}}{\partial\mathbf{w}_{j}^{i}}\approx\frac{\partial\mathcal{L}}{\partial(\mathbf{b}_{w})_{j}^{i}}.

(9)

BNN Inference. In practical deployment, the BNN model is accelerated using the efficient XNOR and bitcount logics embeded in the hardware. Thus, the quantized weights and activations need to be further transformed back into $\{0,1\}$ space. Such a transformation process can be realized by setting

	$\displaystyle(\bar{\mathbf{B}}_{A})^{i}=\big{(}1+(\mathbf{B}_{A})^{i}\big{)}/2.$		(10)
	$\displaystyle(\bar{\mathbf{b}}_{w})^{i}_{j}=\big{(}1+(\mathbf{b}_{w})^{i}_{j}\big{)}/2.$		(11)

Then, the approximated convolution in Eq. (5) can be replaced by the following equality

\begin{split}\mathbf{a}^{i+1}_{j}\approx{\beta}^{i}_{j}\cdot\big{(}2\cdot(\bar{\mathbf{b}}_{w})_{j}^{i}\odot(\bar{\mathbf{B}}_{A})^{i}-n^{i}\big{)},\end{split}

(12)

where $\odot$ represents the XNOR and bitcount operations that are well-fitted for real-time network inference.

Our Insight. In this paper, we focus on binarizing the real-valued weight $\mathbf{w}_{j}^{i}$ . Different from most existing works [56, 57, 38, 24] that project weights $\mathbf{w}_{j}^{i}$ into $(\mathbf{b}_{w})_{j}^{i}\in\{-1,+1\}^{n^{i}}$ using the sign function during training and then transform $(\mathbf{b}_{w})_{j}^{i}$ into $(\bar{\mathbf{b}}_{w})_{j}^{i}\in\{0,+1\}^{n^{i}}$ for inference, we seek to directly encode the weights into $\bar{\mathbf{b}}_{w}\in\{0,+1\}^{n^{i}}$ and then devise an efficient optimization to attain the optimal solution in Sec. 4.1. We demonstrate in Sec. 4.2 that the weight, $\mathbf{w}_{j}^{i}$ , roughly follows a Laplacian distribution, which inhibits the entropy maximization. We reveal that this can be easily addressed by removing the $\ell_{2}$ regularization in Sec. 4.3.

For simplicity, the scripts “ $i$ ” and “ $j$ ” are omitted in the following context.

4 Weight Binarization

In this section, we specify the formulation of our weight binarization, including the binary learning objective, weight distribution, and bit entropy maximization.

4.1 Learning Objective

To achieve high-quality weight binarization, different from the conference version [24], we reformulate the learning objective in Eq. (3) as

\begin{split}&\mathop{\arg\max}_{\bar{\mathbf{b}}_{w}}\frac{(\bar{\mathbf{b}}_{w})^{T}|\mathbf{w}|}{\|\bar{\mathbf{b}}_{w}\|_{2}\big{\|}|\mathbf{w}|\big{\|}_{2}},\\ &\;\;s.t.\quad\bar{\mathbf{b}}_{w}\in\{0,+1\}^{n},\end{split}

(13)

where $|\cdot|$ returns the absolute result of its input.

As can be seen, our learning objective is also built on the basis of angle alignment. Nevertheless, our method differs from Eq. (3) in many aspects: First, we drop the sign function since variables in a binarized network must be retained in a discrete set; thus, the binarization should be built upon the concept of the discrete optimization rather than the simple sign function. Second, we encode the weights into $\bar{\mathbf{b}}_{w}\in\{0,+1\}^{n}$ rather than $\mathbf{b}_{w}\in\{-1,+1\}^{n}$ . In Corollary 1, we show that $\bar{\mathbf{b}}_{w}\in\{0,+1\}^{n}$ allows us to find an analytical solution in an efficient manner by transferring the high-magnitude weights to $+1$ s and $0$ s otherwise. Third, our angle alignment is independent of the rotation matrix, $\mathbf{R}$ , since we remove the sign function, which makes the rotation direction unpredictable. Lastly, we align the angle difference between the binarization and absolute weight vector, $|\mathbf{w}|$ , instead of the weight $\mathbf{w}$ itself. The rationale behind this is that our binarization falls into the non-negative set $\bar{\mathbf{b}}_{w}$ . Fig. 2(b) outlines our binarization process.

Note that $\big{\|}|\mathbf{w}|\big{\|}_{2}$ is irrelevant to the optimization of Eq. (13). Thus, the learning can be simplified to

\begin{split}&\mathop{\arg\max}_{\bar{\mathbf{b}}_{w}}\;\frac{(\bar{\mathbf{b}}_{w})^{T}}{\|\bar{\mathbf{b}}_{w}\|_{2}}|\mathbf{w}|,\\ &s.t.\quad\bar{\mathbf{b}}_{w}\in\{0,+1\}^{n}.\end{split}

(14)

This is an integer programming problem [58]. Nevertheless, as demonstrated in Corollary 1, by learning the encoding space in $\{0,+1\}^{n}$ , we can reach the global maximum in a substantially efficient fashion.

Corollary 1. For Eq. (14), the computational complexity of finding the global optimum is $\mathcal{O}(n\log{n})$ .

Proof: For $\bar{\mathbf{b}}_{w}\in\{0,+1\}^{n}$ , it is intuitive to see that $\|\bar{\mathbf{b}}_{w}\|_{2}$ falls into the set $\{\sqrt{1},...,\sqrt{n}\}$ . Considering that $\|\bar{\mathbf{b}}_{w}\|_{2}=\sqrt{k}$ ( $k=1,...,n$ ), the integer programming problem in Eq. (14) can be maximized by encoding to $+1$ s those elements of $\bar{\mathbf{b}}_{w}$ that correspond to the largest $k$ entries of $|\mathbf{w}|$ . To this end, we need to perform sorting upon $|\mathbf{w}|$ , for which the complexity is $\mathcal{O}(n\log{n})$ . Since $k$ has $n$ possible values, we need to evaluate Eq. (14) $n$ times, and then select the $\bar{\mathbf{b}}_{w}$ that maximizes the objective function, leading to a linear complexity with $n$ . Hence, the overall complexity is $\mathcal{O}(n\log{n})$ . $\hfill\blacksquare$

Therefore, given one filter weight $\mathbf{w}\in\mathbb{R}^{n}$ , we can find the binarization $\bar{\mathbf{b}}_{w}\in\{0,+1\}^{n}$ having the smallest angle with $|\mathbf{w}|$ . Note that $\bar{\mathbf{b}}_{w}$ found in this way is the global optimum. Furthermore, we emphasize that the proof of Corollary 1 indicates that the binarization in our framework involves the magnitudes of weights instead of the signs of weights, which significantly differentiates our work from existing works. In the next two sections, we show that the overall complexity can be further reduced to $\mathcal{O}(n)$ , given the bit entropy maximization.

4.2 Weight Distribution

The capacity of a binarization model, often measured by the bit entropy, is maximized when it is half-half distributed, i.e., one half of the weights are encoded into $0$ and the other half are encoded into $+1$ [24, 29]. In this case, we expect to maximize our objective in Eq. (14) when those weights with the top-half magnitudes are encoded into $+1$ s and the remaining are encoded into $0$ s. However, we reveal that it is difficult to binarize $\mathbf{w}$ with entropy maximization due to its specific form of distribution.

Specifically, after training, $w\in\mathbf{w}$ is widely believed to roughly obey a zero-mean Laplacian distribution, i.e., $w\sim La(0,b)$ , or a zero-mean Gaussian distribution, i.e., $w\sim\mathcal{N}(0,\sigma^{2})$ [22, 59, 60]. In Corollary 2, for the first time, we demonstrate its specific distribution.

Corollary 2. $w\in\mathbf{w}$ roughly follows a zero-mean Laplacian distribution.

Proof: Suppose $w$ is encoded into $+1$ if $|w|>t$ , and 0 otherwise. Note that the learning of Eq. (14) can actually be regarded as a problem of finding the centroid of a subset [61] as well. Consequently, the learning process can be calculated by the integral as

\begin{split}\frac{(\bar{\mathbf{b}}_{w})^{T}}{\|\bar{\mathbf{b}}_{w}\|_{2}}|\mathbf{w}|&=\frac{\int^{-t}_{-\infty}wf(w)dw+\int_{t}^{+\infty}wf(w)dw}{\sqrt{\int^{-t}_{-\infty}f(w)dw+\int_{t}^{+\infty}f(w)dw}}\\ &=\frac{2\int_{t}^{+\infty}wf(w)dw}{\sqrt{2\int_{t}^{+\infty}f(w)dw}},\end{split}

(15)

where $f(w)$ represents the probability density function with regard to $w$ . Then, we denote $p_{+1}$ as the proportion of $\mathbf{w}$ being encoded into $+1$ s. Intuitively, its calculation can be derived as

p_{+1}=1-2\int^{t}_{0}f(w)dw.

(16)

To demonstrate our corollary, we first derive the theoretical values of $p_{+1}$ when $p(w)$ follows a Laplacian or Gaussian distribution, and then experimentally complete our final proof.

Laplacian distribution. In this case, we have $f(w)=\frac{1}{2b}e^{-w/b}$ . Therefore, Eq. (15) becomes

\begin{split}\frac{(\bar{\mathbf{b}}_{w})^{T}}{\|\bar{\mathbf{b}}_{w}\|_{2}}|\mathbf{w}|&=\frac{2\int_{t}^{+\infty}\frac{w}{2b}e^{-w/b}dw}{\sqrt{2\int_{t}^{+\infty}\frac{1}{2b}e^{-w/b}dw}}\\ &=\frac{(b+t)e^{-t/b}}{\sqrt{e^{-t/b}}}\\ &=(b+t)\sqrt{e^{-t/b}}.\end{split}

(17)

Setting $\frac{\partial(b+t)\sqrt{e^{-t/b}}}{\partial t}=0$ to attain the maximum of Eq. (17), we have $t=b$ . The proportion of $+1$ s can be obtained as

p_{+1}=1-2\int_{0}^{t}\frac{1}{2b}e^{-w/b}dw\approx 0.37.

(18)

Gaussian distribution. In this case, we have $f(w)=\frac{1}{\sqrt{2\pi}\sigma}e^{-w^{2}/(2\sigma^{2})}$ . Similarly, Eq. (15) can be rewritten as

\begin{split}\frac{(\bar{\mathbf{b}}_{w})^{T}}{\|\bar{\mathbf{b}}_{w}\|_{2}}|\mathbf{w}|&=\frac{2\int_{t}^{+\infty}\frac{w}{\sqrt{2\pi}\sigma}e^{-w^{2}/(2\sigma^{2})}dw}{\sqrt{2\int_{t}^{+\infty}\frac{1}{\sqrt{2\pi}\sigma}e^{-w^{2}/(2\sigma^{2})}dw}}\\ &=\frac{\frac{\sigma}{\sqrt{2\pi}}e^{-t^{2}/(2\sigma^{2})}}{\sqrt{\frac{1}{2}\operatorname{erfc}(\frac{t}{\sqrt{2}\sigma})}},\end{split}

(19)

where $\operatorname{erfc}(\cdot)$ represents the well-known complementary error function [62].

Let $m=\frac{t}{\sqrt{2}\sigma}$ , then Eq. (19) can be written as

\frac{(\bar{\mathbf{b}}_{w})^{T}}{\|\bar{\mathbf{b}}_{w}\|_{2}}|\mathbf{w}|=\frac{\sigma}{\sqrt{\pi}}\frac{e^{-m^{2}}}{\sqrt{\operatorname{erfc}(m)}}.

(20)

Setting $\frac{\partial e^{-m^{2}}/\sqrt{\operatorname{erfc}(m)}}{\partial m}=0$ , we have $m=\frac{t}{\sqrt{2}\sigma}\approx 0.43$ . Thus, we obtain $t\approx 0.43\sqrt{2}\sigma$ , and then derive the proportion of $+1$ s:

p_{+1}=1-2\int_{0}^{t}\frac{1}{\sqrt{2\pi}\sigma}e^{-w^{2}/(2\sigma^{2})}dw\approx 0.54.

(21)

As mentioned above, the trained weight $w\in\mathbf{w}$ obeys either a Laplacian distribution (with $p_{+1}\approx 0.37$ ) or a Gaussian distribution (with $p_{+1}\approx 0.54$ ). In Fig. 4(a), we conduct an experiment which shows a practical $p_{+1}$ of around 0.36 $\sim$ 0.38 after training.¹¹1Similar phenomena can be observed in other layers and networks as well. This implies that $w\in\mathbf{w}$ follows a Laplacian distribution, and then our proof is completed. $\hfill\blacksquare$

4.3 Maximizing Bit Entropy

The proof of Corollary 1 indicates that the binarization in our framework is related to the weight magnitude, i.e., $|\mathbf{w}|$ . However, the Laplacian distribution contradicts the entropy maximization. To solve this, one naive solution is to assign the top half of elements of the sorted $|\mathbf{w}|$ with $+1$ and assign the remaining elements with $0$ , that is

\tilde{\mathbf{b}}_{w}=\left\{\begin{array}[]{l}+1,\text{ top half of sorted }|\mathbf{w}|,\\ 0,\text{ otherwise.}\end{array}\right.

(22)

Despite helping achieve entropy maximization, such a simple operation violates the learning objective of minimizing the angular bias in Eq. (14) since $\tilde{\mathbf{b}}_{w}$ deviates significantly from the optimal $\bar{\mathbf{b}}_{w}$ as revealed in Corollary 3.

Corollary 3. Suppose $\bar{\mathbf{b}}_{w}$ is a binarized vector with a total of $k$ $+1$ s and the binarized vector $\tilde{\mathbf{b}}_{w}$ has $r$ different bits from $\bar{\mathbf{b}}_{w}$ . Then the angle between $\bar{\mathbf{b}}_{w}$ and $\tilde{\mathbf{b}}_{w}$ is bounded by $\big{[}\arccos\sqrt{\frac{k}{k+r}},\arccos\sqrt{\frac{k-r}{k}}\big{]}$ .

Proof: To find the lower bound, we need to obtain $\tilde{\mathbf{b}}_{w}$ such that $\frac{(\bar{\mathbf{b}}_{w})^{T}\tilde{\mathbf{b}}_{w}}{\|\bar{\mathbf{b}}_{w}\|_{2}\|\tilde{\mathbf{b}}_{w}\|_{2}}$ is maximized. Intuitively, this can be achieved when $\tilde{\mathbf{b}}_{w}$ has all ones at the same positions as $\bar{\mathbf{b}}_{w}$ , and $r$ additional ones in the remaining positions, in which $\|\tilde{\mathbf{b}}_{w}\|_{2}=\sqrt{k+r}$ and $(\bar{\mathbf{b}}_{w})^{T}\tilde{\mathbf{b}}_{w}=k$ . Then, we have the lower bound of $\arccos\sqrt{\frac{k}{k+r}}$ . To obtain the upper bound, we need to minimize $\frac{(\bar{\mathbf{b}}_{w})^{T}\tilde{\mathbf{b}}_{w}}{\|\bar{\mathbf{b}}_{w}\|_{2}\|\tilde{\mathbf{b}}_{w}\|_{2}}$ , which can be done when there are ( $k-r$ ) $+1$ s in $\tilde{\mathbf{b}}_{w}$ in the common positions with $\bar{\mathbf{b}}_{w}$ and the rest are set to zeros. In this case, we have $\|\tilde{\mathbf{b}}\|_{2}=\sqrt{k-r}$ and $(\bar{\mathbf{b}}_{w})^{T}\tilde{\mathbf{b}}_{w}=k-r$ , which leads to the upper bound of $\arccos\sqrt{\frac{k-r}{k}}$ .

According to Corollary 2 and Eq. (22), we have $k\approx 0.37n$ , and $r\approx 0.13n$ . Then, we can derive the practical angle bounds as $[\arccos\sqrt{\frac{0.37}{0.37+0.13}},\arccos\sqrt{\frac{0.37-0.13}{0.37}}]\approx[30.66^{\circ},36.35^{\circ}]$ . Therefore, a large angle bias occurs between the solution $\tilde{\mathbf{b}}_{w}$ from Eq. (22) and the solution $\bar{\mathbf{b}}_{w}$ from optimizing Eq. (14). In Sec. 5.2, we demonstrate the poor performance when simply using $\tilde{\mathbf{b}}_{w}$ .

Algorithm 1 Sign-to-Magnitude Network Binarization

Input: An

L

-layer full-precision network with weights

\mathbf{W}^{i}=\{\mathbf{w}_{1}^{i},\mathbf{w}_{2}^{i},...,\mathbf{w}^{i}_{c^{i}_{out}}\}

(

i=1,2,...,L

), input images (activations)

\mathbf{A}^{1}=\{\mathbf{a}_{1}^{1},\mathbf{a}_{2}^{1},...,\mathbf{a}^{1}_{c_{in}^{1}}\}

1) Forward Propagation:

Remove the

\ell_{2}

regularization term.

for

i=1

L

Binarize the inputs

(\mathbf{B}_{A})^{i}=\operatorname{sign}(\mathbf{A}^{i})

(Eq. (6));

for

j=1

c^{i}_{out}

Obtain the half-half binarization

(\tilde{\mathbf{b}}_{w})^{i}_{j}

(Eq. (22));

Obtain the binarization

(\mathbf{b}_{w})^{i}_{j}=2\cdot(\tilde{\mathbf{b}}_{w})^{i}_{j}-1

(the inverse of Eq. (11));

Conduct the convolution

\mathbf{a}^{i+1}_{j}\approx{\beta}^{i}_{j}\cdot(\mathbf{b}_{w})_{j}^{i}\circledast(\mathbf{B}_{A})^{i}

(Eq. (5));

end for

\mathbf{A}^{i+1}=\{\mathbf{a}_{1}^{i+1},\mathbf{a}_{2}^{i+1},...,\mathbf{a}_{c_{out}^{i+1}}^{i+1}\}

;

end for

2) Backward Propagation:

for

i=L

1

Compute gradient

\frac{\partial\mathcal{L}}{\partial\mathbf{A}^{i}}\approx\frac{\partial\mathcal{L}}{\partial(\mathbf{B}_{A})^{i}}\cdot\frac{\partial F(\mathbf{A}^{i})}{\partial\mathbf{A}^{i}}

(Eq. (7));

for

j=1

c^{i}_{out}

Compute gradient

\frac{\partial\mathcal{L}}{\partial\mathbf{w}_{j}^{i}}\approx\frac{\partial\mathcal{L}}{\partial(\mathbf{b}_{w})^{i}_{j}}

(Eq. (9));

end for

3) Weight Updating:

for

i=L

1

for

j=1

c^{i}_{out}

Update

\mathbf{w}^{i}_{j}=\mathbf{w}^{i}_{j}-\eta\frac{\partial\mathcal{L}}{\partial\mathbf{w}_{j}^{i}}

; #

\eta

denotes learning rate

end for

Output: An

L

-layer binarized network with weights

(\tilde{\mathbf{b}}_{W})^{i}=\{(\tilde{\mathbf{b}}_{w})^{i}_{1},(\tilde{\mathbf{b}}_{w})^{i}_{2},...,(\tilde{\mathbf{b}}_{w})^{i}_{c^{i}_{out}}\}(i=1,2,...,L)

Instead of imposing an additional loss term to regularize the ideal half-half binarization, we analyze the numerical value of each weight and reveal that simply removing the $\ell_{2}$ regularization can explicitly maximize the bit capacity, leading to a more informative binarized network.

Let $\mathcal{L}^{k}_{\mathbf{b}_{w}}=\mathop{\arg\max}_{\mathbf{b}_{w}}\frac{(\mathbf{b}_{w})^{T}}{\sqrt{k}}|\mathbf{w}|=\frac{\sum_{i=1}^{k}\tilde{w}_{i}}{\sqrt{k}}$ denote the maximum result of the integer programming problem in Eq. (14), where $\tilde{w}_{i}\in|\mathbf{w}|$ corresponds to the $i$ -th largest magnitude. We have $\mathcal{L}^{k+1}_{\mathbf{b}_{w}}<\mathcal{L}^{k}_{\mathbf{b}_{w}}$ , i.e.,

\frac{\tilde{w}_{k+1}+\sum_{i=1}^{k}\tilde{w}_{i}}{(\sqrt{k+1}-\sqrt{k})+\sqrt{k}}<\frac{\sum_{i=1}^{k}\tilde{w}_{i}}{\sqrt{k}}.

(23)

We can deduce that

\tilde{w}_{k+1}<\mathcal{L}^{k}_{\mathbf{b}_{w}}(\sqrt{k+1}-\sqrt{k}).

(24)

For Laplacian distributed weights, we know that $k\approx 0.37n$ . Thus, the above inequality can be rewritten as

\tilde{w}_{k+1}<\mathcal{L}^{k}_{\mathbf{b}_{w}}(\sqrt{0.37n+1}-\sqrt{0.37n}).

(25)

Since $n$ is typically thousands for a neural network and we statistically find that $\mathcal{L}^{k}_{\mathbf{b}_{w}}$ ranges from 0.63 to 0.73, the multiplication of the two terms in Eq. (25) thus results in an extremely small $\tilde{w}_{k+1}$ approximating to zero. Thus, we need to enlarge the value of $\tilde{w}_{k+1}$ to break the above inequality. We realize that one of the major causes for a small $\tilde{w}_{k+1}$ lies in the existence of the $\ell_{2}$ regularization imposed on the training of neural network. This inspires us to remove the $\ell_{2}$ regularization for the to-be-binarized weights $\mathbf{w}$ .

As shown in Fig. 4(b), the removal of the $\ell_{2}$ regularization increases the proportion of $+1$ s in $\bar{\mathbf{b}}_{w}$ to around 0.50 $\sim$ 0.52. Taking 0.51 as an example, we then further enforce the ideal half-half binarization $\tilde{\mathbf{b}}_{w}$ using Eq. (22). As a result, $k\approx 0.51n$ and $r\approx 0.01n$ , yielding the much smaller angle bounds of $[\arccos\sqrt{\frac{0.51}{0.51+0.01}},\arccos\sqrt{\frac{0.51-0.01}{0.51}}]\approx[7.97^{\circ},8.05^{\circ}]$ between $\tilde{\mathbf{b}}_{w}$ and $\bar{\mathbf{b}}_{w}$ . This effectively increases the bit entropy and leads to a nearly optimal solution for the learning objective of Eq. (14). Besides, the half-half binarization further reduces the computational complexity of $\mathcal{O}(n\log n)$ to $\mathcal{O}(n)$ since we only need to find the median of $|\mathbf{w}|$ , and encode weights into $+1$ s when their magnitude is larger than the median, and 0s otherwise.

The forward and backward processes of SiMaN are summarized in Algorithm 1. During training, we remove the $\ell_{2}$ regularization and adopt the binarization $\mathbf{b}_{w}\in\{-1,+1\}^{n}$ transformed from the half-half binarization $\tilde{\mathbf{b}}_{w}\in\{0,+1\}^{n}$ for the convolution in Eq. (5). After training, we obtain a network consisting of a binarized weight $\tilde{\mathbf{b}}_{w}$ for practical deployment on hardware where the convolution is executed using the XNOR and bitcount operations in Eq. (12).

5 Experiments

To demonstrate the efficacy of the proposed SiMaN binarization scheme, we compare its performance with several state-of-the-art BNNs [32, 57, 25, 27, 22, 63, 28, 53, 64, 30, 52, 65, 66, 29, 23] as well as the conference version [24] on two image classification datasets, including CIFAR-10 [34] and ImageNet [26].

5.1 Datasets and Experimental Settings

CIFAR-10 [34] consists of $60,000$ $32\times 32$ images from $10$ classes. Each class has $6,000$ images. We split the dataset into $50,000$ training images and $10,000$ testing images. Data augmentation includes random cropping and random flipping, as done in [1] for the training images.

ImageNet [26] contains over $1.2$ million images for training and $50,000$ validation images from $1,000$ classes for classification. For fair comparison with the recent advances in [29, 23, 24], we only apply the data augmentation including random cropping and flipping.

Network Structures For CIFAR-10, we binarize ResNet-18/20 [1] and VGG-small [67]. For ImageNet, ResNet-18/34 are chosen for binarization. Following [29, 23, 24], double skip connections [28] are added to the ResNets and we do not binarize the first and last layers for all networks.

Implementation Details We implement our SiMaN using Pytorch [68] and all experiments are conducted on NVIDIA Tesla V100 GPUs. We use the cosine scheduler with a learning rate of $0.1$ [29, 24]. The SGD is adopted as the optimizer with a momentum of $0.9$ . For those layers that are not binarized, the weight decay is set to $5\times 10^{-4}$ on CIFAR-10 and $1\times 10^{-4}$ on ImageNet, and $0$ otherwise to remove the $\ell_{2}$ regularization for the bit entropy maximization as discussed in Sec. 4.3. We train the models from scratch with $400$ epochs and a batch size of $256$ on CIFAR-10, and with $150$ epochs and a batch size of $512$ on ImageNet.

Note that, we only apply the classification loss during network training for fair comparison. Other training losses such as those proposed in [69, 53, 70], the variants of network structures in [71, 72, 55], and even the two-step training strategy [33] can be integrated to further boost the binarized networks’ performance. These, however, are not considered here. We aim to show the advantages of our magnitude-based optimization solution over the traditional sign-based methods under regular training loss, the same network structure and a common training strategy.

5.2 Ablation Studies

TABLE I: Ablation studies with/without the

\ell_{2}

regularization and half-half binarization (ResNet-18 on ImageNet).

	$\ell_{2}$ regularization	half-half	Top-1(%)	Top-5(%)
SiMaN₁	✓	✗	55.1	75.5
SiMaN₂	✗	✗	57.3	77.4
SiMaN₃	✓	✓	59.2	81.5
SiMaN	✗	✓	60.1	82.3

In this subsection, we conduct ablation studies of different variants to demonstrate the efficacy of our SiMaN, as well as quantization error to demonstrate the superiority of our analytical discrete optimization.

SiMaN Variants. Our SiMaN is built by removing the $\ell_{2}$ regularization and enforcing the half-half strategy in Eq. (22). To analyze their influence, in Table I, we develop three variants, including (1) SiMaN₁: The $\ell_{2}$ regularization is added while removing the half-half binarization. This variant simply implements binarization based on the proof process of Corollary 1. It results in around $37\%$ of weights encoded into $+1$ s, as analyzed in Corollary 2, which fails to maximize the entropy information, and thus leads to poorer accuracies of 55.1% for the top-1 and 75.5% for the top-5. (2) SiMaN₂: Both the $\ell_{2}$ regularization and the half-half binarization are removed. It shows better top-1 ( $57.3\%$ ) and top-5 ( $77.4\%$ ) accuracies since the removal of the $\ell_{2}$ regularization breaks the Laplacian distribution and results in around $51\%$ of weights being encoded into $+1$ s as experimentally verified in Fig. 4. (3) SiMaN₃: Both the $\ell_{2}$ regularization and the half-half binarization are added. Though the performance increases, it is still limited. This is because, the half-half binarization with the $\ell_{2}$ regularization causes a large angle deviation of around $30.66^{\circ}-36.35^{\circ}$ as analyzed in Sec. 4.3, from the optimal binarization of our learning objective in Eq. (14).

Based on SiMaN₃, our SiMaN further removes the $\ell_{2}$ regularization. On one hand, this ensures the maximal bit entropy; on the other hand, it ensures that the half-half binarization closely matches the optimal binarization (only $7.97^{\circ}-8.05^{\circ}$ angle deviation as analyzed in Sec. 4.3), thereby leading to the best performance in Table I.

Quantization Error. Recall in Sec. 1, to mitigate the quantization error, XNOR-Net [25] compensates for the norm gap between the full-precision weight and the corresponding binarization. Our conference implementation of RBNN [24] introduces a rotation matrix to reduce the angular bias. In this paper, we reformulate the angle alignment objective and derive an analytical discrete solution. To validate the theoretical claims on reducing quantization error, we measure the practical quantization errors across different layers of ResNet-20 as a toy example in Fig. 5. Similar observations can be found in other networks as well.

In Fig. 5(1), we first measure the cosine similarity between the full-precision weight vector and the corresponding binarization. We can see that our earlier implementation of RBNN achieves a significantly higher cosine similarity than XNOR-Net does, implying fewer angular bias. Upon RBNN, our SiMaN further increases the cosine similarity across different network layers. Consequently, in Fig. 5(2), though XNOR-Net mitigates quantization error from the norm gap, quantization error still heavily accumulates in most layers since the angular bias remains unsolved. By weight rotation, our conference version of RBNN greatly closes the angular bias thus it can effectively decrease quantization error. Nevertheless, the alternating optimization in RBNN leads to sub-optimal binarization. Therefore, the quantization error in top layers still stays in a relatively large state. By contrast, the analytical optimization in SiMaN enables to further reduce the quantization error to a very small level, providing one new perspective of BNN optimization.

5.3 Convergence

We further show the convergence ability of our SiMaN, and compare with its conference version of RBNN which implements network binarization with the sign function [24]. The experiments in Fig. 6 show that our sign-to-magnitude weight binarization has a significantly better ability to converge during BNN training than the traditional sign-based optimization on both training and validation sets, which demonstrates the feasibility of our discrete solution in learning BNNs.

TABLE II: Comparison with the state-of-the-arts on CIFAR-10. W/A denotes the bit length of the weights and activations. Top-1 accuracy is reported.

Network

Method

W/A

Top-1 (%)

ResNet -18

Full-precision

32/32

94.8

RAD [53]

1/1

90.5

IR-Net [29]

1/1

91.5

RBNN [24]

1/1

92.2

SiMaN (Ours)

1/1

92.5

ResNet -20

Full-precision

32/32

92.1

DoReFa [57]

1/1

79.3

DSQ [64]

1/1

84.1

SLB [65]

1/1

85.5

LNS [23]

1/1

85.8

IR-Net [29]

1/1

86.5

RBNN [24]

1/1

87.8

SiMaN (Ours)

1/1

87.4

VGG -small

Full-precision

32/32

94.1

XNOR-Net [25]

1/1

89.8

BNN [32]

1/1

89.9

DoReFa [57]

1/1

90.2

RAD [53]

1/1

90.0

DSQ [64]

1/1

91.7

IR-Net [29]

1/1

90.4

RBNN [24]

1/1

91.3

SLB [65]

1/1

92.0

SiMaN (Ours)

1/1

92.5

5.4 Results on CIFAR-10

We first conduct detailed studies on CIFAR-10 for the proposed SiMaN as shown in Table II. Despite that RBNN performs best in binarizing ResNet-20, our sign-to-magnitude binarization consistently outperforms the recent sign-based state-of-the-arts. Specifically, SiMaN outperforms RBNN [24] and SLB [65] by $0.3\%$ and $0.5\%$ in binarizing ResNet-18 and VGG-small, respectively. The results emphasize the importance of building discrete optimization to pursue high-quality weight binarization. Importantly, compared with the conference version, i.e., RBNN, our SiMaN increases the performance to $92.5\%$ when binarizing VGG-small, leading to a performance gain of $1.2\%$ . Besides, SiMaN also merits in its easy implementation, where the median of absolute weights acts as the boundary between $0$ s and $+1$ s to ensure the bit entropy maximization, while the angle alignment is also guaranteed, as analyzed in Sec. 4.3. In contrast, RBNN has to learn two complex rotation matrices and applies them in the beginning phase of each training epoch in order to reduce the angle bias. Note that, for ResNet-20, we realize that smaller quantization error (see Fig. 5) does not necessarily lead to better performance. This indicates that there may exist an unexplored optimal solution that is not related to the quantization error. Nevertheless, it has been widely accepted in literature that quantization error can improve the BNN performance in most cases, which has been demonstrated by the experimental results (except ResNet-20) in this paper as well. This paper also addresses the quantization error problem.

5.5 Results on ImageNet

We also conduct similar experiments on ImageNet to validate the performance of SiMaN on a large-scale dataset. Two common networks, ResNet-18 and ResNet-34, are adopted for binarization. Table III shows the results of SiMaN and several other binarization methods. The performance of SiMaN on ImageNet also takes the leading place. Specifically, with ResNet-18, SiMaN achieves $60.1\%$ top-1 and $82.3\%$ top-5 accuracies, respectively, with $0.2\%$ and $0.4\%$ improvements over its conference version of RBNN. By doubling the number of residual blocks and shrinking per-block convolutions, ReActNet [55] achieves 65.9% top-1 accuracy. To manifest the advantage of our discrete optimization, we further perform SiMaN on the modified network structure and obtain a better performance of 66.1%. With ResNet-34, it achieves a top-1 accuracy of $63.9\%$ and a top-5 accuracy of $84.8\%$ , outperforming RBNN by $0.8\%$ and $0.4\%$ , respectively.

The performance improvements in Table II and Table III strongly demonstrate the impact of exploring discrete optimization and the effectiveness of our magnitude-based discrete solution in constructing a high-performing BNN.

TABLE III: Comparison with the state-of-the-arts on ImageNet. W/A denotes the bit length of the weights and activations. Both top-1 and top-5 accuracies are reported. SiMaN^† means using the same network and training setting as ReActNet [55].

Network

Method

W/A

Top-1 (%)

Top-5 (%)

ResNet -18

Full-precision

32/32

69.6

89.2

BNN [32]

1/1

42.2

67.1

XNOR-Net [25]

1/1

51.2

73.2

DoReFa [57]

1/2

53.4

HWGQ [22]

1/2

59.6

82.2

TBN [63]

1/2

55.6

79.0

Bi-Real [28]

1/1

56.4

79.5

PDNN [52]

1/1

57.3

80.0

BONN [30]

1/1

59.3

81.6

Si-BNN [66]

1/1

59.7

81.8

IR-Net [29]

1/1

58.1

80.0

LNS [23]

1/1

59.4

81.7

RBNN [24]

1/1

59.9

81.9

SiMaN (Ours)

1/1

60.1

82.3

ReActNet [55]

1/1

65.9

SiMaN^†

1/1

66.1

85.9

ResNet -34

Full-precision

32/32

73.3

91.3

ABC-Net [27]

1/1

52.4

76.5

Bi-Real [28]

1/1

62.2

83.9

IR-Net [29]

1/1

62.9

84.1

RBNN [24]

1/1

63.1

84.4

SiMaN (Ours)

1/1

63.9

84.8

6 Conclusion

In this paper, we proposed a novel sign-to-magnitude network binarization (SiMaN) scheme that avoids the dependency on the sign function, to optimize a binary neural network for higher accuracy. Our SiMaN reformulates the angle alignment between the weight vector and its binarization as being constrained to $\{0,+1\}$ . We proved that an analytical discrete solution can be attained in a computationally efficient manner by encoding into $+1$ s the high-magnitude weights, and $0$ s otherwise. We also mathematically proved that the learned weights roughly follow a Laplacian distribution, which is harmful to bit entropy maximization. To address the problem, we have shown that simply removing the $\ell_{2}$ regularization during network training can break the Laplacian distribution and lead to a half-half distribution of binarized weights. As a result, the complexity of our binarization could be further simplified by encoding into $+1$ weights within the largest top-half magnitude, and $0$ otherwise. Our experimental results demonstrate the significant performance improvement of SiMaN.

Acknowledgement

This work was supported by the National Science Fund for Distinguished Young Scholars (No.62025603), the National Natural Science Foundation of China (No. U21B2037, No. 62176222, No. 62176223, No. 62176226, No. 62072386, No. 62072387, No. 62072389, and No. 62002305), Guangdong Basic and Applied Basic Research Foundation (No.2019B1515120049), and the Natural Science Foundation of Fujian Province of China (No.2021J01002).

References

[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
[2] Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan, “Hcp: A flexible cnn framework for multi-label image classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 38, no. 9, pp. 1901–1907, 2015.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 37, no. 9, pp. 1904–1916, 2015.
[4] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788.
[5] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun, “Object detection networks on convolutional feature maps,” IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 39, no. 7, pp. 1476–1481, 2016.
[6] X. Zhang, F. Wan, C. Liu, X. Ji, and Q. Ye, “Learning to match anchors for visual object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 44, no. 6, pp. 3096–3109, 2021.
[7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431–3440.
[8] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 39, no. 12, pp. 2481–2495, 2017.
[9] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 39, no. 4, pp. 640–651, 2017.
[10] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2015, pp. 1135–1143.
[11] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” in Proceedings of the International Conference on Learning Representations (ICLR), 2019.
[12] J.-H. Luo, H. Zhang, H.-Y. Zhou, C.-W. Xie, J. Wu, and W. Lin, “Thinet: pruning cnn filters for a thinner net,” IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 41, no. 10, pp. 2525–2538, 2018.
[13] M. Lin, R. Ji, Y. Wang, Y. Zhang, B. Zhang, Y. Tian, and L. Shao, “Hrank: Filter pruning using high-rank feature map,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1529–1538.
[14] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6848–6856.
[15] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 116–131.
[16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
[17] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4510–4520.
[18] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for mobilenetv3,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1314–1324.
[19] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “Ghostnet: More features from cheap operations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1580–1589.
[20] S. Lin, R. Ji, C. Chen, D. Tao, and J. Luo, “Holistic cnn compression via low-rank decomposition with knowledge transfer,” IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 41, no. 12, pp. 2889–2905, 2018.
[21] K. Hayashi, T. Yamaguchi, Y. Sugawara, and S.-i. Maeda, “Exploring unexplored tensor network decompositions for convolutional neural networks,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 5552–5562.
[22] Z. Cai, X. He, J. Sun, and N. Vasconcelos, “Deep learning with low precision by half-wave gaussian quantization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5918–5926.
[23] K. Han, Y. Wang, Y. Xu, C. Xu, E. Wu, and C. Xu, “Training binary neural networks through learning with noisy supervision,” in Proceedings of the International Conference on Machine Learning (ICML), 2020, pp. 4017–4026.
[24] M. Lin, R. Ji, Z. Xu, B. Zhang, Y. Wang, Y. Wu, F. Huang, and C.-W. Lin, “Rotated binary neural network,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 7474–7485.
[25] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2016, pp. 525–542.
[26] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
[27] X. Lin, C. Zhao, and W. Pan, “Towards accurate binary convolutional neural network,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 345–353.
[28] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K.-T. Cheng, “Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 722–737.
[29] H. Qin, R. Gong, X. Liu, M. Shen, Z. Wei, F. Yu, and J. Song, “Forward and backward information retention for accurate binary neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2250–2259.
[30] J. Gu, J. Zhao, X. Jiang, B. Zhang, J. Liu, G. Guo, and R. Ji, “Bayesian optimized 1-bit cnns,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 4909–4917.
[31] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2015, pp. 3123–3131.
[32] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1,” arXiv preprint arXiv:1602.02830, 2016.
[33] B. Martinez, J. Yang, A. Bulat, and G. Tzimiropoulos, “Training binary neural networks with real-to-binary convolutions,” in International Conference on Learning Representations (ICLR), 2020.
[34] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, University of Tront, 2009.
[35] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013.
[36] T. Simons and D.-J. Lee, “A review of binarized neural networks,” Electronics, vol. 8, no. 6, p. 661, 2019.
[37] H. Qin, R. Gong, X. Liu, X. Bai, J. Song, and N. Sebe, “Binary neural networks: A survey,” Pattern Recognition (PR), vol. 105, p. 107281, 2020.
[38] A. Bulat and G. Tzimiropoulos, “Xnor-net++: Improved binary neural networks,” arXiv preprint arXiv:1909.13863, 2019.
[39] Z. Xu, M. Lin, J. Liu, J. Chen, L. Shao, Y. Gao, Y. Tian, and R. Ji, “Recu: Reviving the dead weights in binary neural networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021, pp. 5198–5208.
[40] S. Darabi, M. Belbahri, M. Courbariaux, and V. P. Nia, “Bnn+: Improved binary network training,” arXiv preprint arXiv:1812.11800, 2018.
[41] Y. Xu, K. Han, C. Xu, Y. Tang, C. Xu, and Y. Wang, “Learning frequency domain approximation for binary neural networks,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2021, pp. 25 553–25 565.
[42] J. W. Peters and M. Welling, “Probabilistic binary neural networks,” arXiv preprint arXiv:1809.03368, 2018.
[43] O. Shayer, D. Levi, and E. Fetaya, “Learning discrete weights using the local reparameterization trick,” in Proceedings of the International Conference on Learning Representations (ICLR), 2018.
[44] H. Qin, Z. Cai, M. Zhang, Y. Ding, H. Zhao, S. Yi, X. Liu, and H. Su, “Bipointnet: Binary neural network for point clouds,” in Proceedings of the International Conference on Learning Representations (ICLR), 2022.
[45] C. Leng, Z. Dou, H. Li, S. Zhu, and R. Jin, “Extremely low bit neural network: Squeeze the last bit out with admm,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2018, pp. 3466–3473.
[46] M. Alizadeh, J. Fernández-Marqués, N. D. Lane, and Y. Gal, “An empirical study of binary neural networks’ optimisation,” in Proceedings of the International Conference on Learning Representations (ICLR), 2018.
[47] J. Bethge, H. Yang, M. Bornstein, and C. Meinel, “Back to simplicity: How to train accurate bnns from scratch?” arXiv preprint arXiv:1906.08637, 2019.
[48] K. Helwegen, J. Widdicombe, L. Geiger, Z. Liu, K.-T. Cheng, and R. Nusselder, “Latent weights do not exist: Rethinking binarized neural network optimization,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 7533–7544.
[49] A. G. Anderson and C. P. Berg, “The high-dimensional geometry of binary neural networks,” in International Conference on Learning Representations (ICLR), 2018.
[50] Y. Wang, Y. Yang, F. Sun, and A. Yao, “Sub-bit neural networks: Learning to compress and accelerate binary neural networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021, pp. 5360–5369.
[51] T. Dockhorn, Y. Yu, E. Sari, M. Zolnouri, and V. Partovi Nia, “Demystifying and generalizing binaryconnect,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2021, pp. 13 202–13 216.
[52] J. Gu, C. Li, B. Zhang, J. Han, X. Cao, J. Liu, and D. Doermann, “Projection convolutional neural networks for 1-bit cnns via discrete back propagation,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2019, pp. 8344–8351.
[53] R. Ding, T.-W. Chin, Z. Liu, and D. Marculescu, “Regularizing activation distribution for training binarized deep networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11 408–11 417.
[54] J. Hu, W. Ziheng, V. Tan, Z. Lu, M. Zeng, and E. Wu, “Elastic-link for binarized neural network,” arXiv preprint arXiv:2112.10149, 2021.
[55] Z. Liu, Z. Shen, M. Savvides, and K.-T. Cheng, “Reactnet: Towards precise binary neural network with generalized activation functions,” in Proceedings of the European Conference on Computer Vision (ECCV), 2020, pp. 143–159.
[56] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2016, pp. 4107–4115.
[57] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
[58] M. Conforti, G. Cornuéjols, G. Zambelli et al., Integer programming. Springer, 2014, vol. 271.
[59] R. Banner, Y. Nahshan, E. Hoffer, and D. Soudry, “Post-training 4-bit quantization of convolution networks for rapid-deployment,” arXiv preprint arXiv:1810.05723, 2018.
[60] K. Zhong, T. Zhao, X. Ning, S. Zeng, K. Guo, Y. Wang, and H. Yang, “Towards lower bit multiplication for convolutional neural network training,” arXiv preprint arXiv:2006.02804, 2020.
[61] C. M. Shakarji and V. Srinivasan, “Theory and algorithms for weighted total least-squares fitting of lines, planes, and parallel planes to support tolerancing standards,” Journal of Computing and Information Science in Engineering, vol. 13, no. 3, 2013.
[62] L. C. Andrews, Special functions of mathematics for engineers. Spie Press, 1998, vol. 49.
[63] D. Wan, F. Shen, L. Liu, F. Zhu, J. Qin, L. Shao, and H. Tao Shen, “Tbn: Convolutional neural network with ternary inputs and binary weights,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 315–332.
[64] R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan, “Differentiable soft quantization: Bridging full-precision and low-bit neural networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 4852–4861.
[65] Z. Yang, Y. Wang, K. Han, C. Xu, C. Xu, D. Tao, and C. Xu, “Searching for low-bit weights in quantized neural networks,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 4091–4102.
[66] P. Wang, X. He, G. Li, T. Zhao, and J. Cheng, “Sparsity-inducing binarized neural networks.” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020, pp. 12 192–12 199.
[67] D. Zhang, J. Yang, D. Ye, and G. Hua, “Lq-nets: Learned quantization for highly accurate and compact deep neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 365–382.
[68] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 8026–8037.
[69] L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep networks,” in Proceedings of the International Conference on Learning Representations (ICLR), 2016.
[70] Z. Wang, J. Lu, C. Tao, J. Zhou, and Q. Tian, “Learning channel-wise interactions for binary convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 568–577.
[71] J. Bethge, C. Bartz, H. Yang, Y. Chen, and C. Meinel, “Meliusnet: Can binary neural networks achieve mobilenet-level accuracy?” arXiv preprint arXiv:2001.05936, 2020.
[72] S. Zhu, X. Dong, and H. Su, “Binary ensemble neural network: More bits per network or more networks per bit?” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4923–4932.