This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Paper ID 10011institutetext: National ASIC System Engineering Research Center, Southeast University, China 11email: {long,101011256}@seu.edu.cn

RepBNN: towards a precise Binary Neural Network with Enhanced Feature Map via Repeating

   Xulong Shi 11    Zhi Qi 11    Jiaxuan Cai 11    Keqi Fu 11    Yaru Zhao 11    Zan Li 11    Xuanyu Liu 11    Hao Liu 11
Abstract

Binary neural network (BNN) is an extreme quantization version of convolutional neural networks (CNNs) with all features and weights mapped to just 1-bit. Although BNN saves a lot of memory and computation demand to make CNN applicable on edge or mobile devices, BNN suffers the drop of network performance due to the reduced representation capability after binarization. In this paper, we propose a new replaceable and easy-to-use convolution module RepConv, which enhances feature maps through replicating input or output along channel dimension by β\beta times without extra cost on the number of parameters and convolutional computation. We also define a set of RepTran rules to use RepConv throughout BNN modules like binary convolution, fully connected layer and batch normalization. Experiments demonstrate that after the RepTran transformation, a set of highly cited BNNs have achieved universally better performance than the original BNN versions. For example, the Top-1 accuracy of Rep-ReCU-ResNet-20, i.e., a RepBconv enhanced ReCU-ResNet-20[22], reaches 88.97% on CIFAR-10, which is 1.47% higher than that of the original network. And Rep-AdamBNN-ReActNet-A [14][13] achieves 71.342% Top-1 accuracy on ImageNet, a fresh state-of-the-art result of BNNs. Code and models are available at: https://github.com/imfinethanks/Rep_AdamBNN.

1 Introduction

As we all know, the applications of convolutional neural networks (CNNs) have achieved tremendous success in computer vision fields such as image classification, object detection, object tracking, and depth estimation. But this success comes at the cost of a huge amount of computation, which binds the application of convolutional neural networks generally to high-end hardware, such as GPU, TPU, etc. However, the efficient application of CNN models on mobile devices and embedded devices, where the storage space and computing resource are limited, is still quite challenging. In order to solve this problem, various lightweight network technologies have emerged, mainly including network architecture design[21][25], network architecture search[8][12], knowledge distillation[7], pruning[16][5], and quantization[27][23].

Among these lightweight network technologies, binarization contributes to an extreme version of a convolutional neural network , where all feature maps and weights are represented by just 1-bit. Binarization greatly improves the efficiency of CNN models, because it not only reduces the volume of parameters, the requirement of storage capacity, but also its replacement of MAC operations with efficient XNOR and bit-count operations saves a lot computational expense. The work in [19] shows that a binary neural network has achieve 32× parameter compression and 58× speedup than its full-precision network.

However, the speedup of calculation and the compression of parameter by binarization have obvious impact on the accuracy of CNN models. It is the declined representation capability of feature maps that degrades the performance of binarized CNNs. Unlike full-precision networks, the value range of binarized convolutions is so extremely restricted [15] that the accuracy loss becomes more serious. If the number of input channels is CinC_{in}, and the size of the convolution kernel is khk_{h} and kwk_{w}, then the value range of the binary convolution is (Cinkhkw,Cinkhkw)\left(-C_{in}*k_{h}*k_{w},C_{in}*k_{h}*k_{w}\right), a total of Cinkhkw+1C_{in}*k_{h}*k_{w}+1 quantization levels. Obviously, increasing the number of input channels CinC_{in} can improve the feature representation capability of binary neurual networks (BNNs).

Inspired by this observation, we propose a new replaceable convolution module RepConv, which reshapes the weight kernel through expanding its channel dimension while compressing its number of groups. The total number of parameters and the amount of related convolutional calculation remain unchanged after the reshape. To convolve with such a reshaped weight kernel, RepConv has to increase the number of both input and output channels simply through a replication operation. In such a manner, RepConv enriches the network information.

The binary version of RepConv is called RepBconv. If replacing the normal binarized convolution with RepBconv and further modify the remaining modules such as the first convolution layer, batch normalization, bypasses and fully connected layer following the transformation rule of RepTran, we translate a set of highly cited binary neural networks [22][14][13] into the corresponding RepBNNs, which have been demonstrated to achieve universally better performance than the original BNN versions.

Experimental results show that after the RepTran transformation, the Top-1 accuracy of Rep-ReCU-ResNet-20, i.e., a RepBconv enhanced ReCU-ResNet-20[22], reaches 88.97% on CIFAR-10, which is 1.47% higher than that of the original network. And Rep-AdamBNN-ReActNet-A [14][13] achieves 71.342% Top-1 accuracy on ImageNet, a fresh state-of-the-art result of BNNs as shown in Fig.1.

Refer to caption
Methods OPs(x10810^{8}) Top-1(%)
Real-to-Bin[17] 1.83 65.4
Group-Net[29] 2.68 67.0
4.13 70.5
MeliusNet[2] 2.14 65.8
3.25 69.2
5.32 71.0
ReActNet[14] 0.87 69.4
1.63 70.1
2.14 71.4
AdamBNN[13] 0.87 70.5
High-capacity[3] 1.37 71.2
RepBNN(Ours) 0.88 71.3
Figure 1: OPs vs. ImageNet Top-1 Accuracy. The OPs of some methods is referenced from [24]. RepBNN achieves state-of-the-art result with 71.34% top-1 accuracy on ImageNet. With similar accuracy, RepBNN is 2.4x more efficient than ReActNet. With similar OPs, RepBNN is 0.84% more accurate than AdamBNN.

To sum up, this paper makes the following contributions:

  • We propose a new and easy-to-use convolution module named RepConv that reshapes the weight kernels thus increases the number of input and output channels so as to enrich the representation capability of binarized feature map without extra cost on convolutional operations and parameters.

  • We propose a set of universal network transformation rules of RepTran, which makes the common modules in binary neural network adaptable to the application of RepConv.

  • We apply RepTran to a large number of state-of-the-art binary neural networks and validate them on CIFAR-10 and ImageNet extensively. It is proved that our proposed structure can significantly improve the performance of binary networks.

2 Related Work

Binarization greatly reduces the amount of operations (OPs) and parameters, but encounters a severe drop in accuracy, which is mainly caused by the loss of information accompanied in the network with just 1-bit weights and activations. Previous work mitigated this precision decline through either an efficient training procedure that accurately back-propagates gradient values or methods to improve the representation capability of binary neural networks.

2.1 Training

The impact of loss to weights is hard to be propagated backward through the partial derivative of the sign function that has an infinite gradient value at zero but a constant elsewhere . Thus the first BNN work [4] estimated gradients of STE[1] instead of the sign function during back-propagation. Bi-Real Net[15] replaced STE with a piecewise linear function, which is a second-order approximation of the sign function. IR-Net[18] employed Error Decay Estimator (EDE) to minimize information loss during the back propagation, thus ensured adequate updates at the beginning of training. RBNN[11] proposed a training-aware approximation of the sign function for gradient backpropagation. Real-to-Bin[17]designed a sequence of teacher-student pairs to bridge the architectural gap between real and binary networks. Lately ReActNet[14] adopted a distribution loss that measures the distribution similarity between the binary and real-valued network. ReCU[22] explored the effect of dead weights and introduced a rectified clamp unit (ReCU) to revive it. AdamBNN[13] expounded the influence of Adam and weight decay when training BNNs and proposed a better training strategy. These recent work promoted the performance of BNNs to a significantly higher level than the first BNN [4].

2.2 Representation Capability

Improving the representation capability of models is another major research direction of binary networks. XNOR-Net[19] used two channel-wise scaling factor for activations and weights to estimate the real-value. The scaling factor here is often replaced by batch normalization in practical implementation and this technique is widely used by subsequent work. Another popular work named Bi-Real Net[15] added residuals to each layer to make the full-precision data flow compensating the information loss in the binarized main branch of the network . When designing the transformation rule of RepTran, we have considered the modules of both batch normalization and residuals, so that our work is applicable to a wide range of available binary neural networks. Real-to-Bin[17] used a data-driven channel re-scaling gated residual leading to a superior performance. Group-Net[29] and BENN[28] combined multiple models to trade the amount of computation for accuracy. ReActNet[14], based on an improved mobilenet model, used RPRelu and RSign to explicitly shift and reshape the activation distribution, further reducing the performance gap between a BNN and its full-precision network.

Applying our technique of RepBconv following the transformation rule of RepTran to these great previous work will raise the accuracy to a newly high level.

3 The Module of RepConv

As explained in Section 1, increasing the number of input channels can improve the feature representation capability of BNNs. Inspired by this observation, we propose a new convolution module named RepConv.

Refer to caption
Figure 2: A normal convolution process

Fig.2 shows a normal convolution process. In Fig.2, the structure of a convolution kernel is expressed as (Cout,Cin,kh,kw)\left(C_{out},C_{in},k_{h},k_{w}\right) , where the number of input channels is CinC_{in}, and the number of output channels is CoutC_{out}.

RepConv reshapes the convolution kernel to (Cout/β,Cinβ,kh,kw)\left(C_{out}/\beta,C_{in}*\beta,k_{h},k_{w}\right) with CinβC_{in}*\beta input channels and Cout/βC_{out}/\beta output channels respectively. Then RepConv makes β2\beta^{2} copies of the output feature map and concates them along the channel direction, so that the final channelwise size of the output becomes CoutβC_{out}*\beta. This step is called a repeat operation. The specific process of RepConv can be seen in Fig.3.

Refer to caption
Figure 3: The convolution process of RepConv

Finally, at the same cost of convolutional computation, RepConv generates feature map of CoutβC_{out}*\beta channels, while its counterpart, a normal convolution process, creates a feature map of just CoutC_{out} channels. Since the output of a previous layer is used as the input to the next layer, the expanded information is consistently passed on in networks using RepConv. A batch normalization right after the repeat operation of RepConv and a residual link that transfers the output of last layer to that of this layer make full use of the enriched feature map by RepConv.

Our experiments have demonstrated that the replacement of normal convolution module with RepConv improves the accuracy for both full-precision and binary networks. More discuss on why a simple repeating operation improves the network accuracy and how the modules of batch normalization and residual link impact the feature contents after RepConv is presented in Section 5.1.3.

β\beta is a hyperparameter of RepConv that represents the degree of information dilation in a feature map. However a larger number of β\beta raises up the computational overhead of non-convolutional operations like batch normalization. In Section 4.4, we give an example to demonstrate this computational overhead of RepConv is negligible to the overall calculation of the entire network. The discuss on how different β\beta affects the accuracy is given in Session 5.1.1.

4 RepTran

The binary version of RepConv is called RepBconv in Fig.4b, which consists a regular Bconv module with reshaped weight kernels and a subsequent repeating β2\beta^{2}x operation. A complete BNN generally consists structures such as the first full-precision convolution layer, binary convolution layers, bypasses, batch normalizations, and the last fully connected layer. As shown in Fig.4c, the usage of RepBconv instead of the regular BConv increases the size of information flow by β\beta times in a binary convolution structure and so as to the entire BNN network.

The transformation rule for these BNN structures accompanying with the usage of RepBconv is called RepTran. When the number of input and output channels is increased by RepBconv, these structures must be changed accordingly to maintain the function while control the extra computation cost related to the enlarged information size. This section explain how RepTran works from the first layer, through the backbone and until the last layer of the network, as illustrated in Fig. 5.

Refer to caption
(a) A normal binarized convolution structure with regular Bconv module
Refer to caption
(b) A RepBconv module
Refer to caption
(c) A binary convolution structure after the replacement of BConv with RepBconv
Figure 4: Taking a structure fragment in ReActNet as an example, the usage of RepBconv module brings more feature information to traditional binary neural networks.
Refer to caption
Figure 5: A diagram of a typical binary neural network consisting a first layer, a last layer, and a backbone in between.

4.1 The Rule of RepTran in the First Layer

Most of the BNNs use floating-point weights and activations in first layers to avoid the loss of network accuracy, such as ReCU[22], ReActNet[14], and RBNN[11]. There are also works like FracBNN[26] that encodes the information in the first layer and binarizes the rest part of the network.

No matter how these first layers are implemented, RepTran just replicates β\beta times of its output feature map along the channel direction right before features enter the batch normalization module. The rule of RepTran in the first layer is to prepare a β\beta fold block of input for the implementation of RepBconv in successive layers without interrupting the original structure of the first layer.

4.2 The Rule of RepTran in the Backbone

As shown in Fig.5, a backbone is the main body of a typical binary neural network that consumes the most of the computation. A backbone consists of a series of binarized convolution structures with necessary residual bypasses. Some bypasses just deliver data to Add modules, while some other bypass may down-sample the data necessarily. The computational expense of the downsampling bypass that uses full-precision convolution depends on the size of downsampling inputs and outputs.

In the main branch of a backbone, RepTran simply replaces all the Bconv modules with RepBconvs. Details of this replacement is explained in Section 3. However, RepTran transforms the downsampling bypass using full-precision convolution differently. In order to avoid the β2\beta^{2} times convolutional calculation due to the enlarged size of inputs and outputs by RepBconv, RepTran has to replace the regular full-precision convolution in a downsampling bypass with full-precision RepConv. For other operations in backbone, such as Sign, batch normalization and Relu, RepTran increases the number of input channels by a factor of β\beta directly.

Refer to caption
Figure 6: Three different ways of RepTran in the last layer with β\beta=2. a) the original last layer, b) RepTran takes all the input channels, c) RepTran takes 1/β1/\beta of the input channels, d) RepTran takes 1/β21/\beta^{2} of the input channels.

4.3 The Rule of RepTran in the Last Layer

Most of the BNNs take a full-precision fully connected layer as the last layer, which consumes full-precision operations. Given the input with β\beta times channels than that in the original network, a fully connected layer consequently consumes β\beta fold more parameters and computation.

RepTran in the last layer may be implemented in three ways. First,it uses all the channels of the input that increases the amount of computation by β\beta times. Second, it uses 1/β1/\beta channels of the input and fulfills the computation at the same budget as the original final layer. Third, it uses 1/β21/\beta^{2} channels of the input that reduces computation amount. Without losing generality, we exhibit the three different ways to run RepTran with β=2\beta=2 in the last layer in Fig.6. According to the experimental results in Section 5.1, the RepTran rule transforms the last layer by the 1st way, which achieves better accuracy at the cost of a small amount of increment in computation.

4.4 The Extra Computational Expense of RepTran

RepTran does not consume more convolution calculation than the original network. However, the expanded size of the information flow after applying RepConv causes extra computation in modules like Sign, Relu, and more computationally expensive fully connected layer (FC) or Batch Normalization (BN).

The BN operation can be divided into two steps as normalization and scaling, each with equal amount of calculation in a normal binarized convolution structure. Since RepConv or RepBconv enrich the number of output channels through replication, the normalization result based on just one set of the channels is shared by the others. Therefore the computation consumed by normalization is reduced to 1/2β\beta of the original. However, the calculation amount of scaling, which is proportional to the number of channels, is increased by β\beta times. In summary, after the RepTran, the total amount of BN calculation becomes 1/2β\beta+β\beta/2 of the original.

The statistics of OPs in ReactNet-A[14] before and after RepTran (β\beta=2) is listed in Table.1. Following the methods in [19][14][15], we count the total number of operations as OPs=FLOPs+BOPs/64OPs=FLOPs+BOPs/64, where FLOPsFLOPs and BOPsBOPs mean the computation amount for floating-point and binary respectively. In the case of RepTran of β=2\beta=2, the calculation of FC has been doubled and that of BN has been raised up by 25%. However the total OPs of the entire ReactNet-A is 29x than the extra computational cost of FC and BN after RepTran. The analysis based on the network of ReactNet-A in Table.1 demonstrates the computation overhead when implementing RepBconv through RepTran is negligible to the overall calculation of the entire network.

Table 1: Statistics of OPs of ReActNet-A before and after RepTran (β\beta=2). FC: fully connected layer. Conv: full-precision convolution. BN: batch normalization. Bconv: binarized convolution. Among them, FC, Conv, and BN are floating-point operations, and Bconv is a binary operation. OPs=FLOPs+BOPs/64OPs=FLOPs+BOPs/64
FLOPs(x10710^{7}) BOPs(x10910^{9}) OPs(x10810^{8})
FC Conv BN Bconv OPs-without-BN OPs-with-BN
ReActNet-A 0.102 1.084 1.009 4.822 0.872 0.973
Rep-ReActNet-A 0.205 1.084 1.261 4.822 0.882 1.008
Δ\Delta 0.102 - 0.252 - 0.010 0.035

5 Experiments

For a fair comparison, we use the open source inferencing and training codes provided by IR-Net[18], RBNN[11], FracBNN[26], ReCU[22], Bi-Real Net[15], ReActNet[14] and AdamBNN[13]. We also apply exactly the same settings in data preprocessing as these recent outstanding BNN works.

5.1 Ablation Studies

5.1.1 Configuration of RepTran in Fully Connected Layer

should be discussed.

There are three different ways to apply RepTran in a fully connected layers as introduced in section 4.3. As illustrated by the benchmark on top of basic BNN work as IRNet-ResNet-20 and ReCU-ResNet-20 in Table.2, the first solution of RepTran provides the best accuracy on CIFAR-10 at about 1k extra cost of OPs, which is acceptable.

Table 2: Benchmark three different RepTran solutions in a fully connected layer on top of IRNet-ResNet-20 and ReCU-ResNet-20@ CIFAR-10 in terms of Top-1%\% and OPs.
Raw solution 1 solution 2 solution 3
Top-1(%) OPs Top-1(%) OPs Top-1(%) OPs Top-1(%) OPs
IR-Net 86.50 1069696 87.59 1070336 87.50 1069696 87.50 1069376
ReCU 87.50 1069696 88.97 1070336 88.92 1069696 88.82 1069376

5.1.2 The choice of Hyperparameter β\beta

balances the computational overhead of non-convolutional operations, the storage requirement for duplicated information, and the accuracy of RepBNNs. In the ablation study of β\beta, each layer of the network shares the same β\beta value. Table.3 shows the effect of β\betas on the accuracy of IRNet-ResNet-20 and ReCU-ResNet-20@ CIFAR-10. Although β=4\beta=4 achieves a better accuracy, we set β\beta to 2 by default in RepBconv for the best trade-off between the additional OPs, the limited storage capacity and the gain of accuracy.

Table 3: The accuracy of ResNet-20 @ CIFAR-10 with different β\betas on top of IRNet and ReCU.
Raw(%) β\beta=2(%) β\beta=4(%) β\beta=8(%)
IR-Net 86.50 87.59 88.07 87.07
ReCU 87.50 88.97 89.59 88.72

5.1.3 The Role of Batch Normalization and Residuals to RepBconv

should be explored. First, we discuss the position to insert BN module to RepBconv. As shown in Fig.4b, we can insert a BN before or after the repeat operation. Table.4 provides the experimental results on IRNet-ResNet-20 and ReCU-ResNet-20. Applying the batch normalization before repeat operation downgrades the accuracy by 23%2\sim 3\%, however the accuracy is improved by about 1% if running BN on the extended outputs through repeating.

Table 4: The accuracy of batch normalization before and after repeat in Rep-IRNet-ResNet-20 (β\beta=2) and Rep-ReCU-ResNet-20 (β\beta=2) on CIFAR-10
original(%) RepBNN(BN after repeat)(%) RepBNN(BN before repeat)(%)
IR-Net 86.50 87.59 82.74
ReCU 87.50 88.97 85.30

In order to verify the role of residual link as the bypass shown in Fig.4c to RepBcov, we benchmarks the accuracy on CIFAR-10 for RepTran transformed IR-Net and ReCU on top of ResNet-20 and VGG respectively. The experimental results in Table.5 show that RepTran stably improves the accuracy of Rep-IRNet-ResNet-20 and Rep-ReCU-ResNet-20, but has little effect or even decreases the accuracy on Rep-IRNet-VGG and Rep-ReCU-VGG, which is free of residual links.

Table 5: The accuracy of VGG and ResNet-20 before and after RepTran on CIFAR-10
Raw(%) RepBNN(%) Δ(%)\Delta(\%)
IR-Net-VGG 90.40 86.95 -3.45
ReCU-VGG 92.22 92.41 0.19
IR-Net-ResNet-20 86.50 87.59 1.09
ReCU-ResNet-20 87.50 88.97 1.47

The experimental results in Table.4 and Table.5 inspire us that if we transform a BNN with residual links to RepBNN and apply the batch normalization after the repeat operation of RepBconv, the accuracy of the network will be improved.

Refer to caption
(a) raw image
Refer to caption
(b) feature map after RepConv
Refer to caption
(c) feature map after batch normalization
Refer to caption
(d) feature map after residuals
Figure 7: Visualization of the feature map in the 7th7_{th} layer of Rep-IRNet-ResNet-20 (β\beta=2). Given the raw input image in (a), the 7th7_{th} layer creates an feature map of 8 channels like a row of figures in (b), then RepBconv repeats the feature map β2\beta^{2} times to obtain 4 rows of the same figures as (b). After the operation of batch normalization, these features are scaled and biased differently as(c), and with the help of residual link from the output of the last layer, the variance between features is even more obvious.

To better understand of effect of batch normalization and residual link to RepBNNs, we visualize the feature map in the 7th7_{th} layer of Rep-IRNet-ResNet-20 with β\beta=2. As displayed in Fig.7, given the raw input image in (a), the 7th7_{th} layer creates an feature map of 8 channels as a row of figures in (b), then RepBconv repeats the feature map β2\beta^{2} times to obtain 4 rows of the same feature contents as (b). After the operation of batch normalization, these features are scaled and biased differently as(c), and with the help of residual link from the output of the last layer, the variance between features is even more enlarged. Residual links accumulate the differences after batch normalization layer by layer, so that the variances between the repeated channels are continuously enlarged, which greatly increases the diversity of the feature map in Fig.7d. In this case, an informative diverse feature map of 32 channels finally contributes to a better performance of RepBNN than its originals with featur maps of 16 channels.

5.2 RepConv in Full-precision Network

RepConv can be applied to full-precision convolution too. We replace the convolutions with RepConv in the original full-precision ResNet-20[6][9] and a modified version with Bi-Real[15] structure. In the version of Bi-Real structure, there is a residual link for each convolutional layer. However the residual links every two convolutional layers in ResNet-20. The test on CIFAR-10 reveals that RepConv improves the accuracy of the modified version using Bi-Real structure but reduces the performance of full-precision ResNet-20 as in Table.6 . This illustrates the potential of RepConv structure in non-binary networks and its superior performance highly depends on dense residual links.

Table 6: Benchmark RepConv (β=2\beta=2) on full-precision ResNet-20 and its modified version with Bi-Real structure@ CIFAR-10 in terms of accuracy.
Raw(%) RepConv(%) Δ(%)\Delta(\%)
ResNet-20 91.73 90.62 -1.11
ResNet-20(Bi-Real) 90.57 90.84 0.27

5.3 Comparison with SOTA BNN Methods

To verify the advantages of RepBNN, we apply RepTran to a large number of recent outstanding binary neural networks with open-sourced codes. They are IR-Net[18], RBNN[11], FracBNN[26], and ReCU[22] with codes on CIFAR-10, and Bi-Real Net[15], ReCU[22], ReActNet[14], and AdamBNN[13] with codes on ImageNet. We do extensive experiments on these two datasets.

5.3.1 CIFAR-10

The accuracy results of RepTran transformed version and the original BNNs are shown in Table.7 .

Table 7: The accuracy of RepBNN on CIFAR-10.(β=2\beta=2, Bit-width (W/A)=1/1)
Network Method Top-1(Raw)(%) Top-1(RepBNN(Ours))(%) Δ(%)\Delta(\%)
RBNN 91.3 92.3 1.0
IR-Net 90.4 86.9 -3.5
VGG-small ReCU 92.2 92.4 0.2
IR-Net 86.5 87.6 1.1
RBNN 86.5 88.6 2.1
FracBNN(1bit) 87.2 87.6 0.4
RBNN-bireal 87.8 88.7 0.9
ResNet-20 ReCU 87.5 89.0 1.5
RBNN 92.2 93.1 0.9
ResNet-18 ReCU 92.8 93.6 0.8

After applying RepTran, the accuracy of ResNet-type binary networks is significantly improved. Among them, Rep-ReCU-ResNet-20 achieves an accuracy of 88.97%, far exceeding the current state-of-the-art.

5.3.2 ImageNet

We apply RepTran to Bi-Real Net[15], ReCU[22], ReActNet[14], and AdamBNN[13], and compare the accuracy with other popular BNN works[11][18] on ImageNet. The results are shown in Table.8. RepTran steadily improve the accuracy of the model, among which the Top-1 accuracy of Rep-AdamBNN-ReActNet-A on ImageNet has reached 71.34%, which is the current state-of-the-art.

Table 8: The accuracy of RepBNN on ImageNet.(β=2\beta=2, Bit-width (W/A)=1/1)
Network Method Top-1(%) Top-5(%)
ResNet-18 BNN[4] 42.2 67.1
XNOR-Net[19] 51.2 73.2
Bi-Real Net[15] 56.4 79.5
Rep-Bi-Real Net(Ours) 57.1 79.5
IR-Net[18] 58.1 80.0
RBNN[11] 59.9 81.9
ReCU[22] 61.0 82.6
Rep-ReCU(Ours) 62.3 83.8
ReActNet-A ReActNet[14] 69.4 88.6
Rep-ReActNet(Ours) 70.2 89.1
AdamBNN[13] 70.5 89.1
Rep-AdamBNN(Ours) 71.3 89.8

6 Conclusions

Binary neural network (BNN) is an extreme quantization version of convolutional neural networks (CNNs) with all features and weights mapped to just 1-bit. Although BNN saves a lot of memory and computation demand to make CNN applicable on edge or mobile devices, BNN suffers the drop of network performance due to the reduced representation capability after binarization. In this paper, we propose a new replaceable and easy-to-use convolution module RepConv, which enhances feature maps through replicating input or output along channel dimension by β\beta times without extra cost on the number of parameters and convolutional computation. We also define a set of RepTran rules to use RepConv throughout BNN modules like binary convolution, fully connected layer and batch normalization. We apply RepTran to a large number of state-of-the-art binarization works, leading to a series of enhanced binary networks, named RepBNNs. These RepBNNs are validated on CIFAR-10[10] and ImageNet[20]. Among them, Rep-ReCU-ResNet-20 achieves 88.97% accuracy on CIFAR-10. Rep-AdamBNN-ReActNet-A achieves 71.34% accuracy on ImageNet.

It is hoped that our work can bring some inspiration to the fields of lightweight network architecture design and network architecture search for binary neural networks.

References

  • [1] Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013)
  • [2] Bethge, J., Bartz, C., Yang, H., Chen, Y., Meinel, C.: Meliusnet: Can binary neural networks achieve mobilenet-level accuracy? arXiv preprint arXiv:2001.05936 (2020)
  • [3] Bulat, A., Martinez, B., Tzimiropoulos, G.: High-capacity expert binary networks. arXiv preprint arXiv:2010.03558 (2020)
  • [4] Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830 (2016)
  • [5] Ding, X., Ding, G., Zhou, X., Guo, Y., Han, J., Liu, J.: Global sparse momentum sgd for pruning very deep neural networks. arXiv preprint arXiv:1909.12778 (2019)
  • [6] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
  • [7] Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
  • [8] Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., et al.: Searching for mobilenetv3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1314–1324 (2019)
  • [9] Idelbayev, Y.: Proper ResNet implementation for CIFAR10/CIFAR100 in PyTorch. https://github.com/akamaster/pytorch_resnet_cifar10, accessed: 2022-01-xx
  • [10] Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
  • [11] Lin, M., Ji, R., Xu, Z., Zhang, B., Wang, Y., Wu, Y., Huang, F., Lin, C.W.: Rotated binary neural network. arXiv preprint arXiv:2009.13055 (2020)
  • [12] Liu, H., Simonyan, K., Yang, Y.: Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018)
  • [13] Liu, Z., Shen, Z., Li, S., Helwegen, K., Huang, D., Cheng, K.T.: How do adam and training strategies help bnns optimization? arXiv preprint arXiv:2106.11309 (2021)
  • [14] Liu, Z., Shen, Z., Savvides, M., Cheng, K.T.: Reactnet: Towards precise binary neural network with generalized activation functions. In: European Conference on Computer Vision. pp. 143–159. Springer (2020)
  • [15] Liu, Z., Wu, B., Luo, W., Yang, X., Liu, W., Cheng, K.T.: Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In: Proceedings of the European conference on computer vision (ECCV). pp. 722–737 (2018)
  • [16] Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE international conference on computer vision. pp. 2736–2744 (2017)
  • [17] Martinez, B., Yang, J., Bulat, A., Tzimiropoulos, G.: Training binary neural networks with real-to-binary convolutions. arXiv preprint arXiv:2003.11535 (2020)
  • [18] Qin, H., Gong, R., Liu, X., Shen, M., Wei, Z., Yu, F., Song, J.: Forward and backward information retention for accurate binary neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2250–2259 (2020)
  • [19] Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet classification using binary convolutional neural networks. In: European conference on computer vision. pp. 525–542. Springer (2016)
  • [20] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International journal of computer vision 115(3), 211–252 (2015)
  • [21] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4510–4520 (2018)
  • [22] Xu, Z., Lin, M., Liu, J., Chen, J., Shao, L., Gao, Y., Tian, Y., Ji, R.: Recu: Reviving the dead weights in binary neural networks. arXiv preprint arXiv:2103.12369 (2021)
  • [23] Yang, Z., Wang, Y., Han, K., Xu, C., Xu, C., Tao, D., Xu, C.: Searching for low-bit weights in quantized neural networks. arXiv preprint arXiv:2009.08695 (2020)
  • [24] Yuan, C., Agaian, S.S.: A comprehensive review of binary neural network. arXiv preprint arXiv:2110.06804 (2021)
  • [25] Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6848–6856 (2018)
  • [26] Zhang, Y., Pan, J., Liu, X., Chen, H., Chen, D., Zhang, Z.: Fracbnn: Accurate and fpga-efficient binary neural networks with fractional activations. In: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. pp. 171–182 (2021)
  • [27] Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016)
  • [28] Zhu, S., Dong, X., Su, H.: Binary ensemble neural network: More bits per network or more networks per bit? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4923–4932 (2019)
  • [29] Zhuang, B., Shen, C., Tan, M., Liu, L., Reid, I.: Structured binary neural networks for accurate image classification and semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 413–422 (2019)