Realizing Neural Decoder at the Edge with Ensembled BNN
Abstract
In this work, we propose extreme compression techniques like binarization, ternarization for Neural Decoders such as TurboAE. These methods reduce memory and computation by a factor of with a performance better than the quantized (with -bit or -bits) Neural Decoders. However, because of the limited representation capability of the Binary and Ternary networks, the performance is not as good as the real-valued decoder. To fill this gap, we further propose to ensemble such weak performers to deploy in the edge to achieve a performance similar to the real-valued network. These ensemble decoders give and times saving in memory and computation respectively and help to achieve performance similar to real-valued TurboAE.
Index Terms:
Neural decoding, Deep Learning, Computation and memory efficiency.I Introduction
The future wireless communication system 6G will not only be equipped with multi-band high-speed transmission but also energy-efficient communication, low latency, and high security. In a digital communication system, different physical layer encryption algorithms like LDPC, Polar, Turbo codes [9, 3] are used as channel coding methods[4] to prevent the data from getting corrupted by channel noise. When the channel deviates from the Gaussian setting in a practical scenario, to exploit the power of the encoder, Neural Networks (NN) have been used to design the decoder while the encoder is fixed as a near-optimal code [5]. Deploying decoders for these codes takes up huge computation which is only possible because of recent advancements in signal processing methods. With a surge in the number of devices in the network, the interactions among themselves may result in excessive signal processing at the user end that gives rise to huge power consumption. Therefore economic energy usage to have elongated battery life in mobile devices has been a research direction of utmost importance[17, 10, 11]. In a noisy channel, encoding has been challenging even though the decoders have good performance[16]; so the authors in [13, 12, 18, 14] proposed neural code where the encoder and decoder are jointly trained. To overcome the problem of convergence to a local minimum in joint optimization, [7] proposed TurboAE that uses Convolutional Neural Network (CNN) based over-complete Auto Encoder (AE) model incorporating interleavers and de-interleavers to achieve the performance of State Of The Art (SOTA) channel codes under the AWGN scenario. All the existing neural AEs have real-valued network parameters and perform floating-point operations during deployment. For instance, the TurboAE architecture has nearly parameters that take up a memory of MB considering a bit floating-point representation. Out of total parameters, the encoder has nearly parameters whereas the decoder has nearly parameters. Because of the huge number of parameters in the AEs, deploying it in a resource-limited Internet Of Thing (IoT) setup is a challenging task. Furthermore, with the advent of edge computing in IoT scenarios, the computation is decentralized to edge devices where the data is processed locally. Realizing a Neural Decoder such as TurboAE[7] at a user end that has limited memory and computing power is not practically feasible.
I-A Contributions
In the domain of wireless communication, the channel noise is real-valued and till now, only the real-valued Neural Decoders have been used for end-to-end training and these use only floating-point operations. In this work, we have explored the possibilities of using extreme compactification techniques in Machine Learning-based wireless decoders like TurboAE. We further propose techniques that allow the decoder to be memory and computation-efficient but still have a performance close to the real-valued decoder. The major contributions of our work are the following:
-
1.
We propose to use binary filters/weights/biases and binary activations111Binary Neural Networks[6] take the compression to the extreme level by replacing bit floating point (FP) weights and activations to be -bit that gives a memory reduction of times. Also the FP multiplication and addition operations are replaced with xnor and popcount operations that reduces computation cost radically during the inference time. in the Neural Decoder to save in memory and computation at the edge.
-
2.
The performance is further improved by the use of a Ternary Neural Network (TNN) where the weights take three levels with the binary activation. The proposed architectures with binary and ternary weights are shown to be better than one where the trained network is quantized with bit or bit.
-
3.
An ensemble of multiple weak binary and ternary decoders is then proposed and is shown to perform close to the real-valued TurboAE and also achieve a times saving in memory and nearly times speed up due to less computation thus enabling us to achieve energy efficiency and low latency in the edge communication.
II Extreme compression techniques
We denote a real valued NN where represents the real valued network parameters. The output from the NN is given by where is the input features to the NN and can be real valued. The neural network can be of any type: a fully connected, a CNN or a Recurrent Neural Networks (RNN). As TurboAE uses a CNN for the Neural Decoder, we now focus on CNNs. For a CNN of layers, the parameters are the filters of the CNN and are given by where for layer of one dimensional CNN. Here and represents number of input and output channels and is the dimension of the filter. For a one dimensional CNN as used in TurboAE, if the input to layer of CNN has spatial features of dimension , then input to layer is . The output of layer is where is the dimension of the output. For a Binary Neural Network (BNN), the weights and activations ( and ) are binarized using the function before taking convolution.
(1) |
The binarized parameter and is given by:
(2) |
The real-valued convolution is approximated with binary weights and activations as where is convolution performed with bitwise operations. Even though the binarized weights are used for the forward pass, only the real-valued latent weights are updated with the real-valued gradients during backpropagation. However, during inference, these latent weights can be dropped and a binary network with the binary weights and activations can be deployed. The function is non-differentiable and has gradients as zero almost everywhere; thus it is not appropriate for the backpropagation during the training. Therefore a straight-through estimator [2] was proposed that binarizes in the forward pass but during backpropagation, it just passes the gradients as it is to the previous layers. For instance, if , then where , and is the cost function of the NN. To have a stable update during the training, the updated real-valued weights are clipped between .
If a real-valued network is deployed in a bit system, then its binary version will occupy times lesser memory and all the floating-point operations can be converted to just xnor and popcount operations. However, because of this extreme compactification, the performance generally degrades significantly. So [8] proposed to use Ternary Neural Network (TNN) where bits are used. Therefore the ternarized parameter is given by:
(3) |
where in our architecture where the set parameters of the real network. The introduction of zero as another bit along with gives a better representation power and therefore better performance than BNN. But the zero weights need not to be saved during deployment. So the memory requirement of TNN is same as that of the BNN. Note that the activation is still binary and thus the computational complexity is also same as the BNN. Therefore with TNN, an improvement in performance over BNN is possible without any degradation in memory requirement or computation.
II-A Saving in computation
The convolution between real-valued and at layer results in an output . The total number of multiplication for layer is and the total number of addition for layer is . The total count of FLoating Point Operations (FLOP) for layer of a real-valued 1D-CNN is the summation of the number of multiplication and addition that is roughly twice of the number of multiplication given by . For a binary counterpart, as the weights and activations are constrained to or , the bit floating point multiply-accumulation operations are replaced by bit xnor-count operations [6]. Note that the modern CPUs can perform a single multiplication and addition in a single clock cycle, and thus the total number of operations in a binary network is . In recent CPUs, such binary operations can be performed in a single clock cycle hence, giving a speedup of nearly times in a binary or ternary network [15]. Because the filters take only or , only a limited number of filters are possible. So with BNN, the filter repetition can be exploited by using dedicated hardware/software. The implementation on GPU can be made faster by using SIMD within a register (SWAR) where binary variables are concatenated in a bit register and a times speedup on the bitwise operation like xnor can be achieved.
III TurboAE and its binarized versions
The method of channel coding in TurboAE can be divided into three sub-problems: an encoder at the transmitter, a channel and a decoder at the receiver. In a communication system, the encoder encodes the binary bits of block length to get the codeword of length such that the codeword satisfies the power constraints. The code rate is , where . The i.i.d. AWGN channel corrupts the encoded bits and generates such that for . The noise in the AWGN channel is represented by the signal to noise ratio . After transmission through the channel, the decoder receives the real valued noisy encoded bits and map them to an estimate of the actual message sequence using a decoding algorithm. Channel coding aims to minimize the Bit Error Rate (BER) or the BLock Error Rate (BLER) of the recovered message signal given by and . Naively applying deep learning models by replacing encoder and decoder with general purpose neural networks does not perform well. So in [7], authors have proposed a TurboAE with interleaved encoding and iterative decoding using 1D convolutional neural networks. To make the Neural Decoder utilizable at the edge, we first propose to binarize and ternarize the iterative decoder of TurboAE and inspect its performance. We briefly describe the TurboAE architecture before discussing the proposed compressing techniques.
Turbo code is one of the first capacity approaching codes based on recursive systematic convolutional (RSC) code that has an optimal decoding algorithm namely the Bahl-Cocke-Jelinek-Raviv (BCJR)[1]. To add long-range memory to the code, interleaving is used: out of two copies of input bits, the first one passes through the RSC code and the second goes through the interleaver before passing through the same RSC code as shown in Fig. 1(left). After the transmission through the channel, this code is then decoded by repeating (i) and (ii) alternatively: (i) soft decoding based on the signal received from the first copy (ii) using the de-interleaved version as a prior for decoding the second copy as shown in Fig. 1(right). This iterative decoding method keeps re-estimating the posterior distribution on the transmitted bits. Both the interleaved encoder and the iterative decoder are learnable as proposed in TurboAE [7]. The interleaver and the de-interleaver shuffles and un-shuffles the input sequence with a random interleaving array known to both encoder and decoder respectively. A code rate of is considered for the interleaved encoder that has three learnable blocks and . The first two takes the original message bit to produce and whereas the third block takes the interleaved message to return as shown in Fig. 1. The encoded messages are transmitted through the channel and the received noisy messages are , and . Our focus is mostly on the compression of the iterative decoder part so that it can be deployed at the edge devices. Thus we do not discuss much on the encoder part in this work. Interested readers may refer to [7] for more details on the encoder.
III-A Binary and Ternary iterative decoder
Considering iterations of the iterative decoder, each iteration consists of two decoders. First decoder in iteration takes the original noisy message and the prior distribution on the transmitted bits and returns a posterior that goes to the second decoder via interleaving along with the interleaved noisy messages and . In the proposed binarized and ternarized TurboAE, named as BinTurboAE and TernTurboAE respectively, the real-valued decoders are replaced with binary decoders and ternary decoders . For ease of notation, we represent the complete binary decoder by and the ternary decoder by . The main limitation of BinTurboAE and TernTurboAE is that they do not perform as well as the real-valued TurboAE. But in those applications where degradation in performance is acceptable at the cost of reduced computation and energy efficiency, BinTurboAE or TernTurboAE can be deployed at the Edge devices. As the performance of BinTurboAE is not as good as the real counterpart, each of these can be thought of as a single weak learner. Instead of relying on a single weak learner, we further propose to ensemble a set of weak learners’ outcomes to enable a performance that is as good as that of a real-valued network however with much lower complexity and memory requirement.
III-B Proposed Ensembled binary TurboAE
Considering each decoder a weak learner, such weak learners are trained separately with the complete dataset. The idea of “ensemble” is to get opinions from all these weak learners to arrive at a better prediction. One of the many ways the weak learners can be ensembled is Bagging [19]. In this work, we have proposed to ensemble BinTurboAEs with the Bagging method and denote this proposed method as BinTurboAE-Bag. The same with TernTurboAE is called TernTurboAE-Bag. Bagging is used in machine learning to improve stability and accuracy and to reduce variance. In Bagging method, the decisions from each one of these BinTurboAEs () are averaged to get the final prediction as shown in Fig. 2.
IV Experiments
To validate the usefulness of the proposed compression techniques, we consider the setting used in [7] to train the encoder and decoder networks. A large batch size, preferably greater than or equal to , is used to average the channel noise effects. We train the encoder and decoder separately to avoid any possible local optima. BinTurboAE and TernTurboAE need a smaller learning rate than the real-valued TurboAE. Hence we reduced the learning rate by 10 times whenever the validation loss gets saturated for higher training epochs. The hyper-parameters used in our experiment are shown in Table I.
Loss | Binary Cross-Entropy (BCE) |
Encoder | 2 layers 1D-CNN, kernel size 5, 100 filters for each learnable encoding block |
Decoder | 5 layers 1D-CNN, kernel size 5, 100 filters for each learnable decoding block |
Decoder Iterations | 6 |
Info Feature Size F | 5 |
Batch Size | 500 |
Optimizer | Adam |
Learning Rate | initially 0.0001 and reduced by 10 when test loss saturates for more number of epochs |
Block Length K | 100 |
Number of Epochs | 800 |
Model | Memory savings | Computation | Speed up | BER at SNR dB |
---|---|---|---|---|
Full precision DNN | 1x | FLOPs | 1x | |
QuantTurboAE () | xx | FLOPs | 1x | (q=4) |
BinTurboAE | x | xnor-count | x | |
TernTurboAE | x | xnor-count | x | |
(Bin/Tern)TurboAE-bag () | xx | xnor-count | x |


IV-A Results
We provide results showing performance in terms of BER vs SNR of the proposed BinTurboAE and TernTurboAE and compare them with QuantTurboAE, the quantized TurboAE to levels after the training. For QuantTurboAE, the parameters of the trained TurboAE are quantized to different levels i.e. -bit, -bit, -bit, and -bit. The saving in memory is , , , and times respectively compared to the real-valued TurboAE network as shown in Table. II. QuantTurboAE does not offer any saving in computation unlike our proposed method. The -bit quantization after the training performs as well as the original TurboAE. But the -bit and -bit quantizations have very poor performance as shown in Fig. 3. But instead of quantization after the training, if the network is trained with -bit quantization like the BinTurboAE, the network outperforms 2-bit and 1-bit QuantTurboAEs. The Ternary network improves the BER performance even more by dB and performs similar to QuantTurboAE () which uses bits to store each parameter whereas the TernTurboAE uses only bit. Therefore, compared to the real-valued TurboAE, both the binary and the ternary variants save the memory requirement by about times and the computations by converting all the floating-point computations to xnor and pop-count operations at the decoder side. The performance gap between the proposed methods and TurboAE still exists and needs one’s attention. To close this gap, such BinTurboAE as weak learners are ensembled and its performance is shown in Fig. 4.
The ensemble of just BinTurboAEs implemented with the bagging method performs much better than that of a single BinTurboAE. The BinTurboAE-Bag even outperforms the real network in the low SNR region by almost dB. The performance of TernTurboAE-Bag is slightly better than BinTurboAE-Bag as shown in the figure. In the high SNR region, the BinTurboAE-Bag performs close to the real TurboAE. This result is significant as the BinTurboAE-Bag saves a lot in terms of the memory requirement (about times) and the number of computations (FLOPs are replaced with xnor-count) at the edge device end without compromising the BER performance.
IV-B Computation and memory savings at the edge devices
Decoding usually happens at the edge device. In real TurboAE, the iterative decoder has a huge number of parameters that take up a lot of memory. It also involves floating-point operations thus making the computations slow at the edge devices. Our main goal is then to reduce the memory requirement and computations at the decoder side of the TurboAE so that the proposed decoders are suitable for deployment at the edge. The savings for each of the proposed techniques are shown in Table. II. BinTurboAE and TernTurboAE take up memory times lesser than the real-valued TurboAE. BinTurboAE-Bag takes a memory times of the BinTuboAE.
The number of FLOPs in the decoder of the real TurboAE at the edge devices is about . Even though the memory savings in bit Quantized network would be around times the real network’s requirement, QuantTurboAE and TurboAE do not speed up the computations as the computations are still in bit. As the Binary, Ternary and the Ensembled TurboAEs convert all the floating-point operations to only bitwise operations, the computations are extremely fast with much lower power consumption. When bitwise operations are performed in a single clock cycle, then the binary and ternary networks are times faster thus leading to very low latency when compared with the real TurboAE network. Even though the computation in BinTurboAE-Bag is times of the BinTurboAE, if parallel processing is available at edge, then BinTurboAE-Bag can be equally fast like BinTurboAE.
V Conclusion
In summary, we propose BinTurboAE and TernTurboAE intending to deploy the end-to-end channel coding in the targeted low-power edge devices by reducing the memory requirement and the computations by nearly times at the cost of acceptable performance degradation. We then propose BinTurboAE-bag and TernTurboAE-bag to improve the performance offered by a single BinTurboAE or single TernTurboAE respectively and achieve the performance close to the real network. The ensembled technique implemented with four such weak learners is shown to consume times less memory and computing power than the real-valued TurboAE with nearly similar performance.
References
- [1] Lalit Bahl, John Cocke, Frederick Jelinek, and Josef Raviv. Optimal decoding of linear codes for minimizing symbol error rate (corresp.). IEEE Transactions on information theory, 20(2):284–287, 1974.
- [2] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- [3] Claude Berrou, Alain Glavieux, and Punya Thitimajshima. Near shannon limit error-correcting coding and decoding: Turbo-codes. 1. In Proceedings of ICC’93-IEEE International Conference on Communications, volume 2, pages 1064–1070. IEEE, 1993.
- [4] Sepehr Dehdashtian, Matin Hashemi, and Saber Salehkaleybar. Deep-learning-based blind recognition of channel code parameters over candidate sets under awgn and multi-path fading conditions. IEEE Wireless Communications Letters, 10(5):1041–1045, 2021.
- [5] Nghia Doan, Seyyed Ali Hashemi, and Warren J Gross. Neural successive cancellation decoding of polar codes. In 2018 IEEE 19th international workshop on signal processing advances in wireless communications (SPAWC), pages 1–5. IEEE, 2018.
- [6] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems, volume 29, 2016.
- [7] Yihan Jiang, Hyeji Kim, Himanshu Asnani, Sreeram Kannan, Sewoong Oh, and Pramod Viswanath. Turbo autoencoder: Deep learning based channel codes for point-to-point communication channels. In Advances in Neural Information Processing Systems, volume 32, 2019.
- [8] Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
- [9] David JC MacKay and Radford M Neal. Near shannon limit performance of low density parity check codes. Electronics letters, 32(18):1645–1646, 1996.
- [10] Rajarshi Mahapatra, Yogesh Nijsure, Georges Kaddoum, Naveed Ul Hassan, and Chau Yuen. Energy efficiency tradeoff mechanism towards wireless green communication: A survey. IEEE Communications Surveys & Tutorials, 18(1):686–705, 2015.
- [11] Nancy Nayak, Thulasi Tholeti, Muralikrishnan Srinivasan, and Sheetal Kalyani. Green detnet: Computation and memory efficient detnet using smart compression and training. arXiv preprint arXiv:2003.09446, 2020.
- [12] Timothy J O’Shea, Tugba Erpek, and T Charles Clancy. Deep learning based mimo communications. arXiv preprint arXiv:1707.07980, 2017.
- [13] Vishnu Raj and Sheetal Kalyani. Backpropagating through the air: Deep learning at physical layer without channel models. IEEE Communications Letters, 22(11):2278–2281, 2018.
- [14] Vishnu Raj and Sheetal Kalyani. Design of communication systems using deep learning: A variational inference perspective. IEEE Transactions on Cognitive Communications and Networking, 6(4):1320–1334, 2020.
- [15] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer, 2016.
- [16] Kirty Vedula, Randy Paffenroth, and D Richard Brown. Joint coding and modulation in the ultra-short blocklength regime for bernoulli-gaussian impulsive noise channels using autoencoders. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2020, pages 5065–5069. IEEE, 2020.
- [17] Ming Zhan, Zhibo Pang, Ming Xiao, and Hong Wen. A state metrics compressed decoding technique for energy-efficient turbo decoder. EURASIP Journal on Wireless Communications and Networking, 2018(1):1–7, 2018.
- [18] Banghua Zhu, Jintao Wang, Longzhuang He, and Jian Song. Joint transceiver optimization for wireless communication phy using neural network. IEEE Journal on Selected Areas in Communications, 37(6):1364–1373, 2019.
- [19] Shilin Zhu, Xin Dong, and Hao Su. Binary ensemble neural network: More bits per network or more networks per bit? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4923–4932, 2019.