This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Realizing Neural Decoder at the Edge with Ensembled BNN

Devannagari Vikas, Nancy Nayak, and Sheetal Kalyani
Department of Electrical Engineering, Indian Institute of Technology Madras, India.
Emails: {ee19m018@smail,ee17d408@smail,skalyani@ee}.iitm.ac.in
Abstract

In this work, we propose extreme compression techniques like binarization, ternarization for Neural Decoders such as TurboAE. These methods reduce memory and computation by a factor of 6464 with a performance better than the quantized (with 11-bit or 22-bits) Neural Decoders. However, because of the limited representation capability of the Binary and Ternary networks, the performance is not as good as the real-valued decoder. To fill this gap, we further propose to ensemble 44 such weak performers to deploy in the edge to achieve a performance similar to the real-valued network. These ensemble decoders give 1616 and 6464 times saving in memory and computation respectively and help to achieve performance similar to real-valued TurboAE.

Index Terms:
Neural decoding, Deep Learning, Computation and memory efficiency.

I Introduction

The future wireless communication system 6G will not only be equipped with multi-band high-speed transmission but also energy-efficient communication, low latency, and high security. In a digital communication system, different physical layer encryption algorithms like LDPC, Polar, Turbo codes [9, 3] are used as channel coding methods[4] to prevent the data from getting corrupted by channel noise. When the channel deviates from the Gaussian setting in a practical scenario, to exploit the power of the encoder, Neural Networks (NN) have been used to design the decoder while the encoder is fixed as a near-optimal code [5]. Deploying decoders for these codes takes up huge computation which is only possible because of recent advancements in signal processing methods. With a surge in the number of devices in the network, the interactions among themselves may result in excessive signal processing at the user end that gives rise to huge power consumption. Therefore economic energy usage to have elongated battery life in mobile devices has been a research direction of utmost importance[17, 10, 11]. In a noisy channel, encoding has been challenging even though the decoders have good performance[16]; so the authors in [13, 12, 18, 14] proposed neural code where the encoder and decoder are jointly trained. To overcome the problem of convergence to a local minimum in joint optimization, [7] proposed TurboAE that uses Convolutional Neural Network (CNN) based over-complete Auto Encoder (AE) model incorporating interleavers and de-interleavers to achieve the performance of State Of The Art (SOTA) channel codes under the AWGN scenario. All the existing neural AEs have real-valued network parameters and perform floating-point operations during deployment. For instance, the TurboAE architecture has nearly 26e526e5 parameters that take up a memory of 20.8420.84 MB considering a 6464 bit floating-point representation. Out of total 26e526e5 parameters, the encoder has nearly 1.5e51.5e5 parameters whereas the decoder has nearly 25e525e5 parameters. Because of the huge number of parameters in the AEs, deploying it in a resource-limited Internet Of Thing (IoT) setup is a challenging task. Furthermore, with the advent of edge computing in IoT scenarios, the computation is decentralized to edge devices where the data is processed locally. Realizing a Neural Decoder such as TurboAE[7] at a user end that has limited memory and computing power is not practically feasible.

I-A Contributions

In the domain of wireless communication, the channel noise is real-valued and till now, only the real-valued Neural Decoders have been used for end-to-end training and these use only floating-point operations. In this work, we have explored the possibilities of using extreme compactification techniques in Machine Learning-based wireless decoders like TurboAE. We further propose techniques that allow the decoder to be memory and computation-efficient but still have a performance close to the real-valued decoder. The major contributions of our work are the following:

  1. 1.

    We propose to use binary filters/weights/biases and binary activations111Binary Neural Networks[6] take the compression to the extreme level by replacing 6464 bit floating point (FP) weights and activations to be 11-bit that gives a memory reduction of 6464 times. Also the FP multiplication and addition operations are replaced with xnor and popcount operations that reduces computation cost radically during the inference time. in the Neural Decoder to save in memory and computation at the edge.

  2. 2.

    The performance is further improved by the use of a Ternary Neural Network (TNN) where the weights take three levels {1,0,+1}\{-1,0,+1\} with the binary activation. The proposed architectures with binary and ternary weights are shown to be better than one where the trained network is quantized with 22 bit or 11 bit.

  3. 3.

    An ensemble of multiple weak binary and ternary decoders is then proposed and is shown to perform close to the real-valued TurboAE and also achieve a 1616 times saving in memory and nearly 6464 times speed up due to less computation thus enabling us to achieve energy efficiency and low latency in the edge communication.

Before discussing different compressed versions of TurboAE, we first review the extreme compression techniques like BNN and TNN in Sec. II and then study the impact of these techniques in TurboAE in Sec. III.

II Extreme compression techniques

We denote a real valued NN gϕ(.)g_{\bm{\phi}}(.) where ϕ\bm{\phi} represents the real valued network parameters. The output from the NN is given by 𝐲=gϕ(𝐱)\mathbf{y}=g_{\bm{\phi}}(\mathbf{x}) where 𝐱\mathbf{x} is the input features to the NN and can be real valued. The neural network gϕ(.)g_{\bm{\phi}}(.) can be of any type: a fully connected, a CNN or a Recurrent Neural Networks (RNN). As TurboAE uses a CNN for the Neural Decoder, we now focus on CNNs. For gϕ(.)g_{\mathbf{\phi}}(.) a CNN of LL layers, the parameters are the filters of the CNN and are given by ϕ={𝐖1,,𝐖L}\mathbf{\phi}=\{\mathbf{W}_{1},\dots,\mathbf{W}_{L}\} where 𝐖lco×ci×k\mathbf{W}_{l}\in\mathbb{R}^{c_{o}\times c_{i}\times k} for lthl^{th} layer of one dimensional CNN. Here cic_{i} and coc_{o} represents number of input and output channels and kk is the dimension of the filter. For a one dimensional CNN as used in TurboAE, if the input to lthl^{th} layer of CNN has spatial features of dimension hinh_{in}, then input to lthl^{th} layer is 𝐚lci×hin\mathbf{a}_{l}\in\mathbb{R}^{c_{i}\times h_{in}}. The output of lthl^{th} layer is 𝐚l+1co×hout\mathbf{a}_{l+1}\in\mathbb{R}^{c_{o}\times h_{out}} where houth_{out} is the dimension of the output. For a Binary Neural Network (BNN), the weights and activations (𝐖\mathbf{W} and 𝐚\mathbf{a}) are binarized using the signsign function before taking convolution.

b=sign(r)={+1,if r01,otherwise.\displaystyle b=sign(r)=\begin{cases}+1,&\text{if }r\geq 0\\ -1,&\text{otherwise}.\end{cases} (1)

The binarized parameter 𝐖lb\mathbf{W}^{b}_{l} and 𝐚lb\mathbf{a}^{b}_{l} is given by:

𝐖lb=sign(𝐖l), and 𝐚lb=sign(𝐚l)\displaystyle\mathbf{W}^{b}_{l}=sign(\mathbf{W}_{l}),\text{ and }\mathbf{a}^{b}_{l}=sign(\mathbf{a}_{l}) (2)

The real-valued convolution is approximated with binary weights and activations as 𝐖l𝐚l𝐖lb𝐚lb\mathbf{W}_{l}\ast\mathbf{a}_{l}\approx\mathbf{W}^{b}_{l}\circledast\mathbf{a}^{b}_{l} where \circledast is convolution performed with bitwise operations. Even though the binarized weights are used for the forward pass, only the real-valued latent weights are updated with the real-valued gradients during backpropagation. However, during inference, these latent weights can be dropped and a binary network with the binary weights and activations can be deployed. The signsign function is non-differentiable and has gradients as zero almost everywhere; thus it is not appropriate for the backpropagation during the training. Therefore a straight-through estimator [2] was proposed that binarizes in the forward pass but during backpropagation, it just passes the gradients as it is to the previous layers. For instance, if b=sign(r)b=sign(r), then gradr=gradb𝟏|r|1grad_{r}=grad_{b}\mathbf{1}_{|r|\leq 1} where gradr=Crgrad_{r}=\frac{\partial C}{\partial r}, gradb=Cbgrad_{b}=\frac{\partial C}{\partial b} and CC is the cost function of the NN. To have a stable update during the training, the updated real-valued weights are clipped between [1,1][-1,1].

If a real-valued network gϕ(.)g_{\bm{\phi}}(.) is deployed in a 6464 bit system, then its binary version will occupy 6464 times lesser memory and all the floating-point operations can be converted to just xnor and popcount operations. However, because of this extreme compactification, the performance generally degrades significantly. So [8] proposed to use Ternary Neural Network (TNN) where 33 bits {1,0,1}\{-1,0,1\} are used. Therefore the ternarized parameter tt is given by:

t=tern(r)={+1,if r>Δ0,if r<|Δ|1,if r<Δ.\displaystyle t=tern(r)=\begin{cases}+1,&\text{if }r>\Delta\\ 0,&\text{if }r<|\Delta|\\ -1,&\text{if }r<-\Delta.\end{cases} (3)

where Δ0.7E(|𝐫|)\Delta\approx 0.7E(|\mathbf{r}|) in our architecture where 𝐫\mathbf{r} the set parameters of the real network. The introduction of zero as another bit along with {+1,1}\{+1,-1\} gives a better representation power and therefore better performance than BNN. But the zero weights need not to be saved during deployment. So the memory requirement of TNN is same as that of the BNN. Note that the activation is still binary and thus the computational complexity is also same as the BNN. Therefore with TNN, an improvement in performance over BNN is possible without any degradation in memory requirement or computation.

Figure 1: TurboAE interleaved encoder (left), Channel (middle) and TurboAE iterative decoder (right) with block rate 13\frac{1}{3}. gϕv=gϕbg_{\phi}^{v}=g_{\phi}^{b} for BinTurboAE and gϕv=gϕtg_{\phi}^{v}=g_{\phi}^{t} for TernTurboAE. Fig courtesy [7]
Encoder fθf_{\theta}Decoder gϕg_{\phi}𝐮\mathbf{u}f1,θf_{1,\theta}𝐱1\mathbf{x}_{1}f2,θf_{2,\theta}𝐱2\mathbf{x}_{2}AWGN Channelπ\pif3,θf_{3,\theta}𝐱3\mathbf{x}_{3}𝐳1\mathbf{z}_{1}𝐳2\mathbf{z}_{2}𝐳3\mathbf{z}_{3}1st1^{st} iteration𝐳1\mathbf{z}_{1}𝐳2\mathbf{z}_{2}gϕ1,1vg_{\phi_{1,1}}^{v}p0p_{0}qπ\piπ(𝐳1)\pi(\mathbf{z}_{1})𝐳3\mathbf{z}_{3}gϕ1,2vg_{\phi_{1,2}}^{v}pqπ1\pi^{-1}MthM^{th} Iteration𝐳1\mathbf{z}_{1}𝐳2\mathbf{z}_{2}gϕ6,1vg_{\phi_{6,1}}^{v}pqπ\piπ(𝐳1)\pi(\mathbf{z}_{1})𝐳3\mathbf{z}_{3}gϕ6,2vg_{\phi_{6,2}}^{v}pqπ1\pi^{-1}Sigmoid𝐮^\hat{\mathbf{u}}

II-A Saving in computation

The convolution between real-valued 𝐖lco×ci×k\mathbf{W}_{l}\in\mathbb{R}^{c_{o}\times c_{i}\times k} and 𝐚lci×hin\mathbf{a}_{l}\in\mathbb{R}^{c_{i}\times h_{in}} at lthl^{th} layer results in an output 𝐚l+1co×hout\mathbf{a}_{l+1}\in\mathbb{R}^{c_{o}\times h_{out}}. The total number of multiplication for lthl^{th} layer is ci×k×hout×coc_{i}\times k\times h_{out}\times c_{o} and the total number of addition for lthl^{th} layer is (ci1)×(k1)×hout×co(c_{i}-1)\times(k-1)\times h_{out}\times c_{o}. The total count of FLoating Point Operations (FLOP) for lthl^{th} layer of a real-valued 1D-CNN is the summation of the number of multiplication and addition that is roughly twice of the number of multiplication given by 2×ci×k×hout×co2\times c_{i}\times k\times h_{out}\times c_{o}. For a binary counterpart, as the weights and activations are constrained to 1-1 or +1+1, the 6464 bit floating point multiply-accumulation operations are replaced by 11 bit xnor-count operations [6]. Note that the modern CPUs can perform a single multiplication and addition in a single clock cycle, and thus the total number of operations in a binary network is ci×k×hout×coc_{i}\times k\times h_{out}\times c_{o}. In recent CPUs, 6464 such binary operations can be performed in a single clock cycle hence, giving a speedup of nearly 6464 times in a binary or ternary network [15]. Because the filters take only +1+1 or 1-1, only a limited number of filters are possible. So with BNN, the filter repetition can be exploited by using dedicated hardware/software. The implementation on GPU can be made faster by using SIMD within a register (SWAR) where 6464 binary variables are concatenated in a 6464 bit register and a 6464 times speedup on the bitwise operation like xnor can be achieved.

III TurboAE and its binarized versions

The method of channel coding in TurboAE can be divided into three sub-problems: an encoder fθ(.)f_{\theta}(.) at the transmitter, a channel c(.)c(.) and a decoder gϕ(.)g_{\phi}(.) at the receiver. In a communication system, the encoder x=fθ(u)x=f_{\theta}(u) encodes the binary bits 𝐮=(u1,,uK){+1,1}K\mathbf{u}=(u_{1},\dots,u_{K})\in\{+1,-1\}^{K} of block length KK to get the codeword 𝐱=(x1,,xN)\mathbf{x}=(x_{1},\dots,x_{N}) of length NN such that the codeword satisfies the power constraints. The code rate is R=KNR=\frac{K}{N}, where N>KN>K. The i.i.d. AWGN channel corrupts the encoded bits and generates zi=xi+wiz_{i}=x_{i}+w_{i} such that wi𝒩(0,σ2)w_{i}\sim\mathcal{N}(0,\sigma^{2}) for i=1,,Ki=1,\dots,K. The noise in the AWGN channel is represented by the signal to noise ratio SNR=10log10σ2\text{SNR}=-10\log_{10}\sigma^{2}. After transmission through the channel, the decoder gϕ(z)g_{\phi}(z) receives the real valued noisy encoded bits zz and map them to an estimate of the actual message sequence 𝐮^=(u^1,,u^K){+1,1}K\hat{\mathbf{u}}=(\hat{u}_{1},\dots,\hat{u}_{K})\in\{+1,-1\}^{K} using a decoding algorithm. Channel coding aims to minimize the Bit Error Rate (BER) or the BLock Error Rate (BLER) of the recovered message signal u^\hat{u} given by BER=1K1KPr(u^iui)BER=\frac{1}{K}\sum_{1}^{K}Pr(\hat{u}_{i}\neq u_{i}) and BLER=Pr(𝐮^𝐮)BLER=Pr(\hat{\mathbf{u}}\neq\mathbf{u}). Naively applying deep learning models by replacing encoder and decoder with general purpose neural networks does not perform well. So in [7], authors have proposed a TurboAE with interleaved encoding and iterative decoding using 1D convolutional neural networks. To make the Neural Decoder utilizable at the edge, we first propose to binarize and ternarize the iterative decoder of TurboAE and inspect its performance. We briefly describe the TurboAE architecture before discussing the proposed compressing techniques.

Turbo code is one of the first capacity approaching codes based on recursive systematic convolutional (RSC) code that has an optimal decoding algorithm namely the Bahl-Cocke-Jelinek-Raviv (BCJR)[1]. To add long-range memory to the code, interleaving is used: out of two copies of input bits, the first one passes through the RSC code and the second goes through the interleaver before passing through the same RSC code as shown in Fig. 1(left). After the transmission through the channel, this code is then decoded by repeating (i) and (ii) alternatively: (i) soft decoding based on the signal received from the first copy (ii) using the de-interleaved version as a prior for decoding the second copy as shown in Fig. 1(right). This iterative decoding method keeps re-estimating the posterior distribution on the transmitted bits. Both the interleaved encoder and the iterative decoder are learnable as proposed in TurboAE [7]. The interleaver 𝐱π=π(𝐱)\mathbf{x}^{\pi}=\pi(\mathbf{x}) and the de-interleaver 𝐱=π1(𝐱π)\mathbf{x}=\pi^{-1}(\mathbf{x}^{\pi}) shuffles and un-shuffles the input sequence with a random interleaving array known to both encoder and decoder respectively. A code rate of 1/31/3 is considered for the interleaved encoder fθf_{\theta} that has three learnable blocks f1,θ,f2,θf_{1,\theta},f_{2,\theta} and f3,θf_{3,\theta}. The first two takes the original message bit 𝐮\mathbf{u} to produce 𝐱1\mathbf{x}_{1} and 𝐱2\mathbf{x}_{2} whereas the third block takes the interleaved message π(𝐮)\pi(\mathbf{u}) to return 𝐱3\mathbf{x}_{3} as shown in Fig. 1. The encoded messages are transmitted through the channel and the received noisy messages are 𝐳1\mathbf{z}_{1},𝐳2\mathbf{z}_{2} and 𝐳3\mathbf{z}_{3}. Our focus is mostly on the compression of the iterative decoder part so that it can be deployed at the edge devices. Thus we do not discuss much on the encoder part in this work. Interested readers may refer to [7] for more details on the encoder.

III-A Binary and Ternary iterative decoder

Considering M(=6)M(=6) iterations of the iterative decoder, each iteration consists of two decoders. First decoder gϕi,1(.)g_{\phi_{i,1}}(.) in ithi^{th} iteration takes the original noisy message 𝐳1,𝐳2\mathbf{z}_{1},\mathbf{z}_{2} and the prior distribution pp on the transmitted bits and returns a posterior qq that goes to the second decoder gϕi,2(.)g_{\phi_{i,2}}(.) via interleaving along with the interleaved noisy messages π(𝐳1)\pi(\mathbf{z}_{1}) and 𝐳3\mathbf{z}_{3}. In the proposed binarized and ternarized TurboAE, named as BinTurboAE and TernTurboAE respectively, the real-valued decoders {gϕ1,,gϕM}\{g_{\phi_{1}},\dots,g_{\phi_{M}}\} are replaced with binary decoders {gϕ1b,,gϕMb}\{g^{b}_{\phi_{1}},\dots,g^{b}_{\phi_{M}}\} and ternary decoders {gϕ1t,,gϕMt}\{g^{t}_{\phi_{1}},\dots,g^{t}_{\phi_{M}}\}. For ease of notation, we represent the complete binary decoder by gϕbg^{b}_{\phi} and the ternary decoder by gϕtg^{t}_{\phi}. The main limitation of BinTurboAE and TernTurboAE is that they do not perform as well as the real-valued TurboAE. But in those applications where degradation in performance is acceptable at the cost of reduced computation and energy efficiency, BinTurboAE or TernTurboAE can be deployed at the Edge devices. As the performance of BinTurboAE is not as good as the real counterpart, each of these can be thought of as a single weak learner. Instead of relying on a single weak learner, we further propose to ensemble a set of weak learners’ outcomes to enable a performance that is as good as that of a real-valued network however with much lower complexity and memory requirement.

Figure 2: Architecture of the decoder of (Bin/Tern)TurboAE-Bag: the final estimate u^\hat{u} is the aggregate of B=4B=4 weak learners. gϕv,i=gϕbg_{\phi}^{v,i}=g_{\phi}^{b} for BinTurboAE and gϕv,i=gϕtg_{\phi}^{v,i}=g_{\phi}^{t} for TernTurboAE.
EncoderDecoderufθf_{\theta}𝐱𝟏\mathbf{x_{1}}𝐱𝟐\mathbf{x_{2}}𝐱𝟑\mathbf{x_{3}}AWGN𝐳𝟏\mathbf{z_{1}}𝐳𝟐\mathbf{z_{2}}𝐳𝟑\mathbf{z_{3}}𝐳𝟏\mathbf{z_{1}}𝐳𝟐\mathbf{z_{2}}𝐳𝟑\mathbf{z_{3}}gϕv,1g_{\phi}^{v,1}𝐮^1\hat{\mathbf{u}}^{1}𝐳𝟏\mathbf{z_{1}}𝐳𝟐\mathbf{z_{2}}𝐳𝟑\mathbf{z_{3}}gϕv,2g_{\phi}^{v,2}𝐮^2\hat{\mathbf{u}}^{2}𝐳𝟏\mathbf{z_{1}}𝐳𝟐\mathbf{z_{2}}𝐳𝟑\mathbf{z_{3}}gϕv,3g_{\phi}^{v,3}𝐮^3\hat{\mathbf{u}}^{3}𝐳𝟏\mathbf{z_{1}}𝐳𝟐\mathbf{z_{2}}𝐳𝟑\mathbf{z_{3}}gϕv,4g_{\phi}^{v,4}𝐮^4\hat{\mathbf{u}}^{4}b=14𝐮^b4\frac{\sum_{b=1}^{4}\hat{\mathbf{u}}^{b}}{4}𝐮^\hat{\mathbf{u}}

III-B Proposed Ensembled binary TurboAE

Considering each decoder gϕbg^{b}_{\phi} a weak learner, BB such weak learners are trained separately with the complete dataset. The idea of “ensemble” is to get opinions from all these weak learners to arrive at a better prediction. One of the many ways the weak learners can be ensembled is Bagging [19]. In this work, we have proposed to ensemble BB BinTurboAEs with the Bagging method and denote this proposed method as BinTurboAE-Bag. The same with TernTurboAE is called TernTurboAE-Bag. Bagging is used in machine learning to improve stability and accuracy and to reduce variance. In Bagging method, the decisions from each one of these BB BinTurboAEs ({u^1,,u^B}\{\hat{u}^{1},\dots,\hat{u}^{B}\}) are averaged to get the final prediction 𝐮^=1Bb=1B𝐮^b\hat{\mathbf{u}}=\frac{1}{B}\sum_{b=1}^{B}\hat{\mathbf{u}}^{b} as shown in Fig. 2.

IV Experiments

To validate the usefulness of the proposed compression techniques, we consider the setting used in [7] to train the encoder and decoder networks. A large batch size, preferably greater than or equal to 500500, is used to average the channel noise effects. We train the encoder and decoder separately to avoid any possible local optima. BinTurboAE and TernTurboAE need a smaller learning rate than the real-valued TurboAE. Hence we reduced the learning rate by 10 times whenever the validation loss gets saturated for higher training epochs. The hyper-parameters used in our experiment are shown in Table I.

TABLE I: Hyper-parameters of TurboAE
Loss Binary Cross-Entropy (BCE)
Encoder 2 layers 1D-CNN, kernel size 5, 100 filters for each learnable encoding block
Decoder 5 layers 1D-CNN, kernel size 5, 100 filters for each learnable decoding block
Decoder Iterations 6
Info Feature Size F 5
Batch Size 500
Optimizer Adam
Learning Rate initially 0.0001 and reduced by 10 when test loss saturates for more number of epochs
Block Length K 100
Number of Epochs 800
TABLE II: Savings vs performances at the edge device
Model Memory savings Computation Speed up BER at SNR 0 dB
Full precision DNN 1x 4e8\simeq 4e8 FLOPs 1x 1e21e-2
QuantTurboAE (q=4q=4) (64/q)\simeq(64/q)x=16=16x 4e8\simeq 4e8 FLOPs 1x 6e26e-2 (q=4)
BinTurboAE 64\simeq 64x 4e8\simeq 4e8 xnor-count 6464x 1e11e-1
TernTurboAE 64\simeq 64x 4e8\simeq 4e8 xnor-count 6464x 6e26e-2
(Bin/Tern)TurboAE-bag (B=4B=4) (64/B)\simeq(64/B)x=16=16x \simeq 16e816e8 xnor-count 6464x 2e32e-3
Figure 3: Performance of Binary and Ternary networks compared to the quantized and real valued TurboAE
Refer to caption
Figure 4: Performances of Ensembled, binary and ternary TurboAE
Refer to caption

IV-A Results

We provide results showing performance in terms of BER vs SNR of the proposed BinTurboAE and TernTurboAE and compare them with QuantTurboAE, the quantized TurboAE to qq levels after the training. For QuantTurboAE, the parameters of the trained TurboAE are quantized to different levels i.e. 88-bit, 44-bit, 22-bit, and 11-bit. The saving in memory is 88, 1616, 3232, and 6464 times respectively compared to the real-valued TurboAE network as shown in Table. II. QuantTurboAE does not offer any saving in computation unlike our proposed method. The 88-bit quantization after the training performs as well as the original TurboAE. But the 22-bit and 11-bit quantizations have very poor performance as shown in Fig. 3. But instead of quantization after the training, if the network is trained with 11-bit quantization like the BinTurboAE, the network outperforms 2-bit and 1-bit QuantTurboAEs. The Ternary network improves the BER performance even more by 0.50.5dB and performs similar to QuantTurboAE (q=4q=4) which uses 44 bits to store each parameter whereas the TernTurboAE uses only 11 bit. Therefore, compared to the real-valued TurboAE, both the binary and the ternary variants save the memory requirement by about 6464 times and the computations by converting all the floating-point computations to xnor and pop-count operations at the decoder side. The performance gap between the proposed methods and TurboAE still exists and needs one’s attention. To close this gap, B=4B=4 such BinTurboAE as weak learners are ensembled and its performance is shown in Fig. 4.

The ensemble of just B=4B=4 BinTurboAEs implemented with the bagging method performs much better than that of a single BinTurboAE. The BinTurboAE-Bag even outperforms the real network in the low SNR region by almost 11 dB. The performance of TernTurboAE-Bag is slightly better than BinTurboAE-Bag as shown in the figure. In the high SNR region, the BinTurboAE-Bag performs close to the real TurboAE. This result is significant as the BinTurboAE-Bag saves a lot in terms of the memory requirement (about 64/B64/B times) and the number of computations (FLOPs are replaced with xnor-count) at the edge device end without compromising the BER performance.

IV-B Computation and memory savings at the edge devices

Decoding usually happens at the edge device. In real TurboAE, the iterative decoder has a huge number of parameters that take up a lot of memory. It also involves floating-point operations thus making the computations slow at the edge devices. Our main goal is then to reduce the memory requirement and computations at the decoder side of the TurboAE so that the proposed decoders are suitable for deployment at the edge. The savings for each of the proposed techniques are shown in Table. II. BinTurboAE and TernTurboAE take up memory 6464 times lesser than the real-valued TurboAE. BinTurboAE-Bag takes a memory BB times of the BinTuboAE.

The number of FLOPs in the decoder of the real TurboAE at the edge devices is about 4e84e8. Even though the memory savings in qq bit Quantized network would be around (64/q)(64/q) times the real network’s requirement, QuantTurboAE and TurboAE do not speed up the computations as the computations are still in 6464 bit. As the Binary, Ternary and the Ensembled TurboAEs convert all the 4e84e8 floating-point operations to only bitwise operations, the computations are extremely fast with much lower power consumption. When 6464 bitwise operations are performed in a single clock cycle, then the binary and ternary networks are 6464 times faster thus leading to very low latency when compared with the real TurboAE network. Even though the computation in BinTurboAE-Bag is BB times of the BinTurboAE, if parallel processing is available at edge, then BinTurboAE-Bag can be equally fast like BinTurboAE.

V Conclusion

In summary, we propose BinTurboAE and TernTurboAE intending to deploy the end-to-end channel coding in the targeted low-power edge devices by reducing the memory requirement and the computations by nearly 6464 times at the cost of acceptable performance degradation. We then propose BinTurboAE-bag and TernTurboAE-bag to improve the performance offered by a single BinTurboAE or single TernTurboAE respectively and achieve the performance close to the real network. The ensembled technique implemented with four such weak learners is shown to consume 1616 times less memory and computing power than the real-valued TurboAE with nearly similar performance.

References

  • [1] Lalit Bahl, John Cocke, Frederick Jelinek, and Josef Raviv. Optimal decoding of linear codes for minimizing symbol error rate (corresp.). IEEE Transactions on information theory, 20(2):284–287, 1974.
  • [2] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • [3] Claude Berrou, Alain Glavieux, and Punya Thitimajshima. Near shannon limit error-correcting coding and decoding: Turbo-codes. 1. In Proceedings of ICC’93-IEEE International Conference on Communications, volume 2, pages 1064–1070. IEEE, 1993.
  • [4] Sepehr Dehdashtian, Matin Hashemi, and Saber Salehkaleybar. Deep-learning-based blind recognition of channel code parameters over candidate sets under awgn and multi-path fading conditions. IEEE Wireless Communications Letters, 10(5):1041–1045, 2021.
  • [5] Nghia Doan, Seyyed Ali Hashemi, and Warren J Gross. Neural successive cancellation decoding of polar codes. In 2018 IEEE 19th international workshop on signal processing advances in wireless communications (SPAWC), pages 1–5. IEEE, 2018.
  • [6] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems, volume 29, 2016.
  • [7] Yihan Jiang, Hyeji Kim, Himanshu Asnani, Sreeram Kannan, Sewoong Oh, and Pramod Viswanath. Turbo autoencoder: Deep learning based channel codes for point-to-point communication channels. In Advances in Neural Information Processing Systems, volume 32, 2019.
  • [8] Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
  • [9] David JC MacKay and Radford M Neal. Near shannon limit performance of low density parity check codes. Electronics letters, 32(18):1645–1646, 1996.
  • [10] Rajarshi Mahapatra, Yogesh Nijsure, Georges Kaddoum, Naveed Ul Hassan, and Chau Yuen. Energy efficiency tradeoff mechanism towards wireless green communication: A survey. IEEE Communications Surveys & Tutorials, 18(1):686–705, 2015.
  • [11] Nancy Nayak, Thulasi Tholeti, Muralikrishnan Srinivasan, and Sheetal Kalyani. Green detnet: Computation and memory efficient detnet using smart compression and training. arXiv preprint arXiv:2003.09446, 2020.
  • [12] Timothy J O’Shea, Tugba Erpek, and T Charles Clancy. Deep learning based mimo communications. arXiv preprint arXiv:1707.07980, 2017.
  • [13] Vishnu Raj and Sheetal Kalyani. Backpropagating through the air: Deep learning at physical layer without channel models. IEEE Communications Letters, 22(11):2278–2281, 2018.
  • [14] Vishnu Raj and Sheetal Kalyani. Design of communication systems using deep learning: A variational inference perspective. IEEE Transactions on Cognitive Communications and Networking, 6(4):1320–1334, 2020.
  • [15] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer, 2016.
  • [16] Kirty Vedula, Randy Paffenroth, and D Richard Brown. Joint coding and modulation in the ultra-short blocklength regime for bernoulli-gaussian impulsive noise channels using autoencoders. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2020, pages 5065–5069. IEEE, 2020.
  • [17] Ming Zhan, Zhibo Pang, Ming Xiao, and Hong Wen. A state metrics compressed decoding technique for energy-efficient turbo decoder. EURASIP Journal on Wireless Communications and Networking, 2018(1):1–7, 2018.
  • [18] Banghua Zhu, Jintao Wang, Longzhuang He, and Jian Song. Joint transceiver optimization for wireless communication phy using neural network. IEEE Journal on Selected Areas in Communications, 37(6):1364–1373, 2019.
  • [19] Shilin Zhu, Xin Dong, and Hao Su. Binary ensemble neural network: More bits per network or more networks per bit? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4923–4932, 2019.