Online Learning of Trellis Diagram Using Neural Network for Robust Detection and Decoding

Jie Yang, Student Member, IEEE, Qinghe Du, Student Member, IEEE, Yi Jiang, Member, IEEE,

Abstract

This paper studies machine learning-assisted maximum likelihood (ML) and maximum a posteriori (MAP) receivers for a communication system with memory, which can be modelled by a trellis diagram. The prerequisite of the ML/MAP receiver is to obtain the likelihood of the received samples under different state transitions of the trellis diagram, which relies on the channel state information (CSI) and the distribution of the channel noise. We propose to learn the trellis diagram real-time using an artificial neural network (ANN) trained by a pilot sequence. This approach, termed as the online learning of trellis diagram (OLTD), requires neither the CSI nor statistics of the noise, and can be incorporated into the classic Viterbi and the BCJR algorithm. It is shown to significantly outperform the model-based methods in non-Gaussian channels. It requires much less training overhead than the state-of-the-art methods, and hence is more feasible for real implementations. As an illustrative example, the OLTD-based BCJR is applied to a Bluetooth low energy (BLE) receiver trained only by a 256-sample pilot sequence. Moreover, the OLTD-based BCJR can accommodate for turbo equalization, while the state-of-the-art BCJRNet/ViterbiNet cannot. As an interesting by-product, we propose an enhancement to the BLE standard by introducing a bit interleaver to its physical layer; the resultant improvement of the receiver sensitivity can make it a better fit for some Internet of Things (IoT) communications.

Index Terms:

Neural Network; BCJR algorithm; Viterbi algorithm; turbo equalization; Bluetooth

¹¹1The work was supported by National Natural Science Foundation of China Grant No. 61771005. Partial material of this paper was published in IEEE/CIC International Conference on Communications (ICCC) 2021, Xiamen, China. (Corresponding author: Yi Jiang) J. Yang, Q. Du, and Y. Jiang are with Key Laboratory for Information Science of Electromagnetic Waves (MoE), School of Information Science and Technology, Fudan University, Shanghai, China (E-mails: [email protected], [email protected], [email protected]).

I Introduction

Reliable detection and decoding is essential for any communication systems. For a single-carrier system in an inter-symbol interference (ISI) channel, which can be modeled by a finite-state trellis diagram, the classic design of a maximum likelihood (ML) receiver relies on the likelihood function of the state transitions in the trellis diagram [1]. In this paper, we propose to learn the likelihood function using an artificial neural network (ANN) based on a pilot sequence of moderate length, requiring neither the channel state information (CSI) nor the statistics of the noise.

As related works, machine learning-assisted wireless communications have attracted broad attentions in recent years[2, 3, 4, 5, 6], such as machine learning-assisted channel decoding [7, 8, 9, 10, 11] and symbol detection in a multi-input multi-output (MIMO) system [12, 13, 14, 15]. Deep learning algorithms are found to be more effective in addressing the difficult problem of symbol detection with incomplete CSI [16, 17, 18, 19]. The aforementioned works, however, attempt to substitute a whole communication system by an ANN, which requires a large amount of training data (a lengthy pilot sequences of tens of thousand samples or more), way too much to be practically feasible. The notion of the model-driven method, which combines machine learning techniques and model-based expert knowledge, is introduced in [20] to better incorporate machine learning techniques into a communication system. The efficacy of this type of model-driven methods is demonstrated in [21, 22].

The recent works by Shlezinger et. al. [23, 24] advocate to use a (relatively simple) neural network to substitute only the channel-dependent part of the Viterbi [25] and the BCJR receiver [26]. The resultant algorithms, the so-termed ViterbiNet and BCJRNet, train a neural network to learn the a posteriori probability (APP) of the state transitions given the received samples and use the finite mixture model (FMM) [27] to estimate the marginal probability density of the channel output, assuming that the channel noise is Gaussian. Therefore, the ViterbiNet and BCJRNet require only several thousand training samples, which are significantly less than that in [16, 17], but may still be too much to be practically competitive compared with the conventional model-based method.

In this paper, we first consider the same problem as addressed in [23, 24]. We adopt the notion of integrating a simple neural network into a communication system as advocated by Shlezinger et. al. and propose a new method, termed as the online learning of trellis diagram (OLTD), that can also be integrated into the Viterbi algorithm [25] and the BCJR algorithm [26]. The resultant OLTD-based Viterbi and OLTD-based BCJR differ from the ViterbiNet and BCJRNet in that the ANN is used to learn the likelihoods of the received samples under different state transitions, rather than the APPs. Therefore, we do not need to assume the channel noise to be Gaussian nor to estimate the marginal distribution of the channel output; thus, our proposed method is simpler, robuster, and requires a substantially shorter pilot sequence.

To show the practical feasibility of the proposed OLTD method, we apply it to the physical layer (PHY) of the Bluetooth Low Energy (BLE) protocol[28]. A BLE system adopts the coded Gaussian frequency shift keying (GFSK) modulation, which belongs to the family of continuous phase modulation (CPM) [29]. By modeling the GFSK modulation process with a trellis diagram, we employ the OLTD method to learn the likelihoods of the state transition associated with each received sample based on the 256-sample pilot sequence as regulated in the BLE protocol[30, Ch. 6, PartB], and then use the conventional Viterbi or BCJR algorithm to recover the information bits. This study shows that our proposed neural network assisted receiver can work for a real wireless protocol.

We further introduce a bit-interleaver to the coded GFSK system, for which we combine seamlessly the OLTD method with turbo equalization [31, 32, 33] to achieve significantly improved receiver sensitivity than the conventional Bluetooth. Unlike the previous neural network based methods that unfold each iteration with one layer of the neural network [9, 11], the OLTD-based method obtains the likelihood of each connected branch in the trellis diagram only once through the whole iterative process. In contrast, the BCJRNet [24] assumes that all the coded bits have equal probability and therefore is not suitable for turbo equalization as explained in Section IV-B.

The contributions of this paper are summarized as follows:

•

We show that what a neural network needs to learn about the trellis diagram is the normalized likelihoods of the received sample, rather than the a posteriori probabilities of state transitions as adopted in the BCJRNet/ViterbiNet [23, 24]. Owing to this insight, our proposed OLTD method is computationally simpler, robuster, and more versatile than the state-of-the-art methods. We also show that adopting an ANN with only one hidden-layer is sufficient for the OLTD.
•

In contrast to the BCJRNet and ViterbiNet [23, 24], our proposed OLTD does not need to compute the marginal distribution of received samples, nor does it assumes the statistics of the channel noise or the a priori probabilities of the transmitted symbols; thus, the OLTD is computationally much simpler and robuster.
•

The OLTD-based BCJR method can be seamlessly incorporated into a turbo equalizer for significantly improved performance. But the BCJRNet method cannot, as simulated and analyzed in Section IV-B. In this sense, our proposed method is more versatile.
•

Based on a pilot sequence of a practical length, our neural network-based method can outperform the model-based approaches in channels with non-Gaussian noise or interference, as illustrated by the numerical simulations.
•

As an interesting by-product, this study indicates a possible enhancement to the BLE standard, i.e., to introduce a bit interleaver between the convolutional encoder and the GFSK modulator for much improved reliability, which can make BLE a better candidate for some Internet of Things (IoT) communications [34].

The remainder of this paper is organized as follows: Section II introduces the system model and briefly review the BCJR algorithm and Viterbi algorithm. Section III explains how to train the OLTD and how the OLTD-based BCJR/Viterbi algorithm works online. Section IV introduces the OLTD-based turbo equalization and its application to a bit-interleaved coded GFSK system. In Section V, the simulation results are given to verify the effectiveness of OLTD-based method and show the superior performance of the proposed OLTD-based BCJR/Viterbi and OLTD-based turbo algorithm. The conclusion is given in Section VI.

II System Model and Preliminaries

II-A An ISI Channel Model

The received signal in an ISI channel can be expressed as

y_{k}=\sum_{l=0}^{L-1}h_{l}x_{k-l}+z_{k},k=1,2,\dots,N,

(1)

where $L$ is the channel length, $h_{l},0=1,\cdots,L-1$ are the channel coefficients, and $z_{k}$ denotes the i.i.d additive noise, which is not necessarily Gaussian.

The ISI channel can be modeled as a tapped delay line as shown in Fig. 1. Owing to the shift register structure, the channel can be modelled by a trellis diagram [1].

Refer to caption — Figure 1: The tapped delay line model for an ISI channel with length $L=3$ .

As the transmitted symbols are drawn from a set of ${\mathcal{X}}$ , the ISI channel modeled in (1) can be represented by a trellis diagram consisting of a set ${\mathcal{S}}$ of the states with cardinality $|\mathcal{S}|=|{\mathcal{X}}|^{L-1}$ . Denote the state set at time $k$ as $s_{k}$ , which corresponds to the combination of $L-1$ symbols $x_{k-L+1},...,x_{k-1}$ . The state transitions $s_{k}\rightarrow s_{k+1}$ is associated with the output signal $v_{k}=\sum_{l=0}^{L-1}h_{l}x_{k-l}$ . As an illustrative example, consider an ISI channel with coefficients $h_{0}=0.407$ , $h_{1}=0.815$ , and $h_{2}=0.407$ and the binary phase shift key (BPSK) signal as the input. The trellis consists of $|\mathcal{S}|=4$ states, with $r_{0}=(-1,-1)$ , $r_{1}=(+1,-1)$ , $r_{2}=(-1,+1)$ , $r_{3}=(+1,+1)$ , as shown in Fig. 2. The numbers on the branches represent $u_{k}$ and $v_{k}$ . Take the state $r_{0}$ for example: if the input symbol $u_{k}=-1$ , the branch output $v_{k}=-h_{0}-h_{1}-h_{2}=-1.629$ ; if $u_{k}=+1$ , $v_{k}=h_{0}-h_{1}-h_{2}=-0.815$ .

II-B The GFSK Modulation

The trellis diagram can also be used to model a modulation with memory. As an example, we review the GFSK, which is used in the PHY of Bluetooth [28]. The GFSK with bandwidth $1/T$ sampled at $t=nT$ is

x_{n}=e^{j\phi_{n}},n=1,2\cdots,N,

(2)

where

\phi_{n}=\pi\mathsf{h}\sum_{m=1}^{n-1}I_{m}+2\pi\mathsf{h}I_{n}q(T).

(3)

Here $I_{n}\in\{0,1\}$ is the information bits, ${\sf h}$ is the modulation index, and the pulse shaping function

q(t)=\frac{\int_{0}^{t}Q\left(2\pi B\frac{\tau-\frac{T}{2}}{\sqrt{\ln 2}}\right)-Q\left(2\pi B\frac{\tau+\frac{T}{2}}{\sqrt{\ln 2}}\right)d\tau}{2T}

(4)

with $Q(x)=\frac{1}{\sqrt{2\pi}}\int_{x}^{\infty}e^{\frac{-\tau^{2}}{2}}d\tau$ .

If ${\sf h}=0.5$ (as specified in [28]), (3) becomes

\phi_{n}=\theta_{n}+\pi I_{n}q(T),

(5)

where $\theta_{n}=\frac{\pi}{2}\sum_{m=1}^{n-1}I_{m}$ . Here we set the frequency-time product $BT=0.5$ (also as specified in [28]). As shown in Fig. 3, $q(t)=0$ for $t\leq 0$ , and $q(t)\approx 0.5$ for $t\geq 2T$ .

We can model the continuous phase modulation as a process of finite state transition based on the phase transition (it is $\theta_{n}$ not $\phi_{n}$ ) as shown in Fig. 4. Denote $(r_{i},r_{j})$ as the state transition from $r_{i}\in{\mathcal{S}}$ at time $nT$ to $r_{j}\in{\mathcal{S}}$ at time $(n+1)T$ driven by $I_{n}$ . Here $\mathcal{S}$ stands for the set of the states of the trellis. The dash branches correspond to $I_{n}=-1$ , the solid branches correspond to $I_{n}=1$ , and the associated signal is

x_{n}=e^{j(\theta_{n}+\pi I_{n}q(T))}=e^{j(\theta_{n}+\frac{\pi}{4}I_{n})},

(6)

where we have used the fact that $q(T)=\frac{1}{4}$ .

If the input to the ISI channel (1) is an GFSK signal, we can combine the GFSK modulation and the multi-path channels into a (larger) trellis diagram.

II-C Primer of The Viterbi and BCJR Algorithms

Based on the trellis diagram, the receiver can use the Viterbi algorithm or the BCJR algorithm for optimal symbol detection. We briefly review them to make this paper self-contained.

The Viterbi algorithm is for maximum likelihood (ML) detection. By exploiting the Markovian structure of the finite-memory channel, it computes the likelihood of each branch, i.e., the conditional probability density function (PDF) of the channel output given the inputs

p(\mathbf{y}_{1:N}|\mathbf{x})=\prod_{k=1}^{N}p(y_{k}|\mathbf{x}_{k-L+1}^{k}),

(7)

where $x_{k}=0,k<0$ . For a given channel output $\mathbf{y}$ , we should have

\displaystyle\mathop{\arg\max}\limits_{x_{k}\in\mathcal{X}}p(\mathbf{y}_{1:N}|{\mathbf{x}})=\mathop{\arg\min}\limits_{x_{k}\in\mathcal{X}}-\log p(\mathbf{y}_{1:N}|{\mathbf{x}}).

(8)

Combining (8) and (7), we can rewrite (7) as:

		$\displaystyle\mathop{\arg\min}\limits_{x_{k}\in\mathcal{X}}-\log p(\mathbf{y}_{1:N}\|{\mathbf{\emph{x}}})$		(9)
	$\displaystyle=$	$\displaystyle\mathop{\arg\min}\limits_{x_{k}\in\mathcal{X}}-\sum_{k=1}^{N}\log p(y_{k}\|s_{k},s_{k+1})$		(9)

where $p(y_{k}|s_{k},s_{k+1})$ is the likelihood of $y_{k}$ associated with the state transition from $s_{k}$ to $s_{k+1}$ . The Viterbi algorithm can solve (9) efficiently by searching for the shortest path across the trellis diagram.

The BCJR algorithm [26] is for maximum a posteriori probability (MAP) detection. It computes the a posteriori log-likelihood ratio (LLR) of bits $\{u_{k}\}$

	$\displaystyle L(u_{k}\|\mathbf{y})$	$\displaystyle=\log\frac{p(u_{k}=1\|\mathbf{y})}{p(u_{k}=0\|\mathbf{y})}=\log\frac{p(u_{k}=1,\mathbf{y})}{p(u_{k}=0,\mathbf{y})}$		(10)
		$\displaystyle=\log\frac{\sum_{(s_{k},s_{k+1})\in\mathcal{S}^{+}}p(s_{k},s_{k+1},\mathbf{y})}{\sum_{(s_{k},s_{k+1})\in\mathcal{S}^{-}}p(s_{k},s_{k+1},\mathbf{y})},$		(10)

where $\mathcal{S}^{+}$ , $\mathcal{S}^{-}$ are the set of the ordered pairs $(s_{k},s_{k+1})$ corresponding to all states transitions driven by $u_{k}=1$ and $u_{k}=0$ , respectively. The sequence $\mathbf{y}$ in $p(s_{k},s_{k+1},\mathbf{y})$ can be written as $p(s_{k},s_{k+1},(y_{1},\cdots,y_{k-1}),y_{k},(y_{k+1},\cdots,y_{N}))$ . Applying the chain rule for joint probabilities, we can decompose $p(s_{k},s_{k+1},\mathbf{y})$ into:

\displaystyle p(s_{k},s_{k+1},\mathbf{y})=\alpha_{k}(s_{k})\gamma_{k}(s_{k},s_{k+1})\beta_{k+1}(s_{k+1}),

(11)

where $\alpha_{k}(s_{k})\triangleq p(s_{k},y_{1:k-1})$ , $\gamma_{k}(s_{k},s_{k+1})\triangleq p(y_{k},s_{k+1}|s_{k})$ , and $\beta_{k+1}(s_{k+1})\triangleq p(y_{k+1:N}|s_{k+1})$ . They can be recursively computed as

	$\alpha_{k}(s_{k})=\sum_{s_{k-1}}\alpha_{k-1}(s_{k-1})\gamma_{k-1}(s_{k-1},s_{k}),$		(12a)
	$\beta_{k}(s_{k})=\sum_{s_{k+1}}\beta_{k+1}(s_{k+1})\gamma_{k+1}(s_{k},s_{k+1}),$		(12b)
	$\gamma_{k}(s_{k},s_{k+1})=p(s_{k+1}\|s_{k})\cdot p(y_{k}\|s_{k},s_{k+1}),$		(12c)

with the initialization $\alpha_{0}(s_{0})=1$ and $\beta_{N}(s_{N})=1$ .

Driven by $u_{k}$ , the probability of the state transition is

p(s_{k+1}|s_{k})=\left\{\begin{array}[]{cl}p(u_{k}=1),&(s_{k},s_{k+1})\in\mathcal{S}^{+},\\ p(u_{k}=0),&(s_{k},s_{k+1})\in\mathcal{S}^{-},\\ 0,&{\rm otherwise}.\end{array}\right.

(13)

II-D About the Likelihood Function

Examining (9) and (12), we see that the likelihood $p(y_{k}|s_{k},s_{k+1})$ plays a key role in both Viterbi and BCJR algorithms. Indeed, the Viterbi algorithm depends solely on the likelihood, while the BCJR also exploits the a priori information on the state transition probability $p(s_{k+1}|s_{k})$ , which is independent of the CSI [cf. (12c) and (13)]. Hence, $p(y_{k}|s_{k},s_{k+1})$ is the only CSI-dependent component for both algorithms.

The conventional method calculates $p(y_{k}|s_{k},s_{k+1})$ assuming known CSI and that the channel noise is zero-mean Gaussian with variance $\sigma^{2}$ . Hence the likelihood can be computed to be

p(y_{k}|s_{k},s_{k+1})=\frac{1}{\sqrt{2\pi}\sigma}\exp\left\{{\frac{-|y_{k}-v_{k}|^{2}}{2\sigma^{2}}}\right\},

(14)

where $v_{k}=\sum_{l=0}^{L-1}h_{l}x_{k-l}$ is the channel output associated with the state transition from $s_{k}$ to $s_{k+1}$ (cf. Fig. 2).

But when the CSI is unknown or the distribution of the noise is unknown owing to non-Gaussian co-channel interference, the model-based likelihood (14) will be erroneous, causing severe degradation to the performance of Viterbi or BCJR. To address this issue, we propose to use an ANN to learn online the likelihood function $p(y_{k}|s_{k},s_{k+1})$ of the trellis diagram based on a pilot sequence, assuming no CSI nor the noise statistics.

III The OLTD-based Viterbi and BCJR

This section explains how the OLTD-based Viterbi and BCJR algorithms work. Both algorithms consist of three stages as explained in the following.

III-A Stage One: Train ANN to Learn The Trellis Diagram

As shown in Fig. 5, we construct a fully connected and one hidden-layered ANN [35], whose input is the real and imaginary part of the received sample $y_{k}$ and the output nodes correspond to all the state transitions in the trellis diagram. For example, we can use a network with 8 output nodes to learn the trellis diagram in Fig. 4, which has 8 state transitions. The hidden-layer nodes adopt the Sigmoid activation function, while the output layer employs the Softmax activate function. The reason of using the softmax function is to be explained in Section III-D.

Given the pilot sequence $\{y_{k},k=1,\ldots,P\}$ , we know the true transition $s_{k}\rightarrow s_{k+1}$ corresponding to each $y_{k}$ and thus label the normalized likelihood $p(y_{k}|s_{k},s_{k+1})$ corresponding to the true transition to be 1 and all the others to be 0; this is the so-called one-hot encoding. The ANN is optimized according to the minimum cross-entropy criterion. Note that the OLTD method requires neither the CSI nor the noise statistics, which is its main advantage over the model-based method.

III-B Stage Two: Feed the Payload into the ANN

After the ANN is trained, the payload signals $\{y_{i},i=P+1,\ldots,P+Q\}$ are then fed into the network; for each $y_{i}$ the ANN will yield the normalized likelihoods $p(y_{i}|s_{i},s_{i+1})$ for all the $2|{\mathcal{S}}|$ state transitions. The numerical examples in Section V indicates that the likelihoods yielded by the ANN is sufficiently accurate when it is trained with a pilot with length $P$ no more than a few hundred.

III-C Stage Three: Feed the Likelihoods into the Viterbi or BCJR Algorithm

Given the likelihoods $p(y_{i}|s_{i},s_{i+1})$ ’s per received sample of the payload, the Viterbi (or BCJR) algorithm can then be directly applied to obtain the ML (or MAP) detection.

Taking the BCJR algorithm for example, given the sample $y_{k}$ ’s associated likelihoods $p(y_{k}|s_{k},s_{k+1})$ ’s, the BCJR algorithm can calculate $\gamma_{k}$ according to (12c) and further compute the forward recursion (12a) and the backward recursion (12b) to obtain the LLR $L(u_{k}|\mathbf{y})$ by (11) and (10). The combination of the OLTD method and the BCJR algorithm is termed as the OLTD-based BCJR. The OLTD-based Viterbi is even simpler: it just needs to search the most likelihood trellis path according to (9).

III-D Discussions

Observe from (9) that to multiply $p(y_{k}|s_{k},s_{k+1})$ ’s by a common positive factor does not affect the output result of the Viterbi algorithm, neither does it affect the BCJR as can be seen from (10) and (12). Hence, instead of learning the original likelihoods, which can be anywhere in $[0,\infty)$ , the ANN can simply learn the normalized likelihoods

\frac{p(y_{k}|s_{k},s_{k+1})}{\sum_{(s_{k},s_{k+1})\in\mathcal{S}^{-}\cup\mathcal{S}^{+}}p(y_{k}|s_{k},s_{k+1})}\in[0,1],

which is why we can adopt the softmax as the activation function of the output layer even though the actual likelihood value may be out of the range $[0,1]$ . The above insight is the underlying reason why our method is significantly simpler than the state-of-the-art method [24], which has to compute the marginal distribution $p(y_{k})$ besides using a deep neural network to learn the conditional probability $p(s_{k},s_{k+1}|y_{k})$ [24, Fig. 3].

We can gain more insights into the OLTD method by drawing its analogy to the classic least square (LS) fitting method. An LS method optimizes the parameters of a signal model to fit the true signal; the OLTD method trains the ANN to approximate the ground truth of the normalized likelihoods, which is a one-hot vector in absence of the channel noise. The LS method uses the LS criteria; the OLTD adopts the minimum cross-entropy criterion. The LS method does not assume knowledge of the noise statistics, neither does the OLTD, which is why the OLTD is robust against non-Gaussian noise as shown in Section IV.

Last but not least, our proposed OLTD-based BCJR algorithm can be readily applied for turbo equalization [36], which is an important advantage over the BCJRNet [23] as we will explain in the next.

IV The OLTD-Based Turbo Equalization for A Coded GFSK System

In this section, we apply the OLTD method for turbo equalization, for which a bit interleaver is inserted between the forward error correction (FEC) encoder and the modulator, as shown in Fig. 6. This section serves for two purposes: i) the OLTD method is shown to be able to be incorporated seamlessly into a turbo equalizer for superb performance; ii) we advocate a potential enhancement to the current BLE standard for significantly better receiver sensitivity, as an interesting byproduct of this study.

Consider that the FEC encoder in Fig. 6 is the convolutional code whose generator polynomials are

	$\displaystyle G_{0}(x)$	$\displaystyle=1+x+x^{2}+x^{3},$		(15)
	$\displaystyle G_{1}(x)$	$\displaystyle=1+x^{2}+x^{3},$		(15)

which is actually the convolutional code adopted in the BLE standard [30], and the modulation is the GFSK (as explained in Section II-B). The information bits are denoted as $\{u_{k}\}$ , the coded bits $\{a_{n}\}$ , and the interleaved bits $\{b_{n}\}$ , which will be modulated into symbols $\{x_{n}\}$ .

IV-A OLTD-Based Turbo Equalization

The whole procedure of turbo equalization is as shown in Fig. 7, where BCJR equalization and BCJR decoding are conducted iteratively ²²2Readers unfamiliar with turbo equalization may refer to [36] for an excellent tutorial..

On the equalization side, the BCJR algorithm takes the a priori probabilities of $b_{n}$ ’s to calculate the state transition probabilities as

p(s_{n+1}|s_{n})=\begin{cases}p(b_{n}=0),&\mbox{if $b_{n}=0$ drives $s_{n}\rightarrow s_{n+1}$}\\ p(b_{n}=1),&\mbox{if $b_{n}=1$ drives $s_{n}\rightarrow s_{n+1}$},\end{cases}

(16)

takes the normalized likelihoods from the OLTD algorithm, and then calculate (12) and (11) to calculate $L(b_{n}|\mathbf{y})$ in (10).

Note that the a posteriori LLR of $\{b_{n}\}$ can also be computed as

	$\displaystyle L(b_{n}\|\mathbf{y})$	$\displaystyle=\log\frac{\sum_{\forall\mathbf{b}:b_{n}=1}p(\mathbf{b},\mathbf{y})}{\sum_{\forall\mathbf{b}:b_{n}=0}p(\mathbf{b},\mathbf{y})}$		(17)
		$\displaystyle=\log\frac{\sum_{\forall\mathbf{b}:b_{n}=1}p(\mathbf{y}\|\mathbf{b})p(\mathbf{b})}{\sum_{\forall\mathbf{b}:b_{n}=0}p(\mathbf{y}\|\mathbf{b})p(\mathbf{b})}.$		(17)

Since the bit interleaver decorrelates the neighboring coded bits, it holds that $p(\mathbf{b})=\prod_{n=1}^{N}p(b_{n})$ ; thus, $L(b_{n}|\mathbf{y})$ can be decomposed into

L(b_{n}|\mathbf{y})=L_{ext}(b_{n}|\mathbf{y})+L(b_{n}),

(18)

where

L_{ext}(b_{n}|\mathbf{y})=\log\frac{\sum_{\forall\mathbf{b}:b_{n}=1}p(\mathbf{y}|\mathbf{b})\prod_{i=1:i\neq n}^{N}p(b_{i})}{\sum_{\forall\mathbf{b}:b_{n}=0}p(\mathbf{y}|\mathbf{b})\prod_{i=1:i\neq n}^{N}p(b_{i})}

(19)

is the extrinsic information about $b_{n}$ contained in $\mathbf{y}$ , and

L(b_{n})=\log\frac{p(b_{n}=1)}{p(b_{n}=0)}

(20)

is called the intrinsic information. Hence, it follows from (18) that

L_{ext}(b_{n}|\mathbf{y})=L(b_{n}|\mathbf{y})-L(b_{n}),\;\;n=1,2,\ldots,N.

(21)

On the decoding side, after deinterleaving $L_{ext}(b_{n}|\mathbf{y})$ into $L_{ext}(a_{n}|\mathbf{y})$ , the BCJR decoder takes the extrinsic information as the “received signal”, i.e., $\tilde{\mathbf{y}}\triangleq L_{ext}(a_{n}|\mathbf{y})$ to update the a posteriori LLR of the coded bits $L(a_{n}|\tilde{\mathbf{y}})$ based on the trellis diagram associated with the FEC (15). As another input into the BCJR decoder, the uncoded bits $\{u_{k},k=1,\ldots,K\}$ are assumed to satisfy $P(u_{k}=1)=P(u_{k}=0)=1/2$ .

Note from (18) that we can update the intrinsic LLRs

L(a_{n})=L(a_{n}|\tilde{\mathbf{y}})-L_{ext}(a_{n}|\mathbf{y}),

(22)

which can be fed into the BCJR equalizer after being interleaved into $L(b_{n})$ . Using the relationship [cf. (20)]

p(b_{n})=\begin{cases}\frac{1}{1+e^{L({b_{n}})}},&\mbox{if $b_{n}=0$,}\\ \frac{e^{L(b_{n})}}{1+e^{L(b_{n})}},&\mbox{if $b_{n}=1$}\end{cases}

(23)

and (16), we can obtain the state-transition probabilities $p(s_{n+1}|s_{n})$ ’s, which are needed for the next round of BCJR equalization.

After a prescribed number of iterations, we calculate $\tilde{\gamma}_{k}(r_{i},r_{j})$ as [36, eq.(22)]

\begin{array}[]{l}\tilde{\gamma}_{k}(r_{i},r_{j})=\\ \begin{cases}p(u_{k})\cdot p(a_{1,i,j}|\mathbf{y})\cdot p(a_{2,i,j}|\mathbf{y}),&\mbox{if $(s_{n},s_{n+1})\in\mathcal{S}^{+}\cup\mathcal{S}^{-}$,}\\ 0,&\mbox{if $(s_{n},s_{n+1})\notin\mathcal{S}^{+}\cup\mathcal{S}^{-}$},\end{cases}\end{array}

(24)

for $k=1,2,\cdots,K$ , and the decoder will calculate the LLRs of information bits $u_{k}$ , and make a final decision

\tilde{u}_{k}=\begin{cases}0,&\mbox{if $L(u_{k}|\tilde{\mathbf{y}})<0$,}\\ 1,&\mbox{if $L(u_{k}|\tilde{\mathbf{y}})\geq 0$.}\end{cases}

(25)

Hence, the only difference between the OLTD-based turbo equalization and a standard model-based one is how the likelihoods $p(\mathbf{y}|s_{n},s_{n+1})$ are obtained, while the whole procedure in the dotted-line box is the same for both methods.

IV-B Apply BCJRNet to Turbo Equalization?

One may attempt to apply the BCJRNet to turbo equalization, but will come across a major issue as explained in the next.

In the BCJRNet, a neural network is trained to obtain the a posteriori probability $p(s_{n},s_{n+1}|y_{n})$ . Then by Bayes’ rule ³³3To obtain $p(y_{n}|s_{n},s_{n+1})$ , the BCJRNet algorithm actually assumes that $p(s_{n+1},s_{n})=|{\cal S}|^{-L}$ is a constant [23, eq. (13)].

p(y_{n}|s_{n},s_{n+1})=\frac{p(s_{n},s_{n+1}|y_{n})p(y_{n})}{p(s_{n+1},s_{n})}.

(26)

Combining (26) and (12c), we obtain

\gamma_{n}(s_{n},s_{n+1})=\frac{p(s_{n},s_{n+1}|y_{n})p(y_{n})}{p(s_{n})}.

(27)

According to (10) and (11), we have

		$\displaystyle L(b_{n}\|\mathbf{y})$		(28)
		$\displaystyle=\log\frac{\sum_{(s_{n},s_{n+1})\in\mathcal{S}^{+}}\alpha_{n}(s_{n})\gamma_{n}(s_{n},s_{n+1})\beta_{n+1}(s_{n+1})}{\sum_{(s_{n},s_{n+1})\in\mathcal{S}^{-}}\alpha_{n}(s_{n})\gamma_{n}(s_{n},s_{n+1})\beta_{n+1}(s_{n+1})}.$		(28)

Substituting (27) into (28) yields

		$\displaystyle L(b_{n}\|\mathbf{y})$		(29)
		$\displaystyle=\log\frac{\sum_{(s_{n},s_{n+1})\in\mathcal{S}^{+}}\alpha_{n}\cdot\frac{p(s_{n},s_{n+1}\|y_{n})p(y_{n})}{p(s_{n})}\cdot\beta_{n+1}}{\sum_{(s_{n},s_{n+1})\in\mathcal{S}^{-}}\alpha_{n}\cdot\frac{p(s_{n},s_{n+1}\|y_{n})p(y_{n})}{p(s_{n})}\cdot\beta_{n+1}},$
		$\displaystyle=\log\frac{\sum_{(s_{n},s_{n+1})\in\mathcal{S}^{+}}\alpha_{n}\cdot\frac{p(s_{n},s_{n+1}\|y_{n})}{p(s_{n})}\cdot\beta_{n+1}}{\sum_{(s_{n},s_{n+1})\in\mathcal{S}^{-}}\alpha_{n}\cdot\frac{p(s_{n},s_{n+1}\|y_{n})}{p(s_{n})}\cdot\beta_{n+1}},$

Given the a posteriori probability $p(s_{n},s_{n+1}|y_{n})$ learned by the BCJRNet, one can update $p(s_{n})$ through the forward recursion

p(s_{n})=\sum_{s_{n-1}\in{\cal S}}p(s_{n}|s_{n-1})p(s_{n-1}),

(30)

where $p(s_{n}|s_{n-1})$ can be obtained according to (16) and (23). Due to this recursion, the BCJR-based turbo equalization is cumbersome.

More important, this method actually does not work as shown by the simulation example in Section V-B. Indeed, the a posteriori probability $p(s_{n},s_{n+1}|y_{n})$ is not the suitable metric to learn, because

p(s_{n},s_{n+1}|y_{n})\propto p(y_{n}|s_{n},s_{n+1})p(s_{n},s_{n+1})

(31)

relies on the a priori information $p(s_{n},s_{n+1})$ ; thus, it is impossible to infer $p(s_{n},s_{n+1}|y_{n})$ from $y_{n}$ itself unless $p(s_{n},s_{n+1})$ is a constant, which is usually untrue. In contrast, the likelihood $p(y_{n}|s_{n},s_{n+1})$ relies solely on the distribution of the channel noise and the CSI, and is independent of the channel coding; thus, it can be learned from $y_{n}$ itself.

V Simulation Results

In this section, we present simulation examples to validate the feasibility and superior performance of the OLTD method applied to the Viterbi algorithm, the BCJR algorithm, and the turbo equalization.

We adopt a fully-connected neural network with a single hidden layer as shown in Fig. 5 for the OLTD. The hidden layer has 100 neurons and employs the Sigmoid activate function. The number of neurons in the output layer is the same as the number of state transitions in the trellis diagram. The likelihood of each state transition is normalized by the Softmax to approximate the ground truth, i.e., the one hot vector. The network is trained using the Adamax optimizer [37] to minimize the cross-entropy based on a pilot sequence. The optimizer divides the pilot into mini-batches of 16 samples and the initial learning rate is set to be 0.01. The settings of our training conditions are summarized in Table I.

TABLE I: The Simulation Setting

NN Toolkit	Keras using Tensorflow backend
Training Processor	Inter(R) i7-6700 CPU
Training Batch Size	16
Training Epoch	200
No. of Hidden Layer	1 (100 neurons)
Optimizer	Adamax
Activate Function	Sigmoid—Softmax
Loss Function	Cross Entropy
Pilot Length	500 in QPSK and OOK, 256 in GFSK cases

Three types of signals are simulated: i) uncoded Quadrature Phase Shift Keying (QPSK) transmitted over an ISI channel, ii) bit-interleaved On-Off Keying (OOK) in a Poisson channel, and iii) coded GFSK as adopted in the PHY of the BLE – all can be represented by a trellis diagram. ⁴⁴4The $\rm{Matlab^{TM}}$ codes used for generating the simulation results can be found: https://github.com/JayYang-Fdu/OLTD-code..

V-A Uncoded QPSK in an ISI Channel with Additive Noise

We first simulate an ISI channel as modelled in (1), where the input signal is uncoded QPSK, the noise is complex-valued Gaussian, and channel coefficients are given by $h_{l}\triangleq\sqrt{\frac{e^{-\gamma\cdot l}}{\sum^{L-1}_{l=0}e^{-\gamma\cdot l}}}$ for $\gamma>0$ . We set the channel memory length $L=2$ ; thus, the trellis diagram is fully-connected with 4 states and 16 branches (state transitions). Fig. 8 compares the bit error rate (BER) performance of model-based BCJR/Viterbi algorithm and OLTD-based BCJR/Viterbi algorithm. The model-based method is simulated based on perfect CSI, while the OLTD is trained based on a 500-sample pilot. The simulation results are obtained by averaging over $10,000$ Monte-Carlo simulations, where the channel coefficients are generated with $\gamma$ being draw at random in the range $[0.1,1]$ . Fig. 8 shows that OLTD-based method can achieve performance very close to the model-based benchmark. The Viterbi algorithm performs the same as the BCJR algorithm, since here the bits are generated $0$ or $1$ evenly.

We then simulate the channel noise as complex Cauchy with the PDF

f(z_{k})=\frac{1}{\pi^{2}}\cdot\frac{\lambda^{2}}{(\Re\{z_{k}\}^{2}+\lambda^{2})\cdot(\Im\{z_{k}\}^{2}+\lambda^{2})},

(32)

where $\Re\{\cdot\}$ and $\Im\{\cdot\}$ stand for the real and imaginary parts, respectively. The Cauchy’s PDF has much heavier tails than the Gaussian’s.

In the Cauchy noise case, the SNR is undefined since the variance of Cauchy is unbounded. We simulate the BER performance by changing the value of $\gamma$ . Fig. 9 shows that the OLTD-BCJR outperforms the model-based BCJR algorithm in this non-Gaussian case, which is not surprising since the OLTD requires no a priori knowledge on the statistics of the noise.

We also simulate the ISI channel scenario where the QPSK signal is interfered by a random 4-ary Pulse Amplitude Modulation (4-PAM) source. The received power of the interference is the same as the QPSK signal, i.e., the signal-to-interference ratio (SIR) is $0$ dB. As shown in Fig. 10, the model-based Viterbi method based on the assumption of Gaussian noise fails. The striking advantage of the OLTD-based approach indicates that the neural network somehow learned the “structure” of the non-Gaussian interference and hence suppressed it effectively.

To demonstrate the influence of training pilot length on reception performance, we simulate the BER performance of the OLTD-based approach when trained using the pilots length of $\{0.5,1,2,5,10,20\}\times 10^{2}$ as shown in Fig. 11. We set SNR = 10 dB under Gaussian channel, and the performance of the OLTD-based method, as the training pilots length increases, gradually approaches that of the model-based method under perfect CSI. A pilot of length a few hundred is sufficient for all the three cases. Fig. 11 illustrates from a different perspective that the OLTD-based method can outperform the model-based method in the two cases of non-Gaussian noise.

V-B Bit-interleaved Coded OOK in a Poisson Channel

In addition to the additive noise channel, we also simulate the Poisson channel as previously considered in [23]. But here we consider a bit-interleaved system as shown in Fig. 6, where the information bits are FEC encoded, bit-interleaved, and then modulated by OOK before being transmitted over a Poisson channel. The channel output $y_{k}$ is of Poisson distribution, i.e.,

p(y_{k}|s_{k},s_{k+1})=\frac{(v+1)^{y_{k}}}{y_{k}!}e^{-(v+1)},\\ y_{k}=0,1,2,\ldots

(33)

where $v=\sum_{l=0}^{L-1}h_{l}x_{k-l}$ with $x_{k-l}\in\{0,1\}$ and $h_{l}\triangleq\sqrt{\frac{e^{-\gamma\cdot l}}{\sum^{L-1}_{l=0}e^{-\gamma\cdot l}}}$ for $\gamma$ being draw at random in the range $[0.1,1]$ .

For $L=2$ , the Poisson channel can be represented by a fully-connected two-state trellis diagram with 4 branches. Hence, the OLTD uses a neural network with 4 outputs. Fig. 12 shows that using no iterations the OLTD-based method and the BCJRNet have identical performance. The BCJRNet applied for turbo equalization, however, leads to failed decoding as explained in Section IV-B. In contrast, the OLTD-based turbo equalization can achieve significantly improved performance.

V-C A BLE System and Its Enhancement

We simulate in an AWGN channel a BLE system with bit rate 500 Kbps with a pilot of length 256 as specified in the protocol [30]. A model-based receiver consists of a BCJR demodulator and a Viterbi decoder. It assumes perfect CSI. The OLTD-based receiver differs from the model-based one only in that the likelihoods $p(y_{k}|s_{k},s_{k+1})$ [cf. (12c)] used in the BCJR algorithm is produced by a neural network trained by the 256-sample pilot sequence. According to Fig. 4, the GFSK modulation can be represented by a trellis diagram with 8 state transitions; thus, the neural network for the OLTD has 8 neurons in the output layer. The BER is averaged over $10,000$ Monte Carlo trails of the channel coefficients $h(\theta)=e^{j\theta}$ , with $\theta$ being drawn at random in the range $[0,2\pi]$ . In Fig. 13, the dot dash line with marker $\triangle$ corresponds to the model-based method, while the dot dash line with marker $\square$ is the OLTD-based one. They essentially overlap, which suggests that the neural network assisted receiver can be practically feasible at least performance-wise.

We also present Shannon limit of the BER performance of the GFSK signal obtained using the numerical method in [38, 39]. The large gap between Shannon limit and the achieved performance of the BLE system motivated us to consider introducing a bit interleaver at the transmitter between encoder and GFSK modulator (cf. Fig. 6). Given the bit-interleaver, the receiver can apply the turbo equalization. Fig. 13 also illustrates the BER performance of the turbo receiver with no iteration, and with 1 and 2 iterations. The three dash lines show the model-based turbo equalization under the perfect CSI and the other three sold lines with markers correspond to the OLTD-based turbo algorithm with pilot length of 256. It can be seen that introducing the bit-interleaver and using the turbo equalization in the receiver can yield $4\sim 5$ dB gain compared with the conventional receiver. Hence, to introduce bit-interleaving may be an interesting enhancement to the existing BLE protocol for significantly enhanced performance, which can make it more competitive for IoT communications.

In the last example, we consider the BLE system in a ISI channel with memory $L=2$ . The ISI channel is described as $h_{0}\delta(n)+h_{1}\delta(n-1)$ , where the channel coefficients are normalized. The combination of the GFSK modulation and the ISI channel can be modeled by a trellis diagram with 16 states. Then we can also apply turbo equalization based on the OLTD method except that here the output layer of the neural network has $16$ neurons. The BER performance of the model-based turbo receiver with known perfect CSI and the OLTD-based turbo receiver in the ISI channel is compared in Fig. 14. It can be seen that the OLTD-based turbo equalization algorithm trained based on a same 256-sample pilot sequence has about 0.5dB loss compared with the model-based method with perfect CSI.

VI Conclusions

This paper introduced a method named online learning of trellis diagram (OLTD), which uses a single hidden-layer artificial neural network (ANN) to learn the likelihoods of the received samples under different state transitions. It can be applied to replace only the channel state-dependent part of the Viterbi algorithm and the BCJR algorithm. We applied the OLTD-based Viterbi/BCJR algorithms to a coded QPSK/GFSK system, and the simulation results show that using a pilot sequence of length only a few hundred samples the OLTD based methods can perform similarly to their model-based counterpart given perfect channel state information (CSI) and Gaussian noise. In contrast to the model-based approaches, the OLTD-based approach assumes neither CSI nor statistics of the noise, which makes it robust against non-Gaussian interferences. In contrast to the state-of-the-art machine learning assisted methods, such as the BCJRNet and ViterbiNet, the proposed method does not assume the a priori probabilities of the coded bits, which makes it readily applicable to turbo equalization. The OLTD-based algorithms can be applied to the standard Bluetooth system and the enhanced one with bit-interleaving, because they require only some pilot of moderate length. Introducing bit-interleaving can be a beneficial enhancement to the BLE standard as an interesting by-product of this study.

References

[1] J. G. Proakis and M. Salehi, Digital communications. McGraw-Hill., 2008.
[2] T. O’Shea and J. Hoydis, “An Introduction to Deep Learning for the Physical Layer,” IEEE Transactions on Cognitive Communications and Networking, vol. 3, no. 4, pp. 563–575, 2017.
[3] Q. Mao, F. Hu, and Q. Hao, “Deep Learning for Intelligent Wireless Networks: A Comprehensive Survey,” IEEE Communications Surveys and Tutorials, vol. 20, no. 4, pp. 2595–2621, 2018.
[4] D. Gunduz, P. de Kerret, N. D. Sidiropoulos, D. Gesbert, C. R. Murthy, and M. van der Schaar, “Machine learning in the air,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 10, pp. 2184–2199, 2019.
[5] N. H. Tran, W. Bao, A. Zomaya, N. H. N. Minh, and C. S. Hong, “Federated Learning over Wireless Networks: Optimization Model Design and Analysis,” in IEEE INFOCOM 2019 - IEEE Conference on Computer Communications, pp. 1387–1395, 2019.
[6] S. Park, O. Simeone, and J. Kang, “Meta-Learning to Communicate: Fast End-to-End Training for Fading Channels,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5075–5079, 2020.
[7] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode linear codes using Deep Learning,” in 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 341–346, 2016.
[8] A. Bennatan, Y. Choukroun, and P. Kisilev, “Deep Learning for Decoding of Linear Codes - A Syndrome-Based Approach,” in 2018 IEEE International Symposium on Information Theory (ISIT), pp. 1595–1599, 2018.
[9] F. Liang, C. Shen, and F. Wu, “An Iterative BP-CNN Architecture for Channel Decoding,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 1, pp. 144–159, 2018.
[10] E. Nachmani, E. Marciano, L. Lugosch, W. J. Gross, D. Burshtein, and Y. Be’ery, “Deep Learning Methods for Improved Decoding of Linear Codes,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 1, pp. 119–131, 2018.
[11] T. Gruber, S. Cammerer, J. Hoydis, and S. ten Brink, “On deep learning-based channel decoding,” in 2017 51st Annual Conference on Information Sciences and Systems (CISS), pp. 1–6, 2017.
[12] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO detection,” in 2017 IEEE 18th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pp. 1–5, 2017.
[13] N. Samuel, T. Diskin, and A. Wiesel, “Learning to Detect,” IEEE Transactions on Signal Processing, vol. 67, no. 10, pp. 2554–2564, 2019.
[14] J. Li, Q. Zhang, X. Xin, Y. Tao, Q. Tian, F. Tian, D. Chen, Y. Shen, G. Cao, Z. Gao, and J. Qian, “Deep learning-based massive MIMO CSI feedback,” in 2019 18th International Conference on Optical Communications and Networks (ICOCN), pp. 1–3, 2019.
[15] T. J. O’Shea, T. Erpek, and T. C. Clancy, “Deep Learning Based MIMO Communications.,” arXiv preprint arXiv:1707.07980, 2017.
[16] Y. Liao, N. Farsad, N. Shlezinger, Y. C. Eldar, and A. J. Goldsmith, “Deep Neural Network Symbol Detection for Millimeter Wave Communications,” in 2019 IEEE Global Communications Conference (GLOBECOM), pp. 1–6, 2019.
[17] N. Farsad and A. Goldsmith, “Neural Network Detection of Data Sequences in Communication Systems,” IEEE Transactions on Signal Processing, vol. 66, no. 21, pp. 5663–5678, 2018.
[18] F. A. Aoudia and J. Hoydis, “End-to-End Learning of Communications Systems Without a Channel Model,” in 2018 52nd Asilomar Conference on Signals, Systems, and Computers, pp. 298–303, 2018.
[19] H. Ye, L. Liang, G. Y. Li, and B.-H. Juang, “Deep Learning-Based End-to-End Wireless Communication Systems With Conditional GANs as Unknown Channels,” IEEE Transactions on Wireless Communications, vol. 19, no. 5, pp. 3133–3143, 2020.
[20] H. He, S. Jin, C.-K. Wen, F. Gao, G. Y. Li, and Z. Xu, “Model-Driven Deep Learning for Physical Layer Communications,” IEEE Wireless Communications, vol. 26, no. 5, pp. 77–83, 2019.
[21] X. Gao, S. Jin, C.-K. Wen, and G. Y. Li, “ComNet: Combination of deep learning and expert knowledge in OFDM receivers,” IEEE Communications Letters, vol. 22, no. 12, pp. 2627–2630, 2018.
[22] J. Liao, J. Zhao, F. Gao, and G. Y. Li, “A Model-Driven Deep Learning Method for Massive MIMO Detection,” IEEE Communications Letters, vol. 24, no. 8, pp. 1724–1728, 2020.
[23] N. Shlezinger, N. Farsad, Y. C. Eldar, and A. J. Goldsmith, “Data-Driven Factor Graphs for Deep Symbol Detection,” in 2020 IEEE International Symposium on Information Theory (ISIT), pp. 2682–2687, 2020.
[24] N. Shlezinger, N. Farsad, Y. C. Eldar, and A. J. Goldsmith, “ViterbiNet: A Deep Learning Based Viterbi Algorithm for Symbol Detection,” IEEE Transactions on Wireless Communications, vol. 19, no. 5, pp. 3319–3331, 2020.
[25] J. G. Forney, “The Viterbi algorithm,” Proceedings of the IEEE, vol. 61, no. 3, pp. 268–278, 1973.
[26] L. R. Bahl, “Optimal decoding of linear codes for minimizing symbol error rate,” IEEE Transactions on Information Theory, vol. 20, pp. 284–287, 1974.
[27] G. Mclachlan and D. Peel, “Finite Mixture Model,” Partha Deb, vol. 44, 01 2000.
[28] E. Au, “Bluetooth 5.0 and Beyond [Standards],” IEEE Vehicular Technology Magazine, vol. 14, no. 2, pp. 119–120, 2019.
[29] C.-E. Sundberg, “Continuous phase modulation,” IEEE Communications Magazine, vol. 24, no. 4, pp. 25–38, 1986.
[30] “Bluetooth Core Specification, version 5.0.” Available: https://www.bluetooth.com/zh-cn/specifications/specs/core-specification-5/.
[31] T. Okada and Y. Iwanami, “Turbo Equalization of GMSK Signals Using Noncoherent Frequency Detection,” IEICE Transactions on Electronics, vol. 85, no. 3, pp. 473–479, 2002.
[32] X. Wang and Z. Yang, “Turbo equalization for GMSK signaling over multipath channels,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), vol. 4, pp. 2641–2644, 2001.
[33] T. Okada and Y. Iwanami, “TURBO EQUALIZATION OF GMSK SIGNALS USING LIMITER-DISCRIMINATOR,” Proc. ISSSE, 2001.
[34] L. Atzori, A. Iera, and G. Morabito, “The Internet of Things: A survey,” Computer Networks, vol. 54, no. 15, pp. 2787–2805, 2010.
[35] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
[36] R. Koetter, A. Singer, and M. Tuchler, “Turbo Equalization,” IEEE Signal Processing Magazine, 2004.
[37] D. P. Kingma and J. L. Ba, “Adam: A Method for Stochastic Optimization,” in ICLR 2015 : International Conference on Learning Representations 2015, 2015.
[38] L. I. Bing, F. Wei, B. M. Bai, and M. A. Xiao, “Fundamental performance limits of CPM coded modulation system,” Journal on Communications, 2014.
[39] D. M. Arnold, H. . Loeliger, P. O. Vontobel, A. Kavcic, and W. Zeng, “Simulation-Based Computation of Information Rates for Channels With Memory,” IEEE Transactions on Information Theory, vol. 52, no. 8, pp. 3498–3508, 2006.

		$\displaystyle L(b_{n}\|\mathbf{y})$		(29)
		$\displaystyle=\log\frac{\sum_{(s_{n},s_{n+1})\in\mathcal{S}^{+}}\alpha_{n}\cdot\frac{p(s_{n},s_{n+1}\|y_{n})p(y_{n})}{p(s_{n})}\cdot\beta_{n+1}}{\sum_{(s_{n},s_{n+1})\in\mathcal{S}^{-}}\alpha_{n}\cdot\frac{p(s_{n},s_{n+1}\|y_{n})p(y_{n})}{p(s_{n})}\cdot\beta_{n+1}},$
		$\displaystyle=\log\frac{\sum_{(s_{n},s_{n+1})\in\mathcal{S}^{+}}\alpha_{n}\cdot\frac{p(s_{n},s_{n+1}\|y_{n})}{p(s_{n})}\cdot\beta_{n+1}}{\sum_{(s_{n},s_{n+1})\in\mathcal{S}^{-}}\alpha_{n}\cdot\frac{p(s_{n},s_{n+1}\|y_{n})}{p(s_{n})}\cdot\beta_{n+1}},$