This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Linear MIMO Precoders Design for Finite Alphabet Inputs via Model-Free Training

Chen Cao, Biqian Feng, Yongpeng Wu, Derrick Wing Kwan Ng, and Wenjun Zhang C. Cao, B. Feng, Y. Wu and W. Zhang are with the Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail: cici_\_[email protected]; [email protected]; [email protected]; [email protected].) (Corresponding author: Yongpeng Wu.) D. W. K. Ng is with the School of Electrical Engineering and Telecommunications, University of New South Wales, Sydney, NSW 2052, Australia (e-mail: [email protected]).
Abstract

This paper investigates a novel method for designing linear precoders with finite alphabet inputs based on autoencoders (AE) without the knowledge of the channel model. By model-free training of the autoencoder in a multiple-input multiple-output (MIMO) system, the proposed method can effectively solve the optimization problem to design the precoders that maximize the mutual information between the channel inputs and outputs, when only the input-output information of the channel can be observed. Specifically, the proposed method regards the receiver and the precoder as two independent parameterized functions in the AE and alternately trains them using the exact and approximated gradient, respectively. Compared with previous precoders design methods, it alleviates the limitation of requiring the explicit channel model to be known. Simulation results show that the proposed method works as well as those methods under known channel models in terms of maximizing the mutual information and reducing the bit error rate.

Index Terms:
Autoencoders, deep learning, finite alphabet, linear precoders, MIMO.

I Introduction

Linear precoding prevails as a key technology in multiple-input multiple-output (MIMO) systems, which can improve both the transmission rate and communication quality [1]. To achieve the MIMO system capacity, various precoding methods based on the Gaussian channel input assumption have been proposed. Among them, water-filling (WF) [2] has been proved to be the theoretically optimal precoding method, which can achieve the channel capacity. However, in practical systems, channel inputs are usually drawn from finite alphabets, such as phase-shift keying (PSK) signals, quadrature-amplitude modulation (QAM) signals, etc. Therefore, those methods based on the Gaussian input assumption would inevitably cause performance loss, making it more valuable to study the design methods of precoders taking into account the impacts of finite alphabets inputs [3]. Under this premise, the precoding design goal is usually to maximize the channel input-output mutual information [3]. For instance, [3] proposes mercury/water-filling (MWF), which is an optimal power allocation method for independent parallel additive white Gaussian noise (AWGN) channel. Also, [4] have proved that the mutual information is a concave function with respect to (w.r.t.) the squared singular values of the precoding matrix if its right singular vectors are fixed. Thus, for a general vector Gaussian channel, the design of precoding matrix can be reduced to the power distribution matrix design and the right singular matrix design by the singular value decomposition (SVD) of the precoding matrix [5].

In recent years, due to the widespread use of machine learning (ML), the transmitter and receiver in a communication system based on autoencoders (AE) can be optimized in pairs, rather than the separated approach as in traditional methods, which helps achieve better performance over the whole system [6]. Particularly, [7] has proved that the process of AE training on MIMO precoders and receivers can maximize the mutual information, by properly selecting the activation function and loss function. It is worth mentioning that the time complexity in the optimization process can be reduced significantly [7], as the network training process avoids the explicit calculation of the mutual information and its gradient.

Despite the promised potential performance gain, these aforementioned precoding design methods are based on the assumption that a complete channel model can be obtained at the transmitter. In fact, on the one hand, the channel model in practical systems is hard to derive or estimate and only the input and output data can be observed; on the other hand, channel estimation with errors would inevitably occur and lead to performance loss[8]. As such, [9] proposes an alternating model-free algorithm for training AEs in point-to-point single-input single-output (SISO) communication systems without channel models. Unfortunately, [9] focuses only on the AWGN and Rayleigh block-fading channels without considering the role of precoding. In addition, it does not make the channel inputs be drawn from finite alphabets. Instead, it sends the message directly to the channel with a normalized energy through the neural networks (NN). Therefore, this method is difficult to be applied in practical MIMO systems.

In this paper, a novel AE-based design method for linear MIMO precoders with finite alphabet inputs is presented, which can be applied when the specific channel model is unknown. In such cases, the channels are non-differentiable and thus conventional gradient-based training through backpropagation is not applicable, e.g., [7]. To circumvent this problem, we provide an iterative training algorithm to optimize the precoder and the receiver, which can be regarded as two independent parameterized functions. This model-free algorithm iterates between training receivers with the true gradients and training precoders with the estimated gradients by treating the channel inputs as random variables.

To summarize, the main contributions of this paper are as follows: i) By alleviating the limitation of existing precoding design schemes in requiring a complete channel model, precoders can be designed under directly observable channel information, which avoids the bad influence of channel modeling error on the precoders design; ii) Jointly optimizing the receiver and the precoder in MIMO systems for the practical channel information. Simulation results show that the proposed method can achieve similar performance to the model-based algorithm in maximizing the mutual information and has admitted a good performance in reducing the bit error rate (BER).

Notations: Boldface uppercase (lowercase) letters denote matrices (column vectors). 𝐈n\mathbf{I}_{n} denotes the n×nn\times n identity matrix. ()\mathbb{R}(\mathbb{C}) is the set of real (complex) numbers. Re()\mathrm{Re}(\cdot) and Im()\mathrm{Im}(\cdot) denote the real part and the imaginary part of the complex matrix, respectively. 𝒩(m,s)\mathcal{N}(m,s) is the Gaussian distribution with mean mm and covariance ss. ()\mathcal{H}(\cdot) denotes the information entropy of the random variables. 𝔼𝐱()\mathbb{E}_{\mathbf{x}}(\cdot) represents the expectation w.r.t. 𝐱\mathbf{x}. Tr()\rm{Tr}(\cdot) denotes the trace of the matrix. Bdiag()\rm{Bdiag}(\cdot) means a block diagonal matrix. The gradient and Jacobian operators w.r.t. the set of parameters 𝜽\boldsymbol{\theta} are both denoted by 𝜽\nabla_{\boldsymbol{\theta}}; and the superscripts ()𝖳(\cdot)^{\mathsf{T}}, ()𝖧(\cdot)^{\mathsf{H}}, and 2\|\cdot\|_{2} represent transpose, conjugate transpose operations, and l2l_{2}-norm, respectively.

II System Model

Considering a MIMO communication system which has NtN_{t} antennas at the transmitter and NrN_{r} antennas at the receiver, the received signal can be expressed as:

𝐲=𝐇𝐆𝐱+𝐧,\mathbf{y}=\mathbf{HGx}+\mathbf{n}, (1)

where 𝐇Nr×Nt\mathbf{H}\in\mathbb{C}^{N_{r}\times N_{t}} is the channel matrix, 𝐆Nt×Nt\mathbf{G}\in\mathbb{C}^{N_{t}\times N_{t}} is the linear precoder, 𝐧Nr×1\mathbf{n}\in\mathbb{C}^{N_{r}\times 1} is the circularly symmetric white Gaussian noise whose covariance matrix is σ2𝐈Nr\sigma^{2}\mathbf{I}_{N_{r}}, and 𝐱Nt×1\mathbf{x}\in\mathbb{C}^{N_{t}\times 1} is the input signal of zero mean and covariance 𝔼{𝐱𝐱𝖧}=𝐈Nt\mathbb{E}\{\mathbf{x}\mathbf{x}^{\mathsf{H}}\}=\mathbf{I}_{N_{t}}. Based on the assumption of finite alphabet inputs, the input signal 𝐱\mathbf{x} is equiprobably drawn from a set \mathcal{M} of discrete constellations such as MM-ary PSK or QAM with ||=MNt|\mathcal{M}|=M^{N_{t}}.

Aiming at maximizing the mutual information between channel input 𝐱\mathbf{x} and output 𝐲\mathbf{y}, the design problem of a precoder with the finite alphabet inputs can be formulated as:

maximize𝐆(𝐱;𝐲)s.t.Tr{𝐆𝖧𝐆}Nt,\begin{split}&\underset{\mathbf{G}}{\mathrm{maximize}}\quad\mathcal{I}(\mathbf{x};\mathbf{y})\\ &\quad{\rm s.t.}\quad{\rm Tr}\{\mathbf{G}^{\mathsf{H}}\mathbf{G}\}\leq N_{t},\end{split} (2)

where the mutual information can be described as [10]:

(𝐱;𝐲)=Ntlog2M1MNtm=1MNt𝔼𝐧{log2k=1MNtedm,k},\mathcal{I}(\mathbf{x};\mathbf{y})=N_{t}\log_{2}M-\frac{1}{M^{N_{t}}}\sum_{m=1}^{M^{N_{t}}}\mathbb{E}_{\mathbf{n}}\left\{\log_{2}\sum_{k=1}^{M^{N^{t}}}e^{-d_{m,k}}\right\}, (3)

with dm,k=σ2(𝐇𝐆(𝐱m𝐱k)+𝐧22𝐧22).d_{m,k}=\sigma^{-2}(\|\mathbf{HG}(\mathbf{x}_{m}-\mathbf{x}_{k})+\mathbf{n}\|_{2}^{2}-\|\mathbf{n}\|_{2}^{2}).

Nevertheless, it is difficult to obtain the channel model 𝐇\mathbf{H} [12], and an inaccurate channel model may cause performance loss [8].

III Algorithm Design

The MIMO communication system based on the AE presented in this paper includes three parts: the transmitter, the channel, and the receiver, as shown in Fig. 1. First, the transmitter part is equivalent to the signal modulation and precoding process. Second, the channel corresponds to a random system [9], whose output 𝐲\mathbf{y} follows p(𝐲|𝐱)p(\mathbf{y}|\mathbf{x}), i.e., a conditional probability distribution w.r.t. input 𝐱\mathbf{x}. In the AE, it can be regarded as a layer of untrainable network parameters, which transmits signals to the receiver through forward propagation. Finally, the receiver recovers the transmitted information from the received signal through NNs. In this paper, both the precoding and receiver network are the parts to be optimized by minimizing the loss function.

Refer to caption
Figure 1: A MIMO communication system based on an AE in model-free training. The dotted box indicates that the module of relaxation is only existed when training precoders.

III-A Model-free Training Process Overview of Precoders Design based on AE

In the proposed method, we regard the precoder and the receiver as two independent parameterized functions to optimize: (i) the precoder is presented by the function f𝜽P(P)(𝐱)=𝜽P𝐱f_{\boldsymbol{\theta}_{P}}^{(P)}(\mathbf{x})=\boldsymbol{\theta}_{P}\mathbf{x}, where 𝜽P=[Re(𝐆)Im(𝐆)Im(𝐆)Re(𝐆)]\boldsymbol{\theta}_{P}=\begin{bmatrix}\mathrm{Re}(\mathbf{G})&-\mathrm{Im}(\mathbf{G})\\ \mathrm{Im}(\mathbf{G})&\mathrm{Re}(\mathbf{G})\end{bmatrix} is the parameter matrix of the precoder; (ii) the receiver is implemented as f𝜽R(R):Nt{𝐩+|||i||pi=1}f_{\boldsymbol{\theta}_{R}}^{(R)}:\mathbb{C}^{N_{t}}\rightarrow\{\mathbf{p}\in\mathbb{R}_{+}^{|\mathcal{M}|}|\sum_{i}^{|\mathcal{M}|}p_{i}=1\}, where 𝜽R\boldsymbol{\theta}_{R} is the parameter vector of receivers, and 𝐩\mathbf{p} is the probability vector of the transmitted information. Since the NN implementation is limited to the range of real numbers, in this paper, the real and imaginary parts of complex signals involved need to be reshaped into real number vectors before further processing.

Considering the channel follows a conditional probability distribution w.r.t. the channel input, the loss function of the system can be expressed as:

(𝜽P,𝜽R)=𝔼𝐱{l(f𝜽R(R)(𝐲),𝐱)p(𝐲|𝐱¯)𝑑𝐲}1Si=1Sl(f𝜽R(R)(𝐲(i)),𝐱(i)),\begin{split}\mathcal{L}(\boldsymbol{\theta}_{P},\boldsymbol{\theta}_{R})=\;&\mathbb{E}_{\mathbf{x}}\left\{\int l(f_{\boldsymbol{\theta}_{R}}^{(R)}(\mathbf{y}),\mathbf{x})p(\mathbf{y}|\bar{\mathbf{x}})d\mathbf{y}\right\}\\ \approx\;&\frac{1}{S}\sum_{i=1}^{S}l(f_{\boldsymbol{\theta}_{R}}^{(R)}(\mathbf{y}^{(i)}),\mathbf{x}^{(i)}),\end{split} (4)

where 𝐱¯=f𝜽P(P)(𝐱)\bar{\mathbf{x}}=f_{\boldsymbol{\theta}_{P}}^{(P)}(\mathbf{x}) represents the precoded signal, 𝐲2Nt×1\mathbf{y}\in\mathbb{R}^{2N_{t}\times 1} denotes the received signal, as well as the channel output, and l(f𝜽R(R)(𝐲(i)),𝐱(i))l(f_{\boldsymbol{\theta}_{R}}^{(R)}(\mathbf{y}^{(i)}),\mathbf{x}^{(i)}) is the sample loss function defined as the categorical cross-entropy (CE) [9, 7] between the iith input signal 𝐱(i)\mathbf{x}^{(i)} and the iith received signal 𝐲(i)\mathbf{y}^{(i)}, SS is the batch size of training samples. The approximation in (4) means that we can use the sample mean to estimate the mathematical expectation.

Then, the corresponding optimization problem is formulated as:

minimize𝜽P,𝜽R(𝜽P,𝜽R)s.t.Tr{𝐆𝖧𝐆}Nt.\begin{split}&\underset{\boldsymbol{\theta}_{P},\boldsymbol{\theta}_{R}}{\rm{\mathrm{minimize}}}\quad\mathcal{L}(\boldsymbol{\theta}_{P},\boldsymbol{\theta}_{R})\\ &{\rm s.t.}\quad{\rm Tr}\left\{\mathbf{G}^{\mathsf{H}}\mathbf{G}\right\}\leq N_{t}.\end{split} (5)

When Softmax is adopted as the activation function of the output layer, inspired by [7], then we have

(𝜽P,𝜽R)=𝔼𝐲{m=1||p(𝐱=𝐱m|𝐲)log2(f𝜽R(R)(𝐲)m)}(𝐱|𝐲).\begin{split}\mathcal{L}(\boldsymbol{\theta}_{P},\boldsymbol{\theta}_{R})&=\mathbb{E}_{\mathbf{y}}\left\{-\sum^{|\mathcal{M}|}_{m=1}p(\mathbf{x}=\mathbf{x}_{m}|\mathbf{y})\log_{2}(f_{\boldsymbol{\theta}_{R}}^{(R)}(\mathbf{y})_{m})\right\}\\ &\geq\mathcal{H}(\mathbf{x}|\mathbf{y}).\end{split} (6)

Since (𝐱|𝐲)=(𝐱)(𝐱|𝐲)\mathcal{I}(\mathbf{x}|\mathbf{y})=\mathcal{H}(\mathbf{x})-\mathcal{H}(\mathbf{x}|\mathbf{y}), the problem in (5) is nearly the same as the problem in (2), i.e., the precoder obtained by training the AE can also maximize the mutual information.

The training process requires the derivative of the loss function, i.e.,(𝜽R,𝜽P)=[(𝜽R)𝖳,(𝜽P)𝖳]𝖳\nabla_{(\boldsymbol{\theta}_{R},\boldsymbol{\theta}_{P})}\mathcal{L}=[(\nabla_{\boldsymbol{\theta}_{R}}\mathcal{L})^{\mathsf{T}},(\nabla_{\boldsymbol{\theta}_{P}}\mathcal{L})^{\mathsf{T}}]^{\mathsf{T}}. To enable the training for the proposed framework in the absence of the channel model, inspired by [9], we present an alternating training method with the exact gradient and the approximated gradient, respectively, as shown in Algorithm 1. It should be noted that since the loss function in this algorithm is defined as a categorical CE function, the transmitted information should be in the form of one-hot and then mapped to corresponding discrete constellation points, as shown in Fig. 1. We adopt the Adam optimizer [13] for training.

Algorithm 1 Autoencoder Training
1:  Initialization: Set 𝜽P\boldsymbol{\theta}_{P} and 𝜽R\boldsymbol{\theta}_{R} to some random values, such that Tr{𝐆H𝐆}Nt{\rm Tr}\{\mathbf{G}^{H}\mathbf{G}\}\leq N_{t}.
2:  while stop criterion not met do
3:     Train the Receiver;
4:     Train the Precoder;
5:  end while

III-B Receiver Training

According to (4), the gradient of \mathcal{L} w.r.t. 𝜽R\boldsymbol{\theta}_{R} can be expressed as:

𝜽R=𝔼𝐱,𝐲{𝜽Rl(f𝜽R(R)(𝐲),𝐱)}1Si=1S𝜽Rl(f𝜽R(R)(𝐲(i)),𝐱(i)).\begin{split}\nabla_{\boldsymbol{\theta}_{R}}\mathcal{L}&=\mathbb{E}_{\mathbf{x},\mathbf{y}}\left\{\nabla_{\boldsymbol{\theta}_{R}}l(f_{\boldsymbol{\theta}_{R}}^{(R)}(\mathbf{y}),\mathbf{x})\right\}\\ &\approx\frac{1}{S}\sum_{i=1}^{S}\nabla_{\boldsymbol{\theta}_{R}}l(f_{\boldsymbol{\theta}_{R}}^{(R)}(\mathbf{y}^{(i)}),\mathbf{x}^{(i)}).\end{split} (7)

It can be seen that there is no need to acquire the channel model, as the calculation process only needs to sample the received signal.

Algorithm 2 Receiver Training
1:  for i=1i=1 to itermax\mathrm{iter}_{\max} do
2:     Generate SS independent input information 𝐌\mathbf{M};
3:     Map 𝐌\mathbf{M} to finite alphabet signal 𝐗\mathbf{X};
4:     𝐗¯f𝜽P(P)(𝐗)\bar{\mathbf{X}}\leftarrow f_{\boldsymbol{\theta}_{P}}^{(P)}(\mathbf{X});
5:     Send 𝐗¯\bar{\mathbf{X}} to channel and get 𝐘\mathbf{Y};
6:     𝐌^f𝜽R(R)(𝐘)\hat{\mathbf{M}}\leftarrow f_{\boldsymbol{\theta}_{R}}^{(R)}(\mathbf{Y});
7:     Compute loss \mathcal{L} and 𝜽R\nabla_{\boldsymbol{\theta}_{R}}\mathcal{L}
8:     Update 𝜽R\boldsymbol{\theta}_{R} with one Adam step;
9:  end for
10:  Output: 𝜽R\boldsymbol{\theta}_{R}

Algorithm 2 is the detailed algorithm for training the receiver. First, the signal source randomly generates a batch of one-hot transmitted information 𝐌\mathbf{M}, which is a |||\mathcal{M}|-by-SS matrix and then mapped to the finite alphabet input signals 𝐗\mathbf{X}. Next, the signals are multiplied by 𝜽P\boldsymbol{\theta}_{P} to obtain the precoded signals 𝐗¯\bar{\mathbf{X}}, and then 𝐗¯\bar{\mathbf{X}} enters the channel for transmission. The receiver acquires the channel output 𝐘\mathbf{Y}. After 𝐘\mathbf{Y} is fed to the receiver network, a batch of probability matrix 𝐌^||×S\hat{\mathbf{M}}\in\mathbb{R}^{|\mathcal{M}|\times S} over the transmitted information are given for the calculating loss \mathcal{L}. Finally, 𝜽R\boldsymbol{\theta}_{R} can perform a one-step update using 𝜽R\nabla_{\boldsymbol{\theta}_{R}}\mathcal{L} by (7). As the channel model is unknown, lines 4, 5 in Algorithm 2 only carry out forward propagation and do not need to record the gradient.

III-C Precoder Training

According to (4), the gradient of \mathcal{L} w.r.t. 𝜽P\boldsymbol{\theta}_{P} is:

𝜽P=𝔼𝐱{l(f𝜽R(R)(𝐲),𝐱)𝐱¯p(𝐲|𝐱¯)𝜽Pf𝜽P(P)(𝐲)𝑑𝐲}.\begin{split}\nabla_{\boldsymbol{\theta}_{P}}\mathcal{L}=\mathbb{E}_{\mathbf{x}}\bigg{\{}\int l(f_{\boldsymbol{\theta}_{R}}^{(R)}(\mathbf{y}),\mathbf{x})\nabla_{\bar{\mathbf{x}}}p(\mathbf{y}|\bar{\mathbf{x}})\nabla_{\boldsymbol{\theta}_{P}}f_{\boldsymbol{\theta}_{P}}^{(P)}(\mathbf{y})d\mathbf{y}\bigg{\}}.\end{split} (8)

Since the channel model p(𝐲|𝐱¯)p(\mathbf{y}|\bar{\mathbf{x}}) is unknown and 𝐱¯p(𝐲|𝐱¯)\nabla_{\bar{\mathbf{x}}}p(\mathbf{y}|\bar{\mathbf{x}}) cannot be calculated, inspired by [9], we resort to another approach that relax the channel input 𝐱¯\bar{\mathbf{x}} into a random variable 𝐱~\tilde{\mathbf{x}}, which follows a distribution of π𝐱¯=δ(𝐱~𝐱¯)\pi_{\bar{\mathbf{x}}}=\delta(\tilde{\mathbf{x}}-\bar{\mathbf{x}}), where δ\delta refers to the delta distribution. The position of the relaxation operation in the algorithm process is shown in the dotted box in Fig. 1. Then, the relaxed system loss can be expressed as:

^=𝔼𝐱{l(f𝜽R(R)(𝐲),𝐱)p(𝐲|𝐱~)π𝐱¯(𝐱~)𝑑𝐱~𝑑𝐲}.\mathcal{\hat{L}}=\mathbb{E}_{\mathbf{x}}\left\{\iint l(f_{\boldsymbol{\theta}_{R}}^{(R)}(\mathbf{y}),\mathbf{x})p(\mathbf{y}|\tilde{\mathbf{x}})\pi_{\bar{\mathbf{x}}}(\tilde{\mathbf{x}})d\tilde{\mathbf{x}}d\mathbf{y}\right\}. (9)

Besides, 𝜽P\nabla_{\boldsymbol{\theta}_{P}}\mathcal{L} can be approximated by the following expression:

𝜽P^=𝔼𝐱{l(f𝜽R(R)(𝐲),𝐱)p(𝐲|𝐱~)𝜽Pπ𝐱¯(𝐱~)𝑑𝐱~𝑑𝐲}=𝔼𝐱,𝐱~,𝐲{l(f𝜽R(R)(𝐲),𝐱)𝜽Pf𝜽P(P)(𝐲)𝐱¯logπ𝐱¯(𝐱~)}1Si=1Sl(f𝜽R(R)(𝐲(i)),𝐱(i))𝜽Plog(π^𝐱¯,σπ(𝐱~(i)))|𝐱¯=f𝜽P(P)(𝐱(i)),\begin{split}\nabla_{\boldsymbol{\theta}_{P}}\mathcal{\hat{L}}&=\mathbb{E}_{\mathbf{x}}\left\{\iint l(f_{\boldsymbol{\theta}_{R}}^{(R)}(\mathbf{y}),\mathbf{x})p(\mathbf{y}|\tilde{\mathbf{x}})\nabla_{\boldsymbol{\theta}_{P}}\pi_{\bar{\mathbf{x}}}(\tilde{\mathbf{x}})d\tilde{\mathbf{x}}d\mathbf{y}\right\}\\ &=\mathbb{E}_{\mathbf{x},\tilde{\mathbf{x}},\mathbf{y}}\left\{l(f_{\boldsymbol{\theta}_{R}}^{(R)}(\mathbf{y}),\mathbf{x})\nabla_{\boldsymbol{\theta}_{P}}f_{\boldsymbol{\theta}_{P}}^{(P)}(\mathbf{y})\nabla_{\boldsymbol{\bar{\mathbf{x}}}}\log\pi_{\bar{\mathbf{x}}}(\tilde{\mathbf{x}})\right\}\\ &\approx\frac{1}{S}\sum_{i=1}^{S}l(f_{\boldsymbol{\theta}_{R}}^{(R)}(\mathbf{y}^{(i)}),\mathbf{x}^{(i)})\cdot\\ &\quad\ \ \nabla_{\boldsymbol{\theta}_{P}}\log(\hat{\pi}_{\bar{\mathbf{x}},\sigma_{\pi}}(\tilde{\mathbf{x}}^{(i)}))|_{{\bar{\mathbf{x}}}=f_{\boldsymbol{\theta}_{P}}^{(P)}(\mathbf{x}^{(i)})},\end{split} (10)

where π^𝐱¯,σπ\hat{\pi}_{\bar{\mathbf{x}},\sigma_{\pi}} represents the approximated probability distribution function of π𝐱¯\pi_{\bar{\mathbf{x}}}, whose variance is σπ2\sigma_{\pi}^{2}, since π𝐱¯\pi_{\bar{\mathbf{x}}} is non-differentiable. As such, the function requiring the derivative of 𝜽P\boldsymbol{\theta}_{P} is subtly transformed from the unknown channel model p(𝐲|𝐱)p(\mathbf{y}|\mathbf{x}) to the known distribution π^\hat{\pi}. Through the estimated gradient in (10), we can complete the training process of precoders without the need of the channel model.

Algorithm 3 Precoder Training
1:  for i=1i=1 to itermax\mathrm{iter}_{\max} do
2:     Generate SS independent input information 𝐌\mathbf{M};
3:     Map 𝐌\mathbf{M} to finite alphabet signal 𝐗\mathbf{X};
4:     𝐗¯f𝜽P(P)(𝐗)\bar{\mathbf{X}}\leftarrow f_{\boldsymbol{\theta}_{P}}^{(P)}(\mathbf{X});
5:     Relax 𝐗¯\bar{\mathbf{X}} to random variable 𝐗~\tilde{\mathbf{X}}
6:     Send 𝐗~\tilde{\mathbf{X}} to channel and acquire 𝐘\mathbf{Y};
7:     𝐌^f𝜽R(R)(𝐘)\hat{\mathbf{M}}\leftarrow f_{\boldsymbol{\theta}_{R}}^{(R)}(\mathbf{Y});
8:     Update 𝜽P\boldsymbol{\theta}_{P} with one Adam step
9:     Scale 𝜽P\boldsymbol{\theta}_{P}, such that Tr{𝐆H𝐆}Nt{\rm Tr}\{\mathbf{G}^{H}\mathbf{G}\}\leq N_{t};
10:     Compute per-sample loss ll and 𝜽P~\nabla_{\boldsymbol{\theta}_{P}}\mathcal{\tilde{L}}
11:  end for
12:  Output: 𝜽P\boldsymbol{\theta}_{P}

Algorithm 3 is the detailed algorithm for training the precoder. The signal source randomly generates the transmitted information 𝐌\mathbf{M} and maps them to the input signals 𝐗\mathbf{X}. After precoding, 𝐗¯\bar{\mathbf{X}} can be obtained. For the approximated gradient calculation, it is necessary to relax 𝐗¯\bar{\mathbf{X}} into random variables 𝐗~\tilde{\mathbf{X}}, and then sends the random variables to the channel for transmission to obtain the received signals 𝐘¯\bar{\mathbf{Y}}. The receiver then establishes the probability matrix of the received signals over the transmitted information and then calculates the per-sample loss ll. However, at this time, ll does not directly calculate the gradient of 𝜽P\boldsymbol{\theta}_{P}, as shown in (10), and π^𝐱¯,σπ\hat{\pi}_{\bar{\mathbf{x}},\sigma_{\pi}} should be used to calculate the approximated gradient. Then, 𝜽P\boldsymbol{\theta}_{P} can perform a step update through 𝜽P^\nabla_{\boldsymbol{\theta}_{P}}\mathcal{\hat{L}}, which still needs to satisfy the power constraints of the precoders. In order to ensure that 𝜽P^\nabla_{\boldsymbol{\theta}_{P}}\mathcal{\hat{L}} can be estimated and receiver parameters are not affected, lines 6, 7 in Algorithm 3 only carry out forward propagation and do not need to record the gradient. In other words, the channel in this algorithm is only used to observe its input and the corresponding output, and the complete channel model itself does not participate in the precoders design process.

Refer to caption
Refer to caption
Figure 2: Network structure: (a) Precoder structure; (b) Receiver structure, where the blue and orange box indicate the “equalization” network and decision network, respectively.

III-D Network Structure

Fig. 2 shows the network structure of the algorithm proposed in this paper, where Fig. 2 is the precoding part and Fig. 2 is the receiver part. As shown in the figure, the precoding matrix 𝐆\mathbf{G} is implemented by a linear layer without bias, whose parameters are 𝜽P\boldsymbol{\theta}_{P}. The network structure of the receiver is mainly composed of two parts: the first one is the “equalization” network, which consists of two linear layers, and the activation function between is the tanhtanh function; the second one is the decision network, which is composed of multiple linear layers, between which the activation function is ReLU, and Softmax at the output layer. It should be noted that since the proposed method is used in the case of unknown channel models, the receiver may show poor performance [9] without prior information. Therefore, in order to improve the accuracy of the receiver, the “equalization” network is added to extract part of implicit prior channel information 𝐇^Nt×S\hat{\mathbf{H}}\in\mathbb{C}^{N_{t}\times S} from the received signals. Then, it calculates the product of 1𝐇^22[Re(𝐇^)Im(𝐇^)Im(𝐇^)Re(𝐇^)]\frac{1}{\|\hat{\mathbf{H}}\|_{2}^{2}}\begin{bmatrix}\mathrm{Re}(\hat{\mathbf{H}})&\mathrm{Im}(\hat{\mathbf{H}})\\ -\mathrm{Im}(\hat{\mathbf{H}})&\mathrm{Re}(\hat{\mathbf{H}})\end{bmatrix} and the received signal 𝐲\mathbf{y}, similar to the channel equalization in traditional communication systems. Next, the received signals after “equalization” can be exploited to establish the probability matrix 𝐌^\hat{\mathbf{M}} over the input information through the decision network, and finally the loss function and the corresponding gradient can be calculated to train the whole network. It should be noted that 𝐇^\hat{\mathbf{H}} is just an intermediate variable of the receiver, which does not correspond strictly to the true channel model. Instead, it can be regarded as the implicit channel information, explaining the reason why such a structure can improve the performance of the receiver. The method itself does not need to assume the channel model in advance.

III-E Improvement and Some Applications of the Proposed Linear Precoders Design Method

As described in Section III-A, when Softmax and the categorical CE are adopted, the sizes of layers in the AE grow exponentially with the number of NtN_{t}, which is very expensive in large systems. Therefore, inspired by [7], we use Ntlog2(M)N_{t}\log_{2}(M) bits to represent one training sample. Then, the loss function and the activation function at the last layer of the receiver will be adjusted to binary CE function and Sigmoid accordingly. Such a choice may make the process of training AE equivalent to maximizing the lower bound of mutual information [7]. Even so, the proposed method is easier to generalize to more complex scenarios, such as MIMO Multiple Access Channels (MAC) with finite discrete inputs, MIMO orthogonal frequency-division multiplexing (MIMO-OFDM) systems, etc. The accurate channel models of these systems are complex, while the assumptions of the channel models are relatively simple in the traditional research methods, or the channel model contains some non-differentiable components (such as preamble insertion in MIMO-OFDM), so the simulation results may differ greatly from the actual performance. The proposed method without the knowledge of an accurate channel model would have inherent advantages in these systems and may alleviate this problem to some extent.

For example, in a KK-user MIMO MAC communication system, considering the signal model:

𝐲=𝐇1𝐆1𝐱1+𝐇2𝐆2𝐱2++𝐇K𝐆K𝐱K+𝐧,\begin{split}\mathbf{y}=\mathbf{H}_{1}\mathbf{G}_{1}\mathbf{x}_{1}+\mathbf{H}_{2}\mathbf{G}_{2}\mathbf{x}_{2}+\cdots+\mathbf{H}_{K}\mathbf{G}_{K}\mathbf{x}_{K}+\mathbf{n},\end{split} (11)

where 𝐇iNr×Nt\mathbf{H}_{i}\in\mathbb{C}^{N_{r}\times N_{t}} represents the complex channel matrix between the iith transmitter and the receiver; 𝐆iNt×Nt\mathbf{G}_{i}\in\mathbb{C}^{N_{t}\times N_{t}} is each user’s precoding matrix; 𝐱=[𝐱1𝖳,𝐱2𝖳,,𝐱K𝖳]𝖳\mathbf{x}=[\mathbf{x}_{1}^{\mathsf{T}},\mathbf{x}_{2}^{\mathsf{T}},\cdots,\mathbf{x}_{K}^{\mathsf{T}}]^{\mathsf{T}} contains the signal of all transmitters, assuming 𝐱i\mathbf{x}_{i} of different users are independent from each other; the receiver noise 𝐧Nr×1\mathbf{n}\in\mathbb{C}^{N_{r}\times 1}, and 𝐧𝒞𝒩(𝟎,σ2𝐈)\mathbf{n}\sim\mathcal{CN}(\mathbf{0},\sigma^{2}\mathbf{I}). Suppose there are NrN_{r} antennas at the receiver and each user has NtN_{t} transmit antennas.

According to [14], the boundary of the constellation-constrained capacity region can be characterized by the solution of the sum rate optimization problem. Then, the proposed precoders design methods can be applied by: i) using NtKlog2(M)N_{t}K\log_{2}(M) bits to represent one training target; ii) the parameter 𝜽P\boldsymbol{\theta}_{P} of the precoders is adjusted to Bdiag(𝜽P1,,𝜽PK)(\boldsymbol{\theta}_{P_{1}},\cdots,\boldsymbol{\theta}_{P_{K}}), where 𝜽Pi=[Re(𝐆i)Im(𝐆i)Im(𝐆i)Re(𝐆i)]\boldsymbol{\theta}_{P_{i}}=\begin{bmatrix}\mathrm{Re}(\mathbf{G}_{i})&-\mathrm{Im}(\mathbf{G}_{i})\\ \mathrm{Im}(\mathbf{G}_{i})&\mathrm{Re}(\mathbf{G}_{i})\end{bmatrix}; iii) the receiver function correspondingly becomes as f𝜽R(R):2NtKNtKlog2(M)f_{\boldsymbol{\theta}_{R}}^{(R)}:\mathbb{R}^{2N_{t}K}\rightarrow\mathbb{R}^{N_{t}K\log_{2}(M)}. Since the channel model of each user is unknown, the precoder and receiver parts still need the true and estimated gradients for training, respectively. It should be noted that because the optimal precoders of different users depend on each other [14], we should iteratively optimize one user’s precoder at a time with others fixed.

Refer to caption
(a) Comparison under channel 𝐇1\mathbf{H}_{1}, BPSK
Refer to caption
(b) Comparison under channel 𝐇2\mathbf{H}_{2}, BPSK
Refer to caption
(c) Comparison under channel 𝐇1\mathbf{H}_{1}, QPSK
Refer to caption
(d) Comparison under channel 𝐇2\mathbf{H}_{2}, QPSK
Figure 3: Mutual information for Gaussian and finite alphabet channel inputs.

IV Numerical Results

In this section, simulation results are provided to evaluate the performance of the proposed method of linear precoders design, in terms of maximizing the mutual information and the impacts on the BER of the MIMO communication system, compared with some existing design approaches.

In the receiver part of the AE, the decision network is parameterized with 3 dense layers with size 128 each in this experiment while the “eqalization” part has 2 dense layers with size 64 each. The maximum iteration number of the Algorithm 1 is set as 5000 and the batchsize S=32S=32. The learning rate for Adam step on the receiver and precoder training are both set as 10410^{-4}. Also, the SNR is defined as Tr{𝐇𝖧𝐇}Ntσ2\frac{{\rm Tr}\{\mathbf{H}^{\mathsf{H}}\mathbf{H}\}}{N_{t}\sigma^{2}}.

When training the precoder, we utilize the normal distribution with mean 𝐱¯\bar{\mathbf{x}} and variance σπ2\sigma_{\pi}^{2} to relax the channel input, i.e., 𝐱~=1σπ2𝐱¯+𝐰\tilde{\mathbf{x}}=\sqrt{1-\sigma_{\pi}^{2}}\,\bar{\mathbf{x}}+\mathbf{w}, where 𝐰𝒩(0,σπ2𝐈)\mathbf{w}\sim\mathcal{N}(0,\sigma_{\pi}^{2}\mathbf{I}) and the variance σπ2\sigma_{\pi}^{2} will be appropriately selected to adapt different scenarios. The smaller the variance is, the estimated gradient can be more accurate, but it would lead to a slower convergence rate, which would reflect in the mutual information fluctuation of the low SNR region. However, when the variance is larger, the algorithm may converge to the local optimum, and the mutual information in the high SNR regime would have a significant loss. Therefore, in this paper, σπ\sigma_{\pi} is selected from [0.01,0.15][0.01,0.15].

Fig. 3 shows the mutual information results of several precoder design methods with different channels and modulation orders, where 𝐇1=[2111]\mathbf{H}_{1}=\begin{bmatrix}2&1\\ 1&1\end{bmatrix} and 𝐇2=[10.5j0.30.5j1.50.1j0.30.1j0.5]\mathbf{H}_{2}=\begin{bmatrix}1&0.5j&0.3\\ -0.5j&1.5&-0.1j\\ 0.3&0.1j&0.5\end{bmatrix}. We regard the DNN-based precoders design scheme with the complete channel model proposed by [7] as the model-aware method, compared with the model-free method proposed in this paper. Analyzing Fig. 3, in the low SNR region, the mutual information maximization problem is approximated to the power allocation problem. With the SNR increasing, the power allocation method MWF has a large loss for giving up the part of searching space related to the right singular matrix of the precoder. In this case, the precoders design method based on AE shows a large gain and the model-free method can be almost consistent with the model-aware method since the estimated gradient in this approach is close to the true one as much as possible by adjusting σπ\sigma_{\pi}. It can also be seen that the performance of the model-free method with Sigmoid is slightly worse than that with Softmax, because it maximizes the lower bound of the mutual information rather than the mutual information itself.

Refer to caption
Figure 4: BER with different precoding methods under 𝐇1\mathbf{H}_{1}, BPSK.

Fig. 4 shows the BER performance of the proposed method with 𝐇1\mathbf{H}_{1} and BPSK modulation. In the simulation, the (648, 486) low-density parity-check (LDPC) code in the IEEE 802.11 is adopted for error correction and the maximum a posterior probability (MAP) criterion is used for detection. The iteration between the MAP detector and the LDPC decoder is 5. In this simulation, both the model-free and the model-aware methods are trained at SNR =2=2 dB. We observe that the performance of the model-free (no matter with Softmax or Sigmoid activation function) methods approach the BER of the model-aware method closely, both of which have a significant gain over the case of MWF or no precoder. This is consistent with the conclusion of the mutual information results.

V Conclusion

In this paper, we solved the linear precoding design problem for finite alphabet inputs in MIMO systems without a channel model. Taking the advantages of the model-free network based AE, the alternating training on the receiver and the precoder was performed through the exact and estimated gradients, respectively. The proposed method obtained not only the precoders for maximizing the mutual information between channel inputs and outputs, but also the corresponding receiver, alleviating the high requirement of the existing methods on the perfect channel state information. The simulation results showed that this no-channel-model design method achieved the performance of the complete-channel-model design method in terms of mutual information and bit error rate, offering practical insights for linear MIMO precoders design.

References

  • [1] N. Fatema, G. Hua, Y. Xiang, D. Peng, and I. Natgunanathan, “Massive MIMO linear precoding: A survey,” IEEE Syst. J, vol. 12, no. 4, pp. 3920–3931, Dec. 2018.
  • [2] G. Scutari, D. P. Palomar, and S. Barbarossa, “The MIMO iterative waterfilling algorithm,” IEEE Trans. Signal Process., vol. 57, no. 5, pp. 1917–1935, May 2009.
  • [3] A. Lozano, A. M. Tulino, and S. Verdaú, “Optimum power allocation for parallel Gaussian channels with arbitrary input distributions,” IEEE Trans. Inf. Theory, vol. 52, no. 7, pp. 3033–3051, Jul. 2006.
  • [4] M. Payaró and D. P. Palomar, “On optimal precoding in linear vector Gaussian channels with arbitrary input distribution,” in Proc. IEEE Int. Symp. Inf. Theory, Seoul, Korea (South), 2009, pp. 1085–1089.
  • [5] A. A. Lu, X. Gao, Y. R. Zheng, and C. Xiao, “Linear precoder design for SWIPT in MIMO broadcasting systems with discrete input signals: Manifold optimization approach,” IEEE Trans. Commun., vol. 65, no. 7, pp. 2877–2888, Apr. 2017.
  • [6] T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Trans. Cogn. Commun. Netw., vol. 3, no. 4, pp. 563–575, Dec. 2017.
  • [7] S. Jing and C. Xiao, “Linear MIMO precoders with finite alphabet inputs via stochastic optimization and deep neural networks (DNNs),” IEEE Trans. Signal Process., vol. 69, pp. 4269–4281, Jul. 2021.
  • [8] S. Dörner, S. Cammerer, J. Hoydis, and S. T. Brink, “Deep learning based communication over the air,” IEEE J. Sel. Topics Signal Process., vol. 12, no. 1, pp. 132–143, Feb. 2018.
  • [9] F. A. Aoudia and J. Hoydis, “Model-free training of end-to-end communication systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 11, pp. 2503–2516, Nov. 2019.
  • [10] C. Xiao, Y. R. Zheng, and Z. Ding, “Globally optimal linear precoders for finite alphabet signals over complex vector Gaussian channels,” IEEE Trans. Signal Process., vol. 59, no. 7, pp. 3301–3314, Jul. 2011.
  • [11] Y. R. Zheng, M. Wang, W. Zeng, and C. Xiao, “Practical linear precoder design for finite alphabet multiple-input multiple-output orthogonal frequency division multiplexing with experiment validation,” IET Commun., vol. 7, no. 9, pp. 836–847, Jun. 2013.
  • [12] T. Ketseoglou and E. Ayanoglu, “Zero-forcing per-group precoding for robust optimized downlink massive MIMO performance,” IEEE Trans. Commun., vol. 67, no. 10, pp. 6816–6828, Jul. 2019.
  • [13] D. P. Kingma and J. Ba, “ADAM: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Representations, Banff, Canada, 2014.
  • [14] M. Wang, W. Zeng, and C. Xiao, “Linear precoding for MIMO multiple access channels with finite discrete inputs,” IEEE Trans. Wireless Commun., vol. 10, no. 11, pp. 3934–3942, Nov. 2011.