This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Federated Learning for Channel Estimation in Conventional and RIS-Assisted Massive MIMO

Ahmet M. Elbir, Senior Member, IEEE, and Sinem Coleri, Senior Member, IEEE The work of Sinem Coleri was supported by the Scientific and Technological Research Council of Turkey with European CHIST-ERA grant 119E350.A. M. Elbir is with the Department of Electrical and Electronics Engineering, Duzce University, Duzce, Turkey, and with University of Hertfordshire, Hatfield, UK (e-mail: [email protected]).S. Coleri is with the Department of Electrical and Electronics Engineering, Koc University, Istanbul, Turkey (e-mail: [email protected]).
Abstract

Machine learning (ML) has attracted a great research interest for physical layer design problems, such as channel estimation, thanks to its low complexity and robustness. Channel estimation via ML requires model training on a dataset, which usually includes the received pilot signals as input and channel data as output. In previous works, model training is mostly done via centralized learning (CL), where the whole training dataset is collected from the users at the base station (BS). This approach introduces huge communication overhead for data collection. In this paper, to address this challenge, we propose a federated learning (FL) framework for channel estimation. We design a convolutional neural network (CNN) trained on the local datasets of the users without sending them to the BS. We develop FL-based channel estimation schemes for both conventional and RIS (intelligent reflecting surface) assisted massive MIMO (multiple-input multiple-output) systems, where a single CNN is trained for two different datasets for both scenarios. We evaluate the performance for noisy and quantized model transmission and show that the proposed approach provides approximately 16 times lower overhead than CL, while maintaining satisfactory performance close to CL. Furthermore, the proposed architecture exhibits lower estimation error than the state-of-the-art ML-based schemes.

Index Terms:
Channel estimation, Federated learning, Machine learning, Centralized learning, Massive MIMO.

I Introduction

Compared to the cellular communication systems in lower frequency bands, millimeter wave (mm-Wave) signals, with the frequency range 3030-300300 GHz, encounter a more complex propagation environment that is characterized by higher scattering, severe penetration losses, and higher path loss for fixed transmitter and receiver gains [1, 2, 3]. These losses are compensated by providing beamforming power gain through massive number of antennas at both transmitter and receiver with multiple-input-multiple-output (MIMO) architecture. However, such a large antenna array requires a dedicated radio-frequency (RF) chain for each antenna, resulting in an expensive system architecture and high power consumption. In order to address this issue and reduce the number of digital RF components, hybrid analog and baseband beamforming architectures have been introduced, wherein a small number of phase-only analog beamformers are employed [4]. As a result, the combination of high-dimensional analog and low-dimensional baseband beamformers significantly reduces the number of RF chains while maintaining sufficient beamforming gain [4].

Even with the reduced number of RF chains, the hybrid beamforming architecture combined with mm-Wave transmission comes with the expensive cost of energy consumption and hardware complexity [5]. In order to address these issues and provide a more green and suitable solution to enhance the wireless network performance, reconfigurable intelligent surfaces (RISs) (also known as intelligent reflecting surfaces) are envisaged as a promising solution with low cost and complexity [6, 7, 5, 8, 9]. An RIS is an electromagnetic 2-D surface that is composed of large number of passive reconfigurable meta-material elements, which reflect the incoming signal by introducing a pre-determined phase shift. This phase shift can be controlled via external signals by the base station (BS) through a backhaul control link. As a result, the incoming signal from the BS can be manipulated in real-time, thereby, reflecting the received signal towards the users. Hence, the usage of RIS improves the received signal energy at the distant users as well as expanding the coverage of the BS.

In both conventional and RIS-assisted massive MIMO scenarios, the performance of the system architecture strongly relies on the accuracy of the instantaneous channel state information (CSI), given the highly dynamic nature of the mm-Wave channel [10]. Thus, the channel estimation accuracy plays an important role in the design of the analog and digital beamformers in conventional massive MIMO [11, 12], and the design of reflecting beamformer phase shifts of the RIS elements in the RIS-assisted scenario [13, 8]. Furthermore, RIS-assisted massive MIMO involves signal reception through multiple channels (e.g., BS-RIS, RIS-user and BS-user), which makes the channel estimation task more challenging and interesting. As a result, several channel estimation schemes are proposed for massive MIMO and RIS-assisted scenarios, based on compressed sensing [13], angle-domain processing [14] and coordinated pilot-assignment [15]. The performance of these analytical approaches strongly depends on the perfection of the antenna array output so that reliable channel estimation accuracy can be obtained. In order to provide robustness against the imperfections/corruptions in the array data, data-driven techniques, such as machine learning (ML) based approaches, have been proposed to uncover the non-linear relationships in data/signals with lower computational complexity and achieve better performance for parameter inference and be tolerant against the imperfections in the data. As listed below, ML is more efficient than model-based techniques that largely rely on mathematical models:

  • A learning model constructs a non-linear mapping between the raw input data and the desired output to approximate a problem from a model-free perspective. Thus, its prediction performance is robust against the corruptions/imperfections in the wireless channel data.

  • ML learns the feature patterns, which are easily updated for the new data and adapted to environmental changes. In the long run, this results in a lower computational complexity than the model-based optimization.

  • ML-based solutions have significantly reduced run-times because of parallel processing capabilities. On the other hand, it is not straightforward to achieve parallel implementations of conventional optimization and signal processing algorithms.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: Model training and testing stages for (a) CL and (b) FL. During training, CL involves the transmission of datasets 𝒟k𝒦\mathcal{D}_{k\in\mathcal{K}} from the users to the server, while users send only the model updates 𝐠k𝒦(𝜽t)\mathbf{g}_{k\in\mathcal{K}}(\boldsymbol{\theta}_{t}) in FL. In the test stage of both CL and FL, the server broadcasts the trained learning models 𝜽Trained\boldsymbol{\theta}_{\mathrm{Trained}} to the users.

In massive MIMO and RIS-assisted systems, ML has been proven to have higher spectral efficiency and lower computational complexity for the problems such as channel estimation [16, 17, 18], hybrid beamforming [19, 20, 21] and angle-of-arrival (AoA) estimation [22, 23].

In ML context, channel estimation problem is solved by training a model, e.g., a neural network (NN), on the local datasets collected by the users [16, 17, 18]. The trained model provides a non-linear mapping between the input data, which can be usually selected as the received pilot signals, and the output data, i.e., the channel data. Previous works mostly consider centralized learning (CL) schemes where the whole dataset, i.e., input-output data pairs, is transmitted to the BS (via RIS in the RIS-assisted scenario) for model training, as illustrated in Fig. 1a. Once the model is trained at the BS, then the model parameters are sent to the users, which can perform channel estimation task by feeding the model with the received pilot data. However, this approach involves huge communication overhead, i.e., transmitting the whole dataset from users to the BS. For example, in LTE (long term evolution), a single frame of 55 MHz bandwidth and 1010 ms duration can carry only 60006000 complex symbols [24], whereas the size of the whole dataset can be on the order hundreds of thousands symbols [21, 20, 16, 17]. As a result, CL-based techniques demand huge bandwidth requirements.

In order to deal with high communication overhead of CL schemes, recently federated learning (FL) schemes have been proposed [25, 26]. In FL, instead of sending the whole dataset, only the model updates, i.e., gradients of the model parameters, are transmitted, as illustrated in Fig. 1b. As a result, the communication overhead is reduced. In the literature, FL has been considered for the scheduling and power allocation in wireless sensor networks [27], the trajectory planning of UAV (unmanned aerial vehicle) networks [28], task fetching and offloading in vehicular networks [29, 26], image classification in [24, 30], and massive MIMO hybrid beamforming design [31]. All of these studies accommodate multiple edge devices exchanging model updates with the parameter server to train a global model. In the aforementioned works, FL has been mostly used for image classification/object detection problems in different networking schemes by the assumption that the perfect CSI is available. Motivated by the fact that the acquisition of CSI is very critical in massive MIMO systems and FL has not been considered directly for the channel estimation problem, in this work, we leverage FL for the channel estimation problem, which has been studied previously in the context of CL-based training [16, 17, 18, 32]. Compared to CL, FL is more applicable in case of distributed devices, such as mobile phones. Furthermore, training the same model with FL, rather than CL, reduces the communication overhead significantly during training while maintaining satisfactory channel estimation performance close CL. To the best of our knowledge, this is the first work for the use of FL in channel estimation.

In this paper, we propose an FL-based model training approach for channel estimation problem in both conventional and RIS-assisted massive MIMO systems. We design a convolutional neural network (CNN), which is located at the BS and trained on the local datasets. For these datasets, where the input is received pilot signal and the output is the channel matrix, the usage of the CNN is more convenient than the recurrent NNs (RNNs), which are designed to predict the future CSI by using the previous channels based on the sequential data [33]. The proposed approach has three stages, namely, data collection, training and prediction. In the first stage, each user collects its training datasets and stores them for model training, which is not explicitly discussed in the previous ML-based works [16, 17, 18, 34]. In the second stage, each user uses its own local dataset, and computes the model updates and sends them to the BS111The model parameters computed at the users are transmitted to the BS via the RIS in RIS-assisted scenario., where the model updates are aggregated to train a global model. The main advantage of the proposed FL approach is the reduction in the communication overhead. This overhead is proportional to the dimensionality of the channel matrix, which can be higher in RIS-assisted systems than the conventional MIMO due to the large number of RIS elements. Apart from that, the proposed approach reduces the computation time as well as increasing the robustness against data corruptions. One of the main challenges in FL-based channel estimation is due to the non-i.i.d. (independent identical distribution) structure of the training data. FL is known to converge faster if the local datasets are i.i.d. [35]. Since the channel estimation dataset is non-i.i.d. because of the distribution of the user locations, FL is expected to converge slower. In order to improve the performance in non-i.i.d. scenario, using deeper and wider learning models help to provide better feature extraction and representation performance [31]. Thus, we perform a hyper-parameter optimization to achieve a satisfactory performance.

The main contributions of this paper can be summarized as follows:

  1. 1.

    We propose an FL-based channel estimation approach for both conventional and RIS-assisted massive MIMO systems. Different from the conventional centralized model learning techniques, the proposed FL framework provides decentralized learning, which significantly reduces the communication overhead compared to the CL-based techniques while maintaining satisfactory channel estimation performance close to CL.

  2. 2.

    In order to estimate both direct (BS-user) and cascaded (BS-RIS-user) channels in RIS-assisted scenario, input and output data are combined together for each communication link, hence a single CNN architecture is designed, instead of using different NNs for each task.

  3. 3.

    We prove the convergence of FL and demonstrate its superior performance over CL in terms of communication overhead and channel estimation accuracy via extensive numerical simulations for different number of users while considering the quantization and corruption of the gradient and model data as well as the loss of a portion of the model data during transmission.

Throughout the paper, the identity matrix of size N×NN\times N is denoted by 𝐈N\mathbf{I}_{N}. ()T(\cdot)^{T} and ()H(\cdot)^{H} denote transpose and conjugate transpose operations, respectively. For a matrix 𝐀\mathbf{A} and a vector 𝐚\mathbf{a}, [𝐀]i,j[\mathbf{A}]_{i,j} and [𝐚]i[\mathbf{a}]_{i} denote the (i,j)(i,j)th element of matrix 𝐀\mathbf{A} and the iith element of vector 𝐚\mathbf{a}, respectively. The function 𝔼{}\mathbb{E}\{\cdot\} provides the statistical expectation of its argument and {}\angle\{\cdot\} measures the angle of complex quantity. 𝐀\|\mathbf{A}\|_{\mathcal{F}} and 𝐚2\|\mathbf{a}\|_{2} denote the Frobenius and l2l_{2}-norm, respectively. \otimes is the Hadamard element-wise multiplication and 𝐚\nabla_{\mathbf{a}} represents the gradient with respect to 𝐚\mathbf{a}. A convolutional layer with NN D×DD\times D 2-D kernel is represented by NN@ D×DD\times D.

II System Model

We consider a multi-user MIMO-OFDM (orthogonal frequency division multiplexing) system with MM subcarriers, where the BS has NBSN_{\mathrm{BS}} antennas to communicate with KK users, each of which has NMSN_{\mathrm{MS}} antennas. In the downlink, the BS first precodes KK data symbols 𝐬[m]=[s1[m],s2[m],,sK[m]]TK\mathbf{s}[m]=[s_{1}[m],s_{2}[m],\dots,s_{K}[m]]^{\textsf{T}}\in\mathbb{C}^{K} at each subcarrier (m={1,,M}m\in\mathcal{M}=\{1,\dots,M\}) by applying the subcarrier-dependent baseband precoders 𝐅BB[m]=[𝐟BB1[m],𝐟BB2[m],,𝐟BBK[m]]K×K\mathbf{F}_{\mathrm{BB}}[m]=[\mathbf{f}_{\mathrm{BB}_{1}}[m],\mathbf{f}_{\mathrm{BB}_{2}}[m],\dots,\mathbf{f}_{\mathrm{BB}_{K}}[m]]\in\mathbb{C}^{K\times K}. Then, the signal is transformed to the time-domain via MM-point inverse discrete Fourier transform (IDFT). After adding cyclic prefix (CP), the BS employs subcarrier-independent analog precoder 𝐅RFNBS×K\mathbf{F}_{\mathrm{RF}}\in\mathbb{C}^{N_{\mathrm{BS}}\times K} to form the transmitted signal. Given that 𝐅RF\mathbf{F}_{\mathrm{RF}} consists of analog phase shifters, we assume that the RF precoder has constant unit-modulus constraints, i.e., |[𝐅RF]i,j|2=1|[\mathbf{F}_{\mathrm{RF}}]_{i,j}|^{2}=1. Additionally, we have the power constraint m=1M𝐅RF𝐅BB[m]2=MK\sum_{m=1}^{M}\|\mathbf{F}_{\mathrm{RF}}\mathbf{F}_{\mathrm{BB}}[m]\|_{\mathcal{F}}^{2}=MK that is enforced by the normalization of the baseband precoder {𝐅BB[m]}m\{\mathbf{F}_{\mathrm{BB}}[m]\}_{m\in\mathcal{M}}. Thus, the transmitted signal becomes 𝐱[m]=𝐅RFk=1K𝐟BBk[m]sk[m].\mathbf{x}[m]=\mathbf{F}_{\mathrm{RF}}\sum_{k=1}^{K}\mathbf{f}_{\mathrm{BB}_{k}}[m]s_{k}[m].

II-A Channel Model

Before reception at the users, the transmitted signal is passed through the mm-Wave channel, which can be represented by a geometric model with limited scattering [11]. Let us define 𝐇k[m]\mathbf{H}_{k}[m] as the NMS×NBSN_{\mathrm{MS}}\times N_{\mathrm{BS}} mm-Wave channel matrix between the BS and the kkth user. Then, 𝐇k[m]\mathbf{H}_{k}[m] includes the contributions of LL paths, each of which has the time delay τk,l\tau_{k,l} with relative AoA ϕ¯k,lΘ\bar{\phi}_{k,l}\in\Theta (Θ=[π2,π2]\Theta=[-\frac{\pi}{2},\frac{\pi}{2}]), angle-of-departure (AoD) ϕk,lΘ\phi_{k,l}\in\Theta, and the complex path gain αk,l\alpha_{k,l} for the kkth user and llth path. Let p(τ)p(\tau) denote a pulse shaping function for TsT_{\mathrm{s}}-spaced signaling evaluated at τ\tau seconds. Then, the mm-Wave delay-dd MIMO channel matrix in time domain is given by

𝐇¯k[d]=\displaystyle\bar{\mathbf{H}}_{k}[d]= NBSNMSLl=1Lαk,lp(dTsτk,l)𝐚MS(ϕ¯k,l)𝐚BSH(ϕk,l),\displaystyle\sqrt{\frac{N_{\mathrm{BS}}N_{\mathrm{MS}}}{L}}\sum_{l=1}^{L}\alpha_{k,l}p(dT_{\mathrm{s}}-\tau_{k,l})\mathbf{a}_{\mathrm{MS}}(\bar{\phi}_{k,l})\mathbf{a}_{\mathrm{BS}}^{\textsf{H}}(\phi_{k,l}), (1)

where 𝐚MS(ϕk,l¯)\mathbf{a}_{\mathrm{MS}}(\bar{\phi_{k,l}}) and 𝐚BS(ϕk,l)\mathbf{a}_{\mathrm{BS}}(\phi_{k,l}) are the NMS×1N_{\mathrm{MS}}\times 1, and NBS×1N_{\mathrm{BS}}\times 1 steering vectors representing the array responses of the antenna arrays at the users and the BS, respectively. Let λm=c0fm\lambda_{m}=\frac{c_{0}}{f_{m}} be the wavelength for the subcarrier mm at frequency fmf_{m}. Since the operating frequency is relatively higher than the bandwidth in mm-Wave systems and the subcarrier frequencies are close to each other (i.e., fm1fm2f_{m_{1}}\approx f_{m_{2}}, m1,m2m_{1},m_{2}\in\mathcal{M}), we use a single operating wavelength λ=λ1==λM=c0fc\lambda=\lambda_{1}=\dots=\lambda_{M}=\frac{c_{0}}{f_{c}}, where c0c_{0} is speed of light and fcf_{c} is the central carrier frequency [11, 12]. This approximation also allows for a single frequency-independent analog beamformer for each subcarrier. Then, for a uniform linear array (ULA), the array response of the antenna array at the BS is

𝐚BS(ϕ)=[1,ej2πλdBSsin(ϕ),,ej2πλ(NBS1)dBSsin(ϕ)]T,\displaystyle\mathbf{a}_{\mathrm{BS}}(\phi)=\big{[}1,e^{j\frac{2\pi}{\lambda}{d}_{\mathrm{BS}}\sin(\phi)},\dots,e^{j\frac{2\pi}{\lambda}(N_{\mathrm{BS}}-1){d}_{\mathrm{BS}}\sin(\phi)}\big{]}^{\textsf{T}}, (2)

where dBS=λ/2{d}_{\mathrm{BS}}=\lambda/2 is the antenna spacing. The nnth element of 𝐚MS(ϕ¯)\mathbf{a}_{\mathrm{MS}}(\bar{\phi}) can be defined in a similar way as for 𝐚BS(ϕ)\mathbf{a}_{\mathrm{BS}}(\phi) as [𝐚MS(ϕ¯)]n=ejπ(n1)sin(ϕ¯)\left[\mathbf{a}_{\mathrm{MS}}(\bar{\phi})\right]_{n}=e^{j\pi(n-1)\sin(\bar{\phi})}, n=1,,NMSn=1,\dots,N_{\mathrm{MS}}. After performing MM-point DFT of the delay-dd channel model in (1), the channel matrix of the kkth user at subcarrier mm becomes

𝐇k[m]=d=0D1𝐇¯k[d]ej2πmMd,\displaystyle\mathbf{H}_{k}[m]=\sum_{d=0}^{D-1}\bar{\mathbf{H}}_{k}[d]e^{-j\frac{2\pi m}{M}d}, (3)

where DMD\leq M is the CP length. The frequency domain channel in (3) is used in MIMO-OFDM systems, where the orthogonality of each subcarrier is held such that 𝐇kH[m1]𝐇k[m2]2=0||\mathbf{H}_{k}^{\textsf{H}}[m_{1}]\mathbf{H}_{k}[m_{2}]||_{\mathcal{F}}^{2}=0 for m1,m2m_{1},m_{2}\in\mathcal{M} and m1m2m_{1}\neq m_{2}.

With the aforementioned block-fading channel model [11], the received signal at the kkth user before analog processing at subcarrier mm is 𝐲~k[m]=ρ𝐇k[m]𝐱[m]\tilde{\mathbf{y}}_{k}[m]=\sqrt{\rho}\mathbf{H}_{k}[m]\mathbf{x}[m], i.e.,

𝐲~k[m]=ρ𝐇k[m]𝐅RF𝐅BB[m]𝐬[m]+𝐧[m],\displaystyle\tilde{\mathbf{y}}_{k}[m]=\sqrt{\rho}\mathbf{H}_{k}[m]\mathbf{F}_{\mathrm{RF}}\mathbf{F}_{\mathrm{BB}}[m]\mathbf{s}[m]+\mathbf{n}[m], (4)

where ρ\rho represents the average received power and 𝐧[m]𝒞𝒩(0,σ2𝐈NMS)\mathbf{n}[m]\sim\mathcal{CN}({0},\sigma^{2}\mathbf{I}_{\mathrm{N_{\mathrm{MS}}}}) is additive white Gaussian noise (AWGN) vector. At the kkth user, the received signal is first processed by the analog combiner 𝐰RF,kNMS\mathbf{w}_{\mathrm{RF},k}\in\mathbb{C}^{N_{\mathrm{MS}}}. Then, the cyclic prefix is removed from the processed signal and MM-point DFTs are applied to yield the signal in frequency domain. Then, the received baseband signal becomes

y¯k[m]=ρ𝐰RF,kH𝐇k[m]𝐅RF𝐅BB[m]𝐬[m]+𝐰RF,kH𝐧[m],\displaystyle\bar{y}_{k}[m]=\sqrt{\rho}\mathbf{w}_{\mathrm{RF},k}^{\textsf{H}}\mathbf{H}_{k}[m]\mathbf{F}_{\mathrm{RF}}\mathbf{F}_{\mathrm{BB}}[m]\mathbf{s}[m]+\mathbf{w}_{\mathrm{RF},k}^{\textsf{H}}\mathbf{n}[m], (5)

where the analog combiner 𝐰RF,k\mathbf{w}_{\mathrm{RF},k} has the constraint [𝐰RF,k𝐰RF,kH]i,i=1\big{[}\mathbf{w}_{\mathrm{RF},k}\mathbf{w}_{\mathrm{RF},k}^{\textsf{H}}\big{]}_{i,i}=1, similar to the RF precoder. Once the received symbols, i.e., yk[m]y_{k}[m] are obtained at the kkth user, they are demodulated according to its respective modulation scheme, and the information bits are recovered for each subcarrier. To accurately recover the data streams 𝐬[m]\mathbf{s}[m] in (5), the channel matrix 𝐇k[m]\mathbf{H}_{k}[m] should be estimated. This is usually done by using pilot signals in the preamble stage [36, 16], wherein the beamformers 𝐅RF\mathbf{F}_{\mathrm{RF}}, 𝐅BB\mathbf{F}_{\mathrm{BB}} and 𝐰RFk\mathbf{w}_{\mathrm{RF_{k}}} are designed accordingly (See Section III-C).

II-B Problem Description

The aim in this work is to estimate the channel matrix 𝐇k[m]\mathbf{H}_{k}[m] via FL, as illustrated in Fig. 1b. To this end, the global NN for channel estimation (henceforth called ChannelNet) located at the BS is trained on the local datasets of the users. Let 𝒟k\mathcal{D}_{k} denote the local dataset at the kkth user, containing the input-output pairs 𝒟k(i)=(𝒳k(i),𝒴k(i))\mathcal{D}_{k}^{(i)}=(\mathcal{X}_{k}^{(i)},\mathcal{Y}_{k}^{(i)})222The sizes of 𝒳k(i)\mathcal{X}_{k}^{(i)} and 𝒴k(i)\mathcal{Y}_{k}^{(i)} depend on the size of the channel matrix, and they are explicitly given in Sec. III-C and Sec. III-D for conventional and RIS-assisted massive MIMO scenario, respectively., i=1,,Dki=1,\dots,\textsf{D}_{k} and Dk=|𝒟k|\textsf{D}_{k}=|\mathcal{D}_{k}| is the size of the local dataset 𝒟k\mathcal{D}_{k}. Here, 𝒳k(i)\mathcal{X}_{k}^{(i)} represents the iith input data, i.e., the received pilot signals, 𝒴k(i)\mathcal{Y}_{k}^{(i)} denotes the iith output/label data, i.e., the channel matrix, for k𝒦k\in\mathcal{K}, 𝒦={1,,K}\mathcal{K}=\{1,\dots,K\}. Thus, for an input-output pair (𝒳,𝒴)(\mathcal{X},\mathcal{Y}), ChannelNet constructs a non-linear relationship between the input and the output data as f(𝒳|𝜽)=𝒴f(\mathcal{X}|\boldsymbol{\theta})=\mathcal{Y}, where 𝜽P\boldsymbol{\theta}\in\mathbb{R}^{P} denotes the learnable parameters.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: (a) Training data collection and (b) channel estimation with the trained model.

III Federated Learning for Channel Estimation

In this section, we present the proposed FL-based channel estimation scheme, which is comprised of three stages: training data collection, model training and prediction. First, we present the training data collection stage, in which each user collects its own training dataset from the received pilot signals. After providing the FL-based model training scheme, we discuss how the input and output label data are determined for both massive MIMO and RIS-assisted scenarios, respectively. Once the learning model is trained, then it can be used for channel estimation in the prediction stage.

III-A Training Data Collection

In Fig. 2, we present the communication interval at the user for two consecutive data transmission blocks. At the beginning of each transmission block, the received pilot signals are acquired and processed for channel estimation. This can be done by employing one of the analytical channel estimation techniques, which can be based on compressed sensing [37, 38], angle-domain processing [14] and coordinated pilot-assignment [15]. The analytical approach is only used in the training data collection stage, which is relatively smaller than the prediction stage [32]. Hence, the use of ML/FL in the prediction stage becomes more advantageous over the analytical techniques in the long term.

It is also worth to mention that the training data can be obtained via offline datasets which are prepared by collecting the data from the field measurements. In [39], authors present a channel estimation dataset, which is obtained by electromagnetic simulations tools. While this approach can also be followed, the offline collected data may not always reflect the channel characteristics and the imperfections in the mm-Wave channel. In this work, we evaluate the performance of the proposed approach on the datasets whose labels are selected as both true and estimated channel data. For the estimated channel, we assume that the training data are collected, as described in Fig. 2, by employing angle-domain channel estimation (ADCE) technique [14], which has close to minimum mean-square-error (MMSE) performance.

After channel estimation, the training data can be collected by storing the received pilot data 𝐆k[m]\mathbf{G}_{k}[m] and the estimated channel data 𝐇^k[m]\hat{\mathbf{H}}_{k}[m] in the internal memory of the user. (We discuss how 𝐆k[m]\mathbf{G}_{k}[m] is determined in Sec. III-C.) Then, the user feedbacks the estimated channel data to the BS via uplink transmission. As a result, the local dataset 𝒟k\mathcal{D}_{k} can be collected at the kkth user after i=1,,Dki=1,\dots,\textsf{D}_{k} transmission blocks. This approach allows us to collect training data for different channel coherence times, which can be very short due to dynamic nature of the mm-Wave channel, such as indoor and vehicular communications [10].

The above process is the first stage of the proposed FL-based channel estimation framework. Once the training data is collected, the global model is trained (see, e.g., Fig. 1b). After training, each user can estimate its own channel via the trained NN by simply feeding the NN with 𝐆k[m]\mathbf{G}_{k}[m] and obtains 𝐇^k[m]\hat{\mathbf{H}}_{k}[m], as illustrated in Fig. 2b.

III-B FL-based Model Training

We begin by introducing the training concept in conventional CL, then develop FL-based model training.

In CL-based model training for channel estimation [16, 17, 32, 18, 34], the training of the global NN is performed by collecting the local datasets {𝒟k}k𝒦\{\mathcal{D}_{k}\}_{k\in\mathcal{K}} from the users, as illustrated in Fig. 1a. Once the BS has collected the whole dataset 𝒟\mathcal{D}, the training is performed by solving the following problem

minimize𝜽\displaystyle\operatorname*{minimize}_{\boldsymbol{\theta}} (𝜽)\displaystyle\hskip 10.0pt\mathcal{L}(\boldsymbol{\theta})
subjectto:\displaystyle\operatorname*{subject\hskip 3.0ptto:\hskip 3.0pt} f(𝒳(i)|𝜽)=𝒴(i),i=1,,D,\displaystyle\hskip 10.0ptf(\mathcal{X}^{(i)}|\boldsymbol{\theta})=\mathcal{Y}^{(i)},i=1,\dots,\textsf{D}, (6)

where D=|𝒟|\textsf{D}=|\mathcal{D}| is the number of training samples and (𝜽)\mathcal{L}(\boldsymbol{\theta}) denotes the loss function defined as

(𝜽)=1Di=1Df(𝒳(i)|𝜽)𝒴(i)2,\displaystyle\mathcal{L}(\boldsymbol{\theta})=\frac{1}{\textsf{D}}\sum_{i=1}^{\textsf{D}}\|f(\mathcal{X}^{(i)}|\boldsymbol{\theta})-\mathcal{Y}^{(i)}\|_{\mathcal{F}}^{2}, (7)

which is the MSE between the label data 𝒴(i)\mathcal{Y}^{(i)} and the prediction of the NN, f(𝒳(i)|𝜽)f(\mathcal{X}^{(i)}|\boldsymbol{\theta}).

On the other hand, in FL, the local datasets 𝒟k𝒦\mathcal{D}_{k\in\mathcal{K}} are preserved at the users and not transmitted to the BS. Hence, FL-based model training is performed at the user side as

minimize𝜽\displaystyle\operatorname*{minimize}_{\boldsymbol{\theta}} ¯(𝜽)=1Kk=1Kk(𝜽)\displaystyle\hskip 10.0pt\bar{\mathcal{L}}(\boldsymbol{\theta})=\frac{1}{K}\sum_{k=1}^{K}\mathcal{L}_{k}(\boldsymbol{\theta})
subjectto:\displaystyle\operatorname*{subject\hskip 3.0ptto:\hskip 3.0pt} f(𝒳k(i)|𝜽)=𝒴k(i),i=1,,Dk,k𝒦,\displaystyle\hskip 10.0ptf(\mathcal{X}_{k}^{(i)}|\boldsymbol{\theta})=\mathcal{Y}_{k}^{(i)},i=1,\dots,\textsf{D}_{k},k\in\mathcal{K}, (8)

where k(𝜽)=1Dki=1Dkf(𝒳k(i)|𝜽)𝒴k(i)2\mathcal{L}_{k}(\boldsymbol{\theta})=\frac{1}{\textsf{D}_{k}}\sum_{i=1}^{\textsf{D}_{k}}\|f(\mathcal{X}_{k}^{(i)}|\boldsymbol{\theta})-\mathcal{Y}_{k}^{(i)}\|_{\mathcal{F}}^{2}. Notice that the FL-based model training in (III-B) is solved at the user while the CL problem in (III-B) is handled at the BS. To efficiently solve (III-B) and (III-B), gradient descent (GD) is employed and the problems are solved iteratively. In CL, the gradient is computed over the whole dataset as 𝐠(𝜽t)=(𝜽t)\mathbf{g}(\boldsymbol{\theta}_{t})=\nabla\mathcal{L}(\boldsymbol{\theta}_{t}) and the parameter update is performed as

𝜽t+1=𝜽tη𝐠(𝜽t),\displaystyle\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}-\eta{\mathbf{g}}(\boldsymbol{\theta}_{t}), (9)

where η\eta is the learning rate.

In FL, each user computes the gradients individually as 𝐠k(𝜽t)=k(𝜽t)\mathbf{g}_{k}(\boldsymbol{\theta}_{t})=\nabla\mathcal{L}_{k}(\boldsymbol{\theta}_{t}) to solve (III-B), then sends them to the BS, where the model parameters are updated as

𝜽t+1=𝜽tη1Kk=1K𝐠k(𝜽t).\displaystyle\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}-\eta\frac{1}{K}\sum_{k=1}^{K}{\mathbf{g}}_{k}(\boldsymbol{\theta}_{t}). (10)

The transmission of gradients to the BS provides more energy-efficiency than directly transmitting the model parameters as in the FedAvg algorithm [35]. The main reason is that gradients include only the model updates obtained from the GD algorithm, whereas model transmission includes already known data from the previous iteration. Hence, model transmission wastes a significant amount of transmit power from all the users [31, 30, 40].

The gradients 𝐠k𝒦(𝜽t)\mathbf{g}_{k\in\mathcal{K}}(\boldsymbol{\theta}_{t}) are sent to the BS via wireless channel, which causes corruptions during transmission. Therefore, the corrupted model parameters and gradients at the ttth iteration are given as [24, 41]

𝜽~t\displaystyle\tilde{\boldsymbol{\theta}}_{t} =𝜽t+Δ𝜽t,\displaystyle={\boldsymbol{\theta}}_{t}+\Delta{\boldsymbol{\theta}}_{t}, (11)
𝐠k(𝜽~t)\displaystyle\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t}) =𝐠k(𝜽t)+Δ𝐠k(𝜽t),\displaystyle=\mathbf{g}_{k}({\boldsymbol{\theta}}_{t})+\Delta\mathbf{g}_{k}({\boldsymbol{\theta}}_{t}), (12)
𝐠~k(𝜽~t)\displaystyle\tilde{\mathbf{g}}_{k}(\tilde{\boldsymbol{\theta}}_{t}) =𝐠k(𝜽~t)+Δ𝐠k(𝜽~t),\displaystyle=\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t})+\Delta\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t}), (13)

where 𝜽~t\tilde{\boldsymbol{\theta}}_{t} represents the noisy model parameters captured at the users, 𝐠k(𝜽~t)\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t}) is the gradient vector computed at the user based on 𝜽~t\tilde{\boldsymbol{\theta}}_{t} and 𝐠~k(𝜽~t)\tilde{\mathbf{g}}_{k}(\tilde{\boldsymbol{\theta}}_{t}) denotes the noisy gradient vector received at the BS. Δ𝜽t\Delta\boldsymbol{\theta}_{t}, Δ𝐠k(𝜽t)\Delta\mathbf{g}_{k}({\boldsymbol{\theta}}_{t}) and Δ𝐠k(𝜽~t)\Delta\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t}) represent the noise terms added onto 𝜽t\boldsymbol{\theta}_{t}, 𝐠k(𝜽t)\mathbf{g}_{k}({\boldsymbol{\theta}}_{t}) and 𝐠k(𝜽~t)\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t}), respectively. Then, the model update rule can be given by

𝜽~t+1=𝜽~tη1Kk=1K𝐠~k(𝜽~t),\displaystyle\tilde{\boldsymbol{\theta}}_{t+1}=\tilde{\boldsymbol{\theta}}_{t}-\eta\frac{1}{K}\sum_{k=1}^{K}\tilde{\mathbf{g}}_{k}(\tilde{\boldsymbol{\theta}}_{t}), (14)

which can be rewritten as

𝜽~t+1=[𝜽t+Δ𝜽t]ηk=1K[𝐠k(𝜽t)+Δ𝐠k(𝜽t)+Δ𝐠k(𝜽~t)]K\displaystyle\tilde{\boldsymbol{\theta}}_{t+1}=\big{[}\boldsymbol{\theta}_{t}+\Delta{\boldsymbol{\theta}}_{t}\big{]}-\eta\sum_{k=1}^{K}\frac{\big{[}{\mathbf{g}}_{k}({\boldsymbol{\theta}}_{t})+\Delta\mathbf{g}_{k}({\boldsymbol{\theta}}_{t})+\Delta\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t})\big{]}}{K}
=𝜽tηk=1K𝐠k(𝜽t)K𝜽t+1+Δ𝜽tηk=1K[Δ𝐠k(𝜽t)+Δ𝐠k(𝜽~t)]KΔ\displaystyle=\underbrace{\boldsymbol{\theta}_{t}-\eta\sum_{k=1}^{K}\frac{\mathbf{g}_{k}(\boldsymbol{\theta}_{t})}{K}}_{\boldsymbol{\theta}_{t+1}}+\underbrace{\Delta{\boldsymbol{\theta}}_{t}-\eta\sum_{k=1}^{K}\frac{\big{[}\Delta\mathbf{g}_{k}({\boldsymbol{\theta}}_{t})+\Delta\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t})\big{]}}{K}}_{\Delta}
=𝜽t+1+Δ,\displaystyle=\boldsymbol{\theta}_{t+1}+\Delta, (15)

where Δ\Delta corresponds to the overall noise term added onto 𝜽t+1=𝜽tη1Kk=1K𝐠k(𝜽t)\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}-\eta\frac{1}{K}\sum_{k=1}^{K}\mathbf{g}_{k}(\boldsymbol{\theta}_{t}). Now, let us consider the statistics of Δ\Delta. Without loss of generality, the noise terms due to wireless transmission in (11) and (13), i.e., Δ𝜽t\Delta{\boldsymbol{\theta}}_{t} and Δ𝐠k(𝜽~t)\Delta\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t}), can be modeled as AWGN with variances σ𝜽2\sigma_{\boldsymbol{\theta}}^{2} and σ~k2\tilde{\sigma}_{k}^{2}, respectively [42, 41]. Furthermore, we define Δ𝐠k(𝜽t)\Delta\mathbf{g}_{k}({\boldsymbol{\theta}}_{t}) in (12) as AWGN with variance σk2{\sigma}_{k}^{2} due to the linearity of gradient and the NN layers333Many of the NN layers, such as convolutional, fully connected, normalization and dropout layers, perform linear operations, whereas pooling and ReLU layers are non-linear [43, 41, 42].. Hence, the overall noise term Δ\Delta can be viewed as an AWGN with variance σΔ2=σ𝜽2+ηk=1K(σ~k2+σk2)K\sigma_{\Delta}^{2}=\sigma_{\boldsymbol{\theta}}^{2}+\eta\frac{\sum_{k=1}^{K}(\tilde{\sigma}_{k}^{2}+{\sigma}_{k}^{2})}{K}.

In order to solve (III-B) effectively in the presence of noisy model parameters, we define a regularized loss function ~k(𝜽)\tilde{\mathcal{L}}_{k}(\boldsymbol{\theta}) as

~k(𝜽)=k(𝜽)+σΔ2𝐠k(𝜽)2,\displaystyle\tilde{\mathcal{L}}_{k}(\boldsymbol{\theta})=\mathcal{L}_{k}(\boldsymbol{\theta})+\sigma_{\Delta}^{2}||\mathbf{g}_{k}(\boldsymbol{\theta})||^{2}, (16)

which is widely used in stochastic optimization [44]. (16) can be obtained via first order Taylor expansion of the expectation-based loss 𝔼{k(𝜽+Δ)2}\mathbb{E}\{||\mathcal{L}_{k}(\boldsymbol{\theta}+\Delta)||^{2}\}, which can be approximately written as

𝔼{||k(𝜽+\displaystyle\mathbb{E}\{||\mathcal{L}_{k}(\boldsymbol{\theta}+ Δ)||2}𝔼{||k(𝜽)+Δk(𝜽)||2},\displaystyle\Delta)||^{2}\}\approx\mathbb{E}\{||\mathcal{L}_{k}(\boldsymbol{\theta})+{\Delta}\nabla\mathcal{L}_{k}(\boldsymbol{\theta})||^{2}\},
𝔼{k(𝜽)2}+𝔼{Δ2}𝔼{k(𝜽)2},\displaystyle\approx\mathbb{E}\{||\mathcal{L}_{k}(\boldsymbol{\theta})||^{2}\}+\mathbb{E}\{||\Delta||^{2}\}\mathbb{E}\{||\nabla\mathcal{L}_{k}(\boldsymbol{\theta})||^{2}\},
𝔼{k(𝜽)2}+σΔ2𝐠(𝜽)2,\displaystyle\approx\mathbb{E}\{||\mathcal{L}_{k}(\boldsymbol{\theta})||^{2}\}+{\sigma}_{\Delta}^{2}||\mathbf{g}(\boldsymbol{\theta})||^{2}, (17)

where the first term corresponds to the minimization of the loss function with perfect estimation and the second term is the additional cost due to noise [44, 41]. Using (16), the regularized version of FL-based training problem in (III-B) is given by

minimize𝜽\displaystyle\operatorname*{minimize}_{\boldsymbol{\theta}} ¯(𝜽)=1Kk=1K~k(𝜽)\displaystyle\hskip 10.0pt\bar{\mathcal{L}}(\boldsymbol{\theta})=\frac{1}{K}\sum_{k=1}^{K}\tilde{\mathcal{L}}_{k}(\boldsymbol{\theta})
subjectto:\displaystyle\operatorname*{subject\hskip 3.0ptto:\hskip 3.0pt} f(𝒳k(i)|𝜽)=𝒴k(i),i=1,,Dk,k𝒦,\displaystyle\hskip 10.0ptf(\mathcal{X}_{k}^{(i)}|\boldsymbol{\theta})=\mathcal{Y}_{k}^{(i)},i=1,\dots,\textsf{D}_{k},k\in\mathcal{K}, (18)

which can be effectively solved via GD in the presence of noisy model updates as

𝜽~t+1=𝜽~tη¯(𝜽~),\displaystyle\tilde{\boldsymbol{\theta}}_{t+1}=\tilde{\boldsymbol{\theta}}_{t}-\eta\nabla\bar{\mathcal{L}}(\tilde{\boldsymbol{\theta}}), (19)

where ¯(𝜽~)=1Kk=1K𝐠¯k(𝜽~t)\nabla\bar{\mathcal{L}}(\tilde{\boldsymbol{\theta}})=\frac{1}{K}\sum_{k=1}^{K}\bar{\mathbf{g}}_{k}(\tilde{\boldsymbol{\theta}}_{t}) and 𝐠¯k(𝜽~t)=~k(𝜽~t)=[k(𝜽~t)+σΔ2𝐠k(𝜽~t)2]\bar{\mathbf{g}}_{k}(\tilde{\boldsymbol{\theta}}_{t})=\nabla\tilde{\mathcal{L}}_{k}(\tilde{\boldsymbol{\theta}}_{t})=\nabla\big{[}\mathcal{L}_{k}(\tilde{\boldsymbol{\theta}}_{t})+\sigma_{\Delta}^{2}||\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t})||^{2}\big{]}.

Due to the effect of noisy gradient transmission, ¯(𝜽)\bar{\mathcal{L}}(\boldsymbol{\theta}) converges slower than (𝜽){\mathcal{L}}(\boldsymbol{\theta}). In the following theorem, we prove the convergence of ¯(𝜽)\bar{\mathcal{L}}(\boldsymbol{\theta}). While the convergence of the regularized loss function was studied in different FL works [42, 41], they consider model transmission, whereas in this work we investigate the gradient transmission approach. The convergence analysis is also different from the previous gradient transmission-based works, e.g., [30, 24], which are based on the sparsity assumption of the gradient vector, which may not be always satisfied.

Theorem 1: Let 𝜽0\boldsymbol{\theta}_{0} and 𝜽\boldsymbol{\theta}_{\star} be the initial and optimal model parameters, respectively. Then, the FL-based model training converges with the convergence rate 𝒪(1/t)\mathcal{O}(1/t) as

¯(𝜽t)¯(𝜽)𝜽0𝜽212η1t,\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}_{t})-\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})\leq||\boldsymbol{\theta}_{0}-\boldsymbol{\theta}_{\star}||^{2}\frac{1}{2\eta}\frac{1}{t}, (20)

with the learning rate η1(1+σΔ2)β\eta\leq\frac{1}{(1+{\sigma}_{\Delta}^{2})\beta} for some β0\beta\geq 0.

Proof: See Appendix A.∎

In practice, the convergence of the learning model is subject to the wireless factors, such as the SNR of the transmitted/received model updates. In particular, the convergence becomes slower due to the packet errors during training [45]. Furthermore, the channel statistics change in each communication round, which entails CSI acquisition for each round. While some of the recent works assume that a single communication round between the server and the clients takes a single channel coherence time [31, 30, 24], in [46] FL-based training is completed in a single long-coherence time, which is approximately composed of 40 small-scale fading channel coherence intervals [46].

III-C FL for Channel Estimation in Massive MIMO

Here, we discuss how the input and output of ChannelNet are determined for massive MIMO scenario.

The input of ChannelNet is the set of received pilot signals at the preamble stage. Consider the downlink received signal model in (5) and assume that the BS activates only a single RF chain, one at a time. Let 𝐟¯u[m]NBS\overline{\mathbf{f}}_{u}[m]\in\mathbb{C}^{N_{\mathrm{BS}}} be the resulting beamformer vector and pilot signals are s¯u[m]\overline{{s}}_{u}[m], where u=1,,MBSu=1,\dots,M_{\mathrm{BS}} and mm\in\mathcal{M}. At the receiver side, each user activates the RF chain for MMSM_{\mathrm{MS}} times and applies the beamformer vector 𝐰¯v[m]\overline{\mathbf{w}}_{v}[m], v=1,,MMSv=1,\dots,M_{\mathrm{MS}} to process the received pilots [16]. Hence, the total channel use in the channel acquisition process is MBSMMSNRFM_{\mathrm{BS}}\lceil\frac{M_{\mathrm{MS}}}{N_{\mathrm{RF}}}\rceil. Therefore, the received pilot signal at the kkth user becomes

𝐘¯k[m]=𝐖¯H[m]𝐇k[m]𝐅¯[m]𝐒¯[m]+𝐍~k[m],\displaystyle\mathbf{\overline{Y}}_{k}[m]=\overline{\mathbf{W}}^{\textsf{H}}[m]\mathbf{H}_{k}[m]\overline{\mathbf{F}}[m]\overline{\mathbf{S}}[m]+\widetilde{\mathbf{N}}_{k}[m], (21)

where 𝐅¯[m]=[𝐟¯1[m],𝐟¯2[m],,𝐟¯MBS[m]]\overline{\mathbf{F}}[m]=[\overline{\mathbf{f}}_{1}[m],\overline{\mathbf{f}}_{2}[m],\dots,\overline{\mathbf{f}}_{M_{\mathrm{BS}}}[m]] and 𝐖¯[m]=[𝐰¯1[m],𝐰¯2[m],,𝐰¯MMS[m]]\overline{\mathbf{W}}[m]=[\overline{\mathbf{w}}_{1}[m],\overline{\mathbf{w}}_{2}[m],\dots,\overline{\mathbf{w}}_{M_{\mathrm{MS}}}[m]] are NBS×MBSN_{\mathrm{BS}}\times M_{\mathrm{BS}} and NMS×MMSN_{\mathrm{MS}}\times M_{\mathrm{MS}} beamformer matrices, respectively. 𝐒¯[m]=diag{s¯1[m],,s¯MBS[m]}\overline{\mathbf{S}}[m]=\mathrm{diag}\{\overline{s}_{1}[m],\dots,\overline{s}_{M_{\mathrm{BS}}}[m]\} denotes pilot signals and 𝐍~k[m]=𝐖¯H𝐍¯k[m]\widetilde{\mathbf{N}}_{k}[m]=\overline{\mathbf{W}}^{\textsf{H}}\overline{\mathbf{N}}_{k}[m] is the effective noise matrix, where 𝐍¯k[m]𝒩(0,σ2𝐈MMS)\overline{\mathbf{N}}_{k}[m]\sim\mathcal{N}(0,\sigma^{2}\mathbf{I}_{M_{\mathrm{MS}}}). Without loss of generality, we assume that 𝐅¯[m]=𝐅¯\overline{\mathbf{F}}[m]=\overline{\mathbf{F}} and 𝐖¯[m]=𝐖¯\overline{\mathbf{W}}[m]=\overline{\mathbf{W}} and 𝐒¯[m]=𝐈MBS\overline{\mathbf{S}}[m]=\mathbf{I}_{M_{\mathrm{BS}}} m\forall m\in\mathcal{M}. Then, the received signal (21) becomes

𝐘¯k[m]=𝐖¯H𝐇k[m]𝐅¯+𝐍~k[m].\displaystyle\mathbf{\overline{Y}}_{k}[m]=\overline{\mathbf{W}}^{\textsf{H}}\mathbf{H}_{k}[m]\overline{\mathbf{F}}+\widetilde{\mathbf{N}}_{k}[m]. (22)

Using 𝐘¯k[m]\mathbf{\overline{Y}}_{k}[m], we define the input of ChannelNet 𝐆k[m]\mathbf{G}_{k}[m] as

𝐆k[m]=𝐓MS𝐘¯k[m]𝐓BS,\displaystyle\mathbf{G}_{k}[m]=\mathbf{T}_{\mathrm{MS}}\overline{\mathbf{Y}}_{k}[m]\mathbf{T}_{\mathrm{BS}}, (23)

where 𝐓MS={𝐖¯,MMS<NMS(𝐖¯𝐖¯H)1𝐖¯,MMSNMS,\mathbf{T}_{\mathrm{MS}}=\begin{dcases}\overline{\mathbf{W}},&M_{\mathrm{MS}}<N_{\mathrm{MS}}\\ (\overline{\mathbf{W}}\overline{\mathbf{W}}^{\textsf{H}})^{-1}\overline{\mathbf{W}},&M_{\mathrm{MS}}\geq N_{\mathrm{MS}},\end{dcases} and 𝐓BS={𝐅¯H,MBS<NBS𝐅¯H(𝐅¯𝐅¯H)1,MBSNBS.\mathbf{T}_{\mathrm{BS}}=\begin{dcases}\overline{\mathbf{F}}^{\textsf{H}},&M_{\mathrm{BS}}<N_{\mathrm{BS}}\\ \overline{\mathbf{F}}^{\textsf{H}}(\overline{\mathbf{F}}\overline{\mathbf{F}}^{\textsf{H}})^{-1},&M_{\mathrm{BS}}\geq N_{\mathrm{BS}}.\end{dcases} Here, 𝐓BS\mathbf{T}_{\mathrm{BS}} and 𝐓MS\mathbf{T}_{\mathrm{MS}} clear the effect of unitary matrices 𝐅¯\overline{\mathbf{F}} and 𝐖¯\overline{\mathbf{W}} in (22), respectively. Since ChannelNet accepts real-valued data, we construct the final form of the input 𝒳k\mathcal{X}_{k} as three “channel” tensors. Thus, the first and second “channel” of 𝒳k\mathcal{X}_{k} are the real and imaginary parts of 𝐆k[m]\mathbf{G}_{k}[m], i.e., [𝒳k]1=Re{𝐆k[m]}[\mathcal{X}_{k}]_{1}=\operatorname{Re}\{\mathbf{G}_{k}[m]\} and [𝒳k]2=Im{𝐆k[m]}[\mathcal{X}_{k}]_{2}=\operatorname{Im}\{\mathbf{G}_{k}[m]\}, respectively. Finally, the third “channel” is given by [𝒳k]3={𝐆k[m]}[\mathcal{X}_{k}]_{3}=\angle\{\mathbf{G}_{k}[m]\}. We note here that the use of three “channel” input (e.g., real, imaginary and angle information of 𝐆k[m]\mathbf{G}_{k}[m]) provides better feature representation [19, 18, 36]. As a result, the size of the input data is NMS×NBS×3N_{\mathrm{MS}}\times N_{\mathrm{BS}}\times 3.

The output of ChannelNet is given by a 2NBSNMS×12N_{\mathrm{BS}}N_{\mathrm{MS}}\times 1 real-valued vector as

𝒴k=[vec{Re{𝐇k[m]}}T,vec{Im{𝐇k[m]}}T]T.\displaystyle\mathcal{Y}_{k}=\left[\mathrm{vec}\{\operatorname{Re}\{\mathbf{H}_{k}[m]\}\}^{\textsf{T}},\mathrm{vec}\{\operatorname{Im}\{\mathbf{H}_{k}[m]\}\}^{\textsf{T}}\right]^{\textsf{T}}. (24)

As a result, ChannelNet maps the received pilot signals 𝐆k[m]\mathbf{G}_{k}[m] to the channel matrix 𝐇k[m]\mathbf{H}_{k}[m].

III-D FL for Channel Estimation in RIS-Assisted Massive MIMO

In this part, we examine the channel estimation problem in RIS-assisted massive MIMO, which is shown in Fig. 3. First, we present the received signal model including both direct (BS-user) and cascaded (BS-RIS-user) channels444Channel estimation is required to design the passive beamformer weights. Although the BS-user, BS-RIS and RIS-user channels can be estimated separately [47], the estimation of the direct and the cascaded channels is sufficient for beamformer design [48, 49].. Then, we show how input-output pairs of ChannelNet are obtained for RIS-assisted scenario.

Refer to caption
Figure 3: RIS-assisted mm-Wave massive MIMO scenario.

We consider the downlink channel estimation, where the BS has NBSN_{\mathrm{BS}} antennas to serve KK single-antenna users with the assistance of RIS, which is composed of NRISN_{\mathrm{RIS}} reflective elements, as shown in Fig. 3. The incoming signal from the BS is reflected from the RIS, where each RIS element introduces a phase shift φn\varphi_{n}, for n=1,,NRISn=1,\dots,N_{\mathrm{RIS}}. This phase shift can be adjusted through the PIN (positive-intrinsic-negative) diodes, which are controlled by the RIS-controller connected to the BS over the backhaul link. As a result, RIS allows the users receive the signal transmitted from the BS when they are distant from the BS or there is a blockage among them. Let 𝐒¯RISNBS×MBS\overline{\mathbf{S}}_{\mathrm{RIS}}\in\mathbb{C}^{N_{\mathrm{BS}}\times M_{\mathrm{BS}}}, (NBSMBS)(N_{\mathrm{BS}}\leq M_{\mathrm{BS}}) be the pilot signals transmitted from the BS, then the received signal at the kkth user becomes

𝐲k=(𝐡B,kH+𝝍H𝐕kH)𝐒¯RIS+𝐧k,\displaystyle\mathbf{y}_{k}=(\mathbf{h}_{\mathrm{B},k}^{\textsf{H}}+\boldsymbol{\psi}^{\textsf{H}}\mathbf{V}_{k}^{\textsf{H}})\overline{\mathbf{S}}_{\mathrm{RIS}}+\mathbf{n}_{k}, (25)

where 𝐲k=[y1,k,,yMBS,k]\mathbf{y}_{k}=[y_{1,k},\dots,y_{M_{\mathrm{BS}},k}] and 𝐧k=[n1,k,,nMBS,k]\mathbf{n}_{k}=[n_{1,k},\dots,n_{M_{\mathrm{BS}},k}] are 1×MBS1\times M_{\mathrm{BS}} row vectors and 𝐡B,kNBS\mathbf{h}_{\mathrm{B},k}\in\mathbb{C}^{N_{\mathrm{BS}}} represents the channel for the communication link between the BS and the kkth user. 𝝍=[ψ1,,ψNRIS]TNRIS\boldsymbol{\psi}=[\psi_{1},\dots,\psi_{N_{\mathrm{RIS}}}]^{\textsf{T}}\in\mathbb{C}^{N_{\mathrm{RIS}}} is the reflecting beamformer vector, whose nnth entry is ψn=anejφn\psi_{n}=a_{n}e^{j\varphi_{n}}, where an{0,1}a_{n}\in\{0,1\} denotes the on/off stage of the nnth element of the RIS and φn[0,2π]\varphi_{n}\in[0,2\pi] is the phase shift introduced by the RIS. In practice, the RIS elements cannot be perfectly turned on/off, hence, they can be modeled as an={1ϵ1ON0+ϵ0OFF,a_{n}=\left\{\begin{array}[]{cc}1-\epsilon_{1}&\mathrm{ON}\\ 0+\epsilon_{0}&\mathrm{OFF}\end{array}\right., for small ϵ1,ϵ00\epsilon_{1},\epsilon_{0}\geq 0, which represent the insertion loss of the reflecting elements [18]. In (25), 𝐕kNBS×NRIS\mathbf{V}_{k}\in\mathbb{C}^{N_{\mathrm{BS}}\times N_{\mathrm{RIS}}} denotes the cascaded channel for the BS-RIS-user link and it can be defined in terms of the channel between BS-RIS and RIS-user as

𝐕k=𝐇B𝚲k,\displaystyle\mathbf{V}_{k}=\mathbf{H}_{\mathrm{B}}\boldsymbol{\Lambda}_{k}, (26)

where 𝐇BNBS×NRIS\mathbf{H}_{\mathrm{B}}\in\mathbb{C}^{N_{\mathrm{BS}}\times N_{\mathrm{RIS}}} is the channel between the BS and the RIS and it can be defined similar to (1) as

𝐇B=NBSNRISLRISl=1LRISαlRIS𝐚BS(ϕlBS)𝐚RIS(ϕlRIS)H,\displaystyle\mathbf{H}_{\mathrm{B}}=\sqrt{\frac{N_{\mathrm{BS}}N_{\mathrm{RIS}}}{L_{\mathrm{RIS}}}}\sum_{l=1}^{L_{\mathrm{RIS}}}\alpha_{l}^{\mathrm{RIS}}\mathbf{a}_{\mathrm{BS}}(\phi_{l}^{\mathrm{BS}})\mathbf{a}_{\mathrm{RIS}}({\phi}_{l}^{\mathrm{RIS}})^{\textsf{H}}, (27)

where LRISL_{\mathrm{RIS}} and αlRIS\alpha_{l}^{\mathrm{RIS}} are the number of received paths and the complex gain respectively. 𝐚BS(ϕlBS)NBS\mathbf{a}_{\mathrm{BS}}(\phi_{l}^{\mathrm{BS}})\in\mathbb{C}^{N_{\mathrm{BS}}} and 𝐚RIS(ϕlRIS)NRIS\mathbf{a}_{\mathrm{RIS}}({\phi}_{l}^{\mathrm{RIS}})\in\mathbb{C}^{N_{\mathrm{RIS}}} are the steering vectors corresponding to the BS and RIS with the AoA and AoD angles of ϕlBS,ϕlRIS\phi_{l}^{\mathrm{BS}},\phi_{l}^{\mathrm{RIS}}, respectively. In (26), 𝚲k=diag{𝐡S,k}\boldsymbol{\Lambda}_{k}=\mathrm{diag}\{\mathbf{h}_{\mathrm{S},k}\} and 𝐡S,kNRIS\mathbf{h}_{\mathrm{S},k}\in\mathbb{C}^{N_{\mathrm{RIS}}} represents the channel between the RIS and the kkth user. 𝐡S,k\mathbf{h}_{\mathrm{S},k} and 𝐡B,k\mathbf{h}_{\mathrm{B},k} have similar structure and they can be defined as follows

𝐡B,k\displaystyle\mathbf{h}_{\mathrm{B},k} =NBSLBl=1LBαk,lB𝐚BS(ϕk,lB),\displaystyle=\sqrt{\frac{N_{\mathrm{BS}}}{L_{\mathrm{B}}}}\sum_{l=1}^{L_{\mathrm{B}}}\alpha_{k,l}^{\mathrm{B}}\mathbf{a}_{\mathrm{BS}}(\phi_{k,l}^{\mathrm{B}}), (28)
𝐡S,k\displaystyle\mathbf{h}_{\mathrm{S},k} =NRISLSl=1LSαk,lS𝐚RIS(ϕk,lS),\displaystyle=\sqrt{\frac{N_{\mathrm{RIS}}}{L_{\mathrm{S}}}}\sum_{l=1}^{L_{\mathrm{S}}}\alpha_{k,l}^{\mathrm{S}}\mathbf{a}_{\mathrm{RIS}}(\phi_{k,l}^{\mathrm{S}}), (29)

where LBL_{\mathrm{B}}, αk,lB\alpha_{k,l}^{\mathrm{B}} and 𝐚BS(ϕk,lB)\mathbf{a}_{\mathrm{BS}}(\phi_{k,l}^{\mathrm{B}}) (LSL_{\mathrm{S}}, αk,lS\alpha_{k,l}^{\mathrm{S}}, 𝐚RIS(ϕk,lB)\mathbf{a}_{\mathrm{RIS}}(\phi_{k,l}^{\mathrm{B}})) are the number of paths, complex gain and the steering vector for the BS-user (RIS-user) communication link, respectively.

In order to estimate the direct channel 𝐡B,k\mathbf{h}_{\mathrm{B},k}, we assume that all the RIS elements are turned off, i.e., an=0a_{n}=0 for n=1,,NRISn=1,\dots,N_{\mathrm{RIS}}. Then, the 1×MBS1\times M_{\mathrm{BS}} received signal at the kkth user becomes

𝐲B,k=𝐡B,kH𝐒¯RIS+𝐧B,k.\displaystyle\mathbf{y}_{\mathrm{B},k}=\mathbf{h}_{\mathrm{B},k}^{\textsf{H}}\overline{\mathbf{S}}_{\mathrm{RIS}}+\mathbf{n}_{\mathrm{B},k}. (30)

Then, the direct channel between BS-user 𝐡B,k\mathbf{h}_{\mathrm{B},k} can be estimated from the received pilot signal 𝐲B,k\mathbf{y}_{\mathrm{B},k} via LS and MMSE estimators as 𝐡B,kLS=(𝐲B,k𝐒¯RIS)H,\mathbf{h}_{\mathrm{B},k}^{\mathrm{LS}}=\bigg{(}\mathbf{y}_{\mathrm{B},k}\overline{\mathbf{S}}_{\mathrm{RIS}}^{\dagger}\bigg{)}^{\textsf{H}}, and 𝐡B,kMMSE=𝐡B,kLS𝐑B,k(𝐑B,k+1σ2𝐒¯RIS𝐒¯RISH)1,\mathbf{h}_{\mathrm{B},k}^{\mathrm{MMSE}}=\mathbf{h}_{\mathrm{B},k}^{\mathrm{LS}}\mathbf{R}_{\mathrm{B},k}\bigg{(}\mathbf{R}_{\mathrm{B},k}+\frac{1}{\sigma^{2}}\overline{\mathbf{S}}_{\mathrm{RIS}}\overline{\mathbf{S}}_{\mathrm{RIS}}^{\textsf{H}}\bigg{)}^{-1}, where 𝐑B,k=𝔼{𝐡B,k𝐡B,kH}\mathbf{R}_{\mathrm{B},k}=\mathbb{E}\{\mathbf{h}_{\mathrm{B},k}\mathbf{h}_{\mathrm{B},k}^{\textsf{H}}\} [16].

Next, we consider the cascaded channel estimation. We assume that each RIS element is turned on one by one while all the other elements are turned off. This is done by the BS requesting the RIS via a micro-controller device in the backhaul link so that a single RIS element is turned on at a time. Then, the reflecting beamformer vector at the nnth frame becomes 𝝍(n)=[0,,0,ψn,0,,0]T\boldsymbol{\psi}^{(n)}=[0,\dots,0,\psi_{n},0,\dots,0]^{\textsf{T}}, where an={0:n~=1,NRIS,n~n}a_{n}=\{0:\tilde{n}=1,\dots N_{\mathrm{RIS}},\tilde{n}\neq n\} and the received signal is given by

𝐲C,k(n)=(𝐡B,kH+𝐯k(n)H)𝐒¯RIS+𝐧k,\displaystyle\mathbf{y}_{\mathrm{C},k}^{(n)}=(\mathbf{h}_{\mathrm{B},k}^{\textsf{H}}+{\mathbf{v}_{k}^{(n)}}^{\textsf{H}})\overline{\mathbf{S}}_{\mathrm{RIS}}+\mathbf{n}_{k}, (31)

where 𝐯k(n)NBS\mathbf{v}_{k}^{(n)}\in\mathbb{C}^{N_{\mathrm{BS}}} is the nnth column of 𝐕k\mathbf{V}_{k}, i.e., 𝐯k(n)=𝐕k𝝍(n)\mathbf{v}_{k}^{(n)}=\mathbf{V}_{k}\boldsymbol{\psi}^{(n)}, where ψn=1\psi_{n}=1. Using the estimate of 𝐡B,k\mathbf{h}_{\mathrm{B},k} from (30), (31) can be solved for 𝐯k(n)\mathbf{v}_{k}^{(n)}, n=1,,NRISn=1,\dots,N_{\mathrm{RIS}}, and the cascaded channel 𝐕k\mathbf{V}_{k} can be estimated. Then, the received data for n=1,NRISn=1\dots,N_{\mathrm{RIS}} can be given by 𝐘C,kNRIS×MBS\mathbf{Y}_{\mathrm{C},k}\in\mathbb{C}^{N_{\mathrm{RIS}}\times M_{\mathrm{BS}}} as 𝐘C,k=[𝐲C,k(n)𝐲C,k(NRIS)]\mathbf{Y}_{\mathrm{C},k}=\left[\begin{array}[]{c}\mathbf{y}_{\mathrm{C},k}^{(n)}\\ \vdots\\ \mathbf{y}_{\mathrm{C},k}^{(N_{\mathrm{RIS}})}\end{array}\right]. In order to train ChannelNet for RIS-assisted massive MIMO scenario, we select the input-output data pair as {𝐲B,k,𝐡B,k}\{\mathbf{y}_{\mathrm{B},k},\mathbf{h}_{\mathrm{B},k}\} and {𝐘C,k,𝐕k}\{\mathbf{Y}_{\mathrm{C},k},\mathbf{V}_{k}\} for direct and cascaded channels respectively. To jointly learn both channels, a single input is constructed to train a single NN as 𝚼k=[𝐲B,k𝐘C,k](NRIS+1)×MBS\boldsymbol{\Upsilon}_{k}=\left[\begin{array}[]{c}\mathbf{y}_{\mathrm{B},k}\\ \mathbf{Y}_{\mathrm{C},k}\end{array}\right]\in\mathbb{C}^{(N_{\mathrm{RIS}}+1)\times M_{\mathrm{BS}}}. Following the same strategy in the previous scenario, the three “channel” of the input data can be constructed as [𝒳k]1=Re{𝚼k}[{\mathcal{X}}_{k}]_{1}=\operatorname{Re}\{\boldsymbol{\Upsilon}_{k}\} and [𝒳k]2=Im{𝚼k}[{\mathcal{X}}_{k}]_{2}=\operatorname{Im}\{\boldsymbol{\Upsilon}_{k}\}, [𝒳k]3={𝚼k}[{\mathcal{X}}_{k}]_{3}=\angle\{\boldsymbol{\Upsilon}_{k}\}, respectively. We can define the output data as 𝚺k=[𝐡B,k,𝐕k]NBS×(NRIS+1)\boldsymbol{\Sigma}_{k}=\left[\mathbf{h}_{\mathrm{B},k},\mathbf{V}_{k}\right]\in\mathbb{C}^{N_{\mathrm{BS}}\times(N_{\mathrm{RIS}}+1)}, hence, the output label can be given by a 2NBS(NRIS+1)×12N_{\mathrm{BS}}(N_{\mathrm{RIS}}+1)\times 1 real-valued vector as

𝒴k=[vec{Re{𝚺k}}T,vec{Im{𝚺k}}T]T.\displaystyle{\mathcal{Y}}_{k}=\left[\mathrm{vec}\{\operatorname{Re}\{\boldsymbol{\Sigma}_{k}\}\}^{\textsf{T}},\mathrm{vec}\{\operatorname{Im}\{\boldsymbol{\Sigma}_{k}\}\}^{\textsf{T}}\right]^{\textsf{T}}. (32)

Consequently, we have the sizes of 𝒳k{\mathcal{X}}_{k} and 𝒴k{\mathcal{Y}}_{k} are (NRIS+1)×MBS×3(N_{\mathrm{RIS}}+1)\times M_{\mathrm{BS}}\times 3 and 2NBS(NRIS+1)×12N_{\mathrm{BS}}(N_{\mathrm{RIS}}+1)\times 1 respectively.

III-E Neural Network Architecture and Training

We design a single CNN, i.e., ChannelNet trained on two different datasets for both conventional and RIS-assisted massive MIMO applications. The proposed network architecture is a CNN with 1010 layers. The first layer is the input layer, which accepts the input data of size NMS×NBS×3N_{\mathrm{MS}}\times N_{\mathrm{BS}}\times 3 and (NRIS+1)×MBS×3(N_{\mathrm{RIS}}+1)\times M_{\mathrm{BS}}\times 3 for conventional and RIS-assisted massive MIMO scenario respectively. The {2,4,6}\{2,4,6\}th layers are the convolutional layers with NSF=128N_{\mathrm{SF}}=128 filters, each of which employs a 3×33\times 3 kernel for 2-D spatial feature extraction. The {3,5,7}\{3,5,7\}th layers are the normalization layers. The eighth layer is a fully connected layer with NFCL=1024N_{\mathrm{FCL}}=1024 units, whose main purpose is to provide feature mapping. The ninth layer is a dropout layer with κ=1/2\kappa=1/2 probability. The dropout layer applies an NFCL×1N_{\mathrm{FCL}}\times 1 mask on the weights of the fully connected layer, whose elements are uniform randomly selected from {0,1}\{0,1\}. As a result, at each iteration of FL training, randomly selected different set of weights in the fully connected layer is updated. Thus, the use of dropout layer reduces the size of 𝜽t\boldsymbol{\theta}_{t} and 𝐠k(𝜽t)\mathbf{g}_{k}(\boldsymbol{\theta}_{t}), thereby, reducing model transmission overhead. Finally, the last layer is output regression layer, yielding the output channel estimate of size 2NMSNBS×12N_{\mathrm{MS}}N_{\mathrm{BS}}\times 1 and 2NBS(NRIS+1)×12N_{\mathrm{BS}}(N_{\mathrm{RIS}}+1)\times 1 for conventional and RIS-assisted massive MIMO applications respectively.

During FL-based training, the collected datasets at the users are used to compute the model updates as in Section III-B and transmitted to the BS. The collected model parameters at the BS are then aggregated as in (10) and broadcast to the users for the next iteration. This process is conducted for t=1,,Tt=1,\dots,T communication rounds until convergence.

Refer to caption
Figure 4: The proposed CNN architecture for channel estimation.

IV Communication Overhead and Complexity

IV-A Communication Overhead

Communication overhead can be defined as the size of the transmitted data during model training. Let 𝒯FL\mathcal{T}_{\mathrm{FL}} and 𝒯CL\mathcal{T}_{\mathrm{CL}} denote the communication overhead of FL and CL, respectively. Then, we can define 𝒯CL\mathcal{T}_{\mathrm{CL}} for both conventional and RIS-assisted scenario as

𝒯CL={(3NMSNBS+2NMSNBS)D,mMIMO(3(NRIS+1)MBS+2NBS(NRIS+1))D,RIS,\displaystyle\mathcal{T}_{\mathrm{CL}}=\left\{\begin{array}[]{ll}\footnotesize(3N_{\mathrm{MS}}N_{\mathrm{BS}}+2N_{\mathrm{MS}}N_{\mathrm{BS}})\textsf{D},&\footnotesize\mathrm{mMIMO}\\ \footnotesize(3(N_{\mathrm{RIS}}+1)M_{\mathrm{BS}}+2N_{\mathrm{BS}}(N_{\mathrm{RIS}}+1))\textsf{D},&\footnotesize\mathrm{RIS}\end{array}\right., (35)

which includes the number of symbols in the uplink transmission of the training dataset from the users to the BS. In contrast, the communication overhead of FL includes the transmission of 𝐠k(𝜽t)\mathbf{g}_{k}(\boldsymbol{\theta}_{t}) and 𝜽t\boldsymbol{\theta}_{t} in uplink and downlink communication for t=1,,Tt=1,\dots,T, respectively. Hence, 𝒯FL\mathcal{T}_{\mathrm{FL}} is given by

𝒯FL={2PTK,mMIMO2PTK,RIS.\displaystyle\mathcal{T}_{\mathrm{FL}}=\left\{\begin{array}[]{ll}2PTK,&\mathrm{mMIMO}\\ 2PTK,&\mathrm{RIS}\end{array}\right.. (38)

We can see that the dominant terms in (35) and (38) are D and PP, which are the number of training data pairs and the number of NN parameters respectively. While D can be adjusted according to the amount of available data at the users, PP is usually unchanged during model training. Here, PP is computed as P=NCL(CNSFWxWy)ConvolutionalLayers+κNSFWxWyNFCLFullyConnectedLayers,P=\underbrace{N_{\mathrm{CL}}(CN_{\mathrm{SF}}W_{x}W_{y})}_{\mathrm{Convolutional\hskip 1.0ptLayers}}+\underbrace{\kappa N_{\mathrm{SF}}W_{x}W_{y}N_{\mathrm{FCL}}\footnotesize}_{\mathrm{Fully\hskip 1.0ptConnected\hskip 1.0ptLayers}}, where NCL=3N_{\mathrm{CL}}=3 is the number of convolutional layers and C=3C=3 is the number of spatial “channels”. Wx=Wy=3W_{x}=W_{y}=3 are the 2-D kernel sizes. As a result, we have P=600,192P=600,192. Since the number of samples in the training dataset is usually larger than the number of model parameters, it is expected to have 𝒯FL<𝒯CL\mathcal{T}_{\mathrm{FL}}<\mathcal{T}_{\mathrm{CL}} [35, 30, 31] (see Fig. 11).

TABLE I: Convolutional Layers Settings
ll Dx(l)D_{x}^{(l)} Dy(l)D_{y}^{(l)} Wx(l)W_{x}^{(l)} Wy(l)W_{y}^{(l)} NSF(l1)N_{\mathrm{SF}}^{(l-1)} NSF(l)N_{\mathrm{SF}}^{(l)}
2 NMSN_{\mathrm{MS}} NBSN_{\mathrm{BS}} 3 3 3 128
4 NMSN_{\mathrm{MS}} NBSN_{\mathrm{BS}} 3 3 128 128
6 NMSN_{\mathrm{MS}} NBSN_{\mathrm{BS}} 3 3 128 128
Refer to caption
Figure 5: Complexity order for CNN, MMSE and LS for channel estimation.
Refer to caption
(a)
Refer to caption
(b)
Figure 6: Validation RMSE (a) and channel estimation NMSE (b) with respect to KK in massive MIMO scenario.

IV-B Computational Complexity

We further examine the computational complexity of the proposed CNN architecture. The time complexity of the convolutional layers can be written as [16, 36]

𝒞CL=𝒪(l=1NCLDx(l)Dy(l)Wx(l)Wy(l)NSF(l1)NSF(l)),\displaystyle\mathcal{C}_{\mathrm{CL}}=\mathcal{O}\bigg{(}\sum_{l=1}^{N_{\mathrm{CL}}}D_{x}^{(l)}D_{y}^{(l)}W_{x}^{(l)}W_{y}^{(l)}N_{\mathrm{SF}}^{(l-1)}N_{\mathrm{SF}}^{(l)}\bigg{)}, (39)

where Dx(l),Dy(l)D_{x}^{(l)},D_{y}^{(l)} are the column and row sizes of each output feature map, Wx(l),Wy(l)W_{x}^{(l)},W_{y}^{(l)} are the 2-D filter size of the ll-th layer. NSF(l1)N_{\mathrm{SF}}^{(l-1)} and NSF(l)N_{\mathrm{SF}}^{(l)} denote the number of input and output feature maps of the ll-th layer respectively. Table I lists the parameters of each convolutional layer for an NMS×NBS×3N_{\mathrm{MS}}\times N_{\mathrm{BS}}\times 3 input. Thus, the complexity of three convolutional layers with 128128@3×33\times 3 spatial filters approximately becomes

𝒞CL𝒪(391282NMSNBS).\displaystyle\mathcal{C}_{\mathrm{CL}}\approx\mathcal{O}\big{(}3\cdot 9\cdot 128^{2}N_{\mathrm{MS}}N_{\mathrm{BS}}\big{)}. (40)

The time complexity of the fully connected layer similarly is

𝒞FCL=𝒪(DxDyκNFCL),\displaystyle\mathcal{C}_{\mathrm{FCL}}=\mathcal{O}\bigg{(}D_{x}D_{y}\kappa N_{\mathrm{FCL}}\bigg{)}, (41)

where NFCL=1024N_{\mathrm{FCL}}=1024 is the number of units with κ=1/2\kappa=1/2 dropout. Dx=128NMSNBSD_{x}=128N_{\mathrm{MS}}N_{\mathrm{BS}} and Dy=1D_{y}=1 are the 2-D input size of the fully connected layer respectively. Then, the time complexity of the fully connected layer approximately is

𝒞FCL𝒪(41282NMSNBS).\displaystyle\mathcal{C}_{\mathrm{FCL}}\approx\mathcal{O}\big{(}4\cdot 128^{2}N_{\mathrm{MS}}N_{\mathrm{BS}}\big{)}. (42)

Hence the total time complexity of ChannelNet is 𝒞=𝒞CL+𝒞FCL\mathcal{C}=\mathcal{C}_{\mathrm{CL}}+\mathcal{C}_{\mathrm{FCL}}, which approximately is

𝒞𝒪(391282NMSNBS+41282NMSNBS)),\displaystyle\mathcal{C}\approx\mathcal{O}\big{(}3\cdot 9\cdot 128^{2}N_{\mathrm{MS}}N_{\mathrm{BS}}+4\cdot 128^{2}N_{\mathrm{MS}}N_{\mathrm{BS}})\big{)}, (43)

which is further simplified as 𝒪(311282NMSNBS)\approx\mathcal{O}\big{(}31\cdot 128^{2}N_{\mathrm{MS}}N_{\mathrm{BS}}\big{)}. Since the computation of the pseudo-inverse of the received pilot data is required in the testing stage, the complexity order of LS and MMSE estimation are 𝒪(NMS2NBS2)\mathcal{O}\big{(}N_{\mathrm{MS}}^{2}N_{\mathrm{BS}}^{2}\big{)} and 𝒪(NMS3NBS3)\mathcal{O}\big{(}N_{\mathrm{MS}}^{3}N_{\mathrm{BS}}^{3}\big{)}, respectively [16, 50].

Fig. 5 shows the time complexity comparison of CNN, MMSE and LS with respect to NMSNBSN_{\mathrm{MS}}N_{\mathrm{BS}}. We see that ChannelNet has higher complexity than LS. As the number of antennas, i.e., NMSNBSN_{\mathrm{MS}}N_{\mathrm{BS}} increases, the complexity of MMSE becomes closer to that of ChannelNet, it becomes larger after approximately NMSNBS720N_{\mathrm{MS}}N_{\mathrm{BS}}\geq 720. While the complexity of ChannelNet seems comparable with the conventional techniques, it is able to run more efficiently by using parallel processor, e.g., GPUs, which can significantly reduce the computation time [16, 50, 32]. However, the implementation with GPUs is not straightforward for the other algorithms, and it requires algorithm-dependent processor configuration.

Refer to caption
(a)
Refer to caption
(b)
Figure 7: Validation RMSE (a) and channel estimation NMSE (b) with respect to SNR𝜽\mathrm{SNR}_{\boldsymbol{\theta}} in massive MIMO scenario, respectively.
Refer to caption
(a)
Refer to caption
(b)
Figure 8: Validation RMSE (a) and channel estimation NMSE (b) with respect to ζ[0,0.5]\zeta\in[0,0.5] in massive MIMO scenario.

V Numerical Simulations

The goal of the simulations is to compare the performance of the proposed FL-based channel estimation approach to the channel estimation performance of the state-of-the-art ML-based channel estimation techniques SF-CNN [16] and MLP [17], and the MMSE and LS estimation in terms of normalized MSE (NMSE), defined by NMSE=1JTKMi=1JTk=1Km=1M\mathrm{NMSE}=\frac{1}{J_{T}KM}\sum_{i=1}^{J_{T}}\sum_{k=1}^{K}\sum_{m=1}^{M} 𝐇k[m]𝐇^k(i)[m]2𝐇k[m]2,\frac{\|\mathbf{H}_{k}[m]-\hat{\mathbf{H}}_{k}^{(i)}[m]\|_{\mathcal{F}}^{2}}{\|\mathbf{H}_{k}[m]\|_{\mathcal{F}}^{2}}, where JT=100J_{T}=100 number of Monte Carlo trials. We also present the validation RMSE of the training process, defined by RMSE=(1|𝒟val|i=1|𝒟val|f(𝒳~(i)|𝜽)𝒴~(i)2)1/2,\mathrm{RMSE}=\left(\frac{1}{|\mathcal{D}_{\mathrm{val}}|}\sum_{i=1}^{|\mathcal{D}_{\mathrm{val}}|}\|f(\widetilde{\mathcal{X}}^{(i)}|\boldsymbol{\theta})-\widetilde{\mathcal{Y}}^{(i)}\|_{\mathcal{F}}^{2}\right)^{1/2}, where 𝒳~(i)\widetilde{\mathcal{X}}^{(i)} and 𝒴~(i)\widetilde{\mathcal{Y}}^{(i)} respectively denote the input-output pairs in the validation dataset 𝒟val\mathcal{D}_{\mathrm{val}}, which includes 20%20\% of the whole dataset 𝒟\mathcal{D}, hence, we have |𝒟val|=0.2|𝒟||\mathcal{D}_{\mathrm{val}}|=0.2|\mathcal{D}|.

The local dataset of each user includes N=100N=100 different channel realizations for K=8K=8 users. The number of antennas in the massive MIMO scenario at the BS and users are NBS=128N_{\mathrm{BS}}=128 and NMS=32N_{\mathrm{MS}}=32, respectively, and we select M=16M=16 and L=5L=5. For the RIS-assisted scenario, NBS=NRIS=64N_{\mathrm{BS}}=N_{\mathrm{RIS}}=64. Hence, we have the same number of input elements for both scenario, i.e, 12832=6464128\cdot 32=64\cdot 64. In both scenarios, location of each user is selected as ϕk,lΦk\phi_{k,l}\in\Phi_{k} and ϕ¯k,lΨ¯k\bar{\phi}_{k,l}\in\bar{\Psi}_{k}, where Φk\Phi_{k} and Ψ¯k\bar{\Psi}_{k} are the equally-divided subregions of the angular domain Θ\Theta, i.e., Θ=k𝒦Φk=k𝒦Ψ¯k\Theta=\bigcup_{k\in\mathcal{K}}\Phi_{k}=\bigcup_{k\in\mathcal{K}}\bar{\Psi}_{k}, respectively. The pilot data are generated as 𝐒¯=𝐈MBS\overline{\mathbf{S}}=\mathbf{I}_{M_{\mathrm{BS}}} and 𝐒¯RIS=𝐈MBS\overline{\mathbf{S}}_{\mathrm{RIS}}=\mathbf{I}_{M_{\mathrm{BS}}} for MBS=NBSM_{\mathrm{BS}}=N_{\mathrm{BS}} and MMS=NMSM_{\mathrm{MS}}=N_{\mathrm{MS}}. We selected 𝐅¯[m]\overline{\mathbf{F}}[m] and 𝐖¯[m]\overline{\mathbf{W}}[m] as the first MBSM_{\mathrm{BS}} columns of an NBS×NBSN_{\mathrm{BS}}\times N_{\mathrm{BS}} DFT matrix and the first MMSM_{\mathrm{MS}} columns of an NMS×NMSN_{\mathrm{MS}}\times N_{\mathrm{MS}} DFT matrix, respectively [16]. During training, we have added AWGN on the input data for three SNR levels, i.e., SNR={20,25,30}=\{20,25,30\} dB, for GmMIMO=20G_{\mathrm{mMIMO}}=20 and GRIS=20MG_{\mathrm{RIS}}=20M realizations in order to provide robust performance against noisy input [19, 18] in both scenarios. As a result, both training datasets have the same number of input-output pairs as DmMIMO=3MKNGmMIMO=316810020=768,000\textsf{D}_{\mathrm{mMIMO}}=3MKNG_{\mathrm{mMIMO}}=3\cdot 16\cdot 8\cdot 100\cdot 20=768,000 and DRIS=3KNGRIS=38100320=768,000\textsf{D}_{\mathrm{RIS}}=3KNG_{\mathrm{RIS}}=3\cdot 8\cdot 100\cdot 320=768,000, respectively. The proposed ChannelNet model is realized and trained in MATLAB on a PC with a 23042304-core GPU. For CL, we use the SGD algorithm with momentum of 0.90.9 and the mini-batch size MB=128M_{B}=128, and update the network parameters with learning rate 0.0010.001. For FL, we train ChannelNet for T=100T=100 iterations/rounds. Once the training is completed, the labels of the validation data (i.e., 20%20\% of the whole dataset) are used in prediction stage. During the prediction stage, each user estimates its own channel by feeding ChannelNet with 𝐆k[m]\mathbf{G}_{k}[m] (𝚼k\boldsymbol{\Upsilon}_{k}) and obtains 𝐇^k[m]\hat{\mathbf{H}}_{k}[m] (𝐡^B,k\hat{\mathbf{h}}_{\mathrm{B},k} and 𝐕^k\hat{\mathbf{V}}_{k}) at the output for massive MIMO (RIS) scenario, respectively555The source codes of the FL-based channel estimation scheme can be found at https://sites.google.com/view/elbir/publications..

Refer to caption
(a)
Refer to caption
(b)
Figure 9: Validation RMSE (a) and channel estimation NMSE (b) for different quantization levels in massive MIMO scenario, respectively.
Refer to caption
Figure 10: Channel estimation NMSE for different algorithms in massive MIMO scenario.
Refer to caption
Figure 11: Communication overhead for FL- and CL-based model training.

V-A Channel Estimation in Massive MIMO

Refer to caption
(a)
Refer to caption
(b)
Figure 12: Validation RMSE (a) and channel estimation NMSE (b) with respect to SNR𝜽\mathrm{SNR}_{\boldsymbol{\theta}} in RIS-assisted massive MIMO scenario, respectively.

In Fig. 6, we present the training performance (Fig. 6a) and the channel estimation NMSE (Fig. 6b) of the proposed FL approach for channel estimation for different number of users. In this scenario, we fix the total dataset size D by selecting G=208KG=20\cdot\frac{8}{K}. As KK increases, the training performance is observed to improve and gets closer to the performance of CL since the model updates superposed at the BS become more robust against the noise. As KK decreases, the corruptions in the model aggregation increase due to the diversity in the training dataset.

Fig. 7 shows the training and channel estimation performance for different noise levels added to the transmitted gradient and model data when K=8K=8. Here, we add AWGN onto both 𝐠k(𝜽t)\mathbf{g}_{k}(\boldsymbol{\theta}_{t}) and 𝜽t\boldsymbol{\theta}_{t} with respect to SNR𝜽\mathrm{SNR}_{\boldsymbol{\theta}}, where SNR𝜽=20log10𝐠k(𝜽t)22σ𝜽2\mathrm{SNR}_{\boldsymbol{\theta}}=20\log_{10}\frac{||\mathbf{g}_{k}(\boldsymbol{\theta}_{t})||_{2}^{2}}{\sigma_{\boldsymbol{\theta}}^{2}}. We observe in Fig. 7a that the training diverges for low SNR𝜽\mathrm{SNR}_{\boldsymbol{\theta}} (e.g., SNR𝜽5{}_{\boldsymbol{\theta}}\leq 5 dB) due to the corruptions in the model parameters. The corresponding channel estimation performance is presented in Fig. 7b when the ChannelNet converges and at least SNR𝜽15{}_{\boldsymbol{\theta}}\leq 15 dB is required to obtain reasonable channel estimation performance, e.g., NMSE0.001\mathrm{NMSE}\leq 0.001.

Fig. 8 shows the training and channel estimation performance in case of an impulsive noise causing the loss of gradient and model data. In this experiment, we multiply 𝐠k(𝜽t)\mathbf{g}_{k}(\boldsymbol{\theta}_{t}) and 𝜽t\boldsymbol{\theta}_{t} with 𝐮P\mathbf{u}\in\mathbb{R}^{P} as 𝐠k(𝜽t)𝐮\mathbf{g}_{k}(\boldsymbol{\theta}_{t})\odot\mathbf{u} and 𝜽t𝐮\boldsymbol{\theta}_{t}\odot\mathbf{u}, where the 100ζ\lfloor 100\zeta\rfloor elements of 𝐮\mathbf{u} are 0 and the remaining terms are 11. This allows us to simulate the case when a portion of the gradient/model data are completely lost during transmission. We observe that the loss of model data significantly affects both training and channel estimation accuracy. Therefore, reliable channel estimation demands at most 5%5\% parameter loss during transmission.

Refer to caption
(a)
Refer to caption
(b)
Figure 13: Validation RMSE (a) and channel estimation NMSE (b) for different quantization levels in RIS-assisted massive MIMO, respectively.

Fig. 9 shows the training and channel estimation performance when the transmitted data (i.e., 𝐠k(𝜽t)\mathbf{g}_{k}(\boldsymbol{\theta}_{t}) and 𝜽t\boldsymbol{\theta}_{t}) are quantized with BB bits. As expected, the performance improves as BB increases and at least B=5B=5 bits are required to obtain a reasonable channel estimation performance. Compared to the results in Fig. 7, quantization has more influence on the accuracy than SNR𝜽\mathrm{SNR}_{\boldsymbol{\theta}}.

In Fig. 10, we present the channel estimation NMSE for different algorithms when K=8K=8. We train ChannelNet with both CL and FL frameworks and observe that CL follows the MMSE performance closely. CL provides better performance than that of FL since it has access the whole dataset at once. Nevertheless, FL has satisfactory channel estimation performance despite decentralized training. Specifically, FL and CL have similar NMSE for SNR25\leq 25 dB and the performance of FL maxes out in high SNR regime. This is because the learning model loses precision due to FL training and cannot perform better. This is a common problem in ML-based techniques [16, 18]. In order to improve the performance, over-training can be employed so that more precision can be obtained. However, this introduces overfitting, i.e., the model memorizes the data, hence, it cannot perform well for different inputs. In Fig. 10, the comparison between the training with perfect (true channel data) and imperfect (estimated channel via ADCE) labels is also presented. The use of imperfect labels causes a slight performance degradation, while still providing less NMSE than SF-CNN and MLP. The other algorithms also exhibit similar behavior but perform worse than ChannelNet. This is because SF-CNN and MLP have convolutional-only and fully-connected-only layers, respectively. In contrast, ChannelNet includes both structures, hence, exhibiting better feature extraction and data mapping performance.

According to the analysis in Sec. IV-A, the communication overhead of FL and CL are 2PTK=2600,1921008960×1062PTK=2\cdot 600,192\cdot 100\cdot 8\approx 960\times 10^{6} and (3NMSNBS+2NMSNBS)D=(512832)768,00016×109(3N_{\mathrm{MS}}N_{\mathrm{BS}}+2N_{\mathrm{MS}}N_{\mathrm{BS}})\textsf{D}=(5\cdot 128\cdot 32)768,000\approx 16\times 10^{9}, respectively. This clearly shows the effectiveness of FL over CL. We also present the number of transmitted symbols during training with respect to data transmission blocks in Fig. 11, where we assume that 10001000 data symbols are transmitted at each transmission block. We can see that, it takes about 1×1061\times 10^{6} data blocks to complete the gradient/model transmission in FL (see, e.g., Fig. 1b) whereas CL-based training demands approximately 16×10616\times 10^{6} data blocks to complete the task for training dataset transmission (see, e.g., Fig. 1a). Therefore, the communication overhead of FL is approximately 1616 times lower than that of CL.

V-B Channel Estimation in RIS-assisted Massive MIMO

In Fig. 12, we present the validation RMSE and the channel estimation NMSE. We compute the NMSE of both direct channel and the cascaded channel together and present the results in a single plot. Similar results are obtained for model training, which diverges when SNR𝜽5\mathrm{SNR}_{\boldsymbol{\theta}}\leq 5 dB and channel estimation NMSE becomes relatively small if SNR𝜽15\mathrm{SNR}_{\boldsymbol{\theta}}\geq 15 dB.

Fig. 13 shows the validation RMSE and channel estimation NMSE for different quantization levels. The small number of bits causes the loss of precision in channel estimation NMSE. Similar to the massive MIMO scenario, at least B5B\geq 5 bits are required to obtain satisfactory channel estimate performance at large SNRs, i.e., SNR20\mathrm{SNR}\geq 20dB.

VI Conclusions

In this paper, we propose a FL framework for channel estimation in conventional and RIS-assisted massive MIMO systems. We evaluate the performance of the proposed approach via several numerical simulations for different number of users and when the gradient/model parameters are quantized and corrupted by noise. We show that at least 55 bit quantization and 1515 dB SNR on the model parameters are required for reliable channel estimation performance, i.e, NMSE0.001\mathrm{NMSE}\leq 0.001. We further analyze the scenario when a portion of the gradient/model parameters are completely lost and observe that FL exhibits satisfactory performance under at most 5%5\% information loss. We also examine the channel estimation performance of the proposed CNN architecture with both perfect and imperfect labels. A slight performance degradation is observed in case of imperfect labels as compared to the perfect CSI case. Nevertheless, the performance of imperfect label scenario strongly depends on the accuracy of the channel estimation algorithm employed during training dataset collection. Furthermore, the proposed CNN architecture provides lower NMSE than the state-of-the-art NN architectures. Apart from the channel estimation performance, FL-based approach enjoys approximately 1616 times lower transmission overhead as compared to the CL-based training. As a future work, we plan to develop compression-based techniques for both training data and the model parameters to further reduce the communication overhead.

Appendix A Proof of Theorem 1

We first make the following assumptions needed to ensure the convergence, which are typical for the ł2\l_{2}-norm regularized linear regression, logistic regression, and softmax classifiers [42, 41, 35].

Assumption 1: The loss function (𝜽)\mathcal{L}(\boldsymbol{\theta}) is convex, i.e., ((1λ)𝜽+λ𝜽)(1λ)𝜽(𝜽)+λ(𝜽)\mathcal{L}((1-\lambda)\boldsymbol{\theta}+\lambda\boldsymbol{\theta}^{\prime})\leq(1-\lambda)\boldsymbol{\theta}\mathcal{L}(\boldsymbol{\theta})+\lambda\mathcal{L}(\boldsymbol{\theta}^{\prime}) for λ[0,1]\lambda\in[0,1] and arbitrary 𝜽\boldsymbol{\theta} and 𝜽\boldsymbol{\theta}^{\prime}.

Assumption 2: (𝜽)\mathcal{L}(\boldsymbol{\theta}) is L-Lipschitz, i.e., (𝜽)(𝜽)L𝜽𝜽||\mathcal{L}(\boldsymbol{\theta})-\mathcal{L}(\boldsymbol{\theta}^{\prime})||\leq L||\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}|| for arbitrary 𝜽\boldsymbol{\theta} and 𝜽\boldsymbol{\theta}^{\prime}.

Assumption 3: (𝜽)\mathcal{L}(\boldsymbol{\theta}) is β\beta-Smooth, i.e., (𝜽)(𝜽)β𝜽𝜽||\nabla\mathcal{L}(\boldsymbol{\theta})-\nabla\mathcal{L}(\boldsymbol{\theta}^{\prime})||\leq\beta||\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}|| for arbitrary 𝜽\boldsymbol{\theta} and 𝜽\boldsymbol{\theta}^{\prime}.

In order to prove Theorem 1, we first investigate the β\beta-Smoothness of ¯(𝜽)\bar{\mathcal{L}}(\boldsymbol{\theta}) in the following lemma.

Lemma 1: ¯(𝜽)\bar{\mathcal{L}}(\boldsymbol{\theta}) is a β¯\bar{\beta}-Smooth function with ¯(𝜽)¯(𝜽)β¯𝜽𝜽||\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})-\nabla\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})||\leq\bar{\beta}||\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}||, where β¯=(1+σΔ2)β\bar{\beta}=(1+\sigma_{\Delta}^{2})\beta.

Proof: Using (16), we get

¯(𝜽)¯(𝜽)\displaystyle||\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})-\nabla\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})||
=||((𝜽)+σΔ2||(𝜽)||2)\displaystyle=||\nabla(\mathcal{L}(\boldsymbol{\theta})+\sigma_{\Delta}^{2}||\nabla\mathcal{L}(\boldsymbol{\theta})||^{2})
((𝜽)+σΔ2||(𝜽)||2)||\displaystyle\hskip 50.0pt-\nabla(\mathcal{L}(\boldsymbol{\theta}^{\prime})+\sigma_{\Delta}^{2}||\nabla\mathcal{L}(\boldsymbol{\theta}^{\prime})||^{2})||
=||((𝜽)+σΔ2||(𝜽)||2)\displaystyle=||\big{(}\nabla\mathcal{L}(\boldsymbol{\theta})+\sigma_{\Delta}^{2}\nabla||\nabla\mathcal{L}(\boldsymbol{\theta})||^{2}\big{)}
((𝜽)+σΔ2||(𝜽)||2)||\displaystyle\hskip 50.0pt-\big{(}\nabla\mathcal{L}(\boldsymbol{\theta}^{\prime})+\sigma_{\Delta}^{2}\nabla||\nabla\mathcal{L}(\boldsymbol{\theta}^{\prime})||^{2}\big{)}||
=||(𝜽)(𝜽)+σΔ2\displaystyle=||\nabla{\mathcal{L}}(\boldsymbol{\theta})-\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})+\sigma_{\Delta}^{2}
×(tr{(𝜽)T(𝜽)}tr{(𝜽)T(𝜽)})||\displaystyle\times\big{(}\nabla\mathrm{tr}\{\nabla{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}\nabla{\mathcal{L}}(\boldsymbol{\theta})\}-\nabla\mathrm{tr}\{\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})^{\textsf{T}}\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})\}\big{)}||
=(𝜽)(𝜽)+σΔ2((𝜽)(𝜽))\displaystyle=||\nabla{\mathcal{L}}(\boldsymbol{\theta})-\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})+\sigma_{\Delta}^{2}\big{(}\nabla{\mathcal{L}}(\boldsymbol{\theta})-\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})\big{)}||
=(1+σΔ2)((𝜽)(𝜽))\displaystyle=||(1+\sigma_{\Delta}^{2})\big{(}\nabla{\mathcal{L}}(\boldsymbol{\theta})-\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})\big{)}||
=(1+σΔ2)(𝜽)(𝜽).\displaystyle=(1+\sigma_{\Delta}^{2})||\nabla{\mathcal{L}}(\boldsymbol{\theta})-\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})||. (44)

By incorporating (44), Assumption 2 and 1+σΔ201+\sigma_{\Delta}^{2}\geq 0, we get

¯(𝜽)¯(𝜽)β¯𝜽𝜽2,\displaystyle||\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})-\nabla\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})||\leq\bar{\beta}||\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}||^{2}, (45)

where β¯=(1+σΔ2)β\bar{\beta}=(1+\sigma_{\Delta}^{2})\beta. ∎

Using (45), Assumption 2 and Assumption 3 imply that ¯(𝜽)\bar{\mathcal{L}}(\boldsymbol{\theta}) is second order differentiable as 2¯(𝜽)β¯𝐈P\nabla^{2}\bar{\mathcal{L}}(\boldsymbol{\theta})\preceq\bar{\beta}\mathbf{I}_{P}. Using this fact, performing a quadratic expression around ¯(𝜽)\bar{\mathcal{L}}(\boldsymbol{\theta}) yields

¯(𝜽)\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime}) ¯(𝜽)+¯(𝜽)T(𝜽𝜽)+122¯(𝜽)𝜽𝜽2\displaystyle\leq\bar{\mathcal{L}}(\boldsymbol{\theta})+\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}(\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta})+\frac{1}{2}\nabla^{2}\bar{\mathcal{L}}(\boldsymbol{\theta})||\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}||^{2}
(𝜽)+¯(𝜽)T(𝜽𝜽)+12β¯𝜽𝜽2.\displaystyle\leq\mathcal{L}(\boldsymbol{\theta})+\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}(\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta})+\frac{1}{2}\bar{\beta}||\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}||^{2}. (46)

Substituting the GD update 𝜽=𝜽η¯(𝜽)\boldsymbol{\theta}^{\prime}=\boldsymbol{\theta}-\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta}) in (A), we get

¯(𝜽)\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime}) ¯(𝜽)+¯(𝜽)T(𝜽𝜽)+12β¯𝜽𝜽2\displaystyle\leq\bar{\mathcal{L}}(\boldsymbol{\theta})+\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}(\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta})+\frac{1}{2}\bar{\beta}||\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}||^{2}
=¯(𝜽)+¯(𝜽)T(𝜽η¯(𝜽)𝜽)\displaystyle=\bar{\mathcal{L}}(\boldsymbol{\theta})+\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}(\boldsymbol{\theta}-\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})-\boldsymbol{\theta})
+122¯(𝜽)𝜽η¯(𝜽)𝜽2\displaystyle\hskip 10.0pt+\frac{1}{2}\nabla^{2}\bar{\mathcal{L}}(\boldsymbol{\theta})||\boldsymbol{\theta}-\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})-\boldsymbol{\theta}||^{2}
=¯(𝜽)η¯(𝜽)T¯(𝜽)+12β¯η¯(𝜽)2\displaystyle=\bar{\mathcal{L}}(\boldsymbol{\theta})-\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})+\frac{1}{2}\bar{\beta}||\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})||^{2}
=¯(𝜽)η¯(𝜽)2+12β¯η2¯(𝜽)2\displaystyle=\bar{\mathcal{L}}(\boldsymbol{\theta})-\eta||\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})||^{2}+\frac{1}{2}\bar{\beta}\eta^{2}||\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})||^{2}
=¯(𝜽)(1β¯η2)η¯(𝜽)2,\displaystyle=\bar{\mathcal{L}}(\boldsymbol{\theta})-(1-\frac{\bar{\beta}\eta}{2})\eta||\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})||^{2}, (47)

which bounds the GD update ¯(𝜽)\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime}) with ¯(𝜽)\bar{\mathcal{L}}(\boldsymbol{\theta}). Now, let us bound ¯(𝜽)\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime}) with the optimal objective value ¯(𝜽)\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star}). Using Assumption 1, we have

¯(𝜽)\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star}) ¯(𝜽)+¯(𝜽)T(𝜽𝜽),\displaystyle\geq\bar{\mathcal{L}}(\boldsymbol{\theta})+\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}(\boldsymbol{\theta}_{\star}-\boldsymbol{\theta}),
¯(𝜽)\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}) ¯(𝜽)+¯(𝜽)T(𝜽𝜽).\displaystyle\leq\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})+\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}(\boldsymbol{\theta}-\boldsymbol{\theta}_{\star}). (48)

Furthermore, using η1β¯\eta\leq\frac{1}{\bar{\beta}}, we have (1β¯η2)=12β¯η112β¯(1/β¯)1=121=12-(1-\frac{\bar{\beta}\eta}{2})=\frac{1}{2}\bar{\beta}\eta-1\leq\frac{1}{2}\bar{\beta}(1/\bar{\beta})-1=\frac{1}{2}-1=-\frac{1}{2}. Thus, (47) becomes

¯(𝜽)¯(𝜽)η2¯(𝜽)2\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})\leq\bar{\mathcal{L}}(\boldsymbol{\theta})-\frac{\eta}{2}||\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})||^{2} (49)

By plugging (A) into (49), we get

¯(𝜽)¯(𝜽)+¯(𝜽)T(𝜽𝜽)η2¯(𝜽)2,\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})\leq\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})+\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}(\boldsymbol{\theta}-\boldsymbol{\theta}_{\star})-\frac{\eta}{2}||\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})||^{2}, (50)

which can be rewritten as

¯(𝜽)¯(𝜽)12η(2η¯(𝜽)T(𝜽𝜽)η2¯(𝜽)2).\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})-\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})\leq\frac{1}{2\eta}\big{(}2\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}(\boldsymbol{\theta}-\boldsymbol{\theta}_{\star})-\eta^{2}||\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})||^{2}\big{)}. (51)

By adding 12η(𝜽𝜽2𝜽𝜽2)\frac{1}{2\eta}(||\boldsymbol{\theta}-\boldsymbol{\theta}_{\star}||^{2}-||\boldsymbol{\theta}-\boldsymbol{\theta}_{\star}||^{2}) into the right hand side of (51), we get

¯(𝜽)¯(𝜽)12η(𝜽𝜽2𝜽𝜽η¯(𝜽)2),\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})-\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})\leq\frac{1}{2\eta}\big{(}||\boldsymbol{\theta}-\boldsymbol{\theta}_{\star}||^{2}-||\boldsymbol{\theta}-\boldsymbol{\theta}_{\star}-\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})||^{2}\big{)}, (52)

which is obtained after incorporating the expansion of 𝜽𝜽η¯(𝜽)2||\boldsymbol{\theta}-\boldsymbol{\theta}_{\star}-\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})||^{2}. Substituting the GD update 𝜽=𝜽η¯(𝜽)\boldsymbol{\theta}^{\prime}=\boldsymbol{\theta}-\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta}) into (52), we have

¯(𝜽)¯(𝜽)12η(𝜽𝜽2𝜽𝜽2).\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})-\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})\leq\frac{1}{2\eta}\bigg{(}||\boldsymbol{\theta}-\boldsymbol{\theta}_{\star}||^{2}-||\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}_{\star}||^{2}\bigg{)}. (53)

Now, replacing 𝜽\boldsymbol{\theta}^{\prime} by 𝜽i\boldsymbol{\theta}_{i} and summing over i=1,,ti=1,\dots,t yield

i=1t(¯(𝜽i)¯(𝜽))i=1t12η(𝜽i1𝜽2𝜽i𝜽2)\displaystyle\sum_{i=1}^{t}(\bar{\mathcal{L}}(\boldsymbol{\theta}_{i})-\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star}))\leq\sum_{i=1}^{t}\frac{1}{2\eta}\bigg{(}||\boldsymbol{\theta}_{i-1}-\boldsymbol{\theta}_{\star}||^{2}-||\boldsymbol{\theta}_{i}-\boldsymbol{\theta}_{\star}||^{2}\bigg{)}
=12η(𝜽0𝜽2𝜽t𝜽2)12η𝜽0𝜽2,\displaystyle=\frac{1}{2\eta}\bigg{(}||\boldsymbol{\theta}_{0}-\boldsymbol{\theta}_{\star}||^{2}-||\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{\star}||^{2}\bigg{)}\leq\frac{1}{2\eta}||\boldsymbol{\theta}_{0}-\boldsymbol{\theta}_{\star}||^{2}, (54)

where the summation on the right hand side disappears since the consecutive terms cancel each other. Since ¯(𝜽t)\bar{\mathcal{L}}(\boldsymbol{\theta}_{t}) is a decreasing function, we have

¯(𝜽t)¯(𝜽)1ti=1t(¯(𝜽i)¯(𝜽)).\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}_{t})-\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})\leq\frac{1}{t}\sum_{i=1}^{t}(\bar{\mathcal{L}}(\boldsymbol{\theta}_{i})-\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})). (55)

Inserting (54) into (55), we finally have

¯(𝜽t)¯(𝜽)12ηt𝜽0𝜽2.\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}_{t})-\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})\leq\frac{1}{2\eta t}||\boldsymbol{\theta}_{0}-\boldsymbol{\theta}_{\star}||^{2}. (56)

References

  • [1] R. W. Heath, N. González-Prelcic, S. Rangan, W. Roh, and A. M. Sayeed, “An overview of signal processing techniques for millimeter wave MIMO systems,” IEEE J. Sel. Topics Signal Process., vol. 10, no. 3, pp. 436–453, 2016.
  • [2] F. Rusek, D. Persson, B. K. Lau, E. G. Larsson, T. L. Marzetta, O. Edfors, and F. Tufvesson, “Scaling up MIMO: Opportunities and challenges with very large arrays,” IEEE Signal Process. Mag., vol. 30, no. 1, pp. 40–60, 2013.
  • [3] J. G. Andrews, S. Buzzi, W. Choi, S. V. Hanly, A. Lozano, A. C. K. Soong, and J. C. Zhang, “What will 5G be?” IEEE J. Sel. Areas Commun., vol. 32, no. 6, pp. 1065–1082, 2014.
  • [4] O. E. Ayach, S. Rajagopal, S. Abu-Surra, Z. Pi, and R. W. Heath, “Spatially sparse precoding in millimeter wave MIMO systems,” IEEE Trans. Wireless Commun., vol. 13, no. 3, pp. 1499–1513, 2014.
  • [5] Q. Wu and R. Zhang, “Towards Smart and Reconfigurable Environment: Intelligent Reflecting Surface Aided Wireless Network,” IEEE Commun. Mag., vol. 58, no. 1, pp. 106–112, January 2020.
  • [6] A. M. Elbir and K. V. Mishra, “A Survey of Deep Learning Architectures for Intelligent Reflecting Surfaces,” arXiv, Sep 2020. [Online]. Available: https://arxiv.org/abs/2009.02540v3
  • [7] C. Huang, S. Hu, G. C. Alexandropoulos, A. Zappone, C. Yuen, R. Zhang, M. Di Renzo, and M. Debbah, “Holographic MIMO Surfaces for 6G Wireless Networks: Opportunities, Challenges, and Trends,” IEEE Wireless Commun., vol. 27, no. 5, pp. 118–125, Jul 2020.
  • [8] C. Huang, A. Zappone, G. C. Alexandropoulos, M. Debbah, and C. Yuen, “Reconfigurable Intelligent Surfaces for Energy Efficiency in Wireless Communication,” IEEE Trans. Wireless Commun., vol. 18, no. 8, pp. 4157–4170, Jun 2019.
  • [9] C. Huang, R. Mo, and C. Yuen, “Reconfigurable Intelligent Surface Assisted Multiuser MISO Systems Exploiting Deep Reinforcement Learning,” IEEE J. Sel. Areas Commun., vol. 38, no. 8, pp. 1839–1850, Jun 2020.
  • [10] E. Björnson, L. Van der Perre, S. Buzzi, and E. G. Larsson, “Massive MIMO in sub-6 GHz and mmWave: Physical, practical, and use-case differences,” IEEE Wireless Commun., vol. 26, no. 2, pp. 100–108, 2019.
  • [11] A. Alkhateeb and R. W. Heath, “Frequency selective hybrid precoding for limited feedback millimeter wave systems,” IEEE Trans. Commun., vol. 64, no. 5, pp. 1801–1818, 2016.
  • [12] F. Sohrabi and W. Yu, “Hybrid analog and digital beamforming for mmWave OFDM large-scale antenna arrays,” IEEE Journal on Selected Areas in Communications, vol. 35, no. 7, pp. 1432–1443, 2017.
  • [13] A. Taha, M. Alrabeiah, and A. Alkhateeb, “Enabling large intelligent surfaces with compressive sensing and deep learning,” arXiv preprint arXiv:1904.10136, 2019.
  • [14] D. Fan, F. Gao, Y. Liu, Y. Deng, G. Wang, Z. Zhong, and A. Nallanathan, “Angle Domain Channel Estimation in Hybrid Millimeter Wave Massive MIMO Systems,” IEEE Trans. Wireless Commun., vol. 17, no. 12, pp. 8165–8179, Dec 2018.
  • [15] H. Yin, D. Gesbert, M. Filippou, and Y. Liu, “A Coordinated Approach to Channel Estimation in Large-Scale Multiple-Antenna Systems,” IEEE J. Sel. Areas Commun., vol. 31, no. 2, pp. 264–273, February 2013.
  • [16] P. Dong, H. Zhang, G. Y. Li, I. S. Gaspar, and N. NaderiAlizadeh, “Deep CNN-Based Channel Estimation for mmWave Massive MIMO Systems,” IEEE J. Sel. Topics Signal Process., vol. 13, no. 5, pp. 989–1000, Sep. 2019.
  • [17] H. Huang, J. Yang, H. Huang, Y. Song, and G. Gui, “Deep learning for super-resolution channel estimation and doa estimation based massive mimo system,” IEEE Trans. Veh. Technol., vol. 67, no. 9, pp. 8549–8560, Sept 2018.
  • [18] A. M. Elbir, A. Papazafeiropoulos, P. Kourtessis, and S. Chatzinotas, “Deep Channel Learning for Large Intelligent Surfaces Aided mm-Wave Massive MIMO Systems,” IEEE Wireless Commun. Lett., vol. 9, no. 9, pp. 1447–1451, 2020.
  • [19] A. M. Elbir, “CNN-based precoder and combiner design in mmWave MIMO systems,” IEEE Commun. Lett., vol. 23, no. 7, pp. 1240–1243, 2019.
  • [20] A. M. Elbir and K. V. Mishra, “Joint antenna selection and hybrid beamformer design using unquantized and quantized deep learning networks,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 1677–1688, March 2020.
  • [21] A. M. Elbir and A. Papazafeiropoulos, “Hybrid Precoding for Multi-User Millimeter Wave Massive MIMO Systems: A Deep Learning Approach,” IEEE Trans. Veh. Technol., vol. 69, no. 1, p. 552–563, 2020.
  • [22] A. M. Elbir, K. V. Mishra, and Y. C. Eldar, “Cognitive radar antenna selection via deep learning,” IET Radar, Sonar & Navigation, vol. 13, pp. 871–880, 2019.
  • [23] A. M. Elbir, “DeepMUSIC: Multiple Signal Classification via Deep Learning,” IEEE Sensors Letters, vol. 4, no. 4, pp. 1–4, 2020.
  • [24] M. M. Amiri and D. Gündüz, “Federated Learning Over Wireless Fading Channels,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3546–3557, 2020.
  • [25] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated Learning: Challenges, Methods, and Future Directions,” IEEE Signal Process. Mag., vol. 37, no. 3, pp. 50–60, 2020.
  • [26] A. M. Elbir and S. Coleri, “Federated Learning for Vehicular Networks,” arXiv preprint arXiv:2006.01412, 2020.
  • [27] M. M. Wadu, S. Samarakoon, and M. Bennis, “Federated learning under channel uncertainty: Joint client scheduling and resource allocation,” arXiv preprint arXiv:2002.00802, 2020.
  • [28] T. Zeng, O. Semiari, M. Mozaffari, M. Chen, W. Saad, and M. Bennis, “Federated Learning in the Sky: Joint Power Allocation and Scheduling with UAV Swarms,” arXiv preprint arXiv:2002.08196, 2020.
  • [29] S. Batewela, C. Liu, M. Bennis, H. A. Suraweera, and C. S. Hong, “Risk-sensitive task fetching and offloading for vehicular edge computing,” IEEE Commun. Lett., vol. 24, no. 3, pp. 617–621, 2020.
  • [30] M. Mohammadi Amiri and D. Gündüz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,” IEEE Trans. Signal Process., vol. 68, pp. 2155–2169, 2020.
  • [31] A. M. Elbir and S. Coleri, “Federated Learning for Hybrid Beamforming in mm-Wave Massive MIMO,” IEEE Commun. Lett., pp. 1–1, 2020.
  • [32] A. M. Elbir, K. V. Mishra, M. R. B. Shankar, and B. Ottersten, “Online and Offline Deep Learning Strategies For Channel Estimation and Hybrid Beamforming in Multi-Carrier mm-Wave Massive MIMO Systems,” arXiv preprint arXiv:1912.10036, 2019.
  • [33] J. Yuan, H. Q. Ngo, and M. Matthaiou, “Machine Learning-Based Channel Prediction in Massive MIMO With Channel Aging,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 2960–2973, Feb 2020.
  • [34] H. Ye, G. Y. Li, and B. Juang, “Power of deep learning for channel estimation and signal detection in OFDM systems,” IEEE Commun. Lett., vol. 7, no. 1, pp. 114–117, 2018.
  • [35] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” arXiv, Feb 2016. [Online]. Available: https://arxiv.org/abs/1602.05629v3
  • [36] A. M. Elbir, “A Deep Learning Framework for Hybrid Beamforming Without Instantaneous CSI Feedback,” IEEE Trans. Veh. Technol., pp. 1–1, 2020.
  • [37] W. U. Bajwa, J. Haupt, G. Raz, and R. Nowak, “Compressed channel sensing,” in Annual Conference on Information Sciences and Systems, March 2008, pp. 5–10.
  • [38] Z. Marzi, D. Ramasamy, and U. Madhow, “Compressive Channel Estimation and Tracking for Large Arrays in mm-Wave Picocells,” IEEE J. Sel. Topics Signal Process., vol. 10, no. 3, pp. 514–527, April 2016.
  • [39] A. Klautau, P. Batista, N. González-Prelcic, Y. Wang, and R. W. Heath, “5G MIMO Data for Machine Learning: Application to Beam-Selection Using Deep Learning,” in 2018 Information Theory and Applications Workshop (ITA), 2018, pp. 1–9.
  • [40] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, 2017, pp. 1709–1720.
  • [41] F. Ang, L. Chen, N. Zhao, Y. Chen, W. Wang, and F. R. Yu, “Robust Federated Learning With Noisy Communication,” IEEE Trans. Commun., vol. 68, no. 6, pp. 3452–3464, Mar 2020.
  • [42] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the Convergence of FedAvg on Non-IID Data,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=HJxNAnVtDS
  • [43] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [44] C. M. Bishop, “Training with Noise is Equivalent to Tikhonov Regularization,” Neural Comput., vol. 7, no. 1, pp. 108–116, Jan 1995.
  • [45] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A Joint Learning and Communications Framework for Federated Learning over Wireless Networks,” IEEE Trans. Wireless Commun., p. 1, Oct 2020.
  • [46] T. T. Vu, D. T. Ngo, N. H. Tran, H. Q. Ngo, M. N. Dao, and R. H. Middleton, “Cell-Free Massive MIMO for Wireless Federated Learning,” IEEE Trans. Wireless Commun., vol. 19, no. 10, pp. 6377–6392, Jun 2020.
  • [47] L. Wei, H. Chongwen, A. G. C., C. Yuen, and M. Zhang, Zhaoyangand Debbah, “Channel Estimation for RIS-Empowered Multi-User MISO Wireless Communications,” arXiv, Aug 2020. [Online]. Available: https://arxiv.org/abs/2008.01459v1
  • [48] S. Lin, B. Zheng, G. C. Alexandropoulos, M. Wen, M. Di Renzo, and F. Chen, “Reconfigurable Intelligent Surfaces with Reflection Pattern Modulation: Beamforming Design and Performance Analysis,” IEEE Trans. Wireless Commun., p. 1, Oct 2020.
  • [49] D. Mishra and H. Johansson, “Channel Estimation and Low-complexity Beamforming Design for Passive Intelligent Surface Assisted MISO Wireless Energy Transfer,” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4659–4663, Dec 2017.
  • [50] A. M. Elbir and K. V. Mishra, “Sparse array selection across arbitrary sensor geometries with deep transfer learning,” IEEE Trans. on Cogn. Commun. Netw., pp. 1–1, 2020.
Ahmet M. Elbir (IEEE Senior Member) received the Ph.D. degree from Middle East Technical University in 2016 He is a Senior Researcher at Duzce University, Duzce, Turkey, and Research Fellow at the University of Hertfordshire, Hatfield, UK.
Sinem Coleri (IEEE Senior Member) received the Ph.D. degree from the University of California at Berkeley in 2005. She is a Faculty Member with the Department of Electrical and Electronics Engineering, Koc University, Turkey.