Federated Learning for Channel Estimation in Conventional and RIS-Assisted Massive MIMO

Ahmet M. Elbir, Senior Member, IEEE, and Sinem Coleri, Senior Member, IEEE The work of Sinem Coleri was supported by the Scientific and Technological Research Council of Turkey with European CHIST-ERA grant 119E350.A. M. Elbir is with the Department of Electrical and Electronics Engineering, Duzce University, Duzce, Turkey, and with University of Hertfordshire, Hatfield, UK (e-mail: [email protected]).S. Coleri is with the Department of Electrical and Electronics Engineering, Koc University, Istanbul, Turkey (e-mail: [email protected]).

Abstract

Machine learning (ML) has attracted a great research interest for physical layer design problems, such as channel estimation, thanks to its low complexity and robustness. Channel estimation via ML requires model training on a dataset, which usually includes the received pilot signals as input and channel data as output. In previous works, model training is mostly done via centralized learning (CL), where the whole training dataset is collected from the users at the base station (BS). This approach introduces huge communication overhead for data collection. In this paper, to address this challenge, we propose a federated learning (FL) framework for channel estimation. We design a convolutional neural network (CNN) trained on the local datasets of the users without sending them to the BS. We develop FL-based channel estimation schemes for both conventional and RIS (intelligent reflecting surface) assisted massive MIMO (multiple-input multiple-output) systems, where a single CNN is trained for two different datasets for both scenarios. We evaluate the performance for noisy and quantized model transmission and show that the proposed approach provides approximately 16 times lower overhead than CL, while maintaining satisfactory performance close to CL. Furthermore, the proposed architecture exhibits lower estimation error than the state-of-the-art ML-based schemes.

Index Terms:

Channel estimation, Federated learning, Machine learning, Centralized learning, Massive MIMO.

I Introduction

Compared to the cellular communication systems in lower frequency bands, millimeter wave (mm-Wave) signals, with the frequency range $30$ - $300$ GHz, encounter a more complex propagation environment that is characterized by higher scattering, severe penetration losses, and higher path loss for fixed transmitter and receiver gains [1, 2, 3]. These losses are compensated by providing beamforming power gain through massive number of antennas at both transmitter and receiver with multiple-input-multiple-output (MIMO) architecture. However, such a large antenna array requires a dedicated radio-frequency (RF) chain for each antenna, resulting in an expensive system architecture and high power consumption. In order to address this issue and reduce the number of digital RF components, hybrid analog and baseband beamforming architectures have been introduced, wherein a small number of phase-only analog beamformers are employed [4]. As a result, the combination of high-dimensional analog and low-dimensional baseband beamformers significantly reduces the number of RF chains while maintaining sufficient beamforming gain [4].

Even with the reduced number of RF chains, the hybrid beamforming architecture combined with mm-Wave transmission comes with the expensive cost of energy consumption and hardware complexity [5]. In order to address these issues and provide a more green and suitable solution to enhance the wireless network performance, reconfigurable intelligent surfaces (RISs) (also known as intelligent reflecting surfaces) are envisaged as a promising solution with low cost and complexity [6, 7, 5, 8, 9]. An RIS is an electromagnetic 2-D surface that is composed of large number of passive reconfigurable meta-material elements, which reflect the incoming signal by introducing a pre-determined phase shift. This phase shift can be controlled via external signals by the base station (BS) through a backhaul control link. As a result, the incoming signal from the BS can be manipulated in real-time, thereby, reflecting the received signal towards the users. Hence, the usage of RIS improves the received signal energy at the distant users as well as expanding the coverage of the BS.

In both conventional and RIS-assisted massive MIMO scenarios, the performance of the system architecture strongly relies on the accuracy of the instantaneous channel state information (CSI), given the highly dynamic nature of the mm-Wave channel [10]. Thus, the channel estimation accuracy plays an important role in the design of the analog and digital beamformers in conventional massive MIMO [11, 12], and the design of reflecting beamformer phase shifts of the RIS elements in the RIS-assisted scenario [13, 8]. Furthermore, RIS-assisted massive MIMO involves signal reception through multiple channels (e.g., BS-RIS, RIS-user and BS-user), which makes the channel estimation task more challenging and interesting. As a result, several channel estimation schemes are proposed for massive MIMO and RIS-assisted scenarios, based on compressed sensing [13], angle-domain processing [14] and coordinated pilot-assignment [15]. The performance of these analytical approaches strongly depends on the perfection of the antenna array output so that reliable channel estimation accuracy can be obtained. In order to provide robustness against the imperfections/corruptions in the array data, data-driven techniques, such as machine learning (ML) based approaches, have been proposed to uncover the non-linear relationships in data/signals with lower computational complexity and achieve better performance for parameter inference and be tolerant against the imperfections in the data. As listed below, ML is more efficient than model-based techniques that largely rely on mathematical models:

•

A learning model constructs a non-linear mapping between the raw input data and the desired output to approximate a problem from a model-free perspective. Thus, its prediction performance is robust against the corruptions/imperfections in the wireless channel data.
•

ML learns the feature patterns, which are easily updated for the new data and adapted to environmental changes. In the long run, this results in a lower computational complexity than the model-based optimization.
•

ML-based solutions have significantly reduced run-times because of parallel processing capabilities. On the other hand, it is not straightforward to achieve parallel implementations of conventional optimization and signal processing algorithms.

In massive MIMO and RIS-assisted systems, ML has been proven to have higher spectral efficiency and lower computational complexity for the problems such as channel estimation [16, 17, 18], hybrid beamforming [19, 20, 21] and angle-of-arrival (AoA) estimation [22, 23].

In ML context, channel estimation problem is solved by training a model, e.g., a neural network (NN), on the local datasets collected by the users [16, 17, 18]. The trained model provides a non-linear mapping between the input data, which can be usually selected as the received pilot signals, and the output data, i.e., the channel data. Previous works mostly consider centralized learning (CL) schemes where the whole dataset, i.e., input-output data pairs, is transmitted to the BS (via RIS in the RIS-assisted scenario) for model training, as illustrated in Fig. 1a. Once the model is trained at the BS, then the model parameters are sent to the users, which can perform channel estimation task by feeding the model with the received pilot data. However, this approach involves huge communication overhead, i.e., transmitting the whole dataset from users to the BS. For example, in LTE (long term evolution), a single frame of $5$ MHz bandwidth and $10$ ms duration can carry only $6000$ complex symbols [24], whereas the size of the whole dataset can be on the order hundreds of thousands symbols [21, 20, 16, 17]. As a result, CL-based techniques demand huge bandwidth requirements.

In order to deal with high communication overhead of CL schemes, recently federated learning (FL) schemes have been proposed [25, 26]. In FL, instead of sending the whole dataset, only the model updates, i.e., gradients of the model parameters, are transmitted, as illustrated in Fig. 1b. As a result, the communication overhead is reduced. In the literature, FL has been considered for the scheduling and power allocation in wireless sensor networks [27], the trajectory planning of UAV (unmanned aerial vehicle) networks [28], task fetching and offloading in vehicular networks [29, 26], image classification in [24, 30], and massive MIMO hybrid beamforming design [31]. All of these studies accommodate multiple edge devices exchanging model updates with the parameter server to train a global model. In the aforementioned works, FL has been mostly used for image classification/object detection problems in different networking schemes by the assumption that the perfect CSI is available. Motivated by the fact that the acquisition of CSI is very critical in massive MIMO systems and FL has not been considered directly for the channel estimation problem, in this work, we leverage FL for the channel estimation problem, which has been studied previously in the context of CL-based training [16, 17, 18, 32]. Compared to CL, FL is more applicable in case of distributed devices, such as mobile phones. Furthermore, training the same model with FL, rather than CL, reduces the communication overhead significantly during training while maintaining satisfactory channel estimation performance close CL. To the best of our knowledge, this is the first work for the use of FL in channel estimation.

In this paper, we propose an FL-based model training approach for channel estimation problem in both conventional and RIS-assisted massive MIMO systems. We design a convolutional neural network (CNN), which is located at the BS and trained on the local datasets. For these datasets, where the input is received pilot signal and the output is the channel matrix, the usage of the CNN is more convenient than the recurrent NNs (RNNs), which are designed to predict the future CSI by using the previous channels based on the sequential data [33]. The proposed approach has three stages, namely, data collection, training and prediction. In the first stage, each user collects its training datasets and stores them for model training, which is not explicitly discussed in the previous ML-based works [16, 17, 18, 34]. In the second stage, each user uses its own local dataset, and computes the model updates and sends them to the BS¹¹1The model parameters computed at the users are transmitted to the BS via the RIS in RIS-assisted scenario., where the model updates are aggregated to train a global model. The main advantage of the proposed FL approach is the reduction in the communication overhead. This overhead is proportional to the dimensionality of the channel matrix, which can be higher in RIS-assisted systems than the conventional MIMO due to the large number of RIS elements. Apart from that, the proposed approach reduces the computation time as well as increasing the robustness against data corruptions. One of the main challenges in FL-based channel estimation is due to the non-i.i.d. (independent identical distribution) structure of the training data. FL is known to converge faster if the local datasets are i.i.d. [35]. Since the channel estimation dataset is non-i.i.d. because of the distribution of the user locations, FL is expected to converge slower. In order to improve the performance in non-i.i.d. scenario, using deeper and wider learning models help to provide better feature extraction and representation performance [31]. Thus, we perform a hyper-parameter optimization to achieve a satisfactory performance.

The main contributions of this paper can be summarized as follows:

1.

We propose an FL-based channel estimation approach for both conventional and RIS-assisted massive MIMO systems. Different from the conventional centralized model learning techniques, the proposed FL framework provides decentralized learning, which significantly reduces the communication overhead compared to the CL-based techniques while maintaining satisfactory channel estimation performance close to CL.
2.

In order to estimate both direct (BS-user) and cascaded (BS-RIS-user) channels in RIS-assisted scenario, input and output data are combined together for each communication link, hence a single CNN architecture is designed, instead of using different NNs for each task.
3.

We prove the convergence of FL and demonstrate its superior performance over CL in terms of communication overhead and channel estimation accuracy via extensive numerical simulations for different number of users while considering the quantization and corruption of the gradient and model data as well as the loss of a portion of the model data during transmission.

Throughout the paper, the identity matrix of size $N\times N$ is denoted by $\mathbf{I}_{N}$ . $(\cdot)^{T}$ and $(\cdot)^{H}$ denote transpose and conjugate transpose operations, respectively. For a matrix $\mathbf{A}$ and a vector $\mathbf{a}$ , $[\mathbf{A}]_{i,j}$ and $[\mathbf{a}]_{i}$ denote the $(i,j)$ th element of matrix $\mathbf{A}$ and the $i$ th element of vector $\mathbf{a}$ , respectively. The function $\mathbb{E}\{\cdot\}$ provides the statistical expectation of its argument and $\angle\{\cdot\}$ measures the angle of complex quantity. $\|\mathbf{A}\|_{\mathcal{F}}$ and $\|\mathbf{a}\|_{2}$ denote the Frobenius and $l_{2}$ -norm, respectively. $\otimes$ is the Hadamard element-wise multiplication and $\nabla_{\mathbf{a}}$ represents the gradient with respect to $\mathbf{a}$ . A convolutional layer with $N$ $D\times D$ 2-D kernel is represented by $N$ @ $D\times D$ .

II System Model

We consider a multi-user MIMO-OFDM (orthogonal frequency division multiplexing) system with $M$ subcarriers, where the BS has $N_{\mathrm{BS}}$ antennas to communicate with $K$ users, each of which has $N_{\mathrm{MS}}$ antennas. In the downlink, the BS first precodes $K$ data symbols $\mathbf{s}[m]=[s_{1}[m],s_{2}[m],\dots,s_{K}[m]]^{\textsf{T}}\in\mathbb{C}^{K}$ at each subcarrier ( $m\in\mathcal{M}=\{1,\dots,M\}$ ) by applying the subcarrier-dependent baseband precoders $\mathbf{F}_{\mathrm{BB}}[m]=[\mathbf{f}_{\mathrm{BB}_{1}}[m],\mathbf{f}_{\mathrm{BB}_{2}}[m],\dots,\mathbf{f}_{\mathrm{BB}_{K}}[m]]\in\mathbb{C}^{K\times K}$ . Then, the signal is transformed to the time-domain via $M$ -point inverse discrete Fourier transform (IDFT). After adding cyclic prefix (CP), the BS employs subcarrier-independent analog precoder $\mathbf{F}_{\mathrm{RF}}\in\mathbb{C}^{N_{\mathrm{BS}}\times K}$ to form the transmitted signal. Given that $\mathbf{F}_{\mathrm{RF}}$ consists of analog phase shifters, we assume that the RF precoder has constant unit-modulus constraints, i.e., $|[\mathbf{F}_{\mathrm{RF}}]_{i,j}|^{2}=1$ . Additionally, we have the power constraint $\sum_{m=1}^{M}\|\mathbf{F}_{\mathrm{RF}}\mathbf{F}_{\mathrm{BB}}[m]\|_{\mathcal{F}}^{2}=MK$ that is enforced by the normalization of the baseband precoder $\{\mathbf{F}_{\mathrm{BB}}[m]\}_{m\in\mathcal{M}}$ . Thus, the transmitted signal becomes $\mathbf{x}[m]=\mathbf{F}_{\mathrm{RF}}\sum_{k=1}^{K}\mathbf{f}_{\mathrm{BB}_{k}}[m]s_{k}[m].$

II-A Channel Model

Before reception at the users, the transmitted signal is passed through the mm-Wave channel, which can be represented by a geometric model with limited scattering [11]. Let us define $\mathbf{H}_{k}[m]$ as the $N_{\mathrm{MS}}\times N_{\mathrm{BS}}$ mm-Wave channel matrix between the BS and the $k$ th user. Then, $\mathbf{H}_{k}[m]$ includes the contributions of $L$ paths, each of which has the time delay $\tau_{k,l}$ with relative AoA $\bar{\phi}_{k,l}\in\Theta$ ( $\Theta=[-\frac{\pi}{2},\frac{\pi}{2}]$ ), angle-of-departure (AoD) $\phi_{k,l}\in\Theta$ , and the complex path gain $\alpha_{k,l}$ for the $k$ th user and $l$ th path. Let $p(\tau)$ denote a pulse shaping function for $T_{\mathrm{s}}$ -spaced signaling evaluated at $\tau$ seconds. Then, the mm-Wave delay- $d$ MIMO channel matrix in time domain is given by

\displaystyle\bar{\mathbf{H}}_{k}[d]=

\displaystyle\sqrt{\frac{N_{\mathrm{BS}}N_{\mathrm{MS}}}{L}}\sum_{l=1}^{L}\alpha_{k,l}p(dT_{\mathrm{s}}-\tau_{k,l})\mathbf{a}_{\mathrm{MS}}(\bar{\phi}_{k,l})\mathbf{a}_{\mathrm{BS}}^{\textsf{H}}(\phi_{k,l}),

(1)

where $\mathbf{a}_{\mathrm{MS}}(\bar{\phi_{k,l}})$ and $\mathbf{a}_{\mathrm{BS}}(\phi_{k,l})$ are the $N_{\mathrm{MS}}\times 1$ , and $N_{\mathrm{BS}}\times 1$ steering vectors representing the array responses of the antenna arrays at the users and the BS, respectively. Let $\lambda_{m}=\frac{c_{0}}{f_{m}}$ be the wavelength for the subcarrier $m$ at frequency $f_{m}$ . Since the operating frequency is relatively higher than the bandwidth in mm-Wave systems and the subcarrier frequencies are close to each other (i.e., $f_{m_{1}}\approx f_{m_{2}}$ , $m_{1},m_{2}\in\mathcal{M}$ ), we use a single operating wavelength $\lambda=\lambda_{1}=\dots=\lambda_{M}=\frac{c_{0}}{f_{c}}$ , where $c_{0}$ is speed of light and $f_{c}$ is the central carrier frequency [11, 12]. This approximation also allows for a single frequency-independent analog beamformer for each subcarrier. Then, for a uniform linear array (ULA), the array response of the antenna array at the BS is

\displaystyle\mathbf{a}_{\mathrm{BS}}(\phi)=\big{[}1,e^{j\frac{2\pi}{\lambda}{d}_{\mathrm{BS}}\sin(\phi)},\dots,e^{j\frac{2\pi}{\lambda}(N_{\mathrm{BS}}-1){d}_{\mathrm{BS}}\sin(\phi)}\big{]}^{\textsf{T}},

(2)

where ${d}_{\mathrm{BS}}=\lambda/2$ is the antenna spacing. The $n$ th element of $\mathbf{a}_{\mathrm{MS}}(\bar{\phi})$ can be defined in a similar way as for $\mathbf{a}_{\mathrm{BS}}(\phi)$ as $\left[\mathbf{a}_{\mathrm{MS}}(\bar{\phi})\right]_{n}=e^{j\pi(n-1)\sin(\bar{\phi})}$ , $n=1,\dots,N_{\mathrm{MS}}$ . After performing $M$ -point DFT of the delay- $d$ channel model in (1), the channel matrix of the $k$ th user at subcarrier $m$ becomes

\displaystyle\mathbf{H}_{k}[m]=\sum_{d=0}^{D-1}\bar{\mathbf{H}}_{k}[d]e^{-j\frac{2\pi m}{M}d},

(3)

where $D\leq M$ is the CP length. The frequency domain channel in (3) is used in MIMO-OFDM systems, where the orthogonality of each subcarrier is held such that $||\mathbf{H}_{k}^{\textsf{H}}[m_{1}]\mathbf{H}_{k}[m_{2}]||_{\mathcal{F}}^{2}=0$ for $m_{1},m_{2}\in\mathcal{M}$ and $m_{1}\neq m_{2}$ .

With the aforementioned block-fading channel model [11], the received signal at the $k$ th user before analog processing at subcarrier $m$ is $\tilde{\mathbf{y}}_{k}[m]=\sqrt{\rho}\mathbf{H}_{k}[m]\mathbf{x}[m]$ , i.e.,

\displaystyle\tilde{\mathbf{y}}_{k}[m]=\sqrt{\rho}\mathbf{H}_{k}[m]\mathbf{F}_{\mathrm{RF}}\mathbf{F}_{\mathrm{BB}}[m]\mathbf{s}[m]+\mathbf{n}[m],

(4)

where $\rho$ represents the average received power and $\mathbf{n}[m]\sim\mathcal{CN}({0},\sigma^{2}\mathbf{I}_{\mathrm{N_{\mathrm{MS}}}})$ is additive white Gaussian noise (AWGN) vector. At the $k$ th user, the received signal is first processed by the analog combiner $\mathbf{w}_{\mathrm{RF},k}\in\mathbb{C}^{N_{\mathrm{MS}}}$ . Then, the cyclic prefix is removed from the processed signal and $M$ -point DFTs are applied to yield the signal in frequency domain. Then, the received baseband signal becomes

\displaystyle\bar{y}_{k}[m]=\sqrt{\rho}\mathbf{w}_{\mathrm{RF},k}^{\textsf{H}}\mathbf{H}_{k}[m]\mathbf{F}_{\mathrm{RF}}\mathbf{F}_{\mathrm{BB}}[m]\mathbf{s}[m]+\mathbf{w}_{\mathrm{RF},k}^{\textsf{H}}\mathbf{n}[m],

(5)

where the analog combiner $\mathbf{w}_{\mathrm{RF},k}$ has the constraint $\big{[}\mathbf{w}_{\mathrm{RF},k}\mathbf{w}_{\mathrm{RF},k}^{\textsf{H}}\big{]}_{i,i}=1$ , similar to the RF precoder. Once the received symbols, i.e., $y_{k}[m]$ are obtained at the $k$ th user, they are demodulated according to its respective modulation scheme, and the information bits are recovered for each subcarrier. To accurately recover the data streams $\mathbf{s}[m]$ in (5), the channel matrix $\mathbf{H}_{k}[m]$ should be estimated. This is usually done by using pilot signals in the preamble stage [36, 16], wherein the beamformers $\mathbf{F}_{\mathrm{RF}}$ , $\mathbf{F}_{\mathrm{BB}}$ and $\mathbf{w}_{\mathrm{RF_{k}}}$ are designed accordingly (See Section III-C).

II-B Problem Description

The aim in this work is to estimate the channel matrix $\mathbf{H}_{k}[m]$ via FL, as illustrated in Fig. 1b. To this end, the global NN for channel estimation (henceforth called ChannelNet) located at the BS is trained on the local datasets of the users. Let $\mathcal{D}_{k}$ denote the local dataset at the $k$ th user, containing the input-output pairs $\mathcal{D}_{k}^{(i)}=(\mathcal{X}_{k}^{(i)},\mathcal{Y}_{k}^{(i)})$ ²²2The sizes of $\mathcal{X}_{k}^{(i)}$ and $\mathcal{Y}_{k}^{(i)}$ depend on the size of the channel matrix, and they are explicitly given in Sec. III-C and Sec. III-D for conventional and RIS-assisted massive MIMO scenario, respectively., $i=1,\dots,\textsf{D}_{k}$ and $\textsf{D}_{k}=|\mathcal{D}_{k}|$ is the size of the local dataset $\mathcal{D}_{k}$ . Here, $\mathcal{X}_{k}^{(i)}$ represents the $i$ th input data, i.e., the received pilot signals, $\mathcal{Y}_{k}^{(i)}$ denotes the $i$ th output/label data, i.e., the channel matrix, for $k\in\mathcal{K}$ , $\mathcal{K}=\{1,\dots,K\}$ . Thus, for an input-output pair $(\mathcal{X},\mathcal{Y})$ , ChannelNet constructs a non-linear relationship between the input and the output data as $f(\mathcal{X}|\boldsymbol{\theta})=\mathcal{Y}$ , where $\boldsymbol{\theta}\in\mathbb{R}^{P}$ denotes the learnable parameters.

III Federated Learning for Channel Estimation

In this section, we present the proposed FL-based channel estimation scheme, which is comprised of three stages: training data collection, model training and prediction. First, we present the training data collection stage, in which each user collects its own training dataset from the received pilot signals. After providing the FL-based model training scheme, we discuss how the input and output label data are determined for both massive MIMO and RIS-assisted scenarios, respectively. Once the learning model is trained, then it can be used for channel estimation in the prediction stage.

III-A Training Data Collection

In Fig. 2, we present the communication interval at the user for two consecutive data transmission blocks. At the beginning of each transmission block, the received pilot signals are acquired and processed for channel estimation. This can be done by employing one of the analytical channel estimation techniques, which can be based on compressed sensing [37, 38], angle-domain processing [14] and coordinated pilot-assignment [15]. The analytical approach is only used in the training data collection stage, which is relatively smaller than the prediction stage [32]. Hence, the use of ML/FL in the prediction stage becomes more advantageous over the analytical techniques in the long term.

It is also worth to mention that the training data can be obtained via offline datasets which are prepared by collecting the data from the field measurements. In [39], authors present a channel estimation dataset, which is obtained by electromagnetic simulations tools. While this approach can also be followed, the offline collected data may not always reflect the channel characteristics and the imperfections in the mm-Wave channel. In this work, we evaluate the performance of the proposed approach on the datasets whose labels are selected as both true and estimated channel data. For the estimated channel, we assume that the training data are collected, as described in Fig. 2, by employing angle-domain channel estimation (ADCE) technique [14], which has close to minimum mean-square-error (MMSE) performance.

After channel estimation, the training data can be collected by storing the received pilot data $\mathbf{G}_{k}[m]$ and the estimated channel data $\hat{\mathbf{H}}_{k}[m]$ in the internal memory of the user. (We discuss how $\mathbf{G}_{k}[m]$ is determined in Sec. III-C.) Then, the user feedbacks the estimated channel data to the BS via uplink transmission. As a result, the local dataset $\mathcal{D}_{k}$ can be collected at the $k$ th user after $i=1,\dots,\textsf{D}_{k}$ transmission blocks. This approach allows us to collect training data for different channel coherence times, which can be very short due to dynamic nature of the mm-Wave channel, such as indoor and vehicular communications [10].

The above process is the first stage of the proposed FL-based channel estimation framework. Once the training data is collected, the global model is trained (see, e.g., Fig. 1b). After training, each user can estimate its own channel via the trained NN by simply feeding the NN with $\mathbf{G}_{k}[m]$ and obtains $\hat{\mathbf{H}}_{k}[m]$ , as illustrated in Fig. 2b.

III-B FL-based Model Training

We begin by introducing the training concept in conventional CL, then develop FL-based model training.

In CL-based model training for channel estimation [16, 17, 32, 18, 34], the training of the global NN is performed by collecting the local datasets $\{\mathcal{D}_{k}\}_{k\in\mathcal{K}}$ from the users, as illustrated in Fig. 1a. Once the BS has collected the whole dataset $\mathcal{D}$ , the training is performed by solving the following problem

	$\displaystyle\operatorname*{minimize}_{\boldsymbol{\theta}}$	$\displaystyle\hskip 10.0pt\mathcal{L}(\boldsymbol{\theta})$
	$\displaystyle\operatorname*{subject\hskip 3.0ptto:\hskip 3.0pt}$	$\displaystyle\hskip 10.0ptf(\mathcal{X}^{(i)}\|\boldsymbol{\theta})=\mathcal{Y}^{(i)},i=1,\dots,\textsf{D},$		(6)

where $\textsf{D}=|\mathcal{D}|$ is the number of training samples and $\mathcal{L}(\boldsymbol{\theta})$ denotes the loss function defined as

\displaystyle\mathcal{L}(\boldsymbol{\theta})=\frac{1}{\textsf{D}}\sum_{i=1}^{\textsf{D}}\|f(\mathcal{X}^{(i)}|\boldsymbol{\theta})-\mathcal{Y}^{(i)}\|_{\mathcal{F}}^{2},

(7)

which is the MSE between the label data $\mathcal{Y}^{(i)}$ and the prediction of the NN, $f(\mathcal{X}^{(i)}|\boldsymbol{\theta})$ .

On the other hand, in FL, the local datasets $\mathcal{D}_{k\in\mathcal{K}}$ are preserved at the users and not transmitted to the BS. Hence, FL-based model training is performed at the user side as

	$\displaystyle\operatorname*{minimize}_{\boldsymbol{\theta}}$	$\displaystyle\hskip 10.0pt\bar{\mathcal{L}}(\boldsymbol{\theta})=\frac{1}{K}\sum_{k=1}^{K}\mathcal{L}_{k}(\boldsymbol{\theta})$
	$\displaystyle\operatorname*{subject\hskip 3.0ptto:\hskip 3.0pt}$	$\displaystyle\hskip 10.0ptf(\mathcal{X}_{k}^{(i)}\|\boldsymbol{\theta})=\mathcal{Y}_{k}^{(i)},i=1,\dots,\textsf{D}_{k},k\in\mathcal{K},$		(8)

where $\mathcal{L}_{k}(\boldsymbol{\theta})=\frac{1}{\textsf{D}_{k}}\sum_{i=1}^{\textsf{D}_{k}}\|f(\mathcal{X}_{k}^{(i)}|\boldsymbol{\theta})-\mathcal{Y}_{k}^{(i)}\|_{\mathcal{F}}^{2}$ . Notice that the FL-based model training in (III-B) is solved at the user while the CL problem in (III-B) is handled at the BS. To efficiently solve (III-B) and (III-B), gradient descent (GD) is employed and the problems are solved iteratively. In CL, the gradient is computed over the whole dataset as $\mathbf{g}(\boldsymbol{\theta}_{t})=\nabla\mathcal{L}(\boldsymbol{\theta}_{t})$ and the parameter update is performed as

\displaystyle\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}-\eta{\mathbf{g}}(\boldsymbol{\theta}_{t}),

(9)

where $\eta$ is the learning rate.

In FL, each user computes the gradients individually as $\mathbf{g}_{k}(\boldsymbol{\theta}_{t})=\nabla\mathcal{L}_{k}(\boldsymbol{\theta}_{t})$ to solve (III-B), then sends them to the BS, where the model parameters are updated as

\displaystyle\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}-\eta\frac{1}{K}\sum_{k=1}^{K}{\mathbf{g}}_{k}(\boldsymbol{\theta}_{t}).

(10)

The transmission of gradients to the BS provides more energy-efficiency than directly transmitting the model parameters as in the FedAvg algorithm [35]. The main reason is that gradients include only the model updates obtained from the GD algorithm, whereas model transmission includes already known data from the previous iteration. Hence, model transmission wastes a significant amount of transmit power from all the users [31, 30, 40].

The gradients $\mathbf{g}_{k\in\mathcal{K}}(\boldsymbol{\theta}_{t})$ are sent to the BS via wireless channel, which causes corruptions during transmission. Therefore, the corrupted model parameters and gradients at the $t$ th iteration are given as [24, 41]

$\displaystyle\tilde{\boldsymbol{\theta}}_{t}$	$\displaystyle={\boldsymbol{\theta}}_{t}+\Delta{\boldsymbol{\theta}}_{t},$	(11)
$\displaystyle\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t})$	$\displaystyle=\mathbf{g}_{k}({\boldsymbol{\theta}}_{t})+\Delta\mathbf{g}_{k}({\boldsymbol{\theta}}_{t}),$	(12)
$\displaystyle\tilde{\mathbf{g}}_{k}(\tilde{\boldsymbol{\theta}}_{t})$	$\displaystyle=\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t})+\Delta\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t}),$	(13)

where $\tilde{\boldsymbol{\theta}}_{t}$ represents the noisy model parameters captured at the users, $\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t})$ is the gradient vector computed at the user based on $\tilde{\boldsymbol{\theta}}_{t}$ and $\tilde{\mathbf{g}}_{k}(\tilde{\boldsymbol{\theta}}_{t})$ denotes the noisy gradient vector received at the BS. $\Delta\boldsymbol{\theta}_{t}$ , $\Delta\mathbf{g}_{k}({\boldsymbol{\theta}}_{t})$ and $\Delta\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t})$ represent the noise terms added onto $\boldsymbol{\theta}_{t}$ , $\mathbf{g}_{k}({\boldsymbol{\theta}}_{t})$ and $\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t})$ , respectively. Then, the model update rule can be given by

\displaystyle\tilde{\boldsymbol{\theta}}_{t+1}=\tilde{\boldsymbol{\theta}}_{t}-\eta\frac{1}{K}\sum_{k=1}^{K}\tilde{\mathbf{g}}_{k}(\tilde{\boldsymbol{\theta}}_{t}),

(14)

which can be rewritten as

	$\displaystyle\tilde{\boldsymbol{\theta}}_{t+1}=\big{[}\boldsymbol{\theta}_{t}+\Delta{\boldsymbol{\theta}}_{t}\big{]}-\eta\sum_{k=1}^{K}\frac{\big{[}{\mathbf{g}}_{k}({\boldsymbol{\theta}}_{t})+\Delta\mathbf{g}_{k}({\boldsymbol{\theta}}_{t})+\Delta\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t})\big{]}}{K}$
	$\displaystyle=\underbrace{\boldsymbol{\theta}_{t}-\eta\sum_{k=1}^{K}\frac{\mathbf{g}_{k}(\boldsymbol{\theta}_{t})}{K}}_{\boldsymbol{\theta}_{t+1}}+\underbrace{\Delta{\boldsymbol{\theta}}_{t}-\eta\sum_{k=1}^{K}\frac{\big{[}\Delta\mathbf{g}_{k}({\boldsymbol{\theta}}_{t})+\Delta\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t})\big{]}}{K}}_{\Delta}$
	$\displaystyle=\boldsymbol{\theta}_{t+1}+\Delta,$		(15)

where $\Delta$ corresponds to the overall noise term added onto $\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}-\eta\frac{1}{K}\sum_{k=1}^{K}\mathbf{g}_{k}(\boldsymbol{\theta}_{t})$ . Now, let us consider the statistics of $\Delta$ . Without loss of generality, the noise terms due to wireless transmission in (11) and (13), i.e., $\Delta{\boldsymbol{\theta}}_{t}$ and $\Delta\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t})$ , can be modeled as AWGN with variances $\sigma_{\boldsymbol{\theta}}^{2}$ and $\tilde{\sigma}_{k}^{2}$ , respectively [42, 41]. Furthermore, we define $\Delta\mathbf{g}_{k}({\boldsymbol{\theta}}_{t})$ in (12) as AWGN with variance ${\sigma}_{k}^{2}$ due to the linearity of gradient and the NN layers³³3Many of the NN layers, such as convolutional, fully connected, normalization and dropout layers, perform linear operations, whereas pooling and ReLU layers are non-linear [43, 41, 42].. Hence, the overall noise term $\Delta$ can be viewed as an AWGN with variance $\sigma_{\Delta}^{2}=\sigma_{\boldsymbol{\theta}}^{2}+\eta\frac{\sum_{k=1}^{K}(\tilde{\sigma}_{k}^{2}+{\sigma}_{k}^{2})}{K}$ .

In order to solve (III-B) effectively in the presence of noisy model parameters, we define a regularized loss function $\tilde{\mathcal{L}}_{k}(\boldsymbol{\theta})$ as

\displaystyle\tilde{\mathcal{L}}_{k}(\boldsymbol{\theta})=\mathcal{L}_{k}(\boldsymbol{\theta})+\sigma_{\Delta}^{2}||\mathbf{g}_{k}(\boldsymbol{\theta})||^{2},

(16)

which is widely used in stochastic optimization [44]. (16) can be obtained via first order Taylor expansion of the expectation-based loss $\mathbb{E}\{||\mathcal{L}_{k}(\boldsymbol{\theta}+\Delta)||^{2}\}$ , which can be approximately written as

$\displaystyle\mathbb{E}\{\|\|\mathcal{L}_{k}(\boldsymbol{\theta}+$	$\displaystyle\Delta)\|\|^{2}\}\approx\mathbb{E}\{\|\|\mathcal{L}_{k}(\boldsymbol{\theta})+{\Delta}\nabla\mathcal{L}_{k}(\boldsymbol{\theta})\|\|^{2}\},$
	$\displaystyle\approx\mathbb{E}\{\|\|\mathcal{L}_{k}(\boldsymbol{\theta})\|\|^{2}\}+\mathbb{E}\{\|\|\Delta\|\|^{2}\}\mathbb{E}\{\|\|\nabla\mathcal{L}_{k}(\boldsymbol{\theta})\|\|^{2}\},$
	$\displaystyle\approx\mathbb{E}\{\|\|\mathcal{L}_{k}(\boldsymbol{\theta})\|\|^{2}\}+{\sigma}_{\Delta}^{2}\|\|\mathbf{g}(\boldsymbol{\theta})\|\|^{2},$	(17)

where the first term corresponds to the minimization of the loss function with perfect estimation and the second term is the additional cost due to noise [44, 41]. Using (16), the regularized version of FL-based training problem in (III-B) is given by

	$\displaystyle\operatorname*{minimize}_{\boldsymbol{\theta}}$	$\displaystyle\hskip 10.0pt\bar{\mathcal{L}}(\boldsymbol{\theta})=\frac{1}{K}\sum_{k=1}^{K}\tilde{\mathcal{L}}_{k}(\boldsymbol{\theta})$
	$\displaystyle\operatorname*{subject\hskip 3.0ptto:\hskip 3.0pt}$	$\displaystyle\hskip 10.0ptf(\mathcal{X}_{k}^{(i)}\|\boldsymbol{\theta})=\mathcal{Y}_{k}^{(i)},i=1,\dots,\textsf{D}_{k},k\in\mathcal{K},$		(18)

which can be effectively solved via GD in the presence of noisy model updates as

\displaystyle\tilde{\boldsymbol{\theta}}_{t+1}=\tilde{\boldsymbol{\theta}}_{t}-\eta\nabla\bar{\mathcal{L}}(\tilde{\boldsymbol{\theta}}),

(19)

where $\nabla\bar{\mathcal{L}}(\tilde{\boldsymbol{\theta}})=\frac{1}{K}\sum_{k=1}^{K}\bar{\mathbf{g}}_{k}(\tilde{\boldsymbol{\theta}}_{t})$ and $\bar{\mathbf{g}}_{k}(\tilde{\boldsymbol{\theta}}_{t})=\nabla\tilde{\mathcal{L}}_{k}(\tilde{\boldsymbol{\theta}}_{t})=\nabla\big{[}\mathcal{L}_{k}(\tilde{\boldsymbol{\theta}}_{t})+\sigma_{\Delta}^{2}||\mathbf{g}_{k}(\tilde{\boldsymbol{\theta}}_{t})||^{2}\big{]}$ .

Due to the effect of noisy gradient transmission, $\bar{\mathcal{L}}(\boldsymbol{\theta})$ converges slower than ${\mathcal{L}}(\boldsymbol{\theta})$ . In the following theorem, we prove the convergence of $\bar{\mathcal{L}}(\boldsymbol{\theta})$ . While the convergence of the regularized loss function was studied in different FL works [42, 41], they consider model transmission, whereas in this work we investigate the gradient transmission approach. The convergence analysis is also different from the previous gradient transmission-based works, e.g., [30, 24], which are based on the sparsity assumption of the gradient vector, which may not be always satisfied.

Theorem 1: Let $\boldsymbol{\theta}_{0}$ and $\boldsymbol{\theta}_{\star}$ be the initial and optimal model parameters, respectively. Then, the FL-based model training converges with the convergence rate $\mathcal{O}(1/t)$ as

\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}_{t})-\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})\leq||\boldsymbol{\theta}_{0}-\boldsymbol{\theta}_{\star}||^{2}\frac{1}{2\eta}\frac{1}{t},

(20)

with the learning rate $\eta\leq\frac{1}{(1+{\sigma}_{\Delta}^{2})\beta}$ for some $\beta\geq 0$ .

Proof: See Appendix A.∎

In practice, the convergence of the learning model is subject to the wireless factors, such as the SNR of the transmitted/received model updates. In particular, the convergence becomes slower due to the packet errors during training [45]. Furthermore, the channel statistics change in each communication round, which entails CSI acquisition for each round. While some of the recent works assume that a single communication round between the server and the clients takes a single channel coherence time [31, 30, 24], in [46] FL-based training is completed in a single long-coherence time, which is approximately composed of 40 small-scale fading channel coherence intervals [46].

III-C FL for Channel Estimation in Massive MIMO

Here, we discuss how the input and output of ChannelNet are determined for massive MIMO scenario.

The input of ChannelNet is the set of received pilot signals at the preamble stage. Consider the downlink received signal model in (5) and assume that the BS activates only a single RF chain, one at a time. Let $\overline{\mathbf{f}}_{u}[m]\in\mathbb{C}^{N_{\mathrm{BS}}}$ be the resulting beamformer vector and pilot signals are $\overline{{s}}_{u}[m]$ , where $u=1,\dots,M_{\mathrm{BS}}$ and $m\in\mathcal{M}$ . At the receiver side, each user activates the RF chain for $M_{\mathrm{MS}}$ times and applies the beamformer vector $\overline{\mathbf{w}}_{v}[m]$ , $v=1,\dots,M_{\mathrm{MS}}$ to process the received pilots [16]. Hence, the total channel use in the channel acquisition process is $M_{\mathrm{BS}}\lceil\frac{M_{\mathrm{MS}}}{N_{\mathrm{RF}}}\rceil$ . Therefore, the received pilot signal at the $k$ th user becomes

\displaystyle\mathbf{\overline{Y}}_{k}[m]=\overline{\mathbf{W}}^{\textsf{H}}[m]\mathbf{H}_{k}[m]\overline{\mathbf{F}}[m]\overline{\mathbf{S}}[m]+\widetilde{\mathbf{N}}_{k}[m],

(21)

where $\overline{\mathbf{F}}[m]=[\overline{\mathbf{f}}_{1}[m],\overline{\mathbf{f}}_{2}[m],\dots,\overline{\mathbf{f}}_{M_{\mathrm{BS}}}[m]]$ and $\overline{\mathbf{W}}[m]=[\overline{\mathbf{w}}_{1}[m],\overline{\mathbf{w}}_{2}[m],\dots,\overline{\mathbf{w}}_{M_{\mathrm{MS}}}[m]]$ are $N_{\mathrm{BS}}\times M_{\mathrm{BS}}$ and $N_{\mathrm{MS}}\times M_{\mathrm{MS}}$ beamformer matrices, respectively. $\overline{\mathbf{S}}[m]=\mathrm{diag}\{\overline{s}_{1}[m],\dots,\overline{s}_{M_{\mathrm{BS}}}[m]\}$ denotes pilot signals and $\widetilde{\mathbf{N}}_{k}[m]=\overline{\mathbf{W}}^{\textsf{H}}\overline{\mathbf{N}}_{k}[m]$ is the effective noise matrix, where $\overline{\mathbf{N}}_{k}[m]\sim\mathcal{N}(0,\sigma^{2}\mathbf{I}_{M_{\mathrm{MS}}})$ . Without loss of generality, we assume that $\overline{\mathbf{F}}[m]=\overline{\mathbf{F}}$ and $\overline{\mathbf{W}}[m]=\overline{\mathbf{W}}$ and $\overline{\mathbf{S}}[m]=\mathbf{I}_{M_{\mathrm{BS}}}$ $\forall m\in\mathcal{M}$ . Then, the received signal (21) becomes

\displaystyle\mathbf{\overline{Y}}_{k}[m]=\overline{\mathbf{W}}^{\textsf{H}}\mathbf{H}_{k}[m]\overline{\mathbf{F}}+\widetilde{\mathbf{N}}_{k}[m].

(22)

Using $\mathbf{\overline{Y}}_{k}[m]$ , we define the input of ChannelNet $\mathbf{G}_{k}[m]$ as

\displaystyle\mathbf{G}_{k}[m]=\mathbf{T}_{\mathrm{MS}}\overline{\mathbf{Y}}_{k}[m]\mathbf{T}_{\mathrm{BS}},

(23)

where $\mathbf{T}_{\mathrm{MS}}=\begin{dcases}\overline{\mathbf{W}},&M_{\mathrm{MS}}<N_{\mathrm{MS}}\\ (\overline{\mathbf{W}}\overline{\mathbf{W}}^{\textsf{H}})^{-1}\overline{\mathbf{W}},&M_{\mathrm{MS}}\geq N_{\mathrm{MS}},\end{dcases}$ and $\mathbf{T}_{\mathrm{BS}}=\begin{dcases}\overline{\mathbf{F}}^{\textsf{H}},&M_{\mathrm{BS}}<N_{\mathrm{BS}}\\ \overline{\mathbf{F}}^{\textsf{H}}(\overline{\mathbf{F}}\overline{\mathbf{F}}^{\textsf{H}})^{-1},&M_{\mathrm{BS}}\geq N_{\mathrm{BS}}.\end{dcases}$ Here, $\mathbf{T}_{\mathrm{BS}}$ and $\mathbf{T}_{\mathrm{MS}}$ clear the effect of unitary matrices $\overline{\mathbf{F}}$ and $\overline{\mathbf{W}}$ in (22), respectively. Since ChannelNet accepts real-valued data, we construct the final form of the input $\mathcal{X}_{k}$ as three “channel” tensors. Thus, the first and second “channel” of $\mathcal{X}_{k}$ are the real and imaginary parts of $\mathbf{G}_{k}[m]$ , i.e., $[\mathcal{X}_{k}]_{1}=\operatorname{Re}\{\mathbf{G}_{k}[m]\}$ and $[\mathcal{X}_{k}]_{2}=\operatorname{Im}\{\mathbf{G}_{k}[m]\}$ , respectively. Finally, the third “channel” is given by $[\mathcal{X}_{k}]_{3}=\angle\{\mathbf{G}_{k}[m]\}$ . We note here that the use of three “channel” input (e.g., real, imaginary and angle information of $\mathbf{G}_{k}[m]$ ) provides better feature representation [19, 18, 36]. As a result, the size of the input data is $N_{\mathrm{MS}}\times N_{\mathrm{BS}}\times 3$ .

The output of ChannelNet is given by a $2N_{\mathrm{BS}}N_{\mathrm{MS}}\times 1$ real-valued vector as

\displaystyle\mathcal{Y}_{k}=\left[\mathrm{vec}\{\operatorname{Re}\{\mathbf{H}_{k}[m]\}\}^{\textsf{T}},\mathrm{vec}\{\operatorname{Im}\{\mathbf{H}_{k}[m]\}\}^{\textsf{T}}\right]^{\textsf{T}}.

(24)

As a result, ChannelNet maps the received pilot signals $\mathbf{G}_{k}[m]$ to the channel matrix $\mathbf{H}_{k}[m]$ .

III-D FL for Channel Estimation in RIS-Assisted Massive MIMO

In this part, we examine the channel estimation problem in RIS-assisted massive MIMO, which is shown in Fig. 3. First, we present the received signal model including both direct (BS-user) and cascaded (BS-RIS-user) channels⁴⁴4Channel estimation is required to design the passive beamformer weights. Although the BS-user, BS-RIS and RIS-user channels can be estimated separately [47], the estimation of the direct and the cascaded channels is sufficient for beamformer design [48, 49].. Then, we show how input-output pairs of ChannelNet are obtained for RIS-assisted scenario.

We consider the downlink channel estimation, where the BS has $N_{\mathrm{BS}}$ antennas to serve $K$ single-antenna users with the assistance of RIS, which is composed of $N_{\mathrm{RIS}}$ reflective elements, as shown in Fig. 3. The incoming signal from the BS is reflected from the RIS, where each RIS element introduces a phase shift $\varphi_{n}$ , for $n=1,\dots,N_{\mathrm{RIS}}$ . This phase shift can be adjusted through the PIN (positive-intrinsic-negative) diodes, which are controlled by the RIS-controller connected to the BS over the backhaul link. As a result, RIS allows the users receive the signal transmitted from the BS when they are distant from the BS or there is a blockage among them. Let $\overline{\mathbf{S}}_{\mathrm{RIS}}\in\mathbb{C}^{N_{\mathrm{BS}}\times M_{\mathrm{BS}}}$ , $(N_{\mathrm{BS}}\leq M_{\mathrm{BS}})$ be the pilot signals transmitted from the BS, then the received signal at the $k$ th user becomes

\displaystyle\mathbf{y}_{k}=(\mathbf{h}_{\mathrm{B},k}^{\textsf{H}}+\boldsymbol{\psi}^{\textsf{H}}\mathbf{V}_{k}^{\textsf{H}})\overline{\mathbf{S}}_{\mathrm{RIS}}+\mathbf{n}_{k},

(25)

where $\mathbf{y}_{k}=[y_{1,k},\dots,y_{M_{\mathrm{BS}},k}]$ and $\mathbf{n}_{k}=[n_{1,k},\dots,n_{M_{\mathrm{BS}},k}]$ are $1\times M_{\mathrm{BS}}$ row vectors and $\mathbf{h}_{\mathrm{B},k}\in\mathbb{C}^{N_{\mathrm{BS}}}$ represents the channel for the communication link between the BS and the $k$ th user. $\boldsymbol{\psi}=[\psi_{1},\dots,\psi_{N_{\mathrm{RIS}}}]^{\textsf{T}}\in\mathbb{C}^{N_{\mathrm{RIS}}}$ is the reflecting beamformer vector, whose $n$ th entry is $\psi_{n}=a_{n}e^{j\varphi_{n}}$ , where $a_{n}\in\{0,1\}$ denotes the on/off stage of the $n$ th element of the RIS and $\varphi_{n}\in[0,2\pi]$ is the phase shift introduced by the RIS. In practice, the RIS elements cannot be perfectly turned on/off, hence, they can be modeled as $a_{n}=\left\{\begin{array}[]{cc}1-\epsilon_{1}&\mathrm{ON}\\ 0+\epsilon_{0}&\mathrm{OFF}\end{array}\right.,$ for small $\epsilon_{1},\epsilon_{0}\geq 0$ , which represent the insertion loss of the reflecting elements [18]. In (25), $\mathbf{V}_{k}\in\mathbb{C}^{N_{\mathrm{BS}}\times N_{\mathrm{RIS}}}$ denotes the cascaded channel for the BS-RIS-user link and it can be defined in terms of the channel between BS-RIS and RIS-user as

\displaystyle\mathbf{V}_{k}=\mathbf{H}_{\mathrm{B}}\boldsymbol{\Lambda}_{k},

(26)

where $\mathbf{H}_{\mathrm{B}}\in\mathbb{C}^{N_{\mathrm{BS}}\times N_{\mathrm{RIS}}}$ is the channel between the BS and the RIS and it can be defined similar to (1) as

\displaystyle\mathbf{H}_{\mathrm{B}}=\sqrt{\frac{N_{\mathrm{BS}}N_{\mathrm{RIS}}}{L_{\mathrm{RIS}}}}\sum_{l=1}^{L_{\mathrm{RIS}}}\alpha_{l}^{\mathrm{RIS}}\mathbf{a}_{\mathrm{BS}}(\phi_{l}^{\mathrm{BS}})\mathbf{a}_{\mathrm{RIS}}({\phi}_{l}^{\mathrm{RIS}})^{\textsf{H}},

(27)

where $L_{\mathrm{RIS}}$ and $\alpha_{l}^{\mathrm{RIS}}$ are the number of received paths and the complex gain respectively. $\mathbf{a}_{\mathrm{BS}}(\phi_{l}^{\mathrm{BS}})\in\mathbb{C}^{N_{\mathrm{BS}}}$ and $\mathbf{a}_{\mathrm{RIS}}({\phi}_{l}^{\mathrm{RIS}})\in\mathbb{C}^{N_{\mathrm{RIS}}}$ are the steering vectors corresponding to the BS and RIS with the AoA and AoD angles of $\phi_{l}^{\mathrm{BS}},\phi_{l}^{\mathrm{RIS}}$ , respectively. In (26), $\boldsymbol{\Lambda}_{k}=\mathrm{diag}\{\mathbf{h}_{\mathrm{S},k}\}$ and $\mathbf{h}_{\mathrm{S},k}\in\mathbb{C}^{N_{\mathrm{RIS}}}$ represents the channel between the RIS and the $k$ th user. $\mathbf{h}_{\mathrm{S},k}$ and $\mathbf{h}_{\mathrm{B},k}$ have similar structure and they can be defined as follows

	$\displaystyle\mathbf{h}_{\mathrm{B},k}$	$\displaystyle=\sqrt{\frac{N_{\mathrm{BS}}}{L_{\mathrm{B}}}}\sum_{l=1}^{L_{\mathrm{B}}}\alpha_{k,l}^{\mathrm{B}}\mathbf{a}_{\mathrm{BS}}(\phi_{k,l}^{\mathrm{B}}),$		(28)
	$\displaystyle\mathbf{h}_{\mathrm{S},k}$	$\displaystyle=\sqrt{\frac{N_{\mathrm{RIS}}}{L_{\mathrm{S}}}}\sum_{l=1}^{L_{\mathrm{S}}}\alpha_{k,l}^{\mathrm{S}}\mathbf{a}_{\mathrm{RIS}}(\phi_{k,l}^{\mathrm{S}}),$		(29)

where $L_{\mathrm{B}}$ , $\alpha_{k,l}^{\mathrm{B}}$ and $\mathbf{a}_{\mathrm{BS}}(\phi_{k,l}^{\mathrm{B}})$ ( $L_{\mathrm{S}}$ , $\alpha_{k,l}^{\mathrm{S}}$ , $\mathbf{a}_{\mathrm{RIS}}(\phi_{k,l}^{\mathrm{B}})$ ) are the number of paths, complex gain and the steering vector for the BS-user (RIS-user) communication link, respectively.

In order to estimate the direct channel $\mathbf{h}_{\mathrm{B},k}$ , we assume that all the RIS elements are turned off, i.e., $a_{n}=0$ for $n=1,\dots,N_{\mathrm{RIS}}$ . Then, the $1\times M_{\mathrm{BS}}$ received signal at the $k$ th user becomes

\displaystyle\mathbf{y}_{\mathrm{B},k}=\mathbf{h}_{\mathrm{B},k}^{\textsf{H}}\overline{\mathbf{S}}_{\mathrm{RIS}}+\mathbf{n}_{\mathrm{B},k}.

(30)

Then, the direct channel between BS-user $\mathbf{h}_{\mathrm{B},k}$ can be estimated from the received pilot signal $\mathbf{y}_{\mathrm{B},k}$ via LS and MMSE estimators as $\mathbf{h}_{\mathrm{B},k}^{\mathrm{LS}}=\bigg{(}\mathbf{y}_{\mathrm{B},k}\overline{\mathbf{S}}_{\mathrm{RIS}}^{\dagger}\bigg{)}^{\textsf{H}},$ and $\mathbf{h}_{\mathrm{B},k}^{\mathrm{MMSE}}=\mathbf{h}_{\mathrm{B},k}^{\mathrm{LS}}\mathbf{R}_{\mathrm{B},k}\bigg{(}\mathbf{R}_{\mathrm{B},k}+\frac{1}{\sigma^{2}}\overline{\mathbf{S}}_{\mathrm{RIS}}\overline{\mathbf{S}}_{\mathrm{RIS}}^{\textsf{H}}\bigg{)}^{-1},$ where $\mathbf{R}_{\mathrm{B},k}=\mathbb{E}\{\mathbf{h}_{\mathrm{B},k}\mathbf{h}_{\mathrm{B},k}^{\textsf{H}}\}$ [16].

Next, we consider the cascaded channel estimation. We assume that each RIS element is turned on one by one while all the other elements are turned off. This is done by the BS requesting the RIS via a micro-controller device in the backhaul link so that a single RIS element is turned on at a time. Then, the reflecting beamformer vector at the $n$ th frame becomes $\boldsymbol{\psi}^{(n)}=[0,\dots,0,\psi_{n},0,\dots,0]^{\textsf{T}}$ , where $a_{n}=\{0:\tilde{n}=1,\dots N_{\mathrm{RIS}},\tilde{n}\neq n\}$ and the received signal is given by

\displaystyle\mathbf{y}_{\mathrm{C},k}^{(n)}=(\mathbf{h}_{\mathrm{B},k}^{\textsf{H}}+{\mathbf{v}_{k}^{(n)}}^{\textsf{H}})\overline{\mathbf{S}}_{\mathrm{RIS}}+\mathbf{n}_{k},

(31)

where $\mathbf{v}_{k}^{(n)}\in\mathbb{C}^{N_{\mathrm{BS}}}$ is the $n$ th column of $\mathbf{V}_{k}$ , i.e., $\mathbf{v}_{k}^{(n)}=\mathbf{V}_{k}\boldsymbol{\psi}^{(n)}$ , where $\psi_{n}=1$ . Using the estimate of $\mathbf{h}_{\mathrm{B},k}$ from (30), (31) can be solved for $\mathbf{v}_{k}^{(n)}$ , $n=1,\dots,N_{\mathrm{RIS}}$ , and the cascaded channel $\mathbf{V}_{k}$ can be estimated. Then, the received data for $n=1\dots,N_{\mathrm{RIS}}$ can be given by $\mathbf{Y}_{\mathrm{C},k}\in\mathbb{C}^{N_{\mathrm{RIS}}\times M_{\mathrm{BS}}}$ as $\mathbf{Y}_{\mathrm{C},k}=\left[\begin{array}[]{c}\mathbf{y}_{\mathrm{C},k}^{(n)}\\ \vdots\\ \mathbf{y}_{\mathrm{C},k}^{(N_{\mathrm{RIS}})}\end{array}\right]$ . In order to train ChannelNet for RIS-assisted massive MIMO scenario, we select the input-output data pair as $\{\mathbf{y}_{\mathrm{B},k},\mathbf{h}_{\mathrm{B},k}\}$ and $\{\mathbf{Y}_{\mathrm{C},k},\mathbf{V}_{k}\}$ for direct and cascaded channels respectively. To jointly learn both channels, a single input is constructed to train a single NN as $\boldsymbol{\Upsilon}_{k}=\left[\begin{array}[]{c}\mathbf{y}_{\mathrm{B},k}\\ \mathbf{Y}_{\mathrm{C},k}\end{array}\right]\in\mathbb{C}^{(N_{\mathrm{RIS}}+1)\times M_{\mathrm{BS}}}$ . Following the same strategy in the previous scenario, the three “channel” of the input data can be constructed as $[{\mathcal{X}}_{k}]_{1}=\operatorname{Re}\{\boldsymbol{\Upsilon}_{k}\}$ and $[{\mathcal{X}}_{k}]_{2}=\operatorname{Im}\{\boldsymbol{\Upsilon}_{k}\}$ , $[{\mathcal{X}}_{k}]_{3}=\angle\{\boldsymbol{\Upsilon}_{k}\}$ , respectively. We can define the output data as $\boldsymbol{\Sigma}_{k}=\left[\mathbf{h}_{\mathrm{B},k},\mathbf{V}_{k}\right]\in\mathbb{C}^{N_{\mathrm{BS}}\times(N_{\mathrm{RIS}}+1)}$ , hence, the output label can be given by a $2N_{\mathrm{BS}}(N_{\mathrm{RIS}}+1)\times 1$ real-valued vector as

\displaystyle{\mathcal{Y}}_{k}=\left[\mathrm{vec}\{\operatorname{Re}\{\boldsymbol{\Sigma}_{k}\}\}^{\textsf{T}},\mathrm{vec}\{\operatorname{Im}\{\boldsymbol{\Sigma}_{k}\}\}^{\textsf{T}}\right]^{\textsf{T}}.

(32)

Consequently, we have the sizes of ${\mathcal{X}}_{k}$ and ${\mathcal{Y}}_{k}$ are $(N_{\mathrm{RIS}}+1)\times M_{\mathrm{BS}}\times 3$ and $2N_{\mathrm{BS}}(N_{\mathrm{RIS}}+1)\times 1$ respectively.

III-E Neural Network Architecture and Training

We design a single CNN, i.e., ChannelNet trained on two different datasets for both conventional and RIS-assisted massive MIMO applications. The proposed network architecture is a CNN with $10$ layers. The first layer is the input layer, which accepts the input data of size $N_{\mathrm{MS}}\times N_{\mathrm{BS}}\times 3$ and $(N_{\mathrm{RIS}}+1)\times M_{\mathrm{BS}}\times 3$ for conventional and RIS-assisted massive MIMO scenario respectively. The $\{2,4,6\}$ th layers are the convolutional layers with $N_{\mathrm{SF}}=128$ filters, each of which employs a $3\times 3$ kernel for 2-D spatial feature extraction. The $\{3,5,7\}$ th layers are the normalization layers. The eighth layer is a fully connected layer with $N_{\mathrm{FCL}}=1024$ units, whose main purpose is to provide feature mapping. The ninth layer is a dropout layer with $\kappa=1/2$ probability. The dropout layer applies an $N_{\mathrm{FCL}}\times 1$ mask on the weights of the fully connected layer, whose elements are uniform randomly selected from $\{0,1\}$ . As a result, at each iteration of FL training, randomly selected different set of weights in the fully connected layer is updated. Thus, the use of dropout layer reduces the size of $\boldsymbol{\theta}_{t}$ and $\mathbf{g}_{k}(\boldsymbol{\theta}_{t})$ , thereby, reducing model transmission overhead. Finally, the last layer is output regression layer, yielding the output channel estimate of size $2N_{\mathrm{MS}}N_{\mathrm{BS}}\times 1$ and $2N_{\mathrm{BS}}(N_{\mathrm{RIS}}+1)\times 1$ for conventional and RIS-assisted massive MIMO applications respectively.

During FL-based training, the collected datasets at the users are used to compute the model updates as in Section III-B and transmitted to the BS. The collected model parameters at the BS are then aggregated as in (10) and broadcast to the users for the next iteration. This process is conducted for $t=1,\dots,T$ communication rounds until convergence.

IV Communication Overhead and Complexity

IV-A Communication Overhead

Communication overhead can be defined as the size of the transmitted data during model training. Let $\mathcal{T}_{\mathrm{FL}}$ and $\mathcal{T}_{\mathrm{CL}}$ denote the communication overhead of FL and CL, respectively. Then, we can define $\mathcal{T}_{\mathrm{CL}}$ for both conventional and RIS-assisted scenario as

\displaystyle\mathcal{T}_{\mathrm{CL}}=\left\{\begin{array}[]{ll}\footnotesize(3N_{\mathrm{MS}}N_{\mathrm{BS}}+2N_{\mathrm{MS}}N_{\mathrm{BS}})\textsf{D},&\footnotesize\mathrm{mMIMO}\\ \footnotesize(3(N_{\mathrm{RIS}}+1)M_{\mathrm{BS}}+2N_{\mathrm{BS}}(N_{\mathrm{RIS}}+1))\textsf{D},&\footnotesize\mathrm{RIS}\end{array}\right.,

(35)

which includes the number of symbols in the uplink transmission of the training dataset from the users to the BS. In contrast, the communication overhead of FL includes the transmission of $\mathbf{g}_{k}(\boldsymbol{\theta}_{t})$ and $\boldsymbol{\theta}_{t}$ in uplink and downlink communication for $t=1,\dots,T$ , respectively. Hence, $\mathcal{T}_{\mathrm{FL}}$ is given by

\displaystyle\mathcal{T}_{\mathrm{FL}}=\left\{\begin{array}[]{ll}2PTK,&\mathrm{mMIMO}\\ 2PTK,&\mathrm{RIS}\end{array}\right..

(38)

We can see that the dominant terms in (35) and (38) are D and $P$ , which are the number of training data pairs and the number of NN parameters respectively. While D can be adjusted according to the amount of available data at the users, $P$ is usually unchanged during model training. Here, $P$ is computed as $P=\underbrace{N_{\mathrm{CL}}(CN_{\mathrm{SF}}W_{x}W_{y})}_{\mathrm{Convolutional\hskip 1.0ptLayers}}+\underbrace{\kappa N_{\mathrm{SF}}W_{x}W_{y}N_{\mathrm{FCL}}\footnotesize}_{\mathrm{Fully\hskip 1.0ptConnected\hskip 1.0ptLayers}},$ where $N_{\mathrm{CL}}=3$ is the number of convolutional layers and $C=3$ is the number of spatial “channels”. $W_{x}=W_{y}=3$ are the 2-D kernel sizes. As a result, we have $P=600,192$ . Since the number of samples in the training dataset is usually larger than the number of model parameters, it is expected to have $\mathcal{T}_{\mathrm{FL}}<\mathcal{T}_{\mathrm{CL}}$ [35, 30, 31] (see Fig. 11).

TABLE I: Convolutional Layers Settings

$l$	$D_{x}^{(l)}$	$D_{y}^{(l)}$	$W_{x}^{(l)}$	$W_{y}^{(l)}$	$N_{\mathrm{SF}}^{(l-1)}$	$N_{\mathrm{SF}}^{(l)}$
2	$N_{\mathrm{MS}}$	$N_{\mathrm{BS}}$	3	3	3	128
4	$N_{\mathrm{MS}}$	$N_{\mathrm{BS}}$	3	3	128	128
6	$N_{\mathrm{MS}}$	$N_{\mathrm{BS}}$	3	3	128	128

IV-B Computational Complexity

We further examine the computational complexity of the proposed CNN architecture. The time complexity of the convolutional layers can be written as [16, 36]

\displaystyle\mathcal{C}_{\mathrm{CL}}=\mathcal{O}\bigg{(}\sum_{l=1}^{N_{\mathrm{CL}}}D_{x}^{(l)}D_{y}^{(l)}W_{x}^{(l)}W_{y}^{(l)}N_{\mathrm{SF}}^{(l-1)}N_{\mathrm{SF}}^{(l)}\bigg{)},

(39)

where $D_{x}^{(l)},D_{y}^{(l)}$ are the column and row sizes of each output feature map, $W_{x}^{(l)},W_{y}^{(l)}$ are the 2-D filter size of the $l$ -th layer. $N_{\mathrm{SF}}^{(l-1)}$ and $N_{\mathrm{SF}}^{(l)}$ denote the number of input and output feature maps of the $l$ -th layer respectively. Table I lists the parameters of each convolutional layer for an $N_{\mathrm{MS}}\times N_{\mathrm{BS}}\times 3$ input. Thus, the complexity of three convolutional layers with $128$ @ $3\times 3$ spatial filters approximately becomes

\displaystyle\mathcal{C}_{\mathrm{CL}}\approx\mathcal{O}\big{(}3\cdot 9\cdot 128^{2}N_{\mathrm{MS}}N_{\mathrm{BS}}\big{)}.

(40)

The time complexity of the fully connected layer similarly is

\displaystyle\mathcal{C}_{\mathrm{FCL}}=\mathcal{O}\bigg{(}D_{x}D_{y}\kappa N_{\mathrm{FCL}}\bigg{)},

(41)

where $N_{\mathrm{FCL}}=1024$ is the number of units with $\kappa=1/2$ dropout. $D_{x}=128N_{\mathrm{MS}}N_{\mathrm{BS}}$ and $D_{y}=1$ are the 2-D input size of the fully connected layer respectively. Then, the time complexity of the fully connected layer approximately is

\displaystyle\mathcal{C}_{\mathrm{FCL}}\approx\mathcal{O}\big{(}4\cdot 128^{2}N_{\mathrm{MS}}N_{\mathrm{BS}}\big{)}.

(42)

Hence the total time complexity of ChannelNet is $\mathcal{C}=\mathcal{C}_{\mathrm{CL}}+\mathcal{C}_{\mathrm{FCL}}$ , which approximately is

\displaystyle\mathcal{C}\approx\mathcal{O}\big{(}3\cdot 9\cdot 128^{2}N_{\mathrm{MS}}N_{\mathrm{BS}}+4\cdot 128^{2}N_{\mathrm{MS}}N_{\mathrm{BS}})\big{)},

(43)

which is further simplified as $\approx\mathcal{O}\big{(}31\cdot 128^{2}N_{\mathrm{MS}}N_{\mathrm{BS}}\big{)}$ . Since the computation of the pseudo-inverse of the received pilot data is required in the testing stage, the complexity order of LS and MMSE estimation are $\mathcal{O}\big{(}N_{\mathrm{MS}}^{2}N_{\mathrm{BS}}^{2}\big{)}$ and $\mathcal{O}\big{(}N_{\mathrm{MS}}^{3}N_{\mathrm{BS}}^{3}\big{)}$ , respectively [16, 50].

Fig. 5 shows the time complexity comparison of CNN, MMSE and LS with respect to $N_{\mathrm{MS}}N_{\mathrm{BS}}$ . We see that ChannelNet has higher complexity than LS. As the number of antennas, i.e., $N_{\mathrm{MS}}N_{\mathrm{BS}}$ increases, the complexity of MMSE becomes closer to that of ChannelNet, it becomes larger after approximately $N_{\mathrm{MS}}N_{\mathrm{BS}}\geq 720$ . While the complexity of ChannelNet seems comparable with the conventional techniques, it is able to run more efficiently by using parallel processor, e.g., GPUs, which can significantly reduce the computation time [16, 50, 32]. However, the implementation with GPUs is not straightforward for the other algorithms, and it requires algorithm-dependent processor configuration.

V Numerical Simulations

The goal of the simulations is to compare the performance of the proposed FL-based channel estimation approach to the channel estimation performance of the state-of-the-art ML-based channel estimation techniques SF-CNN [16] and MLP [17], and the MMSE and LS estimation in terms of normalized MSE (NMSE), defined by $\mathrm{NMSE}=\frac{1}{J_{T}KM}\sum_{i=1}^{J_{T}}\sum_{k=1}^{K}\sum_{m=1}^{M}$ $\frac{\|\mathbf{H}_{k}[m]-\hat{\mathbf{H}}_{k}^{(i)}[m]\|_{\mathcal{F}}^{2}}{\|\mathbf{H}_{k}[m]\|_{\mathcal{F}}^{2}},$ where $J_{T}=100$ number of Monte Carlo trials. We also present the validation RMSE of the training process, defined by $\mathrm{RMSE}=\left(\frac{1}{|\mathcal{D}_{\mathrm{val}}|}\sum_{i=1}^{|\mathcal{D}_{\mathrm{val}}|}\|f(\widetilde{\mathcal{X}}^{(i)}|\boldsymbol{\theta})-\widetilde{\mathcal{Y}}^{(i)}\|_{\mathcal{F}}^{2}\right)^{1/2},$ where $\widetilde{\mathcal{X}}^{(i)}$ and $\widetilde{\mathcal{Y}}^{(i)}$ respectively denote the input-output pairs in the validation dataset $\mathcal{D}_{\mathrm{val}}$ , which includes $20\%$ of the whole dataset $\mathcal{D}$ , hence, we have $|\mathcal{D}_{\mathrm{val}}|=0.2|\mathcal{D}|$ .

The local dataset of each user includes $N=100$ different channel realizations for $K=8$ users. The number of antennas in the massive MIMO scenario at the BS and users are $N_{\mathrm{BS}}=128$ and $N_{\mathrm{MS}}=32$ , respectively, and we select $M=16$ and $L=5$ . For the RIS-assisted scenario, $N_{\mathrm{BS}}=N_{\mathrm{RIS}}=64$ . Hence, we have the same number of input elements for both scenario, i.e, $128\cdot 32=64\cdot 64$ . In both scenarios, location of each user is selected as $\phi_{k,l}\in\Phi_{k}$ and $\bar{\phi}_{k,l}\in\bar{\Psi}_{k}$ , where $\Phi_{k}$ and $\bar{\Psi}_{k}$ are the equally-divided subregions of the angular domain $\Theta$ , i.e., $\Theta=\bigcup_{k\in\mathcal{K}}\Phi_{k}=\bigcup_{k\in\mathcal{K}}\bar{\Psi}_{k}$ , respectively. The pilot data are generated as $\overline{\mathbf{S}}=\mathbf{I}_{M_{\mathrm{BS}}}$ and $\overline{\mathbf{S}}_{\mathrm{RIS}}=\mathbf{I}_{M_{\mathrm{BS}}}$ for $M_{\mathrm{BS}}=N_{\mathrm{BS}}$ and $M_{\mathrm{MS}}=N_{\mathrm{MS}}$ . We selected $\overline{\mathbf{F}}[m]$ and $\overline{\mathbf{W}}[m]$ as the first $M_{\mathrm{BS}}$ columns of an $N_{\mathrm{BS}}\times N_{\mathrm{BS}}$ DFT matrix and the first $M_{\mathrm{MS}}$ columns of an $N_{\mathrm{MS}}\times N_{\mathrm{MS}}$ DFT matrix, respectively [16]. During training, we have added AWGN on the input data for three SNR levels, i.e., SNR $=\{20,25,30\}$ dB, for $G_{\mathrm{mMIMO}}=20$ and $G_{\mathrm{RIS}}=20M$ realizations in order to provide robust performance against noisy input [19, 18] in both scenarios. As a result, both training datasets have the same number of input-output pairs as $\textsf{D}_{\mathrm{mMIMO}}=3MKNG_{\mathrm{mMIMO}}=3\cdot 16\cdot 8\cdot 100\cdot 20=768,000$ and $\textsf{D}_{\mathrm{RIS}}=3KNG_{\mathrm{RIS}}=3\cdot 8\cdot 100\cdot 320=768,000$ , respectively. The proposed ChannelNet model is realized and trained in MATLAB on a PC with a $2304$ -core GPU. For CL, we use the SGD algorithm with momentum of $0.9$ and the mini-batch size $M_{B}=128$ , and update the network parameters with learning rate $0.001$ . For FL, we train ChannelNet for $T=100$ iterations/rounds. Once the training is completed, the labels of the validation data (i.e., $20\%$ of the whole dataset) are used in prediction stage. During the prediction stage, each user estimates its own channel by feeding ChannelNet with $\mathbf{G}_{k}[m]$ ( $\boldsymbol{\Upsilon}_{k}$ ) and obtains $\hat{\mathbf{H}}_{k}[m]$ ( $\hat{\mathbf{h}}_{\mathrm{B},k}$ and $\hat{\mathbf{V}}_{k}$ ) at the output for massive MIMO (RIS) scenario, respectively⁵⁵5The source codes of the FL-based channel estimation scheme can be found at https://sites.google.com/view/elbir/publications..

V-A Channel Estimation in Massive MIMO

In Fig. 6, we present the training performance (Fig. 6a) and the channel estimation NMSE (Fig. 6b) of the proposed FL approach for channel estimation for different number of users. In this scenario, we fix the total dataset size D by selecting $G=20\cdot\frac{8}{K}$ . As $K$ increases, the training performance is observed to improve and gets closer to the performance of CL since the model updates superposed at the BS become more robust against the noise. As $K$ decreases, the corruptions in the model aggregation increase due to the diversity in the training dataset.

Fig. 7 shows the training and channel estimation performance for different noise levels added to the transmitted gradient and model data when $K=8$ . Here, we add AWGN onto both $\mathbf{g}_{k}(\boldsymbol{\theta}_{t})$ and $\boldsymbol{\theta}_{t}$ with respect to $\mathrm{SNR}_{\boldsymbol{\theta}}$ , where $\mathrm{SNR}_{\boldsymbol{\theta}}=20\log_{10}\frac{||\mathbf{g}_{k}(\boldsymbol{\theta}_{t})||_{2}^{2}}{\sigma_{\boldsymbol{\theta}}^{2}}$ . We observe in Fig. 7a that the training diverges for low $\mathrm{SNR}_{\boldsymbol{\theta}}$ (e.g., SNR ${}_{\boldsymbol{\theta}}\leq 5$ dB) due to the corruptions in the model parameters. The corresponding channel estimation performance is presented in Fig. 7b when the ChannelNet converges and at least SNR ${}_{\boldsymbol{\theta}}\leq 15$ dB is required to obtain reasonable channel estimation performance, e.g., $\mathrm{NMSE}\leq 0.001$ .

Fig. 8 shows the training and channel estimation performance in case of an impulsive noise causing the loss of gradient and model data. In this experiment, we multiply $\mathbf{g}_{k}(\boldsymbol{\theta}_{t})$ and $\boldsymbol{\theta}_{t}$ with $\mathbf{u}\in\mathbb{R}^{P}$ as $\mathbf{g}_{k}(\boldsymbol{\theta}_{t})\odot\mathbf{u}$ and $\boldsymbol{\theta}_{t}\odot\mathbf{u}$ , where the $\lfloor 100\zeta\rfloor$ elements of $\mathbf{u}$ are $0$ and the remaining terms are $1$ . This allows us to simulate the case when a portion of the gradient/model data are completely lost during transmission. We observe that the loss of model data significantly affects both training and channel estimation accuracy. Therefore, reliable channel estimation demands at most $5\%$ parameter loss during transmission.

Fig. 9 shows the training and channel estimation performance when the transmitted data (i.e., $\mathbf{g}_{k}(\boldsymbol{\theta}_{t})$ and $\boldsymbol{\theta}_{t}$ ) are quantized with $B$ bits. As expected, the performance improves as $B$ increases and at least $B=5$ bits are required to obtain a reasonable channel estimation performance. Compared to the results in Fig. 7, quantization has more influence on the accuracy than $\mathrm{SNR}_{\boldsymbol{\theta}}$ .

In Fig. 10, we present the channel estimation NMSE for different algorithms when $K=8$ . We train ChannelNet with both CL and FL frameworks and observe that CL follows the MMSE performance closely. CL provides better performance than that of FL since it has access the whole dataset at once. Nevertheless, FL has satisfactory channel estimation performance despite decentralized training. Specifically, FL and CL have similar NMSE for SNR $\leq 25$ dB and the performance of FL maxes out in high SNR regime. This is because the learning model loses precision due to FL training and cannot perform better. This is a common problem in ML-based techniques [16, 18]. In order to improve the performance, over-training can be employed so that more precision can be obtained. However, this introduces overfitting, i.e., the model memorizes the data, hence, it cannot perform well for different inputs. In Fig. 10, the comparison between the training with perfect (true channel data) and imperfect (estimated channel via ADCE) labels is also presented. The use of imperfect labels causes a slight performance degradation, while still providing less NMSE than SF-CNN and MLP. The other algorithms also exhibit similar behavior but perform worse than ChannelNet. This is because SF-CNN and MLP have convolutional-only and fully-connected-only layers, respectively. In contrast, ChannelNet includes both structures, hence, exhibiting better feature extraction and data mapping performance.

According to the analysis in Sec. IV-A, the communication overhead of FL and CL are $2PTK=2\cdot 600,192\cdot 100\cdot 8\approx 960\times 10^{6}$ and $(3N_{\mathrm{MS}}N_{\mathrm{BS}}+2N_{\mathrm{MS}}N_{\mathrm{BS}})\textsf{D}=(5\cdot 128\cdot 32)768,000\approx 16\times 10^{9}$ , respectively. This clearly shows the effectiveness of FL over CL. We also present the number of transmitted symbols during training with respect to data transmission blocks in Fig. 11, where we assume that $1000$ data symbols are transmitted at each transmission block. We can see that, it takes about $1\times 10^{6}$ data blocks to complete the gradient/model transmission in FL (see, e.g., Fig. 1b) whereas CL-based training demands approximately $16\times 10^{6}$ data blocks to complete the task for training dataset transmission (see, e.g., Fig. 1a). Therefore, the communication overhead of FL is approximately $16$ times lower than that of CL.

V-B Channel Estimation in RIS-assisted Massive MIMO

In Fig. 12, we present the validation RMSE and the channel estimation NMSE. We compute the NMSE of both direct channel and the cascaded channel together and present the results in a single plot. Similar results are obtained for model training, which diverges when $\mathrm{SNR}_{\boldsymbol{\theta}}\leq 5$ dB and channel estimation NMSE becomes relatively small if $\mathrm{SNR}_{\boldsymbol{\theta}}\geq 15$ dB.

Fig. 13 shows the validation RMSE and channel estimation NMSE for different quantization levels. The small number of bits causes the loss of precision in channel estimation NMSE. Similar to the massive MIMO scenario, at least $B\geq 5$ bits are required to obtain satisfactory channel estimate performance at large SNRs, i.e., $\mathrm{SNR}\geq 20$ dB.

VI Conclusions

In this paper, we propose a FL framework for channel estimation in conventional and RIS-assisted massive MIMO systems. We evaluate the performance of the proposed approach via several numerical simulations for different number of users and when the gradient/model parameters are quantized and corrupted by noise. We show that at least $5$ bit quantization and $15$ dB SNR on the model parameters are required for reliable channel estimation performance, i.e, $\mathrm{NMSE}\leq 0.001$ . We further analyze the scenario when a portion of the gradient/model parameters are completely lost and observe that FL exhibits satisfactory performance under at most $5\%$ information loss. We also examine the channel estimation performance of the proposed CNN architecture with both perfect and imperfect labels. A slight performance degradation is observed in case of imperfect labels as compared to the perfect CSI case. Nevertheless, the performance of imperfect label scenario strongly depends on the accuracy of the channel estimation algorithm employed during training dataset collection. Furthermore, the proposed CNN architecture provides lower NMSE than the state-of-the-art NN architectures. Apart from the channel estimation performance, FL-based approach enjoys approximately $16$ times lower transmission overhead as compared to the CL-based training. As a future work, we plan to develop compression-based techniques for both training data and the model parameters to further reduce the communication overhead.

Appendix A Proof of Theorem 1

We first make the following assumptions needed to ensure the convergence, which are typical for the $\l_{2}$ -norm regularized linear regression, logistic regression, and softmax classifiers [42, 41, 35].

Assumption 1: The loss function $\mathcal{L}(\boldsymbol{\theta})$ is convex, i.e., $\mathcal{L}((1-\lambda)\boldsymbol{\theta}+\lambda\boldsymbol{\theta}^{\prime})\leq(1-\lambda)\boldsymbol{\theta}\mathcal{L}(\boldsymbol{\theta})+\lambda\mathcal{L}(\boldsymbol{\theta}^{\prime})$ for $\lambda\in[0,1]$ and arbitrary $\boldsymbol{\theta}$ and $\boldsymbol{\theta}^{\prime}$ .

Assumption 2: $\mathcal{L}(\boldsymbol{\theta})$ is L-Lipschitz, i.e., $||\mathcal{L}(\boldsymbol{\theta})-\mathcal{L}(\boldsymbol{\theta}^{\prime})||\leq L||\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}||$ for arbitrary $\boldsymbol{\theta}$ and $\boldsymbol{\theta}^{\prime}$ .

Assumption 3: $\mathcal{L}(\boldsymbol{\theta})$ is $\beta$ -Smooth, i.e., $||\nabla\mathcal{L}(\boldsymbol{\theta})-\nabla\mathcal{L}(\boldsymbol{\theta}^{\prime})||\leq\beta||\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}||$ for arbitrary $\boldsymbol{\theta}$ and $\boldsymbol{\theta}^{\prime}$ .

In order to prove Theorem 1, we first investigate the $\beta$ -Smoothness of $\bar{\mathcal{L}}(\boldsymbol{\theta})$ in the following lemma.

Lemma 1: $\bar{\mathcal{L}}(\boldsymbol{\theta})$ is a $\bar{\beta}$ -Smooth function with $||\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})-\nabla\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})||\leq\bar{\beta}||\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}||$ , where $\bar{\beta}=(1+\sigma_{\Delta}^{2})\beta$ .

Proof: Using (16), we get

	$\displaystyle\|\|\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})-\nabla\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})\|\|$
	$\displaystyle=\|\|\nabla(\mathcal{L}(\boldsymbol{\theta})+\sigma_{\Delta}^{2}\|\|\nabla\mathcal{L}(\boldsymbol{\theta})\|\|^{2})$
	$\displaystyle\hskip 50.0pt-\nabla(\mathcal{L}(\boldsymbol{\theta}^{\prime})+\sigma_{\Delta}^{2}\|\|\nabla\mathcal{L}(\boldsymbol{\theta}^{\prime})\|\|^{2})\|\|$
	$\displaystyle=\|\|\big{(}\nabla\mathcal{L}(\boldsymbol{\theta})+\sigma_{\Delta}^{2}\nabla\|\|\nabla\mathcal{L}(\boldsymbol{\theta})\|\|^{2}\big{)}$
	$\displaystyle\hskip 50.0pt-\big{(}\nabla\mathcal{L}(\boldsymbol{\theta}^{\prime})+\sigma_{\Delta}^{2}\nabla\|\|\nabla\mathcal{L}(\boldsymbol{\theta}^{\prime})\|\|^{2}\big{)}\|\|$
	$\displaystyle=\|\|\nabla{\mathcal{L}}(\boldsymbol{\theta})-\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})+\sigma_{\Delta}^{2}$
	$\displaystyle\times\big{(}\nabla\mathrm{tr}\{\nabla{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}\nabla{\mathcal{L}}(\boldsymbol{\theta})\}-\nabla\mathrm{tr}\{\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})^{\textsf{T}}\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})\}\big{)}\|\|$
	$\displaystyle=\|\|\nabla{\mathcal{L}}(\boldsymbol{\theta})-\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})+\sigma_{\Delta}^{2}\big{(}\nabla{\mathcal{L}}(\boldsymbol{\theta})-\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})\big{)}\|\|$
	$\displaystyle=\|\|(1+\sigma_{\Delta}^{2})\big{(}\nabla{\mathcal{L}}(\boldsymbol{\theta})-\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})\big{)}\|\|$
	$\displaystyle=(1+\sigma_{\Delta}^{2})\|\|\nabla{\mathcal{L}}(\boldsymbol{\theta})-\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})\|\|.$		(44)

By incorporating (44), Assumption 2 and $1+\sigma_{\Delta}^{2}\geq 0$ , we get

\displaystyle||\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})-\nabla\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})||\leq\bar{\beta}||\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}||^{2},

(45)

where $\bar{\beta}=(1+\sigma_{\Delta}^{2})\beta$ . ∎

Using (45), Assumption 2 and Assumption 3 imply that $\bar{\mathcal{L}}(\boldsymbol{\theta})$ is second order differentiable as $\nabla^{2}\bar{\mathcal{L}}(\boldsymbol{\theta})\preceq\bar{\beta}\mathbf{I}_{P}$ . Using this fact, performing a quadratic expression around $\bar{\mathcal{L}}(\boldsymbol{\theta})$ yields

	$\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})$	$\displaystyle\leq\bar{\mathcal{L}}(\boldsymbol{\theta})+\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}(\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta})+\frac{1}{2}\nabla^{2}\bar{\mathcal{L}}(\boldsymbol{\theta})\|\|\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}\|\|^{2}$
		$\displaystyle\leq\mathcal{L}(\boldsymbol{\theta})+\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}(\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta})+\frac{1}{2}\bar{\beta}\|\|\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}\|\|^{2}.$		(46)

Substituting the GD update $\boldsymbol{\theta}^{\prime}=\boldsymbol{\theta}-\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})$ in (A), we get

$\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})$	$\displaystyle\leq\bar{\mathcal{L}}(\boldsymbol{\theta})+\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}(\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta})+\frac{1}{2}\bar{\beta}\|\|\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}\|\|^{2}$
	$\displaystyle=\bar{\mathcal{L}}(\boldsymbol{\theta})+\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}(\boldsymbol{\theta}-\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})-\boldsymbol{\theta})$
	$\displaystyle\hskip 10.0pt+\frac{1}{2}\nabla^{2}\bar{\mathcal{L}}(\boldsymbol{\theta})\|\|\boldsymbol{\theta}-\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})-\boldsymbol{\theta}\|\|^{2}$
	$\displaystyle=\bar{\mathcal{L}}(\boldsymbol{\theta})-\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})+\frac{1}{2}\bar{\beta}\|\|\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})\|\|^{2}$
	$\displaystyle=\bar{\mathcal{L}}(\boldsymbol{\theta})-\eta\|\|\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})\|\|^{2}+\frac{1}{2}\bar{\beta}\eta^{2}\|\|\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})\|\|^{2}$
	$\displaystyle=\bar{\mathcal{L}}(\boldsymbol{\theta})-(1-\frac{\bar{\beta}\eta}{2})\eta\|\|\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})\|\|^{2},$	(47)

which bounds the GD update $\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})$ with $\bar{\mathcal{L}}(\boldsymbol{\theta})$ . Now, let us bound $\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})$ with the optimal objective value $\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})$ . Using Assumption 1, we have

	$\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})$	$\displaystyle\geq\bar{\mathcal{L}}(\boldsymbol{\theta})+\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}(\boldsymbol{\theta}_{\star}-\boldsymbol{\theta}),$
	$\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta})$	$\displaystyle\leq\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})+\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}(\boldsymbol{\theta}-\boldsymbol{\theta}_{\star}).$		(48)

Furthermore, using $\eta\leq\frac{1}{\bar{\beta}}$ , we have $-(1-\frac{\bar{\beta}\eta}{2})=\frac{1}{2}\bar{\beta}\eta-1\leq\frac{1}{2}\bar{\beta}(1/\bar{\beta})-1=\frac{1}{2}-1=-\frac{1}{2}$ . Thus, (47) becomes

\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})\leq\bar{\mathcal{L}}(\boldsymbol{\theta})-\frac{\eta}{2}||\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})||^{2}

(49)

By plugging (A) into (49), we get

\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})\leq\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})+\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}(\boldsymbol{\theta}-\boldsymbol{\theta}_{\star})-\frac{\eta}{2}||\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})||^{2},

(50)

which can be rewritten as

\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})-\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})\leq\frac{1}{2\eta}\big{(}2\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}(\boldsymbol{\theta}-\boldsymbol{\theta}_{\star})-\eta^{2}||\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})||^{2}\big{)}.

(51)

By adding $\frac{1}{2\eta}(||\boldsymbol{\theta}-\boldsymbol{\theta}_{\star}||^{2}-||\boldsymbol{\theta}-\boldsymbol{\theta}_{\star}||^{2})$ into the right hand side of (51), we get

\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})-\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})\leq\frac{1}{2\eta}\big{(}||\boldsymbol{\theta}-\boldsymbol{\theta}_{\star}||^{2}-||\boldsymbol{\theta}-\boldsymbol{\theta}_{\star}-\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})||^{2}\big{)},

(52)

which is obtained after incorporating the expansion of $||\boldsymbol{\theta}-\boldsymbol{\theta}_{\star}-\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})||^{2}$ . Substituting the GD update $\boldsymbol{\theta}^{\prime}=\boldsymbol{\theta}-\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})$ into (52), we have

\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})-\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})\leq\frac{1}{2\eta}\bigg{(}||\boldsymbol{\theta}-\boldsymbol{\theta}_{\star}||^{2}-||\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}_{\star}||^{2}\bigg{)}.

(53)

Now, replacing $\boldsymbol{\theta}^{\prime}$ by $\boldsymbol{\theta}_{i}$ and summing over $i=1,\dots,t$ yield

	$\displaystyle\sum_{i=1}^{t}(\bar{\mathcal{L}}(\boldsymbol{\theta}_{i})-\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star}))\leq\sum_{i=1}^{t}\frac{1}{2\eta}\bigg{(}\|\|\boldsymbol{\theta}_{i-1}-\boldsymbol{\theta}_{\star}\|\|^{2}-\|\|\boldsymbol{\theta}_{i}-\boldsymbol{\theta}_{\star}\|\|^{2}\bigg{)}$
	$\displaystyle=\frac{1}{2\eta}\bigg{(}\|\|\boldsymbol{\theta}_{0}-\boldsymbol{\theta}_{\star}\|\|^{2}-\|\|\boldsymbol{\theta}_{t}-\boldsymbol{\theta}_{\star}\|\|^{2}\bigg{)}\leq\frac{1}{2\eta}\|\|\boldsymbol{\theta}_{0}-\boldsymbol{\theta}_{\star}\|\|^{2},$		(54)

where the summation on the right hand side disappears since the consecutive terms cancel each other. Since $\bar{\mathcal{L}}(\boldsymbol{\theta}_{t})$ is a decreasing function, we have

\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}_{t})-\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})\leq\frac{1}{t}\sum_{i=1}^{t}(\bar{\mathcal{L}}(\boldsymbol{\theta}_{i})-\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})).

(55)

Inserting (54) into (55), we finally have

\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}_{t})-\bar{\mathcal{L}}(\boldsymbol{\theta}_{\star})\leq\frac{1}{2\eta t}||\boldsymbol{\theta}_{0}-\boldsymbol{\theta}_{\star}||^{2}.

(56)

∎

References

[1] R. W. Heath, N. González-Prelcic, S. Rangan, W. Roh, and A. M. Sayeed, “An overview of signal processing techniques for millimeter wave MIMO systems,” IEEE J. Sel. Topics Signal Process., vol. 10, no. 3, pp. 436–453, 2016.
[2] F. Rusek, D. Persson, B. K. Lau, E. G. Larsson, T. L. Marzetta, O. Edfors, and F. Tufvesson, “Scaling up MIMO: Opportunities and challenges with very large arrays,” IEEE Signal Process. Mag., vol. 30, no. 1, pp. 40–60, 2013.
[3] J. G. Andrews, S. Buzzi, W. Choi, S. V. Hanly, A. Lozano, A. C. K. Soong, and J. C. Zhang, “What will 5G be?” IEEE J. Sel. Areas Commun., vol. 32, no. 6, pp. 1065–1082, 2014.
[4] O. E. Ayach, S. Rajagopal, S. Abu-Surra, Z. Pi, and R. W. Heath, “Spatially sparse precoding in millimeter wave MIMO systems,” IEEE Trans. Wireless Commun., vol. 13, no. 3, pp. 1499–1513, 2014.
[5] Q. Wu and R. Zhang, “Towards Smart and Reconfigurable Environment: Intelligent Reflecting Surface Aided Wireless Network,” IEEE Commun. Mag., vol. 58, no. 1, pp. 106–112, January 2020.
[6] A. M. Elbir and K. V. Mishra, “A Survey of Deep Learning Architectures for Intelligent Reflecting Surfaces,” arXiv, Sep 2020. [Online]. Available: https://arxiv.org/abs/2009.02540v3
[7] C. Huang, S. Hu, G. C. Alexandropoulos, A. Zappone, C. Yuen, R. Zhang, M. Di Renzo, and M. Debbah, “Holographic MIMO Surfaces for 6G Wireless Networks: Opportunities, Challenges, and Trends,” IEEE Wireless Commun., vol. 27, no. 5, pp. 118–125, Jul 2020.
[8] C. Huang, A. Zappone, G. C. Alexandropoulos, M. Debbah, and C. Yuen, “Reconfigurable Intelligent Surfaces for Energy Efficiency in Wireless Communication,” IEEE Trans. Wireless Commun., vol. 18, no. 8, pp. 4157–4170, Jun 2019.
[9] C. Huang, R. Mo, and C. Yuen, “Reconfigurable Intelligent Surface Assisted Multiuser MISO Systems Exploiting Deep Reinforcement Learning,” IEEE J. Sel. Areas Commun., vol. 38, no. 8, pp. 1839–1850, Jun 2020.
[10] E. Björnson, L. Van der Perre, S. Buzzi, and E. G. Larsson, “Massive MIMO in sub-6 GHz and mmWave: Physical, practical, and use-case differences,” IEEE Wireless Commun., vol. 26, no. 2, pp. 100–108, 2019.
[11] A. Alkhateeb and R. W. Heath, “Frequency selective hybrid precoding for limited feedback millimeter wave systems,” IEEE Trans. Commun., vol. 64, no. 5, pp. 1801–1818, 2016.
[12] F. Sohrabi and W. Yu, “Hybrid analog and digital beamforming for mmWave OFDM large-scale antenna arrays,” IEEE Journal on Selected Areas in Communications, vol. 35, no. 7, pp. 1432–1443, 2017.
[13] A. Taha, M. Alrabeiah, and A. Alkhateeb, “Enabling large intelligent surfaces with compressive sensing and deep learning,” arXiv preprint arXiv:1904.10136, 2019.
[14] D. Fan, F. Gao, Y. Liu, Y. Deng, G. Wang, Z. Zhong, and A. Nallanathan, “Angle Domain Channel Estimation in Hybrid Millimeter Wave Massive MIMO Systems,” IEEE Trans. Wireless Commun., vol. 17, no. 12, pp. 8165–8179, Dec 2018.
[15] H. Yin, D. Gesbert, M. Filippou, and Y. Liu, “A Coordinated Approach to Channel Estimation in Large-Scale Multiple-Antenna Systems,” IEEE J. Sel. Areas Commun., vol. 31, no. 2, pp. 264–273, February 2013.
[16] P. Dong, H. Zhang, G. Y. Li, I. S. Gaspar, and N. NaderiAlizadeh, “Deep CNN-Based Channel Estimation for mmWave Massive MIMO Systems,” IEEE J. Sel. Topics Signal Process., vol. 13, no. 5, pp. 989–1000, Sep. 2019.
[17] H. Huang, J. Yang, H. Huang, Y. Song, and G. Gui, “Deep learning for super-resolution channel estimation and doa estimation based massive mimo system,” IEEE Trans. Veh. Technol., vol. 67, no. 9, pp. 8549–8560, Sept 2018.
[18] A. M. Elbir, A. Papazafeiropoulos, P. Kourtessis, and S. Chatzinotas, “Deep Channel Learning for Large Intelligent Surfaces Aided mm-Wave Massive MIMO Systems,” IEEE Wireless Commun. Lett., vol. 9, no. 9, pp. 1447–1451, 2020.
[19] A. M. Elbir, “CNN-based precoder and combiner design in mmWave MIMO systems,” IEEE Commun. Lett., vol. 23, no. 7, pp. 1240–1243, 2019.
[20] A. M. Elbir and K. V. Mishra, “Joint antenna selection and hybrid beamformer design using unquantized and quantized deep learning networks,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 1677–1688, March 2020.
[21] A. M. Elbir and A. Papazafeiropoulos, “Hybrid Precoding for Multi-User Millimeter Wave Massive MIMO Systems: A Deep Learning Approach,” IEEE Trans. Veh. Technol., vol. 69, no. 1, p. 552–563, 2020.
[22] A. M. Elbir, K. V. Mishra, and Y. C. Eldar, “Cognitive radar antenna selection via deep learning,” IET Radar, Sonar & Navigation, vol. 13, pp. 871–880, 2019.
[23] A. M. Elbir, “DeepMUSIC: Multiple Signal Classification via Deep Learning,” IEEE Sensors Letters, vol. 4, no. 4, pp. 1–4, 2020.
[24] M. M. Amiri and D. Gündüz, “Federated Learning Over Wireless Fading Channels,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3546–3557, 2020.
[25] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated Learning: Challenges, Methods, and Future Directions,” IEEE Signal Process. Mag., vol. 37, no. 3, pp. 50–60, 2020.
[26] A. M. Elbir and S. Coleri, “Federated Learning for Vehicular Networks,” arXiv preprint arXiv:2006.01412, 2020.
[27] M. M. Wadu, S. Samarakoon, and M. Bennis, “Federated learning under channel uncertainty: Joint client scheduling and resource allocation,” arXiv preprint arXiv:2002.00802, 2020.
[28] T. Zeng, O. Semiari, M. Mozaffari, M. Chen, W. Saad, and M. Bennis, “Federated Learning in the Sky: Joint Power Allocation and Scheduling with UAV Swarms,” arXiv preprint arXiv:2002.08196, 2020.
[29] S. Batewela, C. Liu, M. Bennis, H. A. Suraweera, and C. S. Hong, “Risk-sensitive task fetching and offloading for vehicular edge computing,” IEEE Commun. Lett., vol. 24, no. 3, pp. 617–621, 2020.
[30] M. Mohammadi Amiri and D. Gündüz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,” IEEE Trans. Signal Process., vol. 68, pp. 2155–2169, 2020.
[31] A. M. Elbir and S. Coleri, “Federated Learning for Hybrid Beamforming in mm-Wave Massive MIMO,” IEEE Commun. Lett., pp. 1–1, 2020.
[32] A. M. Elbir, K. V. Mishra, M. R. B. Shankar, and B. Ottersten, “Online and Offline Deep Learning Strategies For Channel Estimation and Hybrid Beamforming in Multi-Carrier mm-Wave Massive MIMO Systems,” arXiv preprint arXiv:1912.10036, 2019.
[33] J. Yuan, H. Q. Ngo, and M. Matthaiou, “Machine Learning-Based Channel Prediction in Massive MIMO With Channel Aging,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 2960–2973, Feb 2020.
[34] H. Ye, G. Y. Li, and B. Juang, “Power of deep learning for channel estimation and signal detection in OFDM systems,” IEEE Commun. Lett., vol. 7, no. 1, pp. 114–117, 2018.
[35] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” arXiv, Feb 2016. [Online]. Available: https://arxiv.org/abs/1602.05629v3
[36] A. M. Elbir, “A Deep Learning Framework for Hybrid Beamforming Without Instantaneous CSI Feedback,” IEEE Trans. Veh. Technol., pp. 1–1, 2020.
[37] W. U. Bajwa, J. Haupt, G. Raz, and R. Nowak, “Compressed channel sensing,” in Annual Conference on Information Sciences and Systems, March 2008, pp. 5–10.
[38] Z. Marzi, D. Ramasamy, and U. Madhow, “Compressive Channel Estimation and Tracking for Large Arrays in mm-Wave Picocells,” IEEE J. Sel. Topics Signal Process., vol. 10, no. 3, pp. 514–527, April 2016.
[39] A. Klautau, P. Batista, N. González-Prelcic, Y. Wang, and R. W. Heath, “5G MIMO Data for Machine Learning: Application to Beam-Selection Using Deep Learning,” in 2018 Information Theory and Applications Workshop (ITA), 2018, pp. 1–9.
[40] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, 2017, pp. 1709–1720.
[41] F. Ang, L. Chen, N. Zhao, Y. Chen, W. Wang, and F. R. Yu, “Robust Federated Learning With Noisy Communication,” IEEE Trans. Commun., vol. 68, no. 6, pp. 3452–3464, Mar 2020.
[42] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the Convergence of FedAvg on Non-IID Data,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=HJxNAnVtDS
[43] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
[44] C. M. Bishop, “Training with Noise is Equivalent to Tikhonov Regularization,” Neural Comput., vol. 7, no. 1, pp. 108–116, Jan 1995.
[45] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A Joint Learning and Communications Framework for Federated Learning over Wireless Networks,” IEEE Trans. Wireless Commun., p. 1, Oct 2020.
[46] T. T. Vu, D. T. Ngo, N. H. Tran, H. Q. Ngo, M. N. Dao, and R. H. Middleton, “Cell-Free Massive MIMO for Wireless Federated Learning,” IEEE Trans. Wireless Commun., vol. 19, no. 10, pp. 6377–6392, Jun 2020.
[47] L. Wei, H. Chongwen, A. G. C., C. Yuen, and M. Zhang, Zhaoyangand Debbah, “Channel Estimation for RIS-Empowered Multi-User MISO Wireless Communications,” arXiv, Aug 2020. [Online]. Available: https://arxiv.org/abs/2008.01459v1
[48] S. Lin, B. Zheng, G. C. Alexandropoulos, M. Wen, M. Di Renzo, and F. Chen, “Reconfigurable Intelligent Surfaces with Reflection Pattern Modulation: Beamforming Design and Performance Analysis,” IEEE Trans. Wireless Commun., p. 1, Oct 2020.
[49] D. Mishra and H. Johansson, “Channel Estimation and Low-complexity Beamforming Design for Passive Intelligent Surface Assisted MISO Wireless Energy Transfer,” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4659–4663, Dec 2017.
[50] A. M. Elbir and K. V. Mishra, “Sparse array selection across arbitrary sensor geometries with deep transfer learning,” IEEE Trans. on Cogn. Commun. Netw., pp. 1–1, 2020.

Ahmet M. Elbir (IEEE Senior Member) received the Ph.D. degree from Middle East Technical University in 2016 He is a Senior Researcher at Duzce University, Duzce, Turkey, and Research Fellow at the University of Hertfordshire, Hatfield, UK.

Sinem Coleri (IEEE Senior Member) received the Ph.D. degree from the University of California at Berkeley in 2005. She is a Faculty Member with the Department of Electrical and Electronics Engineering, Koc University, Turkey.

$\displaystyle\mathbb{E}\{\|\|\mathcal{L}_{k}(\boldsymbol{\theta}+$	$\displaystyle\Delta)\|\|^{2}\}\approx\mathbb{E}\{\|\|\mathcal{L}_{k}(\boldsymbol{\theta})+{\Delta}\nabla\mathcal{L}_{k}(\boldsymbol{\theta})\|\|^{2}\},$
	$\displaystyle\approx\mathbb{E}\{\|\|\mathcal{L}_{k}(\boldsymbol{\theta})\|\|^{2}\}+\mathbb{E}\{\|\|\Delta\|\|^{2}\}\mathbb{E}\{\|\|\nabla\mathcal{L}_{k}(\boldsymbol{\theta})\|\|^{2}\},$
	$\displaystyle\approx\mathbb{E}\{\|\|\mathcal{L}_{k}(\boldsymbol{\theta})\|\|^{2}\}+{\sigma}_{\Delta}^{2}\|\|\mathbf{g}(\boldsymbol{\theta})\|\|^{2},$	(17)

	$\displaystyle\|\|\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})-\nabla\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})\|\|$
	$\displaystyle=\|\|\nabla(\mathcal{L}(\boldsymbol{\theta})+\sigma_{\Delta}^{2}\|\|\nabla\mathcal{L}(\boldsymbol{\theta})\|\|^{2})$
	$\displaystyle\hskip 50.0pt-\nabla(\mathcal{L}(\boldsymbol{\theta}^{\prime})+\sigma_{\Delta}^{2}\|\|\nabla\mathcal{L}(\boldsymbol{\theta}^{\prime})\|\|^{2})\|\|$
	$\displaystyle=\|\|\big{(}\nabla\mathcal{L}(\boldsymbol{\theta})+\sigma_{\Delta}^{2}\nabla\|\|\nabla\mathcal{L}(\boldsymbol{\theta})\|\|^{2}\big{)}$
	$\displaystyle\hskip 50.0pt-\big{(}\nabla\mathcal{L}(\boldsymbol{\theta}^{\prime})+\sigma_{\Delta}^{2}\nabla\|\|\nabla\mathcal{L}(\boldsymbol{\theta}^{\prime})\|\|^{2}\big{)}\|\|$
	$\displaystyle=\|\|\nabla{\mathcal{L}}(\boldsymbol{\theta})-\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})+\sigma_{\Delta}^{2}$
	$\displaystyle\times\big{(}\nabla\mathrm{tr}\{\nabla{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}\nabla{\mathcal{L}}(\boldsymbol{\theta})\}-\nabla\mathrm{tr}\{\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})^{\textsf{T}}\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})\}\big{)}\|\|$
	$\displaystyle=\|\|\nabla{\mathcal{L}}(\boldsymbol{\theta})-\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})+\sigma_{\Delta}^{2}\big{(}\nabla{\mathcal{L}}(\boldsymbol{\theta})-\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})\big{)}\|\|$
	$\displaystyle=\|\|(1+\sigma_{\Delta}^{2})\big{(}\nabla{\mathcal{L}}(\boldsymbol{\theta})-\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})\big{)}\|\|$
	$\displaystyle=(1+\sigma_{\Delta}^{2})\|\|\nabla{\mathcal{L}}(\boldsymbol{\theta})-\nabla{\mathcal{L}}(\boldsymbol{\theta}^{\prime})\|\|.$		(44)

$\displaystyle\bar{\mathcal{L}}(\boldsymbol{\theta}^{\prime})$	$\displaystyle\leq\bar{\mathcal{L}}(\boldsymbol{\theta})+\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}(\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta})+\frac{1}{2}\bar{\beta}\|\|\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}\|\|^{2}$
	$\displaystyle=\bar{\mathcal{L}}(\boldsymbol{\theta})+\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}(\boldsymbol{\theta}-\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})-\boldsymbol{\theta})$
	$\displaystyle\hskip 10.0pt+\frac{1}{2}\nabla^{2}\bar{\mathcal{L}}(\boldsymbol{\theta})\|\|\boldsymbol{\theta}-\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})-\boldsymbol{\theta}\|\|^{2}$
	$\displaystyle=\bar{\mathcal{L}}(\boldsymbol{\theta})-\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})^{\textsf{T}}\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})+\frac{1}{2}\bar{\beta}\|\|\eta\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})\|\|^{2}$
	$\displaystyle=\bar{\mathcal{L}}(\boldsymbol{\theta})-\eta\|\|\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})\|\|^{2}+\frac{1}{2}\bar{\beta}\eta^{2}\|\|\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})\|\|^{2}$
	$\displaystyle=\bar{\mathcal{L}}(\boldsymbol{\theta})-(1-\frac{\bar{\beta}\eta}{2})\eta\|\|\nabla\bar{\mathcal{L}}(\boldsymbol{\theta})\|\|^{2},$	(47)