This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Communication-Efficient Federated Learning over MIMO Multiple Access Channels

Yo-Seb Jeon, Mohammad Mohammadi Amiri, and Namyoon Lee Yo-Seb Jeon is with the Department of Electrical Engineering, POSTECH, Pohang, Gyeongbuk 37673, Republic of Korea (e-mail: [email protected])Namyoon Lee is with the School of Electrical Engineering, Korea University, Seoul, Republic of Korea (e-mail: [email protected]).Mohammad Mohammadi Amiri is with the Media Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA (e-mail: [email protected]).
Abstract

Communication efficiency is of importance for wireless federated learning systems. In this paper, we propose a communication-efficient strategy for federated learning over multiple-input multiple-output (MIMO) multiple access channels (MACs). The proposed strategy comprises two components. When sending a locally computed gradient, each device compresses a high dimensional local gradient to multiple lower-dimensional gradient vectors using block sparsification. When receiving a superposition of the compressed local gradients via a MIMO-MAC, a parameter server (PS) performs a joint MIMO detection and the sparse local-gradient recovery. Inspired by the turbo decoding principle, our joint detection-and-recovery algorithm accurately recovers the high-dimensional local gradients by iteratively exchanging their beliefs for MIMO detection and sparse local gradient recovery outputs. We then analyze the reconstruction error of the proposed algorithm and its impact on the convergence rate of federated learning. From simulations, our gradient compression and joint detection-and-recovery methods diminish the communication cost significantly while achieving identical classification accuracy for the case without any compression.

Index Terms:
Federated learning, distributed machine learning, gradient compression, gradient reconstruction, compressed sensing.

I Introduction

Federated learning is a distributed machine learning technique for training a global model at a parameter server (PS) through collaboration with wireless devices, each with its own local training data set, where the local data never leaves the devices. [1, 2, 3, 4, 5]. In federated learning, each wireless device updates a local model based on its local training data set and then sends the local model update (e.g., a gradient vector of the local model) to the PS. The PS then updates a global model by aggregating the local model updates and then broadcasts the updated global model to the devices. By repeating this collaboration between the PS and the devices, the PS is able to train the global model without directly accessing the devices’ data, thereby enhancing the privacy of data generated at the devices. Thanks to this advantage, federated learning has received a great deal of attention as a viable solution for privacy-sensitive machine learning applications at the wireless edge [3, 4, 5, 6, 7].

One of the primary research goals in federated learning is to improve communication efficiency in transmitting the local model updates from the devices. The primary reason is that the communication overhead required by these local updates increases with the number of parameters representing the global model, which is typically an enormous value. To reduce the communication overhead, gradient compression methods based on lossy compression of the local gradient vectors computed at the devices have been widely considered in the literature [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]. Two representative approaches for gradient compression are gradient quantization and gradient sparsification. In gradient quantization, local gradients are quantized and then transmitted using digital transmission [8, 9, 10, 11, 12, 13]. The most widely adopted technique is one-bit quantization, in which the device only transmits the sign of each local gradient entry. It is shown in [10] that aggregating the signs of different local gradient vectors based on majority voting achieves the same variance reduction as transmission without quantization. This scalar quantization technique is extended in [12, 13] to exploit vector quantization. In these works, a high-dimensional local gradient vector is partitioned and then quantized using vector quantizers such as lattice quantizers [12] and Grassmannian quantizers [13]. Gradient sparsification considers sparsifying the gradient by removing less significant elements. Digital transmission with gradient sparsification is studied in [14, 15, 16] by adopting an encoding function to leverage the sparsity of the gradient vector. Analog transmission with gradient sparsification is developed in [17, 18] in which the sparsified gradient vector is compressed by projection onto a lower-dimensional space as in compressed sensing (CS) [20, 21]. Gradient compression that exploits the advantages of both gradient quantization and sparsification is also studied in [19] by leveraging the idea of quantized CS. It is reported in [19] that gradient compression based on quantized CS not only reduces communication overhead for federated learning but also mitigates gradient reconstruction error at the PS.

Federated learning over a wireless multiple-access channel (MAC) is another promising solution to reduce communication overhead, particularly when a large number of devices participate in federated learning [22, 23, 24, 25, 26, 27, 28]. In this solution, multiple devices simultaneously transmit their local model updates (e.g., local gradient vectors) via shared radio resources; thereby, the communication overhead required for the local update transmission does not increase with the number of the devices. The most widely adopted approach for federated learning over the wireless MAC is over-the-air computation which exploits the superposition property of the wireless MAC [22]. To further improve the reliability of communication under channel fading, federated learning over a wireless multiple-input multiple-output (MIMO) MAC has recently been studied by developing efficient multi-antenna transmission and/or reception techniques for the PS and/or the wireless devices [23, 24, 25, 26, 27, 28]. These studies demonstrate that the use of multiple antennas improves model recovery at the PS by mitigating channel fading and interference effects. For example, a multi-antenna technique proposed in [27] enables almost perfect reconstruction of the local gradient vectors when the PS is equipped with a massive number of antennas. One major challenge, however, is that the communication overhead in the wireless MIMO MAC is still considerable if no gradient compression is applied at the wireless devices. Moreover, despite the potentials of both the gradient compression and simultaneous transmission over the wireless MIMO MAC to reduce communication overhead, none of the existing studies have investigated communication-efficient federated learning to take advantage of both techniques.

In this work, we study communication-efficient federated learning over a wireless MIMO MAC operating with gradient compression at the wireless devices. We particularly consider the analog gradient compression technique in [17, 18] in which a local gradient vector is sparsified and then projected onto a lower-dimensional space as in CS [20, 21]. Under this consideration, we reformulate the reconstruction problem of the local gradient vectors from uplink received signals at the PS into a large-scale CS problem with the double-sided linear transformation. To be specific, reconstructing a sparse matrix 𝐆{\bf G} from its noisy linear measurement 𝐘=𝐇𝐆𝖳𝐀𝖳+𝐙{\bf Y}={\bf H}{\bf G}^{\sf T}{\bf A}^{\sf T}+{\bf Z}, where 𝐆{\bf G} is a local gradient matrix whose columns are the local gradient vectors sent from different devices, 𝐇{\bf H} is a MIMO channel matrix, 𝐀{\bf A} is a random measurement matrix used for the compression, and 𝐙{\bf Z} is a channel noise matrix. Existing CS algorithms to solve this problem require significant computational complexity since the number of the entries of 𝐆{\bf G} is proportional to both the number of the devices and the number of global model parameters, which can be on the order of millions in federated learning applications. To overcome this limitation, we propose a computationally-efficient algorithm to solve the gradient reconstruction problem at the PS based on a divide-and-conquer strategy. We then analyze the reconstruction error of the proposed algorithm and its impact on the convergence rate of federated learning. Simulation results demonstrate that the proposed algorithm enables almost lossless reconstruction of the local gradient vectors at the PS while achieving significant performance gain over the existing CS algorithms. The major contributions of this paper are summarized as follows:

  • We propose a computationally-efficient algorithm to solve the gradient reconstruction problem at the PS for enabling communication-efficient federated learning over a wireless MIMO MAC. Our key observation is that the gradient reconstruction problem in the form of 𝐘=𝐇𝐆𝖳𝐀𝖳+𝐙{\bf Y}={\bf H}{\bf G}^{\sf T}{\bf A}^{\sf T}+{\bf Z} can be decomposed into parallel small-scale CS recovery problems if a compressed gradient matrix 𝐗=𝐀𝐆{\bf X}={\bf A}{\bf G} is explicitly estimated from MIMO detection. Inspired by this observation, the proposed algorithm first estimates the compressed gradient matrix based on the minimum-mean-square-error (MMSE) MIMO detection, then solves small-scale CS recovery problems in parallel by employing the expectation-maximization generalized-approximate-message-passing (EM-GAMP) algorithm in [29]. A major drawback of our divide-and-conquer strategy is that MIMO detection error propagates into the subsequent CS recovery process. To alleviate this drawback, the proposed algorithm also leverages a turbo decoding principle that allows the MIMO detection and the CS recovery processes to iteratively exchange their beliefs on the estimate of the compressed gradient matrix. One prominent feature of the proposed algorithm is that the accuracy of the gradient reconstruction at the PS improves as the iteration continues because, in each iteration, the belief from uplink received signals at the PS is complemented by the sparse property of the local gradient vectors.

  • We analyze the reconstruction error of the proposed algorithm and its impact on the convergence rate of federated learning. In this analysis, we model each local gradient vector as an independent and identically distributed (IID) random vector whose distribution is assumed to be known at the PS a priori. Under this consideration, we characterize a mean-square-error (MSE) upper bound in reconstructing a global gradient vector when employing the proposed gradient reconstruction algorithm. Furthermore, our analysis demonstrates that the reconstruction error reduces as the gradient compression ratio at the devices decreases and the orthogonality of the channels from the devices to the PS increases. We also characterize the convergence rate of federated learning operating with a stochastic gradient descent (SGD) algorithm. Our result shows that if the reconstruction error at the PS is below a certain level, federated learning converges to a stationary point of a smooth loss function at the rate of 𝒪(1T)\mathcal{O}\big{(}\frac{1}{\sqrt{T}}\big{)}, where TT is the number of total iterations of the SGD algorithm. The trade-off between communication overhead and the convergence rate of federated learning is also captured in our analysis.

  • Using simulations, we evaluate the performance of federated learning over a wireless MIMO MAC for an image classification task using the MNIST dataset [30]. These simulations compare the classification accuracy and the normalized MSE of the proposed gradient reconstruction algorithm with the existing CS recovery algorithms under the same compression ratio and a similar computational complexity. Simulation results demonstrate that the proposed algorithm achieves almost the same classification accuracy as perfect reconstruction while superior performance to the existing CS algorithms. It is also shown that the proposed algorithm is more robust to the increase in the gradient compression ratio than other algorithms. We also investigate the effect of the number of turbo iterations and the sparsification level of the gradient compression on the performance of the proposed algorithm. Based on these simulation results, we demonstrate that using the proposed algorithm at the PS enables accurate reconstruction of the local gradient vectors while effectively reducing the communication overhead by taking advantage of the gradient compression and simultaneous transmission over the wireless MIMO MAC.

Notation

Upper-case and lower-case boldface letters denote matrices and column vectors, respectively. 𝔼[]\mathbb{E}[\cdot] is the statistical expectation, ()\mathbb{P}(\cdot) is the probability, and ()𝖳(\cdot)^{\sf T} is the transpose. |𝒜||\mathcal{A}| is the cardinality of set 𝒜\mathcal{A}. (𝐚)i({\bf a})_{i} represents the ii-th entry of vector 𝐚{\bf a}. 𝐚=𝐚𝖳𝐚\|{\bf a}\|\!=\!\sqrt{{\bf a}^{\sf T}{\bf a}} is the Euclidean norm of a real vector 𝐚{\bf a}. 𝒩(𝝁,𝐑)\mathcal{N}({\bm{\mu}},{\bf R}) represents a multivariate Gaussian distribution with mean vector 𝝁{\bm{\mu}} and covariance matrix 𝐑{\bf R}, while 𝒩(𝐱;𝝁,𝐑)\mathcal{N}({\bf x};{\bm{\mu}},{\bf R}) denotes the probability density function of a Gaussian random vector 𝐱{\bf x} with mean vector 𝝁{\bm{\mu}} and covariance matrix 𝐑{\bf R}. 𝟎n{\bf 0}_{n} and 𝟏n{\bf 1}_{n} are nn-dimensional vectors whose elements are zero and one, respectively.

Refer to caption
Figure 1: An illustration of uplink communication of federated learning over a wireless MIMO multiple access channel.

II System Model

In this section, we present a federated learning framework considered in our work. We then describe uplink transmission and reception procedures in federated learning over a wireless MIMO MAC.

II-A Federated Learning Framework

We consider federated learning over a wireless MIMO MAC in which a PS equipped with UU antennas trains a global model (e.g., deep neural network) by collaborating with KK single-antenna wireless devices, as illustrated in Fig. 1. We assume that the global model is fully represented by a parameter vector 𝐰N¯{\bf w}\in\mathbb{R}^{\bar{N}}, where N¯\bar{N} is the number of the parameters (e.g., the weights and biases of a deep neural network). We also assume that the training data samples to optimize the global model are available only at the wireless devices and differ across the devices. Let 𝒟k\mathcal{D}_{k} be a local training data set available at device kk, which consists of |𝒟k||\mathcal{D}_{k}| training data samples. A local loss function at device k𝒦={1,,K}k\in\mathcal{K}=\{1,\ldots,K\} for the parameter vector 𝐰{\bf w} is defined as

Fk(𝐰)=1|𝒟k|𝐮𝒟kf(𝐰;𝐮),\displaystyle F_{k}({\bf w})=\frac{1}{|\mathcal{D}_{k}|}\sum_{{\bf u}\in\mathcal{D}_{k}}f({\bf w};{\bf u}), (1)

where f(𝐰;𝐮)f({\bf w};{\bf u}) is a loss function computed with a training data sample 𝐮𝒟k{\bf u}\in\mathcal{D}_{k} for the parameter vector 𝐰{\bf w}. Then a global loss function for the parameter vector 𝐰{\bf w} is represented as

F(𝐰)=1j=1K|𝒟j|𝐮j𝒟jf(𝐰;𝐮)=1j=1K|𝒟j|k=1K|𝒟k|Fk(𝐰).\displaystyle F({\bf w})=\frac{1}{\sum_{j=1}^{K}|\mathcal{D}_{j}|}\sum_{{\bf u}\in\cup_{j}\mathcal{D}_{j}}f({\bf w};{\bf u})=\frac{1}{\sum_{j=1}^{K}|\mathcal{D}_{j}|}\sum_{k=1}^{K}|\mathcal{D}_{k}|F_{k}({\bf w}). (2)

We focus on a scenario where a gradient-based algorithm (e.g., SGD algorithm) is adopted to optimize the parameter vector for minimizing the global loss function in (2). We denote TT as the number of total iterations of the optimization algorithm, and 𝐰tN¯{\bf w}_{t}\in{\mathbb{R}}^{\bar{N}} as the parameter vector at iteration tt. Each iteration of the optimization algorithm corresponds to one communication round consisting of a pair of downlink and uplink communications. During the downlink communication, the PS broadcasts the current parameter vector to the wireless devices. Then, during the uplink communication, the wireless devices transmit the gradient of the parameter vector computed based on their local training data sets to the server. The major focus of our work is to investigate uplink transmission and reception strategies at the server for federated learning; thereby, for simplicity of the analysis, we assume that the downlink communication is error free111The framework under consideration and the proposed approach can be readily extended to a noisy downlink scenario[35]. Under this assumption, all the devices have a globally consistent parameter vector 𝐰t{\bf w}_{t} for all t{1,,T}t\in\{1,\ldots,T\}. In what follows, we describe uplink transmission at the wireless devices and uplink reception at the PS during each uplink communication.

II-B Uplink Transmission at Wireless Devices

For the uplink transmission, each device computes a local gradient vector for the current parameter vector based on its local training data set. A local gradient vector computed at device kk is given by

Fk(t)(𝐰t)=1|𝒟k(t)|𝐮𝒟k(t)f(𝐰t;𝐮),\displaystyle\nabla F_{k}^{(t)}({\bf w}_{t})=\frac{1}{|\mathcal{D}_{k}^{(t)}|}\sum_{{\bf u}\in\mathcal{D}_{k}^{(t)}}\nabla f({\bf w}_{t};{\bf u}), (3)

where 𝒟k(t)𝒟k\mathcal{D}_{k}^{(t)}\subset\mathcal{D}_{k} is a mini-batch randomly drawn from 𝒟k\mathcal{D}_{k}, and \nabla is a gradient operator. We assume that the mini-batch size at device kk is fixed as Dkmini=|𝒟k(t)|D_{k}^{\rm mini}=|\mathcal{D}_{k}^{(t)}| for all t{1,,T}t\in\{1,\ldots,T\}. Since direct transmission of the local gradient vector in (3) imposes large communication overhead when N¯1\bar{N}\gg 1, we assume that each device applies lossy compression to its local gradient vector before uplink transmission. Details of the gradient compression technique considered in our work will be elaborated in Sec. III-A. Let 𝐱k(t)=[xk(t)[1],,xk(t)[M¯]]𝖳M¯{\bf x}_{k}^{(t)}=\big{[}x_{k}^{(t)}[1],\cdots,x_{k}^{(t)}[\bar{M}]\big{]}^{\sf T}\in\mathbb{R}^{\bar{M}} be a compressed gradient vector determined at device kk with M¯<N¯\bar{M}<\bar{N}. Then an uplink signal vector transmitted from device kk at iteration tt is denoted by 𝐱~k(t)=Pk(t)𝐱k(t)\tilde{\bf x}_{k}^{(t)}=\sqrt{P_{k}^{(t)}}{\bf x}_{k}^{(t)}, where Pk(t)P_{k}^{(t)} is a power scaling factor for device kk at iteration tt. In particular, the power scaling factor is set to be Pk(t)=M¯𝐱k(t)2P_{k}^{(t)}=\frac{\bar{M}}{\|{\bf x}_{k}^{(t)}\|^{2}}, so that an average power of transmitting each entry of the uplink signal vector becomes one (i.e., 𝔼[|(𝐱~k(t))m|2]=1\mathbb{E}[|(\tilde{\bf x}_{k}^{(t)})_{m}|^{2}]=1, m\forall m). All KK devices simultaneously transmit their uplink signal vectors over a wireless MAC by utilizing M¯\bar{M} shared radio resources that are orthogonal in either the time or the frequency domain. We assume that the power scaling factor is separately conveyed to the PS in an error-free fashion because the overhead to transmit this scalar value is negligible compared to communication overhead to transmit the compressed gradient vector.

II-C Uplink Reception at Parameter Server

Once the uplink transmission of the wireless devices ends, the PS reconstructs the local gradient vectors from uplink received signals. Let 𝐱(t)[m]=[x1(t)[m],x2(t)[m],,xK(t)[m]]𝖳{\bf x}^{(t)}[m]=\big{[}x_{1}^{(t)}[m],x_{2}^{(t)}[m],\cdots,x_{K}^{(t)}[m]\big{]}^{\sf T} be the aggregation of the compressed gradient entries sent by KK devices at the mm-th resource, where xk[m]x_{k}[m] is the mm-th element of 𝐱k(t){\bf x}_{k}^{(t)}. Then an uplink received vector at the PS for the mm-th resource is given by

𝐲(t)[m]=𝐇(t)(𝐏(t))1/2𝐱(t)[m]+𝐳(t)[m]=𝐇~(t)𝐱(t)[m]+𝐳(t)[m],\displaystyle{\bf y}^{(t)}[m]={\bf H}^{(t)}({\bf P}^{(t)})^{1/2}{\bf x}^{(t)}[m]+{\bf z}^{(t)}[m]=\tilde{\bf H}^{(t)}{\bf x}^{(t)}[m]+{\bf z}^{(t)}[m], (4)

where 𝐏(t)=𝖽𝗂𝖺𝗀(P1(t),,PK(t)){\bf P}^{(t)}={\sf diag}\big{(}P_{1}^{(t)},\ldots,P_{K}^{(t)}\big{)} is a power allocation matrix, 𝐇(t)U×K{\bf H}^{(t)}\in\mathbb{R}^{U\times K} is a MIMO channel matrix from KK devices to the server, 𝐇~(t)𝐇(t)(𝐏(t))1/2\tilde{\bf H}^{(t)}\triangleq{\bf H}^{(t)}({\bf P}^{(t)})^{1/2}, and 𝐳(t)[m]𝒩(𝟎U,σ2𝐈U){\bf z}^{(t)}[m]\sim\mathcal{N}({\bf 0}_{U},\sigma^{2}{\bf I}_{U}) is a Gaussian noise vector. We assume that the channel matrix remains constant within one communication round (i.e., block-fading assumption) and is perfectly known at the PS via uplink channel estimation. Based on the received signal vectors {𝐲(t)[m]}m\{{\bf y}^{(t)}[m]\}_{m}, the PS reconstructs the local gradient vectors, {𝐠k(t)}k=1K\{{\bf g}_{k}^{(t)}\}_{k=1}^{K}, sent from all the KK wireless devices. Details of the gradient reconstruction algorithm developed in our work will be elaborated in Sec. IV.

Let 𝐠^k(t)\hat{\bf g}_{k}^{(t)} be the reconstruction of the local gradient vector 𝐠k(t){\bf g}_{k}^{(t)} at the PS. Then the PS compute a global gradient vector from the reconstructed local gradient vectors as follows:

𝐠^avg(t)=k=1Kρk𝐠^k(t),\displaystyle\hat{\bf g}_{\rm avg}^{(t)}=\sum_{k=1}^{K}\rho_{k}\hat{\bf g}_{k}^{(t)}, (5)

where ρkDkminij=1KDjmini\rho_{k}\triangleq\frac{D_{k}^{\rm mini}}{\sum_{j=1}^{K}D_{j}^{\rm mini}} is assumed to be known at the PS a priori. The global gradient vector in (5) is utilized to update the parameter vector 𝐰t{\bf w}_{t}. For example, if the PS adopts a gradient descent algorithm to optimize the parameter vector, the corresponding update rule is given by

𝐰t+1𝐰tηt𝐠^avg(t),\displaystyle{\bf w}_{t+1}\leftarrow{\bf w}_{t}-\eta_{t}\hat{\bf g}_{\rm avg}^{(t)}, (6)

where ηt>0\eta_{t}>0 is a learning rate at iteration tt.

III Communication-Efficient Federated Learning over MIMO MAC

One major problem of the uplink transmission in federated learning over a wireless MIMO MAC is that direct transmission of the local gradient vector in (3) imposes considerable communication overhead when the number of the model parameters is large (i.e., N¯1\bar{N}\gg 1). To alleviate this problem, in this section, we present a gradient compression technique for wireless devices, which takes the advantage of both gradient compression and simultaneous transmission over the wireless MIMO MAC. We then formulate a gradient reconstruction problem at the PS when employing the gradient compression technique at the devices and shed light on the challenges of solving this problem.

III-A Gradient Compression at Wireless Devices

The key idea of the gradient compression technique is to sparsify a local gradient vector in a block-wise fashion and then compress the sparsified vector as in CS [20, 21]. This technique is a simple variant of the gradient compression technique in [17, 18, 19]. Our compression technique consists of two steps, block sparsification and dimensionality reduction, as elaborated below.

Block Sparsification: The first step of our compression technique is to divide the local gradient vector into BB sub-vectors with dimension N=N¯BN=\frac{\bar{N}}{B}, then sparsify each sub-vector by dropping some gradient entries with small magnitudes. Meanwhile, to compensate for information loss caused by the gradient dropping, the dropped gradient entries are stored at each device and then added to the local gradient vector at the next iteration. Under the above strategy, the local gradient vector that needs to be sent by device kk at iteration tt is determined as

𝐠¯k(t)=Fk(t)(𝐰t)+𝚫k(t).\displaystyle\bar{\bf g}_{k}^{(t)}=\nabla F_{k}^{(t)}({\bf w}_{t})+{\bm{\Delta}}_{k}^{(t)}. (7)

where 𝚫k(t){\bm{\Delta}}_{k}^{(t)} is a residual gradient vector stored by device kk before iteration tt. Then the bb-th local gradient sub-vector at device kk is defined as

𝐠¯k,b(t)\displaystyle\bar{\bf g}_{k,b}^{(t)} =[g¯k,k,b(1)(t),,g¯k,k,b(N)(t)]𝖳,\displaystyle=\big{[}\bar{g}_{k,\mathcal{I}_{k,b}(1)}^{(t)},\cdots,\bar{g}_{k,\mathcal{I}_{k,b}(N)}^{(t)}\big{]}^{\sf T}, (8)

where g¯k,n(t)\bar{g}_{k,n}^{(t)} is the nn-th entry of 𝐠¯k(t)\bar{\bf g}_{k}^{(t)}, and k,b¯{1,,N¯}\mathcal{I}_{k,b}\subset\bar{\mathcal{I}}\triangleq\{1,\ldots,\bar{N}\} is a set of parameter indices associated with the bb-th sub-vector at device kk. We assume that k,1,,k,B\mathcal{I}_{k,1},\ldots,\mathcal{I}_{k,B} are mutually exclusive subsets of ¯\bar{\mathcal{I}} such that b=1Bk,b=¯\bigcup_{b=1}^{B}\mathcal{I}_{k,b}=\bar{\mathcal{I}} and known a priori at the PS. After constructing BB sub-vectors, each local gradient sub-vector is sparsified by setting all but the top-SS entries with the largest magnitudes as zeros, where SS is a sparsification level such that SNS\ll N. We define SratioS/NS_{\rm ratio}\triangleq S/N as a sparsification ratio. As a result, device kk attains BB local gradient sub-vectors {𝐠k,b(t)}\{{\bf g}_{k,b}^{(t)}\}, each of which is an SS-sparse vector. We denote the block sparsification step described above as 𝖡𝗅𝗈𝖼𝗄𝖲𝗉𝖺𝗋𝗌𝖾(𝐠¯k(t)){\sf BlockSparse}\big{(}\bar{\bf g}_{k}^{(t)}\big{)}. In the sequel, we will show that increasing the number of the gradient sub-vectors, BB, reduces the computational complexity of the gradient reconstruction process at the PS. However, if BB is set too large, the SBSB non-zero entries in {𝐠k,b(t)}\{{\bf g}_{k,b}^{(t)}\} can significantly differ from the top-SBSB entries of 𝐠¯k(t)\bar{\bf g}_{k}^{(t)}; this mismatch may lead to a decrease in the performance of federated learning. Therefore, in practical systems, BB can be determined according to a target learning performance and the computing power of the PS.

After the block sparsification, the residual gradient vector 𝐠¯k(t)\bar{\bf g}_{k}^{(t)} is updated as

𝚫k(t+1)𝐠¯k(t)𝖢𝗈𝗇𝖼𝖺𝗍𝖾𝗇𝖺𝗍𝖾({𝐠k,b(t)}b=1B),\displaystyle{\bm{\Delta}}_{k}^{(t+1)}\leftarrow\bar{\bf g}_{k}^{(t)}-{\sf Concatenate}\big{(}\{{\bf g}_{k,b}^{(t)}\}_{b=1}^{B}\big{)}, (9)

where 𝖢𝗈𝗇𝖼𝖺𝗍𝖾𝗇𝖺𝗍𝖾({𝐠k,b(t)}b=1B)=[gk,1cc,,gk,N¯cc]𝖳N¯{\sf Concatenate}\big{(}\{{\bf g}_{k,b}^{(t)}\}_{b=1}^{B}\big{)}=[g_{k,1}^{\rm cc},\cdots,g_{k,\bar{N}}^{\rm cc}]^{\sf T}\in\mathbb{R}^{\bar{N}} such that gk,k,b(n)cc=(𝐠k,b(t))ng_{k,\mathcal{I}_{k,b}(n)}^{\rm cc}=({\bf g}_{k,b}^{(t)})_{n} for all b,nb,n.

Dimensionality Reduction: The second step of our compression technique is to compress each sparse sub-vector by random projection onto a lower dimensional space as in CS [20, 21]. Let 𝐱k,b(t){\bf x}_{k,b}^{(t)} be a compressed gradient sub-vector generated from 𝐠k,b(t){\bf g}_{k,b}^{(t)}. Then 𝐱k,b(t){\bf x}_{k,b}^{(t)} is determined as

𝐱k,b(t)=𝐀(t)𝐠k,b(t),\displaystyle{\bf x}_{k,b}^{(t)}={\bf A}^{(t)}{\bf g}_{k,b}^{(t)}, (10)

where 𝐀(t)M×N{\bf A}^{(t)}\in\mathbb{R}^{M\times N} is a random measurement matrix with M<NM<N at iteration tt. In this work, we set 𝐀(t){\bf A}^{(t)} as an IID random matrix with (𝐀(t))m,n𝒩(0,1/M)({\bf A}^{(t)})_{m,n}\sim\mathcal{N}(0,1/M) which provides promising properties (e.g., restricted isometry property, null space property) for guaranteeing accurate recovery of a sparse signal [21]. Furthermore, 𝐀(t){\bf A}^{(t)} is randomly and independently drawn for each iteration, in order to avoid a null-space problem caused by a fixed random measurement matrix, as discussed in [31].

By concatenating these BB compressed sub-vectors, a compressed gradient vector for device kk at iteration tt is finally obtained as

𝐱k(t)=[(𝐱k,1(t))𝖳,,(𝐱k,B(t))𝖳]𝖳M¯.\displaystyle{\bf x}_{k}^{(t)}=\big{[}({\bf x}_{k,1}^{(t)})^{\sf T},\cdots,({\bf x}_{k,B}^{(t)})^{\sf T}\big{]}^{\sf T}\in\mathbb{R}^{\bar{M}}. (11)

The dimension of the compressed vector is M¯=BM\bar{M}=BM; thereby, the compression ratio of our technique is R=NM>1R=\frac{N}{M}>1. The overall procedures of our gradient compression technique is illustrated in Fig. 1 and also summarized in Steps 5–8 in Algorithm 1.

A significant advantage of our gradient compression technique is that it requires RR times lower communication overhead than the direct transmission of the local gradient vector, utilizing N¯\bar{N} radio resources per device. Another prominent advantage is that the communication overhead of our technique does not increase with the number of devices participating in federated learning because all the devices utilize the same resources available in the wireless MAC. Therefore, our compression technique effectively reduces the communication overhead of federated learning with a large number of devices.

Algorithm 1 Communication-Efficient Federated Learning Framework

Input: 𝐰1N¯{\bf w}_{1}\in\mathbb{R}^{\bar{N}}, 𝐀M×N{\bf A}\in\mathbb{R}^{M\times N}, {k,b}k,b\{\mathcal{I}_{k,b}\}_{k,b}
    
Output: 𝐰TN¯{\bf w}_{T}\in\mathbb{R}^{\bar{N}}

1:  Initialize 𝚫k(1)=𝟎N¯{\bf\Delta}_{k}^{(1)}={\bf 0}_{\bar{N}}, k\forall k.
2:  for t=1t=1 to TT do
3:     ​​​At the wireless devices:
4:     for Each device k𝒦k\in\mathcal{K} do
5:        𝐠¯k(t)=Fk(t)(𝐰t)+𝚫k(t)\bar{\bf g}_{k}^{(t)}=\nabla F_{k}^{(t)}\big{(}{\bf w}_{t}\big{)}+{\bf\Delta}_{k}^{(t)}.
6:        {𝐠k,b(t)}b=1B=𝖡𝗅𝗈𝖼𝗄𝖲𝗉𝖺𝗋𝗌𝖾(𝐠¯k(t))\{{\bf g}_{k,b}^{(t)}\}_{b=1}^{B}={\sf BlockSparse}\big{(}\bar{\bf g}_{k}^{(t)}\big{)}.
7:        𝚫k(t+1)=𝐠¯k(t)𝖢𝗈𝗇𝖼𝖺𝗍𝖾𝗇𝖺𝗍𝖾({𝐠k,b(t)}b=1B){\bf\Delta}_{k}^{(t+1)}=\bar{\bf g}_{k}^{(t)}-{\sf Concatenate}\big{(}\{{\bf g}_{k,b}^{(t)}\}_{b=1}^{B}\big{)}.
8:        𝐱k,b(t)=𝐀(t)𝐠k,b(t){\bf x}_{k,b}^{(t)}={\bf A}^{(t)}{\bf g}_{k,b}^{(t)}, b\forall b.
9:        Transmit 𝐱~k(t)=Pk(t)𝐱k(t)\tilde{\bf x}_{k}^{(t)}=\sqrt{P_{k}^{(t)}}{\bf x}_{k}^{(t)} to the PS via M¯\bar{M} shared radio resources.
10:     end for
11:     ​​​At the parameter server:
12:     Construct 𝐘b(t){\bf Y}_{b}^{(t)} from {𝐲(t)[m]}m=1M¯\big{\{}{\bf y}^{(t)}[m]\big{\}}_{m=1}^{\bar{M}}, b\forall b.
13:     {𝐠^k,b(t)}k𝒦=𝖦𝗋𝖺𝖽𝖱𝖾𝖼𝗈𝗇𝗌𝗍(𝐘b(t))\{\hat{\bf g}_{k,b}^{(t)}\}_{k\in\mathcal{K}}={\sf GradReconst}\big{(}{\bf Y}_{b}^{(t)}\big{)} from Algorithm 3, b\forall b.
14:     𝐠^k(t)=𝖢𝗈𝗇𝖼𝖺𝗍𝖾𝗇𝖺𝗍𝖾({𝐠^k,b(t)}b=1B)\hat{\bf g}_{k}^{(t)}={\sf Concatenate}\big{(}\{\hat{\bf g}_{k,b}^{(t)}\}_{b=1}^{B}\big{)}, k\forall k.
15:     𝐠^avg(t)=k=1Kρk𝐠^k(t)\hat{\bf g}_{\rm avg}^{(t)}=\sum_{k=1}^{K}\rho_{k}\hat{\bf g}_{k}^{(t)}.
16:     𝐰t+1=𝐰tηt𝐠^avg(t){\bf w}_{t+1}={\bf w}_{t}-\eta_{t}\hat{\bf g}_{\rm avg}^{(t)}.
17:     Broadcast 𝐰t+1{\bf w}_{t+1} to the wireless devices.
18:  end for

Remark 1 (Digital Transmission Strategy): In this work, we focus only on analog transmission at the wireless devices in which the uplink signal vector 𝐱~k(t)\tilde{\bf x}_{k}^{(t)} is transmitted without digital modulation. Nevertheless, digital transmission of the uplink signal vector is also feasible by utilizing a digital symbol modulator as a scalar quantizer. When employing our gradient compression technique, the distribution of the uplink signal vector can be modeled as 𝐱~k(t)𝒩(𝟎M¯,𝐈M¯)\tilde{\bf x}_{k}^{(t)}\sim\mathcal{N}({\bf 0}_{\bar{M}},{\bf I}_{\bar{M}}) for large NN, which will be justified in Sec. IV-B. Consequently, a complex-valued signal vector, defined as 𝐱~c,k(t)=(𝐱~k(t))1:M¯/2+j(𝐱~k(t))M¯/2+1:M¯\tilde{\bf x}_{{\rm c},k}^{(t)}=\big{(}\tilde{\bf x}_{k}^{(t)}\big{)}_{1:\bar{M}/2}+j\big{(}\tilde{\bf x}_{k}^{(t)}\big{)}_{\bar{M}/2+1:\bar{M}}, follows 𝐱~c,k(t)𝒞𝒩(𝟎M¯/2,0.5𝐈M¯/2)\tilde{\bf x}_{{\rm c},k}^{(t)}\sim\mathcal{CN}({\bf 0}_{\bar{M}/2},0.5{\bf I}_{\bar{M}/2}) for large NN. This fact implies that each entry of 𝐱~c,k(t)\tilde{\bf x}_{{\rm c},k}^{(t)} can be effectively modulated using quadrature amplitude modulation (QAM). Therefore, our gradient compression technique can be compatible with commercial digital communication systems by utilizing the above digital transmission strategy.

III-B Gradient Reconstruction at Parameter Server

When employing the compression technique in Sec. III-A, the PS faces a new challenge in enabling accurate reconstruction of the global gradient vector in (5). The underlying reason is that the local gradient vectors are not only compressed, but also simultaneously sent via a wireless MIMO MAC that causes different channel fading across devices. To shed light on this challenge, we formulate a gradient reconstruction problem at the PS when employing the compression technique in Sec. III-A.

Let b{(b1)M+1,,bM}\mathcal{M}_{b}\triangleq\{(b-1)M+1,\ldots,bM\} be a set of resource indices associated with the transmission of the bb-th compressed sub-vector. Then an uplink received signal matrix associated with the bb-th compressed sub-vector is given by 𝐘b(t)=[𝐲(t)[b(1)],,𝐲(t)[b(M)]]{\bf Y}_{b}^{(t)}=\left[{\bf y}^{(t)}[\mathcal{M}_{b}(1)],\cdots,{\bf y}^{(t)}[\mathcal{M}_{b}(M)]\right]. From (4) and (11), the uplink received signal matrix 𝐘b(t){\bf Y}_{b}^{(t)} is rewritten as

𝐘b(t)=𝐇~(t)(𝐗b(t))𝖳+𝐙b(t)=(a)𝐇~(t)(𝐆b(t))𝖳(𝐀(t))𝖳+𝐙b(t),\displaystyle{\bf Y}_{b}^{(t)}=\tilde{\bf H}^{(t)}({\bf X}_{b}^{(t)})^{\sf T}+{\bf Z}_{b}^{(t)}\overset{(a)}{=}\tilde{\bf H}^{(t)}({\bf G}_{b}^{(t)})^{\sf T}({\bf A}^{(t)})^{\sf T}+{\bf Z}_{b}^{(t)}, (12)

where 𝐗b(t)=[𝐱1,b(t),,𝐱K,b(t)]{\bf X}_{b}^{(t)}=\big{[}{\bf x}_{1,b}^{(t)},\cdots,{\bf x}_{K,b}^{(t)}\big{]}, 𝐙b(t)=[𝐳(t)[b(1)],,𝐳(t)[b(M)]]{\bf Z}_{b}^{(t)}=\left[{\bf z}^{(t)}[\mathcal{M}_{b}(1)],\cdots,{\bf z}^{(t)}[\mathcal{M}_{b}(M)]\right], 𝐆b(t)=[𝐠1,b(t),,𝐠K,b(t)]{\bf G}_{b}^{(t)}=\big{[}{\bf g}_{1,b}^{(t)},\cdots,{\bf g}_{K,b}^{(t)}\big{]}, and the equality (a) holds because 𝐗b(t)=𝐀(t)𝐆b(t){\bf X}_{b}^{(t)}={\bf A}^{(t)}{\bf G}_{b}^{(t)} from (10). Note that 𝐆b(t){\bf G}_{b}^{(t)} is an SKSK-sparse matrix because 𝐠1,b(t),,𝐠K,b(t){\bf g}_{1,b}^{(t)},\ldots,{\bf g}_{K,b}^{(t)} are SS-sparse vectors by the sparsification step during the gradient compression. Therefore, the gradient reconstruction problem reduces to BB parallel CS recovery problems, each of which aims at finding an SKSK-sparse matrix 𝐆b(t)N×K{\bf G}_{b}^{(t)}\in\mathbb{R}^{N\times K} from a noisy linear measurement 𝐘b(t)U×M{\bf Y}_{b}^{(t)}\in\mathbb{R}^{U\times M}. The overall procedures of gradient reconstruction and parameter update at the PS are illustrated in Fig. 1 and also summarized in Steps 12–17 in Algorithm 1, where 𝖦𝗋𝖺𝖽𝖱𝖾𝖼𝗈𝗇𝗌𝗍(){\sf GradReconst}(\cdot) denotes the gradient reconstruction algorithm chosen by the PS.

III-C Limitations of Conventional Approaches

The CS recovery problem with the form of (12) has also been studied for various applications [32, 33, 34]. The simplest approach to solve this problem is to formulate an equivalent CS recovery problem by converting the received signal matrix in (12) into an equivalent vector form:

vec(𝐘b(t))=vec(𝐇~(t)(𝐆b(t))𝖳(𝐀(t))𝖳)+vec(𝐙b(t))=(𝐀(t)𝐇~(t))vec((𝐆b(t))𝖳)+vec(𝐙b(t)).\displaystyle{\rm vec}\big{(}{\bf Y}_{b}^{(t)}\big{)}={\rm vec}\big{(}\tilde{\bf H}^{(t)}({\bf G}_{b}^{(t)})^{\sf T}({\bf A}^{(t)})^{\sf T}\big{)}+{\rm vec}\big{(}{\bf Z}_{b}^{(t)}\big{)}=({\bf A}^{(t)}\otimes\tilde{\bf H}^{(t)}){\rm vec}\big{(}({\bf G}_{b}^{(t)})^{\sf T}\big{)}+{\rm vec}\big{(}{\bf Z}_{b}^{(t)}\big{)}. (13)

Based on this form, an existing compressed sensing algorithm can be applied to recover a sparse vector vec((𝐆b(t))𝖳)NK{\rm vec}\big{(}({\bf G}_{b}^{(t)})^{\sf T}\big{)}\in\mathbb{R}^{NK} from its noisy linear measurement vec(𝐘b(t))MU{\rm vec}\big{(}{\bf Y}_{b}^{(t)}\big{)}\in\mathbb{R}^{MU}. This approach, however, requires tremendous computational complexity because the dimensions of both vec((𝐆b(t))𝖳){\rm vec}\big{(}({\bf G}_{b}^{(t)})^{\sf T}\big{)} and vec(𝐘b(t)){\rm vec}\big{(}{\bf Y}_{b}^{(t)}\big{)} are extremely large in many federated learning applications. For example, Table I in Sec. VI shows that the orthogonal matching pursuit (OMP) algorithm to solve the Kronecker-form problem in (13) requires the complexity order of 6×10156\times 10^{15} in our simulation with B=10B=10, R=5R=5, and S/N=0.04S/N=0.04. Another existing approach is to directly solve a matrix-form problem in (12) by modifying conventional CS algorithms such as orthogonal matching pursuit (OMP) [32, 33]. This approach, however, still requires significant computational complexity that comes from large-scale matrix multiplications. For example, Table I in Sec. VI shows that the OMP algorithm to solve the matrix-form problem in (12) requires the complexity order of 4×10154\times 10^{15} in our simulation with B=10B=10, R=5R=5, and S/N=0.04S/N=0.04. To overcome the limitations of the existing CS approaches, in the following section, we will present a new low-complexity approach to solve the CS recovery problem in (12).

IV Proposed Gradient Reconstruction Algorithm

When employing the gradient compression technique in Sec. III-A, it is essential to develop a computationally-efficient solution for reconstructing the global gradient vector at the PS while minimizing reconstruction error. To achieve this goal, in this section, we propose a novel low-complexity algorithm for solving the gradient reconstruction problem in Sec. III-B, on the basis of a divide-and-conquer strategy combined with a turbo decoding principle. In the rest of this section, we omit the index tt for ease of notation.

IV-A Basic Principle: Divide-and-Conquer Strategy

The basic principle of the proposed algorithm is to decompose the gradient reconstruction problem in (12) into KK small-scale CS recovery problems by applying MIMO detection to explicitly estimate a compressed gradient matrix 𝐗b{\bf X}_{b}. To be more specific, the output of the MIMO detection applied to detect 𝐗b{\bf X}_{b} from a received signal matrix in (12) can be expressed as

𝐗^b=𝐗b+𝐍b,\displaystyle\hat{\bf X}_{b}={\bf X}_{b}+{\bf N}_{b}, (14)

where 𝐍b{\bf N}_{b} is a MIMO detection error. Then the kk-th column of 𝐗^b\hat{\bf X}_{b}, namely 𝐱^k,b\hat{\bf x}_{k,b}, is expressed as

𝐱^k,b=𝐱k,b+𝐧k,b=𝐀𝐠k,b+𝐧k,b,\displaystyle\hat{\bf x}_{k,b}={\bf x}_{k,b}+{\bf n}_{k,b}={\bf A}{\bf g}_{k,b}+{\bf n}_{k,b}, (15)

where 𝐧k,b{\bf n}_{k,b} is the kk-th column of 𝐍b{\bf N}_{b}. Since every local gradient sub-vector sent by each device is an SS-sparse vector as explained in Sec. III-A, reconstructing 𝐠k,bN{\bf g}_{k,b}\in\mathbb{R}^{N} from 𝐱^k,bM\hat{\bf x}_{k,b}\in\mathbb{R}^{M} is exactly a noisy CS recovery problem which finds an NN-dimensional sparse vector from an MM-dimensional observation. Solving this problem in parallel for all k𝒦k\in\mathcal{K} and b{1,,B}b\in\{1,\ldots,B\} requires much smaller complexity than directly solving the CS recovery problem in (12) which finds a KNKN-dimensional unknown vector from a UMUM-dimensional observation.

Despite the advantage in reducing the computational complexity, a major drawback of our strategy is that error in the MIMO detection (e.g., 𝐍b{\bf N}_{b} in (14)) propagates into the subsequent CS recovery process. To alleviate this drawback, we leverage a turbo decoding principle that allows the MIMO detection and the CS recovery processes to exchange their beliefs about the entries of the compressed gradient matrix. The proposed algorithm based on the above principle is illustrated in Fig. 3.

Refer to caption
Figure 2: Comparison between the empirical cumulative distribution function (CDF) of the local gradient entries sampled from simulation and the CDF of the Bernoulli Gaussian-mixture distribution.

IV-B Statistical Modeling

To enable the turbo decoding principle based on a Bayesian approach, we establish a statistical model for local gradient sub-vectors as well as for compressed gradient sub-vectors, as done in [19]. For statistical modeling of the local gradient sub-vectors, we focus on the fact that every local gradient sub-vector is sparse, while its non-zero entries are arbitrarily distributed. We also use the fact that the entries of each local gradient sub-vector are likely to be independent because they are associated with a randomly drawn subset of the model parameters, as explained in Sec. III-A. Motivated by these facts, we model each local gradient sub-vector as an IID random vector whose entry follows the Bernoulli Gaussian-mixture distribution which is well known for its suitability and generality for modeling a sparse random vector [29]. Note that the probability density function of the Bernoulli Gaussian-mixture distribution with parameter 𝜽=(λ0,{λ,μ,ϕ}=1L){\bm{\theta}}=\big{(}\lambda_{0},\{\lambda_{\ell},\mu_{\ell},\phi_{\ell}\}_{\ell=1}^{L}\big{)} is given by

𝒢(g;𝜽)=λ0δ(g)+=1Lλ2πϕexp((gμ)22ϕ),\displaystyle\mathcal{BG}(g;{\bm{\theta}})=\lambda_{0}\delta(g)+\sum_{\ell=1}^{L}\frac{\lambda_{\ell}}{\sqrt{2\pi\phi_{\ell}}}\exp\left(-\frac{(g-\mu_{\ell})^{2}}{2\phi_{\ell}}\right), (16)

where =0Lλ=1\sum_{\ell=0}^{L}\lambda_{\ell}=1. As can be seen in (16), the above distribution is suitable for modeling the local gradient sub-vector because the sparsity of this sub-vector is effectively captured by the Bernoulli model with parameter λ0\lambda_{0}, while the distribution of non-zero local gradient entries is well approximated by the Gaussian-mixture model with LL components. How to tune the parameter 𝜽{\bm{\theta}} for each gradient sub-vector will be discussed in Sec. IV-D.

To further justify the Bernoulli Gaussian-mixture model, in Fig. 2, we compare the empirical cumulative distribution function (CDF) of the entries of the local gradient sub-vector obtained from simulation222In this simulation, we employ Algorithm 3 with R=3R=3 and T=30T=30. Further details of the simulation are described in Sec. VI. with the CDF of the Bernoulli Gaussian-mixture distribution whose parameters are determined by Algorithm 3. Fig. 2 shows that the empirical CDF is very similar to the corresponding model-based CDF. This result implies that the local gradient sub-vector can be well modeled by an IID random vector with the Bernoulli Gaussian-mixture distribution.

On the basis of the Bernoulli Gaussian-mixture model, we also determine a statistical model for the compressed gradient entries. First of all, as can be seen from (10), the mm-th entry of the compressed gradient sub-vector 𝐱k,b{\bf x}_{k,b} is given by (𝐱k,b)m=n=1N(𝐀)m,n(𝐠k,b)n({\bf x}_{k,b})_{m}=\sum_{n=1}^{N}({\bf A})_{m,n}({\bf g}_{k,b})_{n}. This implies that xk[m]x_{k}[m] is nothing but the sum of NN independent random variables because the random measurement matrix 𝐀{\bf A} has IID entries, while the local gradient sub-vector 𝐠k,b{\bf g}_{k,b} can be modeled as an IID random vector. Therefore, for large NN, every compressed gradient entry xk[m]x_{k}[m] can be modeled as a Gaussian random variable by the central limit theorem.

Refer to caption
Figure 3: Block diagram of the proposed gradient reconstruction algorithm applied to block bb.

IV-C Module A: MIMO Detection

In Module A of the proposed algorithm depicted in Fig. 3, the compressed gradient matrix 𝐗b{\bf X}_{b} is estimated from the uplink received signal matrix 𝐘b{\bf Y}_{b}. To minimize the MSE of the estimate of 𝐗b{\bf X}_{b}, we employ the MMSE MIMO detection based on a Gaussian model justified in Sec. IV-B. The input of Module A is the prior information about the mean and variance of xk[m]x_{k}[m], denoted by x^kpri[m]\hat{x}_{k}^{\rm pri}[m]\in\mathbb{R} and νxk[m]pri>0\nu_{x_{k}[m]}^{\rm pri}>0, respectively, k𝒦,mb\forall k\in\mathcal{K},m\in\mathcal{M}_{b}. This information is set approximately at the beginning of the algorithm and is passed from Module B during the turbo iterations of the algorithm. As discussed in Sec. IV-B, the prior distribution of 𝐱[m]{\bf x}[m] can be modeled as 𝐱[m]𝒩(𝐱^pri[m],𝐑𝐱[m]pri){\bf x}[m]\sim\mathcal{N}\big{(}\hat{\bf x}^{\rm pri}[m],{\bf R}_{{\bf x}[m]}^{\rm pri}\big{)} for large NN by the central limit theorem, where 𝐱^pri[m]=[x^1pri[m],,x^Kpri[m]]𝖳\hat{\bf x}^{\rm pri}[m]=\big{[}\hat{x}_{1}^{\rm pri}[m],\cdots,\hat{x}_{K}^{\rm pri}[m]\big{]}^{\sf T}, 𝐑𝐱[m]pri=𝖽𝗂𝖺𝗀(𝝂𝐱[m]pri){\bf R}_{{\bf x}[m]}^{\rm pri}={\sf diag}\big{(}{\bm{\nu}}_{{\bf x}[m]}^{\rm pri}\big{)}, and 𝝂𝐱[m]pri=[νx1[m]pri,,νxK[m]pri]𝖳{\bm{\nu}}_{{\bf x}[m]}^{\rm pri}=\big{[}\nu_{x_{1}[m]}^{\rm pri},\cdots,\nu_{x_{K}[m]}^{\rm pri}\big{]}^{\sf T}. Note that 𝐑𝐱[m]pri{\bf R}_{{\bf x}[m]}^{\rm pri} is diagonal because {xk[m]}k𝒦\{x_{k}[m]\}_{k\in\mathcal{K}} are associated with different parameters in the global model. Then, the posterior distribution of 𝐱[m]{\bf x}[m] for a given observation 𝐲[m]=𝐇~𝐱[m]+𝐯[m]{\bf y}[m]=\tilde{\bf H}{\bf x}[m]+{\bf v}[m] with 𝐯[m]𝒩(𝟎U,σ2𝐈U){\bf v}[m]\sim\mathcal{N}({\bf 0}_{U},\sigma^{2}{\bf I}_{U}) is determined as p(𝐱[m]|𝐲[m])=𝒩(𝐱[m];𝐱^post[m],𝐑𝐱[m]post)p\big{(}{\bf x}[m]\big{|}{\bf y}[m]\big{)}=\mathcal{N}({\bf x}[m];\hat{\bf x}^{\rm post}[m],{\bf R}_{{\bf x}[m]}^{\rm post}), where the posterior mean and covariance are computed as [36]

𝐱^post[m]\displaystyle\hat{\bf x}^{\rm post}[m] =𝐱^pri[m]+𝐑𝐱[m]pri𝐇~𝖳𝛀𝐱[m](𝐲[m]𝐇~𝐱^pri[m]),\displaystyle=\hat{\bf x}^{\rm pri}[m]+{\bf R}_{{\bf x}[m]}^{\rm pri}\tilde{\bf H}^{\sf T}{\bm{\Omega}}_{{\bf x}[m]}\big{(}{\bf y}[m]-\tilde{\bf H}\hat{\bf x}^{\rm pri}[m]\big{)}, (17)
𝐑𝐱[m]post\displaystyle{\bf R}_{{\bf x}[m]}^{\rm post} =𝐑𝐱[m]pri𝐑𝐱[m]pri𝐇~𝖳𝛀𝐱[m]𝐇~𝐑𝐱[m]pri,\displaystyle={\bf R}_{{\bf x}[m]}^{\rm pri}-{\bf R}_{{\bf x}[m]}^{\rm pri}\tilde{\bf H}^{\sf T}{\bm{\Omega}}_{{\bf x}[m]}\tilde{\bf H}{\bf R}_{{\bf x}[m]}^{\rm pri}, (18)

respectively, where 𝛀𝐱[m]=(𝐇~𝐑𝐱[m]pri𝐇~𝖳+σ2𝐈U)1{\bm{\Omega}}_{{\bf x}[m]}=\big{(}\tilde{\bf H}{\bf R}_{{\bf x}[m]}^{\rm pri}\tilde{\bf H}^{\sf T}+\sigma^{2}{\bf I}_{U}\big{)}^{-1}. The kk-th entry of 𝐱^post[m]\hat{\bf x}^{\rm post}[m], denoted by x^kpost[m]\hat{x}_{k}^{\rm post}[m], is the MMSE estimate of xk[m]x_{k}[m] for a given observation 𝐲[m]{\bf y}[m]. Meanwhile, the kk-th diagonal entry of 𝐑𝐱[m]post{\bf R}_{{\bf x}[m]}^{\rm post}, denoted by νxk[m]post\nu_{x_{k}[m]}^{\rm post}, represents the MSE between xk[m]x_{k}[m] and x^kpost[m]\hat{x}_{k}^{\rm post}[m].

After computing the MMSE estimates of the compressed gradient entries, Module A passes its knowledge about these values to Module B. As illustrated in Fig. 3, the prior information utilized in Module A has been passed from Module B; thereby, Module A extracts the extrinsic information by excluding the contribution of the prior information from the posterior information. In particular, the extrinsic distribution of xk[m]x_{k}[m] can be determined by assuming that its prior and posterior distributions are given by p(xk[m])=𝒩(xk[m];x^kpri[m],νxk[m]pri)p\big{(}x_{k}[m]\big{)}=\mathcal{N}(x_{k}[m];\hat{x}_{k}^{\rm pri}[m],\nu_{x_{k}[m]}^{\rm pri}) and p(xk[m]|𝐲[m])=𝒩(xk[m];x^kpost[m],νxk[m]post)p\big{(}x_{k}[m]\big{|}{\bf y}[m]\big{)}=\mathcal{N}(x_{k}[m];\hat{x}_{k}^{\rm post}[m],\nu_{x_{k}[m]}^{\rm post}), respectively. Then, by Bayes’ rule, the extrinsic distribution of xk[m]x_{k}[m] is given by p(𝐲[m]|xk[m])=𝒩(xk[m];x^kext[m],νxk[m]ext)p\big{(}{\bf y}[m]\big{|}x_{k}[m]\big{)}=\mathcal{N}(x_{k}[m];\hat{x}_{k}^{\rm ext}[m],\nu_{x_{k}[m]}^{\rm ext}), where the extrinsic mean and variance are computed as [37, 38]

x^kext[m]=x^kpost[m]νxk[m]prix^kpri[m]νxk[m]postνxk[m]priνxk[m]post,andνxk[m]ext=νxk[m]priνxk[m]postνxk[m]priνxk[m]post,\displaystyle\hat{x}_{k}^{\rm ext}[m]=\frac{\hat{x}_{k}^{\rm post}[m]\nu_{x_{k}[m]}^{\rm pri}-\hat{x}_{k}^{\rm pri}[m]\nu_{x_{k}[m]}^{\rm post}}{\nu_{x_{k}[m]}^{\rm pri}-\nu_{x_{k}[m]}^{\rm post}},~{}~{}\text{and}~{}~{}\nu_{x_{k}[m]}^{\rm ext}=\frac{\nu_{x_{k}[m]}^{\rm pri}\nu_{x_{k}[m]}^{\rm post}}{\nu_{x_{k}[m]}^{\rm pri}-\nu_{x_{k}[m]}^{\rm post}}, (19)

respectively. Then the values of {x^kext[m],νxk[m]ext}k,m\{\hat{x}_{k}^{\rm ext}[m],\nu_{x_{k}[m]}^{\rm ext}\}_{k,m} are passed to Module B which utilizes these values as the prior information about {xk[m]}k,m\{x_{k}[m]\}_{k,m}.

IV-D Module B: CS Recovery

In Module B of the proposed algorithm, each local gradient sub-vector is reconstructed from the prior information about the compressed gradient entries. To this end, the extrinsic mean and variance of xk[m]x_{k}[m] passed from Module A are harnessed as the prior mean and variance of xk[m]x_{k}[m], respectively, i.e., xk[m]𝒩(x^kpri[m],νxk[m]pri)x_{k}[m]\sim\mathcal{N}(\hat{x}_{k}^{\rm pri}[m],\nu_{x_{k}[m]}^{\rm pri}), where x^kpri[m]=x^kext[m]\hat{x}_{k}^{\rm pri}[m]=\hat{x}_{k}^{\rm ext}[m] and νxk[m]pri=νxk[m]ext\nu_{x_{k}[m]}^{\rm pri}=\nu_{x_{k}[m]}^{\rm ext}, k𝒦,mb\forall k\in\mathcal{K},m\in\mathcal{M}_{b}. Then the prior mean and covariance of 𝐱k,b{\bf x}_{k,b} are determined as 𝐱^k,bpri=[x^kpri[b(1)],,x^kpri[b(M)]]𝖳\hat{\bf x}_{k,b}^{\rm pri}=\big{[}\hat{x}_{k}^{\rm pri}[\mathcal{M}_{b}(1)],\cdots,\hat{x}_{k}^{\rm pri}[\mathcal{M}_{b}(M)]\big{]}^{\sf T} and 𝐑𝐱k,bpri=𝖽𝗂𝖺𝗀(𝝂𝐧k,bpri){\bf R}_{{\bf x}_{k,b}}^{\rm pri}={\sf diag}\big{(}{\bm{\nu}}_{{\bf n}_{k,b}}^{\rm pri}\big{)}, respectively, where 𝝂𝐧k,bpri=[νxk[b(1)]pri,,νxk[b(M)]pri]𝖳{\bm{\nu}}_{{\bf n}_{k,b}}^{\rm pri}=\big{[}{\nu}_{x_{k}[\mathcal{M}_{b}(1)]}^{\rm pri},\cdots,{\nu}_{x_{k}[\mathcal{M}_{b}(M)]}^{\rm pri}\big{]}^{\sf T}. Note that 𝐑𝐱k,bpri{\bf R}_{{\bf x}_{k,b}}^{\rm pri} is diagonal because xk[m1]x_{k}[m_{1}] and xk[m2]x_{k}[m_{2}] are associated with different parameters in the global model for m1m2bm_{1}\neq m_{2}\in\mathcal{M}_{b}. For large NN with the IID random measurement matrix 𝐀{\bf A} in (10), all the entries of 𝐱k,b{\bf x}_{k,b} are statistically identical by the central limit theorem. Motivated by this fact, we approximate the prior covariance as 𝐑𝐱k,bpriν¯𝐱k,bpri𝐈M{\bf R}_{{\bf x}_{k,b}}^{\rm pri}\approx\bar{\nu}_{{\bf x}_{k,b}}^{\rm pri}{\bf I}_{M}, where ν¯𝐱k,bpri1Mmbνxk[m]pri\bar{\nu}_{{\bf x}_{k,b}}^{\rm pri}\triangleq\frac{1}{M}\sum_{m\in\mathcal{M}_{b}}{\nu}_{x_{k}[m]}^{\rm pri}. Then, with some tedious manipulations from (17)–(19), the prior mean 𝐱^k,bpri\hat{\bf x}_{k,b}^{\rm pri} is rewritten as an additive white Gaussian noise (AWGN) observation of 𝐱k,b{\bf x}_{k,b}, i.e.,

𝐱^k,bpri=𝐱k,b+𝐧k,bpri=𝐀𝐠k,b+𝐧k,bpri,\displaystyle\hat{\bf x}_{k,b}^{\rm pri}={\bf x}_{k,b}+{\bf n}_{k,b}^{\rm pri}={\bf A}{\bf g}_{k,b}+{\bf n}_{k,b}^{\rm pri}, (20)

where 𝐧k,bpri𝒩(𝟎M,ν¯𝐱k,bpri𝐈M){\bf n}_{k,b}^{\rm pri}\sim\mathcal{N}\big{(}{\bf 0}_{M},\bar{\nu}_{{\bf x}_{k,b}}^{\rm pri}{\bf I}_{M}\big{)} is an effective noise independent of 𝐱k,b{\bf x}_{k,b}, as also derived in [38].

To reconstruct the local gradient sub-vector 𝐠k,b{\bf g}_{k,b} from the prior mean 𝐱^k,bpri\hat{\bf x}_{k,b}^{\rm pri} in (20), we need to solve a noisy CS recovery problem. Unfortunately, finding the exact MMSE solution to this problem is infeasible as the exact distribution of the local gradient sub-vector is unknown at the PS. To circumvent this challenge, we employ the EM-GAMP algorithm in [29] based on the statistical model in Sec. IV-B. This algorithm iteratively computes an approximate MMSE solution to the CS recovery problem by modeling the distribution of an unknown sparse vector using the Bernoulli Gaussian-mixture model; the parameters of this model are also updated iteratively based on the EM principle. As discussed in Sec. IV-B, the local gradient sub-vectors are well modeled by the Bernoulli Gaussian-mixture distribution; thereby, the EM-GAMP algorithm is a powerful means of solving our CS recovery problem in (20). This algorithm also computes the posterior mean and variance of each compressed gradient entry as a byproduct; thereby, it can be directly combined with the turbo decoding principle in the proposed algorithm. In Algorithm 2, we summarize the EM-GAMP algorithm employed in Module B of the proposed algorithm, where we omit the indices kk and bb for ease of notation.

The major steps in Algorithm 2 are described below. In Step 3, the pseudo-prior mean and variance of xmx_{m} are determined. In Step 4, the posterior mean and variance of xmx_{m} are computed based on (20) and xm𝒩(p^m,νpm)x_{m}\sim\mathcal{N}(\hat{p}_{m},{\nu}_{p_{m}}). In Step 6, the estimate of gng_{n} and the corresponding error variance are determined. In Step 7, the posterior mean and variance of gng_{n} are computed under the assumptions of gn𝒢(g;𝜽)g_{n}\sim\mathcal{BG}(g;{\bm{\theta}}) and r^n=gn+ξn\hat{r}_{n}=g_{n}+\xi_{n} with ξn𝒩(0,νrn)\xi_{n}\sim\mathcal{N}(0,{\nu}_{r_{n}}) as derived in [29], where βn,0=λ0𝒩(0;r^n,νrn)\beta_{n,0}=\lambda_{0}\mathcal{N}(0;\hat{r}_{n},\nu_{r_{n}}), βn,=λ𝒩(r^n;μ,νrn+ϕ)\beta_{n,\ell}=\lambda_{\ell}\mathcal{N}(\hat{r}_{n};\mu_{\ell},\nu_{r_{n}}+\phi_{\ell}),

λn,0\displaystyle\lambda_{n,0}^{\prime} =βn,0βn,0+i=1Lβn,i,λn,=βn,βn,0+i=1Lβn,i,μn,\displaystyle=\frac{\beta_{n,0}}{\beta_{n,0}+\sum_{i=1}^{L}\beta_{n,i}},~{}~{}\lambda_{n,\ell}^{\prime}=\frac{\beta_{n,\ell}}{\beta_{n,0}+\sum_{i=1}^{L}\beta_{n,i}},~{}~{}{\mu}_{n,\ell}^{\prime} =r^nϕ+μνrnνrn+ϕ,ϕn,=νrnϕνrn+ϕ,\displaystyle=\frac{\hat{r}_{n}\phi_{\ell}+\mu_{\ell}\nu_{r_{n}}}{\nu_{r_{n}}+\phi_{\ell}},~{}~{}{\phi}_{n,\ell}^{\prime}=\frac{\nu_{r_{n}}\phi_{\ell}}{\nu_{r_{n}}+\phi_{\ell}},

for all {1,,L}\ell\in\{1,\ldots,L\}. In Step 8, the parameters of the Bernoulli Gaussian-mixture model are computed based on the EM principle as derived in [29], where λ0′′=1Nn=1Nλn,0\lambda_{0}^{\prime\prime}=\frac{1}{N}\sum_{n=1}^{N}\lambda_{n,0}^{\prime},

λ′′1Nn=1Nλn,,μ′′n=1Nλn,μn,n=1Nλn,,ϕ′′n=1Nλn,(|μμn,|2+ϕn,)n=1Nλn,.\displaystyle\lambda_{\ell}^{\prime\prime}\approx\frac{1}{N}\sum_{n=1}^{N}\lambda_{n,\ell}^{\prime},~{}~{}\mu_{\ell}^{\prime\prime}\approx\frac{\sum_{n=1}^{N}\lambda_{n,\ell}^{\prime}\mu_{n,\ell}^{\prime}}{\sum_{n=1}^{N}\lambda_{n,\ell}^{\prime}},~{}~{}\phi_{\ell}^{\prime\prime}\approx\frac{\sum_{n=1}^{N}\lambda_{n,\ell}^{\prime}\left(|\mu_{\ell}-\mu_{n,\ell}^{\prime}|^{2}+\phi_{n,\ell}^{\prime}\right)}{\sum_{n=1}^{N}\lambda_{n,\ell}^{\prime}}. (21)

The asymptotic convergence of the EM-GAMP algorithm was studied in [39], in which the authors showed that under certain conditions, the performance of this algorithm asymptotically coincides with that of the original GAMP algorithm with perfect knowledge of the input distribution. We refer interested readers to [39] for more details. Meanwhile, the complexity order of the EM-GAMP algorithm is given by 𝒪(MN)\mathcal{O}(MN) in terms of the number of real multiplications (see Step 3 and Step 6 of Algorithm ).

Algorithm 2 EM-GAMP Algorithm

function 𝖤𝖬𝖦𝖠𝖬𝖯(𝐠^,𝝂𝐠,𝜽=(λ0,{λ,μ,ϕ}=1L),𝐱^pri,ν¯𝐱pri){\sf EMGAMP}\big{(}\hat{\bf g},{\bm{\nu}}_{{\bf g}},{\bm{\theta}}=(\lambda_{0},\{\lambda_{\ell},\mu_{\ell},\phi_{\ell}\}_{\ell=1}^{L}),\hat{\bf x}^{\rm pri},\bar{\nu}_{{\bf x}}^{\rm pri}\big{)}

1:  Set s^m=0\hat{s}_{m}=0, g^nold=g^n\hat{g}_{n}^{\rm old}=\hat{g}_{n}, and Am,n=(𝐀)m,nA_{m,n}=({\bf A})_{m,n}, m,n\forall m,n.
2:  for i=1i=1 to IGAMPI_{\rm GAMP} do
3:     Set p^m=n=1NAm,ng^nνpms^m\hat{p}_{m}=\sum_{n=1}^{N}A_{m,n}\hat{g}_{n}-\nu_{p_{m}}\hat{s}_{m} and νpm=n=1N|Am,n|2νgn\nu_{p_{m}}=\sum_{n=1}^{N}|A_{m,n}|^{2}\nu_{g_{n}}, m\forall m.
4:     Compute x^post[m]=(p^mν¯𝐱pri+x^pri[m]νpm)/(νpm+ν¯𝐱pri)\hat{x}^{\rm post}[m]=({\hat{p}_{m}\bar{\nu}_{\bf x}^{\rm pri}+\hat{x}^{\rm pri}[m]\nu_{p_{m}}})/({\nu_{p_{m}}+\bar{\nu}_{\bf x}^{\rm pri}}) and νx[m]post=(1/νpm+1/ν¯𝐱pri)1\nu_{x[m]}^{\rm post}=\big{(}1/{\nu_{p_{m}}+1/\bar{\nu}_{\bf x}^{\rm pri}}\big{)}^{-1}, m\forall m.
5:     Set s^m=(x^post[m]p^m)/νpm\hat{s}_{m}=(\hat{x}^{\rm post}[m]-\hat{p}_{m})/\nu_{p_{m}} and νsm=(1νx[m]post/νpm)/νpm\nu_{s_{m}}=(1-\nu_{x[m]}^{\rm post}/\nu_{p_{m}})/\nu_{p_{m}}, m\forall m.
6:     Set r^n=g^n+νrnm=1MAm,ns^m\hat{r}_{n}=\hat{g}_{n}+\nu_{r_{n}}\sum_{m=1}^{M}A_{m,n}\hat{s}_{m} and νrn=(m=1M|Am,n|2νsm)1\nu_{r_{n}}=\big{(}\sum_{m=1}^{M}|A_{m,n}|^{2}\nu_{s_{m}}\big{)}^{-1}, n\forall n.
7:     Compute g^n==1Lλn,μn,\hat{g}_{n}=\sum_{\ell=1}^{L}\lambda_{n,\ell}^{\prime}{\mu}_{n,\ell}^{\prime} and νgn==1Lλn,(ϕn,+(μn,)2)(g^n)2\nu_{g_{n}}=\sum_{\ell=1}^{L}\lambda_{n,\ell}^{\prime}\big{(}{\phi}_{n,\ell}^{\prime}+({\mu}_{n,\ell}^{\prime})^{2}\big{)}-(\hat{g}_{n})^{2}, n\forall n.
8:     Update 𝜽(λ0′′,{λ′′,μ′′,ϕ′′}=1L){\bm{\theta}}\leftarrow(\lambda_{0}^{\prime\prime},\{\lambda_{\ell}^{\prime\prime},\mu_{\ell}^{\prime\prime},\phi_{\ell}^{\prime\prime}\}_{\ell=1}^{L}) from (21).
9:     Break if n=1N(g^noldg^n)2<τGAMPn=1N(g^nold)2\sum_{n=1}^{N}(\hat{g}_{n}^{\rm old}-\hat{g}_{n})^{2}<\tau_{\rm GAMP}\sum_{n=1}^{N}(\hat{g}_{n}^{\rm old})^{2}.
10:     Set g^nold=g^n\hat{g}_{n}^{\rm old}=\hat{g}_{n}, n\forall n.
11:  end for
12:  Set 𝐠^=[g^1,,g^N]𝖳\hat{\bf g}=[\hat{g}_{1},\cdots,\hat{g}_{N}]^{\sf T} and 𝝂𝐠=[νg1,,νgN]𝖳{\bm{\nu}}_{{\bf g}}=[\nu_{g_{1}},\cdots,\nu_{g_{N}}]^{\sf T}.
13:  Set 𝐱^post=[x^post[1],,x^post[M]]𝖳\hat{\bf x}^{\rm post}=[\hat{x}^{\rm post}[1],\cdots,\hat{x}^{\rm post}[M]]^{\sf T} and 𝝂𝐱post=[νx[1]post,,νx[M]post]𝖳{\bm{\nu}}_{\bf x}^{\rm post}=[\nu_{x[1]}^{\rm post},\cdots,\nu_{x[M]}^{\rm post}]^{\sf T}.
14:  return 𝐠^,𝝂𝐠,𝜽,𝐱^post,𝝂𝐱post\hat{\bf g},{\bm{\nu}}_{{\bf g}},{\bm{\theta}},\hat{\bf x}^{\rm post},{\bm{\nu}}_{{\bf x}}^{\rm post}.

Once the EM-GAMP algorithm applied to 𝐱^k,bpri\hat{\bf x}_{k,b}^{\rm pri} converges, Module B acquires not only the approximate MMSE estimate of 𝐠^k,b\hat{\bf g}_{k,b}, but also the posterior mean and variance of compressed gradient entries, given by {x^kpost[m],νxk[m]post}mb\{\hat{x}_{k}^{\rm post}[m],{\nu}_{x_{k}[m]}^{\rm post}\}_{m\in\mathcal{M}_{b}} (see Step 4 in Algorithm 2). As done in Module A, the extrinsic mean and variance, denoted by x^kext[m]\hat{x}_{k}^{\rm ext}[m] and νxk[m]ext{\nu}_{x_{k}[m]}^{\rm ext}, respectively, are computed from x^kpost[m]\hat{x}_{k}^{\rm post}[m] and νxk[m]post{\nu}_{x_{k}[m]}^{\rm post} according to (19), k𝒦,mb\forall k\in\mathcal{K},m\in\mathcal{M}_{b}. Then these values are passed to Module A which utilizes these values as the prior information about {xk[m]}k,m\{x_{k}[m]\}_{k,m}.

Algorithm 3 Proposed Gradient Reconstruction Algorithm

function 𝖦𝗋𝖺𝖽𝖱𝖾𝖼𝗈𝗇𝗌𝗍(𝐘b){\sf GradReconst}\big{(}{\bf Y}_{b}\big{)}

1:  Initialize 𝐠^k,b,𝝂𝐠k,b,𝜽k,b,x^kpri[m],νxk[m]pri\hat{\bf g}_{k,b},{\bm{\nu}}_{{\bf g}_{k,b}},{\bm{\theta}}_{k,b},\hat{x}_{k}^{\rm pri}[m],\nu_{x_{k}[m]}^{\rm pri}k,mb\forall k,m\in\mathcal{M}_{b}.
2:  for i=1i=1 to ITurboI_{\rm Turbo} do
3:     for mbm\in\mathcal{M}_{b} do
4:        Set 𝐱^pri[m]=[x^1pri[m],,x^Kpri[m]]𝖳\hat{\bf x}^{\rm pri}[m]=\big{[}\hat{x}_{1}^{\rm pri}[m],\cdots,\hat{x}_{K}^{\rm pri}[m]\big{]}^{\sf T} and 𝐑𝐱[m]pri=𝖽𝗂𝖺𝗀([νx1[m]pri,,νxK[m]pri]){\bf R}_{{\bf x}[m]}^{\rm pri}={\sf diag}\big{(}\big{[}\nu_{x_{1}[m]}^{\rm pri},\ldots,\nu_{x_{K}[m]}^{\rm pri}\big{]}\big{)}.
5:        Compute 𝐱^post[m]\hat{\bf x}^{\rm post}[m] from (17) and 𝐑𝐱[m]post{\bf R}_{{\bf x}[m]}^{\rm post} from (18).
6:     end for
7:     Update x^kpri[m]x^kext[m]\hat{x}_{k}^{\rm pri}[m]\leftarrow\hat{x}_{k}^{\rm ext}[m] and νxk[m]priνxk[m]ext\nu_{x_{k}[m]}^{\rm pri}\leftarrow\nu_{x_{k}[m]}^{\rm ext} from (19), k,mb\forall k,m\in\mathcal{M}_{b}.
8:     for k=1k=1 to KK do
9:        Set 𝐱^k,bpri=[x^kpri[(b1)M+1],,x^kpri[bM]]𝖳\hat{\bf x}_{k,b}^{\rm pri}=\big{[}\hat{x}_{k}^{\rm pri}[(b-1)M+1],\cdots,\hat{x}_{k}^{\rm pri}[bM]\big{]}^{\sf T} and ν¯𝐱k,bpri=1Mmbνxk[m]pri\bar{\nu}_{{\bf x}_{k,b}}^{\rm pri}=\frac{1}{M}\sum_{m\in\mathcal{M}_{b}}\nu_{x_{k}[m]}^{\rm pri}.
10:        {𝐠^k,b,𝝂𝐠k,b,𝜽k,b,𝐱^k,bpost,𝝂𝐱k,bpost}=𝖤𝖬𝖦𝖠𝖬𝖯(𝐠^k,b,𝝂𝐠k,b,𝜽k,b,𝐱^k,bpri,ν¯𝐱k,bpri)\big{\{}\hat{\bf g}_{k,b},{\bm{\nu}}_{{\bf g}_{k,b}},{\bm{\theta}}_{k,b},\hat{\bf x}_{k,b}^{\rm post},{\bm{\nu}}_{{\bf x}_{k,b}}^{\rm post}\big{\}}={\sf EMGAMP}(\hat{\bf g}_{k,b},{\bm{\nu}}_{{\bf g}_{k,b}},{\bm{\theta}}_{k,b},\hat{\bf x}_{k,b}^{\rm pri},\bar{\nu}_{{\bf x}_{k,b}}^{\rm pri}).
11:     end for
12:     Update νxk[m]priνxk[m]ext\nu_{x_{k}[m]}^{\rm pri}\leftarrow\nu_{x_{k}[m]}^{\rm ext} and x^kpri[m]x^kext[m]\hat{x}_{k}^{\rm pri}[m]\leftarrow\hat{x}_{k}^{\rm ext}[m] from (19), k,mb\forall k,m\in\mathcal{M}_{b}.
13:  end for
14:  return 𝐠^1,b,𝐠^2,b,,𝐠^K,b\hat{\bf g}_{1,b},\hat{\bf g}_{2,b},\ldots,\hat{\bf g}_{K,b}.

IV-E Summary

In Algorithm 3, we summarize the operations of the proposed algorithm that iterates between Module A and Module B based on the turbo decoding principle. We denote ITurboI_{\rm Turbo} as the total number of the iterations of the proposed algorithm. As can be seen in Algorithm 3, in Module A, the compressed gradient entries are estimated based on the observation of the uplink received signals. Meanwhile, in Module B, the estimates of the compressed gradient entries are refined by exploiting their sparse property. Since Module A and Module B utilize different sources of information for estimating the compressed gradient entries, these modules can complement each other by iteratively exchanging their extrinsic beliefs about the compressed gradient entries. Recall that the reduction in the estimation error of the compressed gradient entries leads to the decrease in the noise in the CS recovery process for reconstructing the local gradient sub-vectors, as discussed in Sec. IV-A. Therefore, the use of the turbo decoding principle in the proposed algorithm improves the reconstruction accuracy of the local gradient sub-vectors at the PS.

V Performance Analysis

In this section, we analyze the reconstruction error of the proposed algorithm and its impact on the convergence rate of federated learning over a wireless MIMO MAC.

V-A Reconstruction Error Analysis

In this analysis, we characterize the MSE of a global gradient vector reconstructed from the proposed algorithm. To this end, we model each local gradient sub-vector as an IID random vector, which can be justified using the Bernoulli Gaussian-mixture distribution, as discussed in Sec. IV-B. Meanwhile, we assume that the distribution of the local gradient sub-vectors can differ across devices and blocks, while the mean and variance of each gradient sub-vector are known333To realize this assumption, the information of the means and variances of the local gradient sub-vectors needs to be conveyed from the devices to the PS. Fortunately, the overhead to transmit this information is negligible compared to communication overhead to transmit the compressed gradient vectors. at the PS. Under these assumptions, extrinsic means and variances passed from Module A of the proposed algorithm are characterized as given in the following lemma:

Lemma 1

Let x^k(t)[m]\hat{x}_{k}^{(t)}[m] be the extrinsic mean of xk(t)[m]x_{k}^{(t)}[m] computed from Module A of the proposed algorithm for a given observation 𝐲(t)[m]{\bf y}^{(t)}[m]. For large NN, an extrinsic mean vector 𝐱^k,b(t)=[x^k(t)[b(1)],,x^k(t)[b(M)]]𝖳\hat{\bf x}_{k,b}^{(t)}=\big{[}\hat{x}_{k}^{(t)}[\mathcal{M}_{b}(1)],\cdots,\hat{x}_{k}^{(t)}[\mathcal{M}_{b}(M)]\big{]}^{\sf T} is an AWGN observation of 𝐱k,b(t){\bf x}_{k,b}^{(t)} with noise variance νk,b(t)\nu_{k,b}^{(t)}, i.e.,

𝐱^k,b(t)=𝐱k,b(t)+𝐧k,b(t)with𝐧k,b(t)𝒩(𝟎M,νk,b(t)𝐈M),\displaystyle\hat{\bf x}_{k,b}^{(t)}={\bf x}_{k,b}^{(t)}+{\bf n}_{k,b}^{(t)}~{}~{}\text{with}~{}~{}{\bf n}_{k,b}^{(t)}\sim\mathcal{N}({\bf 0}_{M},\nu_{k,b}^{(t)}{\bf I}_{M}), (22)

where

νk,b(t)\displaystyle\nu_{k,b}^{(t)} =1𝖲𝖭𝖱k(t)(𝐡k(t)2i=1Ek,b(t)|(𝐮k,b,i(t))𝖳𝐡k(t)|2λk,b,i(t)λk,b,i(t)+σ2)1,\displaystyle=\frac{1}{{\sf SNR}_{k}^{(t)}}\Bigg{(}\|{\bf h}_{k}^{(t)}\|^{2}-\sum_{i=1}^{E_{k,b}^{(t)}}\big{|}\big{(}{\bf u}_{k,b,i}^{(t)}\big{)}^{\sf T}{\bf h}_{k}^{(t)}\big{|}^{2}\frac{\lambda_{k,b,i}^{(t)}}{\lambda_{k,b,i}^{(t)}+\sigma^{2}}\Bigg{)}^{-1}, (23)

𝖲𝖭𝖱k(t)=Pkσ2{\sf SNR}_{k}^{(t)}=\frac{P_{k}}{\sigma^{2}}, λk,b,i(t)\lambda_{k,b,i}^{(t)} is the ii-th largest eigenvector of 𝚽k,b(t){\bm{\Phi}}_{k,b}^{(t)}, 𝐮k,b,i(t){\bf u}_{k,b,i}^{(t)} is the eigenvector associated with λk,b,i(t)\lambda_{k,b,i}^{(t)}, Ek,b(t)E_{k,b}^{(t)} is the rank of 𝚽k,b(t){\bm{\Phi}}_{k,b}^{(t)}, 𝚽k,b(t)=Rjkζj,b(t)Pj(t)𝐡j(t)(𝐡j(t))𝖳{\bm{\Phi}}_{k,b}^{(t)}=R\sum_{j\neq k}\zeta_{j,b}^{(t)}P_{j}^{(t)}{\bf h}_{j}^{(t)}\big{(}{\bf h}_{j}^{(t)}\big{)}^{\sf T}, 𝐡k(t){\bf h}_{k}^{(t)} is the kk-th column of 𝐇(t){\bf H}^{(t)}, and ζj,b(t)\zeta_{j,b}^{(t)} is the variance of each entry of 𝐠j,b(t){\bf g}_{j,b}^{(t)}.

Proof:

See Appendix A. ∎

Utilizing the result in Lemma 1, we characterize an MSE upper bound for the global gradient vector reconstructed by the proposed algorithm. This result is given in the following theorem:

Theorem 1

In the asymptotic regime of NN\rightarrow\infty and N/MRN/M\rightarrow R for some R1R\geq 1, the global gradient vector, 𝐠^avg(t)\hat{\bf g}_{\rm avg}^{(t)}, reconstructed by the proposed algorithm with ITurbo=1I_{\rm Turbo}=1 satisfies the following MSE bound:

𝔼[𝐠avg(t)𝐠^avg(t)2]k=1Kb=1Bρk2Nζk,b(t)(1ζk,b(t)Rζk,b(t)+νk,b(t)),\displaystyle\mathbb{E}\big{[}\|{\bf g}_{\rm avg}^{(t)}-\hat{\bf g}_{\rm avg}^{(t)}\|^{2}\big{]}\leq\sum_{k=1}^{K}\sum_{b=1}^{B}\rho_{k}^{2}N\zeta_{k,b}^{(t)}\left(1-\frac{\zeta_{k,b}^{(t)}}{R\zeta_{k,b}^{(t)}+\nu_{k,b}^{(t)}}\right), (24)

where 𝐠avg(t)=k=1Kρk𝐠k(t){\bf g}_{\rm avg}^{(t)}=\sum_{k=1}^{K}\rho_{k}{\bf g}_{k}^{(t)} is a global gradient vector at iteration tt.

Proof:

See Appendix B. ∎

Theorem 1 shows that the reconstruction error of the proposed algorithm reduces as the compression ratio RR approaches to one (i.e., no compression). Since the communication overhead increases as the compression ratio decreases, Theorem 1 clearly reveals the trade-off between communication overhead at the devices and reconstruction accuracy at the PS in federated learning over a wireless MIMO MAC. Another important observation is that νk,b(t)\nu_{k,b}^{(t)} in (24) decreases not only as the signal-to-noise ratio (SNR) of device kk increases, but also as |(𝐮k,b,i(t))𝖳𝐡k(t)|2\big{|}\big{(}{\bf u}_{k,b,i}^{(t)}\big{)}^{\sf T}{\bf h}_{k}^{(t)}\big{|}^{2} decreases, which can be seen from (23). This observation implies that νk,b(t)\nu_{k,b}^{(t)} decreases with the orthogonality between the channel of device kk and other devices’ channels because |(𝐮k,b,i(t))𝖳𝐡k(t)|20\big{|}\big{(}{\bf u}_{k,b,i}^{(t)}\big{)}^{\sf T}{\bf h}_{k}^{(t)}\big{|}^{2}\rightarrow 0, i{1,,D}\forall i\in\{1,\ldots,D\}, as |(𝐡j(t))𝖳𝐡k(t)|20\big{|}\big{(}{\bf h}_{j}^{(t)}\big{)}^{\sf T}{\bf h}_{k}^{(t)}\big{|}^{2}\rightarrow 0, jk\forall j\neq k. Therefore, our analytical result demonstrates that the reconstruction error of the proposed algorithm depends on both the gradient compression ratio and wireless environment (e.g., SNR or channel distribution).

V-B Convergence Rate Analysis

Now, we characterize the convergence rate of federated learning over the wireless MIMO MAC when employing a SGD algorithm. To this end, we make the following assumptions:

Assumption 1: The loss function F(𝐰)F({\bf w}) is β\beta-smooth and lower bounded by some constant F(𝐰)F({\bf w}^{\star}), i.e., F(𝐰)F(𝐰)F({\bf w})\geq F({\bf w}^{\star}), 𝐰N¯\forall{\bf w}\in\mathbb{R}^{\bar{N}}.

Assumption 2: For a given parameter vector 𝐰t{\bf w}_{t}, the global gradient vector in (5) is unbiased, i.e., 𝔼[𝐠avg(t)|𝐰t]=F(𝐰t)\mathbb{E}\big{[}{\bf g}_{\rm avg}^{(t)}\big{|}{\bf w}_{t}\big{]}=\nabla F({\bf w}_{t}), while its variance is upper bounded as 𝔼[𝐠avg(t)F(𝐰t)2|𝐰t]ψF(𝐰t)2\mathbb{E}\big{[}\|{\bf g}_{\rm avg}^{(t)}-\nabla F({\bf w}_{t})\|^{2}\big{|}{\bf w}_{t}\big{]}\leq\psi\|\nabla F({\bf w}_{t})\|^{2} with a constant ψ>0\psi>0, for all t{1,,T}t\in\{1,\ldots,T\}.

Rationale: Assumptions 1–2 are typical for analyzing the convergence rate of the SGD algorithm for a smooth loss function, as considered in [10, 19]. In particular, a variance scaling factor ψ\psi depends on both the mini-batch size and the sparsification level SS employed at each device, while this factor decreases with the mini-batch size and the sparsification level.

Under Assumptions 1–2, we characterize the convergence rate of federated learning, as given in the following theorem:

Theorem 2

Let ϵmax=maxt𝐠avg(t)𝐠^avg(t)2𝐠avg(t)2\epsilon_{\rm max}=\max_{t}\frac{\|{\bf g}_{\rm avg}^{(t)}-\hat{\bf g}_{\rm avg}^{(t)}\|^{2}}{\|{\bf g}_{\rm avg}^{(t)}\|^{2}} be the maximum normalized reconstruction error of the global gradient vector at the PS. Under Assumptions 1–2, if ϵmax<11+ψ\epsilon_{\rm max}<\frac{1}{1+\psi}, federated learning based on SGD with a learning rate ηt=ηt\eta_{t}=\frac{\eta}{\sqrt{t}} satisfies the following bound:

𝔼[1Tt=1TF(𝐰t)2]1T[F(𝐰1)F(𝐰)c1ηc2η2],\displaystyle\mathbb{E}\left[\frac{1}{T}\sum_{t=1}^{T}\|\nabla F({\bf w}_{t})\|^{2}\right]\leq\frac{1}{\sqrt{T}}\left[\frac{F({\bf w}_{1})-F({\bf w}^{\star})}{c_{1}\eta-c_{2}\eta^{2}}\right], (25)

for any η<c1c2\eta<\frac{c_{1}}{c_{2}}, where c1=1ϵmax(1+ψ)2c_{1}=\frac{1-\epsilon_{\rm max}(1+\psi)}{2} and c2=β(1+ϵmax)(1+ψ)c_{2}=\beta(1+\epsilon_{\rm max})(1+\psi).

Proof:

See Appendix C. ∎

Theorem 2 shows that federated learning converges to a stationary point of a smooth loss function at the rate of 𝒪(1T)\mathcal{O}\big{(}\frac{1}{\sqrt{T}}\big{)}, which is the same order as the original SGD algorithm [10], if the maximum normalized reconstruction error at the PS is lower than a certain level 11+ψ\frac{1}{1+\psi}. Theorem 2 also shows that the convergence rate of federated learning improves as the reconstruction error of the global gradient vector at the PS reduces because c1ηc2η2c_{1}\eta-c_{2}\eta^{2} increases as ϵmax\epsilon_{\rm max} decreases. Fortunately, the reconstruction error of the global gradient vector can be effectively reduced by employing the proposed gradient reconstruction algorithm, as will be demonstrated in Sec. VI. Therefore, our analytical and numerical results together imply that the convergence rate of federated learning can be improved using the proposed algorithm. As discussed in Theorem 1, the reconstruction error of the proposed algorithm reduces as the compression ratio decreases; thereby, Theorem 2 also reveals the trade-off between the communication overhead and convergence rate of federated learning.

VI Simulation Results

In this section, using simulation, we evaluate the performance of federated learning over a wireless MIMO MAC when employing the communication-efficient federated learning framework in Algorithm 1. We set K=32K=32, U=64U=64, and σ2=1\sigma^{2}=1, while modeling the entries of the MIMO channel matrix as IID Gaussian random variables with zero mean and unit variance. We assume that the downlink communication is error free unless specified otherwise. For the gradient compression technique in Sec. III-A, we set M=N/RM=\lfloor N/R\rfloor and S=SratioNS=\lfloor S_{\rm ratio}N\rfloor, where SratioS_{\rm ratio} is the ratio of the sparsification level. For the gradient reconstruction at the PS, we consider the following algorithms:

  • Proposed algorithm: This is given in Algorithm 3. To initialize the algorithm, we set444In this initialization, we use 𝔼[|(𝐱k,b(t))m|2]𝐱k(t)2/M¯=1/Pk(t)\mathbb{E}\big{[}|({\bf x}_{k,b}^{(t)})_{m}|^{2}\big{]}\approx{\|{\bf x}_{k}^{(t)}\|^{2}}/{\bar{M}}={1}/{P_{k}^{(t)}} and 𝔼[|(𝐠k,b(t))n|2]𝔼[|(𝐱k,b(t))m|2]/R1/(RPk(t))\mathbb{E}\big{[}|({\bf g}_{k,b}^{(t)})_{n}|^{2}\big{]}\approx\mathbb{E}\big{[}|({\bf x}_{k,b}^{(t)})_{m}|^{2}\big{]}/R\approx{1}/{(RP_{k}^{(t)})}. x^kpri[m]=0\hat{x}_{k}^{\rm pri}[m]=0, νxk[m]pri=1/Pk(t)\nu_{x_{k}[m]}^{\rm pri}={1}/{P_{k}^{(t)}}, and 𝝂𝐠k,b(t)=1/(RPk(t))𝟏N{\bm{\nu}}_{{\bf g}_{k,b}^{(t)}}={1}/{(RP_{k}^{(t)})}{\bf 1}_{N}. An initial estimate 𝐠^k,b(t)\hat{\bf g}_{k,b}^{(t)} is randomly drawn from 𝒩(𝟎N,𝝂𝐠k,b(t))\mathcal{N}({\bf 0}_{N},{\bm{\nu}}_{{\bf g}_{k,b}^{(t)}}). The parameters 𝜽k,b(t){\bm{\theta}}_{k,b}^{(t)} of the Bernoulli Gaussian-mixture model for 𝐠k,b(t){\bf g}_{k,b}^{(t)} are initialized with λ0=0.9\lambda_{0}=0.9, L=3L=3, λ=1λ0L\lambda_{\rm\ell}=\frac{1-\lambda_{0}}{L}, μ=g^min+212L(g^maxg^min)\mu_{\ell}=\hat{g}_{\rm min}+\frac{2\ell-1}{2L}(\hat{g}_{\rm max}-\hat{g}_{\rm min}), ϕ=112(g^maxg^minL)2\phi_{\ell}=\frac{1}{12}\left(\frac{\hat{g}_{\rm max}-\hat{g}_{\rm min}}{L}\right)^{2}, {1,,L}\forall\ell\in\{1,\ldots,L\}, where g^max\hat{g}_{\rm max} and g^min\hat{g}_{\rm min} are the largest and the smallest entry of 𝐠^k,b(t)\hat{\bf g}_{k,b}^{(t)}, respectively. Other parameters are set as τGAMP=105\tau_{\rm GAMP}=10^{-5}, IGAMP=30I_{\rm GAMP}=30, and ITurbo=2I_{\rm Turbo}=2, unless otherwise specified.

  • LMMSE-OMP: This is a variation of the proposed algorithm that employs the LMMSE MIMO detector followed by the OMP algorithm in [20, 21, 40] without harnessing the turbo decoding principle. In this algorithm, the distribution of the compressed gradient vector is modeled as 𝐱(t)[m]𝒩(𝟎,𝐏(t)){\bf x}^{(t)}[m]\sim\mathcal{N}({\bf 0},{\bf P}^{(t)}). The OMP algorithm is set to stop after SS iterations.

  • 2D-OMP: This is the OMP algorithm proposed in [33] that directly solves a matrix-form problem in (12). This algorithm is set to stop after SKSK iterations.

  • Kron-OMP: This is the OMP algorithm in [20, 21, 40] that solves a Kronecker-form problem in (13). This algorithm is set to stop after SKSK iterations.

The computational complexities of the above algorithms are characterized in Table I. Note that in our simulations, different numbers of blocks, BB, is adopted for different algorithms, in order to ensure that they have a similar complexity order, as specified in Table I. Meanwhile, the same sub-vector index sets are adopted by every device (i.e., k,b=b\mathcal{I}_{k,b}=\mathcal{I}_{b}, k\forall k).

TABLE I: The number of real multiplications required by gradient reconstruction algorithms.
Algorithm Complexity order Value in simulation for R=3R=3, Sratio=4%S_{\rm ratio}=4\% Value in simulation for R=5R=5, Sratio=4%S_{\rm ratio}=4\%
Proposed algorithm (B=10B=10) (U3M+NMKIGAMP)ITurboB\left(U^{3}M+NMKI_{\rm GAMP}\right)I_{\rm Turbo}B 1.90×10101.90\times 10^{10} 1.14×10101.14\times 10^{10}
LMMSE-OMP (B=10B=10) (U3M+KS4/4+NMKS)B\left(U^{3}M+KS^{4}/4+NMKS\right)B 1.97×10101.97\times 10^{10} 1.23×10101.23\times 10^{10}
2D-OMP (B=100B=100) {(KS)4/4+(U+N)MK2S}B\left\{(KS)^{4}/4+(U+N)MK^{2}S\right\}B 3.67×10103.67\times 10^{10} 3.82×10103.82\times 10^{10}
Kron-OMP (B=300B=300) {(KS)4/4+UNMK2S}B\left\{(KS)^{4}/4+UNMK^{2}S\right\}B 4.12×10104.12\times 10^{10} 2.21×10102.21\times 10^{10}

In this simulation, we consider an image classification task using the MNIST dataset in which a handwritten digit (from 0 to 99) of a 28×2828\times 28 grayscale image is classified using a neural network [30]. We particularly adopt the neural network that consists of 784784 input nodes, a single hidden layer with 2020 hidden nodes, and 1010 output nodes. The activation functions of the hidden layer and the output layer are set as the ReLU and the softmax functions, respectively. The weights of this neural network at iteration tt is mapped into a parameter vector 𝐰t{\bf w}_{t} with N¯=15910\bar{N}=15910. To train the network, we consider a mini-batch SGD algorithm with mini-batch size |𝒟k(t)|=10|\mathcal{D}_{k}^{(t)}|=10, a fixed learning rate ηt=0.2\eta_{t}=0.2, and the cross-entropy loss function. The local training data set of device kk, 𝒟k\mathcal{D}_{k}, is determined by randomly drawing 10001000 training data samples labeled with digit dk=k1K/10d_{k}=\big{\lfloor}\frac{k-1}{K/10}\big{\rfloor} among 6000060000 training data samples in the MNIST dataset; this corresponds to a non-IID setting because each device has the information of only one digit. For performance evaluation, we consider two metrics: Classification accuracy computed for 1000010000 training data samples in the MNIST dataset, and normalized MSE (NMSE) of the global gradient vector, defined as 𝔼[ϵt]\mathbb{E}[\epsilon_{t}] with ϵt=𝐠avg(t)𝐠^avg(t)2𝐠avg(t)2\epsilon_{t}=\frac{\|{\bf g}_{\rm avg}^{(t)}-\hat{\bf g}_{\rm avg}^{(t)}\|^{2}}{\|{\bf g}_{\rm avg}^{(t)}\|^{2}}.

Refer to caption
(a) Classification accuracy
Refer to caption
(b) Normalized MSE
Figure 4: Performance comparison of different gradient reconstruction algorithms when R=5R=5 and Sratio=4%S_{\rm ratio}=4\%.

In Fig. 4, we evaluate the classification accuracy and the NMSE of federated learning with different gradient reconstruction algorithms when R=5R=5 and Sratio=4%S_{\rm ratio}=4\%. In Fig. 4(a), we also plot the classification accuracy of federated learning based on over-the-air computation (FL-AirComp) developed in [28] by assuming no device selection and no intelligent reflecting surface between the PS and the wireless devices. In a noisy downlink communication scenario, we assume that the parameter vector received by device kk at iteration tt is modeled as 𝐰k(t)=ϵDL𝐰t+1ϵDL2𝐞k,t{\bf w}_{k}^{(t)}=\epsilon_{\rm DL}{\bf w}_{t}+\sqrt{1-\epsilon_{\rm DL}^{2}}{\bf e}_{k,t}, where ϵDL=0.7\epsilon_{\rm DL}=0.7, 𝐞k,t{\bf e}_{k,t} is an independent Gaussian random vector distributed as 𝒩(𝟎N¯,𝖽𝗂𝖺𝗀(|wt,1|2,,|wt,N¯|2))\mathcal{N}\big{(}{\bf 0}_{\bar{N}},{\sf diag}(|w_{t,1}|^{2},\ldots,|w_{t,\bar{N}}|^{2})\big{)}, and wt,iw_{t,i} is the ii-th entry of 𝐰t{\bf w}_{t}. Fig. 4(a) shows that for both the error-free and the noisy downlink scenarios, the proposed algorithm achieves almost the same classification accuracy as perfect reconstruction with no compression. Meanwhile, Fig. 4(b) shows that the proposed algorithm enables almost lossless (less than 17-17 dB) reconstruction of the global gradient vector at the PS. Unlike the proposed algorithm, conventional CS algorithms (2D-OMP and Kron-OMP) cause degradation in classification accuracy (see Fig. 4(a)) due to severe loss of reconstruction accuracy (see Fig. 4(b)). Although both the proposed algorithm and LMMSE-OMP adopt the divide-and-conquer strategy described in Sec. IV, the proposed algorithm outperforms LMMSE-OMP, attained by harnessing the turbo decoding principle with the EM-GAMP algorithm. FL-AirComp provides a similar classification accuracy as the proposed algorithm, but this framework assumes no gradient compression at the devices; thereby, AirComp requires R=5R=5 times higher communication overhead than the federated learning framework considered in the proposed algorithm.

Refer to caption
Figure 5: Classification accuracy of federated learning with different compression ratios when Sratio=4%S_{\rm ratio}=4\%.
Refer to caption
Figure 6: Average NMSE of the proposed algorithm with different numbers of turbo iterations when Sratio=1%S_{\rm ratio}=1\%.

In Fig. 6, we evaluate the classification accuracy of federated learning with different compression ratios when Sratio=4%S_{\rm ratio}=4\%. Fig. 6 shows that the classification accuracies of all the gradient reconstruction algorithms degrade with the compression ratio. This result follows from the fact that the smaller the projected dimension in (10), the more difficult the reconstruction of the local gradient vectors at the PS. Since communication overhead of federated learning reduces with the compression ratio, this result also reveals the trade-off between communication overhead and learning performance of federated learning. Despite this trade-off, the proposed algorithm provides almost the same accuracy even at the compression ratio of R=5R=5, implying that the proposed algorithm has the best robustness against the increase in the compression ratio. Therefore, the proposed algorithm minimizes the communication overhead of federated learning required to achieve a certain level of classification accuracy.

Refer to caption
(a) Average classification accuracy
Refer to caption
(b) Average NMSE
Figure 7: Performance of the proposed gradient reconstruction algorithm with different sparsification ratios.

In Fig. 6, we evaluate the average NMSE of the proposed algorithm with different numbers of turbo iterations when Sratio=1%S_{\rm ratio}=1\%. The NMSE is averaged over the first 20 iterations. Fig. 6 shows that the reconstruction accuracy of the proposed algorithm improves with the number of turbo iterations. This performance improvement comes from the information exchange between Module A and Module B in the proposed algorithm; during this exchange, the information from the uplink received signals and the sparse property of the local gradient vector complement each other to improve the reconstruction accuracy of the local gradient vectors. Since this effect vanishes as the information exchange repeats, the performance improvement of the proposed algorithm becomes marginal after a few iterations (i.e., 3 iterations).

In Fig. 7, we evaluate the average classification accuracy and NMSE of the proposed gradient reconstruction algorithm with different sparsification ratios. These performance measures are averaged over the first 50 iterations. Fig. 7(a) shows that there exists the optimal sparsification ratio that provides the highest classification accuracy. This result is due to the trade-off between the amount of the local gradient information conveyed to the PS and the reconstruction accuracy of the local gradient vectors at the PS. To be more specific, increasing the sparsification ratio increases the number of non-zero gradient entries conveyed to the PS per iteration but, at the same time, decreases the sparsity of the local gradient vectors, which leads to the degradation in the reconstruction accuracy, as shown in Fig. 7(b). Our simulation results imply that the performance of the proposed algorithm can be maximized by optimizing the selection of the sparsification level, which would be interesting for future work.

VII Conclusion

In this paper, we have studied communication-efficient federated learning over a wireless MIMO MAC in which communication overhead for local gradient transmission is effectively reduced by employing gradient compression at wireless devices and simultaneous transmission via shared radio resources. Under this scenario, we have presented a turbo-iterative algorithm that solves a computationally demanding gradient reconstruction problem at the PS based on a divide-and-conquer strategy. One prominent feature of the presented algorithm is that gradient reconstruction accuracy at the PS improves as the iteration continues. In each iteration, the belief from uplink received signals at the PS is complemented by the sparse property of the local gradient vectors. We have also characterized the reconstruction error of the presented algorithm and a sufficient condition to guarantee the convergence of federated learning. Using simulations, we have demonstrated that the presented algorithm achieves almost the same performance as a perfect reconstruction scenario while providing a superior performance compared to existing CS algorithms.

An important direction for future research is to optimize device scheduling and power control for uplink communication in federated learning, taking into account the different locations of devices in wireless networks. It would also be important to study downlink communication of federated learning to enable accurate and efficient broadcasting of the global model to wireless devices.

Appendix A Proof of Lemma 1

Suppose that the bb-th local gradient sub-vector at device kk, 𝐠k,b(t){\bf g}_{k,b}^{(t)}, follows an IID random distribution with mean μgk,b(t)\mu_{g_{k,b}}^{(t)} and variance ζk,b(t)\zeta_{k,b}^{(t)}. Since 𝐀(t){\bf A}^{(t)} is an IID random matrix with (𝐀(t))m,n𝒩(0,1/M)({\bf A}^{(t)})_{m,n}\sim\mathcal{N}(0,1/M), the projection in (10) implies that 𝐱k,b(t)𝒩(𝟎M,Rζk,b(t)𝐈M){\bf x}_{k,b}^{(t)}\sim\mathcal{N}({\bf 0}_{M},R\zeta_{k,b}^{(t)}{\bf I}_{M}) for large NN by the central limit theorem. Then from (17) and (18), the posterior mean and variance of xk(t)[m]x_{k}^{(t)}[m] for mbm\in\mathcal{M}_{b} are computed as

x^kpost[m]\displaystyle\hat{x}_{k}^{\rm post}[m] =νxk[m]post{j=1Kχk,b,j(t)xj(t)[m]+χk,b,𝐳(t)[m]},andνxk[m]post=Rζk,b(t)1+Rζk,b(t)χk,b,k(t),\displaystyle=\nu_{x_{k}[m]}^{\rm post}\left\{\sum_{j=1}^{K}\chi_{k,b,j}^{(t)}x_{j}^{(t)}[m]+\chi_{k,b,{\bf z}}^{(t)}[m]\right\},~{}~{}\text{and}~{}~{}\nu_{x_{k}[m]}^{\rm post}=\frac{R\zeta_{k,b}^{(t)}}{1+R\zeta_{k,b}^{(t)}\chi_{k,b,k}^{(t)}}, (26)

respectively, where χk,b,j(t)=Pk(t)(𝐡k(t))𝖳(𝚽k,b(t)+σ2𝐈U)1𝐡j(t)\chi_{k,b,j}^{(t)}=P_{k}^{(t)}\big{(}{\bf h}_{k}^{(t)}\big{)}^{\sf T}\big{(}{\bm{\Phi}}_{k,b}^{(t)}+\sigma^{2}{\bf I}_{U}\big{)}^{-1}{\bf h}_{j}^{(t)}, χk,b,𝐳(t)[m]=Pk(t)(𝐡k(t))𝖳(𝚽k,b(t)+σ2𝐈U)1𝐳(t)[m]\chi_{k,b,{\bf z}}^{(t)}[m]=\sqrt{P_{k}^{(t)}}\big{(}{\bf h}_{k}^{(t)}\big{)}^{\sf T}\big{(}{\bm{\Phi}}_{k,b}^{(t)}+\sigma^{2}{\bf I}_{U}\big{)}^{-1}{\bf z}^{(t)}[m], 𝚽k,b(t)=Rjkζj,b(t)Pj(t)𝐡j(t)(𝐡j(t))𝖳{\bm{\Phi}}_{k,b}^{(t)}=R\sum_{j\neq k}\zeta_{j,b}^{(t)}P_{j}^{(t)}{\bf h}_{j}^{(t)}\big{(}{\bf h}_{j}^{(t)}\big{)}^{\sf T}, and 𝐡k(t){\bf h}_{k}^{(t)} is the kk-th column of 𝐇(t){\bf H}^{(t)}. By plugging x^kpost[m]\hat{x}_{k}^{\rm post}[m], νxk[m]post\nu_{x_{k}[m]}^{\rm post}, x^kpri[m]=0\hat{x}_{k}^{\rm pri}[m]=0, and νxk[m]pri=Rζk,b(t)\nu_{x_{k}[m]}^{\rm pri}=R\zeta_{k,b}^{(t)} into (19), the extrinsic mean and variance of xk(t)[m]x_{k}^{(t)}[m] for mbm\in\mathcal{M}_{b} are computed as

x^k(t)[m]=xk(t)[m]+1χk,b,k(t){jkχk,b,j(t)xj(t)[m]+χk,b,𝐳(t)[m]},andνk,b(t)=1χk,b,k(t),\displaystyle\hat{x}_{k}^{(t)}[m]=x_{k}^{(t)}[m]+\frac{1}{\chi_{k,b,k}^{(t)}}\Bigg{\{}\sum_{j\neq k}\chi_{k,b,j}^{(t)}x_{j}^{(t)}[m]+\chi_{k,b,{\bf z}}^{(t)}[m]\Bigg{\}},~{}~{}\text{and}~{}~{}{\nu}_{k,b}^{(t)}=\frac{1}{\chi_{k,b,k}^{(t)}}, (27)

respectively. From the eigendecomposition of 𝚽k,b(t){\bm{\Phi}}_{k,b}^{(t)}, the extrinsic variance νk,b(t)\nu_{k,b}^{(t)} can be rewritten as

νk,b(t)\displaystyle\nu_{k,b}^{(t)} =1𝖲𝖭𝖱k(t)(𝐡k(t)2i=1Ek,b(t)|(𝐮k,b,i(t))𝖳𝐡k(t)|2λk,b,i(t)λk,b,i(t)+σ2)1,\displaystyle=\frac{1}{{\sf SNR}_{k}^{(t)}}\Bigg{(}\|{\bf h}_{k}^{(t)}\|^{2}-\sum_{i=1}^{E_{k,b}^{(t)}}\big{|}\big{(}{\bf u}_{k,b,i}^{(t)}\big{)}^{\sf T}{\bf h}_{k}^{(t)}\big{|}^{2}\frac{\lambda_{k,b,i}^{(t)}}{\lambda_{k,b,i}^{(t)}+\sigma^{2}}\Bigg{)}^{-1}, (28)

where 𝖲𝖭𝖱k(t)=Pkσ2{\sf SNR}_{k}^{(t)}=\frac{P_{k}}{\sigma^{2}}, λk,b,i(t)\lambda_{k,b,i}^{(t)} is the ii-th largest eigenvector of 𝚽k,b(t){\bm{\Phi}}_{k,b}^{(t)}, 𝐮k,b,i(t){\bf u}_{k,b,i}^{(t)} is the eigenvector associated with λk,b,i(t)\lambda_{k,b,i}^{(t)}, and Ek,b(t)E_{k,b}^{(t)} is the rank of 𝚽k,b(t){\bm{\Phi}}_{k,b}^{(t)}. Since 𝐱k,b(t){\bf x}_{k,b}^{(t)} is an IID Gaussian vector for large NN, the extrinsic mean 𝐱^k,b(t)=[x^k(t)[b(1)],,x^k(t)[b(M)]]𝖳\hat{\bf x}_{k,b}^{(t)}=\big{[}\hat{x}_{k}^{(t)}[\mathcal{M}_{b}(1)],\cdots,\hat{x}_{k}^{(t)}[\mathcal{M}_{b}(M)]\big{]}^{\sf T} becomes an AWGN observation of 𝐱k,b(t){\bf x}_{k,b}^{(t)} where the noise variance is the extrinsic variance νk,b(t){\nu}_{k,b}^{(t)} [38]. This completes the proof.

Appendix B Proof of Theorem 1

Suppose that the bb-th local gradient sub-vector at device kk, 𝐠k,b(t){\bf g}_{k,b}^{(t)}, follows an IID random distribution with mean μgk,b(t)\mu_{g_{k,b}}^{(t)} and variance ζk,b(t)\zeta_{k,b}^{(t)}. Under the assumption that the distributions of {𝐠k,b(t)}k\{{\bf g}_{k,b}^{(t)}\}_{k} are known at the PS in prior, the EM-GAMP algorithm in Module B of the proposed algorithm reduces to the GAMP algorithm in [43] which does not have the EM step (i.e., without Step 8 in Algorithm 2). In the asymptotic regime of NN\rightarrow\infty and N/MR1N/M\rightarrow R\geq 1, the behavior of the GAMP algorithm applied to reconstruct 𝐠k,b(t){\bf g}_{k,b}^{(t)} from 𝐱^k,b(t)\hat{\bf x}_{k,b}^{(t)} is characterized by a scalar state evolution. If this state evolution has a unique fixed point, the output of the GAMP algorithm, 𝐠^k,b(t)\hat{\bf g}_{k,b}^{(t)}, converges to the MMSE estimate of 𝐠k,b(t){\bf g}_{k,b}^{(t)} for a given observation 𝐱^k,b(t)\hat{\bf x}_{k,b}^{(t)} [41, 42, 43]. Therefore, the MSE between 𝐠k,b(t){\bf g}_{k,b}^{(t)} and 𝐠^k,b(t)\hat{\bf g}_{k,b}^{(t)} is less than or equal to the MSE between 𝐠k,b(t){\bf g}_{k,b}^{(t)} and its linear MMSE estimate, i.e.,

𝔼[𝐠k,b(t)𝐠^k,b(t)2]\displaystyle\mathbb{E}\big{[}\|{\bf g}_{k,b}^{(t)}-\hat{\bf g}_{k,b}^{(t)}\|^{2}\big{]} 𝔼[𝐠k,b(t)𝐠^LMMSE,k,b(t)2],\displaystyle\leq\mathbb{E}\big{[}\|{\bf g}_{k,b}^{(t)}-\hat{\bf g}_{{\rm LMMSE},k,b}^{(t)}\|^{2}\big{]}, (29)

where 𝐠^LMMSE,k,b(t)\hat{\bf g}_{{\rm LMMSE},k,b}^{(t)} is the linear MMSE estimate of 𝐠k,b(t){\bf g}_{k,b}^{(t)} for a given observation 𝐱^k,b(t)\hat{\bf x}_{k,b}^{(t)} in (22). From (22) with 𝐱k,b(t)=𝐀(t)𝐠k,b(t){\bf x}_{k,b}^{(t)}={\bf A}^{(t)}{\bf g}_{k,b}^{(t)}, the MSE of the linear MMSE estimate is computed as [36]

𝔼[𝐠k,b(t)𝐠^LMMSE,k,b(t)2]\displaystyle\mathbb{E}\big{[}\|{\bf g}_{k,b}^{(t)}-\hat{\bf g}_{{\rm LMMSE},k,b}^{(t)}\|^{2}\big{]} =Nζk,b(t)(ζk,b(t))2𝖳𝗋[(𝐀(t))𝖳(ζk,b(t)𝐀(t)(𝐀(t))𝖳+νk,b(t)𝐈M)1𝐀(t)]\displaystyle=N\zeta_{k,b}^{(t)}-\big{(}\zeta_{k,b}^{(t)}\big{)}^{2}{\sf Tr}\left[({\bf A}^{(t)})^{\sf T}\left(\zeta_{k,b}^{(t)}{\bf A}^{(t)}({\bf A}^{(t)})^{\sf T}+\nu_{k,b}^{(t)}{\bf I}_{M}\right)^{-1}{\bf A}^{(t)}\right]
=(a)Nζk,b(t)[1ζk,b(t)Rζk,b(t)+νk,b(t)],\displaystyle\overset{(a)}{=}N\zeta_{k,b}^{(t)}\left[1-\frac{\zeta_{k,b}^{(t)}}{R\zeta_{k,b}^{(t)}+\nu_{k,b}^{(t)}}\right], (30)

where the equality (a) follows from 𝐀(t)(𝐀(t))𝖳R𝐈M{\bf A}^{(t)}({\bf A}^{(t)})^{\sf T}\rightarrow R{\bf I}_{M} as NN\rightarrow\infty under the assumption that 𝐀(t){\bf A}^{(t)} is an IID random matrix with (𝐀(t))m,n𝒩(0,1/M)({\bf A}^{(t)})_{m,n}\sim\mathcal{N}(0,1/M). Since the kk-th local gradient vector is reconstructed as 𝐠^k(t)=𝖢𝗈𝗇𝖼𝖺𝗍𝖾𝗇𝖺𝗍𝖾({𝐠^k,b(t)}b=1B)\hat{\bf g}_{k}^{(t)}={\sf Concatenate}\big{(}\{\hat{\bf g}_{k,b}^{(t)}\}_{b=1}^{B}\big{)}, the MSE between 𝐠^avg(t)=k=1Kρk𝐠^k(t)\hat{\bf g}_{\rm avg}^{(t)}=\sum_{k=1}^{K}\rho_{k}\hat{\bf g}_{k}^{(t)} and 𝐠avg(t)=k=1Kρk𝐠k(t){\bf g}_{\rm avg}^{(t)}=\sum_{k=1}^{K}\rho_{k}{\bf g}_{k}^{(t)} is expressed as

𝔼[𝐠avg(t)𝐠^avg(t)2]=𝔼[k=1Kρk(𝐠k(t)𝐠^k(t))2]=(a)k=1Kρk2b=1B𝔼[𝐠k,b(t)𝐠^k,b(t)2],\displaystyle\mathbb{E}\big{[}\|{\bf g}_{\rm avg}^{(t)}-\hat{\bf g}_{\rm avg}^{(t)}\|^{2}\big{]}=\mathbb{E}\left[\left\|\sum_{k=1}^{K}\rho_{k}\big{(}{\bf g}_{k}^{(t)}-\hat{\bf g}_{k}^{(t)}\big{)}\right\|^{2}\right]\overset{(a)}{=}\sum_{k=1}^{K}\rho_{k}^{2}\sum_{b=1}^{B}\mathbb{E}\big{[}\|{\bf g}_{k,b}^{(t)}-\hat{\bf g}_{k,b}^{(t)}\|^{2}\big{]}, (31)

where the equality (a) holds because different parameters in the global model are simultaneously transmitted by different devices when k,bj,b\mathcal{I}_{k,b}\neq\mathcal{I}_{j,b} for kjk\neq j. Applying (29) and (30) into (31) yields the result in (24).

Appendix C Proof of Theorem 2

Under Assumption 1, the improvement of the loss function at iteration tt is upper bounded as

F(𝐰t+1)F(𝐰t)F(𝐰t)𝖳(𝐰t+1𝐰t)+β2𝐰t+1𝐰t2\displaystyle F({\bf w}_{t+1})-F({\bf w}_{t})\leq\nabla F({\bf w}_{t})^{\sf T}({\bf w}_{t+1}-{\bf w}_{t})+\frac{\beta}{2}\|{\bf w}_{t+1}-{\bf w}_{t}\|^{2}
=(a)ηtF(𝐰t)𝖳𝐠avg(t)ηtF(𝐰t)𝖳𝐞avg(t)+ηt2β2𝐠avg(t)+𝐞avg(t)2\displaystyle\overset{(a)}{=}-\eta_{t}\nabla F({\bf w}_{t})^{\sf T}{\bf g}_{\rm avg}^{(t)}-\eta_{t}\nabla F({\bf w}_{t})^{\sf T}{\bf e}_{\rm avg}^{(t)}+\eta_{t}^{2}\frac{\beta}{2}\|{\bf g}_{\rm avg}^{(t)}+{\bf e}_{\rm avg}^{(t)}\|^{2}
(b)ηtF(𝐰t)𝖳𝐠avg(t)+ηt2(F(𝐰t)2+𝐞avg(t)2)+ηt2β(𝐠avg(t)2+𝐞avg(t)2),\displaystyle\overset{(b)}{\leq}-\eta_{t}\nabla F({\bf w}_{t})^{\sf T}{\bf g}_{\rm avg}^{(t)}+\frac{\eta_{t}}{2}\big{(}\|\nabla F({\bf w}_{t})\|^{2}+\|{\bf e}_{\rm avg}^{(t)}\|^{2}\big{)}+\eta_{t}^{2}\beta\big{(}\|{\bf g}_{\rm avg}^{(t)}\|^{2}+\|{\bf e}_{\rm avg}^{(t)}\|^{2}\big{)}, (32)

where the equality (a) follows from (6) with 𝐞avg(t)=𝐠^avg(t)𝐠avg(t){\bf e}_{\rm avg}^{(t)}=\hat{\bf g}_{\rm avg}^{(t)}-{\bf g}_{\rm avg}^{(t)}, and the inequality (b) follows from ±𝐚𝖳𝐛12(𝐚2+𝐛2)\pm{\bf a}^{\sf T}{\bf b}\leq\frac{1}{2}(\|{\bf a}\|^{2}+\|{\bf b}\|^{2}). Let ϵt=𝐠^avg(t)𝐠avg(t)2/𝐠avg(t)2\epsilon_{t}={\|\hat{\bf g}_{\rm avg}^{(t)}-{\bf g}_{\rm avg}^{(t)}\|^{2}}/{\|{\bf g}_{\rm avg}^{(t)}\|^{2}} be a normalized reconstruction error of the global gradient vector at iteration tt. Then the inequality in (32) is rewritten as

F(𝐰t+1)F(𝐰t)\displaystyle F({\bf w}_{t+1})-F({\bf w}_{t}) ηtF(𝐰t)𝖳𝐠avg(t)+ηt2(F(𝐰t)2+ϵt𝐠avg(t)2)+ηt2β(1+ϵt)𝐠avg(t)2,\displaystyle\leq-\eta_{t}\nabla F({\bf w}_{t})^{\sf T}{\bf g}_{\rm avg}^{(t)}+\frac{\eta_{t}}{2}\big{(}\|\nabla F({\bf w}_{t})\|^{2}+\epsilon_{t}\|{\bf g}_{\rm avg}^{(t)}\|^{2}\big{)}+\eta_{t}^{2}\beta(1+\epsilon_{t})\|{\bf g}_{\rm avg}^{(t)}\|^{2}, (33)

Since 𝔼[𝐠avg(t)|𝐰t]=F(𝐰t)\mathbb{E}\big{[}{\bf g}_{\rm avg}^{(t)}\big{|}{\bf w}_{t}\big{]}=\nabla F({\bf w}_{t}) and 𝔼[𝐠avg(t)2|𝐰t](1+ψ)F(𝐰t)2\mathbb{E}\big{[}\|{\bf g}_{\rm avg}^{(t)}\|^{2}\big{|}{\bf w}_{t}\big{]}\leq(1+\psi)\|\nabla F({\bf w}_{t})\|^{2} under Assumption 2, taking the expectation of both sides of (33) conditioned on 𝐰t{\bf w}_{t} yields

𝔼[F(𝐰t+1)F(𝐰t)|𝐰t]\displaystyle\mathbb{E}\big{[}F({\bf w}_{t+1})-F({\bf w}_{t})\big{|}{\bf w}_{t}\big{]} ηt2{1ϵt(1+ψ)}F(𝐰t)2+ηt2β(1+ϵt)(1+ψ)F(𝐰t)2\displaystyle\leq-\frac{\eta_{t}}{2}\big{\{}1-\epsilon_{t}(1+\psi)\big{\}}\|\nabla F({\bf w}_{t})\|^{2}+\eta_{t}^{2}\beta(1+\epsilon_{t})(1+\psi)\|\nabla F({\bf w}_{t})\|^{2}
=(c1ηtc2ηt2)F(𝐰t)2,\displaystyle=-(c_{1}\eta_{t}-c_{2}\eta_{t}^{2})\|\nabla F({\bf w}_{t})\|^{2}, (34)

where c1=1ϵmax(1+ψ)2c_{1}=\frac{1-\epsilon_{\rm max}(1+\psi)}{2}, c2=β(1+ϵmax)(1+ψ)>0c_{2}=\beta(1+\epsilon_{\rm max})(1+\psi)>0, and ϵmax=maxtϵt\epsilon_{\rm max}=\max_{t}\epsilon_{t}.

A lower bound of the initial loss with 𝐰1{\bf w}_{1} is expressed as

F(𝐰1)F(𝐰)\displaystyle F({\bf w}_{1})-F({\bf w}^{\star}) F(𝐰1)𝔼[F(𝐰T+1)]=t=1T𝔼[F(𝐰t)F(𝐰t+1)]\displaystyle\geq F({\bf w}_{1})-\mathbb{E}[F({\bf w}_{T+1})]=\sum_{t=1}^{T}\mathbb{E}\left[F({\bf w}_{t})-F({\bf w}_{t+1})\right]
(c)𝔼[t=1T(c1ηtc2ηt2)F(𝐰t)2],\displaystyle\overset{(c)}{\geq}\mathbb{E}\left[\sum_{t=1}^{T}\left(c_{1}\eta_{t}-c_{2}\eta_{t}^{2}\right)\|\nabla F({\bf w}_{t})\|^{2}\right], (35)

where the inequality (c) follows from (C). Plugging an adaptive learning rate ηt=ηt>0\eta_{t}=\frac{\eta}{\sqrt{t}}>0 into (35) yields [11]

F(𝐰1)F(𝐰)\displaystyle F({\bf w}_{1})-F({\bf w}^{\star}) 𝔼[t=1T(c1ηtc2η2t)F(𝐰t)2].\displaystyle\geq\mathbb{E}\left[\sum_{t=1}^{T}\left(\frac{c_{1}\eta}{\sqrt{t}}-\frac{c_{2}\eta^{2}}{t}\right)\|\nabla F({\bf w}_{t})\|^{2}\right]. (36)

If c1>0c_{1}>0, from 1t1t\frac{1}{t}\leq\frac{1}{\sqrt{t}} and 1T1t\frac{1}{\sqrt{T}}\leq\frac{1}{\sqrt{t}} for 1tT1\leq t\leq T, the right-hand-side of (36) is further lower bounded as

F(𝐰1)F(𝐰)\displaystyle F({\bf w}_{1})-F({\bf w}^{\star}) (c1ηc2η2)𝔼[1Tt=1TF(𝐰t)2].\displaystyle\geq\left(c_{1}\eta-c_{2}\eta^{2}\right)\mathbb{E}\left[\frac{1}{\sqrt{T}}\sum_{t=1}^{T}\|\nabla F({\bf w}_{t})\|^{2}\right]. (37)

Since c1>0c_{1}>0 for ϵmax<11+ψ\epsilon_{\rm max}<\frac{1}{1+\psi}, the inequality in (37) is rewritten as in (25) for any η<c1c2\eta<\frac{c_{1}}{c_{2}} with ϵmax<11+ψ\epsilon_{\rm max}<\frac{1}{1+\psi}.

References

  • [1] J. Konec̆ný, B. H. McMahan, and D. Ramage, “Federated optimization: Distributed optimization beyond the datacenter,” arXiv:1511.03575 [cs.LG], Nov. 2015 [Online] Available: https://arxiv.org/abs/1511.03575
  • [2] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Arcas “Communication-efficient learning of deep networks from decentralized data,” in Proc. 20th Int. Conf. Artif. Intell. Statist. (AISTATS), 2017, pp. 1273–-1282.
  • [3] S. Niknam, H. S. Dhillon, and J. H. Reed, “Federated learning for wireless communications: Motivation, opportunities and challenges,” IEEE Commun. Magazine, vol. 58, no. 1, pp. 19–25, Jan. 2020.
  • [4] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward an intelligent edge: Wireless communication meets machine learning,” IEEE Commun. Magazine, vol. 58, no. 1, pp. 19–25, Jan. 2020.
  • [5] D. Gündüz, D. B. Kurka, M. Jankowski, M. M. Amiri, E. Ozfatura, and S. Sreekumar, “Communicate to learn at the edge,” IEEE Commun. Magazine, vol. 58, no. 12, pp. 14–19, Dec. 2020.
  • [6] S. Samarakoon, M. Bennis, W. Saad, and M. Debbah, “Distributed federated learning for ultra-reliable low-latency vehicular communications,” IEEE Trans. Commun., vol. 68, no. 2, pp. 1146–1159, Feb. 2020.
  • [7] M. Chen, O. Semiari, W. Saad, X. Liu, and C. Yin, “Federated echo state learning for minimizing breaks in presence in wireless virtual reality networks,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 177–191, Jan. 2020.
  • [8] J. Konec̆ný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” arXiv:1610.05492v2 [cs.LG], Oct. 2016 [Online] Available: https://arxiv.org/abs/1610.05492
  • [9] D. Alistarh, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Randomized quantization for communication-optimal stochastic gradient descent,” in Proc. Advances Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 1709–-1720.
  • [10] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: Compressed optimisation for non-convex problems,” in Proc. Int. Conf. Mach. Learn. (ICML), 2018, pp. 560–569.
  • [11] S. Lee, C. Park, S.-N. Hong, Y. C. Eldar, and N. Lee, “Bayesian federated learning over wireless networks,” arXiv:2012.15486 [eess.SP], Dec. 2020 [Online] Available: https://arxiv.org/abs/2012.15486
  • [12] N. Shlezinger, M. Chen, Y. C. Eldar, H. V. Poor, and S. Cui, “UVeQFed: Universal vector quantization for federated learning,” IEEE Trans. Signal Process., vol. 69, pp. 500–514, 2021.
  • [13] Y. Du, S. Yang, and K. Huang, “High-dimensional stochastic gradient quantization for communication-efficient edge learning,” IEEE Trans. Signal Process., vol. 68, pp. 2128–2142, 2020.
  • [14] A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” in Proc. Empirical Methods in Natural Language Process. (EMNLP), 2017.
  • [15] J. Wangni, J. Wang, J. Liu, and T. Zhang, “Gradient sparsification for communication-efficient distributed optimization,” in Proc. Advances Neural Inf. Process. Syst. (NeurIPS), 2018, pp. 1299–-1309.
  • [16] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2018, pp. 1–13.
  • [17] M. M. Amiri and D. Gündüz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,” IEEE Trans. Signal Process., vol. 68, pp. 2155–2169, Mar. 2020.
  • [18] M. M. Amiri and D. Gündüz, “Federated learning over wireless fading channels,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3546–3557, May 2020.
  • [19] Y. Oh, N. Lee, Y.-S. Jeon, and H. V. Poor “Communication-efficient federated learning via quantized compressed sensing,” arXiv:2111.15071 [cs.DC], Dec. 2021 [Online] Available: https://arxiv.org/abs/2111.15071
  • [20] Y. C. Eldar and G. Kutyniok, Compressed Sensing: Theory and Applications, Cambridge, U.K.: Cambridge University Press, 2012.
  • [21] S. Foucart and H. Rauhut, A Mathematical Introduction to Compressive Sensing, Boston, MA: Birkhüuser, 2013.
  • [22] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 2022–2035, Mar. 2020.
  • [23] M. M. Amiri, T. M. Duman, and D. Gündüz, “Collaborative machine learning at the wireless edge with blind transmitters,” in Proc. IEEE Global Conf. Sig. Inf. Process. (GlobalSIP), Nov. 2019, pp. 1–6.
  • [24] D. Wen, G. Zhu, and K. Huang, “Reduced-dimension design of MIMO over-the-air computing for data aggregation in clustered IoT networks,” IEEE Trans. Wireless Commun., vol. 18, no. 11, pp. 5255–5268, Nov. 2019.
  • [25] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 491–506, Jan. 2020.
  • [26] T. T. Vu, D. T. Ngo, N. H. Tran, H. Q. Ngo, M. N. Dao, and R. H. Middleton, “Cell-free massive MIMO for wireless federated learning,” IEEE Trans. Wireless Commun., vol. 19, no. 10, pp. 6377–6392, Oct. 2020.
  • [27] Y.-S. Jeon, M. M. Amiri, J. Li, and H. V. Poor “A compressive sensing approach for federated learning over massive MIMO communication systems,” IEEE Trans. Wireless Commun., vol. 20, no. 3, 1990–2004, Mar. 2021.
  • [28] Z. Wang, J. Qiu, Y. Zhou, Y. Shi, L. Fu, W. Chen, and K. B. Letaief, “Federated Learning via Intelligent Reflecting Surface,” IEEE Trans. Wireless Commun., vol. 21, no. 2, pp. 808–822, Feb. 2022.
  • [29] J. P. Vila and P. Schniter, “Expectation-maximization Gaussian-mixture approximate message passing,” IEEE Trans. Signal Process., vol. 61, no. 19, pp. 4658–4672, Oct. 2013.
  • [30] Y. LeCun, C. Cortes, and C. Burges, “The MNIST database of handwritten digits,” [Online]. Available: http://yann.lecun.com/exdb/mnist/
  • [31] A. Abdi, Y. M. Saidutta, and F. Fekri, “Analog compression and communication for federated learning over wireless MAC,” in Proc. IEEE Workshop Signal Process. Adv. Wirel. Commun. (SPAWC), May 2020, pp. 1–5.
  • [32] Z. Y. Peng, “An iterative method for the least squares symmetric solution of the linear matrix equation AXB = C,” Appl. Math. Comput., vol. 170, no. 1, pp. 711–-723, 2005.
  • [33] T. Wimalajeewa, Y. C. Eldar, and P. K. Varshney, “Recovery of sparse matrices via matrix sketching,” arXiv:1311.2448 [cs.NA], Nov. 2013 [Online] Available: http://arxiv.org/abs/1311.2448
  • [34] D. Cohen, D. Cohen, Y. C. Eldar, and A. M. Haimovich, “SUMMeR: Sub-Nyquist MIMO radar,” IEEE Trans. Signal Process., vol. 66, no. 16, pp. 4315–4330, Aug. 2018.
  • [35] M. M. Amiri, D. Gündüz, S. R. Kulkarni, and H. V. Poor, “Convergence of federated learning over a noisy downlink,” IEEE Trans. Wireless Commun., vol. 21, no. 3, pp. 1422–1437, Mar. 2022.
  • [36] S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory, NJ: Prentice Hall, 1993.
  • [37] Q. Guo and D. D. Huang, “A concise representation for the soft-in soft-out LMMSE detector,” IEEE Commun. Letts., vol. 15, no. 5, pp. 566–568, May 2011.
  • [38] L. Liu, Y. Chi, C. Yuen, Y. L. Guan, and Y. Li, “Capacity-achieving MIMO-NOMA: Iterative LMMSE detection,” IEEE Trans. Signal Process., vol. 67, no. 7, pp. 1758–1773, Apr. 2019.
  • [39] U. S. Kamilov, S. Rangan, A. K. Fletcher, and M. Unser, “Approximate message passing with consistent parameter estimation and applications to sparse learning,” IEEE Trans. Inf. Theory, vol. 60, no. 5, pp. 2969–2985, May 2014.
  • [40] J. A. Tropp, “Greed is good: Algorithmic results for sparse approximation,” IEEE Trans. Inf. Theory, vol. 50, no. 10, pp. 2231–2242, Oct. 2004.
  • [41] D. Guo and S. Verdú, “Randomly spread CDMA: Asymptotics via statistical physics,” IEEE Trans. Inf. Theory, vol. 51, no. 6, pp. 1983–2010, Jun. 2005.
  • [42] M. Bayati and A. Montanari, “The dynamics of message passing on dense graphs with applications to compressed sensing,” IEEE Trans. Inf. Theory, vol. 57, no. 2, pp. 764–785, Feb. 2011.
  • [43] S. Rangan, “Generalized approximate message passing for estimation with random linear mixing,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Aug. 2011, pp. 2168–2172.